Orchard API
Voice infrastructure on a private inference cluster — three products, one balance, one API. Transcribe audio to text (sync / async / webhooks), Synthesize text to audio across 13 languages, and Clone Voice — upload a 6-60 s reference and regenerate text in that speaker's voice. All three endpoints debit the same per-second balance, so your plan covers everything without separate quotas. Every request lives under /v1.
Quickstart
Three steps to your first request — same flow whether you start with Transcribe or Synthesize. One API key, one balance.
- Sign up at /signup.
- Activate a plan at /billing — Free includes 500 min/month (covers both products), paid plans start at $1.
- Generate an API key at /keys. The raw key (
ork_…) is shown once — copy it.
Then, transcribe an MP3 — or jump to Synthesize for text-to-speech:
curl -X POST https://api.orchardrun.com/v1/audio/transcriptions \
-H "Authorization: Bearer ork_..." \
-F file=@audio.mp3 \
-F language=es \
-F response_format=verbose_jsonTypeScript SDK
Skip the raw HTTP plumbing. The official @orchardrun/sdk package wraps every endpoint with typed methods, automatic retries on 5xx, and typed errors you can instanceof instead of string-matching. Universal — Node 18+, Bun, Deno, browsers, Cloudflare Workers, Vercel Edge.
Install
pnpm add @orchardrun/sdkQuickstart
Set ORCHARD_API_KEY in your env (or pass apiKey directly to the constructor) and you're three lines from a transcript.
import Orchard from "@orchardrun/sdk";
import fs from "node:fs";
const orchard = new Orchard({ apiKey: process.env.ORCHARD_API_KEY });
// Speech to text
const { text } = await orchard.transcribe({
file: fs.readFileSync("./call.wav"),
language: "es",
});
console.log(text);
// Text to speech
const audio = await orchard.tts.generate({
text: "Hola mundo",
voice: "es_MX-claude",
});
await fs.promises.writeFile(
"./out.wav",
Buffer.from(await audio.arrayBuffer()),
);
// Voice cloning: register once, synth many times
const voice = await orchard.voices.create({
name: "Mateo · founder",
file: fs.readFileSync("./sample.wav"),
language: "es",
});
const synth = await orchard.voices.synthesize(voice.id, {
text: "Hola, soy Mateo.",
});Vercel AI SDK provider
Orchard ships a first-class provider for the Vercel AI SDK. If you already use transcribe() or generateSpeech() from ai, switching to Orchard is a one-line provider swap.
import { experimental_transcribe as transcribe } from "ai";
import { orchard } from "@orchardrun/sdk/ai-sdk";
import fs from "node:fs";
const { text, language, durationInSeconds } = await transcribe({
model: orchard.transcription(),
audio: fs.readFileSync("./call.wav"),
mediaType: "audio/wav",
providerOptions: { orchard: { language: "es" } },
});Typed errors
Every non-2xx maps to an OrchardError subclass. Catch the specific one for retry / upgrade-CTA logic — no string matching on error.message needed.
import {
OrchardRateLimitError,
OrchardQuotaError,
OrchardAuthError,
} from "@orchardrun/sdk";
try {
await orchard.tts.generate({ text, voice });
} catch (e) {
if (e instanceof OrchardRateLimitError) {
await sleep((e.retryAfterSeconds ?? 5) * 1000);
return retry();
}
if (e instanceof OrchardQuotaError) {
// 402 — out of balance. Surface upgrade CTA to your user.
return showUpgradeBanner();
}
if (e instanceof OrchardAuthError) {
// 401/403 — rotate the API key.
return null;
}
throw e;
}Cancellation
Every method accepts an AbortSignal via the optional second argument. Useful when you're wiring up a "cancel transcription" button on a long upload.
const ac = new AbortController();
setTimeout(() => ac.abort(), 5_000); // 5-second hard cap
await orchard.transcribe(
{ file: bigAudioBuffer, language: "es" },
{ signal: ac.signal },
);Full API surface, changelog and contributing guide on npm and GitHub.
Authentication
Every endpoint except POST /v1/auth/* and /v1/billing/webhook requires authentication. Two interchangeable methods:
- API key (recommended for programmatic access):
Authorization: Bearer ork_...orX-API-Key: ork_.... - JWT (used by the web dashboard, obtainable via
POST /v1/auth/login):Authorization: Bearer eyJhbGciOi....
Keys are SHA-256 hashed at rest and revocable instantly from /keys. The raw token is shown only on creation — there is no recovery flow.
Sync API
POST /v1/audio/transcriptions accepts a standard multipart upload and blocks until the cluster returns a transcript. Request and response shape follows the public transcription-API conventions — any SDK that speaks that format plugs in by pointing its base URL at Orchard.
curl -X POST https://api.orchardrun.com/v1/audio/transcriptions \
-H "Authorization: Bearer ork_..." \
-F file=@audio.mp3 \
-F language=es \
-F response_format=verbose_jsonMultipart fields
| file | binary, required | Audio file. mp3, m4a, wav, mp4, ogg, flac, webm. Max 500 MB. |
| model | string | Accepted for client-library compatibility and ignored — every request runs on our latest STT engine, kept up to date by us. |
| language | string | ISO 639-1 hint (e.g. "es", "en"). Auto-detected when omitted. |
| response_format | string | json (default) · verbose_json · text |
| prompt | string | Accepted for parity; ignored. |
| temperature | number | Accepted for parity; ignored (decoder is greedy). |
Blocking: the request waits for the cluster to finish (timeout 600s). For long audio prefer the async endpoints below.
Async jobs
For long audio (or YouTube URLs) submit a job, get back a job_id, and poll until status issuccess. Concurrency is capped per plan (see Limits); excess requests queue rather than fail.
POST /v1/transcriptions · YouTube URL
curl -X POST https://api.orchardrun.com/v1/transcriptions \
-H "Authorization: Bearer ork_..." \
-H "Content-Type: application/json" \
-d '{"url":"https://youtu.be/dQw4w9WgXcQ","language":"en"}'
# → 202 Accepted
# {"job_id": "abc123...", "status": "queued"}POST /v1/transcriptions/upload · file
curl -X POST https://api.orchardrun.com/v1/transcriptions/upload \
-H "Authorization: Bearer ork_..." \
-F file=@interview.mp3 \
-F language=es
# → 202 Accepted
# {"job_id": "...", "status": "queued"}GET /v1/transcriptions/{id} · poll
Returns the job's status plus a live progress snapshot while running, and the full result payload once succeeded.
curl https://api.orchardrun.com/v1/transcriptions/abc123 \
-H "Authorization: Bearer ork_..."
# {"job_id":"abc123","status":"running",
# "progress":{"current_ms":699000,"total_ms":1099000,"percent":63}}
# Once done:
# {"job_id":"abc123","status":"success",
# "result":{"text":"...","language":"es","duration_seconds":1099, ...}}GET /v1/transcriptions/{id}/download · formatted
Same job, rendered in the format you ask for. Returns the file with a Content-Disposition: attachment header.
# Plain text
curl "https://api.orchardrun.com/v1/transcriptions/abc123/download?format=text" \
-H "Authorization: Bearer ork_..." -O
# SRT subtitles
curl "https://api.orchardrun.com/v1/transcriptions/abc123/download?format=srt" \
-H "Authorization: Bearer ork_..." -O
# Markdown with YAML frontmatter
curl "https://api.orchardrun.com/v1/transcriptions/abc123/download?format=md" \
-H "Authorization: Bearer ork_..." -OWebhooks & n8n
Pass webhook_url when you create a job and Orchard POSTs the result to that URL when the job finishes. No polling needed — perfect for n8n / Zapier / Make.
Sending a job with a webhook
curl -X POST https://api.orchardrun.com/v1/transcriptions \
-H "Authorization: Bearer ork_..." \
-H "Content-Type: application/json" \
-d '{
"url": "https://youtu.be/dQw4w9WgXcQ",
"language": "en",
"webhook_url": "https://your.n8n.instance/webhook/orchard-result"
}'Payload Orchard POSTs to your URL
Same shape as GET /v1/transcriptions/{id}:
{
"job_id": "abc123...",
"status": "success", // or "failed"
"result": {
"text": "...",
"language": "en",
"duration_seconds": 1099,
"segments": [ ... ],
"provider": "local",
"model": "orchard-stt-v1",
"elapsed_ms": 17430,
"post_processed": true
},
"error": null // string when status="failed"
}Headers we send
| User-Agent | Orchard-Webhook/1.0 | Identify Orchard in your logs. |
| X-Orchard-Job-Id | abc123... | Same id you got back from the POST. |
| X-Orchard-Attempt | 1, 2 or 3 | Retry counter — handle idempotently using job_id. |
| Content-Type | application/json |
Retry & idempotency
- 3 attempts total with exponential backoff:
1s · 5s · 25s. 2xx= success. We stop retrying.4xx= permanent failure. We do not retry — fix your endpoint.5xxor network errors = retried up to 3 attempts.- Use
X-Orchard-Job-Idto dedupe — same id may arrive more than once.
Ready-made n8n workflows
Importable JSON — open n8n → Workflows → Import from File. Replace the placeholder API key and destination IDs before running.
- orchard-youtube-to-notion.json — Webhook trigger → POST YouTube URL → wait for result → save to a Notion DB.
- orchard-upload-to-airtable.json — Webhook trigger receiving an audio file → POST to /upload → wait → save row in Airtable.
Both templates use n8n's Wait node in webhook mode. The {{$execution.resumeUrl}} expression resolves to a unique URL per run, so each transcription resumes the right execution.
Editor integrations
Orchard Dictate for VS Code / Cursor — hit a hotkey, speak, get the transcript inserted at your cursor. Built for dictating prompts to AI coding agents (Cursor, Copilot, Claude) without breaking flow.
Requirements
soxinstalled (provides therecbinary the extension uses to capture mic audio).- macOS:
brew install sox - Debian / Ubuntu:
sudo apt-get install sox
- macOS:
- An Orchard API key (create one at /keys).
- Microphone permission for VS Code / Cursor in your OS privacy settings.
Install
- Install from the Visual Studio Marketplace — click Install, or in VS Code / Cursor open the Extensions sidebar and search
Orchard Dictate. From the command line:code --install-extension Orchardrun.orchard-dictate - Run the command
Orchard: Set API Key(or click the⚙️next to the🎤 Dictatestatus bar item) and paste yourork_…key. That's it — the extension ships pointed athttps://api.orchardrun.com.
Usage
Cmd+Shift+8(macOS) /Ctrl+Shift+8(Win/Linux) — start recording.- Same shortcut again — stop, transcribe, insert at cursor.
Settings
| orchardDictate.apiUrl | string | Orchard backend URL. Default https://api.orchardrun.com — override only if you're self-hosting. |
| orchardDictate.language | auto | es | en | pt | fr | de | it | hi | Language hint passed to the model. |
| orchardDictate.insertMode | cursor | clipboard | Paste at cursor (default) or copy to clipboard for manual paste. |
| orchardDictate.notifyOnSuccess | boolean | Show a toast on every success. Off by default — the inserted text is feedback enough. |
Troubleshooting
- Nothing happens on the hotkey: sox is probably not installed. The extension probes
/opt/homebrew/bin/rec,/usr/local/bin/rec, and/usr/bin/recbefore falling back toPATH, so install location shouldn't matter as long as one of those exists. - Recording runs but transcript is empty: mic permission missing for your editor in OS settings.
Synthesize · Text-to-Speech
POST /v1/tts/generate takes text + a voice id and returns a WAV. Synchronous — request blocks for ~1-2s on common short texts (the response body IS the audio file). Served by the same private inference cluster that handles transcription.
Request
curl -X POST https://api.orchardrun.com/v1/tts/generate \
-H "Authorization: Bearer ork_..." \
-F text="Hola, ¿qué tal todo por allá?" \
-F voice_id="es_MX-claude" \
-F voice_type="generic" \
--output out.wavForm fields
| text | string · required · max 500 chars | What you want spoken. Punctuation and capitalisation are respected — they shape prosody. Longer texts: split into ≤500-char chunks and concatenate the WAVs client-side. |
| voice_id | string · required | One of the entries from the voice catalog below (e.g. "en_US-amy", "es_MX-claude"). The voice's language is implicit in the id. |
| voice_type | "generic" · default "clone" | Use "generic" for the public voice library. Voice cloning is not exposed in v2.0 — leave at "generic". |
| source_language | ISO 639-1 code · optional | Language hint for the input text when it differs from the voice's language. If set + different from voice language, the text is auto-translated before synthesis. Skip for same-language use. |
| translate | "auto" · "force" · "off" · default "auto" | Translation policy: auto = translate only when source_language ≠ voice language; force = always; off = synthesize the literal text in the voice's phoneme set even if mismatched. |
Response
200 OKwithContent-Type: audio/wav— body is the raw 22.05 kHz / 16-bit mono WAV. Stream-friendly to disk or to a browser<audio>tag.404ifvoice_idis unknown.402if your balance is exhausted (see Limits & pricing).429if you exceed the concurrent-request limit for your plan.
Billing
Each request debits the user's balance by the duration of the generated audio. A 30-second synthesized clip costs 30 seconds from the same balance Transcribe uses. The audio is generated at roughly 1000 characters / minute of speech, so a 500-char input consumes ~30 seconds.
Cross-language synthesis
If your input text is in one language but you pick a voice in another, Orchard auto-translates before synthesis:
# Input is Spanish, voice is American English →
# server translates to English then synthesizes with Amy.
curl -X POST https://api.orchardrun.com/v1/tts/generate \
-H "Authorization: Bearer ork_..." \
-F text="Hola, ¿cómo estás?" \
-F voice_id="en_US-amy" \
-F source_language=es \
-F translate=force \
--output greeting_en.wavAuto-translation falls back to the original text if the translation provider is unreachable — same-language synthesis always works.
Voice catalog
Generic voices available out of the box. Hit GET /v1/tts/voices/generic at runtime for the canonical list (it's public — no auth required, useful for populating a voice picker in your UI).
| en_US-amy | 🇺🇸 English (US) | Female · friendly, conversational |
| en_US-ryan | 🇺🇸 English (US) | Male · professional, narration |
| es_ES-davefx | 🇪🇸 Spanish (Spain) | Male · neutral broadcast tone |
| es_ES-sharvard | 🇪🇸 Spanish (Spain) | Multi-speaker · documentation reading |
| es_MX-claude | 🇲🇽 Spanish (Mexico) | Male · conversational LATAM |
| es_MX-ald | 🇲🇽 Spanish (Mexico) | Male · alternate LATAM voice |
| pt_BR-faber | 🇧🇷 Portuguese (Brazil) | Male · warm narration |
| fr_FR-siwis | 🇫🇷 French (France) | Female · neutral conversational |
| de_DE-thorsten | 🇩🇪 German | Male · clean broadcast |
| it_IT-paola | 🇮🇹 Italian | Female · narration |
| hi_IN-pratham | 🇮🇳 Hindi | Male · standard pronunciation |
| hi_IN-priyamvada | 🇮🇳 Hindi | Female · standard pronunciation |
Picking the right voice for your audience
The country code in the voice id (es_ES vs es_MX) drives accent. A LATAM audience listening to es_ES-davefx will hear a Castilian accent (the θ sound on c/z); for neutral LATAM use the es_MX voices.
Programmatic listing
curl https://api.orchardrun.com/v1/tts/voices/generic | jqReturns an array of objects with voice_id, language (ISO 639-1), locale (e.g. es_MX), flag, gender, and description. Use this in production rather than hard-coding — the catalog grows as new voices ship.
Clone Voice
Upload a 6-60 s reference recording, give it a name, and Orchard regenerates any text in that speaker's voice. Two-step flow: create persists the voice (one-time embed compute); synthesize generates audio from cached state on every subsequent call.
"premium-es" or "multilingual") so you can reason about expected quality and latency.POST /v1/voices · create from audio
curl -X POST https://api.orchardrun.com/v1/voices \
-H "Authorization: Bearer $ORCHARD_KEY" \
-F "audio=@reference.wav" \
-F "name=My voice" \
-F "language=es"Form fields
| audio | file (required) | Reference audio: 6-60 s, any codec ffmpeg decodes (wav/webm/m4a/mp3). Normalised server-side to 24 kHz mono PCM. |
| name | string (required) | 1-80 chars. Shown in the playground voice list + the rename endpoint can change it later. |
| language | string (required) | ISO 639-1 of the reference: es / en / pt / fr / de / it / pl / tr / ru / nl / cs / ar / zh-cn / ja / hu / ko / hi. Drives engine routing (es → Premium, else Multilingual). |
Response
Returns 201 Created + JSON with id, name, language, tier ("premium-es" or "multilingual"), embed_bytes, created_at.
Returns 402 when you've hit your plan's cloned-voice quota — delete an existing voice (frees a slot) or upgrade. See the per-plan limits in Limits & pricing.
GET /v1/voices · list + quota
curl https://api.orchardrun.com/v1/voices \
-H "Authorization: Bearer $ORCHARD_KEY"Response shape: { voices: [...], quota: { used: 2, limit: 3 } }. The quota block lets you render an "X of Y used" UI without an extra round-trip.
POST /v1/voices/{id}/synthesize · generate audio
curl -X POST https://api.orchardrun.com/v1/voices/$VOICE_ID/synthesize \
-H "Authorization: Bearer $ORCHARD_KEY" \
-F "text=Hola mundo, esta es mi voz clonada." \
-F "language=es" \
--output out.wavForm fields
| text | string (required) | 1-500 chars. |
| language | string (required) | Same set as create. Multilingual voices can synthesize cross-lingually (your English voice speaking Italian); Premium ES voices stay in Spanish. |
Returns audio/wav bytes (16-bit PCM, 24 kHz mono). Debits from your shared per-second balance. Sampling hyperparameters are tuned server-side and not exposed — they're calibrated per engine for max fidelity.
PATCH /v1/voices/{id} · rename
curl -X PATCH https://api.orchardrun.com/v1/voices/$VOICE_ID \
-H "Authorization: Bearer $ORCHARD_KEY" \
-F "name=My better name"Embedding stays intact — only the label changes. Avoids the delete-and-recreate workflow that would force a re-record.
DELETE /v1/voices/{id}
curl -X DELETE https://api.orchardrun.com/v1/voices/$VOICE_ID \
-H "Authorization: Bearer $ORCHARD_KEY"Returns 204. Frees a slot in your quota immediately. The embedding bytes are deleted — recreating the same voice requires a fresh reference recording.
Privacy notes
For Premium ES voices, the source WAV is persisted alongside the transcribed reference text — both are required for every synth on that tier. For Multilingual voices, only a compact conditioning embedding (~130 KB) is stored; the source WAV is discarded after the one-time embed compute. Either way: DELETE removes all artefacts permanently.
Response formats
Five formats supported on the download endpoint. The sync API uses response_format and supports json, verbose_json, and text.
| text | text/plain | Just the cleaned text. No metadata. |
| md | text/markdown | Cleaned text + YAML frontmatter (language, duration, node). |
| srt | application/x-subrip | SubRip subtitles, one cue per segment. |
| vtt | text/vtt | WebVTT subtitles. |
| json | application/json | Full payload: text, raw_text, segments, language, duration_seconds, elapsed_ms, post_processed, node_id. |
| verbose_json | application/json | Standard shape: {task, language, duration, text, segments[]}. |
Performance tips
We accept any audio container ffmpeg can decode (mp3, m4a, ogg, opus, flac, wav, mp4, webm). The format you pick changes upload latency, not transcription quality — every request runs on the same model.
Send Opus instead of WAV
WAV is uncompressed PCM. A 15-minute speech file is ~27 MB as WAV and only ~2 MB as Opus 32kbps. The transcription is identical (Opus at speech bitrates is perceptually lossless for our STT engine); the upload is ~8x faster and you stop bumping into upload limits on long audios.
# Convert before upload
ffmpeg -i recording.wav -c:a libopus -b:a 32k recording.opus
# 27 MB → ~2 MB. Then upload normally.Already using Opus? WhatsApp voicenotes (.ogg) and browser MediaRecorder defaults are already Opus — pass them through directly.
Use sync vs async deliberately
- Sync (
POST /v1/audio/transcriptions) blocks. Best for short clips (≤2 min) where waiting on the response is fine. - Async + webhook (
POST /v1/transcriptions/uploadwithwebhook_url) frees the caller immediately and pushes the result when ready. Best for n8n / Zapier and audio over a couple of minutes. - Avoid polling when you can use a webhook — polling every 1-2s adds the same cumulative wait but burns request quota.
Hint header
Sync responses include X-Orchard-Hint when we notice an inefficient upload (e.g. WAV > 5 MB). Surface it in your client logs the first time it appears and you'll catch slow paths before they show up as user-visible latency.
Error codes
| 401 Unauthorized | — | Missing or invalid credentials. |
| 402 Payment Required | — | Account balance is 0 or negative. Top up at /billing. |
| 404 Not Found | — | Job, key, or user not found (also returned for jobs you don't own). |
| 409 Conflict | — | Download requested before the job finished. |
| 413 Payload Too Large | — | Upload exceeds 500 MB. |
| 504 Gateway Timeout | — | Sync transcription exceeded the cluster's timeout (600s). |
| 5xx | — | Server error — typically a worker or DB issue. Safe to retry. |
All error responses share the same shape: { "detail": "..." }.
Limits & pricing
One unified balance per plan, metered in seconds of audio. A 30-second transcription consumes 30 seconds; a 30-second synthesized clip consumes 30 seconds; a 30-second voice clone consumes 30 seconds. Mix the three products freely — no per-product cap to juggle.
| Plans | Free · $1 · $10 · $25 · Enterprise | Free (500 min/mo, opt-in), Hobby, Starter, Pro, custom Enterprise. See /billing. |
| Balance unit | 1 second of audio | Applies to Transcribe (input audio length), Synthesize (generated audio length), and Clone Voice (generated audio length). 1000 characters of TTS ≈ 60 seconds. |
| Effective rate | from $0.00042 / min on Pro | Pro tier ($25/mo for 60,000 min). Free tier requires explicit activation in the dashboard. |
| Cloned voices | 1 / 3 / 10 / 50 / ∞ | Free / Hobby / Starter / Pro / Enterprise. Each voice held costs ~130-500 KB of DB storage; the limit caps that + creates a natural upgrade moment. |
| Concurrency | 1 / 1 / 3 / 10 / 24 | Free / Hobby / Starter / Pro / Enterprise. Counts STT + TTS + Clone Voice in the same bucket. Excess requests queue rather than fail. |
| Rate limit | 30 / 60 / 300 / 900 / 5000 rpm | Free / Hobby / Starter / Pro / Enterprise. |
| Sync timeout | 600 seconds | POST /v1/audio/transcriptions blocks up to this long. |
| Async timeout | 1800 seconds | Per-job wall-clock cap. Practical audio length per file is ~2 hours. |
| Upload size (STT) | 500 MB | Per file. |
| Supported audio (STT) | mp3, m4a, wav, ogg, flac, mp4, webm | Anything ffmpeg can decode. |
| TTS text length | 500 chars per request | Split longer text client-side and concatenate the WAVs. |
| Clone Voice reference | 6-60 seconds | Shorter loses prosody; longer wastes upload + GPU time. Clean recording matters more than length. |