●FUNDING — Rork raised a $15M seed led by Left Lane Capital, with Peak XV, True Ventures, Goodwater, and a16z Speedrun joining●ENGINE — Rork Max runs on Claude Code and Claude Opus 4.6; it drew 8M+ views on X and doubled annual revenue in two weeks●SWIFT — Rork Max is the first web-based Swift app builder, positioned to replace Apple's traditional Xcode●PRODUCT — Rork Max covers the whole Apple ecosystem: iPhone, iPad, Apple Watch, Apple TV, Vision Pro, and iMessage●CLASSIC — The original Rork uses React Native (Expo), building iOS/Android apps from a plain-English description●PRICING — Start free; paid plans begin at $25/mo, and Rork Max is $200/mo●FUNDING — Rork raised a $15M seed led by Left Lane Capital, with Peak XV, True Ventures, Goodwater, and a16z Speedrun joining●ENGINE — Rork Max runs on Claude Code and Claude Opus 4.6; it drew 8M+ views on X and doubled annual revenue in two weeks●SWIFT — Rork Max is the first web-based Swift app builder, positioned to replace Apple's traditional Xcode●PRODUCT — Rork Max covers the whole Apple ecosystem: iPhone, iPad, Apple Watch, Apple TV, Vision Pro, and iMessage●CLASSIC — The original Rork uses React Native (Expo), building iOS/Android apps from a plain-English description●PRICING — Start free; paid plans begin at $25/mo, and Rork Max is $200/mo
Building an Immersive AI Language Learning App with Rork — Whisper Speech Recognition × Claude Conversational AI × ElevenLabs TTS
Production implementation notes for an immersive language learning app integrating Whisper, Claude, and ElevenLabs in Rork Max. Covers CEFR-adaptive curriculum, SM-2 spaced repetition, streaming latency optimization, and a freemium pricing model that holds a 55 percent margin.
What I Wished Existed While Preparing Solo Exhibitions in Europe
In 2024, while coordinating exhibitions in Berlin and Milan with local galleries, I kept hitting the same wall. I had prepped with the usual language apps, but the moment the video calls started I could not catch their intonation, and my own pronunciation did not land. The reason was obvious in hindsight: most apps over-index on reading practice and under-deliver on real-time speaking and listening loops.
As Masaki Hirokawa, an indie developer who has been shipping iOS and Android apps since 2014 (dolice.design), I sat down with those notes and asked: could Whisper, Claude, and ElevenLabs together inside Rork Max close that gap at an indie scale? This guide is the design document that came out of that experiment, written so you can move from API keys to production without re-doing the research. The flashcard-style basics live in the Rork Language Learning App Tutorial; from here we focus only on the next step.
The three AI APIs we combine are:
OpenAI Whisper API — speech-to-text transcription with word-level confidence values that we will repurpose for pronunciation scoring
Anthropic Claude API — adaptive conversation simulation, grammar feedback, and curriculum generation that respond to the learner's CEFR level (A1 through C2)
ElevenLabs TTS API — native-quality voice synthesis so listening and speaking practice happen in the same screen
One thing I will say upfront. It is tempting to assume that the voice AI quality is everything, but after running these pipelines against real users I am convinced the bigger factors for retention are loop latency and learning consistency. I will return to that observation throughout the guide.
System Architecture — Integrating Three AI APIs
Overall Structure
The immersive language learning app architecture consists of four layers:
The key design principle here is the sequential dependency: Whisper's output feeds into Claude, and Claude's output feeds into ElevenLabs. Because each API call is serial, latency optimization becomes critical — we'll address this later with streaming responses.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦A three-stage Whisper x Claude x ElevenLabs pipeline you can copy-paste, plus the full Claude system prompt that adapts behavior across CEFR A1 to C2
✦A robust pronunciation scoring function built from Whisper word probabilities, along with the SM-2 spaced repetition and adaptive curriculum scoring code
✦API cost breakdown of about $0.05 per session, and the freemium design that keeps a 55 percent gross margin while shipping streaming TTS for sub-second perceived latency
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
Implementing Whisper Speech Recognition — Transcription and Pronunciation Scoring
Audio Recording with expo-av
Building on the fundamentals covered in Rork + OpenAI Whisper API Speech Recognition Guide, here's the recording implementation optimized for language learning:
Implementing Claude Conversational AI — Adaptive Curriculum and Grammar Feedback
Designing the Conversation Simulation
The most critical aspect of integrating Claude for language learning is system prompt design. The prompt must dynamically adjust vocabulary difficulty, speech pace guidance, and grammar focus based on the learner's CEFR level (A1 through C2).
// services/claudeConversationService.tsinterface ConversationContext { userText: string; expectedPhrase?: string; history: Message[]; learnerLevel: 'A1' | 'A2' | 'B1' | 'B2' | 'C1' | 'C2'; targetLanguage: string; nativeLanguage: string; lessonTopic: string;}const LEVEL_GUIDELINES: Record<string, string> = { A1: ` - Use present tense only. Keep responses under 5 words per sentence - Limit vocabulary to 500 basic words (family, food, weather, greetings) - Correct only 1 mistake per response - Always use encouraging tone — praise correct parts first `, A2: ` - Use present and past tense. Up to 8 words per sentence - Vocabulary around 1,500 words (shopping, travel, daily life) - Correct up to 2 grammar mistakes. Suggest 1 alternative expression `, B1: ` - All tenses allowed. Complex sentences OK. Natural pace - Actively introduce idioms and phrasal verbs - Provide feedback on nuance, not just grammar `, B2: ` - Include business and academic vocabulary - Ask questions that encourage logical arguments and opinions - Comment on collocations and register appropriateness `, C1: ` - Native-equivalent conversation. Include jargon and slang - Point out subtle nuances and cultural context differences - Teach rhetorical devices and persuasive expressions `, C2: ` - Mastery level. Pursue refined expression - Discuss style and tone variation, metaphor and irony - Explain differences from common native speaker mistakes `,};export async function generateConversationResponse( context: ConversationContext): Promise<ConversationFeedback> { const systemPrompt = `You are an expert language tutor conducting an immersive ${context.targetLanguage} conversation lesson.LEARNER PROFILE:- Native language: ${context.nativeLanguage}- Current level: ${context.learnerLevel} (CEFR)- Lesson topic: ${context.lessonTopic}LEVEL-SPECIFIC GUIDELINES:${LEVEL_GUIDELINES[context.learnerLevel]}RESPONSE FORMAT (JSON):{ "responseText": "Your conversational response in ${context.targetLanguage}", "pronunciationScore": 0-100, "grammarFeedback": [ { "original": "what the learner said", "corrected": "the correct version", "explanation": "brief explanation in ${context.nativeLanguage}", "rule": "grammar rule name" } ], "vocabularyTip": { "word": "a new word used in your response", "meaning": "meaning in ${context.nativeLanguage}", "exampleSentence": "another example" }, "nextPrompt": "a follow-up question to continue the conversation", "encouragement": "a brief positive comment in ${context.nativeLanguage}"}IMPORTANT:- Always respond naturally as a conversation partner FIRST, tutor SECOND- Keep corrections constructive and non-intimidating- Gradually increase complexity when the learner performs well- If pronunciation score is below 60, suggest specific phonemes to practice`; const messages = [ ...context.history.map(m => ({ role: m.role as 'user' | 'assistant', content: m.content, })), { role: 'user' as const, content: `The learner said: "${context.userText}"${ context.expectedPhrase ? `\n(Expected response was similar to: "${context.expectedPhrase}")` : '' }`, }, ]; const response = await fetch('https://api.anthropic.com/v1/messages', { method: 'POST', headers: { 'Content-Type': 'application/json', 'x-api-key': process.env.ANTHROPIC_API_KEY!, 'anthropic-version': '2023-06-01', }, body: JSON.stringify({ model: 'claude-sonnet-4-6', max_tokens: 1024, system: systemPrompt, messages, }), }); const data = await response.json(); const content = data.content[0].text; return JSON.parse(content);}
Adaptive Curriculum Engine
This engine automatically adjusts lesson content based on the learner's accuracy rates, pronunciation scores, and session frequency:
The Rork TTS Implementation Guide covered basic text-to-speech with expo-speech, but language learning demands pronunciation quality that closely mimics native speakers. ElevenLabs produces natural-sounding speech with realistic intonation and emotion.
// services/elevenLabsService.tsimport { Audio } from 'expo-av';import * as FileSystem from 'expo-file-system';// Voice IDs suited for language learningconst VOICE_PROFILES = { en: { male: 'pNInz6obpgDQGcFmaJgB', // Adam — Clear enunciation female: 'EXAVITQu4vr4xnSDxMaL', // Bella — Calm delivery }, ja: { male: 'bIHbv24MWmeRgasZH58o', female: 'jsCqWAovK2LkecY7zXl4', }, es: { male: 'onwK4e9ZLuTAKqWW03F9', female: 'XB0fDUnXU5powFXDhCwa', },} as const;interface TTSOptions { text: string; language: string; voiceGender: 'male' | 'female'; speed: 'slow' | 'normal' | 'fast'; // Adjusts based on learner level stability: number; // 0-1: lower = more expressive, higher = more stable}export async function synthesizeSpeech( options: TTSOptions): Promise<Audio.Sound> { const voiceId = VOICE_PROFILES[options.language]?.[options.voiceGender] || VOICE_PROFILES.en.female; const similarityBoost = options.speed === 'slow' ? 0.9 : 0.75; const response = await fetch( `https://api.elevenlabs.io/v1/text-to-speech/${voiceId}`, { method: 'POST', headers: { 'Content-Type': 'application/json', 'xi-api-key': process.env.ELEVENLABS_API_KEY!, }, body: JSON.stringify({ text: options.text, model_id: 'eleven_multilingual_v2', voice_settings: { stability: options.stability, similarity_boost: similarityBoost, style: 0.3, use_speaker_boost: true, }, }), } ); if (!response.ok) { throw new Error(`ElevenLabs API error: ${response.status}`); } // Save audio data locally for playback const audioData = await response.arrayBuffer(); const base64 = btoa( String.fromCharCode(...new Uint8Array(audioData)) ); const fileUri = `${FileSystem.cacheDirectory}tts_${Date.now()}.mp3`; await FileSystem.writeAsStringAsync(fileUri, base64, { encoding: FileSystem.EncodingType.Base64, }); const { sound } = await Audio.Sound.createAsync({ uri: fileUri }); return sound;}// Auto-adjust TTS parameters based on learner levelexport function getTTSParamsForLevel( level: string): Partial<TTSOptions> { switch (level) { case 'A1': case 'A2': return { speed: 'slow', stability: 0.9 }; case 'B1': case 'B2': return { speed: 'normal', stability: 0.75 }; case 'C1': case 'C2': return { speed: 'fast', stability: 0.5 }; default: return { speed: 'normal', stability: 0.75 }; }}
Audio Caching for Cost Optimization
To reduce API costs, we cache synthesized audio for frequently used phrases:
Latency Optimization — Streaming and Parallel Processing
Claude Streaming Responses
The sequential Whisper → Claude → TTS pipeline introduces 3–5 seconds of latency between the user speaking and hearing a response. By using Claude's streaming API, we can start TTS conversion while text is still being generated:
// pipeline/StreamingPipeline.tsexport async function processWithStreaming( audioBlob: Blob, context: ConversationContext, callbacks: { onTranscription: (text: string) => void; onPartialResponse: (text: string) => void; onAudioReady: (uri: string) => void; onFeedback: (feedback: ConversationFeedback) => void; }) { // Step 1: Whisper (batch processing only) const transcription = await transcribeWithWhisper(audioBlob); callbacks.onTranscription(transcription.text); // Step 2: Claude streaming let fullResponse = ''; let firstSentenceProcessed = false; const stream = await fetch('https://api.anthropic.com/v1/messages', { method: 'POST', headers: { 'Content-Type': 'application/json', 'x-api-key': process.env.ANTHROPIC_API_KEY!, 'anthropic-version': '2023-06-01', }, body: JSON.stringify({ model: 'claude-sonnet-4-6', max_tokens: 1024, stream: true, system: buildSystemPrompt(context), messages: buildMessages(context, transcription.text), }), }); const reader = stream.body?.getReader(); const decoder = new TextDecoder(); while (reader) { const { done, value } = await reader.read(); if (done) break; const chunk = decoder.decode(value); const lines = chunk.split('\n'); for (const line of lines) { if (line.startsWith('data: ')) { const data = JSON.parse(line.slice(6)); if (data.type === 'content_block_delta') { fullResponse += data.delta.text; callbacks.onPartialResponse(fullResponse); // Start TTS for the first sentence as soon as it's complete if (!firstSentenceProcessed && fullResponse.includes('.')) { const firstSentence = fullResponse.split('.')[0] + '.'; firstSentenceProcessed = true; // Begin TTS conversion asynchronously synthesizeSpeech({ text: firstSentence, language: context.targetLanguage, voiceGender: 'female', speed: 'normal', stability: 0.75, }).then(sound => { callbacks.onAudioReady(firstSentence); sound.playAsync(); }); } } } } }}
Subscription Monetization Design
Freemium Model Structure
A freemium model works best for language learning apps:
Free Plan: 3 AI conversation sessions per day (5 minutes each), basic flashcards, progress tracking
Pro Plan ($9.99/month): Unlimited AI sessions, all voice options, detailed pronunciation reports, offline caching, no ads
Premium Plan ($79.99/year): Everything in Pro + personalized curriculum generation, monthly native speaker feedback
At $9.99/month with 3 sessions/day (90 sessions/month), API costs are approximately $4.50 — yielding a healthy 55% gross margin.
Three Gotchas I Hit While Running This in Production
A pipeline that looks clean on paper starts surfacing quiet issues once you run a hundred sessions a day on real devices. Filtering through the operational logging discipline I built up across roughly 50 million cumulative downloads on my wallpaper and meditation apps, here are the three traps that hurt language-learning apps the most.
Gotcha 1: Forecasting ElevenLabs by "conversation turns" is off by 2x
I initially budgeted at "1 session = 10 turns = ~1,000 characters" for Pro plan users. Actual measurement came in around 2,000 characters. The cause was that I was sending the corrected/explanation fields of grammar feedback and the in-prompt example sentences to TTS as well. Without an explicit split, the API bill at month three runs at 2x your projection.
// services/ttsRouter.ts — decide which texts get spoken vs only shown// Always separate "must be spoken" from "screen-only assistance"interface TTSCandidate { text: string; purpose: 'main_response' | 'pronunciation_hint' | 'grammar_explanation' | 'vocab_example';}const TTS_BUDGET_PER_SESSION = 1200; // character cap; anything above is screen-onlyexport function selectTTSCandidates( candidates: TTSCandidate[], spentCharsSoFar: number): TTSCandidate[] { // main_response always gets TTS; the rest only if budget allows, in priority order const main = candidates.filter(c => c.purpose === 'main_response'); const others = candidates.filter(c => c.purpose !== 'main_response'); let budget = TTS_BUDGET_PER_SESSION - spentCharsSoFar; const result = [...main]; budget -= main.reduce((s, c) => s + c.text.length, 0); // Priority: pronunciation_hint > vocab_example > grammar_explanation const priorityOrder: TTSCandidate['purpose'][] = [ 'pronunciation_hint', 'vocab_example', 'grammar_explanation', ]; for (const purpose of priorityOrder) { const items = others.filter(c => c.purpose === purpose); for (const item of items) { if (budget >= item.text.length) { result.push(item); budget -= item.text.length; } } } return result;}
Once this router was in place, per-user TTS character consumption dropped by roughly 40 percent at the median, and Pro plan unit economics started matching the spreadsheet.
If you compute pronunciation score as a flat average of word probabilities, "ums" and breathy near-silence tokens come back at 0.95+ and unfairly lift the total. The fix is small but important: drop segments where no_speech_prob > 0.3 or avg_logprob < -1.0 before scoring.
// Add this to services/whisperService.tsexport function calculateRobustPronunciationScore( response: WhisperDetailedResponse): { overall: number; reliableWordCount: number } { const reliableSegments = response.segments.filter(s => s.no_speech_prob < 0.3 && s.avg_logprob > -1.0 ); const words = reliableSegments.flatMap(s => s.words || []); // also drop words below 0.3 — usually noise, coughs, or stutters const reliable = words.filter(w => w.probability >= 0.3); if (reliable.length === 0) { return { overall: 0, reliableWordCount: 0 }; } const overall = Math.round( (reliable.reduce((s, w) => s + w.probability, 0) / reliable.length) * 100 ); return { overall, reliableWordCount: reliable.length };}
When I introduced this filter, average pronunciation score dropped from 82 to 71. But user feedback shifted toward "the score actually matches what I feel about my own pronunciation." Scores should be honest signals, not vanity numbers. Honesty correlates better with retention.
Gotcha 3: Letting Rork Max float its Expo SDK can crash on iOS 17 launch
Rork Max scaffolds projects with a pinned Expo SDK major version. The interaction between AdMob, ElevenLabs, and expo-av has, at least twice in my experience, shifted on minor SDK updates — Audio.Recording.createAsync argument shape changed once and cost me two days of post-launch crash hunting in one of my wallpaper apps. Because recording is the heart of a language app, build the following into CI so you catch SDK drift before users do.
# .github/workflows/audio-smoke-test.yml — the meaningful part- name: Smoke test audio recording API run: | npx expo install --check npx expo-doctor # Verify expo-av major is in sync with expo node -e " const pkg = require('./node_modules/expo-av/package.json'); const expoPkg = require('./node_modules/expo/package.json'); const expoMajor = parseInt(expoPkg.version.split('.')[0]); const avMajor = parseInt(pkg.version.split('.')[0]); if (Math.abs(expoMajor - avMajor * 5) > 5) { console.error('expo-av major version drift detected'); process.exit(1); } "
With this in CI, you will be alerted the day Rork Max bumps Expo SDK behind your back.
Operational Checklist for Indie Developers
The implementation pieces are now in place. Before signing off, here is the weekly operational view I keep, plus a few related Rork Lab references for the broader stack.
Five numbers to track every week
API cost divided by paying users (compare against ARPU): if API cost climbs above 60 percent of plan price, immediately break down which provider (ElevenLabs characters / Whisper seconds / Claude tokens) is the cause
Day 7 and Day 30 retention: language learning takes longer to habit-form than typical apps. From my experience I treat 35 percent at Day 7 and 15 percent at Day 30 as the minimum acceptable line
Scatter plot of average pronunciation score vs session length: if low-scoring users consistently leave sessions early, the level is likely set too high
AdMob rewarded video completion rate (free tier): below 60 percent means you need to revisit ad placement or switch to interstitial flow
P95 latency for Whisper, Claude, and ElevenLabs separately: perceived speed is set by P95. If any single provider crosses 4 seconds, it is time to consider streaming or parallelization
Companion articles on Rork Lab
This guide stands on its own, but during operations the following pairs well:
Auth and billing: Building a Membership App with Rork and Supabase Auth
Thank you for staying with the guide this far. Every snippet here is written to be the kind of code you could paste during your morning commute and have running by lunch. The most rewarding first step, in my experience, is dropping calculateRobustPronunciationScore into a small sandbox project and recording yourself reading something. Watching your own scores change as you adjust your delivery is where the conviction comes from.
A language app is, fundamentally, an app where users meet a less polished version of themselves. That is why the score needs to be honest, the feedback needs to stay warm, and the loop needs to be short. I hope yours becomes the one that helps someone feel their world widen.
Share
Thank You for Reading
Rork Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.