RORK LABJP
FUNDING — Rork raised a $15M seed led by Left Lane Capital, with Peak XV, True Ventures, Goodwater, and a16z Speedrun joiningENGINE — Rork Max runs on Claude Code and Claude Opus 4.6; it drew 8M+ views on X and doubled annual revenue in two weeksSWIFT — Rork Max is the first web-based Swift app builder, positioned to replace Apple's traditional XcodePRODUCT — Rork Max covers the whole Apple ecosystem: iPhone, iPad, Apple Watch, Apple TV, Vision Pro, and iMessageCLASSIC — The original Rork uses React Native (Expo), building iOS/Android apps from a plain-English descriptionPRICING — Start free; paid plans begin at $25/mo, and Rork Max is $200/moFUNDING — Rork raised a $15M seed led by Left Lane Capital, with Peak XV, True Ventures, Goodwater, and a16z Speedrun joiningENGINE — Rork Max runs on Claude Code and Claude Opus 4.6; it drew 8M+ views on X and doubled annual revenue in two weeksSWIFT — Rork Max is the first web-based Swift app builder, positioned to replace Apple's traditional XcodePRODUCT — Rork Max covers the whole Apple ecosystem: iPhone, iPad, Apple Watch, Apple TV, Vision Pro, and iMessageCLASSIC — The original Rork uses React Native (Expo), building iOS/Android apps from a plain-English descriptionPRICING — Start free; paid plans begin at $25/mo, and Rork Max is $200/mo
Articles/AI Models
AI Models/2026-04-08Advanced

Building an Immersive AI Language Learning App with Rork — Whisper Speech Recognition × Claude Conversational AI × ElevenLabs TTS

Production implementation notes for an immersive language learning app integrating Whisper, Claude, and ElevenLabs in Rork Max. Covers CEFR-adaptive curriculum, SM-2 spaced repetition, streaming latency optimization, and a freemium pricing model that holds a 55 percent margin.

Rork Max185AI Language LearningWhisper4Claude API11ElevenLabs2TTSSpeech RecognitionReact Native180Indie Dev31EdTech2

Premium Article

What I Wished Existed While Preparing Solo Exhibitions in Europe

In 2024, while coordinating exhibitions in Berlin and Milan with local galleries, I kept hitting the same wall. I had prepped with the usual language apps, but the moment the video calls started I could not catch their intonation, and my own pronunciation did not land. The reason was obvious in hindsight: most apps over-index on reading practice and under-deliver on real-time speaking and listening loops.

As Masaki Hirokawa, an indie developer who has been shipping iOS and Android apps since 2014 (dolice.design), I sat down with those notes and asked: could Whisper, Claude, and ElevenLabs together inside Rork Max close that gap at an indie scale? This guide is the design document that came out of that experiment, written so you can move from API keys to production without re-doing the research. The flashcard-style basics live in the Rork Language Learning App Tutorial; from here we focus only on the next step.

The three AI APIs we combine are:

  • OpenAI Whisper API — speech-to-text transcription with word-level confidence values that we will repurpose for pronunciation scoring
  • Anthropic Claude API — adaptive conversation simulation, grammar feedback, and curriculum generation that respond to the learner's CEFR level (A1 through C2)
  • ElevenLabs TTS API — native-quality voice synthesis so listening and speaking practice happen in the same screen

One thing I will say upfront. It is tempting to assume that the voice AI quality is everything, but after running these pipelines against real users I am convinced the bigger factors for retention are loop latency and learning consistency. I will return to that observation throughout the guide.


System Architecture — Integrating Three AI APIs

Overall Structure

The immersive language learning app architecture consists of four layers:

  • Presentation Layer: React Native (Expo) UI components — conversation screen, lesson screen, progress dashboard
  • Orchestration Layer: Middleware logic controlling the AI API call sequence, managing the Whisper → Claude → TTS pipeline
  • AI Service Layer: Individual API clients for Whisper (STT), Claude (conversation/analysis), and ElevenLabs (TTS)
  • Data Persistence Layer: Supabase for learning history, progress tracking, and user profiles

Pipeline Flow

A typical learning session where the user speaks in English follows this pipeline:

// Full AI language learning pipeline flow
// pipeline/LessonPipeline.ts
 
interface LessonPipelineResult {
  transcription: string;        // Whisper transcription output
  pronunciationScore: number;   // Pronunciation score (0-100)
  feedback: ConversationFeedback; // Claude's feedback
  audioResponse: string;        // ElevenLabs audio as Base64
  nextPrompt: string;           // Next conversation prompt
}
 
export async function processUserUtterance(
  audioBlob: Blob,
  conversationHistory: Message[],
  userProfile: LearnerProfile
): Promise<LessonPipelineResult> {
  // Step 1: Whisper converts speech to text
  const transcription = await transcribeWithWhisper(audioBlob);
 
  // Step 2: Claude evaluates pronunciation + generates response
  const claudeResponse = await analyzeAndRespond({
    userText: transcription.text,
    expectedPhrase: conversationHistory.at(-1)?.expectedResponse,
    history: conversationHistory,
    learnerLevel: userProfile.level,
    targetLanguage: userProfile.targetLanguage,
  });
 
  // Step 3: ElevenLabs synthesizes the response as speech
  const audioResponse = await synthesizeSpeech(
    claudeResponse.responseText,
    userProfile.preferredVoice
  );
 
  return {
    transcription: transcription.text,
    pronunciationScore: claudeResponse.pronunciationScore,
    feedback: claudeResponse,
    audioResponse,
    nextPrompt: claudeResponse.nextPrompt,
  };
}

The key design principle here is the sequential dependency: Whisper's output feeds into Claude, and Claude's output feeds into ElevenLabs. Because each API call is serial, latency optimization becomes critical — we'll address this later with streaming responses.


Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN
A three-stage Whisper x Claude x ElevenLabs pipeline you can copy-paste, plus the full Claude system prompt that adapts behavior across CEFR A1 to C2
A robust pronunciation scoring function built from Whisper word probabilities, along with the SM-2 spaced repetition and adaptive curriculum scoring code
API cost breakdown of about $0.05 per session, and the freemium design that keeps a 55 percent gross margin while shipping streaming TTS for sub-second perceived latency
Secure payment via Stripe · Cancel anytime

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

or
Unlock all articles with Membership →
Share

Thank You for Reading

Rork Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

Related Articles

AI Models2026-05-05
Build an AI Interview Coach App with Rork Max: Voice Recording, Claude Evaluation & Progress Tracking
A complete guide to building a production-ready AI interview coach app with Rork Max — covering voice recording, Whisper transcription, Claude 4 evaluation, session tracking, and subscription monetization.
AI Models2026-04-29
Shipping Android Apps with Rork Max: A Cross-Platform Reality Check
Rork Max generates SwiftUI, which means it ships iOS only. If you want Android coverage too, here is how I have been combining Rork (vanilla) and Rork Max in real indie projects, plus the trade-offs nobody warns you about.
AI Models2026-06-19
Before You Pay $200/mo for Rork Max, Map How Far Expo Reaches in Three Tiers
Wanting widgets or Live Activities makes Rork Max tempting, but most of those features are reachable from the Expo setup that standard Rork generates. Here is how I sort each Apple-native feature into three tiers—reachable in Expo, reachable with a custom module, or where Max is the pragmatic answer—and verify which tier my app is in before paying.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →