◉ AI Models/2026-04-08Advanced

Building an Immersive AI Language Learning App with Rork — Whisper Speech Recognition × Claude Conversational AI × ElevenLabs TTS

Production implementation notes for an immersive language learning app integrating Whisper, Claude, and ElevenLabs in Rork Max. Covers CEFR-adaptive curriculum, SM-2 spaced repetition, streaming latency optimization, and a freemium pricing model that holds a 55 percent margin.

Rork Max¹⁸⁵ AI Language Learning Whisper⁴ Claude API¹¹ ElevenLabs² TTS Speech Recognition React Native¹⁸⁰ Indie Dev³¹ EdTech²

✦ Premium Article

What I Wished Existed While Preparing Solo Exhibitions in Europe

In 2024, while coordinating exhibitions in Berlin and Milan with local galleries, I kept hitting the same wall. I had prepped with the usual language apps, but the moment the video calls started I could not catch their intonation, and my own pronunciation did not land. The reason was obvious in hindsight: most apps over-index on reading practice and under-deliver on real-time speaking and listening loops.

As Masaki Hirokawa, an indie developer who has been shipping iOS and Android apps since 2014 (dolice.design), I sat down with those notes and asked: could Whisper, Claude, and ElevenLabs together inside Rork Max close that gap at an indie scale? This guide is the design document that came out of that experiment, written so you can move from API keys to production without re-doing the research. The flashcard-style basics live in the Rork Language Learning App Tutorial; from here we focus only on the next step.

The three AI APIs we combine are:

OpenAI Whisper API — speech-to-text transcription with word-level confidence values that we will repurpose for pronunciation scoring
Anthropic Claude API — adaptive conversation simulation, grammar feedback, and curriculum generation that respond to the learner's CEFR level (A1 through C2)
ElevenLabs TTS API — native-quality voice synthesis so listening and speaking practice happen in the same screen

One thing I will say upfront. It is tempting to assume that the voice AI quality is everything, but after running these pipelines against real users I am convinced the bigger factors for retention are loop latency and learning consistency. I will return to that observation throughout the guide.

System Architecture — Integrating Three AI APIs

Overall Structure

The immersive language learning app architecture consists of four layers:

Presentation Layer: React Native (Expo) UI components — conversation screen, lesson screen, progress dashboard
Orchestration Layer: Middleware logic controlling the AI API call sequence, managing the Whisper → Claude → TTS pipeline
AI Service Layer: Individual API clients for Whisper (STT), Claude (conversation/analysis), and ElevenLabs (TTS)
Data Persistence Layer: Supabase for learning history, progress tracking, and user profiles

Pipeline Flow

A typical learning session where the user speaks in English follows this pipeline:

// Full AI language learning pipeline flow
// pipeline/LessonPipeline.ts
 
interface LessonPipelineResult {
  transcription: string;        // Whisper transcription output
  pronunciationScore: number;   // Pronunciation score (0-100)
  feedback: ConversationFeedback; // Claude's feedback
  audioResponse: string;        // ElevenLabs audio as Base64
  nextPrompt: string;           // Next conversation prompt
}
 
export async function processUserUtterance(
  audioBlob: Blob,
  conversationHistory: Message[],
  userProfile: LearnerProfile
): Promise<LessonPipelineResult> {
  // Step 1: Whisper converts speech to text
  const transcription = await transcribeWithWhisper(audioBlob);
 
  // Step 2: Claude evaluates pronunciation + generates response
  const claudeResponse = await analyzeAndRespond({
    userText: transcription.text,
    expectedPhrase: conversationHistory.at(-1)?.expectedResponse,
    history: conversationHistory,
    learnerLevel: userProfile.level,
    targetLanguage: userProfile.targetLanguage,
  });
 
  // Step 3: ElevenLabs synthesizes the response as speech
  const audioResponse = await synthesizeSpeech(
    claudeResponse.responseText,
    userProfile.preferredVoice
  );
 
  return {
    transcription: transcription.text,
    pronunciationScore: claudeResponse.pronunciationScore,
    feedback: claudeResponse,
    audioResponse,
    nextPrompt: claudeResponse.nextPrompt,
  };
}

The key design principle here is the sequential dependency: Whisper's output feeds into Claude, and Claude's output feeds into ElevenLabs. Because each API call is serial, latency optimization becomes critical — we'll address this later with streaming responses.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦A three-stage Whisper x Claude x ElevenLabs pipeline you can copy-paste, plus the full Claude system prompt that adapts behavior across CEFR A1 to C2

✦A robust pronunciation scoring function built from Whisper word probabilities, along with the SM-2 spaced repetition and adaptive curriculum scoring code

✦API cost breakdown of about $0.05 per session, and the freemium design that keeps a 55 percent gross margin while shipping streaming TTS for sub-second perceived latency

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Implementing Whisper Speech Recognition — Transcription and Pronunciation Scoring

Audio Recording with expo-av

Building on the fundamentals covered in Rork + OpenAI Whisper API Speech Recognition Guide, here's the recording implementation optimized for language learning:

// hooks/useAudioRecorder.ts
import { Audio } from 'expo-av';
import { useState, useRef } from 'react';
 
export function useAudioRecorder() {
  const [isRecording, setIsRecording] = useState(false);
  const recordingRef = useRef<Audio.Recording | null>(null);
 
  const startRecording = async () => {
    // Configure audio session for language learning
    await Audio.setAudioModeAsync({
      allowsRecordingIOS: true,
      playsInSilentModeIOS: true,
      // Interrupt other audio for clean recording
      interruptionModeIOS: 1, // DoNotMix
      interruptionModeAndroid: 1,
    });
 
    const { recording } = await Audio.Recording.createAsync(
      // Optimal recording settings for Whisper
      {
        android: {
          extension: '.m4a',
          outputFormat: 3, // MPEG_4
          audioEncoder: 3, // AAC
          sampleRate: 16000, // Whisper's recommended sample rate
          numberOfChannels: 1,
          bitRate: 128000,
        },
        ios: {
          extension: '.m4a',
          audioQuality: 127, // Max quality
          sampleRate: 16000,
          numberOfChannels: 1,
          bitRate: 128000,
          outputFormat: 'aac',
        },
        web: { mimeType: 'audio/webm', bitsPerSecond: 128000 },
      }
    );
 
    recordingRef.current = recording;
    setIsRecording(true);
  };
 
  const stopRecording = async (): Promise<Blob | null> => {
    if (!recordingRef.current) return null;
 
    await recordingRef.current.stopAndUnloadAsync();
    const uri = recordingRef.current.getURI();
    recordingRef.current = null;
    setIsRecording(false);
 
    if (!uri) return null;
 
    // Convert file to Blob for API submission
    const response = await fetch(uri);
    return response.blob();
  };
 
  return { isRecording, startRecording, stopRecording };
}

Whisper API Call with Pronunciation Data

By using Whisper's verbose_json response format, we can extract word-level timestamps and confidence scores — invaluable for pronunciation evaluation:

// services/whisperService.ts
 
interface WhisperDetailedResponse {
  text: string;
  language: string;
  segments: Array<{
    text: string;
    start: number;
    end: number;
    // Per-segment recognition confidence
    avg_logprob: number;
    no_speech_prob: number;
    words?: Array<{
      word: string;
      start: number;
      end: number;
      probability: number; // 0.0–1.0: pronunciation clarity metric
    }>;
  }>;
}
 
export async function transcribeWithWhisper(
  audioBlob: Blob,
  expectedLanguage: string = 'en'
): Promise<WhisperDetailedResponse> {
  const formData = new FormData();
  formData.append('file', audioBlob, 'recording.m4a');
  formData.append('model', 'whisper-1');
  formData.append('language', expectedLanguage);
  formData.append('response_format', 'verbose_json');
  formData.append('timestamp_granularities[]', 'word');
 
  const response = await fetch(
    'https://api.openai.com/v1/audio/transcriptions',
    {
      method: 'POST',
      headers: {
        Authorization: `Bearer ${process.env.OPENAI_API_KEY}`,
      },
      body: formData,
    }
  );
 
  if (!response.ok) {
    throw new Error(`Whisper API error: ${response.status}`);
  }
 
  return response.json();
}
 
// Calculate pronunciation score from word-level confidence
export function calculatePronunciationScore(
  response: WhisperDetailedResponse,
  expectedText: string
): { overall: number; wordScores: Array<{ word: string; score: number }> } {
  const words = response.segments.flatMap(s => s.words || []);
 
  const wordScores = words.map(w => ({
    word: w.word.trim(),
    // Convert probability to 0-100 score
    score: Math.round(w.probability * 100),
  }));
 
  // Average score across all words
  const overall = wordScores.length > 0
    ? Math.round(
        wordScores.reduce((sum, w) => sum + w.score, 0) / wordScores.length
      )
    : 0;
 
  return { overall, wordScores };
}

Implementing Claude Conversational AI — Adaptive Curriculum and Grammar Feedback

Designing the Conversation Simulation

The most critical aspect of integrating Claude for language learning is system prompt design. The prompt must dynamically adjust vocabulary difficulty, speech pace guidance, and grammar focus based on the learner's CEFR level (A1 through C2).

// services/claudeConversationService.ts
 
interface ConversationContext {
  userText: string;
  expectedPhrase?: string;
  history: Message[];
  learnerLevel: 'A1' | 'A2' | 'B1' | 'B2' | 'C1' | 'C2';
  targetLanguage: string;
  nativeLanguage: string;
  lessonTopic: string;
}
 
const LEVEL_GUIDELINES: Record<string, string> = {
  A1: `
    - Use present tense only. Keep responses under 5 words per sentence
    - Limit vocabulary to 500 basic words (family, food, weather, greetings)
    - Correct only 1 mistake per response
    - Always use encouraging tone — praise correct parts first
  `,
  A2: `
    - Use present and past tense. Up to 8 words per sentence
    - Vocabulary around 1,500 words (shopping, travel, daily life)
    - Correct up to 2 grammar mistakes. Suggest 1 alternative expression
  `,
  B1: `
    - All tenses allowed. Complex sentences OK. Natural pace
    - Actively introduce idioms and phrasal verbs
    - Provide feedback on nuance, not just grammar
  `,
  B2: `
    - Include business and academic vocabulary
    - Ask questions that encourage logical arguments and opinions
    - Comment on collocations and register appropriateness
  `,
  C1: `
    - Native-equivalent conversation. Include jargon and slang
    - Point out subtle nuances and cultural context differences
    - Teach rhetorical devices and persuasive expressions
  `,
  C2: `
    - Mastery level. Pursue refined expression
    - Discuss style and tone variation, metaphor and irony
    - Explain differences from common native speaker mistakes
  `,
};
 
export async function generateConversationResponse(
  context: ConversationContext
): Promise<ConversationFeedback> {
  const systemPrompt = `You are an expert language tutor conducting an immersive ${context.targetLanguage} conversation lesson.
 
LEARNER PROFILE:
- Native language: ${context.nativeLanguage}
- Current level: ${context.learnerLevel} (CEFR)
- Lesson topic: ${context.lessonTopic}
 
LEVEL-SPECIFIC GUIDELINES:
${LEVEL_GUIDELINES[context.learnerLevel]}
 
RESPONSE FORMAT (JSON):
{
  "responseText": "Your conversational response in ${context.targetLanguage}",
  "pronunciationScore": 0-100,
  "grammarFeedback": [
    {
      "original": "what the learner said",
      "corrected": "the correct version",
      "explanation": "brief explanation in ${context.nativeLanguage}",
      "rule": "grammar rule name"
    }
  ],
  "vocabularyTip": {
    "word": "a new word used in your response",
    "meaning": "meaning in ${context.nativeLanguage}",
    "exampleSentence": "another example"
  },
  "nextPrompt": "a follow-up question to continue the conversation",
  "encouragement": "a brief positive comment in ${context.nativeLanguage}"
}
 
IMPORTANT:
- Always respond naturally as a conversation partner FIRST, tutor SECOND
- Keep corrections constructive and non-intimidating
- Gradually increase complexity when the learner performs well
- If pronunciation score is below 60, suggest specific phonemes to practice`;
 
  const messages = [
    ...context.history.map(m => ({
      role: m.role as 'user' | 'assistant',
      content: m.content,
    })),
    {
      role: 'user' as const,
      content: `The learner said: "${context.userText}"${
        context.expectedPhrase
          ? `\n(Expected response was similar to: "${context.expectedPhrase}")`
          : ''
      }`,
    },
  ];
 
  const response = await fetch('https://api.anthropic.com/v1/messages', {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      'x-api-key': process.env.ANTHROPIC_API_KEY!,
      'anthropic-version': '2023-06-01',
    },
    body: JSON.stringify({
      model: 'claude-sonnet-4-6',
      max_tokens: 1024,
      system: systemPrompt,
      messages,
    }),
  });
 
  const data = await response.json();
  const content = data.content[0].text;
 
  return JSON.parse(content);
}

Adaptive Curriculum Engine

This engine automatically adjusts lesson content based on the learner's accuracy rates, pronunciation scores, and session frequency:

// engine/AdaptiveCurriculumEngine.ts
 
interface LearnerMetrics {
  averagePronunciationScore: number;
  grammarAccuracy: number; // 0-1
  vocabularyRetention: number; // 0-1
  sessionsThisWeek: number;
  streakDays: number;
  weakTopics: string[];
  strongTopics: string[];
}
 
interface LessonPlan {
  topic: string;
  difficulty: number; // 1-10
  focusAreas: ('pronunciation' | 'grammar' | 'vocabulary' | 'fluency')[];
  estimatedDuration: number; // minutes
  warmupActivity: string;
  mainActivity: string;
  reviewActivity: string;
}
 
export function generateNextLesson(
  metrics: LearnerMetrics,
  completedLessons: string[]
): LessonPlan {
  // Determine focus areas based on scores
  const focusAreas: LessonPlan['focusAreas'] = [];
 
  if (metrics.averagePronunciationScore < 70) {
    focusAreas.push('pronunciation');
  }
  if (metrics.grammarAccuracy < 0.7) {
    focusAreas.push('grammar');
  }
  if (metrics.vocabularyRetention < 0.6) {
    focusAreas.push('vocabulary');
  }
  if (focusAreas.length === 0) {
    // All areas performing well — focus on fluency
    focusAreas.push('fluency');
  }
 
  // Prioritize weak topics (spaced repetition principle)
  const topic = metrics.weakTopics.length > 0
    ? metrics.weakTopics[0]
    : selectNewTopic(completedLessons);
 
  // Bonus content for maintaining streaks
  const bonusDuration = metrics.streakDays >= 7 ? 5 : 0;
 
  return {
    topic,
    difficulty: calculateDifficulty(metrics),
    focusAreas,
    estimatedDuration: 15 + bonusDuration,
    warmupActivity: generateWarmup(topic, focusAreas),
    mainActivity: generateMainActivity(topic, focusAreas),
    reviewActivity: generateReview(metrics.weakTopics),
  };
}
 
function calculateDifficulty(metrics: LearnerMetrics): number {
  // Composite score to determine difficulty (1-10)
  const composite =
    metrics.averagePronunciationScore * 0.3 +
    metrics.grammarAccuracy * 100 * 0.3 +
    metrics.vocabularyRetention * 100 * 0.4;
 
  // Above 80 → increase difficulty, below 50 → decrease
  if (composite > 80) return Math.min(10, Math.ceil(composite / 10));
  if (composite < 50) return Math.max(1, Math.floor(composite / 15));
  return Math.round(composite / 12);
}

Implementing ElevenLabs TTS — Native-Quality Voice Output

Building the Speech Synthesis Service

The Rork TTS Implementation Guide covered basic text-to-speech with expo-speech, but language learning demands pronunciation quality that closely mimics native speakers. ElevenLabs produces natural-sounding speech with realistic intonation and emotion.

// services/elevenLabsService.ts
import { Audio } from 'expo-av';
import * as FileSystem from 'expo-file-system';
 
// Voice IDs suited for language learning
const VOICE_PROFILES = {
  en: {
    male: 'pNInz6obpgDQGcFmaJgB',   // Adam — Clear enunciation
    female: 'EXAVITQu4vr4xnSDxMaL', // Bella — Calm delivery
  },
  ja: {
    male: 'bIHbv24MWmeRgasZH58o',
    female: 'jsCqWAovK2LkecY7zXl4',
  },
  es: {
    male: 'onwK4e9ZLuTAKqWW03F9',
    female: 'XB0fDUnXU5powFXDhCwa',
  },
} as const;
 
interface TTSOptions {
  text: string;
  language: string;
  voiceGender: 'male' | 'female';
  speed: 'slow' | 'normal' | 'fast'; // Adjusts based on learner level
  stability: number; // 0-1: lower = more expressive, higher = more stable
}
 
export async function synthesizeSpeech(
  options: TTSOptions
): Promise<Audio.Sound> {
  const voiceId =
    VOICE_PROFILES[options.language]?.[options.voiceGender] ||
    VOICE_PROFILES.en.female;
 
  const similarityBoost = options.speed === 'slow' ? 0.9 : 0.75;
 
  const response = await fetch(
    `https://api.elevenlabs.io/v1/text-to-speech/${voiceId}`,
    {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
        'xi-api-key': process.env.ELEVENLABS_API_KEY!,
      },
      body: JSON.stringify({
        text: options.text,
        model_id: 'eleven_multilingual_v2',
        voice_settings: {
          stability: options.stability,
          similarity_boost: similarityBoost,
          style: 0.3,
          use_speaker_boost: true,
        },
      }),
    }
  );
 
  if (!response.ok) {
    throw new Error(`ElevenLabs API error: ${response.status}`);
  }
 
  // Save audio data locally for playback
  const audioData = await response.arrayBuffer();
  const base64 = btoa(
    String.fromCharCode(...new Uint8Array(audioData))
  );
  const fileUri = `${FileSystem.cacheDirectory}tts_${Date.now()}.mp3`;
 
  await FileSystem.writeAsStringAsync(fileUri, base64, {
    encoding: FileSystem.EncodingType.Base64,
  });
 
  const { sound } = await Audio.Sound.createAsync({ uri: fileUri });
  return sound;
}
 
// Auto-adjust TTS parameters based on learner level
export function getTTSParamsForLevel(
  level: string
): Partial<TTSOptions> {
  switch (level) {
    case 'A1':
    case 'A2':
      return { speed: 'slow', stability: 0.9 };
    case 'B1':
    case 'B2':
      return { speed: 'normal', stability: 0.75 };
    case 'C1':
    case 'C2':
      return { speed: 'fast', stability: 0.5 };
    default:
      return { speed: 'normal', stability: 0.75 };
  }
}

Audio Caching for Cost Optimization

To reduce API costs, we cache synthesized audio for frequently used phrases:

// utils/audioCache.ts
import * as FileSystem from 'expo-file-system';
import AsyncStorage from '@react-native-async-storage/async-storage';
 
const CACHE_DIR = `${FileSystem.cacheDirectory}tts_cache/`;
const CACHE_INDEX_KEY = 'tts_cache_index';
const MAX_CACHE_SIZE_MB = 100;
 
interface CacheEntry {
  key: string;
  filePath: string;
  createdAt: number;
  sizeBytes: number;
  accessCount: number;
}
 
export class AudioCache {
  private index: Map<string, CacheEntry> = new Map();
 
  async initialize() {
    const dirInfo = await FileSystem.getInfoAsync(CACHE_DIR);
    if (!dirInfo.exists) {
      await FileSystem.makeDirectoryAsync(CACHE_DIR, {
        intermediates: true,
      });
    }
 
    const stored = await AsyncStorage.getItem(CACHE_INDEX_KEY);
    if (stored) {
      const entries: CacheEntry[] = JSON.parse(stored);
      entries.forEach(e => this.index.set(e.key, e));
    }
  }
 
  private getCacheKey(text: string, voiceId: string): string {
    const input = `${text}:${voiceId}`;
    let hash = 0;
    for (let i = 0; i < input.length; i++) {
      const char = input.charCodeAt(i);
      hash = (hash << 5) - hash + char;
      hash |= 0;
    }
    return `tts_${Math.abs(hash).toString(36)}`;
  }
 
  async get(text: string, voiceId: string): Promise<string | null> {
    const key = this.getCacheKey(text, voiceId);
    const entry = this.index.get(key);
 
    if (!entry) return null;
 
    const info = await FileSystem.getInfoAsync(entry.filePath);
    if (!info.exists) {
      this.index.delete(key);
      return null;
    }
 
    // Update access count for LRU management
    entry.accessCount++;
    return entry.filePath;
  }
 
  async store(
    text: string,
    voiceId: string,
    audioBase64: string
  ): Promise<string> {
    const key = this.getCacheKey(text, voiceId);
    const filePath = `${CACHE_DIR}${key}.mp3`;
 
    await FileSystem.writeAsStringAsync(filePath, audioBase64, {
      encoding: FileSystem.EncodingType.Base64,
    });
 
    const info = await FileSystem.getInfoAsync(filePath);
 
    this.index.set(key, {
      key,
      filePath,
      createdAt: Date.now(),
      sizeBytes: info.exists ? (info.size ?? 0) : 0,
      accessCount: 1,
    });
 
    await this.evictIfNeeded();
    await this.persistIndex();
 
    return filePath;
  }
 
  private async evictIfNeeded() {
    const totalSize = Array.from(this.index.values()).reduce(
      (sum, e) => sum + e.sizeBytes, 0
    );
 
    if (totalSize <= MAX_CACHE_SIZE_MB * 1024 * 1024) return;
 
    // Evict least-accessed entries first
    const sorted = Array.from(this.index.values()).sort(
      (a, b) => a.accessCount - b.accessCount
    );
 
    let currentSize = totalSize;
    for (const entry of sorted) {
      if (currentSize <= MAX_CACHE_SIZE_MB * 1024 * 1024 * 0.8) break;
      await FileSystem.deleteAsync(entry.filePath, { idempotent: true });
      this.index.delete(entry.key);
      currentSize -= entry.sizeBytes;
    }
  }
 
  private async persistIndex() {
    const entries = Array.from(this.index.values());
    await AsyncStorage.setItem(CACHE_INDEX_KEY, JSON.stringify(entries));
  }
}

Conversation Lesson UI Implementation

Main Conversation Interface

Here's the full chat-style lesson screen that enables natural voice interaction:

// screens/ConversationLessonScreen.tsx
import React, { useState, useCallback, useRef, useEffect } from 'react';
import {
  View, Text, ScrollView, TouchableOpacity,
  StyleSheet, Animated, ActivityIndicator,
} from 'react-native';
import { useAudioRecorder } from '../hooks/useAudioRecorder';
import { processUserUtterance } from '../pipeline/LessonPipeline';
import { Audio } from 'expo-av';
 
interface ChatMessage {
  id: string;
  role: 'tutor' | 'learner';
  text: string;
  audioUri?: string;
  pronunciationScore?: number;
  feedback?: {
    corrections: Array<{
      original: string;
      corrected: string;
      explanation: string;
    }>;
    vocabularyTip?: { word: string; meaning: string };
    encouragement: string;
  };
}
 
export default function ConversationLessonScreen() {
  const [messages, setMessages] = useState<ChatMessage[]>([]);
  const [isProcessing, setIsProcessing] = useState(false);
  const [showFeedback, setShowFeedback] = useState<string | null>(null);
  const { isRecording, startRecording, stopRecording } = useAudioRecorder();
  const scrollViewRef = useRef<ScrollView>(null);
  const pulseAnim = useRef(new Animated.Value(1)).current;
 
  // Pulse animation while recording
  useEffect(() => {
    if (isRecording) {
      Animated.loop(
        Animated.sequence([
          Animated.timing(pulseAnim, {
            toValue: 1.3,
            duration: 600,
            useNativeDriver: true,
          }),
          Animated.timing(pulseAnim, {
            toValue: 1,
            duration: 600,
            useNativeDriver: true,
          }),
        ])
      ).start();
    } else {
      pulseAnim.setValue(1);
    }
  }, [isRecording]);
 
  const handleRecordToggle = useCallback(async () => {
    if (isRecording) {
      const audioBlob = await stopRecording();
      if (!audioBlob) return;
 
      setIsProcessing(true);
 
      try {
        const result = await processUserUtterance(
          audioBlob,
          messages.map(m => ({
            role: m.role === 'tutor' ? 'assistant' : 'user',
            content: m.text,
          })),
          { /* learner profile from context */ } as any
        );
 
        // Add learner's utterance
        const learnerMessage: ChatMessage = {
          id: `learner-${Date.now()}`,
          role: 'learner',
          text: result.transcription,
          pronunciationScore: result.pronunciationScore,
          feedback: {
            corrections: result.feedback.grammarFeedback || [],
            vocabularyTip: result.feedback.vocabularyTip,
            encouragement: result.feedback.encouragement,
          },
        };
 
        // Add tutor's response
        const tutorMessage: ChatMessage = {
          id: `tutor-${Date.now()}`,
          role: 'tutor',
          text: result.feedback.responseText,
          audioUri: result.audioResponse,
        };
 
        setMessages(prev => [...prev, learnerMessage, tutorMessage]);
 
        // Auto-play tutor's voice response
        if (result.audioResponse) {
          const { sound } = await Audio.Sound.createAsync({
            uri: result.audioResponse,
          });
          await sound.playAsync();
        }
      } catch (error) {
        console.error('Pipeline error:', error);
      } finally {
        setIsProcessing(false);
      }
    } else {
      await startRecording();
    }
  }, [isRecording, messages]);
 
  return (
    <View style={styles.container}>
      <ScrollView
        ref={scrollViewRef}
        style={styles.messageList}
        onContentSizeChange={() =>
          scrollViewRef.current?.scrollToEnd({ animated: true })
        }
      >
        {messages.map(msg => (
          <View
            key={msg.id}
            style={[
              styles.messageBubble,
              msg.role === 'learner'
                ? styles.learnerBubble
                : styles.tutorBubble,
            ]}
          >
            <Text style={styles.messageText}>{msg.text}</Text>
 
            {/* Pronunciation score badge */}
            {msg.pronunciationScore !== undefined && (
              <TouchableOpacity
                style={[
                  styles.scoreBadge,
                  {
                    backgroundColor:
                      msg.pronunciationScore >= 80
                        ? '#22c55e'
                        : msg.pronunciationScore >= 60
                        ? '#f59e0b'
                        : '#ef4444',
                  },
                ]}
                onPress={() => setShowFeedback(msg.id)}
              >
                <Text style={styles.scoreText}>
                  {msg.pronunciationScore}/100
                </Text>
              </TouchableOpacity>
            )}
 
            {/* Expandable feedback panel */}
            {showFeedback === msg.id && msg.feedback && (
              <View style={styles.feedbackPanel}>
                <Text style={styles.encouragement}>
                  {msg.feedback.encouragement}
                </Text>
                {msg.feedback.corrections.map((c, i) => (
                  <View key={i} style={styles.correctionItem}>
                    <Text style={styles.original}>{c.original}</Text>
                    <Text style={styles.arrow}>→</Text>
                    <Text style={styles.corrected}>{c.corrected}</Text>
                    <Text style={styles.explanation}>{c.explanation}</Text>
                  </View>
                ))}
              </View>
            )}
          </View>
        ))}
      </ScrollView>
 
      {/* Record button */}
      <View style={styles.recordArea}>
        {isProcessing ? (
          <ActivityIndicator size="large" color="#6366f1" />
        ) : (
          <TouchableOpacity onPress={handleRecordToggle}>
            <Animated.View
              style={[
                styles.recordButton,
                isRecording && styles.recordingActive,
                { transform: [{ scale: pulseAnim }] },
              ]}
            >
              <Text style={styles.recordIcon}>
                {isRecording ? '⏹' : '🎙'}
              </Text>
            </Animated.View>
          </TouchableOpacity>
        )}
        <Text style={styles.recordHint}>
          {isRecording
            ? 'Tap to send'
            : isProcessing
            ? 'AI analyzing...'
            : 'Tap to speak'}
        </Text>
      </View>
    </View>
  );
}
 
const styles = StyleSheet.create({
  container: { flex: 1, backgroundColor: '#f8fafc' },
  messageList: { flex: 1, padding: 16 },
  messageBubble: {
    maxWidth: '80%',
    padding: 14,
    borderRadius: 18,
    marginBottom: 12,
  },
  learnerBubble: {
    alignSelf: 'flex-end',
    backgroundColor: '#6366f1',
  },
  tutorBubble: {
    alignSelf: 'flex-start',
    backgroundColor: '#ffffff',
    borderWidth: 1,
    borderColor: '#e2e8f0',
  },
  messageText: { fontSize: 16, lineHeight: 22 },
  scoreBadge: {
    alignSelf: 'flex-end',
    paddingHorizontal: 10,
    paddingVertical: 4,
    borderRadius: 12,
    marginTop: 6,
  },
  scoreText: { color: '#fff', fontSize: 13, fontWeight: '700' },
  feedbackPanel: {
    marginTop: 10,
    padding: 12,
    backgroundColor: '#f1f5f9',
    borderRadius: 12,
  },
  encouragement: {
    fontSize: 14,
    color: '#22c55e',
    fontWeight: '600',
    marginBottom: 8,
  },
  correctionItem: { marginBottom: 8 },
  original: { color: '#ef4444', textDecorationLine: 'line-through' },
  arrow: { color: '#94a3b8' },
  corrected: { color: '#22c55e', fontWeight: '600' },
  explanation: { color: '#64748b', fontSize: 13, marginTop: 2 },
  recordArea: { alignItems: 'center', paddingVertical: 24 },
  recordButton: {
    width: 72,
    height: 72,
    borderRadius: 36,
    backgroundColor: '#6366f1',
    alignItems: 'center',
    justifyContent: 'center',
  },
  recordingActive: { backgroundColor: '#ef4444' },
  recordIcon: { fontSize: 28 },
  recordHint: { marginTop: 8, color: '#94a3b8', fontSize: 14 },
});

Learning Progress Persistence — Supabase Integration

Data Model Design

We store learner profiles, session history, and pronunciation score trends in Supabase:

-- Supabase table design
-- Learner profiles
CREATE TABLE learner_profiles (
  id UUID DEFAULT gen_random_uuid() PRIMARY KEY,
  user_id UUID REFERENCES auth.users(id),
  native_language TEXT NOT NULL DEFAULT 'ja',
  target_language TEXT NOT NULL DEFAULT 'en',
  cefr_level TEXT NOT NULL DEFAULT 'A1',
  total_sessions INTEGER DEFAULT 0,
  total_minutes REAL DEFAULT 0,
  streak_days INTEGER DEFAULT 0,
  last_session_at TIMESTAMPTZ,
  created_at TIMESTAMPTZ DEFAULT NOW()
);
 
-- Learning sessions
CREATE TABLE learning_sessions (
  id UUID DEFAULT gen_random_uuid() PRIMARY KEY,
  learner_id UUID REFERENCES learner_profiles(id),
  topic TEXT NOT NULL,
  duration_seconds INTEGER,
  avg_pronunciation_score REAL,
  grammar_accuracy REAL,
  new_vocabulary_count INTEGER,
  messages_count INTEGER,
  started_at TIMESTAMPTZ DEFAULT NOW(),
  ended_at TIMESTAMPTZ
);
 
-- Vocabulary cards for spaced repetition
CREATE TABLE vocabulary_cards (
  id UUID DEFAULT gen_random_uuid() PRIMARY KEY,
  learner_id UUID REFERENCES learner_profiles(id),
  word TEXT NOT NULL,
  meaning TEXT NOT NULL,
  example_sentence TEXT,
  ease_factor REAL DEFAULT 2.5,  -- SM-2 algorithm
  interval_days INTEGER DEFAULT 1,
  repetitions INTEGER DEFAULT 0,
  next_review_at TIMESTAMPTZ DEFAULT NOW(),
  created_at TIMESTAMPTZ DEFAULT NOW()
);
 
-- Pronunciation score history (for progress charts)
CREATE TABLE pronunciation_history (
  id UUID DEFAULT gen_random_uuid() PRIMARY KEY,
  learner_id UUID REFERENCES learner_profiles(id),
  session_id UUID REFERENCES learning_sessions(id),
  word TEXT NOT NULL,
  score INTEGER NOT NULL,
  recorded_at TIMESTAMPTZ DEFAULT NOW()
);

SM-2 Spaced Repetition Algorithm

For long-term vocabulary retention, we implement SuperMemo's SM-2 algorithm:

// engine/SpacedRepetition.ts
 
interface ReviewResult {
  quality: number; // 0-5 (0 = complete blackout, 5 = perfect)
}
 
interface CardSchedule {
  easeFactor: number;
  interval: number; // days
  repetitions: number;
  nextReviewAt: Date;
}
 
export function calculateNextReview(
  card: {
    easeFactor: number;
    interval: number;
    repetitions: number;
  },
  review: ReviewResult
): CardSchedule {
  let { easeFactor, interval, repetitions } = card;
  const { quality } = review;
 
  if (quality < 3) {
    // Incorrect — reset
    repetitions = 0;
    interval = 1;
  } else {
    // Correct — extend interval
    if (repetitions === 0) {
      interval = 1;
    } else if (repetitions === 1) {
      interval = 6;
    } else {
      interval = Math.round(interval * easeFactor);
    }
    repetitions++;
  }
 
  // Update ease factor
  easeFactor = Math.max(
    1.3,
    easeFactor + (0.1 - (5 - quality) * (0.08 + (5 - quality) * 0.02))
  );
 
  const nextReviewAt = new Date();
  nextReviewAt.setDate(nextReviewAt.getDate() + interval);
 
  return { easeFactor, interval, repetitions, nextReviewAt };
}

Latency Optimization — Streaming and Parallel Processing

Claude Streaming Responses

The sequential Whisper → Claude → TTS pipeline introduces 3–5 seconds of latency between the user speaking and hearing a response. By using Claude's streaming API, we can start TTS conversion while text is still being generated:

// pipeline/StreamingPipeline.ts
 
export async function processWithStreaming(
  audioBlob: Blob,
  context: ConversationContext,
  callbacks: {
    onTranscription: (text: string) => void;
    onPartialResponse: (text: string) => void;
    onAudioReady: (uri: string) => void;
    onFeedback: (feedback: ConversationFeedback) => void;
  }
) {
  // Step 1: Whisper (batch processing only)
  const transcription = await transcribeWithWhisper(audioBlob);
  callbacks.onTranscription(transcription.text);
 
  // Step 2: Claude streaming
  let fullResponse = '';
  let firstSentenceProcessed = false;
 
  const stream = await fetch('https://api.anthropic.com/v1/messages', {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      'x-api-key': process.env.ANTHROPIC_API_KEY!,
      'anthropic-version': '2023-06-01',
    },
    body: JSON.stringify({
      model: 'claude-sonnet-4-6',
      max_tokens: 1024,
      stream: true,
      system: buildSystemPrompt(context),
      messages: buildMessages(context, transcription.text),
    }),
  });
 
  const reader = stream.body?.getReader();
  const decoder = new TextDecoder();
 
  while (reader) {
    const { done, value } = await reader.read();
    if (done) break;
 
    const chunk = decoder.decode(value);
    const lines = chunk.split('\n');
 
    for (const line of lines) {
      if (line.startsWith('data: ')) {
        const data = JSON.parse(line.slice(6));
        if (data.type === 'content_block_delta') {
          fullResponse += data.delta.text;
          callbacks.onPartialResponse(fullResponse);
 
          // Start TTS for the first sentence as soon as it's complete
          if (!firstSentenceProcessed && fullResponse.includes('.')) {
            const firstSentence = fullResponse.split('.')[0] + '.';
            firstSentenceProcessed = true;
 
            // Begin TTS conversion asynchronously
            synthesizeSpeech({
              text: firstSentence,
              language: context.targetLanguage,
              voiceGender: 'female',
              speed: 'normal',
              stability: 0.75,
            }).then(sound => {
              callbacks.onAudioReady(firstSentence);
              sound.playAsync();
            });
          }
        }
      }
    }
  }
}

Subscription Monetization Design

Freemium Model Structure

A freemium model works best for language learning apps:

Free Plan: 3 AI conversation sessions per day (5 minutes each), basic flashcards, progress tracking
Pro Plan ($9.99/month): Unlimited AI sessions, all voice options, detailed pronunciation reports, offline caching, no ads
Premium Plan ($79.99/year): Everything in Pro + personalized curriculum generation, monthly native speaker feedback

// config/subscriptionTiers.ts
 
export const SUBSCRIPTION_TIERS = {
  free: {
    dailySessions: 3,
    sessionDurationMinutes: 5,
    voiceOptions: ['default'],
    pronunciationReport: 'basic', // Score only
    offlineCache: false,
    adaptiveCurriculum: false,
  },
  pro: {
    dailySessions: Infinity,
    sessionDurationMinutes: 30,
    voiceOptions: ['male', 'female', 'british', 'australian'],
    pronunciationReport: 'detailed', // Phoneme-level analysis
    offlineCache: true,
    adaptiveCurriculum: true,
  },
  premium: {
    dailySessions: Infinity,
    sessionDurationMinutes: 60,
    voiceOptions: 'all',
    pronunciationReport: 'expert', // Native speaker comparison
    offlineCache: true,
    adaptiveCurriculum: true,
    nativeFeedback: true, // Monthly
  },
} as const;

For cost estimation, each 5-minute session (approximately 10 conversation exchanges) costs roughly:

Whisper: ~$0.006 (5 minutes of audio)
Claude Sonnet: ~$0.015 (3K input + 1K output tokens × 10 turns)
ElevenLabs: ~$0.03 (1,000 characters × 10 turns)
Total: ~$0.05/session

At $9.99/month with 3 sessions/day (90 sessions/month), API costs are approximately $4.50 — yielding a healthy 55% gross margin.

Three Gotchas I Hit While Running This in Production

A pipeline that looks clean on paper starts surfacing quiet issues once you run a hundred sessions a day on real devices. Filtering through the operational logging discipline I built up across roughly 50 million cumulative downloads on my wallpaper and meditation apps, here are the three traps that hurt language-learning apps the most.

Gotcha 1: Forecasting ElevenLabs by "conversation turns" is off by 2x

I initially budgeted at "1 session = 10 turns = ~1,000 characters" for Pro plan users. Actual measurement came in around 2,000 characters. The cause was that I was sending the corrected/explanation fields of grammar feedback and the in-prompt example sentences to TTS as well. Without an explicit split, the API bill at month three runs at 2x your projection.

// services/ttsRouter.ts — decide which texts get spoken vs only shown
// Always separate "must be spoken" from "screen-only assistance"
 
interface TTSCandidate {
  text: string;
  purpose: 'main_response' | 'pronunciation_hint' | 'grammar_explanation' | 'vocab_example';
}
 
const TTS_BUDGET_PER_SESSION = 1200; // character cap; anything above is screen-only
 
export function selectTTSCandidates(
  candidates: TTSCandidate[],
  spentCharsSoFar: number
): TTSCandidate[] {
  // main_response always gets TTS; the rest only if budget allows, in priority order
  const main = candidates.filter(c => c.purpose === 'main_response');
  const others = candidates.filter(c => c.purpose !== 'main_response');
 
  let budget = TTS_BUDGET_PER_SESSION - spentCharsSoFar;
  const result = [...main];
  budget -= main.reduce((s, c) => s + c.text.length, 0);
 
  // Priority: pronunciation_hint > vocab_example > grammar_explanation
  const priorityOrder: TTSCandidate['purpose'][] = [
    'pronunciation_hint', 'vocab_example', 'grammar_explanation',
  ];
  for (const purpose of priorityOrder) {
    const items = others.filter(c => c.purpose === purpose);
    for (const item of items) {
      if (budget >= item.text.length) {
        result.push(item);
        budget -= item.text.length;
      }
    }
  }
  return result;
}

Once this router was in place, per-user TTS character consumption dropped by roughly 40 percent at the median, and Pro plan unit economics started matching the spreadsheet.

Gotcha 2: Whisper word_probability over-rewards silent segments

If you compute pronunciation score as a flat average of word probabilities, "ums" and breathy near-silence tokens come back at 0.95+ and unfairly lift the total. The fix is small but important: drop segments where no_speech_prob > 0.3 or avg_logprob < -1.0 before scoring.

// Add this to services/whisperService.ts
export function calculateRobustPronunciationScore(
  response: WhisperDetailedResponse
): { overall: number; reliableWordCount: number } {
  const reliableSegments = response.segments.filter(s =>
    s.no_speech_prob < 0.3 && s.avg_logprob > -1.0
  );
  const words = reliableSegments.flatMap(s => s.words || []);
  // also drop words below 0.3 — usually noise, coughs, or stutters
  const reliable = words.filter(w => w.probability >= 0.3);
 
  if (reliable.length === 0) {
    return { overall: 0, reliableWordCount: 0 };
  }
 
  const overall = Math.round(
    (reliable.reduce((s, w) => s + w.probability, 0) / reliable.length) * 100
  );
  return { overall, reliableWordCount: reliable.length };
}

When I introduced this filter, average pronunciation score dropped from 82 to 71. But user feedback shifted toward "the score actually matches what I feel about my own pronunciation." Scores should be honest signals, not vanity numbers. Honesty correlates better with retention.

Gotcha 3: Letting Rork Max float its Expo SDK can crash on iOS 17 launch

Rork Max scaffolds projects with a pinned Expo SDK major version. The interaction between AdMob, ElevenLabs, and expo-av has, at least twice in my experience, shifted on minor SDK updates — Audio.Recording.createAsync argument shape changed once and cost me two days of post-launch crash hunting in one of my wallpaper apps. Because recording is the heart of a language app, build the following into CI so you catch SDK drift before users do.

# .github/workflows/audio-smoke-test.yml — the meaningful part
- name: Smoke test audio recording API
  run: |
    npx expo install --check
    npx expo-doctor
    # Verify expo-av major is in sync with expo
    node -e "
      const pkg = require('./node_modules/expo-av/package.json');
      const expoPkg = require('./node_modules/expo/package.json');
      const expoMajor = parseInt(expoPkg.version.split('.')[0]);
      const avMajor = parseInt(pkg.version.split('.')[0]);
      if (Math.abs(expoMajor - avMajor * 5) > 5) {
        console.error('expo-av major version drift detected');
        process.exit(1);
      }
    "

With this in CI, you will be alerted the day Rork Max bumps Expo SDK behind your back.

Operational Checklist for Indie Developers

The implementation pieces are now in place. Before signing off, here is the weekly operational view I keep, plus a few related Rork Lab references for the broader stack.

Five numbers to track every week

API cost divided by paying users (compare against ARPU): if API cost climbs above 60 percent of plan price, immediately break down which provider (ElevenLabs characters / Whisper seconds / Claude tokens) is the cause
Day 7 and Day 30 retention: language learning takes longer to habit-form than typical apps. From my experience I treat 35 percent at Day 7 and 15 percent at Day 30 as the minimum acceptable line
Scatter plot of average pronunciation score vs session length: if low-scoring users consistently leave sessions early, the level is likely set too high
AdMob rewarded video completion rate (free tier): below 60 percent means you need to revisit ad placement or switch to interstitial flow
P95 latency for Whisper, Claude, and ElevenLabs separately: perceived speed is set by P95. If any single provider crosses 4 seconds, it is time to consider streaming or parallelization

Companion articles on Rork Lab

This guide stands on its own, but during operations the following pairs well:

Auth and billing: Building a Membership App with Rork and Supabase Auth
AdMob monetization: Integrating AdMob Rewarded Ads in a Rork App
Notification segmentation: Advanced Push Notification Segmentation in Rork

The Next Line of Code

Thank you for staying with the guide this far. Every snippet here is written to be the kind of code you could paste during your morning commute and have running by lunch. The most rewarding first step, in my experience, is dropping calculateRobustPronunciationScore into a small sandbox project and recording yourself reading something. Watching your own scores change as you adjust your delivery is where the conviction comes from.

A language app is, fundamentally, an app where users meet a less polished version of themselves. That is why the score needs to be honest, the feedback needs to stay warm, and the loop needs to be short. I hope yours becomes the one that helps someone feel their world widen.

Thank You for Reading

Rork Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.