RORK LABJP
FUNDING — Rork's $15M seed was led by Left Lane Capital with Peak XV, True Ventures, Goodwater, and a16z SpeedrunGROWTH — Rork keeps growing with 743K monthly visits and an 85% growth rateMAX — Rork Max generates native Swift apps for iPhone, iPad, Watch, TV, Vision Pro, and iMessageMAX — It reaches HealthKit, Core ML, and Dynamic Island — territory React Native struggles withMARKET — Apple pushes agentic coding in Xcode 27, accelerating AI-driven native developmentMARKET — Gartner projects 75% of new apps will be low-code or no-code by the end of 2026FUNDING — Rork's $15M seed was led by Left Lane Capital with Peak XV, True Ventures, Goodwater, and a16z SpeedrunGROWTH — Rork keeps growing with 743K monthly visits and an 85% growth rateMAX — Rork Max generates native Swift apps for iPhone, iPad, Watch, TV, Vision Pro, and iMessageMAX — It reaches HealthKit, Core ML, and Dynamic Island — territory React Native struggles withMARKET — Apple pushes agentic coding in Xcode 27, accelerating AI-driven native developmentMARKET — Gartner projects 75% of new apps will be low-code or no-code by the end of 2026
Articles/App Dev
App Dev/2026-07-03Intermediate

Adding Read-Aloud to a Rork Max App: AVSpeechSynthesizer Voice Selection and Live Word Highlighting

An implementation memo on adding read-aloud to a native Swift app generated by Rork Max — covering AVSpeechSynthesizer voice selection, highlighting the word being spoken, audio session design, and the pitfalls that bite specifically with Japanese text.

Rork Max209AVSpeechSynthesizerRead AloudAccessibility3SwiftUI60

Premium Article

When someone asked me to add read-aloud to a reading app, I assumed it would be a few lines. As an indie developer I tend to underestimate these small features. Hand text to AVSpeechSynthesizer and it talks — that part is true. But on a real device the small complaints piled up fast: the voice sounded muffled, you couldn't tell where on screen it was reading, and background music cut out abruptly. Making it speak is trivial; making it something you actually enjoy listening to takes several deliberate design choices.

Here I've written down, in the order I actually touched them, the steps for wiring read-aloud into the native Swift apps Rork Max produces. We start from the smallest working version, then move to choosing a voice, highlighting the word being spoken, tuning the audio session, and finally the things that tripped me up specifically in Japanese.

Make it speak — but keep the synthesizer alive

This is the minimal version. There's exactly one landmine every beginner hits: if you create AVSpeechSynthesizer as a local variable, it gets deallocated before it finishes speaking and you get silence. The rule is to hold it outside the view, inside an object that survives — a stored property.

import AVFoundation
import SwiftUI
 
@MainActor
final class Reader: ObservableObject {
    // ❌ A local variable inside a func is released before it speaks -> silence
    // ✅ Keep it as a stored property
    private let synth = AVSpeechSynthesizer()
 
    func speak(_ text: String) {
        let utterance = AVSpeechUtterance(string: text)
        utterance.rate = AVSpeechUtteranceDefaultSpeechRate  // ~0.5
        utterance.pitchMultiplier = 1.0
        utterance.postUtteranceDelay = 0.2
        synth.speak(utterance)  // enqueued and read in order
    }
 
    func stop() {
        synth.stopSpeaking(at: .immediate)
    }
}

speak(_:) enqueues rather than plays immediately. Call it repeatedly and utterances are read in order, so to "interrupt the current one and read the next," call stopSpeaking(at:) first. .immediate stops right away; .word finishes the current word before stopping. When the user taps the next sentence, .immediate felt right; for a pause button, pauseSpeaking(at: .word) was the natural choice.

Choosing a voice — installed voices differ in quality

This is what moves the perceived quality the most. If you specify only a language with AVSpeechSynthesisVoice(language:), you get that language's default voice, which isn't always the pleasant one. iOS mixes compact lightweight voices with higher-quality downloaded ones (.enhanced or .premium), and preferring the latter when present raises satisfaction noticeably.

extension AVSpeechSynthesisVoice {
    /// Returns the highest-quality voice for a language (premium > enhanced > default)
    static func bestVoice(for language: String) -> AVSpeechSynthesisVoice? {
        let candidates = speechVoices().filter {
            $0.language.hasPrefix(language)  // catches both "ja-JP" and "ja"
        }
        // quality is .default(1) < .enhanced(2) < .premium(3)
        return candidates.max(by: { $0.quality.rawValue < $1.quality.rawValue })
    }
}
 
// Usage
let voice = AVSpeechSynthesisVoice.bestVoice(for: "ja")
        ?? AVSpeechSynthesisVoice(language: "ja-JP")
utterance.voice = voice

The catch: high-quality voices may not be installed. Unless the user added them under Settings → Accessibility → Spoken Content → Voices, .premium never appears among the candidates. I placed a low-key hint inside the app ("for a more natural voice, add one from Settings") and made "doesn't break with the default voice" the floor. Design that assumes premium quality and the experience collapses on devices that never added it.

The relationship between quality and availability looks like this:

qualityCharacterHow to get itDesign stance
defaultLightweight, a bit roboticAlways present in the OSGuarantee it as the fallback floor
enhancedNatural, tens of MBManual download in SettingsPrefer if present; don't break without it
premiumMost natural (iOS 16+)Manual download in SettingsTop priority if present; never assume it

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN
The delegate wiring that highlights the word currently being spoken, and why the range drifts on text with emoji
Detecting the quality tier of voices installed on the device so you prefer the most natural one available
The audio session setup that reads aloud without killing the user's background music, plus lock-screen behavior
Secure payment via Stripe · Cancel anytime

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

or
Unlock all articles with Membership →
Share

Thank You for Reading

Rork Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

Related Articles

App Dev2026-07-02
Building a Place-Search Map in Rork Max with MapKit: MKLocalSearch, Pin Clustering, and Location Permissions
A hands-on guide to building a place-search map screen with Rork Max native Swift: nearby search with MKLocalSearch, when to switch from SwiftUI Map to MKMapView for pin clustering, and location permission design that survives App Review.
App Dev2026-07-01
Building SharePlay in Rork Max's Native Swift — Keeping Two Screens in the Same State
Implementation notes on building SharePlay with GroupActivities in Rork Max's native Swift — moving two screens through the same state over FaceTime. Covers declaring the GroupActivity, joining a GroupSession, syncing state with GroupSessionMessenger, handling latency and conflicts, catching up late joiners, and the boundary for bridging from React Native, with the pitfalls I actually hit.
App Dev2026-07-01
Building a Song-Recognition App with ShazamKit in Rork Max's Native Swift
Implementation notes on building a song-recognition app with SHManagedSession in Rork Max's native Swift. Covers the difference from hand-rolling AVAudioEngine, designing the idle / prerecording / matching states, using prerecording to improve initial accuracy, and the boundary design for bridging from Expo — with the pitfalls I actually hit.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →