◇ App Dev/2026-07-03Intermediate

Adding Read-Aloud to a Rork Max App: AVSpeechSynthesizer Voice Selection and Live Word Highlighting

An implementation memo on adding read-aloud to a native Swift app generated by Rork Max — covering AVSpeechSynthesizer voice selection, highlighting the word being spoken, audio session design, and the pitfalls that bite specifically with Japanese text.

Rork Max²⁰⁹ AVSpeechSynthesizer Read Aloud Accessibility³ SwiftUI⁶⁰

✦ Premium Article

When someone asked me to add read-aloud to a reading app, I assumed it would be a few lines. As an indie developer I tend to underestimate these small features. Hand text to AVSpeechSynthesizer and it talks — that part is true. But on a real device the small complaints piled up fast: the voice sounded muffled, you couldn't tell where on screen it was reading, and background music cut out abruptly. Making it speak is trivial; making it something you actually enjoy listening to takes several deliberate design choices.

Here I've written down, in the order I actually touched them, the steps for wiring read-aloud into the native Swift apps Rork Max produces. We start from the smallest working version, then move to choosing a voice, highlighting the word being spoken, tuning the audio session, and finally the things that tripped me up specifically in Japanese.

Make it speak — but keep the synthesizer alive

This is the minimal version. There's exactly one landmine every beginner hits: if you create AVSpeechSynthesizer as a local variable, it gets deallocated before it finishes speaking and you get silence. The rule is to hold it outside the view, inside an object that survives — a stored property.

import AVFoundation
import SwiftUI
 
@MainActor
final class Reader: ObservableObject {
    // ❌ A local variable inside a func is released before it speaks -> silence
    // ✅ Keep it as a stored property
    private let synth = AVSpeechSynthesizer()
 
    func speak(_ text: String) {
        let utterance = AVSpeechUtterance(string: text)
        utterance.rate = AVSpeechUtteranceDefaultSpeechRate  // ~0.5
        utterance.pitchMultiplier = 1.0
        utterance.postUtteranceDelay = 0.2
        synth.speak(utterance)  // enqueued and read in order
    }
 
    func stop() {
        synth.stopSpeaking(at: .immediate)
    }
}

speak(_:) enqueues rather than plays immediately. Call it repeatedly and utterances are read in order, so to "interrupt the current one and read the next," call stopSpeaking(at:) first. .immediate stops right away; .word finishes the current word before stopping. When the user taps the next sentence, .immediate felt right; for a pause button, pauseSpeaking(at: .word) was the natural choice.

Choosing a voice — installed voices differ in quality

This is what moves the perceived quality the most. If you specify only a language with AVSpeechSynthesisVoice(language:), you get that language's default voice, which isn't always the pleasant one. iOS mixes compact lightweight voices with higher-quality downloaded ones (.enhanced or .premium), and preferring the latter when present raises satisfaction noticeably.

extension AVSpeechSynthesisVoice {
    /// Returns the highest-quality voice for a language (premium > enhanced > default)
    static func bestVoice(for language: String) -> AVSpeechSynthesisVoice? {
        let candidates = speechVoices().filter {
            $0.language.hasPrefix(language)  // catches both "ja-JP" and "ja"
        }
        // quality is .default(1) < .enhanced(2) < .premium(3)
        return candidates.max(by: { $0.quality.rawValue < $1.quality.rawValue })
    }
}
 
// Usage
let voice = AVSpeechSynthesisVoice.bestVoice(for: "ja")
        ?? AVSpeechSynthesisVoice(language: "ja-JP")
utterance.voice = voice

The catch: high-quality voices may not be installed. Unless the user added them under Settings → Accessibility → Spoken Content → Voices, .premium never appears among the candidates. I placed a low-key hint inside the app ("for a more natural voice, add one from Settings") and made "doesn't break with the default voice" the floor. Design that assumes premium quality and the experience collapses on devices that never added it.

The relationship between quality and availability looks like this:

quality	Character	How to get it	Design stance
default	Lightweight, a bit robotic	Always present in the OS	Guarantee it as the fallback floor
enhanced	Natural, tens of MB	Manual download in Settings	Prefer if present; don't break without it
premium	Most natural (iOS 16+)	Manual download in Settings	Top priority if present; never assume it

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦The delegate wiring that highlights the word currently being spoken, and why the range drifts on text with emoji

✦Detecting the quality tier of voices installed on the device so you prefer the most natural one available

✦The audio session setup that reads aloud without killing the user's background music, plus lock-screen behavior

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Highlight the word being spoken — reflect the delegate's range in UI

What actually decided satisfaction in a reading app was less the voice itself and more the visualization of "where are we reading right now." AVSpeechSynthesizerDelegate's willSpeakRangeOfSpeechString hands you the character range about to be spoken as an NSRange. Reflect it into an AttributedString's background color and the words flow like karaoke.

final class Reader: NSObject, ObservableObject, AVSpeechSynthesizerDelegate {
    @Published var highlightedRange: NSRange? = nil
    @Published var fullText: String = ""
    private let synth = AVSpeechSynthesizer()
 
    override init() {
        super.init()
        synth.delegate = self
    }
 
    func speak(_ text: String) {
        fullText = text
        let u = AVSpeechUtterance(string: text)
        u.voice = AVSpeechSynthesisVoice.bestVoice(for: "ja")
        synth.speak(u)
    }
 
    // The range about to be spoken (called on the main thread)
    func speechSynthesizer(_ s: AVSpeechSynthesizer,
                           willSpeakRangeOfSpeechString characterRange: NSRange,
                           utterance: AVSpeechUtterance) {
        highlightedRange = characterRange
    }
 
    func speechSynthesizer(_ s: AVSpeechSynthesizer,
                           didFinish utterance: AVSpeechUtterance) {
        highlightedRange = nil
    }
}

On the SwiftUI side, convert the NSRange into AttributedString indices to highlight. This is where Japanese trips you up most, so run the conversion through the canonical NSRange → Range<String.Index> path rather than naive Int addition.

struct ReadingView: View {
    @ObservedObject var reader: Reader
 
    var attributed: AttributedString {
        var s = AttributedString(reader.fullText)
        guard let r = reader.highlightedRange,
              let swiftRange = Range(r, in: reader.fullText),
              let lower = AttributedString.Index(swiftRange.lowerBound, within: s),
              let upper = AttributedString.Index(swiftRange.upperBound, within: s)
        else { return s }
        s[lower..<upper].backgroundColor = .yellow.opacity(0.5)
        return s
    }
 
    var body: some View {
        ScrollView { Text(attributed).font(.title3).padding() }
    }
}

Why the highlight drifts in Japanese — the UTF-16 and emoji pitfall

This is where I burned the most time. The NSRange returned by willSpeakRangeOfSpeechString is an offset in UTF-16 code units. Swift's String is measured in grapheme clusters, so passing the raw location into String.index(offsetBy:) drifts on any sentence with emoji or combining characters. That's exactly why the code above uses Range(nsRange, in: string) — that initializer interprets UTF-16 offsets correctly. Write it with your own integer math instead, and on a review sentence with emoji the highlight lands a few characters late — the classic "the spot it reads and the spot it lights don't match" bug.

There's a second point: because word boundaries in Japanese aren't as clean as in English, the ranges willSpeakRange returns vary between a phrase and one-to-a-few characters. I settled on "light up only the range that arrives, and clear it on the next one." Trying to keep prior ranges lit and accumulate them made the display messy at the wobbly boundaries. Reflecting only the incoming range each time ended up looking more natural.

Don't stop the music, speak on the lock screen — the audio session

Left at defaults, starting read-aloud can silence other apps' music or go mute with the ringer switch. Plenty of readers keep ambient sound playing, so I chose "keep other audio going, just duck it slightly while speaking."

import AVFoundation
 
func configureAudioSession() {
    let session = AVAudioSession.sharedInstance()
    do {
        // .playback = plays even with the ringer switch off;
        // .duckOthers = temporarily lowers other apps' volume
        try session.setCategory(.playback,
                                mode: .spokenAudio,
                                options: [.duckOthers])
        try session.setActive(true)
    } catch {
        // Read-aloud still works even if this fails; just log it
        print("audio session error:", error)
    }
}

The .spokenAudio mode is optimized for speech content, and .duckOthers lowers other apps' volume only while speaking and restores it afterward. To fully stop the music, drop .duckOthers; to simply mix with ambient sound, switch to .mixWithOthers. To keep reading on the lock screen or over other apps, you need the Audio Background Mode enabled in addition to the .playback category. If you go as far as building lock-screen controls (play/pause), the same MPNowPlayingInfoCenter approach from a separate article on background audio and lock-screen controls carries over directly.

The prompt I hand to Rork Max

When generating from scratch with Rork Max, spelling the feature out in detail before handing it over gave me stable results. Specifying the boundaries cuts down rework compared to a vague "add read-aloud."

Add read-aloud to a SwiftUI reading screen. Requirements:
- Hold AVSpeechSynthesizer as a property on an ObservableObject (not a local var)
- For Japanese, pick the best voice by quality: premium > enhanced > default
- Use willSpeakRangeOfSpeechString to @Publish the range being read, and
  highlight it via the background color of a Text's AttributedString
  (convert NSRange -> Range correctly)
- Add play / pause (pauseSpeaking(at: .word)) / stop buttons
- Audio session: .playback / .spokenAudio / .duckOthers
After generating, verify the highlight doesn't drift on sentences with emoji.

The output usually leaves one of two things: "works but the synthesizer is a local variable," or "converts NSRange with Int addition." So I check exactly those two spots every time. Rather than trusting generated code wholesale, re-reading only the pitfall areas listed in this article keeps the cleanup short.

What I learned after shipping it

Read-aloud isn't a feature everyone uses just because it's there. In my reading prototypes, about one in ten people used it even once. But that tenth had a clearly higher week-two retention — there really is a segment that "reads with their ears" during a commute or between chores. Small in the numbers, yet a path that lets people keep touching content without stopping their hands seems to help them stick.

If I tackle the next piece, auto-scroll tied to the word highlight (keeping the line being read centered) has the most room to improve the experience. For accessibility work overall, thinking about it alongside VoiceOver and Dynamic Type widens the reach beyond read-aloud alone. For healing apps that combine it with ambient sound, the ambient audio loop design notes should help too.

Start by holding the synthesizer as a property and speaking one sentence with the default voice. Clear that step and you can add highlighting and voice quality one at a time.

Thank You for Reading

Rork Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.