◇ App Dev/2026-07-01Advanced

Building a Song-Recognition App with ShazamKit in Rork Max's Native Swift

Implementation notes on building a song-recognition app with SHManagedSession in Rork Max's native Swift. Covers the difference from hand-rolling AVAudioEngine, designing the idle / prerecording / matching states, using prerecording to improve initial accuracy, and the boundary design for bridging from Expo — with the pitfalls I actually hit.

ShazamKit Rork Max¹⁹⁹ Music Recognition iOS⁹³ SwiftUI⁵⁷

✦ Premium Article

Once, in a café I'd wandered into with a friend, a song played that felt familiar but neither of us could name, and we sat there stumped for a while. It would be fun, I thought, to have that "hold up your phone and instantly know" experience as a feature of one of my own small apps. I've built a handful of utility apps as an indie developer, and the feel of "point it and it just knows" carries a pleasure that needs no explanation.

The doorway to that is ShazamKit. It's Apple's music-recognition machinery you can embed in an app, matching a song from the ambient sound or from your app's own playback. It used to require you to manage the recording yourself, but with iOS 17's SHManagedSession you can hand most of that off. And because Rork Max now generates native Swift, this API is something you can wire into an indie app cleanly. Here is what I learned building a small "point it and it knows" app, step by step.

Hand-rolled classic, or the managed session?

ShazamKit offers two broad ways to build. Choosing this first settles the rest of your design.

Aspect	SHSession (hand-rolled)	SHManagedSession (iOS 17+)
Recording management	You build AVAudioEngine yourself	Left to the framework
Mic permission	You request and check it	The session takes care of it
Receiving results	Delegate	Async result sequence (results())
State visibility	Manage it yourself	Observe idle / prerecording / matching
Best for	Fine control, e.g. matching against playback	The classic "point and recognize ambient song"

For the classic use — hold up the phone and identify the song playing around you — I reach for SHManagedSession without hesitation. It carries the recording and mic-permission burden for you, so the amount you write visibly shrinks. I hand-rolled AVAudioEngine at first, meaning to learn, but permissions and buffer wrangling ate more time than expected; switching to SHManagedSession left only the essence.

Map the state onto SwiftUI

SHManagedSession conforms to Observable, so SwiftUI picks up its state changes directly. There are three states:

idle: waiting, neither recording nor matching
prerecording: prepared for matching, prerecording ahead of time
matching: actively attempting a match

Reflecting these three straight onto the button's look tells the user "what it's doing right now." Here's a minimal screen that changes its display by state.

import SwiftUI
import ShazamKit
 
@MainActor
final class RecognizerModel: ObservableObject {
    let session = SHManagedSession()
    @Published var title: String?
    @Published var artist: String?
    @Published var isWorking = false
 
    /// Point and match. Results arrive from the results() async sequence.
    func recognize() async {
        isWorking = true
        defer { isWorking = false }
 
        // Attempt one match, stop on the first result
        for await result in session.results {
            switch result {
            case .match(let match):
                if let item = match.mediaItems.first {
                    self.title = item.title
                    self.artist = item.artist
                }
                return                    // Stop once we've got a hit
            case .noMatch:
                self.title = "No match"
                return
            case .error(let error, _):
                self.title = "Error: \(error.localizedDescription)"
                return
            @unknown default:
                return
            }
        }
    }
}
 
struct RecognizeView: View {
    @StateObject private var model = RecognizerModel()
 
    var body: some View {
        VStack(spacing: 24) {
            if let title = model.title {
                Text(title).font(.title2).bold()
                if let artist = model.artist {
                    Text(artist).foregroundStyle(.secondary)
                }
            } else {
                Text(model.isWorking ? "Listening…" : "Tap to recognize a song")
                    .foregroundStyle(.secondary)
            }
 
            Button {
                Task { await model.recognize() }
            } label: {
                Image(systemName: "shazam.logo.fill")
                    .font(.system(size: 64))
            }
            .disabled(model.isWorking)
        }
        .padding()
    }
}

session.results returns an async sequence of match results. Returning once you get a hit is because one identified song is enough for this use. To keep identifying continuously, don't return — keep spinning the sequence and results arrive each time the playing song changes.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦The difference between hand-rolling AVAudioEngine and letting SHManagedSession do the work, and how to choose between them

✦A complete, working Swift flow: mapping idle / prerecording / matching to SwiftUI and using prerecording to sharpen first-touch accuracy

✦How to use ShazamKit in the native Swift Rork Max generates, and the boundary design for bridging it from an Expo / React Native native module

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Use prerecording to cut first-touch misses

On a device you notice that the few seconds at the very moment you press decide success or failure. Starting to record only after the press clips the head and makes matching harder. SHManagedSession's prerecording is the state that fills that gap.

Call prepare() ahead of a likely match — say, right after the screen appears — and the session prerecords in anticipation. By the time the user presses the button, a few seconds of audio are already in hand, so first-touch misses drop visibly.

// Prerecord in anticipation the moment the screen appears
.task {
    await model.session.prepare()   // idle → prerecording
}

In my case, matching in a low-volume environment like a quiet café clearly improved before and after adding this one line. Users are sensitive to "I pressed it and it didn't work," so grabbing the first few seconds ahead of time has a large felt effect. Note that the mic starts running the moment you call prepare(), so when you leave the screen, be sure to stop the session and return it to idle. Neglect this and it keeps recording in the background.

.onDisappear {
    model.session.cancel()   // Return to idle, stop recording
}

When you want to match your own catalog

ShazamKit can match not only Apple's music catalog but audio you supply yourself — recognizing an in-venue announcement, or audio content you distribute. In that case you generate a signature from the audio ahead of time, register it in a custom catalog, and hand that catalog to the session.

// Build a catalog from your own audio and hand it to the session
let catalog = SHCustomCatalog()
let signature = try SHSignatureGenerator.signature(from: myAudioFile)  // pre-generated signature
let mediaItem = SHMediaItem(properties: [.title: "In-venue announcement A"])
try catalog.addReferenceSignature(signature, representing: [mediaItem])
 
let session = SHManagedSession(catalog: catalog)

With this, you can build a "the app reacts when a specific sound plays" experience without leaning on Apple's catalog. Recognizing event-venue audio or limited content — with a little imagination about the use, it grows into interesting projects even for an indie.

Decide the boundary between Rork Max native and Expo

Because Rork Max generates native Swift, it reaches cleanly for an audio framework like ShazamKit. Meanwhile, many existing apps are Rork's own Expo / React Native. Deciding the split up front is the trick to keeping the implementation light.

My split is as follows.

Layer	Owns	Doesn't own
Native Swift (Rork Max generated)	Starting SHManagedSession, matching, getting results	Where history is saved, navigation
React Native (JS side)	Start instruction, receiving the matched song, saving/sharing	Round-tripping raw audio data

The key is not to hand the audio itself to the JS side. Keep matching fully on the native side and return only the matched title, artist, and (if present) Apple Music identifier as an event — a one-directional flow. As a native module, exposing just one method, startRecognition, and an onMatch event is enough.

// Minimal surface with the Expo Modules API (sketch)
public class ShazamModule: Module {
    let session = SHManagedSession()
 
    public func definition() -> ModuleDefinition {
        Name("Shazam")
 
        // JS only touches this method and the onMatch event
        AsyncFunction("startRecognition") {
            for await result in self.session.results {
                if case .match(let match) = result, let item = match.mediaItems.first {
                    self.sendEvent("onMatch", [
                        "title": item.title ?? "",
                        "artist": item.artist ?? "",
                        "appleMusicID": item.appleMusicID ?? ""
                    ])
                    return
                }
            }
        }
 
        Events("onMatch")
    }
}

Now the JS side is only ever "ask to recognize, receive the matched song, save or share it." The heavy audio work stays sealed on the native side, so it never clogs the bridge.

Pitfalls I found by building it

Finally, a few things I stumbled on.

First, the mic-permission string. Put NSMicrophoneUsageDescription in Info.plist and honestly state why the mic is needed. Leave it blank and review rejects you, and it breeds user distrust.

Second, forgetting to return to idle. As noted, prepare() starts the mic, so always cancel() when leaving the screen. Recording in the background ties directly to battery complaints and privacy concerns.

Third, how you present a miss. noMatch happens normally — noisy surroundings, an instrumental track with faint features. Adding a next step, like "No match. Try again a little closer," makes the experience kinder.

Fourth, where to put prerecording. Starting it as the screen appears feels natural, but prerecording constantly uses that much more power. Anticipating only on screens likely to use recognition is the compromise with battery life.

Start by building "tap to identify one song" against Apple's catalog. Once that feel clicks, sharpening accuracy with prerecording, extending to your own catalog, and splitting work between Rork Max and Expo all follow naturally. If it helps someone who wants to build their own "point it and it knows," I'll be glad.

Thank You for Reading

Rork Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.