●TEST — The Rork Companion app lets you test on a real iPhone without a paid Apple Developer account●CLOUD — Code compiles on a cloud Mac, streaming a 60fps live simulator with real touch input●BROWSER — Design, code, and test entirely in Chrome or Safari — no Xcode required●PUBLISH — Two-click App Store publishing keeps the submission process simple●MAX — Rork Max builds native Swift apps for iPhone, iPad, Apple Watch, and Vision Pro●RN — Standard Rork generates iOS and Android apps together with React Native (Expo)●TEST — The Rork Companion app lets you test on a real iPhone without a paid Apple Developer account●CLOUD — Code compiles on a cloud Mac, streaming a 60fps live simulator with real touch input●BROWSER — Design, code, and test entirely in Chrome or Safari — no Xcode required●PUBLISH — Two-click App Store publishing keeps the submission process simple●MAX — Rork Max builds native Swift apps for iPhone, iPad, Apple Watch, and Vision Pro●RN — Standard Rork generates iOS and Android apps together with React Native (Expo)
Design On-Device Core ML So Cold Start and Heat Don't Break It
Put on-device Core ML in the native Swift that Rork Max generates and you hit two walls before accuracy: the first inference is slow, and the device heats up and slows down. Here is a design built around cold start and a thermal budget, with working Swift.
Because Rork Max can generate native Swift apps, on-device Core ML inference — long out of easy reach in React Native — is now within reach even for indie development. But put it on a real device and there are stumbling points before you ever get to accuracy. The first inference is oddly slow. After a while the device warms up and everything feels sluggish. Both are design problems about when and how much you run inference, not about whether the model is good.
As an indie developer at Dolice, when I built on-device inference into an app I run, I struggled with a roughly one-second freeze on the very first call. The cause was not model accuracy; it was running the first load and inference on the main thread right at launch.
This article lays out how to design Core ML around two constraints — cold start and a thermal budget — with Swift code.
Why the first inference is slow
A Core ML model runs two heavy operations the first time you use it. One is loading and compiling the model (optimized for the device's Neural Engine); the other is the first inference, which allocates internal buffers. It is normal for the first call to be an order of magnitude slower than later ones.
Stage
What mainly happens
Felt impact
First load
Model compile and placement
Hundreds of ms to 1 s
First inference
Buffer allocation, warmup
Tens to hundreds of ms
Subsequent
Run on the allocated path
Often around 10-30 ms
The problem is throwing that heavy first call at the moment the user is waiting for a result. The design goal is simple: move the heavy first call earlier, to a time when the user is not waiting.
Pull model load and warmup out of the launch flow
First, do not synchronously load the model at app launch. Defer with lazy, and do the load and warmup (one inference on dummy input) on a background queue.
import CoreMLactor InferenceEngine { private var model: MyModel? // warm up in the background; call after the first screen appears, not at launch func warmUp() async { guard model == nil else { return } let config = MLModelConfiguration() config.computeUnits = .all // let it use the Neural Engine too do { let loaded = try MyModel(configuration: config) // run the first inference on dummy input to allocate buffers _ = try? loaded.prediction(input: .dummy) model = loaded } catch { model = nil // do not block launch even on failure } } func predict(_ input: MyModelInput) async throws -> MyModelOutput { if model == nil { await warmUp() } guard let model else { throw InferenceError.unavailable } return try model.prediction(input: input) }}
The caller invokes warmUp() in the brief gap after the first screen renders and before the user starts interacting.
.task { // warm up during idle time after the screen shows await engine.warmUp()}
The actor is what pays off here. Even if predict is called from several places at once, the language guarantees the load does not run twice. Early on I forgot this protection and multi-loaded the model on every screen transition, needlessly bloating memory. Concurrent access is a quiet pitfall in on-device inference.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦What cold start really is (first compile, first inference) and pulling it out of the launch flow with an actor
✦Gating that reads ProcessInfo.thermalState and steps inference down across full / reduced / suspended
✦Releasing the model on memory warnings and backgrounding, then warming it up again on return
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
On-device inference uses power and generates heat. Run it continuously and the device warms, the OS lowers the clock, and the whole app slows as a result. To avoid this, read ProcessInfo.thermalState and proactively reduce inference frequency or quality as heat rises.
import Foundationenum InferenceBudget { case full // as usual case reduced // thin out frequency case suspended // stop inference, switch to a lightweight fallback static func current() -> InferenceBudget { switch ProcessInfo.processInfo.thermalState { case .nominal, .fair: return .full case .serious: return .reduced case .critical: return .suspended @unknown default: return .reduced } }}
The caller changes behavior by budget. For real-time frame inference, under reduced it infers only once every few frames and reuses the last result in between; under suspended it stops inferring and switches to a light alternative such as a rule-based path.
func process(frame: Frame) async { switch InferenceBudget.current() { case .full: lastResult = try? await engine.predict(frame.input) case .reduced: frameCounter += 1 if frameCounter % 3 == 0 { // only once every 3 frames lastResult = try? await engine.predict(frame.input) } case .suspended: lastResult = fallbackHeuristic(frame) // stop inferring } render(lastResult)}
You might feel that reacting after heat rises is too late. But the key is not to treat heat as zero-or-one: thin out earlier at the serious stage so you never reach critical. What pays off in production was reducing quietly before it got hot, rather than stopping after it did.
Release the model on memory warnings and backgrounding
A model can occupy tens of MB. Holding it while unused raises the chance of being killed under a memory warning. Release the model on backgrounding and memory warnings, and warm it up again on return.
extension InferenceEngine { func release() { model = nil } // drop the reference and return memory}// app side.onChange(of: scenePhase) { _, phase in if phase == .background { Task { await engine.release() } } else if phase == .active { Task { await engine.warmUp() } // re-warm on return }}
Release plus re-warm looks like double the cost, but the warmup on return hides in the gap before the user starts interacting. Letting go and re-warming is more stable than holding on and hitting a memory warning. This call depends on model size and usage frequency, so it is safest decided while measuring.
Do not decide without measuring
The numbers above (how many frames per inference, where to set thresholds) vary by app and target device. I strongly recommend measuring first load, first inference, and subsequent times with signpost and deciding on your own app's real data. Pinning computeUnits to .cpuOnly by guess, or leaning on .all always, can backfire on some devices.
If you are unsure which of the three to start with, just "pull warmup out of the launch flow" is enough at first. Removing the initial freeze alone changes the feel dramatically. Thermal and memory control can be added only as needed, once you observe temperature and memory warnings on a real device.
On-device Core ML becomes far more practical when you can design "when to run it" before accuracy. Precisely because Rork Max puts native Swift within reach now, locking down this foundation first pays off later.
Share
Thank You for Reading
Rork Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.