◉ AI Models/2026-06-12Advanced

A Three-Layer AI Cost Design for Rork Apps After Apple Opened Foundation Models to Small Developers

Apple now offers Foundation Models on Private Cloud Compute at no charge for developers under two million first downloads. Here is a three-layer cost architecture for Rork apps, with a simulation script and working bridge code.

Rork⁵²¹ Apple Foundation Models Private Cloud Compute AI cost design on-device AI⁵ indie development³³

✦ Premium Article

Midway through following the WWDC 2026 announcements in June, one line in the State of the Union stopped me: developers with fewer than two million first downloads can use the Foundation Models running on Private Cloud Compute at no charge.

For indie apps, the cost of AI features — metered inference billing — has been the single tightest constraint on pricing. Your API bill grows in proportion to your DAU, while competitors ship the same feature for free. Protecting your margin means cutting into the experience. If part of that structure becomes free, the assumptions behind your architecture deserve a fresh look.

I covered the on-device side previously in Use Apple FoundationModels in Rork — A Practical Guide to On-Device LLM Apps That Keep Working Offline, where the working rule was a two-way split: narrow, fast tasks on the device; heavy tasks in the cloud. This announcement inserts a third option right between those two — a free cloud layer. What follows walks through that shift step by step, from cost simulation to implementation paths.

How to read the under-two-million free tier

Condensing the public information available at announcement time, four points matter.

What is free is the Foundation Models running on Private Cloud Compute (PCC) — the larger server-side model, not the roughly 3-billion-parameter on-device version. Developers under two million first downloads can call it without additional fees
The Foundation Models framework gains image input, so understanding tasks that take screenshots or photos can be written as a natural extension of the same Swift API
Server-side model integration arrives through the same Swift API, with third-party models such as Claude and Gemini connectable without changing the call site
The framework is slated to go open source this summer, which will make its internals inspectable while you design

The third point deserves a careful reading. Being able to call Claude through the same Swift API is a statement about connection convenience, not about cost. Third-party inference still bills through each provider's own keys and pricing. Mixing those two up in the excitement of launch week will quietly corrupt your cost model. The first thing I did was write that boundary down on paper.

The two-million threshold also has a practical meaning. Most indie apps live well below two million first downloads per app, so in effect this is a program aimed at individual developers and small teams. That said, the fine print — how downloads are counted, for instance — will settle in the official documentation, so treat the developer agreement as the source of truth before you take a dependency.

Three layers, and which tasks belong where

With a free PCC tier in the middle, AI placement becomes a three-layer decision. The properties line up like this.

Layer 1 (on-device): about 3 billion parameters, roughly a 4K-token context. Zero cost, works offline, lowest latency. Suited to classification, tagging, short rewrites, and templated responses
Layer 2 (Foundation Models on PCC): the larger server-side model and the workhorse of the free tier. Summarization, structured extraction, and image-input understanding that will not fit on-device. Requires a network connection
Layer 3 (third-party APIs): Claude or Gemini, metered. Reserve it for long-context reasoning, generation whose quality directly drives revenue, and the hard corners of multilingual work

Four criteria are enough to sort tasks: whether the task must work offline, whether it fits within about 4K tokens, whether output quality maps directly to revenue or ratings, and how frequently it runs. The higher the call frequency, the lower the layer it should sit in — pulling high-frequency tasks out of metered billing is what actually bends the cost curve.

As an indie developer running wallpaper apps, my own mapping looks like this: candidate tag generation for images sits in layer 1, draft replies to user reviews sit in layer 2, and first-pass localization of store listings into many languages sits in layer 3. Different tasks in the same app naturally land in different layers.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦If your AI bill has been growing in lockstep with your DAU, you can now design a hard ceiling into your cost structure with the free tier as a base layer

✦You will be able to decide which tasks belong on-device, on Private Cloud Compute, and on third-party APIs based on a monthly cost simulation instead of guesswork

✦You take home a working Expo native module bridge and a budget-capped backend router pattern you can adapt to your own Rork app

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Estimate before you migrate — a cost simulation script

Re-homing tasks costs implementation time, so put a number on the savings first. The script below compares an all-third-party setup against a three-layer setup, using per-task monthly call counts and token volumes.

// ai-cost-simulation.ts — estimate the monthly effect of going three-layer
// run: npx tsx ai-cost-simulation.ts
 
type Layer = "on-device" | "pcc" | "third-party";
 
interface TaskProfile {
  name: string;
  monthlyCalls: number;    // calls per month
  avgInputTokens: number;  // input tokens per call
  avgOutputTokens: number; // output tokens per call
  layer: Layer;            // placement in the three-layer design
}
 
// Placeholder prices — replace with the current rates of the model you use
const THIRD_PARTY_PRICE = {
  inputPerMTok: 3.0,   // USD per million input tokens
  outputPerMTok: 15.0, // USD per million output tokens
};
 
const tasks: TaskProfile[] = [
  { name: "image tag candidates",   monthlyCalls: 120_000, avgInputTokens: 200,   avgOutputTokens: 20,  layer: "on-device" },
  { name: "review reply drafts",    monthlyCalls: 90_000,  avgInputTokens: 1_200, avgOutputTokens: 200, layer: "pcc" },
  { name: "long document analysis", monthlyCalls: 6_000,   avgInputTokens: 6_000, avgOutputTokens: 800, layer: "third-party" },
];
 
function thirdPartyCost(t: TaskProfile): number {
  const input  = (t.monthlyCalls * t.avgInputTokens)  / 1_000_000 * THIRD_PARTY_PRICE.inputPerMTok;
  const output = (t.monthlyCalls * t.avgOutputTokens) / 1_000_000 * THIRD_PARTY_PRICE.outputPerMTok;
  return input + output;
}
 
// every task on third-party APIs
const allThirdParty = tasks.reduce((sum, t) => sum + thirdPartyCost(t), 0);
 
// three-layer: on-device and pcc cost nothing (within the free tier)
const threeLayer = tasks
  .filter((t) => t.layer === "third-party")
  .reduce((sum, t) => sum + thirdPartyCost(t), 0);
 
console.log(`all third-party: $${allThirdParty.toFixed(2)} / month`);
console.log(`three-layer:     $${threeLayer.toFixed(2)} / month`);
console.log(`reduction:       ${((1 - threeLayer / allThirdParty) * 100).toFixed(1)}%`);
 
// expected output:
// all third-party: $882.00 / month
// three-layer:     $180.00 / month
// reduction:       79.6%

With these sample values, $882 a month becomes $180 — roughly an 80 percent cut. What matters is not the headline number but its composition. The high-frequency, low-token tag task and the mid-frequency reply-draft task left metered billing, so the remaining spend is only the low-frequency, high-cost work. Once your costs have that shape, user growth no longer drags your bill up at the same slope.

When you plug in your own numbers, replace THIRD_PARTY_PRICE with the current rates of the model you actually use, and take monthlyCalls from recent measurements rather than optimistic guesses. A loose estimate here will misorder your migration priorities.

Keeping one call site — routing on-device and Private Cloud Compute

The most fragile version of a three-layer design is one where layer-selection logic is scattered across the app. Every time the free-tier conditions shift, or the finalized PCC API differs slightly at release, you end up hunting for call sites. I prefer to concentrate the decision in a single router and keep each layer's implementation swappable behind one function.

// AIRouter.swift — concentrate layer selection in one place
import FoundationModels
 
enum AILayer {
    case onDevice      // layer 1: in-device, free, works offline
    case privateCloud  // layer 2: Foundation Models on PCC (free tier)
    case thirdParty    // layer 3: metered APIs such as Claude / Gemini
}
 
struct AITask {
    let prompt: String
    let needsOffline: Bool
    let estimatedTokens: Int
    let qualityCritical: Bool
}
 
enum AIRouter {
    static func layer(for task: AITask) -> AILayer {
        // offline requirements force layer 1
        if task.needsOffline { return .onDevice }
        // only revenue-critical quality goes to metered layer 3
        if task.qualityCritical { return .thirdParty }
        // beyond ~4K tokens it will not fit on-device
        if task.estimatedTokens > 3_500 { return .privateCloud }
        return .onDevice
    }
 
    static func respond(to task: AITask) async throws -> String {
        switch layer(for: task) {
        case .onDevice:
            // always check device support (fall back to PCC if unsupported)
            guard case .available = SystemLanguageModel.default.availability else {
                return try await respondViaPrivateCloud(task)
            }
            let session = LanguageModelSession(
                instructions: "Answer briefly and calmly."
            )
            let response = try await session.respond(to: task.prompt)
            return response.content
        case .privateCloud:
            return try await respondViaPrivateCloud(task)
        case .thirdParty:
            return try await respondViaBackend(task) // through your own backend
        }
    }
 
    private static func respondViaPrivateCloud(_ task: AITask) async throws -> String {
        // Confirm the finalized PCC selection API in the release docs,
        // then replace only the body of this function (callers stay untouched).
        let session = LanguageModelSession(
            instructions: "Answer in short paragraphs without bullet lists."
        )
        let response = try await session.respond(to: task.prompt)
        return response.content
    }
 
    private static func respondViaBackend(_ task: AITask) async throws -> String {
        var request = URLRequest(url: URL(string: "https://api.example.com/ai/heavy")!)
        request.httpMethod = "POST"
        request.setValue("application/json", forHTTPHeaderField: "Content-Type")
        request.httpBody = try JSONEncoder().encode(["prompt": task.prompt])
        let (data, _) = try await URLSession.shared.data(for: request)
        struct Reply: Decodable { let text: String }
        return try JSONDecoder().decode(Reply.self, from: data).text
    }
}

Two details carry the weight here. First, when an unsupported device cannot run layer 1, the router degrades quietly to PCC instead of throwing — so the feature still works on devices outside the Apple Intelligence lineup. Second, respondViaPrivateCloud is deliberately written to be replaced. The exact shape of the PCC selection API is something to verify against the release documentation, but with the boundary drawn here, that future edit touches one function. In the week after an announcement, separating confirmed facts from unconfirmed details inside the structure of your code is the most durable kind of preparation I know.

Bridging from an Expo-based Rork app

Everything above is Swift, which loads directly if you build with Rork Max ($200/month, generates native Swift). The standard Rork tier ($25/month and up) generates Expo (React Native) apps, and Foundation Models is a Swift framework — JavaScript cannot call it directly. You need a bridge.

There are three realistic paths.

Move to Rork Max: shortest path if native capabilities are your main battleground, though the price gap is large enough that this one point alone should not decide the migration
Write a native module with the Expo Modules API: export your Rork-generated app, switch to a development build, and add your own module. If AI is a core feature and zeroing out layer 1–2 costs matters to your margin, the effort pays for itself
Keep calling third-party APIs through your backend: if AI remains a supporting feature, staying entirely in layer 3 is a perfectly sound decision

Here is the skeleton of the second path — the module config, the Swift side, and the JavaScript call site.

// expo-module.config.json — module declaration
{
  "platforms": ["apple"],
  "apple": {
    "modules": ["FoundationModelsBridgeModule"]
  }
}

// ios/FoundationModelsBridgeModule.swift — the Swift side of the bridge
import ExpoModulesCore
import FoundationModels
 
public class FoundationModelsBridgeModule: Module {
  public func definition() -> ModuleDefinition {
    Name("FoundationModelsBridge")
 
    AsyncFunction("isAvailable") { () -> Bool in
      if case .available = SystemLanguageModel.default.availability {
        return true
      }
      return false
    }
 
    AsyncFunction("respond") { (prompt: String) -> String in
      let session = LanguageModelSession(
        instructions: "Reply briefly and politely."
      )
      let response = try await session.respond(to: prompt)
      return response.content
    }
  }
}

// lib/foundation-models.ts — the JavaScript call site
import { requireNativeModule } from "expo-modules-core";
 
const FoundationModelsBridge = requireNativeModule("FoundationModelsBridge");
 
export async function localSummarize(text: string): Promise<string | null> {
  const available: boolean = await FoundationModelsBridge.isAvailable();
  if (!available) return null; // caller falls back to layer 2 or 3
  return FoundationModelsBridge.respond(`Summarize this text in three lines: ${text}`);
}

To be candid, this path steps outside Rork's managed development experience. A development build is required, so the loop no longer closes inside Rork's browser environment. Still, if it removes hundreds of thousands of monthly calls from metered billing, the export-and-bridge effort justifies itself in plain numbers. Run the simulation with your own figures first; the order of operations matters.

A budget ceiling for the third layer

Even after the migration, layer 3 stays metered, and without a ceiling the old failure mode remains: a viral spike or a misbehaving bot inflates your end-of-month invoice. I route every layer-3 call through my own backend and enforce a monthly budget in code. The full reasoning lives in Enforcing AI Cost Ceilings at Runtime — A Budget-Guard Architecture for Rork Apps; here is the minimal version adapted to the three-layer setup.

// server/ai-router.ts — put a monthly budget cap at the layer-3 entrance
import { Hono } from "hono";
 
type Bindings = {
  ANTHROPIC_API_KEY: string; // injected as an environment variable, never hard-coded
  AI_SPEND: KVNamespace;     // KV namespace holding the month-to-date spend
};
 
const MONTHLY_BUDGET_USD = 50;
 
const app = new Hono<{ Bindings: Bindings }>();
 
app.post("/ai/heavy", async (c) => {
  const { prompt, estimatedCostUsd } = await c.req.json<{
    prompt: string;
    estimatedCostUsd: number;
  }>();
 
  const monthKey = `spend:${new Date().toISOString().slice(0, 7)}`;
  const spent = Number((await c.env.AI_SPEND.get(monthKey)) ?? "0");
 
  if (spent + estimatedCostUsd > MONTHLY_BUDGET_USD) {
    // over budget: return 429 so the client retreats to layer 2 or 1
    return c.json({ fallback: true, reason: "budget_exceeded" }, 429);
  }
 
  const res = await fetch("https://api.anthropic.com/v1/messages", {
    method: "POST",
    headers: {
      "content-type": "application/json",
      "x-api-key": c.env.ANTHROPIC_API_KEY,
      "anthropic-version": "2023-06-01",
    },
    body: JSON.stringify({
      model: "claude-haiku-4-5-20251001",
      max_tokens: 1024,
      messages: [{ role: "user", content: prompt }],
    }),
  });
  const data = (await res.json()) as { content: { text: string }[] };
 
  await c.env.AI_SPEND.put(monthKey, String(spent + estimatedCostUsd));
  return c.json({ text: data.content[0].text });
});
 
export default app;

For brevity this accumulates the client-supplied estimate; in production you should read the usage field from the response and accumulate the actual cost server-side. A budget guard that trusts client declarations has a hole in it.

Decide the over-budget behavior per task while you are at it. A client that receives a 429 can resend the task to layer 2, accept one step lower quality on layer 1, or queue the work for next month. With those retreat paths assigned in advance, the ceiling acts as graceful quality degradation rather than a feature outage.

What the free tier means for your pricing page

When the cost structure changes, the pricing page follows. Layers 1 and 2 have near-zero marginal cost, so features built on them can sit in your free plan without bleeding money. Layer 3 features carry real per-call costs, so they belong in a paid plan or behind a usage allowance.

Free plan: compose it from layer 1 and 2 tasks — tagging, summaries, reply drafts can be generous without hurting your margin
Paid plan: assign layer 3 quality here — long-document processing, high-grade generation, output that holds up in professional use
A middle tier: a monthly allowance of N layer-3 calls with overflow nudging users toward the subscription is now easy to structure

This also pairs well with a free plan supported by ad revenue such as AdMob — the wider your zero-cost layers, the more resilient the plan becomes to swings in ad rates. This connects directly to the principles in Rork App Subscription Pricing Design: 5 Principles to Maximize LTV Without Losing Retention.

One caution before you spend the entire windfall on a more generous free plan: the moment you cross two million first downloads, the cost assumptions behind layer 2 change. Success walks you toward that cliff. I prefer to confirm, at simulation time, that the paid plan's gross margin could absorb every layer-2 task being rebilled at layer-3 rates. Treat the free tier as a tailwind, not as the foundation — that posture ages better.

Pitfalls worth avoiding early

A few traps I nearly stepped into while designing around this.

Skipping the device support check: layer 1 runs only on Apple Intelligence devices. Omit the SystemLanguageModel.default.availability check and older devices crash at launch. Build the check and the PCC fallback into the router
Assigning PCC to offline-first features: layer 2 needs a network. Features used in airplane mode — notes, drafts, offline reading aids — belong in layer 1 or need a local fallback
Reading the same Swift API for Claude as Claude being free: third-party model costs remain on each provider's meter. Unified connection and free inference are separate facts
Freezing the design until the open-source release or final docs: what is unconfirmed is the fine shape of the PCC selection API, not the three-layer structure. Abstract the router now and the later swap stays small
Hard-coding assumptions about the fine print: how first downloads are counted and similar details should come from the official developer agreement, not from launch-week summaries. Optimistic assumptions here are the most expensive ones to unwind

Start with an inventory of your AI calls

List every place your app calls an AI model, then fill in four columns per task: monthly calls, average token volume, offline requirement, and quality sensitivity. Feed that table into the simulation script above and the migration priorities fall out as numbers. The inventory takes maybe half an hour, and it becomes the foundation for decisions that reach all the way to your pricing page.

Redoing my own inventory after the announcement, I found fewer tasks genuinely needed layer 3 than I had assumed. I hope the same exercise proves useful for your app — thank you for reading.

Thank You for Reading

Rork Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.