◉ AI Models/2026-06-30Advanced

When Your Rork Hybrid AI Quietly Drifts to the Cloud and the Bill Creeps Up — Field Notes on Instrumenting Routing Decisions

A router that splits work across on-device, edge, and cloud layers will quietly drift toward the cloud when no one logs its decisions — flat traffic, rising bill. These are field notes on instrumenting routing to isolate the cause.

Rork Max²⁰² hybrid AI on-device AI⁵ edge AI² cost optimization² observability³ Core ML⁵ Cloudflare Workers AI

✦ Premium Article

I Only Noticed It on the Invoice

An app with a hybrid AI stack came in at 1.7x the previous month's API bill. The frustrating part: active users and total request count were essentially flat. Traffic wasn't up, but cost was. My first guess was a price change on the model side. The per-token price hadn't moved.

It took a while to find the cause because nothing recorded where the router actually sent each request. The design split work cleanly across three layers — on-device, edge, cloud — but nobody was watching how it split in practice. We had a blueprint and no flight log.

These notes are the record of bolting on that flight log after the fact and isolating why the bill grew. The code is written to drop into an app generated by Rork Max (a React Native + Expo project) without a rewrite.

After years of running apps as an indie developer, I keep meeting this class of bug — nothing is broken, yet the cost just climbs. In my own work the quiet degradations that no one complains about have always been harder to catch than loud crashes, and how fast I can respond comes down to whether I instrumented the thing beforehand. This was one more case of exactly that.

Why It Drifts "Quietly"

Hybrid routing usually starts as a small heuristic. Personal data goes on-device, anything needing fresh information or complex reasoning goes to the cloud, everything else goes to the edge.

// src/ai/AIRouter.ts — a common first version that records nothing
export type AILayer = 'on-device' | 'edge' | 'cloud';
 
export function determineAILayer(req: AIRequest): AILayer {
  const c = req.context ?? {};
  if (c.offlineMode || c.containsPersonalInfo) return 'on-device';
  if (c.requiresLatestInfo || c.complexReasoning) return 'cloud';
  if (req.message.length <= 50) return 'on-device';
  return 'edge';
}

The real question is who sets requiresLatestInfo or complexReasoning. In most apps a lightweight upstream classifier or some keyword check in the prompt layer sets them, loosely. Loosen that judgment by a hair and the share routed to the cloud quietly grows. Total traffic is unchanged, so none of the dashboards flag anything. The only place it shows up is the invoice.

In my experience, silent cloud drift came down to one of three causes.

Cause	What happens	Why it hides
Over-eager flags	Upstream classifier sets `complexReasoning` too often, sending more to cloud	Every request still answers fine, so error rates stay clean
Implicit fallback	On-device model fails to init and silently routes to cloud	User experience holds, so no one complains
Growing history	Long conversation history exceeds the edge limit and escalates to cloud	Only happens late in a session, hard to reproduce

What the three share is that nothing looks broken from the user's side. That's exactly why they stay invisible without measurement.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦Instrumentation code that records every routing decision and aggregates cost and latency by layer and by reason

✦Three typical causes of silent cloud drift when traffic is flat but the bill rises, each with a concrete isolation step

✦A budget guard that watches cloud share and fallback rate so the drift surfaces weeks before the invoice does

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Record the Decision First

Before fixing anything, make it visible. Widen the router's return value from "just a layer name" to a struct carrying the reason and estimated cost, and record it on every decision.

// src/ai/AIRouter.ts — the version that records its reasoning
export interface RoutingDecision {
  layer: AILayer;
  reason: string;       // why this layer was chosen — the axis we aggregate on later
  estCostUsd: number;   // estimated charge (zero on-device)
  estLatencyMs: number;
}
 
const COST = { 'on-device': 0, edge: 0.0001, cloud: 0.002 } as const;
const LAT  = { 'on-device': 40, edge: 200, cloud: 1500 } as const;
 
export function decideLayer(req: AIRequest): RoutingDecision {
  const c = req.context ?? {};
  let layer: AILayer = 'edge';
  let reason = 'default-edge';
 
  if (c.offlineMode)            { layer = 'on-device'; reason = 'offline'; }
  else if (c.containsPersonalInfo) { layer = 'on-device'; reason = 'pii-local'; }
  else if (c.requiresLatestInfo)   { layer = 'cloud'; reason = 'needs-fresh'; }
  else if (c.complexReasoning)     { layer = 'cloud'; reason = 'complex'; }
  else if (req.message.length <= 50) { layer = 'on-device'; reason = 'short-msg'; }
 
  return { layer, reason, estCostUsd: COST[layer], estLatencyMs: LAT[layer] };
}

The key is keeping reason a short, fixed, human-readable string. Sum count and cost per reason later and it becomes obvious at a glance which decision is generating the cloud bill. Free text can't be aggregated, so the vocabulary has to be a small enum-like set.

Next, push the decision and the measured values into a telemetry layer. estCostUsd is an up-front estimate; when you actually call the cloud, overwrite it with the real token count from the response. The gap between estimate and actual is itself an important signal.

// src/ai/telemetry.ts — one row per request, aggregated locally
type Row = { reason: string; layer: AILayer; costUsd: number; latencyMs: number; fallback: boolean };
const buffer: Row[] = [];
 
export function recordRouting(r: Row) {
  buffer.push(r);
  if (buffer.length >= 50) flush();   // batch every 50 rows
}
 
export function summarize(rows: Row[]) {
  const byLayer: Record<string, { n: number; cost: number }> = {};
  for (const x of rows) {
    const k = x.layer;
    byLayer[k] ??= { n: 0, cost: 0 };
    byLayer[k].n++; byLayer[k].cost += x.costUsd;
  }
  const total = rows.length || 1;
  return Object.entries(byLayer).map(([layer, v]) => ({
    layer,
    share: +(100 * v.n / total).toFixed(1),   // per-layer share (%)
    costUsd: +v.cost.toFixed(4),
  }));
}

Streaming an event per request from the device burns battery and network, so I batch 50 rows before sending. Keep summarize a pure function so the server can reuse it on the aggregated data — it makes verification much easier.

Read the Shares to Isolate the Cause

Run the instrumentation for a day and you get shares by reason and by layer. In my case the cloud layer — which was supposed to sit at 10–15% — held 38%. Split by reason, most of it came from complex, i.e. the complexReasoning flag. That narrows the cause fast.

The isolation procedure I settled on:

Look at per-layer share. If cloud is higher than expected, drift is happening. If share moves while traffic is flat, it's routing, not pricing.
Break cloud down by reason. Whether it's needs-fresh or complex points to different causes. A lot of complex means you should suspect the upstream flag's threshold.
Count fallback: true rows. If that's high, the culprit isn't your branch logic — it's the implicit fallback when on-device init fails. It never reaches the error log, so without this column you'd never see it.

The third one paid off. On some devices the on-device model didn't load in time, and the catch block silently routed to the cloud. User experience held, so there were no complaints — only a rising cloud bill. A fallback should be allowed to happen, but it must always be measured.

// a fallback that records the on-device failure instead of hiding it
async function runOnDevice(req: AIRequest): Promise<Result> {
  try {
    return await onDevice.infer(req);
  } catch (e) {
    recordRouting({ reason: 'fallback-ondevice-fail', layer: 'cloud',
                    costUsd: COST.cloud, latencyMs: LAT.cloud, fallback: true });
    return await cloud.infer(req);   // keep the experience, but make it visible
  }
}

Stop the Recurrence with a Budget Guard

Fixing the cause doesn't stop the threshold from loosening again. Every time you swap the upstream classifier or edit a prompt, routing shifts quietly. So I added something that surfaces it before the invoice does.

It compares the daily summarize output against thresholds and alerts when the cloud share crosses the line. The trick was to watch share, not dollars. Dollars rise with usage too, so they're hard to call an anomaly; share is normalized against traffic volume, so it isolates routing anomalies cleanly.

// daily budget guard (server side)
const BUDGET = { cloudSharePct: 20, fallbackPct: 2 };
 
export function checkBudget(rows: Row[]) {
  const s = summarize(rows);
  const cloud = s.find(x => x.layer === 'cloud')?.share ?? 0;
  const fbRate = 100 * rows.filter(r => r.fallback).length / (rows.length || 1);
  const alerts: string[] = [];
  if (cloud > BUDGET.cloudSharePct) alerts.push(`cloud share ${cloud}% > ${BUDGET.cloudSharePct}%`);
  if (fbRate > BUDGET.fallbackPct)  alerts.push(`fallback ${fbRate.toFixed(1)}% > ${BUDGET.fallbackPct}%`);
  return { ok: alerts.length === 0, alerts };
}

I set the thresholds at 20% cloud share and 2% fallback rate — cross either and routing has started to skew. Later, when a prompt update pushed cloud share to 26%, the guard caught it three weeks earlier than the invoice would have.

What This Design Taught Me

The quality of a hybrid AI stack is decided less by how clever the routing logic is and more by whether the routing is visible. Writing a smart router isn't the hard part. The hard part is continuously catching the slow drift of those decisions in production.

If I had to rank the work, I'd put "record the decision, even for one layer" ahead of "build all three layers perfectly." Keep the decision in a small fixed vocabulary called reason, and watch three things: per-layer share, real token cost, and fallback rate. With just that minimum in place, you can describe your own app's behavior without waiting for the invoice.

Which of on-device, edge, or cloud is right keeps changing per user and per session. There's no fixed answer. That's exactly why making your own ongoing decisions measurable is, in my experience, the single most effective bit of preparation for running a hybrid AI system over the long haul.

Thanks for reading.

Thank You for Reading

Rork Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.