●FUNDING — Rork closed a $15M seed round led by Left Lane Capital, with Peak XV, True Ventures, Goodwater, and a16z Speedrun●USERS — Rork now reaches 2M users with 743K monthly visits and an 85% growth rate●MAX — Rork Max generates native Swift apps for iPhone, iPad, Watch, TV, Vision Pro, and iMessage●STACK — Standard Rork builds iOS and Android together in React Native (Expo), so non-engineers can ship real apps●PRICE — Plans start free, paid tiers from $25/month, and Rork Max at $200/month●MARKET — Gartner projects 75% of new apps will be low-code or no-code by the end of 2026●FUNDING — Rork closed a $15M seed round led by Left Lane Capital, with Peak XV, True Ventures, Goodwater, and a16z Speedrun●USERS — Rork now reaches 2M users with 743K monthly visits and an 85% growth rate●MAX — Rork Max generates native Swift apps for iPhone, iPad, Watch, TV, Vision Pro, and iMessage●STACK — Standard Rork builds iOS and Android together in React Native (Expo), so non-engineers can ship real apps●PRICE — Plans start free, paid tiers from $25/month, and Rork Max at $200/month●MARKET — Gartner projects 75% of new apps will be low-code or no-code by the end of 2026
When Your Finance App's AI Keeps Dumping Everything Into 'Other' — Field Notes on Catching Silent Classification Drift
An AI expense classifier in a Rork finance app can be accurate at launch and quietly decay over months until the monthly advice goes wrong. Here is how I instrumented confidence and category distribution to get ahead of the drift, with working code.
The Classification Used to Be Right, Then the Advice Started to Slip
I was running an AI finance app I had built on Rork as an indie developer. Snap a receipt, and Gemini pulls the amount, the store name, and a category; at month end it produces advice from your spending pattern. The first weeks felt great, and I used it every day myself.
The trouble came months later, creeping in. One month the advice said "your food spending is steady," but my gut said the opposite — my wallet felt lighter than usual. I opened the data and found grocery runs scattered into "Other." The classifier was breaking.
The way it broke was the nasty part. The app never crashed, never threw an error. It just slowly fattened "Other" while every real category thinned out — and the month-end advice, built on top of that classification, quietly went off target. The numbers came out cleanly every month; I simply could no longer trust them.
As an indie developer running several apps myself, here are my notes on how I traced that silent decay and what instrumentation I added to recover, with the code I actually run. I am writing this for anyone building a Rork app that lets AI sort expenses or receipts, so you do not end up a step behind the way I did.
"Other" Fattens Up in Silence
Before the cause, it is worth spelling out why I failed to notice. The classifier's failure was hard to see in two distinct ways.
First, seen one record at a time, "Other" always looks like a defensible choice. Putting a genuinely ambiguous expense into "Other" is not wrong. The problem was that its frequency climbed month over month. A judgment that looks fine in isolation traced an abnormal trend in aggregate.
Second, the classifier itself stayed quiet. The JSON Gemini returned carried a confidence value, but I only used it for the branch (below 0.5, ask the user) and never looked at it over time. Confidence was thrown away every single time, leaving no way to look back.
Symptom
Seen one at a time
Seen in aggregate
Assignment to "Other"
Looks reasonable
Ratio rises each month
Falling confidence
Waved through above the threshold
Average quietly trends down
Off-target advice
One line feels barely off
Systematically wrong vs. reality
Looking back, several triggers had stacked up. More users brought more unfamiliar store names; a model update shifted behavior slightly; seasonal items changed the vocabulary on receipts. None was a single blow. That is exactly why I needed something that measures the trend, not each record.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦Logging classification confidence per record and tracking the monthly decline in average confidence
✦Quantifying category distribution drift as a distance to catch a bloating 'Other' bucket early
✦Using the user correction rate as ground truth and gating advice generation on classifier health
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
Log Classification Confidence, One Record at a Time
The first move was to keep the confidence I had been throwing away. On every classification, I write the confidence, the category the AI chose, and whether the user later corrected it into a dedicated table. The idea is to accumulate a history of the classification's quality, not of the classification result itself.
// lib/classificationLog.ts — keep a time series of classification qualityimport * as SQLite from 'expo-sqlite';const db = SQLite.openDatabaseSync('finance.db');export async function initClassificationLog(): Promise<void> { await db.execAsync(` CREATE TABLE IF NOT EXISTS classification_log ( id TEXT PRIMARY KEY, expense_id TEXT NOT NULL, ai_category TEXT NOT NULL, confidence REAL NOT NULL, corrected_category TEXT, -- where the user moved it (NULL if untouched) created_at TEXT NOT NULL ); CREATE INDEX IF NOT EXISTS idx_cl_created ON classification_log(created_at); `);}export async function logClassification(entry: { id: string; expenseId: string; aiCategory: string; confidence: number;}): Promise<void> { await db.runAsync( `INSERT INTO classification_log (id, expense_id, ai_category, confidence, corrected_category, created_at) VALUES (?, ?, ?, ?, NULL, ?)`, [entry.id, entry.expenseId, entry.aiCategory, entry.confidence, new Date().toISOString()] );}// when the user edits a category, record that fact after the factexport async function recordCorrection(expenseId: string, correctedTo: string): Promise<void> { await db.runAsync( `UPDATE classification_log SET corrected_category = ? WHERE expense_id = ?`, [correctedTo, expenseId] );}
The point is to keep confidence not only as branch fuel but as a record you can aggregate later. It may look right in the moment, but if the average confidence slides from 0.82 to 0.63 over three months, that is a clear sign the classifier has started to hesitate. One log tells you nothing; bundled together, they show the classifier's temperature.
Measure the Drift in Category Distribution
Next after confidence, I measured the change in the category distribution itself. Between last month and this one, how much of the spending landed in each category — and when that shape shifts sharply, use it as a cue to ask whether behavior actually changed or the classification slipped.
A distribution difference can be quantified as the distance between two ratio vectors. I used the total variation distance (the sum of absolute ratio differences per category, divided by two) because it is light to compute and intuitive. Zero means unchanged, one means a full swap.
// lib/distributionDrift.ts — measure category distribution driftexport type CategoryCounts = Record<string, number>;function toRatio(counts: CategoryCounts): Record<string, number> { const total = Object.values(counts).reduce((a, b) => a + b, 0) || 1; const ratio: Record<string, number> = {}; for (const [k, v] of Object.entries(counts)) ratio[k] = v / total; return ratio;}export function distributionDrift(prev: CategoryCounts, curr: CategoryCounts) { const pr = toRatio(prev), cr = toRatio(curr); const keys = new Set([...Object.keys(pr), ...Object.keys(cr)]); let tvd = 0; // total variation distance for (const k of keys) tvd += Math.abs((pr[k] ?? 0) - (cr[k] ?? 0)); tvd /= 2; // watch "other" on its own too — broken classification pools here first const otherJump = (cr['other'] ?? 0) - (pr['other'] ?? 0); return { drift: Number(tvd.toFixed(3)), otherRatio: Number((cr['other'] ?? 0).toFixed(3)), otherJump: Number(otherJump.toFixed(3)), suspicious: tvd > 0.15 || (cr['other'] ?? 0) > 0.25 || otherJump > 0.08, };}
Watching the "Other" ratio and its jump on their own, on top of the overall drift, comes from experience. When a classifier starts to hesitate, the slack pools into "Other" first. Even when the overall distribution has barely moved, an "Other" bucket that swells month over month was a cue to suspect classification decay rather than a change in spending. Keeping the value change (dining really did rise) and the sorting change (the classifier got lazy) as separate signals means I never have to wonder at month end whether it is a household story or an app story.
The Correction Rate Is the Hidden Truth
Confidence and distribution are both views from inside the classifier. I wanted one more — an answer key from outside. That is the rate at which users correct a category: the correction rate.
When a user moves "Other" to "Food," or relabels "Entertainment" as "Education," that action is the most trustworthy ground truth the classifier has. Track how often people correct it, with a denominator, and you can measure the AI's hits and misses from the outside, without leaning on its internal confidence.
// lib/correctionRate.ts — measure accuracy from the outside via correctionsimport * as SQLite from 'expo-sqlite';const db = SQLite.openDatabaseSync('finance.db');export async function monthlyCorrectionRate(month: string): Promise<{ total: number; corrected: number; rate: number; byCategory: Record<string, number>;}> { const rows = await db.getAllAsync<any>( `SELECT ai_category, corrected_category FROM classification_log WHERE created_at LIKE ?`, [`${month}%`] ); const total = rows.length || 1; const corrected = rows.filter(r => r.corrected_category && r.corrected_category !== r.ai_category).length; // which category gets corrected most = where the classifier slips const miss: Record<string, { n: number; wrong: number }> = {}; for (const r of rows) { const c = r.ai_category; miss[c] ??= { n: 0, wrong: 0 }; miss[c].n++; if (r.corrected_category && r.corrected_category !== c) miss[c].wrong++; } const byCategory: Record<string, number> = {}; for (const [c, v] of Object.entries(miss)) byCategory[c] = Number((v.wrong / v.n).toFixed(2)); return { total: rows.length, corrected, rate: Number((corrected / total).toFixed(3)), byCategory };}
Split the correction rate by category and it names exactly where the classifier slips. In my case, "Other" and "shopping" stood out: the former was the dumping ground for hesitation, the latter wobbled at the border between food and non-food. I added judgment criteria to the prompt for just those two and made the confirmation UI's default suggestion smarter, and the next month's correction rate dropped clearly. What worked was not fixing every category uniformly, but naming the slipping spots with numbers.
Doubt Yourself Before Giving Advice
With the measurements in place, the last thing I touched was the entrance to advice generation. I used to hand the aggregates to Gemini unconditionally once they were ready. Even with the classification broken, plausible advice would come out, grounded in broken numbers. That was the scariest failure of all.
So I inserted a gate that judges classifier health before building advice. If confidence, drift, or correction rate is in the danger zone, the advice is replaced with a maintenance message — "the classification needs review this month" — and no ordinary budget advice is produced.
// lib/adviceGate.ts — stop advice in a month the classification can't be trustedinterface HealthInputs { avgConfidence: number; // average confidence that month drift: number; // distribution drift (total variation distance) otherRatio: number; // "Other" ratio correctionRate: number; // user correction rate}export function adviceGate(h: HealthInputs): { ok: boolean; reasons: string[] } { const reasons: string[] = []; if (h.avgConfidence < 0.65) reasons.push(`low average confidence (${h.avgConfidence})`); if (h.drift > 0.15) reasons.push(`large distribution drift (${h.drift})`); if (h.otherRatio > 0.25) reasons.push(`high "Other" ratio (${h.otherRatio})`); if (h.correctionRate > 0.2) reasons.push(`high correction rate (${h.correctionRate})`); // if any one is in the danger zone, do not build advice on broken numbers return { ok: reasons.length === 0, reasons };}// usage: assemble the advice prompt only when the gate passesexport function buildMonthlyAdvice(gate: ReturnType<typeof adviceGate>, summary: string): string { if (!gate.ok) { return `This month, expense classification accuracy dropped, so budget advice was held back. ` + `We recommend reviewing classification in settings (reasons: ${gate.reasons.join(' / ')}).`; } return summary; // in practice, pass summary to Gemini to generate advice}
Holding advice back gave me pause at first — it felt like withholding a feature. But advice built from wrong numbers does more harm than the feature does good. Rather than handing over the false comfort of "your food spending is steady," honestly saying "classification looks shaky this month, so I'll hold the advice" actually grew trust in the app. Honesty outlasts a feature count.
Fold It into Operations
Here is how I place all of this across the app and the monthly batch, in order. The key is to measure health before aggregating, and to accumulate records before measuring health.
When
What runs
On danger
Every classification
Log confidence + AI category to classification_log
Below 0.5 goes to the confirm UI
On user correction
Append corrected_category
Feed often-corrected stores into suggestion learning
Month-end batch
Aggregate avg confidence, drift, correction rate
Store as health
Before advice
Judge health via adviceGate
Swap in the maintenance message if in danger
One implementation note: the classification log and the finance data live in the same SQLite, but always in separate tables. The record of the spending itself and the record of how trustworthy the classification is serve entirely different purposes. The former is the user's money history; the latter is your AI's vital signs. Mix them, and finance-screen queries get heavy with classification metadata, and classification decay hides inside the household numbers. Split the outputs, and both stay light.
One more: do not try to nail the confidence thresholds or drift limits perfectly up front. I placed provisional numbers and tuned them against the actual correction rate over two or three months. Thresholds set against what your users actually corrected fit reality far better than ones set by theory.
Wrapping Up
What is truly frightening in an app that lets AI sort expenses is not the classification breaking loudly, but decaying quietly while keeping a correct face. A crash you notice. But the gentle rot of "Other" growing one record at a time only surfaces when the month-end advice goes off target — unless you have eyes that measure the trend.
First, keep the confidence you were throwing away. Then measure category distribution drift and the bloat of "Other," and check from the outside with the user correction rate. Finally, refuse to build advice on broken numbers. It is not a flashy feature, but this measurement layer is the foundation for whether you can trust the numbers in an AI finance app at all.
If you have a classifier running quietly right now, just start logging confidence in your next release. Three months later, a glance at the average will turn decay you could not see into numbers. I hope this helps your own build, and thank you for reading.
Share
Thank You for Reading
Rork Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.