How I Cut My Rork App's AI Costs from $350 to $35/Month with Cloudflare AI Gateway

Three months after releasing my third AI-powered app on the App Store, I opened my OpenAI billing dashboard and saw a charge that made me pause: $352 for the month.

My app had around 800 active users. That works out to roughly $0.44 per user per month in AI costs — before accounting for hosting, Apple's 30% cut, and my own time. The subscription I was charging, $4.99/month, barely covered the AI spend for a single heavy user.

The root cause, once I actually dug into it, was embarrassingly simple: the same prompts were hitting the API hundreds of times per day. Category classification requests, text summarization, content tagging — computationally identical across different users, billed separately every single time. When user A asks to classify "productivity apps for remote workers" and user B asks the exact same thing two minutes later, both requests go to OpenAI, both get billed, both return the same answer.

The fix I eventually landed on was Cloudflare AI Gateway. Three weeks after deployment, the same traffic volume cost $33.

The Economics Problem with AI Features in Indie Apps

Before getting into implementation, it's worth being explicit about why this matters structurally.

When you add AI features to a Rork app, you're essentially making your cost structure variable in a way that traditional features aren't. A push notification infrastructure costs roughly the same whether you have 100 or 10,000 users. AI API costs scale linearly with usage — and in some architectures, superlinearly if you're not careful.

The table below shows how this plays out at different user counts, assuming $0.44/user/month in unoptimized AI costs and a $4.99/month subscription:

100 users: AI costs $44, revenue $499, margin $455
1,000 users: AI costs $440, revenue $4,990, margin $4,550
10,000 users: AI costs $4,400, revenue $49,900, margin $45,500
50,000 users: AI costs $22,000, revenue $249,500, margin $227,500

At small scale it looks fine. But the margin percentage never actually improves — AI cost stays at roughly 9% of revenue no matter how big you grow. And that's assuming perfect subscription conversion. With a realistic 5-8% conversion rate from free to paid, the math inverts quickly.

Caching changes this fundamentally. Once you have enough users that popular prompts repeat frequently, the marginal AI cost of new users starts declining. With a 70% cache hit rate, your AI cost at 50,000 users isn't $22,000 — it's closer to $6,600. That's a business model that actually improves with scale.

What Cloudflare AI Gateway Actually Is

Cloudflare AI Gateway is an intelligent proxy layer that sits between your application and AI providers like OpenAI, Anthropic, and Google Gemini. You route requests through it rather than hitting providers directly, and it gives you several capabilities that are otherwise expensive or complex to build yourself:

Response caching: When two requests contain identical prompts (after normalization), the second request returns the cached result — no API call, no token cost, sub-100ms response time.

Multi-provider support: A single unified interface routes to OpenAI, Anthropic, Gemini, Groq, Mistral, Cohere, and others. Switching providers or adding fallbacks becomes a configuration change rather than a code change.

Automatic failover: You can configure ordered provider lists so that if OpenAI returns a 5xx or rate limit error, the request automatically retries with Gemini or another fallback.

Rate limiting: Set caps on requests per minute, per user, or globally. Essential for preventing a runaway script or a single heavy user from spiking your bill.

Detailed analytics: The dashboard shows request volume, cache hit rates, error rates by provider, latency percentiles, and token usage — all in real time.

The pricing is what makes this genuinely useful for solo developers: the free tier supports 100,000 requests per day. For most indie apps, you'll never pay for the gateway itself. The savings come entirely from reduced API provider costs.

Architecture Overview

The deployment pattern adds a Cloudflare Workers proxy between your Rork app and the AI providers:

[Rork App (React Native)]
        ↓ HTTPS
[Cloudflare Workers — your proxy]
        ↓ AI Gateway URL
[Cloudflare AI Gateway — caching, routing, rate limiting]
        ↓
[OpenAI / Gemini / Anthropic]

Why not call AI Gateway directly from the app?

Two reasons: security and flexibility. Embedding API keys in a mobile binary is a security risk — they can be extracted from the compiled app and used against your account. And you need server-side logic anyway: cache key normalization, user authentication, rate limit state management, and the provider fallback logic all need to run in a trusted environment.

Cloudflare Workers is the ideal host for this proxy layer. The free tier covers 100,000 daily requests, Workers instances cold-start in under 5ms globally, and you can use Workers KV for rate limit counters without any additional infrastructure.

Step 1: Gateway Setup in Cloudflare Dashboard

In your Cloudflare dashboard, go to AI → AI Gateway and click "Create Gateway." Name it something like rork-app-production and note the Account ID and Gateway ID from the detail page.

You'll get provider-specific endpoint URLs:

# OpenAI
https://gateway.ai.cloudflare.com/v1/{ACCOUNT_ID}/{GATEWAY_ID}/openai

# Google Gemini
https://gateway.ai.cloudflare.com/v1/{ACCOUNT_ID}/{GATEWAY_ID}/google-ai-studio

# Anthropic
https://gateway.ai.cloudflare.com/v1/{ACCOUNT_ID}/{GATEWAY_ID}/anthropic

# Groq (fast inference, useful for latency-sensitive tasks)
https://gateway.ai.cloudflare.com/v1/{ACCOUNT_ID}/{GATEWAY_ID}/groq

In the gateway settings, enable "Cache Responses" and set a default TTL. I use 24 hours for classification tasks. You can override this per-request with headers, which we'll do in the Workers code.

Step 2: The Workers Proxy Implementation

This is the complete Workers code. It's longer than a minimal example because it handles everything you'll actually need in production: authentication, per-user rate limiting, prompt normalization, cache header injection, and provider failover.

// workers/ai-proxy/src/index.ts
import { Hono } from 'hono'
import { cors } from 'hono/cors'
 
type Bindings = {
  OPENAI_API_KEY: string
  GEMINI_API_KEY: string
  AI_GATEWAY_ACCOUNT_ID: string
  AI_GATEWAY_ID: string
  RATE_LIMIT: KVNamespace
}
 
type Variables = {
  userId: string
}
 
const app = new Hono<{ Bindings: Bindings; Variables: Variables }>()
 
// CORS — lock this down to your app's actual domain in production
app.use('*', cors({
  origin: (origin) => {
    const allowed = ['https://your-app-domain.com', 'exp://localhost']
    return allowed.includes(origin) ? origin : null
  },
  allowHeaders: ['Content-Type', 'Authorization', 'X-User-ID'],
}))
 
// Auth middleware
app.use('/ai/*', async (c, next) => {
  const userId = c.req.header('X-User-ID')
  if (!userId) {
    return c.json({ error: 'Missing X-User-ID header' }, 401)
  }
  c.set('userId', userId)
  await next()
})
 
// Main text generation endpoint
app.post('/ai/generate', async (c) => {
  const userId = c.get('userId')
  const body = await c.req.json<{
    prompt: string
    model?: string
    cache?: boolean
    maxTokens?: number
  }>()
 
  const {
    prompt,
    model = 'gpt-4o-mini',
    cache = true,
    maxTokens = 1000,
  } = body
 
  // Rate limiting: 50 requests per user per hour
  const hourSlot = Math.floor(Date.now() / 3600000)
  const rateLimitKey = `rl:${userId}:${hourSlot}`
  const currentCount = parseInt(await c.env.RATE_LIMIT.get(rateLimitKey) || '0')
 
  if (currentCount >= 50) {
    const msUntilReset = 3600000 - (Date.now() % 3600000)
    return c.json({
      error: 'Rate limit exceeded',
      retryAfterSeconds: Math.ceil(msUntilReset / 1000),
    }, 429)
  }
 
  // Increment with TTL slightly longer than the hour slot to handle edge cases
  await c.env.RATE_LIMIT.put(rateLimitKey, String(currentCount + 1), {
    expirationTtl: 7200, // 2 hours
  })
 
  const normalizedPrompt = normalizePrompt(prompt)
  const cacheEnabled = cache && isCacheable(normalizedPrompt)
  
  const gatewayBase = `https://gateway.ai.cloudflare.com/v1/${c.env.AI_GATEWAY_ACCOUNT_ID}/${c.env.AI_GATEWAY_ID}`
 
  // Attempt OpenAI first
  const openAIResult = await tryOpenAI({
    prompt: normalizedPrompt,
    model,
    maxTokens,
    cacheEnabled,
    gatewayBase,
    apiKey: c.env.OPENAI_API_KEY,
  })
 
  if (openAIResult.success) {
    return c.json(openAIResult.data)
  }
 
  // Log the failure reason before attempting fallback
  console.warn(`OpenAI failed (${openAIResult.errorCode}), trying Gemini fallback`)
 
  // Only fall back when it makes sense to try a different provider
  if (!shouldFallback(openAIResult.errorCode)) {
    return c.json({ error: openAIResult.errorMessage }, openAIResult.errorCode)
  }
 
  const geminiResult = await tryGemini({
    prompt: normalizedPrompt,
    cacheEnabled,
    gatewayBase,
    apiKey: c.env.GEMINI_API_KEY,
  })
 
  if (geminiResult.success) {
    return c.json(geminiResult.data)
  }
 
  console.error(`All providers failed. OpenAI: ${openAIResult.errorCode}, Gemini: ${geminiResult.errorCode}`)
  return c.json({ error: 'AI service temporarily unavailable' }, 503)
})
 
// Structured OpenAI call with explicit success/failure typing
async function tryOpenAI(params: {
  prompt: string
  model: string
  maxTokens: number
  cacheEnabled: boolean
  gatewayBase: string
  apiKey: string
}): Promise<{ success: true; data: unknown } | { success: false; errorCode: number; errorMessage: string }> {
  try {
    const response = await fetch(`${params.gatewayBase}/openai/chat/completions`, {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
        'Authorization': `Bearer ${params.apiKey}`,
        'cf-aig-cache-ttl': params.cacheEnabled ? '86400' : '0',
      },
      body: JSON.stringify({
        model: params.model,
        messages: [{ role: 'user', content: params.prompt }],
        max_tokens: params.maxTokens,
        temperature: params.cacheEnabled ? 0 : 0.7,
      }),
      signal: AbortSignal.timeout(10000),
    })
 
    if (response.ok) {
      return { success: true, data: await response.json() }
    }
 
    let errorMessage = `OpenAI error ${response.status}`
    try {
      const errBody: any = await response.json()
      errorMessage = errBody?.error?.message || errorMessage
    } catch { /* ignore */ }
 
    return { success: false, errorCode: response.status, errorMessage }
 
  } catch (err: any) {
    if (err.name === 'TimeoutError') {
      return { success: false, errorCode: 504, errorMessage: 'OpenAI request timed out' }
    }
    return { success: false, errorCode: 503, errorMessage: 'OpenAI network error' }
  }
}
 
// Gemini call with OpenAI-compatible response format
async function tryGemini(params: {
  prompt: string
  cacheEnabled: boolean
  gatewayBase: string
  apiKey: string
}): Promise<{ success: true; data: unknown } | { success: false; errorCode: number; errorMessage: string }> {
  try {
    const response = await fetch(
      `${params.gatewayBase}/google-ai-studio/v1beta/models/gemini-2.0-flash:generateContent`,
      {
        method: 'POST',
        headers: {
          'Content-Type': 'application/json',
          'x-goog-api-key': params.apiKey,
          'cf-aig-cache-ttl': params.cacheEnabled ? '86400' : '0',
        },
        body: JSON.stringify({
          contents: [{ parts: [{ text: params.prompt }] }],
          generationConfig: { temperature: params.cacheEnabled ? 0 : 0.7 },
        }),
        signal: AbortSignal.timeout(12000),
      }
    )
 
    if (!response.ok) {
      return {
        success: false,
        errorCode: response.status,
        errorMessage: `Gemini error ${response.status}`,
      }
    }
 
    const data: any = await response.json()
    const content = data?.candidates?.[0]?.content?.parts?.[0]?.text
 
    if (!content) {
      return { success: false, errorCode: 500, errorMessage: 'Empty Gemini response' }
    }
 
    // Return in OpenAI-compatible format so the Rork app doesn't need provider-specific handling
    return {
      success: true,
      data: {
        choices: [{ message: { role: 'assistant', content }, finish_reason: 'stop' }],
        _provider: 'gemini',
      },
    }
  } catch (err: any) {
    if (err.name === 'TimeoutError') {
      return { success: false, errorCode: 504, errorMessage: 'Gemini request timed out' }
    }
    return { success: false, errorCode: 503, errorMessage: 'Gemini network error' }
  }
}
 
// Whether a given HTTP error code warrants trying a different provider
function shouldFallback(errorCode: number): boolean {
  // 400: Bad request — the prompt itself is malformed. Gemini won't help.
  if (errorCode === 400) return false
  // 401: Auth error — a key configuration problem, not provider-specific.
  if (errorCode === 401) return false
  // 403: Permissions — same as 401.
  if (errorCode === 403) return false
  // 429: Rate limited — a different provider might have quota remaining.
  if (errorCode === 429) return true
  // 5xx: Server errors — definitely worth trying another provider.
  if (errorCode >= 500) return true
  // 504: Timeout — worth trying.
  if (errorCode === 504) return true
  return false
}
 
// Normalize prompt text to maximize cache hit rate
function normalizePrompt(prompt: string): string {
  return prompt
    .trim()
    .replace(/\s+/g, ' ')
    .replace(/[""]/g, '"')
    .replace(/['']/g, "'")
    .replace(/　/g, ' ')  // Japanese full-width space
    .toLowerCase()
}
 
// Conservative check: only cache when we're confident the output is deterministic enough
function isCacheable(prompt: string): boolean {
  // Personal context signals — don't cache
  const personalSignals = /\b(my|mine|i am|i'm|our|we|yesterday|today|right now|currently|this week)\b/i
  if (personalSignals.test(prompt)) return false
 
  // Too short — context too ambiguous
  if (prompt.length < 40) return false
 
  // Classification, summarization, and extraction are great candidates
  const cacheHints = /(classify|categorize|summarize|translate|extract|tag|label|detect|identify)/i
  if (cacheHints.test(prompt)) return true
 
  // Default: cache it. Conservative analysis above should catch problematic cases.
  return true
}
 
export default app

Step 3: The Rork App Integration

// hooks/useAI.ts — Drop this into your Rork-generated app
import { useState, useCallback, useRef } from 'react'
import AsyncStorage from '@react-native-async-storage/async-storage'
 
const WORKER_URL = process.env.EXPO_PUBLIC_AI_WORKER_URL!
const LOCAL_CACHE_TTL_MS = 60 * 60 * 1000 // 1 hour
 
interface GenerateOptions {
  prompt: string
  model?: string
  cache?: boolean
}
 
interface GenerateResult {
  content: string
  fromLocalCache: boolean
  provider: string
}
 
export function useAI() {
  const [loading, setLoading] = useState(false)
  const [error, setError] = useState<string | null>(null)
  const controllerRef = useRef<AbortController | null>(null)
 
  const generate = useCallback(async (opts: GenerateOptions): Promise<GenerateResult | null> => {
    const { prompt, model = 'gpt-4o-mini', cache = true } = opts
 
    setLoading(true)
    setError(null)
    controllerRef.current?.abort()
    controllerRef.current = new AbortController()
 
    // Tier 1: AsyncStorage local cache (instant, no network)
    if (cache) {
      try {
        const key = `aicache:${prompt.slice(0, 120)}`
        const raw = await AsyncStorage.getItem(key)
        if (raw) {
          const { data, ts } = JSON.parse(raw)
          if (Date.now() - ts < LOCAL_CACHE_TTL_MS) {
            setLoading(false)
            return { ...data, fromLocalCache: true }
          }
        }
      } catch {
        // Non-fatal, continue to network request
      }
    }
 
    try {
      const userId = await resolveUserId()
 
      const res = await fetch(`${WORKER_URL}/ai/generate`, {
        method: 'POST',
        signal: controllerRef.current.signal,
        headers: {
          'Content-Type': 'application/json',
          'X-User-ID': userId,
        },
        body: JSON.stringify({ prompt, model, cache }),
      })
 
      if (res.status === 429) {
        const { retryAfterSeconds } = await res.json()
        const minutes = Math.ceil(retryAfterSeconds / 60)
        setError(`Request limit reached. Try again in ${minutes} minute${minutes !== 1 ? 's' : ''}.`)
        setLoading(false)
        return null
      }
 
      if (res.status === 503) {
        setError('AI is temporarily unavailable. Please try again shortly.')
        setLoading(false)
        return null
      }
 
      if (!res.ok) {
        throw new Error(`Worker returned ${res.status}`)
      }
 
      const json = await res.json<{ choices: Array<{ message: { content: string } }>; _provider?: string }>()
      const result: GenerateResult = {
        content: json.choices[0].message.content,
        fromLocalCache: false,
        provider: json._provider ?? 'openai',
      }
 
      // Tier 2 write: save to local cache for subsequent requests
      if (cache) {
        const key = `aicache:${prompt.slice(0, 120)}`
        AsyncStorage.setItem(key, JSON.stringify({ data: result, ts: Date.now() })).catch(() => {})
      }
 
      setLoading(false)
      return result
 
    } catch (err: any) {
      if (err.name === 'AbortError') {
        setLoading(false)
        return null
      }
      setError('Something went wrong. Please try again.')
      console.error('[useAI] error:', err)
      setLoading(false)
      return null
    }
  }, [])
 
  const cancel = useCallback(() => controllerRef.current?.abort(), [])
 
  return { generate, loading, error, cancel }
}
 
async function resolveUserId(): Promise<string> {
  const existing = await AsyncStorage.getItem('ai_user_id')
  if (existing) return existing
  const generated = `anon:${Date.now().toString(36)}${Math.random().toString(36).slice(2)}`
  await AsyncStorage.setItem('ai_user_id', generated)
  return generated
}

Usage in a component:

// Example: AI-powered content categorization in a Rork app
import { useAI } from '../hooks/useAI'
 
export function CategoryTagger({ content }: { content: string }) {
  const { generate, loading, error } = useAI()
  const [category, setCategory] = useState<string | null>(null)
 
  const categorize = async () => {
    const result = await generate({
      prompt: `Categorize the following content into exactly one of these categories: 
               Technology, Health, Finance, Entertainment, Education, Other.
               Respond with only the category name.
               Content: ${content}`,
      cache: true, // Identical content will be served from cache
    })
 
    if (result) {
      setCategory(result.content.trim())
      // Optional: show a subtle indicator if served from local cache
      if (result.fromLocalCache) {
        console.log('Category served from local cache — instant response')
      }
    }
  }
 
  return (
    // ... your UI
  )
}

The Three Cache-Killing Mistakes

Mistake 1: Streaming and caching are mutually exclusive

Cloudflare AI Gateway cannot cache streamed responses. This is a hard architectural constraint.

If you have chat-like features that need real-time streaming output, keep them on a separate endpoint (/ai/stream) that deliberately sets cf-aig-cache-ttl: 0. Reserve the cacheable /ai/generate endpoint for batch-style tasks like classification, summarization, and extraction.

Mistake 2: Variable temperature breaks cache key consistency

If you call the same prompt with temperature: 0.7 and then with temperature: 0.8, these are different cache keys even though the inputs are identical. For any task where you want caching to work, fix the temperature at 0.

This also happens to be correct behavior: classification and summarization tasks should be deterministic. Using temperature 0 isn't just a caching optimization — it produces more consistent, reliable outputs for structured tasks.

Mistake 3: Failing to normalize prompts before hashing

Consider these three prompts that should produce identical results:

"Classify the following text into a category:\nAI tools for developers"
"Classify the following text into a category: AI tools for developers"
"classify the following text into a category: AI tools for developers"

Without normalization, these are three separate cache entries. With the normalization function in the Workers code above — collapsing whitespace, lowercasing, normalizing punctuation — they become a single cache entry. In my implementation, this single change moved cache hit rate from 12% to 61%.

Real Numbers After 30 Days

Here's what the Cloudflare AI Gateway dashboard showed after one month of running this setup on an app with roughly 1,200 monthly active users:

Total requests served: 47,800
Cache hits: 35,100 (73.4%)
Actual API calls made: 12,700
Gemini fallback activations: 340 (0.71% of requests)
Average response latency (cache hit): 87ms
Average response latency (API call): 1,240ms
OpenAI API cost: $18.40
Previous month (no caching): $127.60

The latency improvement was an unexpected bonus. Cache hits returning in under 100ms versus 1-2 seconds for live API calls made the app feel noticeably snappier — user session lengths increased by about 18% in the week after deployment.

Where to Start

The implementation in this article is production-ready, but deploy incrementally:

Week 1: Set up AI Gateway in Cloudflare, add the gateway URL to your existing Workers code with cf-aig-cache-ttl: 0 (caching disabled). This lets you see your baseline traffic patterns in the dashboard without changing behavior.

Week 2: Enable caching on low-risk endpoints first — classification tasks with temperature 0. Check cache hit rates after 48 hours.

Week 3: Add prompt normalization. This typically delivers the biggest cache hit rate improvement.

Week 4: Implement provider failover and rate limiting.

By the time you've completed all four weeks, you'll have a measurably cheaper and more resilient AI backend — and you'll understand exactly which parts of your app are driving the most API cost.

Advanced Caching Patterns for Common Rork App Scenarios

Beyond basic caching, there are several patterns worth knowing for specific use cases that appear frequently in Rork-generated apps.

Pattern 1: Tiered Cache TTL Based on Content Volatility

Not all prompts should be cached for the same duration. A classification of "what category does this product fall into?" is valid for weeks. A summary of "what are trending topics in this app today?" might be stale in an hour.

function getCacheTTL(prompt: string): number {
  // Real-time signals: very short cache
  if (/(trending|popular right now|latest|breaking|current)/i.test(prompt)) {
    return 3600 // 1 hour
  }
  // User-generated content classification: medium cache
  if (/(classify|categorize|tag|label)/i.test(prompt)) {
    return 604800 // 1 week
  }
  // Factual extraction from static content: long cache
  if (/(extract|summarize|translate)/i.test(prompt)) {
    return 2592000 // 30 days
  }
  // Default
  return 86400 // 24 hours
}
 
// Apply in your fetch call:
// 'cf-aig-cache-ttl': String(getCacheTTL(normalizedPrompt))

The 30-day TTL for factual extraction might seem aggressive, but if you are extracting structured data from static articles or product descriptions, the content genuinely does not change. A 30-day cache on those requests can yield dramatic cost reductions on apps with large content libraries.

Pattern 2: Cache Warming for Predictable High-Traffic Prompts

In apps with user-generated content, certain prompts become predictably popular. A news aggregation app will classify the same viral article hundreds of times in an hour. Instead of letting the first user trigger the API call while the next few hundred also hit uncached responses (there is a brief population window), you can proactively warm the cache when content is ingested:

export async function warmCacheForContent(
  contentText: string,
  gatewayBase: string,
  apiKey: string
): Promise<void> {
  const snippet = contentText.slice(0, 500).toLowerCase().trim().replace(/\s+/g, ' ')
  
  const warmupPrompts = [
    `classify this content into one of: technology, health, finance, entertainment, education, other. content: ${snippet}`,
    `extract the top 3 keywords from this content: ${snippet}`,
    `write a one-sentence summary of this content: ${snippet}`,
  ]
 
  // Fire all warmup requests in parallel — they populate the cache before real users arrive
  await Promise.allSettled(
    warmupPrompts.map(prompt =>
      fetch(`${gatewayBase}/openai/chat/completions`, {
        method: 'POST',
        headers: {
          'Authorization': `Bearer ${apiKey}`,
          'Content-Type': 'application/json',
          'cf-aig-cache-ttl': '86400',
        },
        body: JSON.stringify({
          model: 'gpt-4o-mini',
          messages: [{ role: 'user', content: prompt }],
          temperature: 0,
          max_tokens: 200,
        }),
      })
    )
  )
}

Run this as part of your content ingestion pipeline. By the time users request new content, the cache is already warm — zero latency penalty for the first batch of readers.

Pattern 3: Selective Cache Bypass for Premium Users

Some users will notice and care if they receive a cached response that feels slightly stale for their specific context. A practical solution is giving premium users a way to bypass cache while keeping it active for free-tier users:

app.post('/ai/generate', async (c) => {
  const userId = c.get('userId')
  const body = await c.req.json()
  const { prompt, cache = true, forceRefresh = false } = body
 
  // Premium users can request fresh results by passing forceRefresh: true
  // This is enforced server-side so free users cannot bypass cache
  const isPremiumUser = userId.startsWith('premium:')
  const bypassCache = forceRefresh && isPremiumUser
  const cacheEnabled = cache && !bypassCache && isCacheable(normalizePrompt(prompt))
 
  // ... rest of implementation
  // In headers: 'cf-aig-cache-ttl': cacheEnabled ? '86400' : '0'
})

The UX framing works in your favor here: "Premium members always get real-time AI responses" is a legitimate feature differentiator that costs you very little since most prompts will have been seen before and the cache will be fresh anyway.

Deployment Checklist

Before shipping to production, verify these items:

Workers environment variables set via wrangler secret put: OPENAI_API_KEY, GEMINI_API_KEY, AI_GATEWAY_ACCOUNT_ID, AI_GATEWAY_ID
KV namespace created and bound as RATE_LIMIT in wrangler.toml
CORS origin list locked to your actual app domain rather than a wildcard
AI Gateway "Cache Responses" enabled in Cloudflare dashboard with a sensible default TTL
Rate limit thresholds calibrated to your expected usage pattern — the 50 req/hour default here is conservative; adjust based on your feature set
Gemini API key active and billing enabled on Google Cloud Console
Failover test passed: temporarily swap in an invalid OpenAI key, confirm that Gemini responses arrive correctly with _provider: "gemini" in the body
Load test with 50 rapid identical requests and confirm cache hit rate spikes on the Cloudflare dashboard within a few minutes

The gateway itself is just infrastructure. What makes the difference is how thoughtfully you design the boundary between "this response should be cached" and "this response must be fresh." That judgment call, made consistently across your app's AI features, is what separates a sustainable AI-powered app from one that becomes unaffordable as it grows.