What Cloudflare’s AI Service Actually Offers (And What It Doesn’t)
What Cloudflare’s AI Service Actually Offers (And What It Doesn’t)
- 60% of MD5 Password Hashes Can Be Cracked in Under an Hour with a Single GPU
- Dirty Frag: Root Access on Every Major Linux Distribution — No Patch, No Warning
- Ubuntu 26.04 LTS (Resolute Raccoon): The Most Ambitious Ubuntu LTS in a Decade
- Proton Mail: Data Transferred to FBI Again!
- How Close Are Quantum Computers to Breaking RSA-2048?
- How to Prevent Ransomware Infection Risks?
- What is the best alternative to Microsoft Office?
What Cloudflare’s AI Service Actually Offers (And What It Doesn’t)
A circulating article promises “free, limitless AI inference.” Here’s what the official documentation really says.
Cloudflare Workers AI is a real, production-ready service worth knowing. But a recently circulating writeup glosses over one critical detail that could catch developers off guard. Here is a precise, source-verified account of what the platform actually provides.
What is Cloudflare Workers AI?
Workers AI is Cloudflare’s serverless AI inference service. Instead of renting GPU capacity yourself, you call Cloudflare’s API and the compute runs on their global edge network — in over 300 cities worldwide. You don’t manage servers, runtime environments, or scaling. The model runs; you pay for what you use.
The service supports a range of task types: text generation, text embeddings, image classification, object detection, automatic speech recognition, text-to-image, image-to-text, translation, summarization, and text-to-speech. Popular models include Llama 3.1, Mistral, DeepSeek-R1, and Qwen series models — well over 50 in total.
The claim that Workers AI supports 50+ open-source models across multiple task types is accurate and matches the current Cloudflare model catalogue.
Fact-check: claim by claim
The following reviews every major claim in the circulating article against Cloudflare’s official documentation.
Correct
Fully accurate. Workers AI is serverless by design. You write code, call the API, and Cloudflare handles all GPU provisioning, scaling, and infrastructure.
Correct
Confirmed. The current catalogue includes Llama 3.1, Mistral, DeepSeek-R1, Qwen3, GPT-OSS, FLUX image models, and many others across multiple task types.
Correct
Accurate. You can point the OpenAI SDK at Cloudflare’s base URL and migrate existing code with minimal changes. The interface is compatible with OpenAI’s chat completions format.
Nuance
Partly accurate, but misleading. Pay-as-you-go billing only applies to users on the Workers Paid plan ($5/month). On the free tier, when you hit the daily limit, requests simply fail — there is no automatic billing rollover.
Incomplete
The article is evasive here. The actual limit is 10,000 Neurons per day, reset at 00:00 UTC. Exceeding this causes requests to return errors. Cloudflare’s own documentation states: “If you exceed any one of the above limits, further operations will fail.” There is no grace or overflow — it stops.
Nuance
Not guaranteed. Some sources note that the free tier may restrict access to certain models. The full catalogue is available on paid plans. Check the Cloudflare documentation for current per-model availability.
Understanding “Neurons” — Cloudflare’s billing unit
Cloudflare does not bill in tokens. It uses a proprietary unit called a Neuron, which represents the GPU compute required for a given request. Neurons are calculated as a function of input tokens, output tokens, and a per-model coefficient — heavier models cost more Neurons per request.
For a mid-size model like Llama 3.1 8B, 10,000 Neurons translates roughly to several hundred to a few thousand short conversations per day — adequate for personal projects and experimentation, but not for production workloads without a paid plan.
| Plan | Daily free Neurons | Overage rate | What happens at limit |
|---|---|---|---|
| Workers Free | 10,000 / day | No overage billing | Requests fail with an error |
| Workers Paid $5/mo base | 10,000 / day (included) | $0.011 per 1,000 Neurons | Billing continues automatically |
Both plans include the same free daily allowance. The difference is what happens after: Workers Paid keeps running and charges you; Workers Free stops.
Making your first API call
The cURL example in the original article is correct. Here it is with annotations for clarity:
# Run inference against Llama 3.1 8B via REST API curl https://api.cloudflare.com/client/v4/accounts/{ACCOUNT_ID}/ai/run/@cf/meta/llama-3.1-8b-instruct \ -H 'Authorization: Bearer {API_TOKEN}' \ -d '{ "prompt": "Where did the phrase Hello World come from?" }'
For teams already using OpenAI’s SDK, migration requires only a base URL swap:
import OpenAI from 'openai'; const openai = new OpenAI({ apiKey: env.CLOUDFLARE_API_KEY, baseURL: `https://api.cloudflare.com/client/v4/accounts/${env.CLOUDFLARE_ACCOUNT_ID}/ai/v1`, }); const completion = await openai.chat.completions.create({ model: '@cf/meta/llama-3.1-8b-instruct', messages: [{ role: 'user', content: 'Say hello' }], }); console.log(completion.choices[0].message.content);
Who should actually use this?
Workers AI is well-suited for:
Individual developers and learners. The free tier is genuinely useful for experimentation, prototyping, and learning. 10,000 Neurons daily is enough to run hundreds of inference calls if you are working with smaller models.
Startups validating AI features. The Workers Paid plan’s pricing — $0.011 per 1,000 Neurons — is significantly cheaper than equivalently-sized OpenAI models. For cost-sensitive early-stage products, the economics are attractive.
Teams already on Cloudflare. If you’re using Workers, Pages, or R2, Workers AI integrates natively without additional vendor accounts or networking configuration.
Most LLM models on Workers AI enforce a limit of 300 requests per minute, independent of Neuron usage. If you’re building batch processing workflows or high-frequency apps, plan for this ceiling. On the paid plan, you can implement queue systems or add delays between requests to stay within limits.
The bottom line
The circulating article’s enthusiasm is understandable — Workers AI is a genuinely good service. But the framing around the free tier is misleading. It is not a “no token anxiety” experience if you’re expecting OpenAI-style pay-as-you-go metering from day one. The free plan imposes a hard daily ceiling, and when you hit it, your application stops working until midnight UTC.
For developers who understand this constraint, Workers AI is excellent value: global GPU inference, 50+ models, no infrastructure management, and a price point well below the major cloud AI providers. Sign up for the Workers Paid plan if you intend to run anything in production — the $5/month base is low, and actual Neuron costs for moderate workloads remain modest.
Quick reference
