The Real Cost of Inference

A discount that stayed

A temporary discount that was supposed to expire is now permanent. That either means DeepSeek found a cost advantage nobody else has, or they’re buying market share they can’t hold on price alone.

DeepSeek turned their V4 Pro introductory pricing into the new baseline. The silence from the rest of the AI API market is louder than any press release.

The question is which one this is — cost advantage or land grab — and what it means for you if you’re building applications or agents at scale.

Key Takeaway

Permanent pricing removes the “will they jack up rates next quarter” fear that kills enterprise procurement deals. Temporary discounts poison long-term planning. Permanent ones reveal strategy.

What the pricing page actually says

The DeepSeek API pricing page (api-docs.deepseek.com/quick_start/pricing) is the canonical reference. It’s a straightforward table showing input tokens, output tokens, and cache hit rates, updated to reflect the new permanent pricing.

This matters because DeepSeek has been the one enforcing price discipline since V2. Every time OpenAI or Anthropic adjusts pricing, the market watches to see if DeepSeek follows. When DeepSeek cuts and holds, everyone else feels the margin pressure.

The timing lands right as enterprises are moving from “which model is best on a benchmark” to “which model can I run at scale without burning my infrastructure budget.” That’s exactly the conversation DeepSeek wants to lead.

“For all models, the input cache hit price has been reduced to 1/10 of the launch price. This price adjustment takes effect from 2026/4/26 12:15 UTC.”

DeepSeek API Pricing Docs · DeepSeek Docs

That quote matters. It’s not a promotion. It’s a permanent structural discount.

The real cost picture

Here’s what the pricing page actually says. DeepSeek V4 Pro now sits at permanent pricing that was previously a temporary launch discount. The numbers are aggressive, and they land in a very different market than the GPT-4o vs Claude 3.5 era:

Model	Input (per 1M tokens)	Cached Input	Output (per 1M tokens)
DeepSeek V4 Flash	$0.14	~$0.003	$0.28
DeepSeek V4 Pro	$2.00	$0.50	$8.00

DeepSeek V4 Flash at $0.28/MTok output and $0.14 input is in a tier of its own. It’s not a direct competitor to frontier models — it’s a different category. For high volume, latency tolerant workloads like classification, extraction, summarization, and single-turn RAG, it’s absurdly cheap.

DeepSeek V4 Pro sits in the sweet spot: competitive input pricing, a 1M context window, and output pricing that makes multi-turn agentic loops economically viable in ways that $30/MTok output doesn’t.

The cache mechanism auto-detects repeated prefixes and serves cached results at the $0.50 rate. For production workloads with predictable prompt structures like RAG pipelines, code generation scaffolding, and agentic loops, that’s where the real savings live.

“DeepSeek V4 Pro at $8/MTok output changes the architecture math. When your output cost is significantly lower than competitors, you can afford to run more agent turns, more retries, more fallback chains. Cheap output changes how you design the loop.”

Pricing analysis, codegrit.dev

“The deepseek-v4-pro model API pricing will be officially adjusted to 1/4 of the original price after the 75% discount promotion ends on 2026/05/31 15:59 UTC.”

DeepSeek API Pricing Docs · DeepSeek Docs

Read the quote again and notice what’s really there. DeepSeek isn’t just publishing rates, they’re publishing infrastructure economics. This is a company treating inference as an engineering cost to be optimized rather than a margin to be maintained.

DeepSeek V4 Pro pricing comparison showing DeepSeek Flash and DeepSeek V4 Pro tiers

Bottom Line

The real cost driver isn’t the headline per-token rate — it’s your cache hit rate. A 60% cache hit on V4 Pro drops your effective input cost to $1.10 per million tokens. That’s a structural discount, not a promotional one. Factor in the 1M context window and 384K max output, and the architecture implications are real for teams running at volume.

Yes, with conditions

Permanent pricing removes the uncertainty that makes enterprise procurement nervous. When a CTO signs off on a model integration, they want to know what the cost picture looks like 12 months out. Temporary discounts poison that conversation. Permanent pricing, even at the discounted level, is a cleaner signal.

The conditions are the operational gaps worth knowing about before you bet your architecture on them:

No cache analytics dashboard — there’s no API to check your effective cache hit rate. You infer it from your bill.
Latency behavior under high concurrent load is less documented than it should be.
Thin documentation on rate limits and retry behavior.

“Concurrency Limit — V4 Flash: 2500 | V4 Pro: 500”

DeepSeek API Pricing Docs · DeepSeek Docs

Those are high limits. 500 concurrent requests on V4 Pro is generous, but without visibility into your actual cache behavior, you’re flying partially blind.

I also want to see what happens when GPU supply tightens. DeepSeek’s cost advantage is partly architectural (MoE efficiency on V4 is genuinely impressive) and partly geopolitical. If global GPU supply gets squeezed, that advantage may compress.

“Permanent doesn’t mean forever. It means ‘for the foreseeable future, this is the price of admission.’ Take it, but keep your exit strategy warm.”

Pricing analysis, codegrit.dev

Where DeepSeek goes from here

If I were running the DeepSeek API product, I’d have done three things differently.

First: ship cache analytics on day one. The caching mechanism is one of V4 Pro’s strongest competitive advantages. It’s also invisible to the customer. Give every API user a dashboard showing cache hit rate by endpoint, by prompt prefix, by time window. That turns a passive pricing advantage into an active value lever, which increases switching costs without increasing prices.

Second: publish a reliability SLA for the discount model tier. The unspoken anxiety in every “cheaper AI API” conversation is reliability. Cheap doesn’t matter if the endpoint is flaky. DeepSeek should have paired the permanent pricing with a 99.5% uptime SLA on V4 Pro, backed by service credits.

Third: launch a “bring your own context” tier. A tier where you prepay for context slots at reserved capacity for your prompt prefixes at below-market cache rates would be a killer product for high-volume enterprise workloads. It’s what AWS did with Reserved Instances, applied to inference.

These three moves would turn a pricing announcement into a platform strategy.

Bottom Line

DeepSeek V4 Pro’s permanent discount is real and significant for teams running at volume. The 1M context window and structurally cheap output pricing open architectural options that weren’t viable at competitor rates. But the cache analytics gap, latency documentation, and operational unknowns mean you need strong infrastructure chops to capture the full value. Buy the price, but budget engineering overhead for monitoring, retry logic, and fallback strategies.

The market just got a margin call

DeepSeek V4 Pro’s permanent discount is the closest thing the AI API market has seen to an honest price. The margin is thin, the infrastructure is engineered, and the cost advantage is real, as long as you bring your own ops maturity. Temporary discounts expire. Permanent ones reveal strategy. This one reveals that DeepSeek is playing the long game on inference economics.

The rest of the market just got a margin call.

Sources

DeepSeek API Pricing — https://api-docs.deepseek.com/quick_start/pricing
DeepSeek V4 Pro Model Documentation — https://api-docs.deepseek.com/models/deepseek-v4-pro

The Real Cost of Inference

A discount that stayed

Key Takeaway

What the pricing page actually says

The real cost picture

Bottom Line

Yes, with conditions

Where DeepSeek goes from here

Bottom Line

The market just got a margin call

Sources

Tags

Ready to build something that works?