Skip to content

DeepSeek V4 Flash

DeepSeek V4 Flash is DeepSeek's April 23, 2026 efficiency-tier model in the V4 series. It pairs a hybrid attention architecture with a context window of 1.0M tokens and supports reasoning, tool use, and implicit caching.

ReasoningTool UseImplicit Caching
index.ts
import { streamText } from 'ai'
const result = streamText({
model: 'deepseek/deepseek-v4-flash',
prompt: 'Why is the sky blue?'
})

Playground

Try out DeepSeek V4 Flash by DeepSeek. Usage is billed to your team at API rates. Free users (those who haven't made a payment) get $5 of credits every 30 days.

Providers

Route requests across multiple providers. Copy a provider slug to set your preference. Visit the docs for more info. Using a provider means you agree to their terms, listed under Legal.

Provider
Context
Latency
Throughput
Input
Output
Cache
Web Search
Per Query
Capabilities
ZDR
No Training
Release Date
DeepSeek
Legal:Terms
Privacy
1M
1.7s
78tps
$0.14/M$0.28/M
Read:$0.0/M
Write:
04/23/2026
Novita AI
Legal:Terms
Privacy
1M
2.5s
79tps
$0.14/M$0.28/M
Read:$0.03/M
Write:
04/23/2026
DeepInfra
Legal:Terms
Privacy
1M
0.6s
25tps
$0.14/M$0.28/M
Read:$0.03/M
Write:
04/23/2026
Throughput

P50 throughput on live AI Gateway traffic, in tokens per second (TPS). Visit the docs for more info.

Latency

P50 time to first token (TTFT) on live AI Gateway traffic, in milliseconds. View the docs for more info.

Uptime

Direct request success rate on AI Gateway and per-provider. Visit the docs for more info.

More models by DeepSeek

Model
Context
Latency
Throughput
Input
Output
Cache
Web Search
Per Query
Capabilities
Providers
ZDR
No Training
Release Date
1M
1.2s
50tps
$0.43/M$0.87/M
Read:$0.0/M
Write:
deepinfra logo
deepseek logo
fireworks logo
+1
04/23/2026
164K
0.8s
58tps
$0.28/M$0.42/M
Read:$0.03/M
Write:
bedrock logo
deepinfra logo
deepseek logo
+2
12/01/2025
164K
0.8s
76tps
$0.28/M$0.42/M
Read:$0.03/M
Write:
bedrock logo
deepinfra logo
deepseek logo
+2
12/01/2025
131K
1.7s
26tps
$0.27/M$1.00/M
Read:$0.14/M
Write:
novita logo
09/22/2025
164K
0.2s
198tps
$0.50/M$1.50/M
Read:$0.13/M
Write:
baseten logo
deepinfra logo
fireworks logo
+3
08/21/2025
164K
1.0s
113tps
$0.77/M$0.77/M
Read:$0.14/M
Write:
baseten logo
novita logo
12/26/2024

About DeepSeek V4 Flash

DeepSeek V4 Flash was released April 23, 2026 as part of DeepSeek's V4 generation. The V4 series introduces a hybrid attention architecture that combines Compressed Sparse Attention (CSA) with Heavily Compressed Attention (HCA), along with ManifoldConstrained Hyper-Connections (mHC) that refine standard residual connections. The combination targets efficient long-context inference at the 1.0M tokens window.

DeepSeek V4 Flash positions as the efficiency tier of the V4 lineup. It handles instruction following, classification, short-form Q&A, and other tasks where latency and per-token cost matter more than maximum reasoning depth. Maximum output is 1.0M tokens, the same budget as DeepSeek V4 Pro, so single-call response length is not the differentiator. The split between Flash and Pro is about capability depth and cost.

DeepSeek V4 Flash supports tool use and reasoning, and the model is tagged for implicit caching. Implicit caching reduces input-token charges for repeated prefixes without requiring explicit cache-control headers in the request. Access is through AI Gateway with an AI Gateway API key or OIDC token, so you don't need a separate DeepSeek platform account.

What To Consider When Choosing a Provider

  • Configuration: DeepSeek V4 Flash is tuned for speed and cost on shorter tasks. If your workload involves multi-step reasoning, complex agentic flows, or long synthesis chains, DeepSeek V4 Pro is the better fit within the same generation.
  • Zero Data Retention: AI Gateway supports Zero Data Retention for this model via direct gateway requests (BYOK is not included). To configure this, check the documentation.
  • Authentication: AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.

When to Use DeepSeek V4 Flash

Best For

  • High-volume classification: Routing and short-answer pipelines where $0.14 input and $0.28 output keep unit economics tight
  • Short-form instruction following: Summarization of short inputs, structured extraction, and rewriting tasks without multi-step planning
  • Front-line agent steps: Intent detection and parameter parsing before handing off to a deeper-reasoning model
  • Implicit caching workloads: Long, repeated system prompts across many calls benefit from cached input pricing

Consider Alternatives When

  • Complex agent orchestration: Use DeepSeek V4 Pro within the same generation for multi-step reasoning and tool planning
  • Earlier-generation pricing: DeepSeek V3 family models may be lower cost when the 1.0M tokens window or V4 capabilities aren't required
  • Dedicated deep reasoning: DeepSeek-R1 remains the open-weights reasoning specialist for extended chain-of-thought workloads

Conclusion

DeepSeek V4 Flash is the efficiency tier of the V4 generation, suited to high-volume short-form tasks where cost and latency dominate. For deeper reasoning and agentic workflows within the same generation, step up to DeepSeek V4 Pro.

Frequently Asked Questions

  • What does the V4 hybrid attention architecture change for inference?

    DeepSeek V4 Flash combines Compressed Sparse Attention (CSA) with Heavily Compressed Attention (HCA), and uses ManifoldConstrained Hyper-Connections (mHC) in place of standard residual connections. The combination targets efficient inference at long context, including the full 1.0M tokens window.

  • When should I pick DeepSeek V4 Flash over DeepSeek V4 Pro?

    Pick DeepSeek V4 Flash for instruction following, classification, and short-form question answering where latency and per-token cost matter most. Use DeepSeek V4 Pro for complex reasoning, multi-step problem solving, and agentic tasks.

  • What is the context window and max output for DeepSeek V4 Flash?

    The context window is 1.0M tokens and the maximum output is 1.0M tokens.

  • What does implicit caching do for pricing?

    Implicit caching detects repeated input prefixes (typically long system prompts) and charges the cached input rate of $0.0028 per token instead of the standard $0.14 input rate. No explicit cache-control header is required.

  • Does DeepSeek V4 Flash support tool calls?

    Yes. DeepSeek V4 Flash is tagged for tool use and reasoning, so function calling works through the AI SDK as well as Chat Completions, Responses, and Messages API formats.

  • Does DeepSeek V4 Flash support zero data retention?

    Yes, Zero Data Retention is available for this model. Zero Data Retention is offered on a per-provider basis. See https://vercel.com/docs/ai-gateway/capabilities/zdr for details.