What does the V4 hybrid attention architecture change for inference?

DeepSeek V4 Flash combines Compressed Sparse Attention (CSA) with Heavily Compressed Attention (HCA), and uses ManifoldConstrained Hyper-Connections (mHC) in place of standard residual connections. The combination targets efficient inference at long context, including the full 1.0M tokens window.

When should I pick DeepSeek V4 Flash over DeepSeek V4 Pro?

Pick DeepSeek V4 Flash for instruction following, classification, and short-form question answering where latency and per-token cost matter most. Use DeepSeek V4 Pro for complex reasoning, multi-step problem solving, and agentic tasks.

What is the context window and max output for DeepSeek V4 Flash?

The context window is 1.0M tokens and the maximum output is 1.0M tokens.

What does implicit caching do for pricing?

Implicit caching detects repeated input prefixes (typically long system prompts) and charges the cached input rate of $0.0028 per token instead of the standard $0.14 input rate. No explicit cache-control header is required.

Does DeepSeek V4 Flash support tool calls?

Yes. DeepSeek V4 Flash is tagged for tool use and reasoning, so function calling works through the AI SDK as well as Chat Completions, Responses, and Messages API formats.

Does DeepSeek V4 Flash support zero data retention?

Yes, Zero Data Retention is available for this model. Zero Data Retention is offered on a per-provider basis. See https://vercel.com/docs/ai-gateway/capabilities/zdr for details.

Dashboard

DeepSeek V4 Flash

DeepSeek V4 Flash is DeepSeek's April 23, 2026 efficiency-tier model in the V4 series. It pairs a hybrid attention architecture with a context window of 1.0M tokens and supports reasoning, tool use, and implicit caching.

ReasoningTool UseImplicit Caching

index.ts

import { streamText } from 'ai'

const result = streamText({
  model: 'deepseek/deepseek-v4-flash',
  prompt: 'Why is the sky blue?'
})

Overview Playground About Providers Throughput Latency Uptime Status Similar FAQ

Playground

Try out DeepSeek V4 Flash by DeepSeek. Usage is billed to your team at API rates. Free users (those who haven't made a payment) get $5 of credits every 30 days.

Providers

Route requests across multiple providers. Copy a provider slug to set your preference. Visit the docs for more info. Using a provider means you agree to their terms, listed under Legal.

Provider

Context	Latency	Throughput	Input	Output	Cache	Web Search	Per Query	Capabilities	ZDR	No Training	Release Date

Legal:Terms

•

Privacy

1.7s

78tps

$0.14/M

$0.28/M

Read:$0.0/M

Write:—

—

04/23/2026

Legal:Terms

•

Privacy

2.5s

79tps

$0.14/M

$0.28/M

Read:$0.03/M

Write:—

—

04/23/2026

Legal:Terms

•

Privacy

0.6s

25tps

$0.14/M

$0.28/M

Read:$0.03/M

Write:—

—

04/23/2026

More models by DeepSeek

Model

Context	Latency	Throughput	Input	Output	Cache	Web Search	Per Query	Capabilities	Providers	ZDR	No Training	Release Date

1.2s

50tps

$0.43/M

$0.87/M

Read:$0.0/M

Write:—

—

04/23/2026

164K

0.8s

58tps

$0.28/M

$0.42/M

Read:$0.03/M

Write:—

—

12/01/2025

164K

0.8s

76tps

$0.28/M

$0.42/M

Read:$0.03/M

Write:—

—

12/01/2025

131K

1.7s

26tps

$0.27/M

$1.00/M

Read:$0.14/M

Write:—

—

09/22/2025

164K

0.2s

198tps

$0.50/M

$1.50/M

Read:$0.13/M

Write:—

—

08/21/2025

164K

1.0s

113tps

$0.77/M

Read:$0.14/M

Write:—

—

12/26/2024

About DeepSeek V4 Flash

DeepSeek V4 Flash was released April 23, 2026 as part of DeepSeek's V4 generation. The V4 series introduces a hybrid attention architecture that combines Compressed Sparse Attention (CSA) with Heavily Compressed Attention (HCA), along with ManifoldConstrained Hyper-Connections (mHC) that refine standard residual connections. The combination targets efficient long-context inference at the 1.0M tokens window.

DeepSeek V4 Flash positions as the efficiency tier of the V4 lineup. It handles instruction following, classification, short-form Q&A, and other tasks where latency and per-token cost matter more than maximum reasoning depth. Maximum output is 1.0M tokens, the same budget as DeepSeek V4 Pro, so single-call response length is not the differentiator. The split between Flash and Pro is about capability depth and cost.

DeepSeek V4 Flash supports tool use and reasoning, and the model is tagged for implicit caching. Implicit caching reduces input-token charges for repeated prefixes without requiring explicit cache-control headers in the request. Access is through AI Gateway with an AI Gateway API key or OIDC token, so you don't need a separate DeepSeek platform account.

What To Consider When Choosing a Provider

Configuration: DeepSeek V4 Flash is tuned for speed and cost on shorter tasks. If your workload involves multi-step reasoning, complex agentic flows, or long synthesis chains, DeepSeek V4 Pro is the better fit within the same generation.
Zero Data Retention: AI Gateway supports Zero Data Retention for this model via direct gateway requests (BYOK is not included). To configure this, check the documentation.
Authentication: AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.

When to Use DeepSeek V4 Flash

Best For

High-volume classification: Routing and short-answer pipelines where $0.14 input and $0.28 output keep unit economics tight
Short-form instruction following: Summarization of short inputs, structured extraction, and rewriting tasks without multi-step planning
Front-line agent steps: Intent detection and parameter parsing before handing off to a deeper-reasoning model
Implicit caching workloads: Long, repeated system prompts across many calls benefit from cached input pricing

Consider Alternatives When

Complex agent orchestration: Use DeepSeek V4 Pro within the same generation for multi-step reasoning and tool planning
Earlier-generation pricing: DeepSeek V3 family models may be lower cost when the 1.0M tokens window or V4 capabilities aren't required
Dedicated deep reasoning: DeepSeek-R1 remains the open-weights reasoning specialist for extended chain-of-thought workloads

Conclusion

DeepSeek V4 Flash is the efficiency tier of the V4 generation, suited to high-volume short-form tasks where cost and latency dominate. For deeper reasoning and agentic workflows within the same generation, step up to DeepSeek V4 Pro.

Frequently Asked Questions

What does the V4 hybrid attention architecture change for inference?
DeepSeek V4 Flash combines Compressed Sparse Attention (CSA) with Heavily Compressed Attention (HCA), and uses ManifoldConstrained Hyper-Connections (mHC) in place of standard residual connections. The combination targets efficient inference at long context, including the full 1.0M tokens window.
When should I pick DeepSeek V4 Flash over DeepSeek V4 Pro?
Pick DeepSeek V4 Flash for instruction following, classification, and short-form question answering where latency and per-token cost matter most. Use DeepSeek V4 Pro for complex reasoning, multi-step problem solving, and agentic tasks.
What is the context window and max output for DeepSeek V4 Flash?
The context window is 1.0M tokens and the maximum output is 1.0M tokens.
What does implicit caching do for pricing?
Implicit caching detects repeated input prefixes (typically long system prompts) and charges the cached input rate of $0.0028 per token instead of the standard $0.14 input rate. No explicit cache-control header is required.
Does DeepSeek V4 Flash support tool calls?
Yes. DeepSeek V4 Flash is tagged for tool use and reasoning, so function calling works through the AI SDK as well as Chat Completions, Responses, and Messages API formats.
Does DeepSeek V4 Flash support zero data retention?
Yes, Zero Data Retention is available for this model. Zero Data Retention is offered on a per-provider basis. See https://vercel.com/docs/ai-gateway/capabilities/zdr for details.

AI Cloud

Core Platform

Security

Company

Learn

Open Source

Use Cases

Tools

Users

DeepSeek V4 Flash

Playground

Providers

More models by DeepSeek

About DeepSeek V4 Flash

What To Consider When Choosing a Provider

When to Use DeepSeek V4 Flash

Best For

Consider Alternatives When

Conclusion

Frequently Asked Questions