Gemini 3.1 Flash Lite
Gemini 3.1 Flash Lite is the GA release of the efficiency tier in the Gemini 3.1 generation, with improvements in reasoning, multimodal understanding, agentic tool use, and long-context performance over 2.5 Flash Lite, plus four configurable thinking levels and a context window of 1M tokens.
import { streamText } from 'ai'
const result = streamText({ model: 'google/gemini-3.1-flash-lite', prompt: 'Why is the sky blue?'})Playground
Try out Gemini 3.1 Flash Lite by Google. Usage is billed to your team at API rates. Free users (those who haven't made a payment) get $5 of credits every 30 days.
Providers
Route requests across multiple providers. Copy a provider slug to set your preference. Visit the docs for more info. Using a provider means you agree to their terms, listed under Legal.
| Provider |
|---|
P50 throughput on live AI Gateway traffic, in tokens per second (TPS). Visit the docs for more info.
P50 time to first token (TTFT) on live AI Gateway traffic, in milliseconds. View the docs for more info.
Direct request success rate on AI Gateway and per-provider. Visit the docs for more info.
More models by Google
| Model |
|---|
About Gemini 3.1 Flash Lite
Gemini 3.1 Flash Lite is the general-availability version of the efficiency tier in the Gemini 3.1 generation, released May 7, 2026. It outperforms Gemini 2.5 Flash Lite on overall quality and lands close to 2.5 Flash performance across key capability areas, including reasoning, multimodal understanding, agentic tool use, and long-context performance.
The model is positioned for high-volume use cases where unit economics, not peak capability, set the constraint. Gemini 3.1 Flash Lite accepts text, images, audio, and documents as input within 1M tokens and produces text output, with implicit caching and web search available as runtime options to control cost and ground responses in current information.
Four configurable thinking levels (minimal, low, medium, high) allow a single deployment to serve mixed workloads without routing across models. A bulk extraction job can run at minimal to minimize latency and tokens; an edge case in the same pipeline can step up to medium or high when more deliberation pays off. Thinking tokens contribute to output token counts, so the level becomes a direct lever on cost.
Compared to the preceding preview release, Gemini 3.1 Flash Lite is the stable surface for production deployments. Teams running 2.5 Flash Lite at scale get the 3.1 generation quality gains on the workloads that drive the most tokens, including translation, data extraction, and code completion, without moving up to the standard Flash tier.
What To Consider When Choosing a Provider
- Configuration: Gemini 3.1 Flash Lite exposes four thinking levels through
providerOptions.google.thinkingConfig:minimal,low,medium, andhigh. Lower levels favor latency and per-token cost; higher levels add reasoning compute that counts toward output tokens. Benchmark total cost under realistic thinking settings before committing to a deployment. - Zero Data Retention: AI Gateway supports Zero Data Retention for this model via direct gateway requests (BYOK is not included). To configure this, check the documentation.
- Authentication: AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.
When to Use Gemini 3.1 Flash Lite
Best For
- High-volume agentic pipelines: Aggregate token cost is a binding constraint and per-step reasoning depth can be tuned per request
- Data extraction at scale: Structured extraction over millions of documents, invoices, transcripts, or scraped HTML where throughput economics dominate
- Bulk translation workloads: Per-token cost determines whether the use case is viable, with thinking levels available for nuanced passages
- Code completion and review: Inline suggestions, lint-style review, and refactor proposals at IDE or CI/CD scale
- Multimodal ingestion at the lite tier: Vision and file inputs are required, but pro-tier latency and cost are not
Consider Alternatives When
- Maximum reasoning depth: Complex multi-step problems benefit from
google/gemini-3.1-pro-previeworgoogle/gemini-3-pro-preview - Native image output: Gemini 3.1 Flash Lite returns text only;
google/gemini-3.1-flash-image-previeworgoogle/gemini-3-pro-imagegenerate images - Pro-grade quality at flash latency:
google/gemini-3-flashsits between Flash Lite and Pro on capability and cost - Pure embedding workloads: Semantic retrieval and clustering fit a dedicated embedding model like
google/gemini-embedding-001better
Conclusion
Gemini 3.1 Flash Lite is the GA destination for teams that ran the 3.1 Flash Lite preview or are migrating from 2.5 Flash Lite. It brings the 3.1 generation quality lift to the highest-volume, most cost-sensitive workloads, with four thinking levels giving you a single deployment that adapts to mixed task difficulty without changing models.
Frequently Asked Questions
How is Gemini 3.1 Flash Lite different from
google/gemini-3.1-flash-lite-preview?Gemini 3.1 Flash Lite is the general-availability release of the same efficiency tier in the Gemini 3.1 family. The preview entry remains in the catalog for teams pinned to the earlier identifier; the GA model is the recommended target for new production work.
How does Gemini 3.1 Flash Lite compare to Gemini 2.5 Flash Lite?
Gemini 3.1 Flash Lite outperforms 2.5 Flash Lite on overall quality and lands close to 2.5 Flash across reasoning, multimodal understanding, agentic tool use, and long-context performance. For teams already running 2.5 Flash Lite at scale, it's a quality upgrade within the same lite tier.
What thinking levels does Gemini 3.1 Flash Lite support and how do they affect cost?
Four levels:
minimal,low,medium, andhigh. Higher levels add reasoning compute that contributes to output token counts, so the choice trades off latency and per-request cost against quality on harder inputs.Can I mix thinking levels across requests in the same application?
Yes. Set
thinkingLevelper request inproviderOptions.google.thinkingConfig. Routine requests can run atminimalwhile flagged hard cases run atmediumorhighwithout any architectural changes.Does Gemini 3.1 Flash Lite support multimodal inputs?
Yes. Gemini 3.1 Flash Lite accepts text, images, audio, and documents as input within the 1M tokens context window and returns text output. Web search and implicit caching are available as runtime options.
How do I call Gemini 3.1 Flash Lite on AI Gateway?
Use the identifier
google/gemini-3.1-flash-litewith the AI SDK, the OpenAI-compatible Chat Completions endpoint, the Responses API, or any other supported interface. AI Gateway handles provider routing, retries, and failover automatically.How does Zero Data Retention work with Gemini 3.1 Flash Lite through AI Gateway?
Yes, Zero Data Retention is available for this model. Zero Data Retention is offered on a per-provider basis. See https://vercel.com/docs/ai-gateway/capabilities/zdr for details.