How is Gemini 3.1 Flash Lite different from `google/gemini-3.1-flash-lite-preview`?

Gemini 3.1 Flash Lite is the general-availability release of the same efficiency tier in the Gemini 3.1 family. The preview entry remains in the catalog for teams pinned to the earlier identifier; the GA model is the recommended target for new production work.

What thinking levels does Gemini 3.1 Flash Lite support and how do they affect cost?

Four levels: `minimal`, `low`, `medium`, and `high`. Higher levels add reasoning compute that contributes to output token counts, so the choice trades off latency and per-request cost against quality on harder inputs.

Can I mix thinking levels across requests in the same application?

Yes. Set `thinkingLevel` per request in `providerOptions.google.thinkingConfig`. Routine requests can run at `minimal` while flagged hard cases run at `medium` or `high` without any architectural changes.

Does Gemini 3.1 Flash Lite support multimodal inputs?

Yes. Gemini 3.1 Flash Lite accepts text, images, audio, and documents as input within the 1M tokens context window and returns text output. Web search and implicit caching are available as runtime options.

How do I call Gemini 3.1 Flash Lite on AI Gateway?

Use the identifier `google/gemini-3.1-flash-lite` with the AI SDK, the OpenAI-compatible Chat Completions endpoint, the Responses API, or any other supported interface. AI Gateway handles provider routing, retries, and failover automatically.

How does Zero Data Retention work with Gemini 3.1 Flash Lite through AI Gateway?

Yes, Zero Data Retention is available for this model. Zero Data Retention is offered on a per-provider basis. See https://vercel.com/docs/ai-gateway/capabilities/zdr for details.

Gemini 3.1 Flash Lite

Gemini 3.1 Flash Lite is the GA release of the efficiency tier in the Gemini 3.1 generation, with improvements in reasoning, multimodal understanding, agentic tool use, and long-context performance over 2.5 Flash Lite, plus four configurable thinking levels and a context window of 1M tokens.

ReasoningTool UseImplicit CachingFile InputVision (Image)Web Search

index.ts

import { streamText } from 'ai'

const result = streamText({
  model: 'google/gemini-3.1-flash-lite',
  prompt: 'Why is the sky blue?'
})

Overview Playground About Providers Throughput Latency Uptime Status Similar FAQ

Playground

Try out Gemini 3.1 Flash Lite by Google. Usage is billed to your team at API rates. Free users (those who haven't made a payment) get $5 of credits every 30 days.

Providers

Route requests across multiple providers. Copy a provider slug to set your preference. Visit the docs for more info. Using a provider means you agree to their terms, listed under Legal.

Provider

Context	Latency	Throughput	Input	Output	Cache	Web Search	Per Query	Capabilities	ZDR	No Training	Release Date

Legal:Terms

•

Privacy

1.3s

309tps

$0.25/M

$1.50/M

Read:$0.03/M

Write:—

$14.00/K

+ input costs

—

05/07/2026

Legal:Terms

•

Privacy

0.8s

206tps

$0.25/M

$1.50/M

Read:$0.03/M

Write:—

$14.00/K

+ input costs

—

05/07/2026

More models by Google

Model

Context	Latency	Throughput	Input	Output	Cache	Web Search	Per Query	Capabilities	Providers	ZDR	No Training	Release Date

2.4s

301tps

$1.50/M

$9.00/M

Read:$0.15/M

Write:—

$14.00/K

+ input costs

—

05/19/2026

0.8s

402tps

$0.25/M

$1.50/M

Read:$0.03/M

Write:—

$14.00/K

+ input costs

—

03/03/2026

3.7s

107tps

$2.00/M

$12.00/M

Read:

$0.2/M

Write:

—

$14.00/K

+ input costs

—

02/19/2026

0.7s

184tps

$0.50/M

$3.00/M

Read:

$0.05/M

Write:

—

$14.00/K

+ input costs

—

12/17/2025

0.3s

223tps

$0.10/M

$0.40/M

Read:$0.01/M

Write:—

$35.00/K

+ input costs

—

06/17/2025

0.4s

187tps

$0.30/M

$2.50/M

Read:$0.03/M

Write:—

$35.00/K

+ input costs

—

03/20/2025

About Gemini 3.1 Flash Lite

Gemini 3.1 Flash Lite is the general-availability version of the efficiency tier in the Gemini 3.1 generation, released May 7, 2026. It outperforms Gemini 2.5 Flash Lite on overall quality and lands close to 2.5 Flash performance across key capability areas, including reasoning, multimodal understanding, agentic tool use, and long-context performance.

The model is positioned for high-volume use cases where unit economics, not peak capability, set the constraint. Gemini 3.1 Flash Lite accepts text, images, audio, and documents as input within 1M tokens and produces text output, with implicit caching and web search available as runtime options to control cost and ground responses in current information.

Four configurable thinking levels (minimal, low, medium, high) allow a single deployment to serve mixed workloads without routing across models. A bulk extraction job can run at minimal to minimize latency and tokens; an edge case in the same pipeline can step up to medium or high when more deliberation pays off. Thinking tokens contribute to output token counts, so the level becomes a direct lever on cost.

Compared to the preceding preview release, Gemini 3.1 Flash Lite is the stable surface for production deployments. Teams running 2.5 Flash Lite at scale get the 3.1 generation quality gains on the workloads that drive the most tokens, including translation, data extraction, and code completion, without moving up to the standard Flash tier.

What To Consider When Choosing a Provider

Configuration: Gemini 3.1 Flash Lite exposes four thinking levels through providerOptions.google.thinkingConfig: minimal, low, medium, and high. Lower levels favor latency and per-token cost; higher levels add reasoning compute that counts toward output tokens. Benchmark total cost under realistic thinking settings before committing to a deployment.
Zero Data Retention: AI Gateway supports Zero Data Retention for this model via direct gateway requests (BYOK is not included). To configure this, check the documentation.
Authentication: AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.

When to Use Gemini 3.1 Flash Lite

Best For

High-volume agentic pipelines: Aggregate token cost is a binding constraint and per-step reasoning depth can be tuned per request
Data extraction at scale: Structured extraction over millions of documents, invoices, transcripts, or scraped HTML where throughput economics dominate
Bulk translation workloads: Per-token cost determines whether the use case is viable, with thinking levels available for nuanced passages
Code completion and review: Inline suggestions, lint-style review, and refactor proposals at IDE or CI/CD scale
Multimodal ingestion at the lite tier: Vision and file inputs are required, but pro-tier latency and cost are not

Consider Alternatives When

Maximum reasoning depth: Complex multi-step problems benefit from google/gemini-3.1-pro-preview or google/gemini-3-pro-preview
Native image output: Gemini 3.1 Flash Lite returns text only; google/gemini-3.1-flash-image-preview or google/gemini-3-pro-image generate images
Pro-grade quality at flash latency: google/gemini-3-flash sits between Flash Lite and Pro on capability and cost
Pure embedding workloads: Semantic retrieval and clustering fit a dedicated embedding model like google/gemini-embedding-001 better

Conclusion

Gemini 3.1 Flash Lite is the GA destination for teams that ran the 3.1 Flash Lite preview or are migrating from 2.5 Flash Lite. It brings the 3.1 generation quality lift to the highest-volume, most cost-sensitive workloads, with four thinking levels giving you a single deployment that adapts to mixed task difficulty without changing models.

Frequently Asked Questions

How is Gemini 3.1 Flash Lite different from google/gemini-3.1-flash-lite-preview?
Gemini 3.1 Flash Lite is the general-availability release of the same efficiency tier in the Gemini 3.1 family. The preview entry remains in the catalog for teams pinned to the earlier identifier; the GA model is the recommended target for new production work.
How does Gemini 3.1 Flash Lite compare to Gemini 2.5 Flash Lite?
Gemini 3.1 Flash Lite outperforms 2.5 Flash Lite on overall quality and lands close to 2.5 Flash across reasoning, multimodal understanding, agentic tool use, and long-context performance. For teams already running 2.5 Flash Lite at scale, it's a quality upgrade within the same lite tier.
What thinking levels does Gemini 3.1 Flash Lite support and how do they affect cost?
Four levels: minimal, low, medium, and high. Higher levels add reasoning compute that contributes to output token counts, so the choice trades off latency and per-request cost against quality on harder inputs.
Can I mix thinking levels across requests in the same application?
Yes. Set thinkingLevel per request in providerOptions.google.thinkingConfig. Routine requests can run at minimal while flagged hard cases run at medium or high without any architectural changes.
Does Gemini 3.1 Flash Lite support multimodal inputs?
Yes. Gemini 3.1 Flash Lite accepts text, images, audio, and documents as input within the 1M tokens context window and returns text output. Web search and implicit caching are available as runtime options.
How do I call Gemini 3.1 Flash Lite on AI Gateway?
Use the identifier google/gemini-3.1-flash-lite with the AI SDK, the OpenAI-compatible Chat Completions endpoint, the Responses API, or any other supported interface. AI Gateway handles provider routing, retries, and failover automatically.
How does Zero Data Retention work with Gemini 3.1 Flash Lite through AI Gateway?
Yes, Zero Data Retention is available for this model. Zero Data Retention is offered on a per-provider basis. See https://vercel.com/docs/ai-gateway/capabilities/zdr for details.

AI Cloud

Core Platform

Security

Company

Learn

Open Source

Use Cases

Tools

Users

Gemini 3.1 Flash Lite

Playground

Providers

More models by Google

About Gemini 3.1 Flash Lite

What To Consider When Choosing a Provider

When to Use Gemini 3.1 Flash Lite

Best For

Consider Alternatives When

Conclusion

Frequently Asked Questions