What are the four thinking levels and what do they mean for cost and quality?

`minimal`, `low`, `medium`, and `high`. Lower levels reduce the amount of reasoning compute applied before generating a response, which decreases latency and token consumption but may reduce quality on complex tasks. `high` applies the most reasoning, similar to configuring a reasoning model for thorough inference.

Which task categories saw the most improvement over Gemini 2.5 Flash Lite?

Translation, data extraction, and code completion saw the largest improvements over Gemini 2.5 Flash Lite. These are the high-volume task categories where the efficiency gains of the 3.1 generation have the most practical impact.

Is this model suitable for agentic multi-agent architectures?

Yes. High-volume agentic tasks are a primary target. The model's low cost and configurable thinking levels make it appropriate for sub-agents in hierarchical agent systems.

Can I mix thinking levels in the same application?

Yes. You set `thinkingLevel` per request in `providerOptions.google.thinkingConfig`, so different request types within the same application can use different levels without any architectural changes.

Does Gemini 3.1 Flash Lite Preview support streaming?

Yes. Use `streamText` from the AI SDK with `model: 'google/gemini-3.1-flash-lite-preview'` to stream responses.

How does Gemini 3.1 Flash Lite Preview differ from Gemini 3 Flash?

Gemini 3 Flash prioritizes pro-grade reasoning at flash speed and is positioned as the standard speed/quality balance point. Flash Lite is specifically optimized for maximum cost efficiency and high volume throughput, trading some capability headroom for a lower price point.

Is `includeThoughts` supported on this model?

Yes. Set `includeThoughts: true` in `providerOptions.google.thinkingConfig` to stream the model's reasoning tokens alongside the generated response.

What's the recommended thinking level for bulk translation tasks?

Start with `low` or `minimal` for straightforward translation tasks where throughput is the primary concern. Increase to `medium` for content requiring cultural nuance or domain-specific accuracy.

Gemini 3.1 Flash Lite Preview

Gemini 3.1 Flash Lite Preview is the efficiency-focused model in the Gemini 3.1 generation for budget-constrained, high-volume workloads, with notable gains in translation, data extraction, and code completion over Gemini 2.5 Flash Lite and four configurable thinking levels.

ReasoningTool UseImplicit CachingFile InputVision (Image)Web Search

index.ts

import { streamText } from 'ai'

const result = streamText({
  model: 'google/gemini-3.1-flash-lite-preview',
  prompt: 'Why is the sky blue?'
})

Overview Playground About Providers Throughput Latency Uptime Status Similar FAQ

Playground

Try out Gemini 3.1 Flash Lite Preview by Google. Usage is billed to your team at API rates. Free users (those who haven't made a payment) get $5 of credits every 30 days.

Providers

Route requests across multiple providers. Copy a provider slug to set your preference. Visit the docs for more info. Using a provider means you agree to their terms, listed under Legal.

Provider

Context	Latency	Throughput	Input	Output	Cache	Web Search	Per Query	Capabilities	ZDR	No Training	Release Date

Legal:Terms

•

Privacy

0.8s

251tps

$0.25/M

$1.50/M

Read:$0.03/M

Write:—

$14.00/K

+ input costs

—

03/03/2026

Legal:Terms

•

Privacy

0.8s

234tps

$0.25/M

$1.50/M

Read:$0.03/M

Write:—

$14.00/K

+ input costs

—

03/03/2026

More models by Google

Model

Context	Latency	Throughput	Input	Output	Cache	Web Search	Per Query	Capabilities	Providers	ZDR	No Training	Release Date

2.9s

323tps

$1.50/M

$9.00/M

Read:$0.15/M

Write:—

$14.00/K

+ input costs

—

05/19/2026

3.5s

131tps

$2.00/M

$12.00/M

Read:

$0.2/M

Write:

—

$14.00/K

+ input costs

—

02/19/2026

0.7s

179tps

$0.50/M

$3.00/M

Read:

$0.05/M

Write:

—

$14.00/K

+ input costs

—

12/17/2025

0.5s

248tps

$0.10/M

$0.40/M

Read:$0.01/M

Write:—

$35.00/K

+ input costs

—

06/17/2025

0.3s

192tps

$0.30/M

$2.50/M

Read:$0.03/M

Write:—

$35.00/K

+ input costs

—

03/20/2025

2.0s

123tps

$1.25/M

$10.00/M

Read:

$0.13/M

Write:

—

$35.00/K

+ input costs

—

03/20/2025

About Gemini 3.1 Flash Lite Preview

Gemini 3.1 Flash Lite Preview is Google's most cost-efficient model in the 3.1 generation, designed explicitly for high-volume agentic tasks, data extraction pipelines, and latency-sensitive applications where budget is the primary constraint. This model outperforms Gemini 2.5 Flash Lite on overall quality, with the most pronounced improvements in translation, data extraction, and code completion, three task categories that commonly drive the highest request volumes in production.

The four-level thinking configuration (minimal, low, medium, high) is a notable engineering affordance. It allows a single model deployment to serve heterogeneous workloads without switching models: a bulk extraction job might run at minimal thinking to minimize latency and cost, while an edge-case translation that requires cultural nuance detection runs at medium. For teams running large-scale pipelines, content localization, automated data cleaning, code completion at IDE scale, or classification across millions of documents, Gemini 3.1 Flash Lite Preview provides the quality improvements of the 3.1 generation without the cost profile of the Pro or standard Flash tiers. Its position in the lineup is defined by throughput economics rather than maximum capability.

What To Consider When Choosing a Provider

Configuration: Gemini 3.1 Flash Lite Preview supports four thinking levels, minimal, low, medium, and high, giving you fine-grained control over the reasoning-to-cost tradeoff across different task types within the same deployment.
Zero Data Retention: AI Gateway supports Zero Data Retention for this model via direct gateway requests (BYOK is not included). To configure this, check the documentation.
Authentication: AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.

When to Use Gemini 3.1 Flash Lite Preview

Best For

High-volume data extraction: Pipelines processing millions of structured or semi-structured inputs
Bulk translation workloads: Per-character cost directly determines operational feasibility
Code completion and review: Inline suggestions and automated review at IDE or CI/CD pipeline scale
Parallel sub-agent systems: Agentic architectures where aggregate token cost is a binding constraint
Latency-first applications: Task complexity is moderate to low and response speed is the primary criterion

Consider Alternatives When

Lite-tier quality insufficient: Your task quality cannot tolerate the tradeoffs (consider google/gemini-3.1-flash-image-preview or google/gemini-3-flash for higher quality)
Multi-step reasoning: Your workflow involves complex documents or images (consider google/gemini-3.1-pro-preview or google/gemini-3-pro-preview)
Native image generation: You need image output alongside text (consider google/gemini-3.1-flash-image-preview)
Maximum reasoning depth: The task requires deepest reasoning without cost constraints (consider google/gemini-3.1-pro-preview)

Conclusion

Gemini 3.1 Flash Lite Preview brings the quality advances of the 3.1 generation to the highest-volume, most cost-sensitive segment of the model market. With four thinking levels and documented improvements in the task categories that drive the most tokens in production, it's the natural choice for teams optimizing for scale economics rather than peak capability.

Frequently Asked Questions

What are the four thinking levels and what do they mean for cost and quality?
minimal, low, medium, and high. Lower levels reduce the amount of reasoning compute applied before generating a response, which decreases latency and token consumption but may reduce quality on complex tasks. high applies the most reasoning, similar to configuring a reasoning model for thorough inference.
Which task categories saw the most improvement over Gemini 2.5 Flash Lite?
Translation, data extraction, and code completion saw the largest improvements over Gemini 2.5 Flash Lite. These are the high-volume task categories where the efficiency gains of the 3.1 generation have the most practical impact.
Is this model suitable for agentic multi-agent architectures?
Yes. High-volume agentic tasks are a primary target. The model's low cost and configurable thinking levels make it appropriate for sub-agents in hierarchical agent systems.
Can I mix thinking levels in the same application?
Yes. You set thinkingLevel per request in providerOptions.google.thinkingConfig, so different request types within the same application can use different levels without any architectural changes.
Does Gemini 3.1 Flash Lite Preview support streaming?
Yes. Use streamText from the AI SDK with model: 'google/gemini-3.1-flash-lite-preview' to stream responses.
How does Gemini 3.1 Flash Lite Preview differ from Gemini 3 Flash?
Gemini 3 Flash prioritizes pro-grade reasoning at flash speed and is positioned as the standard speed/quality balance point. Flash Lite is specifically optimized for maximum cost efficiency and high volume throughput, trading some capability headroom for a lower price point.
Is includeThoughts supported on this model?
Yes. Set includeThoughts: true in providerOptions.google.thinkingConfig to stream the model's reasoning tokens alongside the generated response.
What's the recommended thinking level for bulk translation tasks?
Start with low or minimal for straightforward translation tasks where throughput is the primary concern. Increase to medium for content requiring cultural nuance or domain-specific accuracy.

AI Cloud

Core Platform

Security

Company

Learn

Open Source

Use Cases

Tools

Users

Gemini 3.1 Flash Lite Preview

Playground

Providers

More models by Google

About Gemini 3.1 Flash Lite Preview

What To Consider When Choosing a Provider

When to Use Gemini 3.1 Flash Lite Preview

Best For

Consider Alternatives When

Conclusion

Frequently Asked Questions