Gemini 2.0 Flash Is Here: Real-Time Streaming, Multimodal Creation, and More

Gemini 2.0 Flash Brings Real-Time AI Tools to Google AI Studio—Available Now

Louis-François Bouchard

Francois Huppe-Marcoux

, and

Omar Solano

Dec 11, 2024

Article voiceover

1×

0:00

-4:15

Good morning!

This is just the first email of today’s news recap from us. You’ll hear from us again in a few hours!

This morning, Google DeepMind released Project Mariner, a browser task automation prototype powered by Gemini 2.0. It is basically a direct response to OpenAI’s real-time API and Anthropic’s Computer Use. It simulates human actions like clicking and typing to interact with desktop apps and websites. The difference? Google owns the Chrome web browser, which might make this a much deeper (and more efficient) integration.

But this news didn’t come alone. We are even more excited about Gemini 2.0 and 2.0 Flash, Google’s latest models, and their multimodal real-time live API on Google AI Studio, already accessible today, with nice upgrades (Gemini 2.0 Flash already available). They emphasize that these models were made to build “agentic workflows.”

Get 25% off forever

Key points:

Flash has 2x faster performance (generation speed) than Gemini 1.5 Pro.
Native multimodal outputs, including text, audio, and image generation.
Realtime Multimodal API for live streaming of audio, video, and text, with integrated tool use like Search and code execution.
Steerable text-to-speech with multiple voices, languages, and accents.
You can even stream your screen or film, talk with it in real-time, and see its dynamic tool calls happen in the background.

This is a big deal for many applications that need real-time processing, such as apps that can interact with your environment to ask questions about what you see, perform live translations, etc.

Plus, Mariner (the web browser agent) is one of the early applications of Gemini 2.0, leveraging its multimodal understanding to automate browser workflows. It can:

Navigate and interact with websites autonomously.
Automate repetitive tasks while showing their reasoning and plans.
Respond to voice commands and keep users informed with visual feedback.

Performance benchmarks for Gemini 2.0 look promising:

ScreenSpot: 84% on GUI understanding.
WebVoyager: 90.5% for autonomous web interactions.

Compared to Anthropic’s Computer Use, which takes a simpler and more versatile approach by simulating human actions across apps, Mariner focuses entirely on Chrome. While this narrower focus may seem limited, it could be an advantage. We know that agent-based systems are far from mature and unreliable.

Gemini 2.0 Flash is available now through the Gemini API in Google AI Studio, with access to multimodal outputs and real-time streaming. Project Mariner, however, is still in its experimental phase and available only to trusted testers via a waitlist (we are waiting to get access and will report back!).

Google is making a big leap today, stealing the spotlight from OpenAI on this one! Still, we’ll return later today with our thoughts on today’s OpenAI news. See you soon!

High Learning Rate

Discussion about this post