Good morning!
Today, we got video live streaming, as we expected in the bingo sheet!
The video live streaming feature first announced at the OpenAI Dev Day (in May of 2024) will be available on ChatGPT's advanced voice mode (in the next few weeks). It will allow you to live stream your camera or screen and ask ChatGPT any questions about what it sees. It is very responsive, streaming visual and audio tokens in real-time.
Unfortunately, since we do not have any details on the implementation, there is very little to say here. We do not know how many frames per second are processed and sent to the multimodal model, nor do we know at what aspect ratio the screenshots are sent to the model. One thing that is for sure is that it can see and remember a long sequence of images, though we currently don’t know the full extent of the video (and audio) context and memory.
In the demo, they tried demonstrating its capability by asking ChatGPT to teach them the art of preparing coffee with a V60. ChatGPT described they had to make circles while dripping the water, but the recorded person was pouring straight down and asked if they were doing it correctly. ChatGPT answered positively, even though it was not what the AI instructed.
We suspect this was a failed part of the demo, where the goal was to show whether the AI could identify the mistake and instruct the person to pour in circles instead. But the AI didn’t spot it, and the OpenAI team just continued as if nothing happened. In any case, the point here is that ChatGPT didn’t spot it, which shows how AI doesn't understand our world (especially physics) even though we tell it and show it how it works. It needs more than words and visuals (it needs to experiment, similar to what babies and kids do, touching, feeling, trying, and breaking stuff!).
Still, we want to finish on a positive note: adding a real-time visual capability is an incredible step forward and a complex problem to solve. It is promising for many, many applications, including intelligent glasses that promise live interactions, Q&As, translations, street directions, etc.!
p.s. The OpenAI 12-day releases are turning more into 12 days of free publicity from us for upcoming (not yet available) features.