Good morning, everyone!
Today, we're exploring OpenAI's newest o1 model series, which (re)introduces a new paradigm in AI: test-time computation, or, in human words, “reasoning.”
Let’s first address the title: we do not know. Some of the authors think it could be with extreme scaling; some don’t. One thing is certain: we are not at the point of AGI yet and nowhere near.
Let's examine what makes o1 (a.k.a. the Strawberry project) unique and how it changes the game for complex tasks.
o1, o1-mini, o1-preview? What are they? Why 3 of them?
o1 mini is the small, efficient version. It was specifically trained for technical content like math and other academic topics and thus achieves better results than the others in some benchmarks.
o1 preview is the early checkpoint (during training) of the real bigger (and improved) o1 model that will come out later this year.
This is why o1-mini seems particularly interesting in the plot below, which shows results on the math AIME dataset.
What's New About o1?
o1 uses a parallel multi-step reasoning process before generating a response, unlike traditional LLMs that can only produce a single reasoning trace.
The thinking process is comparable to asking GPT-4o to “think it step-by-step.” Adding this short line to your prompts is called “chain-of-thought” (CoT) prompting. We know that CoT prompting enables more complex problem-solving.
The difference with GPT-4o + CoT prompting is that o1 uses large-scale reinforcement learning to refine its thinking process, allowing it to adapt its reasoning steps to any task. It was specifically trained to improve its “reasoning” (the steps in a chain of thought). It is “built-in” with o1, so you no longer need to prompt the model to do CoT. As Jason Wei (researcher at OpenAI) posted, “Don’t do CoT purely via prompting; train models to do better CoT using RL.”
What is reinforcement learning?
Reinforcement learning is an advanced AI technique in which a system learns to make better decisions through trial and error, receiving feedback from its actions—much like how a business adjusts its strategies based on market responses to maximize performance and achieve goals.
This reinforcement learning approach enables the model to identify and rectify errors in a chain of thought, resulting in more resilient "reasoning." It examines each step of the thinking process to enhance it during training, rather than solely assessing the response at the end when the answer is fully generated, as with standard LLM training.
Test-time computing refers to the concept of performing additional computations while using a model. Instead of training the model to provide the best answer immediately, we design it to generate more output during usage, allowing it more time and computational resources to arrive at a solution.
Perhaps the most interesting aspect of o1 is its "test-time compute." This concept involves two scaling laws:
Train-time compute scaling: Performance improves with more reinforcement learning during training.
Test-time compute scaling: Performance improves when the model spends more time thinking (searching) during deployment.
o1 performance smoothly improves with both train-time and test-time compute
This shift towards two curves working in tandem, rather than just one, could be what truly beats the diminishing returns in language model capabilities.
Thanks to this reasoning and working harder during “test time,” o1 has shown remarkable results on tasks requiring deep reasoning:
It solved 74% of problems on the American Invitational Mathematics Examination (AIME), compared to GPT-4's 12%.
It achieved an Elo rating of 1807 in competitive programming contests, performing better than 93% of human competitors.
OpenAi - Learning to Reason with Llms
While o1 represents a significant advancement, it also raises new questions, limitations, and concerns:
Computational Demands & Adaptation: Since o1's release, users have experienced slower responses due to increased inference time. While this may result in more thoughtful answers, it raises concerns about efficiency. o1's ability to adapt its reasoning in real-time is valuable for complex problems but leads to delays, even for simple questions. Here, OpenAI (and other providers aiming for test time compute optimization) may use smaller models for token prediction in the thinking process and other ways to parallelize works.
Generalization and Adaptation: o1's ability to adapt its reasoning process in real-time could lead to more flexible and robust AI systems. This could be particularly valuable when the model encounters novel or complex problems.
Thanks to continuously reduced token costs and faster inference time, it is clear that we're entering a new era of AI capabilities. The ability to reason, adapt, and self-correct in real time could lead to more powerful and reliable AI systems across various applications. Spending resources at inference time AND training is a nice avenue allowing companies to optimize on both fronts.
Now that we’ve discussed the model and paradigm shift we are witnessing, let’s conclude with some practical advice we discovered from prompting with o1:
Break complex requests into numerous well-defined small tasks.
Start with an easy task as a warm-up.
Let the model successfully solve the first task, debugging if necessary.
Repeat the process for subsequent tasks.
Remember, o1's strength lies in its ability to solve complex problems by repeatedly applying its reasoning capabilities, and it is more likely to recover from errors.
What are your thoughts on o1 and its implications for the future of AI? We'd love to hear your insights!
Great post!