Good morning, everyone!
In AI, the general trend has been to make large language models (LLMs) bigger and bigger. The assumption is that a larger model can better generalize and thus improve factual accuracy as a side effect. But there’s a catch: LLMs aren't primarily designed to store factual information. Their real strength lies in generating coherent, structured text based on patterns learned during training. However, increasing their size to reduce factual mistakes is not the best way forward.
The Problem with Bigger Models for Factual Accuracy
People often critique LLMs for making factual mistakes (hallucinations), claiming they are "useless" because of this. However, factual errors aren't the LLM’s fault. An LLM's purpose is text generation, and its knowledge is limited to the data it was trained on, which means it lacks real-time factual updating. Despite this, companies push for larger and larger models, hoping that the sheer number of examples will reduce hallucinations.
💡 Yes, a bigger model will reduce the amount of hallucination. But, is it the most efficient way of dealing with Hallucinations?
One major drawback of these massive models is their inability to stay current. Since their knowledge is mostly frozen at the time of training, they quickly become outdated, especially when dealing with fast-changing domains like current events, medicine, or technology. Updating these models isn’t as simple as clicking “refresh”—it’s a costly and time-consuming process that requires retraining the entire model. As a result, the bigger the model, the more likely it is to present outdated or inaccurate information.
The approach of growing the models to reduce hallucinations also presents a challenge because much of the model’s computational power is being used inefficiently. Rather than embedding vast amounts of examples directly into the model’s weights, we should shift the focus toward enhancing the model's ability to recognize patterns and reason effectively. A smaller model, though potentially less factual in terms of memorization, can leverage better reasoning capabilities for a smaller cost, making it significantly more efficient than a larger model that relies heavily on brute-force knowledge retention.
But a problem remains… How can we “ensure” factual accuracy with smaller models?
Solution? RAG!
With Retrieval-Augmented Generation (RAG), we can shift away from the need for massive models. RAG combines the generative abilities of potentially smaller LLMs with the power to retrieve accurate, up-to-date information from external sources. This approach removes the burden of factual accuracy from the LLM and allows it to focus on what it does best: language generation.
Smaller language models, when combined with external knowledge retrieval, can match or even surpass the performance of much larger models in tasks that rely heavily on accurate facts. This advantage stems from the fact that smaller models are easier to train and fine-tune, making them more efficient. By reducing the model size, the time required for training decreases significantly, and the fine-tuning process becomes more manageable, with fewer resources needed overall. This makes the approach not only more effective but also cost-efficient.
In addition, using techniques like RAG further enhances the performance of smaller models. While larger models often take more time during inference due to the need to process extensive parameter weights, smaller models with RAG can be faster and more accurate. This is because they don't rely solely on internal knowledge; instead, they retrieve relevant facts dynamically.
The accuracy of factual information improves dramatically through this process. By drawing on reliable sources in real-time, RAG helps prevent the model from generating misleading or incorrect information, a common issue in larger models that may rely on outdated or incomplete data. This combination of smaller models and real-time retrieval creates a more robust and accurate system, minimizing errors and improving performance across fact-dependent tasks.
💡 The Caveat of Small Model
While smaller models excel in efficiency and accuracy through external retrieval, they can be less adept at following complex instructions directly from the prompt. This means they may require additional engineering to handle certain tasks before generating responses. For example, if you ask in a prompt to search in a database, and if it finds nothing, to search the internet, then transform the output text into JSON and add a link, etc. A larger model may be able to follow all the instructions, while a smaller model will struggle. You need to break the prompt into several unit tasks and use computational logic.
Generalization vs. Factual Accuracy
It’s crucial to understand that while bigger LLMs are valuable for generalization, they aren’t necessarily the right tool for improving factual accuracy. Generalization allows models to understand patterns and generate fluent text, but it doesn't guarantee that the generated text will always be factually correct. Instead of scaling models for better factual outcomes, we should focus on hybrid systems like RAG, which balance generalization with accurate fact retrieval.
In a recent report from the Salesforce team, they demonstrate a prime example of improving factual accuracy without scaling up: SFR-RAG, a small (9 billion parameters - “small” by the current standard) but highly efficient LLM. SFR-RAG is specifically designed to maximize the benefits of RAG through context-grounded generation, minimizing hallucinations, and handling complex scenarios such as conflicting information or unanswerable questions.
The key innovation in SFR-RAG compared to traditional RAG systems is the introduction of Thoughts and Observations, where the model generates an internal chain of reasoning (similar to how OpenAI’s O1 works). This enables multi-hop reasoning, allowing the model to identify gaps in the information and call tools to retrieve additional data from external sources.
They’ve designed the system to distinguish between thoughts and observations by using specific tokens to mark the boundaries of each. This setup allows the model to be trained specifically on generating thoughts while relying on a retrieval system to gather the necessary information.
SFR-RAG represents a shift from the conventional RAG approach, where relevant information is simply appended to the initial prompt. Instead, it leverages the LLM to generate targeted search queries for missing information, while fine-tuning the model to produce high-quality thoughts that serve as effective search queries.
While this approach may take slightly longer to execute, the benefit lies in using a smaller, faster-to-train model that can match or even exceed the performance of much larger models.
💡 This system of thought mirrors the latest advancements in OpenAI’s models, where automated chain-of-though mimic the logical steps we typically perform manually with these systems.
Conclusion
The future of AI doesn’t have to be about building the biggest models. In fact, smaller LLMs combined with retrieval mechanisms like SFR-RAG offer a smarter, more efficient approach to maintaining accuracy while reducing the costs associated with training and inference. The key is to let LLMs do what they do best—generate text—while relying on external systems like RAG to provide the factual backbone. It’s time to shift our focus toward a more balanced and scalable approach to AI development that combines the strengths of smaller models with powerful retrieval systems.