Good morning everyone!
Today, we discuss an important aspect when developing LLM applications with your data: How to evaluate Retrieval-Augmented Generation (RAG) systems.
RAG combines retrieval mechanisms (as in Google search) with generative models (e.g., GPT-4) to improve response quality and relevance. Evaluating RAG systems is important to confirm they meet performance and accuracy requirements. We will present the key evaluation metrics and methods we’ve found useful when developing those applications.
A quick reminder on RAG
RAG is simply the process of adding knowledge to an existing LLM into the input prompt. It is most useful for private data or advanced topics the LLM might not have seen during its training. In RAG, the most basic setup is to “inject” additional information along with your prompt so that it can be used to formulate a better answer. You can learn more about RAG here. From now on, we assume you are familiar with such systems.
Important note before we start
The following sections assume that you curated your data. It should contain all relevant information necessary for the task to guarantee effective performance. Data curation is a critical process and a topic worthy of its own discussion, which we plan to cover shortly.
Setting Up an Evaluation Pipeline
First, you will need an Evaluation Pipeline. A RAG system has two distinct parts you can evaluate. Foremost, you need to evaluate the information retrieval component. Without good information retrieval, the answer generation will suffer from poor information even if you use the latest and best LLM out there. This is where you need to have a good dataset of questions and the associated relevant information.
Secondly, you need to evaluate the answer generation of the LLM. Here, you will need a dataset of questions and correct answers to review your overall system. We talk about this in point 3 of this iteration.
The most important part of these systems is understanding the type of questions they have to answer. You can spend a lot of time and effort building a RAG system (indexing and curating the data) and then realize that your users end up asking the same three questions 90% of the time (if that’s the case, don't even bother with LLM; just use a regular decision tree-based chatbot).
Put your energy into finding the right set of questions. The system must be able to answer it before anything else. Then, if you think your system is capable of answering this type of question, evaluate its retrieval capacity first and, secondly, the generation capability.
Step one: Retrieval - Create Your Eval Dataset
The first step is to assess the retrieval performance of your RAG pipeline. Here, you need a set of questions associated with relevant data that contains the answer and thus can help you measure the system’s capacity to retrieve the right data. You can do that in three different ways:
Leveraging Expertise: Use expert users or researchers to create high-quality ground truth data. They will write useful questions and pair them with the relevant information.
Yourself: If you have domain expertise, you can generate ground truth data manually.
An LLM: A powerful LLM can also assist in generating complex and contextually rich questions from your indexed data. Don’t forget to review them yourself too!
Then, regardless of how you want to tackle this, a general process might look like:
Keep reading with a 7-day free trial
Subscribe to High Learning Rate to keep reading this post and get 7 days of free access to the full post archives.