UpTrain [github || website || docs] is an open-source platform to evaluate and improve LLM applications. It provides grades for 20+ preconfigured checks (covering language, code, embedding use cases), performs root cause analyses on instances of failure cases and provides guidance for resolving them.
UpTrain Callback Handler
This notebook showcases the UpTrain callback handler seamlessly integrating into your pipeline, facilitating diverse evaluations. We have chosen a few evaluations that we deemed apt for evaluating the chains. These evaluations run automatically, with results displayed in the output. More details on UpTrain’s evaluations can be found here. Selected retrievers from LangChain are highlighted for demonstration:1. Vanilla RAG
RAG plays a crucial role in retrieving context and generating responses. To ensure its performance and response quality, we conduct the following evaluations:- Context Relevance: Determines if the context extracted from the query is relevant to the response.
- Factual Accuracy: Assesses if the LLM is hallcuinating or providing incorrect information.
- Response Completeness: Checks if the response contains all the information requested by the query.
2. Multi Query Generation
MultiQueryRetriever creates multiple variants of a question having a similar meaning to the original question. Given the complexity, we include the previous evaluations and add:- Multi Query Accuracy: Assures that the multi-queries generated mean the same as the original query.
3. Context Compression and Reranking
Re-ranking involves reordering nodes based on relevance to the query and choosing top n nodes. Since the number of nodes can reduce once the re-ranking is complete, we perform the following evaluations:- Context Reranking: Checks if the order of re-ranked nodes is more relevant to the query than the original order.
- Context Conciseness: Examines whether the reduced number of nodes still provides all the required information.
Install Dependencies
faiss-gpu
instead of faiss-cpu
if you want to use the GPU enabled version of the library.
Import Libraries
Load the documents
Split the document into chunks
Create the retriever
Define the LLM
Setup
UpTrain provides you with:- Dashboards with advanced drill-down and filtering options
- Insights and common topics among failing cases
- Observability and real-time monitoring of production data
- Regression testing via seamless integration with your CI/CD pipelines
1. UpTrain’s Open-Source Software (OSS)
You can use the open-source evaluation service to evaluate your model. In this case, you will need to provie an OpenAI API key. UpTrain uses the GPT models to evaluate the responses generated by the LLM. You can get yours here. In order to view your evaluations in the UpTrain dashboard, you will need to set it up by running the following commands in your terminal:http://localhost:3000/dashboard
.
Parameters:
- key_type=“openai”
- api_key=“OPENAI_API_KEY”
- project_name=“PROJECT_NAME”
2. UpTrain Managed Service and Dashboards
Alternatively, you can use UpTrain’s managed service to evaluate your model. You can create a free UpTrain account here and get free trial credits. If you want more trial credits, book a call with the maintainers of UpTrain here. The benefits of using the managed service are:- No need to set up the UpTrain dashboard on your local machine.
- Access to many LLMs without needing their API keys.
https://dashboard.uptrain.ai/dashboard
Parameters:
- key_type=“uptrain”
- api_key=“UPTRAIN_API_KEY”
- project_name=“PROJECT_NAME”
project_name
will be the project name under which the evaluations performed will be shown in the UpTrain dashboard.
Set the API key
The notebook will prompt you to enter the API key. You can choose between the OpenAI API key or the UpTrain API key by changing thekey_type
parameter in the cell below.
1. Vanilla RAG
UpTrain callback handler will automatically capture the query, context and response once generated and will run the following three evaluations (Graded from 0 to 1) on the response:- Context Relevance: Check if the context extractedfrom the query is relevant to the response.
- Factual Accuracy: Check how factually accurate the response is.
- Response Completeness: Check if the response contains all the information that the query is asking for.
2. Multi Query Generation
The MultiQueryRetriever is used to tackle the problem that the RAG pipeline might not return the best set of documents based on the query. It generates multiple queries that mean the same as the original query and then fetches documents for each. To evaluate this retriever, UpTrain will run the following evaluation:- Multi Query Accuracy: Checks if the multi-queries generated mean the same as the original query.
3. Context Compression and Reranking
The reranking process involves reordering nodes based on relevance to the query and choosing the top n nodes. Since the number of nodes can reduce once the reranking is complete, we perform the following evaluations:- Context Reranking: Check if the order of re-ranked nodes is more relevant to the query than the original order.
- Context Conciseness: Check if the reduced number of nodes still provides all the required information.
UpTrain’s Dashboard and Insights
Here’s a short video showcasing the dashboard and the insights: