- A way to extract text from files (PDF, PPT, DOCX, etc)
- ML-based chunking that provides state of the art performance.
- The Boomerang embeddings model.
- Its own internal vector database where text chunks and embedding vectors are stored.
- A query service that automatically encodes the query into embedding, and retrieves the most relevant text segments, including support for Hybrid Search as well as multiple reranking options such as the multi-lingual relevance reranker, MMR, UDF reranker.
- An LLM to for creating a generative summary, based on the retrieved documents (context), including citations.
Vectara
as SelfQueryRetriever
.
Setup
To use theVectaraVectorStore
you first need to install the partner package.
Getting Started
To get started, use the following steps:- If you don’t already have one, Sign up for your free Vectara trial.
- Within your account you can create one or more corpora. Each corpus represents an area that stores text data upon ingest from input documents. To create a corpus, use the “Create Corpus” button. You then provide a name to your corpus as well as a description. Optionally you can define filtering attributes and apply some advanced options. If you click on your created corpus, you can see its name and corpus ID right on the top.
- Next you’ll need to create API keys to access the corpus. Click on the “Access Control” tab in the corpus view and then the “Create API Key” button. Give your key a name, and choose whether you want query-only or query+index for your key. Click “Create” and you now have an active API key. Keep this key confidential.
corpus_key
and api_key
.
You can provide VECTARA_API_KEY
to LangChain in two ways:
-
Include in your environment these two variables:
VECTARA_API_KEY
. For example, you can set these variables using os.environ and getpass as follows:
- Add them to the
Vectara
vectorstore constructor:
Connecting to Vectara from LangChain
In this example, we assume that you’ve created an account and a corpus, and added yourVECTARA_CORPUS_KEY
and VECTARA_API_KEY
(created with permissions for both indexing and query) as environment variables.
We further assume the corpus has 4 fields defined as filterable metadata attributes: year
, director
, rating
, and genre
Dataset
We first define an example dataset of movie, and upload those to the corpus, along with the metadata:Self-query with Vectara
You don’t need self-query via the LangChain mechanism—enablingintelligent_query_rewriting
on the Vectara platform achieves the same result.
Vectara offers Intelligent Query Rewriting option which enhances search precision by automatically generating metadata filter expressions from natural language queries. This capability analyzes user queries, extracts relevant metadata filters, and rephrases the query to focus on the core information need. For more details.
Enable intelligent query rewriting on a per-query basis by setting the intelligent_query_rewriting
parameter to true
in VectaraQueryConfig
.