Cassandra is a NoSQL, row-oriented, highly scalable and highly available database.Starting with version 5.0, the database ships with vector search capabilities.Note: in addition to access to the database, an OpenAI API Key is required to run the full example.
Setup and general dependencies
Use of the integration requires the following Python package.datasets
, openai
, pypdf
and tiktoken
are required, along with langchain-community
).
Import the Vector Store
Connection parameters
The Vector Store integration shown in this page can be used with Cassandra as well as other derived databases, such as Astra DB, which use the CQL (Cassandra Query Language) protocol.DataStax Astra DB is a managed serverless database built on Cassandra, offering the same interface and strengths.Depending on whether you connect to a Cassandra cluster or to Astra DB through CQL, you will provide different parameters when creating the vector store object.
Connecting to a Cassandra cluster
You first need to create acassandra.cluster.Session
object, as described in the Cassandra driver documentation. The details vary (e.g. with network settings and authentication), but this might be something like:
cassio.init
setting, however, comes handy if your applications uses Cassandra in several ways (for instance, for vector store, chat memory and LLM response caching), as it allows to centralize credential and DB connection management in one place.
Connecting to Astra DB through CQL
In this case you initialize CassIO with the following connection parameters:- the Database ID, e.g.
01234567-89ab-cdef-0123-456789abcdef
- the Token, e.g.
AstraCS:6gBhNmsk135....
(it must be a “Database Administrator” token) - Optionally a Keyspace name (if omitted, the default one for the database will be used)
Load a dataset
Convert each entry in the source dataset into aDocument
, then write them into the vector store:
metadata
dictionaries are created from the source data and are part of the Document
.
Add some more entries, this time with add_texts
:
add_texts
and add_documents
by increasing the concurrency level for
these bulk operations - check out the methods’ batch_size
parameter
for more details. Depending on the network and the client machine specifications, your best-performing choice of parameters may vary.
Run searches
This section demonstrates metadata filtering and getting the similarity scores back:MMR (Maximal-marginal-relevance) search
Deleting stored documents
A minimal RAG chain
The next cells will implement a simple RAG pipeline:- download a sample PDF file and load it onto the store;
- create a RAG chain with LCEL (LangChain Expression Language), with the vector store at its heart;
- run the question-answering chain.
Cleanup
the following essentially retrieves theSession
object from CassIO and runs a CQL DROP TABLE
statement with it:
(You will lose the data you stored in it.)
Learn more
For more information, extended quickstarts and additional usage examples, please visit the CassIO documentation for more on using the LangChainCassandra
vector store.
Attribution statement
Apache Cassandra, Cassandra and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.