Timescale Vector is PostgreSQL++
vector database for AI applications.
This notebook shows how to use the Postgres vector database Timescale Vector
. You’ll learn how to use TimescaleVector for (1) semantic search, (2) time-based vector search, (3) self-querying, and (4) how to create indexes to speed up queries.
What is Timescale Vector?
Timescale Vector
enables you to efficiently store and query millions of vector embeddings in PostgreSQL
.
- Enhances
pgvector
with faster and more accurate similarity search on 100M+ vectors viaDiskANN
inspired indexing algorithm. - Enables fast time-based vector search via automatic time-based partitioning and indexing.
- Provides a familiar SQL interface for querying vector embeddings and relational data.
Timescale Vector
is cloud PostgreSQL
for AI that scales with you from POC to production:
- Simplifies operations by enabling you to store relational metadata, vector embeddings, and time-series data in a single database.
- Benefits from rock-solid PostgreSQL foundation with enterprise-grade features like streaming backups and replication, high availability and row-level security.
- Enables a worry-free experience with enterprise-grade security and compliance.
How to access Timescale Vector
Timescale Vector
is available on Timescale, the cloud PostgreSQL platform. (There is no self-hosted version at this time.)
LangChain users get a 90-day free trial for Timescale Vector.
- To get started, signup to Timescale, create a new database and follow this notebook!
- See the Timescale Vector explainer blog for more details and performance benchmarks.
- See the installation instructions for more details on using Timescale Vector in Python.
Setup
Follow these steps to get ready to follow this tutorial.OpenAIEmbeddings
, so let’s load your OpenAI API key.
timescale-vector
library as well as the TimescaleVector LangChain vectorstore.
1. Similarity Search with Euclidean Distance (Default)
First, we’ll look at an example of doing a similarity search query on the State of the Union speech to find the most similar sentences to a given query sentence. We’ll use the Euclidean distance as our similarity metric..env
file you downloaded after creating a new database.
The URI will look something like this: postgres://tsdbadmin:<password>@<id>.tsdb.cloud.timescale.com:<port>/tsdb?sslmode=require
.
Using a Timescale Vector as a Retriever
After initializing a TimescaleVector store, you can use it as a retriever.2. Similarity Search with time-based filtering
A key use case for Timescale Vector is efficient time-based vector search. Timescale Vector enables this by automatically partitioning vectors (and associated metadata) by time. This allows you to efficiently query vectors by both similarity to a query vector and time. Time-based vector search functionality is helpful for applications like:- Storing and retrieving LLM response history (e.g. chatbots)
- Finding the most recent embeddings that are similar to a query vector (e.g recent news).
- Constraining similarity search to a relevant time range (e.g asking time-based questions about a knowledge base)
Extract content and metadata from git log JSON
First lets load in the git log data into a new collection in our PostgreSQL database namedtimescale_commits
.
We’ll define a helper function to create a uuid for a document and associated vector embedding based on its timestamp. We’ll use this function to create a uuid for each git log entry.
Important note: If you are working with documents and want the current date and time associated with vector for time-based search, you can skip this step. A uuid will be automatically generated when the documents are ingested by default.
Load documents and metadata into TimescaleVector vectorstore
Now that we have prepared our documents, let’s process them and load them, along with their vector embedding representations into our TimescaleVector vectorstore. Since this is a demo, we will only load the first 500 records. In practice, you can load as many records as you want.time_partition_interval
argument, which will be used to as the interval for partitioning the data by time. Each partition will consist of data for the specified length of time. We’ll use 7 days for simplicity, but you can pick whatever value make sense for your use case — for example if you query recent vectors frequently you might want to use a smaller time delta like 1 day, or if you query vectors over a decade long time period then you might want to use a larger time delta like 6 months or 1 year.
Finally, we’ll create the TimescaleVector instance. We specify the ids
argument to be the uuid
field in our metadata that we created in the pre-processing step above. We do this because we want the time part of our uuids to reflect dates in the past (i.e when the commit was made). However, if we wanted the current date and time to be associated with our document, we can remove the id argument and uuid’s will be automatically created with the current date and time.
Querying vectors by time and similarity
Now that we have loaded our documents into TimescaleVector, we can query them by time and similarity. TimescaleVector provides multiple methods for querying vectors by doing similarity search with time-based filtering. Let’s take a look at each method below:3. Using ANN Search Indexes to Speed Up Queries
You can speed up similarity queries by creating an index on the embedding column. You should only do this once you have ingested a large part of your data. Timescale Vector supports the following indexes:- timescale_vector index (tsv): a disk-ann inspired graph index for fast similarity search (default).
- pgvector’s HNSW index: a hierarchical navigable small world graph index for fast similarity search.
- pgvector’s IVFFLAT index: an inverted file index for fast similarity search.
create_index()
function without additional arguments will create a timescale_vector_index by default, using the default parameters.
index_type
argument which index you’d like to create, and optionally specify the parameters for the index.
4. Self Querying Retriever with Timescale Vector
Timescale Vector also supports the self-querying retriever functionality, which gives it the ability to query itself. Given a natural language query with a query statement and filters (single or composite), the retriever uses a query constructing LLM chain to write a SQL query and then applies it to the underlying PostgreSQL database in the Timescale Vector vectorstore. For more on self-querying, see the docs. To illustrate self-querying with Timescale Vector, we’ll use the same gitlog dataset from Part 3.5. Working with an existing TimescaleVector vectorstore
In the examples above, we created a vectorstore from a collection of documents. However, often we want to work insert data into and query data from an existing vectorstore. Let’s see how to initialize, add documents to, and query an existing collection of documents in a TimescaleVector vector store. To work with an existing Timescale Vector store, we need to know the name of the table we want to query (COLLECTION_NAME
) and the URL of the cloud PostgreSQL database (SERVICE_URL
).
add_document()
function. This function takes a list of documents and a list of metadata. The metadata must contain a unique id for each document.
If you want your documents to be associated with the current date and time, you do not need to create a list of ids. A uuid will be automatically generated for each document.
If you want your documents to be associated with a past date and time, you can create a list of ids using the uuid_from_time
function in the timecale-vector
python library, as shown in Section 2 above. This function takes a datetime object and returns a uuid with the date and time encoded in the uuid.
Deleting Data
You can delete data by uuid or by a filter on the metadata.Overriding a vectorstore
If you have an existing collection, you override it by doingfrom_documents
and setting pre_delete_collection
= True