Skip to content

Commit

Permalink
More test cases and updated README
Browse files Browse the repository at this point in the history
  • Loading branch information
tjgreen42 committed Feb 28, 2025
1 parent d041fee commit 06a5b70
Show file tree
Hide file tree
Showing 2 changed files with 180 additions and 2 deletions.
127 changes: 125 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@

<p></p>
<div align=center>

Expand All @@ -13,6 +12,7 @@
pgvectorscale complements [pgvector][pgvector], the open-source vector data extension for PostgreSQL, and introduces the following key innovations for pgvector data:
- A new index type called StreamingDiskANN, inspired by the [DiskANN](https://github.com/microsoft/DiskANN) algorithm, based on research from Microsoft.
- Statistical Binary Quantization: developed by Timescale researchers, This compression method improves on standard Binary Quantization.
- Label-based filtered vector search: based on Microsoft's Filtered DiskANN research, this allows you to combine vector similarity search with label filtering for more precise and efficient results.

On a benchmark dataset of 50 million Cohere embeddings with 768 dimensions
each, PostgreSQL with `pgvector` and `pgvectorscale` achieves **28x lower p95
Expand Down Expand Up @@ -163,12 +163,128 @@ To enable pgvectorscale:
SELECT *
FROM document_embedding
ORDER BY embedding <=> $1
LIMIT 10
LIMIT 10;
```
Note: pgvectorscale currently supports: cosine distance (`<=>`) queries, for indices created with `vector_cosine_ops`; L2 distance (`<->`) queries, for indices created with `vector_l2_ops`; and inner product (`<#>`) queries, for indices created with `vector_ip_ops`. This is the same syntax used by `pgvector`. If you would like additional distance types,
[create an issue](https://github.com/timescale/pgvectorscale/issues). (Note: inner product indices are not compatible with plain storage.)
## Filtered Vector Search
pgvectorscale supports combining vector similarity search with metadata filtering. There are two approaches to filtering:
1. **Label-based filtering with the diskann index**: This provides optimized performance for filtering by labels.
2. **Arbitrary WHERE clause filtering**: This uses post-filtering after the vector search.
The label-based filtering implementation is based on the [Filtered DiskANN](https://dl.acm.org/doi/10.1145/3543507.3583552) approach developed by Microsoft researchers, which enables efficient filtered vector search while maintaining high recall.
### Label-based Filtering with diskann
For optimal performance with label filtering, you must specify the label column directly in the index creation:
1. Create a table with an embedding column and a labels array:
```postgresql
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
embedding VECTOR(1536),
labels INTEGER[], -- Array of category labels
status TEXT,
created_at TIMESTAMPTZ
);
```
2. Create a StreamingDiskANN index on the embedding column, including the labels column:
```postgresql
CREATE INDEX ON documents USING diskann (embedding vector_cosine_ops, labels);
```
3. Perform label-filtered vector searches using the `&&` operator (array overlap):
```postgresql
-- Find similar documents with specific labels
SELECT * FROM documents
WHERE labels && ARRAY[1, 3] -- Documents with label 1 OR 3
ORDER BY embedding <=> '[...]'
LIMIT 10;
```
The index directly supports this type of filtering, providing significantly better performance than post-filtering.
#### Giving Semantic Meaning to Labels
While the labels must be stored as integers in the array for the index to work efficiently, you can give them semantic meaning by relating them to a separate labels table:
1. Create a labels table with meaningful descriptions:
```postgresql
CREATE TABLE label_definitions (
id INTEGER PRIMARY KEY,
name TEXT,
description TEXT,
attributes JSONB -- Can store additional metadata about the label
);
-- Insert some label definitions
INSERT INTO label_definitions (id, name, description, attributes) VALUES
(1, 'science', 'Scientific content', '{"domain": "academic", "confidence": 0.95}'),
(2, 'technology', 'Technology-related content', '{"domain": "technical", "confidence": 0.92}'),
(3, 'business', 'Business and finance content', '{"domain": "commercial", "confidence": 0.88}');
```
2. When inserting documents, use the appropriate label IDs:
```postgresql
-- Insert a document with science and technology labels
INSERT INTO documents (embedding, labels)
VALUES ('[...]', ARRAY[1, 2]);
```
3. When querying, you can join with the labels table to work with meaningful names:
```postgresql
-- Find similar science documents and include label information
SELECT d.*, array_agg(l.name) as label_names
FROM documents d
JOIN label_definitions l ON l.id = ANY(d.labels)
WHERE d.labels && ARRAY[1] -- Science label
GROUP BY d.id, d.embedding, d.labels, d.status, d.created_at
ORDER BY d.embedding <=> '[...]'
LIMIT 10;
```
4. You can also convert between label names and IDs when filtering:
```postgresql
-- Find documents with specific label names
SELECT d.*
FROM documents d
WHERE d.labels && (
SELECT array_agg(id)
FROM label_definitions
WHERE name IN ('science', 'business')
)
ORDER BY d.embedding <=> '[...]'
LIMIT 10;
```
This approach gives you the performance benefits of integer-based label filtering while still allowing you to work with semantically meaningful labels in your application.
### Arbitrary WHERE Clause Filtering
You can also use any PostgreSQL WHERE clause with vector search, but these conditions will be applied as post-filtering:
```postgresql
-- Find similar documents with specific status and date range
SELECT * FROM documents
WHERE status = 'active' AND created_at > '2024-01-01'
ORDER BY embedding <=> '[...]'
LIMIT 10;
```
For these arbitrary conditions, the vector search happens first, and then the WHERE conditions are applied to the results. For best performance with frequently used filters, consider using the label-based approach described above.
## Tuning
The StreamingDiskANN index comes with **smart defaults** but also the ability to customize its behavior. There are two types of parameters: index build-time parameters that are specified when an index is created and query-time parameters that can be tuned when querying an index.
Expand Down Expand Up @@ -197,6 +313,13 @@ CREATE INDEX document_embedding_idx ON document_embedding
USING diskann (embedding) WITH(num_neighbors=50);
```
An example of creating an index with label-based filtering:
```sql
CREATE INDEX document_embedding_idx ON document_embedding
USING diskann (embedding vector_cosine_ops, labels);
```
#### StreamingDiskANN query-time parameters
You can also set two parameters to control the accuracy vs. query speed trade-off at query time. We suggest adjusting `diskann.query_rescore` to fine-tune accuracy.
Expand Down
55 changes: 55 additions & 0 deletions pgvectorscale/src/access_method/build.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1218,6 +1218,61 @@ pub mod tests {
Ok(())
}

#[pg_test]
pub unsafe fn test_tiny_labeled_index() -> spi::Result<()> {
Spi::run(
"CREATE TABLE test(embedding vector(3), labels integer[]);
CREATE INDEX idxtest
ON test
USING diskann(embedding)
WITH (num_neighbors=15, search_list_size=10);
INSERT INTO test(embedding, labels) VALUES ('[1,2,3]', '{1,2}'), ('[4,5,6]', '{1,3}'), ('[7,8,10]', '{2,3}');
",
)?;

let res: Option<i64> = Spi::get_one(" set enable_seqscan = 0;
WITH cte as (select * from test order by embedding <=> '[0,0,0]') SELECT count(*) from cte;")?;
assert_eq!(3, res.unwrap());

let res: Option<i64> = Spi::get_one(" set enable_seqscan = 0;
WITH cte as (select * from test where labels && '{1}' order by embedding <=> '[0,0,0]') SELECT count(*) from cte;")?;
assert_eq!(2, res.unwrap());

let res: Option<i64> = Spi::get_one(" set enable_seqscan = 0;
WITH cte as (select * from test where labels && '{2}' order by embedding <=> '[0,0,0]') SELECT count(*) from cte;")?;
assert_eq!(2, res.unwrap());

let res: Option<i64> = Spi::get_one(" set enable_seqscan = 0;
WITH cte as (select * from test where labels && '{3}' order by embedding <=> '[0,0,0]') SELECT count(*) from cte;")?;
assert_eq!(2, res.unwrap());

let res: Option<i64> = Spi::get_one(" set enable_seqscan = 0;
WITH cte as (select * from test where labels && '{1,3}' order by embedding <=> '[0,0,0]') SELECT count(*) from cte;")?;
assert_eq!(3, res.unwrap());

let res: Option<i64> = Spi::get_one(" set enable_seqscan = 0;
WITH cte as (select * from test where labels && '{1,2,3}' order by embedding <=> '[0,0,0]') SELECT count(*) from cte;")?;
assert_eq!(3, res.unwrap());

let res: Option<i64> = Spi::get_one(" set enable_seqscan = 0;
WITH cte as (select * from test where labels && '{4}' order by embedding <=> '[0,0,0]') SELECT count(*) from cte;")?;
assert_eq!(0, res.unwrap());

let res: Option<i64> = Spi::get_one(" set enable_seqscan = 0;
WITH cte as (select * from test where labels && '{1,4}' order by embedding <=> '[0,0,0]') SELECT count(*) from cte;")?;
assert_eq!(2, res.unwrap());

let res: Option<i64> = Spi::get_one(" set enable_seqscan = 0;
WITH cte as (select * from test where labels && '{4,1}' order by embedding <=> '[0,0,0]') SELECT count(*) from cte;")?;
assert_eq!(2, res.unwrap());

Spi::run("DROP TABLE test;")?;

Ok(())
}

#[pg_test]
pub unsafe fn test_labeled_index() -> spi::Result<()> {
let index_options = "num_neighbors=15, search_list_size=10";
Expand Down

0 comments on commit 06a5b70

Please sign in to comment.