More test cases and updated README

timescale · Feb 28, 2025 · 06a5b70 · 06a5b70
1 parent d041fee
commit 06a5b70
Show file tree

Hide file tree

Showing 2 changed files with 180 additions and 2 deletions.
diff --git a/README.md b/README.md
@@ -1,4 +1,3 @@
-
 <p></p>
 <div align=center>
 
@@ -13,6 +12,7 @@
 pgvectorscale complements [pgvector][pgvector], the open-source vector data extension for PostgreSQL, and introduces the following key innovations for pgvector data:
 - A new index type called StreamingDiskANN, inspired by the [DiskANN](https://github.com/microsoft/DiskANN) algorithm, based on research from Microsoft.
 - Statistical Binary Quantization: developed by Timescale researchers, This compression method improves on standard Binary Quantization.
+- Label-based filtered vector search: based on Microsoft's Filtered DiskANN research, this allows you to combine vector similarity search with label filtering for more precise and efficient results.
 
 On a benchmark dataset of 50 million Cohere embeddings with 768 dimensions
 each, PostgreSQL with `pgvector` and `pgvectorscale` achieves **28x lower p95
@@ -163,12 +163,128 @@ To enable pgvectorscale:
     SELECT *
     FROM document_embedding
     ORDER BY embedding <=> $1
-    LIMIT 10
+    LIMIT 10;
     ```
 
     Note: pgvectorscale currently supports: cosine distance (`<=>`) queries, for indices created with `vector_cosine_ops`; L2 distance (`<->`) queries, for indices created with `vector_l2_ops`; and inner product (`<#>`) queries, for indices created with `vector_ip_ops`.  This is the same syntax used by `pgvector`.  If you would like additional distance types,
     [create an issue](https://github.com/timescale/pgvectorscale/issues).  (Note: inner product indices are not compatible with plain storage.)
 
+## Filtered Vector Search
+
+pgvectorscale supports combining vector similarity search with metadata filtering. There are two approaches to filtering:
+
+1. **Label-based filtering with the diskann index**: This provides optimized performance for filtering by labels.
+2. **Arbitrary WHERE clause filtering**: This uses post-filtering after the vector search.
+
+The label-based filtering implementation is based on the [Filtered DiskANN](https://dl.acm.org/doi/10.1145/3543507.3583552) approach developed by Microsoft researchers, which enables efficient filtered vector search while maintaining high recall.
+
+### Label-based Filtering with diskann
+
+For optimal performance with label filtering, you must specify the label column directly in the index creation:
+
+1. Create a table with an embedding column and a labels array:
+
+    ```postgresql
+    CREATE TABLE documents (
+        id SERIAL PRIMARY KEY,
+        embedding VECTOR(1536),
+        labels INTEGER[],  -- Array of category labels
+        status TEXT,
+        created_at TIMESTAMPTZ
+    );
+    ```
+
+2. Create a StreamingDiskANN index on the embedding column, including the labels column:
+
+    ```postgresql
+    CREATE INDEX ON documents USING diskann (embedding vector_cosine_ops, labels);
+    ```
+
+3. Perform label-filtered vector searches using the `&&` operator (array overlap):
+
+    ```postgresql
+    -- Find similar documents with specific labels
+    SELECT * FROM documents
+    WHERE labels && ARRAY[1, 3]  -- Documents with label 1 OR 3
+    ORDER BY embedding <=> '[...]'
+    LIMIT 10;
+    ```
+
+    The index directly supports this type of filtering, providing significantly better performance than post-filtering.
+
+#### Giving Semantic Meaning to Labels
+
+While the labels must be stored as integers in the array for the index to work efficiently, you can give them semantic meaning by relating them to a separate labels table:
+
+1. Create a labels table with meaningful descriptions:
+
+    ```postgresql
+    CREATE TABLE label_definitions (
+        id INTEGER PRIMARY KEY,
+        name TEXT,
+        description TEXT,
+        attributes JSONB  -- Can store additional metadata about the label
+    );
+
+    -- Insert some label definitions
+    INSERT INTO label_definitions (id, name, description, attributes) VALUES
+    (1, 'science', 'Scientific content', '{"domain": "academic", "confidence": 0.95}'),
+    (2, 'technology', 'Technology-related content', '{"domain": "technical", "confidence": 0.92}'),
+    (3, 'business', 'Business and finance content', '{"domain": "commercial", "confidence": 0.88}');
+    ```
+
+2. When inserting documents, use the appropriate label IDs:
+
+    ```postgresql
+    -- Insert a document with science and technology labels
+    INSERT INTO documents (embedding, labels)
+    VALUES ('[...]', ARRAY[1, 2]);
+    ```
+
+3. When querying, you can join with the labels table to work with meaningful names:
+
+    ```postgresql
+    -- Find similar science documents and include label information
+    SELECT d.*, array_agg(l.name) as label_names
+    FROM documents d
+    JOIN label_definitions l ON l.id = ANY(d.labels)
+    WHERE d.labels && ARRAY[1]  -- Science label
+    GROUP BY d.id, d.embedding, d.labels, d.status, d.created_at
+    ORDER BY d.embedding <=> '[...]'
+    LIMIT 10;
+    ```
+
+4. You can also convert between label names and IDs when filtering:
+
+    ```postgresql
+    -- Find documents with specific label names
+    SELECT d.*
+    FROM documents d
+    WHERE d.labels && (
+        SELECT array_agg(id)
+        FROM label_definitions
+        WHERE name IN ('science', 'business')
+    )
+    ORDER BY d.embedding <=> '[...]'
+    LIMIT 10;
+    ```
+
+This approach gives you the performance benefits of integer-based label filtering while still allowing you to work with semantically meaningful labels in your application.
+
+### Arbitrary WHERE Clause Filtering
+
+You can also use any PostgreSQL WHERE clause with vector search, but these conditions will be applied as post-filtering:
+
+```postgresql
+-- Find similar documents with specific status and date range
+SELECT * FROM documents
+WHERE status = 'active' AND created_at > '2024-01-01'
+ORDER BY embedding <=> '[...]'
+LIMIT 10;
+```
+
+For these arbitrary conditions, the vector search happens first, and then the WHERE conditions are applied to the results. For best performance with frequently used filters, consider using the label-based approach described above.
+
 ## Tuning
 
 The StreamingDiskANN index comes with **smart defaults** but also the ability to customize its behavior. There are two types of parameters: index build-time parameters that are specified when an index is created and query-time parameters that can be tuned when querying an index.
@@ -197,6 +313,13 @@ CREATE INDEX document_embedding_idx ON document_embedding
 USING diskann (embedding) WITH(num_neighbors=50);
 ```
 
+An example of creating an index with label-based filtering:
+
+```sql
+CREATE INDEX document_embedding_idx ON document_embedding
+USING diskann (embedding vector_cosine_ops, labels);
+```
+
 #### StreamingDiskANN query-time parameters
 
 You can also set two parameters to control the accuracy vs. query speed trade-off at query time. We suggest adjusting `diskann.query_rescore` to fine-tune accuracy.

diff --git a/pgvectorscale/src/access_method/build.rs b/pgvectorscale/src/access_method/build.rs
@@ -1218,6 +1218,61 @@ pub mod tests {
         Ok(())
     }
 
+    #[pg_test]
+    pub unsafe fn test_tiny_labeled_index() -> spi::Result<()> {
+        Spi::run(
+            "CREATE TABLE test(embedding vector(3), labels integer[]);
+
+            CREATE INDEX idxtest
+                  ON test
+               USING diskann(embedding)
+                WITH (num_neighbors=15, search_list_size=10);
+
+            INSERT INTO test(embedding, labels) VALUES ('[1,2,3]', '{1,2}'), ('[4,5,6]', '{1,3}'), ('[7,8,10]', '{2,3}');
+            ",
+        )?;
+
+        let res: Option<i64> = Spi::get_one("   set enable_seqscan = 0;
+                WITH cte as (select * from test order by embedding <=> '[0,0,0]') SELECT count(*) from cte;")?;
+        assert_eq!(3, res.unwrap());
+
+        let res: Option<i64> = Spi::get_one("   set enable_seqscan = 0;
+                WITH cte as (select * from test where labels && '{1}' order by embedding <=> '[0,0,0]') SELECT count(*) from cte;")?;
+        assert_eq!(2, res.unwrap());
+
+        let res: Option<i64> = Spi::get_one("   set enable_seqscan = 0;
+                WITH cte as (select * from test where labels && '{2}' order by embedding <=> '[0,0,0]') SELECT count(*) from cte;")?;
+        assert_eq!(2, res.unwrap());
+
+        let res: Option<i64> = Spi::get_one("   set enable_seqscan = 0;
+                WITH cte as (select * from test where labels && '{3}' order by embedding <=> '[0,0,0]') SELECT count(*) from cte;")?;
+        assert_eq!(2, res.unwrap());
+
+        let res: Option<i64> = Spi::get_one("   set enable_seqscan = 0;
+        WITH cte as (select * from test where labels && '{1,3}' order by embedding <=> '[0,0,0]') SELECT count(*) from cte;")?;
+        assert_eq!(3, res.unwrap());
+
+        let res: Option<i64> = Spi::get_one("   set enable_seqscan = 0;
+        WITH cte as (select * from test where labels && '{1,2,3}' order by embedding <=> '[0,0,0]') SELECT count(*) from cte;")?;
+        assert_eq!(3, res.unwrap());
+
+        let res: Option<i64> = Spi::get_one("   set enable_seqscan = 0;
+        WITH cte as (select * from test where labels && '{4}' order by embedding <=> '[0,0,0]') SELECT count(*) from cte;")?;
+        assert_eq!(0, res.unwrap());
+
+        let res: Option<i64> = Spi::get_one("   set enable_seqscan = 0;
+        WITH cte as (select * from test where labels && '{1,4}' order by embedding <=> '[0,0,0]') SELECT count(*) from cte;")?;
+        assert_eq!(2, res.unwrap());
+
+        let res: Option<i64> = Spi::get_one("   set enable_seqscan = 0;
+        WITH cte as (select * from test where labels && '{4,1}' order by embedding <=> '[0,0,0]') SELECT count(*) from cte;")?;
+        assert_eq!(2, res.unwrap());
+
+        Spi::run("DROP TABLE test;")?;
+
+        Ok(())
+    }
+
     #[pg_test]
     pub unsafe fn test_labeled_index() -> spi::Result<()> {
         let index_options = "num_neighbors=15, search_list_size=10";