From 76185e2ef7014f9421e7c85a68f91051626ea594 Mon Sep 17 00:00:00 2001 From: PlanetScale Actions Bot Date: Mon, 21 Oct 2024 16:42:59 +0000 Subject: [PATCH] docs: upstream https://github.com/planetscale/www/commit/32921a2d7b8bf68956e326bff0bdc4c7dbd494c9 --- docs/navigation/standard.md | 14 ++ .../vectors.md => vectors/overview.md} | 111 ++------- docs/vectors/reference.md | 158 ++++++++++++ docs/vectors/terminology-and-concepts.md | 164 +++++++++++++ docs/vectors/use-cases.md | 225 ++++++++++++++++++ docs/vectors/using-with-an-orm.md | 210 ++++++++++++++++ 6 files changed, 791 insertions(+), 91 deletions(-) rename docs/{concepts/vectors.md => vectors/overview.md} (61%) create mode 100644 docs/vectors/reference.md create mode 100644 docs/vectors/terminology-and-concepts.md create mode 100644 docs/vectors/use-cases.md create mode 100644 docs/vectors/using-with-an-orm.md diff --git a/docs/navigation/standard.md b/docs/navigation/standard.md index 977d7a2f..054fd816 100644 --- a/docs/navigation/standard.md +++ b/docs/navigation/standard.md @@ -227,6 +227,20 @@ - label: 'Creating sequence tables' slug: '/docs/sharding/sequence-tables' +- label: 'Vectors' + icon: 'vectors' + items: + - label: 'Vectors overview' + slug: '/docs/vectors/overview' + - label: 'Concepts and terminology' + slug: '/docs/vectors/terminology-and-concepts' + - label: 'Use cases' + slug: '/docs/vectors/use-cases' + - label: 'Using with an ORM' + slug: '/docs/vectors/using-with-an-orm' + - label: 'Reference' + slug: '/docs/vectors/reference' + - label: 'Security and access' icon: 'security' items: diff --git a/docs/concepts/vectors.md b/docs/vectors/overview.md similarity index 61% rename from docs/concepts/vectors.md rename to docs/vectors/overview.md index 6d2d6ab4..1fdbdb44 100644 --- a/docs/concepts/vectors.md +++ b/docs/vectors/overview.md @@ -1,32 +1,22 @@ --- title: 'Vector search and storage' subtitle: 'Learn how to use PlanetScale vector search and storage.' -date: '2024-09-30' +date: '2024-10-29' --- -Welcome to the PlanetScale vectors beta! The goal of this private beta period is to get the product in the hands of our customers so you can build alongside us while we continue to improve the feature — with your feedback. +Welcome to the PlanetScale vectors beta! The goal of this beta period is to get the product in the hands of our customers so you can build alongside us while we continue to improve the feature — with your feedback. -If at any point you experience issues with vectors while using the beta, we highly encourage you to get in touch. Your feedback is extremely valuable during this beta period, so don’t hesitate to reach out. You can [submit a support ticket](/contact) to relay any feedback or issues. We also have a private [Discord](https://discord.com/invite/pDUGAAFEJx) channel for the vectors beta. If you'd like to added, fill out our [contact form](/contact). +If at any point you experience issues with vectors while using the beta, we highly encourage you to get in touch. Your feedback is extremely valuable during this beta period, so don’t hesitate to reach out. You can [submit a support ticket](/contact) to relay any feedback or issues. We also have a vectors channel in our [Discord](https://discord.com/invite/pDUGAAFEJx) where you can ask questions and share feedback. {% callout type="warning" %} This feature is still beta quality and is not intended for use in production workloads. We recommend limiting use of PlanetScale vector search and storage to testing and evaluation purposes only. PlanetScale vectors is considered a Beta Feature as noted in our Agreement with you, and any use of PlanetScale vectors is in accordance with the Agreement. {% /callout %} -This documentation outlines how to get started with vectors, known issues and limitations, some example usage, and how to share feedback. - -## Known issues and limitations - -- Building a **one-shot** **index** (an index built in bulk on an existing set of vectors) requires enough RAM to fit all of the vectors in memory simultaneously. This limitation will be lifted by the time the beta is complete. -- **Incremental indexes** (indexes that begin empty and update as new vectors are added) function correctly, but are significantly slower to build compared to a one-shot index. Disk usage is much higher due to potentially very high InnoDB blob fragmentation issues, so it's much easier to run out of disk space. -- **Online DDL** and deploy requests do not work well yet, because they build incremental indexes. Please use direct DDL for now. We plan to improve this significantly during the beta. -- Since this is a beta, there may be bugs, performance, and security issues that have not yet been uncovered. We also may need to change query or DDL syntax before the feature is generally available. Don’t run this on a production database. -- Once you opt a branch into the vectors feature, that branch must continue to run a vectors-enabled version of MySQL. You can remove your vector columns/tables, but you cannot downgrade that branch to its prior version of MySQL. - ## Overview -Vectors are a data structure that captures opaque semantic meaning about something and allows a database to search for resources by similarity based on this opaque meaning. As a data type, a vector is just an array of floating-point numbers. Those numbers are generated by submitting some resource — a word, a string, a document, an image, audio, etc — to an *embedding*¹, which converts the resource to a vector. +Vectors are a data structure that captures opaque semantic meaning about something and allows a database to search for resources by similarity based on this opaque meaning. As a data type, a vector is just an array of floating-point numbers. Those numbers are generated by submitting some resource — a word, a string, a document, an image, audio, etc. — to an _embedding model_,¹ which converts the resource to a vector. -A vector database stores those vector embeddings alongside other relational data. In practice, that might look like a table with columns for ID (a primary key), content (as a BLOB or VARCHAR), and a vector. Then it becomes possible to perform queries that find content similar to a search query, like so: +A vector database stores those vector embeddings alongside other relational data. In practice, that might look like a table with columns for ID (a primary key), content (as a BLOB or VARCHAR), and a vector. Then it becomes possible to perform queries that find content similar to a search vector, like so: ```sql SELECT id @@ -35,17 +25,15 @@ SELECT id LIMIT 10; ``` -Possible applications include recommendation engines that show products similar to a user's purchase history, or search engines that find documents or other resources based on natural-language queries. +Possible applications include recommendation engines that show products similar to a user's purchase history, or search engines that find documents or other resources based on natural-language queries. Read our [applications of vector databases](/docs/vectors/use-cases) docs to learn more about how vector databases can be applied in the real world. PlanetScale has added support for vector columns, vector distance functions, and vector indexes, as described below. -[¹]: PlanetScale does not currently provide an embedding service. You can find several good cloud-based options like OpenAI or AWS Titan, or local options like Python sentence_transformers. +[¹]: PlanetScale does not currently provide an embedding service. You can find several good cloud-based options like OpenAI or AWS Titan, or local options like Python `sentence_transformers`. ## Enrolling in the PlanetScale vectors beta -This is currently a closed beta. To access the beta, you must have received an invite. If you did not receive an invite and wish to join, or you would like to enroll a different organization, please fill out our [contact form](/contact). - -PlanetScale has a series of MySQL images that have been extended with vector support. Vector support can be enabled on a per-branch basis, however, you have to first opt-in to the beta from your database settings page. After that, you will choose which branch(es) you’d like to opt-in to the vectors beta. The branch will be updated to the vectors-enabled version of MySQL at the time of opting the branch in. +PlanetScale has a custom version of MySQL that has been extended with vector support. Vector support can be enabled on a per-branch basis, however, you have to first opt-in to the beta from your database settings page. After that, you will choose which branch(es) you’d like to opt-in to the vectors beta. The branch will be updated to the vectors-enabled version of MySQL at the time of opting the branch in. To enable the vector support on a branch: 1. Click on the database that you’d like to enroll in the vectors beta. @@ -56,7 +44,7 @@ To enable the vector support on a branch: 6. Click on the small gear icon underneath the “Connect” button on the right. 7. Click the toggle next to “Enable vectors”. 8. Click “Save branch settings”. -9. The branch will upgrade asynchronously to the correct version of MySQL, which may take 30-60 minutes. You can confirm when this process is complete by executing a “SELECT @@version” query. The vector-enabled version is 8.0.37. +9. The branch will upgrade asynchronously to the correct version of MySQL, which may take 30-60 minutes. While this happens, the database dashboard will show an "Enabling vectors" badge, which changes to a "Vector-enabled" badge when the upgrade is complete. ## Adding vector columns @@ -88,12 +76,12 @@ SELECT id, DISTANCE(TO_VECTOR('[3, 3, 3, 3]'), embedding, 'L2_SQUARED') AS d Use an `EXPLAIN` query to confirm that the query uses the new index. This query actually won’t use the index until the table has around 50 rows in it. -Note that vector indexes provide approximate results. An unindexed query with LIMIT 100 returns exactly the 100 rows closest to the reference vector, after performing a full table scan and a sort. An indexed query returns, on average, about 100 of the top 105 (around 95%) of the rows closest to the reference vector, but much faster than a full table scan. This is expected, because all efficient vector indexes, including PlanetScale’s vector indexes, perform approximate nearest neighbor (ANN) searches. +Vector indexes provide approximate results. An unindexed query with LIMIT 100 returns exactly the 100 rows closest to the reference vector, after performing a full table scan and a sort. An indexed query returns, on average, about 100 of the top 105 (around 95%) of the rows closest to the reference vector, but much faster than a full table scan. This is expected, because all efficient vector indexes, including PlanetScale’s vector indexes, perform approximate nearest neighbor (ANN) searches. -If you are adding vectors from an app, you may want to use prepared statements, although we do not recommend it. `TO_VECTOR` works in that setting, but serializing the vectors on the client side and uploading them as binary is faster. The serialized format is IEEE-754 32-bit floats, which you can serialize with code like this: +If you are adding vectors to your database from an application, you may want to use prepared statements, although we do not recommend it. `TO_VECTOR` works in that setting, but serializing the vectors on the client side and uploading them as binary is faster. The serialized format is IEEE-754 32-bit floats, which you can serialize with code like this: -- Python: `struct.pack('ffff', *float_array)` -- Ruby: `float_array.pack(“ffff”)` +- Python: `struct.pack(f'{len(float_array)}f', *float_array)` +- Ruby: `float_array.pack(“f*”)` - Rust: `float_array.map(|f| f.to_ne_bytes()).flatten().collect()` You can use the resulting blob (which will be 4 bytes times the number of dimensions in the vector) in an `INSERT` statement like this: @@ -147,76 +135,17 @@ This query selects the ten products from a given seller that are closest to some The MySQL query planner chooses whether to use the vector index or some other index automatically based on the query and based on the contents of the table, to maximize query performance. Use `EXPLAIN` on any given query to see how it will execute. -As part of the private beta, we’re looking for feedback on how well MySQL plans vector queries. If you believe you’ve hit an edge case or something looks wrong, please [open a support ticket](/contact) and let us know. - -## Vector function reference - -**`TO_VECTOR(string)`** -Converts a text string to a binary vector value. The text string is an array of floating point numbers in JSON format. - -- alias `STRING_TO_VECTOR(string)` -- Example: `SELECT TO_VECTOR('[1, 2.78, 3.14]');` - - `-> 0x0000803F85EB3140C3F54840` - -**`FROM_VECTOR(string)`** -Converts a binary vector to a human-readable string. - -- alias `VECTOR_TO_STRING(vector)` -- Example: `SELECT FROM_VECTOR(0x0000803F85EB3140C3F54840);` - - `-> [1.00000e+00,2.78000e+00,3.14000e+00]` - -**`VECTOR_DIM(string)`** -Calculates the dimension of a vector - -- Example: `SELECT VECTOR_DIM(TO_VECTOR('[1,2,3]')); -> 3` - -**`DISTANCE(vector1, vector2, [metric])`** -Calculates the distance between two vectors. The optional third parameter specifies which distance metric is to be used: `DOT`, `COSINE`, L2 (`EUCLIDEAN)`, or L2_SQUARED (`EUCLIDEAN_SQUARED)`. If the distance metric is omitted, it defaults to `DOT`. - -- `DOT` means the dot product. Example: +As part of the beta, we’re looking for feedback on how well MySQL plans vector queries. If you believe you’ve hit an edge case or something looks wrong, please [open a support ticket](/contact) and let us know. - `SELECT DISTANCE(TO_VECTOR('[1,2]'), TO_VECTOR('[5,4]'), 'DOT');` +## Known issues and limitations for the beta - `-> 13` - -- `COSINE` means the cosine of the angle between the two vectors, which is the same as the dot product divided by the magnitude of the two vectors. Example: - - `SELECT DISTANCE(TO_VECTOR('[1,2]'), TO_VECTOR('[5,4]'), 'COSINE');` - - `-> 0.09204061549954834` - -- `L2` (or `EUCLIDEAN`) means the length of a line between the ends of the vectors. Example: - - `SELECT DISTANCE(TO_VECTOR('[1,2]'), TO_VECTOR('[5,4]'), 'L2');` - - `-> 4.47213595499958` - -- `L2_SQUARED` (or `EUCLIDEAN_SQUARED`) is the square of the Euclidean distance - - `SELECT DISTANCE(TO_VECTOR('[1,2]'), TO_VECTOR('[5,4]'), 'L2_SQUARED');` - - `-> 20` - -**`DISTANCE_DOT(vector1, vector2)`** -Is the same as `DISTANCE(vector1, vector2, 'DOT')` - -**`DISTANCE_COSINE(vector1, vector2)`** -Is the same as `DISTANCE(vector1, vector2, 'COSINE')` - -**`DISTANCE_L2(vector1, vector2)`** -Is the same as `DISTANCE(vector1, vector2, 'L2')` - -- alias: `DISTANCE_EUCLIDEAN(vector1, vector2)` - -**`DISTANCE_L2_SQUARED(vector1, vector2)`** -Is the same as `DISTANCE(vector1, vector2, 'L2_SQUARED')` - -- alias: `DISTANCE_EUCLIDEAN_SQUARED(vector1, vector2)` +- Building a **one-shot** **index** (an index built in bulk on an existing set of vectors) requires enough RAM to fit roughly half of all the vector dataset in memory simultaneously. This will be improved throughout the beta. +- **Incremental indexes** (indexes that begin empty and update as new vectors are added) function correctly, but are significantly slower to build compared to a one-shot index. Disk usage is much higher due to potentially very high InnoDB blob fragmentation issues, so it's much easier to run out of disk space. +- Since this is a beta, there may be bugs, performance, and security issues that have not yet been uncovered. We also may need to change query or DDL syntax before the feature is generally available. Don’t run this on a production database. +- Once you opt a branch into the vectors feature, that branch must continue to run a vectors-enabled version of MySQL. You can remove your vector columns/tables, but you cannot downgrade that branch to its prior version of MySQL. ## Feedback We want to make our vectors offering as reliable, fast, and feature-rich as possible. Feedback from our early users will help make this possible. If you encounter any issues, crashes, unexpected errors or poor performance, please [submit a support ticket](/contact). You are also welcome to reach out with general feedback and suggestions. -We also have a private [Discord](https://discord.com/invite/pDUGAAFEJx) channel for the vectors beta where you can ask questions, share feedback, and discuss what you’re working on. If you'd like to added, please fill out the [contact form](/contact). +We also have a [Discord](https://discord.com/invite/pDUGAAFEJx) channel for the vectors beta where you can ask questions, share feedback, and discuss what you’re working on. diff --git a/docs/vectors/reference.md b/docs/vectors/reference.md new file mode 100644 index 00000000..a040753e --- /dev/null +++ b/docs/vectors/reference.md @@ -0,0 +1,158 @@ +--- +title: 'Vector type and index reference' +subtitle: 'Learn how to use vectors in MySQL on PlanetScale' +label: 'Beta feature' +date: '2024-10-29' +--- + +{% callout type="note" %} +PlanetScale MySQL vectors is [still in beta](/docs/vectors/overview) and is not intended for use in production workloads. +{% /callout %} + +## Vector type + +PlanetScale MySQL provides a `VECTOR(X)` type that can be used to store vectors. +To add a vector column to a table, set it to type `VECTOR(X)` where `X` is the dimension of the vectors to be stored in this column. + +### Example + +```sql +CREATE TABLE t1 ( + id INT PRIMARY KEY auto_increment, + embedding VECTOR(4) +); +``` + +## Vector index + +PlanetScale MySQL provides a new `VECTOR INDEX` to facilitate fast and scalable approximate nearest neighbor (ANN) search on vector data. + +Statements that create a vector index may take optional parameters, which can be specified as JSON key-value pairs, via the `SECONDARY_ENGINE_ATTRIBUTE` variable. +There are two options that can be specified in the JSON: + +- `type`: specifies the algorithm used to build and query the vector index. + - Supported values: `spann` (more info on the [SPANN algorithm](#TODO-link-to-other-doc)) +- `distance` specifies the distance metric that queries will use. + - Supported values: + - `dot` for the dot product + - `cosine` for the cosine of the angle between the two vectors, which is the same as the dot product divided by the magnitude of the two vectors + - `l2` or `euclidean` for the length of a line between the ends of the vectors + - `l2_squared` or `euclidean_squared` for the square of the Euclidean distance. This is the default. + +The distance metric specified at index creation time must match the distance metric used at query time, or the index cannot be used, and MySQL will perform a full-table scan instead. + +### Examples + +```sql +CREATE /*vt+ QUERY_TIMEOUT_MS=0 */ + VECTOR INDEX embedding_index ON t1(embedding); +``` + +```sql +CREATE /*vt+ QUERY_TIMEOUT_MS=0 */ + VECTOR INDEX embedding_index ON t1(embedding) + SECONDARY_ENGINE_ATTRIBUTE='{"type":"spann", "distance":"cosine"}'; +``` + +## Vector functions + +PlanetScale MySQL includes several new functions for working with vectors. + +## `TO_VECTOR(string)` or `STRING_TO_VECTOR(string)` + +Converts a text string to a binary vector value. The text string is an array of floating point numbers in JSON format. + +### Example + +```sql +SELECT TO_VECTOR('[1, 2.78, 3.14]'); + -> 0x0000803F85EB3140C3F54840 +``` + +## `FROM_VECTOR(string)` or `VECTOR_TO_STRING(vector)` + +Converts a binary vector to a human-readable string. + +### Example + +```sql +SELECT FROM_VECTOR(0x0000803F85EB3140C3F54840); + -> [1.00000e+00,2.78000e+00,3.14000e+00] +``` + +## `VECTOR_DIM(string)` + +Calculates the dimension of a vector. + +### Example + +```sql +SELECT VECTOR_DIM(TO_VECTOR('[1,2,3]')); + -> 3 +``` + +## `DISTANCE(vector1, vector2, [metric])` + +Calculates the distance between `vector1` and `vector2`. +The optional third parameter specifies which distance metric is to be used: `DOT`, `COSINE`, `L2` (`EUCLIDEAN)`, or `L2_SQUARED` (`EUCLIDEAN_SQUARED)`. + +- `DOT` means the dot product. +- `COSINE` means the cosine of the angle between the two vectors, which is the same as the dot product divided by the magnitude of the two vectors. Example: +- `L2` (or `EUCLIDEAN`) means the length of a line between the ends of the vectors. Example: +- `L2_SQUARED` (or `EUCLIDEAN_SQUARED`) is the square of the Euclidean distance + +If the distance metric is omitted, it defaults to `DOT`. + +### Examples + +```sql +SELECT DISTANCE(TO_VECTOR('[1,2]'), TO_VECTOR('[5,4]'), 'DOT'); + -> 13 +``` + +```sql +SELECT DISTANCE(TO_VECTOR('[1,2]'), TO_VECTOR('[5,4]'), 'COSINE'); + -> 0.9079593845004517 +``` + +```sql +SELECT DISTANCE(TO_VECTOR('[1,2]'), TO_VECTOR('[5,4]'), 'L2'); + -> 4.47213595499958 +``` + +```sql +SELECT DISTANCE(TO_VECTOR('[1,2]'), TO_VECTOR('[5,4]'), 'L2_SQUARED'); + -> 20 +``` + +```sql +SELECT id, price, seller_id + FROM products + WHERE price < 20.0 + ORDER BY DISTANCE(TO_VECTOR('[1.2, 3.4, 5.6]'), embedding, 'L2_SQUARED') + LIMIT 10; +``` + +## `DISTANCE_DOT(vector1, vector2)` + +Is the same as `DISTANCE(vector1, vector2, 'DOT')` + +## `DISTANCE_COSINE(vector1, vector2)` + +Is the same as `DISTANCE(vector1, vector2, 'COSINE')` + +## `DISTANCE_L2(vector1, vector2)` + +Is the same as `DISTANCE(vector1, vector2, 'L2')` + +## `DISTANCE_EUCLIDEAN(vector1, vector2)` + +Is the same as `DISTANCE(vector1, vector2, 'L2')` + +## `DISTANCE_L2_SQUARED(vector1, vector2)` + +Is the same as `DISTANCE(vector1, vector2, 'L2_SQUARED')` + +## `DISTANCE_EUCLIDEAN_SQUARED(vector1, vector2)` + +Is the same as `DISTANCE(vector1, vector2, 'L2_SQUARED')` diff --git a/docs/vectors/terminology-and-concepts.md b/docs/vectors/terminology-and-concepts.md new file mode 100644 index 00000000..01c2640c --- /dev/null +++ b/docs/vectors/terminology-and-concepts.md @@ -0,0 +1,164 @@ +--- +title: 'Vector database terminology and concepts' +subtitle: 'Get familiar with the terminology and algorithms often referred to in the world of vector databases.' +label: 'Beta feature' +date: '2024-10-29' +--- + +{% callout type="note" %} +PlanetScale MySQL vectors is [still in beta](/docs/vectors/overview) and is not intended for use in production workloads. +{% /callout %} + +There are many concepts, algorithms and data structures that are used when discussing vector databases. +Here, we provide an overview of many of these concepts and describe the indexing technique that PlanetScale MySQL uses (SPANN). + +## General terminology + +Common terms used when discussing vector databases and indexes. + +#### Vector (Embedding) + +Vectors (more specifically, Embeddings) are a data structure that captures opaque semantic meaning about something and allows a database to search for resources by similarity based on this opaque meaning. +As a data type, a vector is just an array of floating-point numbers. +Those numbers are generated by submitting some resource — a word, a string, a document, an image, audio, etc — to an _embedding model_, which converts the resource to a vector. + +#### K-Nearest Neighbors (KNN) + +KNN refers to the K-Nearest Neighbors to some point P in a high-dimensional space. +With vector databases, we sometimes want to store millions or billions of high-dimensional vectors. +A common query on a vector database is to find the other vectors in our data set that are similar to some input search vector. +If the vectors are embeddings generated from an AI model, two vectors being similar (or near each other) in the high-dimensional space means that they have similar opaque meaning. +We can ask the vector database to give us the 100 closest vectors, which is a form of KNN search, where K = 100. + +#### Approximate Nearest Neighbors (ANN) + +ANN It is similar to KNN, but instead of finding the _exact_ K closest neighbors to some vector, we instead look for the _approximately_ K closest neighbors in the high-dimensional space. +This means that we might miss some of the _actual_ closest neighbors. +However, if we relax our requirement and use ANN instead of KNN, we can often get significant performance improvement using specialized vector search indexes, which are further discussed later on this page. + +We measure how good a job an algorithm does at producing nearest neighbors with ANN via a measure known as _recall_. + +#### Recall + +In the context of ANN search, the _recall_ is the percentage of the actual nearest neighbors found in an ANN search. +For example, say that we are performing a search on a large set of vectors and want to get the nearest 100 neighbors. + +If we perform KNN search, this will give us the exact 100 top searches. +Instead, we may perform ANN search. +If that ANN search returns 95 of the actual top 100 and then 5 results that are outside of the top 100, we would say the recall is 95%. +If instead it returned 80 of the actual top 100 and then 20 results outside, we say the recall was 80%. + +Most vector indexes provide ANN search. +Ideally, we'd like to perform these searches with high recall. +Often, there is a trade-off between speed and recall %. + +#### Product Quantization (PQ) + +PQ is a technique for compressing a vector into a more compact representation. +Many vector indexes use PQ to compress the vector data stored in the index to reduce the memory footprint. +Sometimes, the full vector representation will still be stored somewhere on disk. + +## Types of search algorithms and indexes + +There are many algorithms and indexes that have been developed over the years, each with their strengths and weaknesses. + +#### Brute-force search + +Brute-force search on an vector data set for nearest neighbors is one that does not use any index. +Instead, it compares the search vector to all other vectors to find the nearest neighbors. +The advantage of this is that is can produce exact KNN results. +The disadvantage is that it does not scale well beyond a few hundred or a few thousand vectors. +It is impractical for any large-scale data set. + +#### Inverted File Index (IVF) + +IVF is a type of index used to speed up ANN search. +In an IVF index, all of the vectors are partitioned into chunks of similar indexes. +The number of chunks to partition the data into is typically configurable. +In the image below, we have our vectors (blue dots) partitioned into 5 chunks (colored regions). +This shows a representation in 2D space, but is applicable to N-dimensional space as well. + +![IVF Visual](/assets/docs/concepts/vector-indexes/ivf.png) + +When a vector search is performed, the index finds the partition(s) that are most likely to contain similar vectors, and only performs similarity search on vectors in those. + +This technique allows for much faster search compared to a full scan of all vectors, as we eliminate much of the vector data from the search early in the process. +However, it might miss some similar vectors in the chunks that it skips, and therefore is only capable of performant ANN searches. + +- **Pros:** Speeds up searches, can be combined with other methods. +- **Cons:** Large data sets either cause slowdown or reduced recall, or both. + +#### Hierarchical Navigable Small World (HNSW) + +HNSW is one of the most commonly implemented vector indexes in modern vector databases. +This is because it provides efficient ANN search for small and medium-sized data sets, and is a widely-adopted algorithm. + +HNSW indexes map every vector in the data set onto a graph, with nodes representing each vector and edges between nodes that are near each-other in the high-dimensional space. + +HNSW builds multiple graph layers. +The bottom layer is the full similarity graph, and each level above is a sparser version of the graph below (in other words, some nodes and edges get pruned). + +![HNSW Visual](/assets/docs/concepts/vector-indexes/hnsw.png) + +HNSW begins its search for a vector at the top layer. +Since the graph is sparse, it can quickly navigate through and find the nearest similar vector. +It then drops down to the next level, and continues to search for the closest vector match in that region of the graph. +This process continues until it finds the closest match(es) in the lowest level. + +HNSW works really well for small and medium sized vector data sets when the index can all fit in RAM. +However, once the index no longer fits in RAM, performance takes a drastic hit. +This is because the search on the graph produces a lot of non-sequential I/O. +In memory this is fine, but this leads to poorly-optimized disk access patterns, even for SSDs. + +Another downside of HNSW indexes is that they cannot be updated incrementally, so they require periodically re-building the index with the underlying vector data. + +- **Pros:** Popular, easy to implement, fast searches. +- **Cons:** Does not scale well for large data sets, requires periodically re-building the index + +#### DiskANN + +DiskANN is another graph-based ANN search algorithm akin to HNSW. +However, DiskANN does not use the multi-layer graph approach like HNSW does. +Instead, it builds a single large graph of the vector data in several phases. + +Initially, a graph node is created for each vector in the data set, and random edges are added, leading to a very "messy" graph. +Then, two phases of optimization and pruning occur. +The first optimization phase does significant adjustment and pruning to optimize similarity search with short edges. +The second optimization phase build longer edges into the graph, allowing for faster graph traversals. +This graph construction algorithm is known as _Vamana_. + +![DiskANN Visual](/assets/docs/concepts/vector-indexes/diskann.png) + +As the name suggests, DiskANN has better performance than HNSW when the index does not all fit into RAM. +DiskANN uses a different technique for building the graph compared to HNSW. +This technique leads to different neighbors for each node and a different memory / disk layout. +When the index cannot fit into memory, this graph allows for a significant reduction in the number of disk read operations needed to fulfill a search compared to HNSW. + +DiskANN scales well, but suffers from worse query performance. +While it can be modified to allow incremental updates, these are not particularly efficient and are hard to map to transactional SQL semantics. + +- **Pros:** Fast searches, scales better than HNSW or IVF +- **Cons:** Relies on basic graph traversal, incremental index updates are expensive + +#### Space-Partitioned Approximate Nearest Neighbors (SPANN) + +SPANN is a hybrid vector indexing and search algorithm that uses both graph and tree structures, and was specifically designed to work well for indexes requiring SSD usage. + +A graph is created for the vector data, with edges representing nearby neighbors. +The graph is partitioned into many small clusters called _posting lists_. +In SPANN, nodes that are near the boundary between two posting lists may reside in multiple posting lists to help improve recall. + +The full set of posting lists are stored on disk. +The center-most node of each posting list (known as the _centroid_) is stored in a special SPTAG index, which is designed to fit in memory. + +![SPANN Visual](/assets/docs/concepts/vector-indexes/spann.png) + +When an ANN search is performed, the search algorithm can quickly identify a small set of the nearest centroids to the search vector using the in-memory index. +Then, a relatively small number of disk reads can take place to load only the relevant parts of the graph into memory, and then the search can be completed. +According to the [SPANN research paper](https://www.microsoft.com/en-us/research/uploads/prod/2021/11/SPANN_finalversion1.pdf), this leads to a 2x performance improvement over DiskANN. + +- **Pros:** Fast search, scales to huge data sets, the SPFresh variant allows for efficient incremental updates +- **Cons:** High complexity, high disk usage because of replicated data on-disk + +**PlanetScale MySQL uses a variant of the SPANN algorithm for vector indexes.** +Though challenging to implement, we wanted to provide the best solution for our users to scale their vector data sets. diff --git a/docs/vectors/use-cases.md b/docs/vectors/use-cases.md new file mode 100644 index 00000000..1e89d67d --- /dev/null +++ b/docs/vectors/use-cases.md @@ -0,0 +1,225 @@ +--- +title: 'Common use cases for Vector search' +subtitle: 'Vector search is an extremely powerful tool to leverage in your database. Learn common ways to use it to improve your product.' +label: 'Beta feature' +date: '2024-10-29' +--- + +{% callout type="note" %} +PlanetScale MySQL vectors is [still in beta](/docs/vectors/overview) and is not intended for use in production workloads. +{% /callout %} + +The ability to store and search vector data in your MySQL database, right alongside the rest of your relational data, is a powerful tool when applied correctly. +Some are already keenly familiar with the types of features and products that you can build with this. +However, many are not. +Here, we provide some examples of the types of features you can build with vector search when it is built directly in to your MySQL database. + +## Semantic search + +We use search all the time, from getting answers to questions via Google, looking for products on Amazon, finding a video on YouTube, or even searching for a document in our local file system. +However, not all search systems are created equal, and they can be implemented in a variety of ways. + +Let's say we have a corpus of blog posts that we want to be able to search through to find relevant articles to read. +The table to store the blog posts in looks like this: + +```sql +CREATE TABLE blog ( + id BIGINT NOT NULL AUTO_INCREMENT, + url VARCHAR(2048) NOT NULL, + title VARCHAR(1024), + subtitle VARCHAR(1024), + author VARCHAR(128), + PRIMARY KEY(id) +); +``` + +One way we could accomplish this is using plain-text matching. +For our search feature, a user would type in a search string, and then the text of each blog post's title and subtitle is searched for exact string (or substring) matches of our search term. + +```sql +SELECT title, url + FROM blog + WHERE title LIKE '%$SEARCH_TERM%' + OR subtitle LIKE '%$SEARCH_TERM%' + LIMIT 10; +``` + +(Note: we could also use a MySQL `FULLTEXT` index to perform faster text-matching search). + +This could give some good results, however there would be some instances where it would be problematic. +Searching only for exact matches may miss some of the relevant results. +For example, a user might search for the term "dogs" and end up with some posts about dogs. +However, it would miss results that do not use the term dog in favor of words like "puppy" or "hound." +It also might miss documents that are about "wolves" or "coyotes." +This search knows nothing about the _meaning_ of the word "dog." + +This is where **vector similarity search** comes into play. +With this type of search, we would generate an [embedding](/docs/vectors/terminology-and-concepts#vector-embedding-) for each blog post in our data set. +An embedding is an N-dimensional vector that captures opaque meaning about some piece of data — in this case, the title + subtitle of a blog post. +This vector would then be stored right along with the corresponding blog post row in the database. + +```sql +CREATE TABLE blog ( + id BIGINT NOT NULL AUTO_INCREMENT, + url VARCHAR(2048) NOT NULL, + title VARCHAR(1024), + subtitle VARCHAR(1024), + author VARCHAR(128), + embedding VECTOR(384), /* <-- New column for the embedding */ + PRIMARY KEY(id) +); +``` + +Whenever a search occurs on our database, we will also generate an embedding for that search term. +Then, we can use vector similarity search to find the top 10 results that have the most similar _meaning_ to the search term. + +```sql +SELECT title, url + FROM blog + ORDER BY DISTANCE($SEARCH_TERM_VECTOR, embedding, 'L2_squared') + LIMIT 10; +``` + +This type of search will be able to include those other posts, since we are now searching by meaning rather than text matches. + +## Recommendation systems + +Recommendation systems are also common in many products. +Amazon may recommend purchases similar to ones you view, and streaming services may recommend shows to you based on your watch history. +Similar types of systems can be built using vector similarity search. + +Perhaps we have an e-commerce platform. +In our database, we have a `product` table, a `user` table, and ` purchase` table to track which items each user purchases. + +```sql +CREATE TABLE product ( + product_id BIGINT NOT NULL AUTO_INCREMENT PRIMARY KEY, + title VARCHAR(256) NOT NULL, + description VARCHAR(1024) NOT NULL, + price INT NOT NULL, + ... +); + +CREATE TABLE user ( + user_id BIGINT NOT NULL AUTO_INCREMENT PRIMARY KEY, + username VARCHAR(256) NOT NULL, + email VARCHAR(256) NOT NULL, + ... +); + +CREATE TABLE purchase ( + purchase_id BIGINT NOT NULL AUTO_INCREMENT PRIMARY KEY, + product_id BIGINT NOT NULL, + user_id BIGINT NOT NULL, + purchased_at DATETIME NOT NULL, + ... +); +``` + +A great way to add a purchase recommendation feature to our application would be to use vectors. +The first step would be to add a `VECTOR` column to the `product` table: + +```sql +CREATE TABLE product ( + product_id bigint not null auto_increment primary key, + title varchar(256) not null, + description varchar(1024) not null, + price INT NOT NULL, + embedding VECTOR(384), /* <-- New column for the embedding */ + ... +); +``` + +We could generate an embedding for each row by feeding the title and description into an embedding model and storing the results in this new column. +Whenever a user logs in to our platform, we want to show them a list of 5 recommended purchases, based on similarity to their most recently purchased item. +To do this, we need a query that finds a user's most recent purchase, and then performs a KNN search for the 5 most similar products in the product table, based on vector similarity. +This query would look something like this: + +```sql +SET @uid = 12345; + +SET @recentEmbedding = + (SELECT product.embedding + FROM product + JOIN purchase ON product.product_id = purchase.product_id + JOIN user ON purchase.user_id = user.user_id + WHERE user.user_id = @uid + ORDER BY purchase.purchased_at DESC + LIMIT 1); + +SET @recommendationIDs = + (SELECT product_id + FROM product + ORDER BY DISTANCE(@recentEmbedding, embedding, 'L2_squared') + LIMIT 5); +``` + +We can then use the products in `@recommendationIDs` and display those to the user. + +## Retrieval-Augmented Generation (RAG) + +RAG is a popular technique for augmenting and enhancing results produced by an LLM. +LLMs such as GPT-4.0 or Sonnet-3.5 are extremely powerful, as they have been trained on immense data sets. +However, these LLMs are not trained on the entire universe of data, and it is often useful to pass them additional context to help answer a query. + +Suppose we have a private question/answer platform, internal to our organization. +None of the information on this platform is on the public internet and was not used to train any public LLMs. +This platform stores questions and answers like so: + +```sql +CREATE TABLE question ( + question_id BIGINT NOT NULL AUTO_INCREMENT PRIMARY KEY, + title VARCHAR(256) NOT NULL, + text VARCHAR(2048) NOT NULL +); + +CREATE TABLE answer( + answer_id BIGINT NOT NULL AUTO_INCREMENT PRIMARY KEY, + question_id BIGINT NOT NULL, + text VARCHAR(2048) NOT NULL +); +``` + +We would like to add a feature to this internal Q/A platform to allow a user to ask an LLM-powered chatbot questions instead of posting a question for other humans to answer. + +One way we could do this is allow the user to enter a question, send the prompt directly to the OpenAI API (or similar APIs), fetch the result, and display to the user. +This may not work well for questions that are specific to our organization, as OpenAI may not have sufficient knowledge in this area. +What we can do instead is leverage similarity search to augment the user's question with additional context before sending the prompt to OpenAI, which will allow it to produce better answers. + +To do this, we will yet again need to add a vector column to the `answer` table and populate them with embeddings: + +```sql +CREATE TABLE answer( + answer_id BIGINT NOT NULL AUTO_INCREMENT PRIMARY KEY, + question_id BIGINT NOT NULL, + text VARCHAR(2048) NOT NULL, + embedding VECTOR(384) /* <-- New column for the embedding */ +); +``` + +Whenever a user enters a query for the chatbot, we will first generate an embedding for that question. +Then, we will perform similarity search to find existing answers on our platform that are related to this: + +```sql +SELECT answer.text + FROM answer + JOIN question ON question.question_id = answer.question_id + ORDER BY DISTANCE($QUERY_EMBEDDING, question.embedding, 'L2_squared') + LIMIT 5; +``` + +Now we have the user's initial question and several related answers. +We will construct a string formatted something like this: + +``` +Here is some information that is related to a question that a user has: + +[RELATED ANSWERS] + +Using this information, please answer the following user question: + +[THE USER'S ORIGINAL QUERY] +``` + +This new larger string will get passed on to our AI API for a response. +Since we have added context to the prompt, it will be able to do a better job at giving the user a satisfactory answer. diff --git a/docs/vectors/using-with-an-orm.md b/docs/vectors/using-with-an-orm.md new file mode 100644 index 00000000..bcd42612 --- /dev/null +++ b/docs/vectors/using-with-an-orm.md @@ -0,0 +1,210 @@ +--- +title: 'Vectors with an ORM' +subtitle: 'Learn how to PlanetScale MySQL vectors with popular ORMs' +label: 'Beta feature' +date: '2024-10-29' +--- + +{% callout type="note" %} +PlanetScale MySQL vectors is [still in beta](/docs/vectors/overview) and is not intended for use in production workloads. +{% /callout %} + +Many modern web development frameworks provide ORMs to make mapping data between your database and the objects in your application a seamless experience. +Here, we show several examples of how you can use the `VECTOR` MySQL type with ORMs. +These docs to not provide a comprehensive list of how to use vectors with all ORMs. +Rather, small examples for several popular choices are provided (Drizzle, Prisma, Rails). +If you use a different ORM and are having trouble getting it to work with vectors, please [reach out](/contact). + +Also note that for now, using PlanetScale MySQL vectors with these ORMs requires either a custom type or running raw MySQL queries. + +## Drizzle + +Here, we'll show how you can create and insert rows into a table that has a `VECTOR` column using Drizzle. +First, modify your `schema.ts` file to import the appropriate items and define two new types. +The first type is for the distance functions, and the second is to represent a vector. + +```typescript +import { sql } from 'drizzle-orm' +import { mysqlTable, customType, text, MySqlColumn, bigint } from 'drizzle-orm/mysql-core' + +export type DistanceFunction = 'COSINE' | 'DOT' | 'L2' | 'L2_SQUARED' + +export const vector = customType<{ + data: ArrayBuffer + config: { length: number } + configRequired: true + driverData: Buffer +}>({ + dataType(config) { + return `VECTOR(${config.length})` + }, + fromDriver(value) { + return value.buffer as ArrayBuffer + }, + toDriver(value) { + return Buffer.from(value) + } +}) +``` + +With these in place, we can go ahead and declare our table, also in `schema.ts`. + +```typescript +export const product = mysqlTable('product', { + id: bigint('id', { mode: 'number', unsigned: true }).autoincrement().primaryKey(), + name: text('name'), + description: text('description'), + embedding: vector('embedding', { length: 384 }) +}) +``` + +This would be in addition to other schema declarations you need for your application. + +We now have an object to model a table with a vector embedding column. +Elsewhere in our application, we can perform typical operations like inserts, deletes, searches, etc. +For example, to insert a row into this table, do the following: + +```typescript +const embedding = // generate an embedding with your preferred API +const serializedEmbedding = Buffer.from((new Float32Array(embedding)).buffer) +await db.insert(product).values({ + description: 'hi there', + embedding: serializedEmbedding +}) +``` + +We can also perform searches on the data: + +```typescript +const rankedItems = await db + .select({ description: product.description }) + .from(product) + .orderBy(`DISTANCE(TO_VECTOR(${embedding}), ${product.embedding}, L2_SQUARED)`) + .limit(10) +``` + +If you have a large data set, you'll want to make sure you create an index on this vector column. + +## Prisma + +Next, let's look at what it takes to get a table with a `VECTOR` column working with Prisma. +As of this writing, [Prisma does not support custom types](https://github.com/prisma/prisma/issues/5039). + +Until Prisma provides support, you can still use vectors in a Prisma-powered application by using the `Unsupported` function in your `schema.prisma` and then use raw queries to perform vector searches. +We can add a new table with a `VECTOR(4)` column by adding the following to our `schema.prisma` file. + +```typescript +model Product { + id BigInt @id @default(autoincrement()) @db.UnsignedBigInt + name String? @db.Text + description String? @db.Text + embedding Unsupported("vector(4)")? +} +``` + +When running `prisma db push`, it will create the table with the `embedding` column having type `VECTOR(4)`, even though Prisma does not technically support vectors. + +From here, you can use Prisma's `queryRaw` feature to run raw SQL queries. +For example, to insert a row into this table. + +```typescript +const name = 'Pots' +const description = 'For cooking' +const embedding = '[0.5, 0.4, 0.3, 0.2]' +await prisma.$queryRaw( + Prisma.sql` + INSERT INTO + Product (name, description, embedding) + VALUES(${name}, ${description}, TO_VECTOR(${embedding}))` +) +``` + +And to retrieve results from this table: + +```typescript +const result = await prisma.$queryRaw( + Prisma.sql`SELECT * FROM Product ORDER BY DISTANCE(TO_VECTOR(${query_vector}), embedding, 'l2_squared')` +) +console.log(result) +``` + +If you have a large data set, you'll want to make sure you create an index on this vector column. + +## Ruby on Rails + +Lets look at how you can work with a vector column in a Ruby on Rails application. +Say you have an existing object that models rows in a table, and you are using ActiveRecord to manage the mapping between your objects and your database. +For example, an object representing a tweet in `app/models/tweet.rb`. + +```ruby +class Tweet < ApplicationRecord + belongs_to :user + has_one_attached :image + default_scope -> { order(created_at: :desc) } + validates :content, presence: true, length: { maximum: 140 } + validates :user_id, presence: true +end +``` + +Since vectors are a new type for MySQL, we will add a custom migration to handle adding and dropping a `VECTOR` column to this table in the database. +To do this, add a new migration in `db/migrations` like so: + +```ruby +class AddTweetEmbeddings < ActiveRecord::Migration[7.0] + def up + execute <<-SQL + ALTER TABLE tweets ADD COLUMN embedding VECTOR(1536); + SQL + execute <<-SQL + CREATE VECTOR INDEX idx_tweet_embeddings ON tweets(embedding) '{"type":"spann","distance":"cosine"}'; + SQL + end + + def down + execute <<-SQL + ALTER TABLE tweets DROP COLUMN embedding; + SQL + end +end +``` + +You'll also need to make the appropriate adjustments in your project to ensure that this gets executed when you run `rails db:migrate`. +With the schema updated appropriately, you can run raw SQL queries via ActiveRecord to insert rows with vectors and perform search. +For example, to insert a new row with a vector, you can do something like this: + +```ruby +embedding = [1.0, 0.5, 0.25, 0.125].to_json +created_at = Time.now.to_s +updated_at = Time.now.to_s +content = 'A tweet!' +user_id = 100 + +sql = <<-SQL + INSERT INTO tweets (user_id, content, content_embeddings, created_at, updated_at) + VALUES ( + #{ActiveRecord::Base.connection.quote(user_id)}, + #{ActiveRecord::Base.connection.quote(content)}, + TO_VECTOR(#{ActiveRecord::Base.connection.quote(embedding)}), + #{ActiveRecord::Base.connection.quote(created_at)}, + #{ActiveRecord::Base.connection.quote(updated_at)} + ) +SQL + +ActiveRecord::Base.connection.execute(sql) +``` + +To perform a similarity search based on an input embedding, you can execute the following: + +```ruby +embedding = [1.0, 0.5, 0.25, 0.125] +sql = <<-SQL + SELECT id, content, distance(content_embeddings, TO_VECTOR('#{embedding}'), 'COSINE') as d FROM tweets ORDER BY d DESC LIMIT 10 +SQL + +results = ActiveRecord::Base.connection.execute(sql) +``` + +## Other ORMs + +If you have requests for other ORMs you'd like to see documented for using with vectors, please [reach out](/contact) with your questions. +On the other hand, if you get PlanetScale MySQL vectors working with an ORM not listed here and would like to share your technique, please do as well.