-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update README: new vecton index syntax #62
Conversation
Co-authored-by: Mini256 <[email protected]>
# vector index currently depends on tiflash | ||
session.execute(text('ALTER TABLE sqlalchemy_documents SET TIFLASH REPLICA 1')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be encapsulated? User don't know what is TiFlash.
In the syntax design, an explicit SET TIFLASH REPLICA is intentional, because otherwise, adding a vector index to an existing table with existing data may cause replicating a huge amount of data (because of implicit set replica 1) which is out of user expectation. However in this driver, the index can be added right after table creation, thus there is no such risk.
On the other hand, when index is added while table is created is allowed without explicitly setting a TiFlash replica factor, because there is no such risk:
CREATE TABLE t(
vec VECTOR(3),
VECTOR INDEX ((VEC_COSINE_DISTANCE(vec))
)
README.md
Outdated
index = Index( | ||
'idx_embedding', | ||
func.vec_cosine_distance(Document.embedding), | ||
mysql_prefix="vector", | ||
mysql_using="hnsw" | ||
) | ||
index.create(engine) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For other vector databases looks like the index is generally specified when table is created. I think this could be a better UX. It may be no need to add index after data is inserted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sqlalchemy will create the table in two steps: 1. Creating a table without indexes. 2. Creating indexes. Since we need to explicitly set the tiflash replica when creating vector index separately, we cannot define the index in the table class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wd0517 Thanks for the explain. So it is not possible to execute multiple statements there right? Let's raise some discussions with the PM to check how this can be workarounded.
README.md
Outdated
Add hnsw index | ||
|
||
```python | ||
# vector index currently depends on tiflash | ||
db.execute_sql(SQL( | ||
"ALTER TABLE peewee_documents SET TIFLASH REPLICA 1;" | ||
)) | ||
DocumentModel.add_index(SQL( | ||
"CREATE VECTOR INDEX idx_embedding ON peewee_documents ((vec_cosine_distance(embedding))) USING HNSW" | ||
)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid explicit SQL? We can follow how pgvector does: https://github.com/pgvector/pgvector-python?tab=readme-ov-file#peewee
IMO now we provide same capability as pgvector (the index can be created at any time, both when table is created or after table is created), we should provide the same interface and UX as pgvectors', in order to minimize user learning cost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Peewee does not support the vector
prefix for defining an index, so that we still need to use raw sql. https://github.com/coleifer/peewee/blob/master/peewee.py#L2964
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice digging! Looks like it is indeed very hard. When you are free, could you try with something like our own derived Index? Like:
class VectorIndex(peewee.Index):
...
Add hnsw index | ||
|
||
```python | ||
# vector index currently depends on tiflash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is by design and permanent. If TiFlash component is not deployed corresponding operation should fail, and asking user to deploy a TiFlash in order to use Vector Index (just like FULL TEXT INDEX will fail in MySQL if ENGINE is MEMORY).
No description provided.