Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support embedders setting #554

Open
wants to merge 34 commits into
base: main
Choose a base branch
from

Conversation

CommanderStorm
Copy link
Contributor

@CommanderStorm CommanderStorm commented Mar 2, 2024

Pull Request

Related issue

Fixes #541
Fixes #612
Fixes #621
Fixes #646

What does this PR do?

  • Adds the required settings

    • with_embedders does use the same "API" (not using impl AsRef for items passed) as with_synonyms, as this is the closest existing
    • given set_embedders has not been implemented upstream (at least when I try to PATCH the object, it does not work)
    • only {get,reset}_embedders settings have been implemented.
      Said implementation goes with the work done in Implement vector search experimental feature v2 (v1.6) meilisearch-python#924
  • adds the hybrid field to search via the vector search to add an end-to-end test of this feature with the huggingface configuration.

    userProvided seens more brittle, but we may want change to this instead
    using userProvided instead would mean (at the cost of hardcoding stuff)
    => lower cpu effort
    => no higher timeout necceeseary
    => aligning with meilisearch/meilisearch to only have a test for userProvided)

TODO:

  • find a combination of semantic search model + configuration that does not fail the assumptions (see search testcase) spectacularly

PR checklist

Please check if your PR fulfills the following requirements:

  • Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
  • Have you read the contributing guidelines?
  • Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!

@CommanderStorm CommanderStorm changed the title Vector search embedder [v1.6] support embedders setting Mar 2, 2024
@CommanderStorm CommanderStorm marked this pull request as ready for review March 2, 2024 19:52
@CommanderStorm CommanderStorm force-pushed the vector-search-embedder branch from 1076241 to 8ffb555 Compare March 11, 2024 11:42
@curquiza
Copy link
Member

curquiza commented Apr 15, 2024

Hello @CommanderStorm

Can you rebase your branch? We made changes recently to improve the library

Sorry for the inconvenience, I try to review your PR as soon as possible after the rebase

@CommanderStorm CommanderStorm force-pushed the vector-search-embedder branch 2 times, most recently from 51820a9 to 8ade09d Compare April 16, 2024 03:48
@CommanderStorm
Copy link
Contributor Author

CommanderStorm commented Apr 16, 2024

Can you rebase your branch?

Done ^^

Copy link
Member

@irevoire irevoire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @CommanderStorm,

Code-wise, the PR is very nice and well-documented; I love it!


But I’m hitting way too many timeouts to really help you; sorry 😖

I don’t think we’ll be able to merge this PR with tests that take about 10 minutes. Could you mock Meilisearch as we did here:

async fn test_get_tasks_with_params() -> Result<(), Error> {
let mut s = mockito::Server::new_async().await;
let mock_server_url = s.url();
let client = Client::new(mock_server_url, Some("masterKey")).unwrap();
let path =
"/tasks?indexUids=movies,test&statuses=equeued&types=documentDeletion&uids=1&limit=0&from=1";
let mock_res = s.mock("GET", path).with_status(200).create_async().await;
let mut query = TasksSearchQuery::new(&client);
query
.with_index_uids(["movies", "test"])
.with_statuses(["equeued"])
.with_types(["documentDeletion"])
.with_from(1)
.with_limit(0)
.with_uids([&1]);
let _ = client.get_tasks_with(&query).await;
mock_res.assert_async().await;
Ok(())

That way, we simply ensure that meilisearch-rust sends a valid payload and hope meilisearch works as expected.

userProvided seens more brittle, but we may want change to this instead

Or we could do that and actually send the payload to meilisearch.
Or do both; let me know what you prefer, but I would very much like the tests to not take that much time to run 😖

Where in the Meilisearch codebase could I find an e2e-test how to use feature?

I don’t think there is. I believe you’re right, and we only wrote tests for user-provided vectors

introduces the experimental-vector-search-feature flag

I don’t know if this is required, @curquiza. Could we simply expose the feature as-is instead of hiding it behind a feature flag (that will make it harder to test and use).
Since it doesn’t add any dependency, I don’t see much point in putting it behind a feature flag

@curquiza
Copy link
Member

I don’t know if this is required, @curquiza. Could we simply expose the feature as-is instead of hiding it behind a feature flag (that will make it harder to test and use).

Yes, that's ok not to do feature flag 😊

@CommanderStorm CommanderStorm force-pushed the vector-search-embedder branch from 0285e62 to 9033232 Compare April 17, 2024 15:37
@CommanderStorm
Copy link
Contributor Author

CommanderStorm commented Apr 17, 2024

I did a few tests, and on my side it never took 120s, the quickest execution took 150s but most of the time I was over 200s

Time after being migrated to userProvided seems fine in CI (this is likely the slowest machine running the tests). ✅

Or we could do that and actually send the payload to meilisearch.

I think keeping the testcases from requiring a active internet connection is better as otherwise the test might be unnessesarily flaky in CI.

Could you mock Meilisearch as we did here

I can add a testcase where I mock the routes.
I am unsure if you actually would want to mock this for an experimental feature (= where the api might change => requiring changes)

I have tweaked a bunch with the vectors available and can't get the test_hybrid testcase to work without fully using the same dataset as in tests/search/hybrid.rs.

It seems to always return everything when I set semantic_ratio=1.0...
Is this operating as intended?

I have tried similar 2D vectors as the upstream test and went as far as to use 1D vectors with considerable (1/10/1k/1m) spread (=>something that as far as I understand embeddings should not match)

=> I am missing something. 😅
Could you point me into the right direction?

Note

I can obviously steal the testcase from tests/search/hybrid.rs, but for longer-term maintainability I would like this testcase to not be such an oddball compared to the existing testcases in this repo ^^

@CommanderStorm CommanderStorm force-pushed the vector-search-embedder branch from 9033232 to 0e044b1 Compare April 17, 2024 16:18
@curquiza curquiza requested a review from irevoire May 14, 2024 09:25
@curquiza
Copy link
Member

curquiza commented Jul 1, 2024

@CommanderStorm really sorry for this.
Can you fix the git conflicts? 😊

@CommanderStorm
Copy link
Contributor Author

Thanks for the pinging (github does not give me notifications for this) ^^
No need to worry. If this takes a few months that is fine.

Copy link
Member

@curquiza curquiza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CommanderStorm thank you again! Can you add the rest and ollama models we added in v1.8.0?
Sorry for the late notice again!
And thank you for your involvement 🙏

@NoodleSamaChan NoodleSamaChan mentioned this pull request Jul 2, 2024
3 tasks
@irevoire
Copy link
Member

irevoire commented Jul 3, 2024

Hey @CommanderStorm, since we introduced a new rest embedder, I think we could write better tests by mocking the rest embedder and ensuring it works with meilisearch; here's some example I wrote earlier this week in meilisearch: meilisearch/meilisearch@2141cb3

@CommanderStorm
Copy link
Contributor Author

I think I have adressed the requested changes ^^
Sorry for the long delay, I was on vacation and then a bit buisy with other work. Hope you understand.

AND we should absolutely remove these two tests

I have only removed the test which disables the vector store, but have kept the test which .set_vector_store(true)'s.
I have also improved a test a bit to also set, but this should be minor ^^

Next time, I will split such PRs earlier ^^

Copy link
Contributor Author

@CommanderStorm CommanderStorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This updates the PR to the breaking 1.11 changes.

@CommanderStorm
Copy link
Contributor Author

CommanderStorm commented Nov 17, 2024

binaryQuantized will have to be in a different PR, as according to the docs this is only avaliable for updating via PATCH, which is currently not supported.

Docs for this feature: https://meilisearch.notion.site/Binary-quantization-usage-v1-11-2a9c9559461a4a9d9fa3e0ea5149ad68

@CommanderStorm
Copy link
Contributor Author

@irevoire sorry to ping you.
I think it would be much more maintainable to do this work in different PRs.

Is merging this feature in the experimental stage still something you are interested in, or would this feature need to be stabilised first?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
3 participants