[v2] Refactor text tasks to use DataLoader #2198

Samoed · 2025-02-28T19:15:50Z

Now models will receive encode function Dataloader

{
   "text": [...],  # default text
   "image": [...], 
   "audio": [...], 
   "body: [...], # models are allowed to construct the text from the body + title if they wish
   "title: [...],
}

Code Quality

Code Formatted: Format the code using make lint to maintain consistent style.

Samoed · 2025-03-01T12:25:24Z

Right now it is very much a quick wrapper. Wouldn’t we prefer directly working with the dataset for datasets? (I know that this is more code to write)

It's not easy because datasets have different column names and most datasets require encoding two columns, and I don’t have a clear solution for handling that. Also in most tasks list of sentences passed to evaluators and there datasets can't be used for now, but we can change that. Additionally, some datasets return a dictionary instead of a dataset, and Pair classification expects all data to be in the first row (as I recall). I could pass the dataset directly and select columns, but that would be a similar approach to using a wrapper. (edited)

KennethEnevoldsen

So I would really like to see how a Dataloader native abstask would look like. Can we try to do it with just Classification?

I am also afraid of how much this influences throughput - can we do a quick test e.g. using minishlab models?

It is a bit annoying that we have to convert everything in the encode functions (it might be the right solution). We could consider whether it better to just hand of the Dataset object to the model? (but I assume that does not work for images?)

mteb/data_loading_utils.py

mteb/encoder_interface.py

mteb/data_loading_utils.py

KennethEnevoldsen · 2025-03-02T12:50:26Z

mteb/evaluation/evaluators/model_classes.py

-        if isinstance(queries[0], list):
+        # Encode only unique queries using the dataloader
+        if isinstance(query_list[0], list):
+            # For conversations, still use the original encode_conversations method


Hmm don't we want to standardize everything?

We want, but I still don't know what to do with them, because we don't have implementation for any model #1330

I pinged him. Can't we just convert it to text and keep the "conversation in a column as well??

Yes, can change like that

mteb/models/cohere_models.py

Samoed · 2025-03-02T20:49:32Z

I've updated clustering and classification tasks to use Dataset more natively

update text tasks except retrieval

1e30543

Samoed requested review from KennethEnevoldsen and isaac-chung February 28, 2025 19:15

Samoed added the v2 Issues and PRs related to `v2` branch label Feb 28, 2025

Samoed added 6 commits March 1, 2025 00:32

update retrieval

92f195c

fix mock models

d843190

remove change to model card

245ca83

fix tests

390453a

fix tests

4228d95

fix tests

40b2e24

Samoed changed the title ~~update text tasks except retrieval~~ [v2] Refactor text tasks to use DataLoader Mar 1, 2025

change loaders to batches

493c268

Samoed marked this pull request as ready for review March 1, 2025 15:21

KennethEnevoldsen reviewed Mar 2, 2025

View reviewed changes

Samoed added 3 commits March 2, 2025 23:31

update review comments

ac98600

update clustering

f0acebf

update classification

81e44b7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v2] Refactor text tasks to use DataLoader #2198

[v2] Refactor text tasks to use DataLoader #2198

Samoed commented Feb 28, 2025

Samoed commented Mar 1, 2025

KennethEnevoldsen left a comment

KennethEnevoldsen Mar 2, 2025

Samoed Mar 2, 2025

KennethEnevoldsen Mar 2, 2025

Samoed Mar 2, 2025

Samoed commented Mar 2, 2025

[v2] Refactor text tasks to use DataLoader #2198

Are you sure you want to change the base?

[v2] Refactor text tasks to use DataLoader #2198

Conversation

Samoed commented Feb 28, 2025

Code Quality

Samoed commented Mar 1, 2025

KennethEnevoldsen left a comment

Choose a reason for hiding this comment

KennethEnevoldsen Mar 2, 2025

Choose a reason for hiding this comment

Samoed Mar 2, 2025

Choose a reason for hiding this comment

KennethEnevoldsen Mar 2, 2025

Choose a reason for hiding this comment

Samoed Mar 2, 2025

Choose a reason for hiding this comment

Samoed commented Mar 2, 2025