Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v2] Refactor text tasks to use DataLoader #2198

Open
wants to merge 11 commits into
base: v2.0.0
Choose a base branch
from

Conversation

Samoed
Copy link
Collaborator

@Samoed Samoed commented Feb 28, 2025

Ref #1606

Now models will receive encode function Dataloader

{
   "text": [...],  # default text
   "image": [...], 
   "audio": [...], 
   "body: [...], # models are allowed to construct the text from the body + title if they wish
   "title: [...],
}

Code Quality

  • Code Formatted: Format the code using make lint to maintain consistent style.

@Samoed Samoed added the v2 Issues and PRs related to `v2` branch label Feb 28, 2025
@Samoed
Copy link
Collaborator Author

Samoed commented Mar 1, 2025

Right now it is very much a quick wrapper. Wouldn’t we prefer directly working with the dataset for datasets? (I know that this is more code to write)

It's not easy because datasets have different column names and most datasets require encoding two columns, and I don’t have a clear solution for handling that. Also in most tasks list of sentences passed to evaluators and there datasets can't be used for now, but we can change that. Additionally, some datasets return a dictionary instead of a dataset, and Pair classification expects all data to be in the first row (as I recall). I could pass the dataset directly and select columns, but that would be a similar approach to using a wrapper. (edited)

@Samoed Samoed changed the title update text tasks except retrieval [v2] Refactor text tasks to use DataLoader Mar 1, 2025
@Samoed Samoed marked this pull request as ready for review March 1, 2025 15:21
Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I would really like to see how a Dataloader native abstask would look like. Can we try to do it with just Classification?

I am also afraid of how much this influences throughput - can we do a quick test e.g. using minishlab models?

It is a bit annoying that we have to convert everything in the encode functions (it might be the right solution). We could consider whether it better to just hand of the Dataset object to the model? (but I assume that does not work for images?)

if isinstance(queries[0], list):
# Encode only unique queries using the dataloader
if isinstance(query_list[0], list):
# For conversations, still use the original encode_conversations method
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm don't we want to standardize everything?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want, but I still don't know what to do with them, because we don't have implementation for any model #1330

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pinged him. Can't we just convert it to text and keep the "conversation in a column as well??

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, can change like that

@Samoed
Copy link
Collaborator Author

Samoed commented Mar 2, 2025

I've updated clustering and classification tasks to use Dataset more natively

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
v2 Issues and PRs related to `v2` branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants