Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v2] Remove multilingual task #1926

Merged
merged 14 commits into from
Feb 4, 2025
Merged

[v2] Remove multilingual task #1926

merged 14 commits into from
Feb 4, 2025

Conversation

Samoed
Copy link
Collaborator

@Samoed Samoed commented Feb 2, 2025

Closes #1832

I've removed MultilingualTask and explicitly added is_multilingual to tasks, which was previously handled by MultilingualTask. I also moved the fast_loading method from MultilingualTask to AbsTask, but I think we should reupload these datasets in the correct format.

Additionally, we might want to determine is_multilingual automatically based on eval_langs, but I’m not entirely in favor of this since it feels a bit like magic. Another option is to integrate is_multilingual into TaskMetadata.

Code Quality

  • Code Formatted: Format the code using make lint to maintain consistent style.

Documentation

  • Updated Documentation: Add or update documentation to reflect the changes introduced in this PR.

Testing

  • New Tests Added: Write tests to cover new functionality. Validate with make test-with-coverage.
  • Tests Passed: Run tests locally using make test or make test-with-coverage to ensure no existing functionality is broken.

@Samoed Samoed added the v2 Issues and PRs related to `v2` branch label Feb 2, 2025
@KennethEnevoldsen
Copy link
Contributor

I have also made some changes on this over on main. I would probably do a merge before (my fix removed .is_multilingual). It was required to allow for the aggregate task I believe.

Co-authored-by: Kenneth Enevoldsen <[email protected]>
@Samoed
Copy link
Collaborator Author

Samoed commented Feb 2, 2025

No, is_multilingual is still in AbsTask.

is_multilingual: bool = False

@KennethEnevoldsen
Copy link
Contributor

(we could also consider merging this and resolving it in a merge/partial merge)

@KennethEnevoldsen
Copy link
Contributor

No, is_multilingual is still in AbsTask.

Yes, but it is not in .evaluate (which I see it also isn't here, that is my mistake).

So yea, no merge is needed

Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, looking a bit more into this. Are there cases where we can't replace .is_multilingual?

@Samoed
Copy link
Collaborator Author

Samoed commented Feb 2, 2025

No, we can totally remove is_multilingual. If you think we should do this, I will remove it

@Samoed Samoed force-pushed the remomove_multilingual_task branch from b74cbf3 to d0cdc39 Compare February 2, 2025 12:48
Copy link
Collaborator

@isaac-chung isaac-chung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm missing something: how do we determine which subsets to load? Does is_multilingual play a role or do we just check for the presence of a default subset?

@Samoed
Copy link
Collaborator Author

Samoed commented Feb 2, 2025

how do we determine which subsets to load?

By hf_subsets. If user want to run some part of them, then they require to run mteb.get_tasks that will filter required subsets. For example, for language filter.

self.hf_subsets = subsets_to_keep

Does is_multilingual play a role or do we just check for the presence of a default subset?

Previously yes, but now we can get subsets with hf_subsets_to_langscripts (maybe we can create separate method just for getting splits), that return them correctly. For now, I think we can totally remove it

@Samoed Samoed linked an issue Feb 2, 2025 that may be closed by this pull request
@isaac-chung
Copy link
Collaborator

Sorry, I didn't phrase the question right. What I'm curious about is the evidence supporting this statement:

we can totally remove is_multilingual.

@Samoed Samoed force-pushed the remomove_multilingual_task branch from ef07074 to d0eb782 Compare February 3, 2025 07:47
@Samoed
Copy link
Collaborator Author

Samoed commented Feb 3, 2025

Sorry, I didn't phrase the question right. What I'm curious about is the evidence supporting this statement:

we can totally remove is_multilingual.

Yes, for now, is_multilingual is only used to check the default split.

hf_subsets = list(self.dataset) if self.is_multilingual else ["default"]

@isaac-chung
Copy link
Collaborator

Yes, for now, is_multilingual is only used to check the default split.

Hmm it seems like there are a few more references, like

Do they seem easy to refactor?

@Samoed
Copy link
Collaborator Author

Samoed commented Feb 3, 2025

I think we can make is_multilingual just property of TaskMetadata that would be calculated based on eval_splits for using in some parts of mteb

@Samoed Samoed requested a review from isaac-chung February 4, 2025 10:34
Copy link
Collaborator

@isaac-chung isaac-chung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just got one small suggestion.

@Samoed Samoed merged commit a619ad8 into v2.0.0 Feb 4, 2025
10 checks passed
@Samoed Samoed deleted the remomove_multilingual_task branch February 4, 2025 12:29
@Samoed Samoed mentioned this pull request Feb 4, 2025
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
v2 Issues and PRs related to `v2` branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[v2] Remove MultilingualTask
3 participants