Table to count number of tasks per language #686
Replies: 4 comments 11 replies
-
We can add it at the top of the existing task overview. You can see how the previous script inserts the table at a specific spot. Unsure if we should do two tables one for the group languages and one for language groups or do a nested table:
|
Beta Was this translation helpful? Give feedback.
-
For the paper, we might also create an overview similar to: Also seen e.g. here We probably don't want to fields to be datasets but maybe rather domains e.g. something like: |
Beta Was this translation helpful? Give feedback.
-
Should we make |
Beta Was this translation helpful? Give feedback.
-
I think it's a good idea. To facilitate this just want to bring to your attention @isaac-chung a (fairly) new method that should make it easy to aggregate the data this way, e.g.: task_types = ["BitextMining","Classification","Clustering","InstructionRetrieval","PairClassification","Reranking","Retrieval","STS","Summarization"]
for tt in task_types:
print(get_tasks(task_types=[tt]).count_languages()) |
Beta Was this translation helpful? Give feedback.
-
With growing numbers of datasets in multilingual part of MTEB, maybe a table is needed to tally how many datasets there are per task per language. I would imagine something like this is also needed for the submission (e.g. some stats grouped by language family). Where would this live? Maybe as a script under docs/mmteb? Or a separate repo like https://github.com/embeddings-benchmark/mtebscripts? or adding to that repo?
Tagging @imenelydiaker @KennethEnevoldsen @Muennighoff and anyone who might be interested.
For example:
Beta Was this translation helpful? Give feedback.
All reactions