Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Dataset Transformation Scripts for ColPali and Msmarco (training+corpus) #168

Merged
merged 2 commits into from
Feb 9, 2025

Conversation

Samantha-Zhan
Copy link
Collaborator

@Samantha-Zhan Samantha-Zhan commented Feb 8, 2025

The Design

A unified training data format

{
	"query_id": str
	"query_text": str
	"query_image": PIL.Image
	"positive_document_ids": List[str]
	"negative_document_ids": List[str]
	"source": 'msmarco'
        "answer": str
}

A unified corpus data format

{
	"docid": str,
	"image": PIL.Image,
	"text": str,
	"source": str,
}

Execution

Simply navigate to /tevatron/scripts/dataset_transform_scripts/, and call python ./<script_name>

Results

We have successfully created 4 new Huggingface datasets by executing the scripts, transforming original datasets according to the new schema

Msmarco

ColPali

@Samantha-Zhan Samantha-Zhan self-assigned this Feb 8, 2025
@MXueguang
Copy link
Contributor

let's follow the snake style for variable names and function names.
each word is separated by an underscore character

e.g. loadDatasets -> load_datasets

@MXueguang MXueguang merged commit 869cf92 into tevatron-v2 Feb 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants