-
Notifications
You must be signed in to change notification settings - Fork 211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] refactor of dataset builder and executor #537
Open
cyruszhang
wants to merge
74
commits into
main
Choose a base branch
from
feat/cyruszhang/data-downloader
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 60 commits
Commits
Show all changes
74 commits
Select commit
Hold shift + click to select a range
d11f89c
ignore __dj__produced_data__
cyruszhang 41dea26
add download framework; add wiki support
cyruszhang 50f8d3d
refactor formatter; add dataset_builder
cyruszhang 817caab
merge with master
cyruszhang a089de4
add config files and test entry
cyruszhang 5a717d7
initial dataset_builder
cyruszhang 9c79844
Merge branch 'main' into feat/cyruszhang/data-downloader
cyruszhang ffba7e7
add mixture dataset support; type/subtype
cyruszhang 79ae980
RayExecutor with ExecutorBase
cyruszhang e6a6e71
get rid of subtype for local dataset; depending on ext for proper rou…
cyruszhang eb300f0
use source instead of sub_type for remote dataset configs
cyruszhang 456eea1
arxiv downloader return Dataset instead of DJDataset
cyruszhang c25e40f
rewrite CLI datapath with test cases
cyruszhang 75ffe3f
add executor and dataload strategy logic
cyruszhang 4ec1ef9
Merge branch 'main' into feat/cyruszhang/data-downloader
cyruszhang 4fb6e17
add layered load strategies
cyruszhang 84803cd
Merge branch 'main' into feat/cyruszhang/data-downloader
cyruszhang cb5b80a
fix circular dependency; add dataset config test
cyruszhang daf7a85
update dataset_path parsing in config
cyruszhang 7c48892
fix download test case; add wildcard matching for load strategy
cyruszhang 940b44d
add test case for load strategy wild card matching
cyruszhang b80f991
add more test cases for datapath rewrite logic; fix rewrite to handle…
cyruszhang 0d5d4ba
materialize symlinks for duplicates
cyruszhang f3a4ec4
add load strategy validation framework
cyruszhang 70fffd2
add DataValidator logic
cyruszhang bbc303d
data validator as separate pre-processing
cyruszhang 4b6065f
update data validator logic and add/fix test cases
cyruszhang 0b153ab
[nit] rename test
cyruszhang 171b361
[nit] rename test again
cyruszhang 6841d19
add builder test cases; update ds config validation logic
cyruszhang 3128d05
[minor] update test case naming
cyruszhang 7b6b2bd
add support for max_sample_num in dataset configs; add tests
cyruszhang 161f059
fix test cases and update dataset builder code
cyruszhang 8cb322f
merge main
cyruszhang afe906d
handle weights and sample_nums
cyruszhang 1217e61
support ExecutorType enum
cyruszhang 755abca
Merge branch 'main' into feat/cyruszhang/data-downloader
cyruszhang 5dd17fe
flip on DatasetBuilder; replace formatter
cyruszhang eb3b123
minor fix
cyruszhang 7c171fb
add ExecutorBase to RayExecutor
cyruszhang 195aff8
Merge branch 'main' into feat/cyruszhang/data-downloader
cyruszhang dd95df0
fix bugs; use str for executor_type
cyruszhang 530efa8
add add_same_content_to_new_column reference
cyruszhang 3b726bd
ray data defaults to json
cyruszhang cac8e5e
fix dataset_path bug; add ray config test
cyruszhang a99c9b5
tests video on ray config
cyruszhang 3c9caf5
add default cfg logic; fix data_mixture demo
cyruszhang b9f6a99
default executor + local data; fix analyzer bug
cyruszhang e05f146
Merge branch 'main' into feat/cyruszhang/data-downloader
cyruszhang acccc01
pass through num_proc param for ray executor when loading dataset
cyruszhang 1823cd6
fix bugs for huggingface dataset loading; add sample config
cyruszhang 2963118
fix typo in configs
cyruszhang 4472aef
remove absolute path logic; remove dup test files
cyruszhang 7964867
update .gitignore for dup files in tests
cyruszhang 96207ba
fix RayDataset schema validation issue
cyruszhang 9b1d738
fix wiki downloader tests
cyruszhang 828e7ba
remove mixture formatter; logic captured in dataloader
cyruszhang 4ffb3cf
remove unused mixture formatter
cyruszhang 7c16b23
minor fixes for CR comments
cyruszhang f73dd41
resolve eager RayExecutor importing
cyruszhang 8aae265
bugfix: handle missing configs
cyruszhang 1d65a3a
add schema support for datasets
cyruszhang 96a4997
bugfix: handle relative path problem in tests
cyruszhang 2f49eec
fix test cases
cyruszhang 643e7d7
add schema support for DJDataset; remove eager Ray imports; add data …
cyruszhang 0412e36
revert relative path for demo multi-modal data
cyruszhang 17e70cc
proper type mapping for HF and ray datasets; add test cases
cyruszhang 2073660
add get method for DJDataset and tests
cyruszhang f3f5e13
add proper validators and test cases for SwiftMessage and DJ_conversa…
cyruszhang c5c4d0a
add validation demo; add validators config entry
cyruszhang 4019fd7
fix test bug; _strategies is class variable and could cause dirty dat…
cyruszhang 38204d0
add ray relative path resolution logic, for both config file and data…
cyruszhang f8849d7
merge master
cyruszhang 2514b1f
revert to lazy loading
cyruszhang File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
# global parameters | ||
project_name: 'dataset-local-json' | ||
dataset: | ||
configs: | ||
- type: 'local' | ||
path: 'path/to/json/file' |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
# global parameters | ||
project_name: 'dataset-local-parquet' | ||
dataset: | ||
configs: | ||
- type: 'local' | ||
path: 'path/to/parquet/file' |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
project_name: 'dataset-mixture' | ||
dataset: | ||
max_sample_num: 10000 | ||
configs: | ||
- type: 'local' | ||
weight: 1.0 | ||
path: 'path/to/json/file' | ||
- type: 'local' | ||
weight: 1.0 | ||
path: 'path/to/csv/file' |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
# global parameters | ||
project_name: 'dataset-remote-arxiv' | ||
dataset: | ||
configs: | ||
- type: 'remote' | ||
source: 'arxiv' | ||
lang: 'en' | ||
dump_date: 'latest' | ||
force_download: false | ||
url_limit: 2 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
# global parameters | ||
project_name: 'dataset-remote-commoncrawl' | ||
dataset: | ||
configs: | ||
- type: 'remote' | ||
source: 'commoncrawl' | ||
start_snapshot: '2020-50' | ||
end_snapshot: '2021-04' | ||
aws: true | ||
force_download: false | ||
url_limit: 2 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
# global parameters | ||
project_name: 'dataset-remote-huggingface' | ||
dataset: | ||
configs: | ||
- type: 'remote' | ||
source: 'huggingface' | ||
path: "HuggingFaceFW/fineweb" | ||
name: "CC-MAIN-2024-10" | ||
split: "train" | ||
limit: 1000 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
# global parameters | ||
project_name: 'dataset-remote-modelscope' | ||
dataset: | ||
configs: | ||
- type: 'remote' | ||
source: 'modelscope' | ||
path: 'modelscope/clue' | ||
subset_name: 'afqmc' | ||
split: 'train' | ||
limit: 1000 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
# global parameters | ||
project_name: 'dataset-remote-wiki' | ||
dataset: | ||
configs: | ||
- type: 'remote' | ||
source: 'wiki' | ||
lang: 'en' | ||
dump_date: 'latest' | ||
force_download: false | ||
url_limit: 2 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
dataset: | ||
configs: | ||
- type: local | ||
path: path/to/data.json | ||
|
||
validators: | ||
- type: conversation | ||
min_turns: 2 | ||
max_turns: 20 | ||
- type: required_fields | ||
required_fields: | ||
- "text" | ||
- "metadata" | ||
- "language" | ||
field_types: | ||
text: "str" | ||
metadata: "dict" | ||
language: "str" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
# Process config example for dataset | ||
|
||
# global parameters | ||
project_name: 'demo-process' | ||
dataset: | ||
configs: | ||
- type: 'remote' | ||
source: 'huggingface' | ||
path: 'hugfaceguy0001/retarded_bar' | ||
name: 'question' | ||
split: 'train' | ||
|
||
np: 4 # number of subprocess to process your dataset | ||
|
||
export_path: './outputs/demo-process/demo-processed.jsonl' | ||
|
||
# process schedule | ||
# a list of several process operators with their arguments | ||
process: | ||
- language_id_score_filter: | ||
lang: 'zh' | ||
min_score: 0.8 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,7 +1,7 @@ | ||
from .config import (export_config, get_init_configs, init_configs, | ||
merge_config, prepare_side_configs) | ||
from .config import (export_config, get_default_cfg, get_init_configs, | ||
init_configs, merge_config, prepare_side_configs) | ||
|
||
__all__ = [ | ||
'init_configs', 'get_init_configs', 'export_config', 'merge_config', | ||
'prepare_side_configs' | ||
'prepare_side_configs', 'get_default_cfg' | ||
] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
from .dj_dataset import (DJDataset, NestedDataset, | ||
add_same_content_to_new_column, | ||
wrap_func_with_nested_access) | ||
from .ray_dataset import RayDataset | ||
|
||
__all__ = [ | ||
'DJDataset', 'NestedDataset', 'RayDataset', 'wrap_func_with_nested_access', | ||
'add_same_content_to_new_column' | ||
] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,61 @@ | ||
from typing import Dict | ||
|
||
|
||
class ConfigValidationError(Exception): | ||
"""Custom exception for validation errors""" | ||
pass | ||
|
||
|
||
class ConfigValidator: | ||
"""Mixin class for configuration validation""" | ||
|
||
# Define validation rules for each strategy type | ||
CONFIG_VALIDATION_RULES = { | ||
'required_fields': [], # Fields that must be present | ||
'optional_fields': [], # Fields that are optional | ||
'field_types': {}, # Expected types for fields | ||
'custom_validators': {} # Custom validation functions | ||
} | ||
|
||
def validate_config(self, ds_config: Dict) -> None: | ||
""" | ||
Validate the configuration dictionary. | ||
|
||
Args: | ||
ds_config: Configuration dictionary to validate | ||
|
||
Raises: | ||
ValidationError: If validation fails | ||
""" | ||
# Check required fields | ||
missing_fields = [ | ||
field for field in self.CONFIG_VALIDATION_RULES['required_fields'] | ||
if field not in ds_config | ||
] | ||
if missing_fields: | ||
raise ConfigValidationError( | ||
f"Missing required fields: {', '.join(missing_fields)}") | ||
|
||
# Optional fields | ||
# no need for any special checks | ||
|
||
# Check field types | ||
for field, expected_type in self.CONFIG_VALIDATION_RULES[ | ||
'field_types'].items(): | ||
if field in ds_config: | ||
value = ds_config[field] | ||
if not isinstance(value, expected_type): | ||
raise ConfigValidationError( | ||
f"Field '{field}' must be of " | ||
"type '{expected_type.__name__}', " | ||
f"got '{type(value).__name__}'") | ||
|
||
# Run custom validators | ||
for field, validator in self.CONFIG_VALIDATION_RULES[ | ||
'custom_validators'].items(): | ||
if field in ds_config: | ||
try: | ||
validator(ds_config[field]) | ||
except Exception as e: | ||
raise ConfigValidationError( | ||
f"Validation failed for field '{field}': {str(e)}") |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to make this new style explicit in
config_all.yaml
(maybe commented out), and refer users to this example and files under configs/datasets/