Skip to content

Commit

Permalink
Merge branch 'main' of https://github.com/t1101675/LMOps
Browse files Browse the repository at this point in the history
  • Loading branch information
t1101675 committed Oct 28, 2024
2 parents c38563c + 6944b4c commit 0390cd2
Show file tree
Hide file tree
Showing 4 changed files with 11 additions and 5 deletions.
9 changes: 8 additions & 1 deletion data_selection/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,11 @@
[paper](https://arxiv.org/abs/2410.07064) | [huggingface](https://huggingface.co/Data-Selection)

<div>Theory Overview:</div>

<img src="./figures/theory.png" width="70%"/>
<br>
<div>Training Framwork PDS:</div>

<img src="./figures/method.png" width="70%"/>

## Overview of the Training Framework
Expand All @@ -14,7 +17,11 @@
4. Filter CC with the scores.
5. Pre-train the model.

## Selected Data and Pre-Trained Models
## Pre-Trained Models
+ [Models](https://huggingface.co/collections/Data-Selection/baseline-models-670550972a59015f6c8870ab) Trained on Redpajama CC (Conventional Pre-Training, Baselines)
+ [Models](https://huggingface.co/collections/Data-Selection/pds-models-6705504096a78d10a30837c0) Trained PDS-Selected Data

## Selected Data
TODO

## Details of the Pipeline & How to run
Expand Down
1 change: 0 additions & 1 deletion data_selection/install.sh
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,6 @@ pip3 install rich
pip3 install accelerate
pip3 install datasets
pip3 install sentencepiece
pip3 install peft
pip3 install matplotlib
pip3 install wandb
pip3 install cvxpy
Expand Down
4 changes: 2 additions & 2 deletions data_selection/scripts/data_scorer/infer.sh
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,8 @@ DISTRIBUTED_ARGS="--num_gpus $GPUS_PER_NODE \

# model
BASE_PATH=${1-"/home/MiniLLM"}
CKPT="${BASE_PATH}/results/data_scorer/cc-sgd100-160M-10k-lima-163840/fairseq_125M/e5-w10-bs16-lr0.0001cosine1e-07-G2-N16-NN2/mean-bias-linear/best"
CKPT_NAME="cc-sgd100-160M-10k-lima"
CKPT="${BASE_PATH}/results/data_scorer/"
CKPT_NAME="cc-160M-lima"
# data
DATA_DIR="${BASE_PATH}/processed_data/data_scorer_infer/cc/mistral-fairseq-1024"
# hp
Expand Down
2 changes: 1 addition & 1 deletion data_selection/scripts/pmp_solver/160M.sh
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ OPTS=""
OPTS+=" --type pmp_solver"
# model
OPTS+=" --model-type mistral"
OPTS+=" --model-path ${BASE_PATH}/results/pretrain/cc/mistral_160M/t100K-w2K-bs8-lr0.0006cosine6e-05-G4-N16-NN2-scr/10000"
OPTS+=" --model-path ${BASE_PATH}/results/pretrain/mistral_160M-10K/"
OPTS+=" --base-path ${BASE_PATH}"
OPTS+=" --ckpt-name 160M-10k"
OPTS+=" --n-gpu ${GPUS_PER_NODE}"
Expand Down

0 comments on commit 0390cd2

Please sign in to comment.