-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add code for running the Eval Harness in t5x #10
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't read the entire code, but it would be nice to check that the whole pipeline works. Typically are you able to load one of the checkpoints to run evaluation. Otherwise awesome work!
num_partitions = 4 | ||
model_parallel_submesh = (2,1,1,1) | ||
|
||
TASK_FEATURE_LENGTHS = {"inputs": 512, "targets": 114} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused by this at inference, like how do you make a sample fit inside this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite sure what you mean, but fit as in fit in memory?
In that case I didn't play with it too much but since we're not storing grads and the batch size is small everything seems to work out fine even for the xxl. Just reduced it since we cant partition the small one 4 times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you mean the length of the features I should probably find a way to make sure that it's never truncated. The tasks I've looked at in the EH are quite short though so didn't seem to be an issue but should probably add an assert. Will be more of an issue if we look at few-shot instead of zero-shot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah okay I see, it automatically pads to those sequence lengths right? Concerning the truncation problem ... that's a good problem. We tried tracking the length of each task in this google sheet (shared internally). And it seems to be okay-ish to truncate (most samples will fit) race
might be problematic though.
utils.RestoreCheckpointConfig: | ||
path = %CHECKPOINT_PATH | ||
mode = 'specific' | ||
dtype = 'bfloat16' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm saving them in float32
don't know if this impacts if you load a float32 checkpoint in bfloat16. I have some earlier checkpoints, if you could run inference on them that'd be awesome!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, but I don't think it should be an issue, since the training is in bfloat16 the inf should work as well. I'll check and see if it makes any difference though.
Sure, send me a path and I can test it.
Running on checkpoints is just a matter of doing the following. python3 ${T5X_DIR}/t5x/eval_harness.py Not tested on our current checkpoints, but should just be a matter of changing the checkpoint path. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on successful evaluation runs (that seemed coherent). Let's merge this to allow others to run evaluation more easily. Haven't had the time to review it though.
Adding support for running the EleutherAI Evaulation Harness directly addressing issue #4.