-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow for tokenizers/preprocessors to change batch size #866
base: main
Are you sure you want to change the base?
Conversation
thanks for this. I think it's not quite right in the case of preemption. In particular, we use open_shard_at_row with this value when we resume tokenization. What we would need to do is save both "rows_in" and "rows_out" and only use rows_in for open_shard_at_row. Does that make sense? |
@dlwh ah yeah totally! I'll try to get that worked in. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, one small rename then i'm happy
@@ -473,7 +473,11 @@ def _monitor_metrics(self): | |||
class CacheLedger: | |||
# NB: unlike the old cache, the mere existence of a ledger doesn't mean the cache is finished | |||
total_num_rows: int | |||
shard_rows: Dict[str, int] | |||
"""Number of outputted rows in the cache""" | |||
shard_rows_in: Dict[str, int] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, can we rename this one back to just shard_rows (leave comment) so that we don't invalidate all other caches
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah gotcha yeah... Do I need to do some other logic to make shard_rows_out
an optional key during deserialization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm pretty sure if you give it a default value it will be fine. I'll check it against a cache before merging
Small change that keeps changes how the counting for the number of batches is done. Instead of assuming$n$ examples from shard iterator means $n$ examples after tokenizing/processing, it gets the length from the output of the tokenizer. I had a use case that involved combining multiple samples together during the processing stage, and ran into a bug where the number of batches stored in the offsets was incorrect.