-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Baal seems to ignore eval_batch_size causing gpu memory issues #280
Comments
Hello, Thank you for submitting the issue. I should be able to take a look over the weekend. I'm a bit puzzled because we simply call self.get_eval_dataloader(dataset) which is managed by HuggingFace. I'll know more this weekend. |
I know!
I digged into the code, and that’s why I did the args set_dataloader.
But apparently it’s getting ignored, so I don’t know how to trouble shoot this, or maybe there is some environment variable that is playing a role here, idk..
…___
Hugo Evers
On 20 Oct 2023 at 15:56 +0200, Frédéric Branchaud-Charron ***@***.***>, wrote:
Hello,
Thank you for submitting the issue. I should be able to take a look over the weekend.
I'm a bit puzzled because we simply call self.get_eval_dataloader(dataset) which is managed by HuggingFace. I'll know more this weekend.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
I was thinking about it and maybe it's because of the stacking we perform. Our It is fairly trivial to add this feature so I'll do that. Progress bar stuffI just tested the progress bar problem and it seems to work. 🤔 from datasets import load_dataset
from transformers import pipeline, TrainingArguments, DataCollatorWithPadding
from baal.transformers_trainer_wrapper import BaalTransformersTrainer
TEXT_COL = 'sentence'
ds = load_dataset('sst2')['test'].remove_columns('label')
pipe = pipeline('text-classification', model='distilbert-base-uncased-finetuned-sst-2-english')
tokenizer = pipe.tokenizer
model = pipe.model
def preprocess_function(examples):
return tokenizer(examples[TEXT_COL], truncation=True)
tokenized_ds = ds.map(preprocess_function, batched=True)
training_args = TrainingArguments(
output_dir='/tmp',
per_device_train_batch_size=2,
per_device_eval_batch_size=2,
)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
trainer = BaalTransformersTrainer(model=model, args=training_args, tokenizer=tokenizer,
data_collator=data_collator, )
print("Total examples", len(tokenized_ds))
print(
f"Dataloader length={len(trainer.get_eval_dataloader(tokenized_ds))}, batch_size={training_args.per_device_eval_batch_size}")
trainer.predict_on_dataset(tokenized_ds, iterations=10) |
I just opened #281 which should allow you to run your experiment. If you can install Baal from source from this branch, you could update your code with:
and that should fix it. In any case, I should be able to get the PR merged this week and will release a minor version along with it :) |
Im sorry for any miscommunication, what i meant by manually setting the batch_size to 8 is the following:
where
So the iterations are still 30, my max batch_size is 8 so the number of inputs its loading into the model is 8*30. Anyway, ill install BAAL from #281 and see whether that removes the need for my forced chunking solution. |
Describe the bug
When setting the
batch_size
to 2 in BAAL, it appears to be using abatch_size
of 16 instead. This is causing a CUDA out of memory error. Despite settingper_device_eval_batch_size
andtrain_batch_size
to 2 inTrainingArguments
, thepredict_on_dataset
function seems to be using abatch_size
of 16, I am letting BAAL sort 1e6 (1 million) examples, when i run the predict_on_dataset function i see the following in the logs:meaning it is using a batch_size of 16, instead of the specified 2.
A batch size of 8 would also work (if i manually downsample the input dataframe to be 8 inputs).
To Reproduce
which gives:
Expected behavior
The predict_on_dataset function should respect the batch_size specified in TrainingArguments and not cause a CUDA out of memory error.
Version (please complete the following information):
version : 1.9.1
description : Library to enable Bayesian active learning in your research or labeling work.
dependencies
Additional context
I am running this on AWS batch on a p3 instance.
The text was updated successfully, but these errors were encountered: