Baal seems to ignore eval_batch_size causing gpu memory issues #280

hugocool · 2023-10-20T13:06:54Z

Describe the bug
When setting the batch_size to 2 in BAAL, it appears to be using a batch_size of 16 instead. This is causing a CUDA out of memory error. Despite setting per_device_eval_batch_size and train_batch_size to 2 in TrainingArguments, the predict_on_dataset function seems to be using a batch_size of 16, I am letting BAAL sort 1e6 (1 million) examples, when i run the predict_on_dataset function i see the following in the logs:

0%| | 0/62500 [00:00<?, ?it/s] 0%| | 0/62500 [00:01<?, ?it/s]

meaning it is using a batch_size of 16, instead of the specified 2.
A batch size of 8 would also work (if i manually downsample the input dataframe to be 8 inputs).

To Reproduce

    model = patch_module(model)
    from transformers import TrainingArguments

    args = TrainingArguments(output_dir="/", per_device_eval_batch_size=2)
    args = args.set_dataloader(
        train_batch_size=2, eval_batch_size=2, auto_find_batch_size=False
    )

    trainer = BaalTransformersTrainer(
        model=model,
        args=args,
    )

    dataset = Dataset.from_pandas(tokenized_X)
    predictions = trainer.predict_on_dataset(dataset, iterations=30)

which gives:

"CUDA out of memory. Tried to allocate 1.41 GiB (GPU 0; 15.78 GiB total capacity; 14.49 GiB already allocated; 397.75 MiB free; 14.53 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF"

Expected behavior
The predict_on_dataset function should respect the batch_size specified in TrainingArguments and not cause a CUDA out of memory error.

Version (please complete the following information):

OS: ubuntu 22
Python 3.10.4:
Baal:
version : 1.9.1
description : Library to enable Bayesian active learning in your research or labeling work.

dependencies

h5py >=3.4.0,<4.0.0
matplotlib >=3.4.3,<4.0.0
numpy >=1.21.2,<2.0.0
Pillow >=6.2.0
scikit-learn >=1.0.0,<2.0.0
scipy >=1.7.1,<2.0.0
structlog >=21.1.0,<22.0.0
torch >=1.6.0
torchmetrics >=0.9.3,<0.10.0
tqdm >=4.62.2,<5.0.0

Additional context
I am running this on AWS batch on a p3 instance.

The text was updated successfully, but these errors were encountered:

Dref360 · 2023-10-20T13:55:56Z

Hello,

Thank you for submitting the issue. I should be able to take a look over the weekend.

I'm a bit puzzled because we simply call self.get_eval_dataloader(dataset) which is managed by HuggingFace. I'll know more this weekend.

hugocool · 2023-10-20T13:58:42Z

I know! I digged into the code, and that’s why I did the args set_dataloader. But apparently it’s getting ignored, so I don’t know how to trouble shoot this, or maybe there is some environment variable that is playing a role here, idk..

…

___ Hugo Evers

On 20 Oct 2023 at 15:56 +0200, Frédéric Branchaud-Charron ***@***.***>, wrote: Hello, Thank you for submitting the issue. I should be able to take a look over the weekend. I'm a bit puzzled because we simply call self.get_eval_dataloader(dataset) which is managed by HuggingFace. I'll know more this weekend. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: ***@***.***>

Dref360 · 2023-10-21T14:45:33Z

I was thinking about it and maybe it's because of the stacking we perform.
For our HF implementation, we always perform MC-Dropout in a single pass meaning that batch_size=2 will result in a batch size of 2 * ITERATIONS to be fed to the model. You said that a manual batch_size of 8 is your maximum so 2*30=60 which is too much.

Our ModelWrapper implementation has a flag replicate_in_memory which avoid stacking, but we have it for HF.

It is fairly trivial to add this feature so I'll do that.

Progress bar stuff

I just tested the progress bar problem and it seems to work. 🤔

from datasets import load_dataset
from transformers import pipeline, TrainingArguments, DataCollatorWithPadding

from baal.transformers_trainer_wrapper import BaalTransformersTrainer

TEXT_COL = 'sentence'
ds = load_dataset('sst2')['test'].remove_columns('label')
pipe = pipeline('text-classification', model='distilbert-base-uncased-finetuned-sst-2-english')
tokenizer = pipe.tokenizer
model = pipe.model


def preprocess_function(examples):
    return tokenizer(examples[TEXT_COL], truncation=True)


tokenized_ds = ds.map(preprocess_function, batched=True)

training_args = TrainingArguments(
    output_dir='/tmp',
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

trainer = BaalTransformersTrainer(model=model, args=training_args, tokenizer=tokenizer,
                                  data_collator=data_collator, )
print("Total examples", len(tokenized_ds))
print(
    f"Dataloader length={len(trainer.get_eval_dataloader(tokenized_ds))}, batch_size={training_args.per_device_eval_batch_size}")
trainer.predict_on_dataset(tokenized_ds, iterations=10)

Dref360 · 2023-10-21T16:14:14Z

I just opened #281 which should allow you to run your experiment.

If you can install Baal from source from this branch, you could update your code with:

trainer = BaalTransformersTrainer(
        model=model,
		replicate_in_memory=False,
        args=args,
    )

and that should fix it.

In any case, I should be able to get the PR merged this week and will release a minor version along with it :)

hugocool · 2023-10-21T16:21:55Z

Im sorry for any miscommunication, what i meant by manually setting the batch_size to 8 is the following:

    predictions = np.empty((0, model.num_labels, iterations))

    for chunk in df_chunker(tokenized_X, batch_size=2):
        dataset = Dataset.from_pandas(chunk)
        _predictions: NDArray[
            (batch_size, model.num_labels, iterations), np.float32
        ] = trainer.predict_on_dataset(dataset, iterations=iterations)
        predictions = np.concatenate((predictions, _predictions), axis=0)

where

def df_chunker(
    df: pd.DataFrame, batch_size: int = 1000
) -> Generator[pd.DataFrame, None, None]:
    """
    Splits a pandas DataFrame into smaller chunks of a specified batch size.

    Args:
        df (pandas.DataFrame): The DataFrame to be split.
        batch_size (int): The number of rows in each chunk.

    Yields:
        pandas.DataFrame: A chunk of the original DataFrame with the specified number of rows.
    """
    for i in range(0, len(df), batch_size):
        yield df.iloc[i : i + batch_size]

So the iterations are still 30, my max batch_size is 8 so the number of inputs its loading into the model is 8*30.
Im basically forcing the predict function to only be able to take 8 inputs at a time.
However, when i dont force chunk the batch_size, it seems to be predicting in much larger batches, which cause memory overflows.
What is so weird about this bug, is that it might not even be BAAL related, it might just seem so because of the progress bar.

Anyway, ill install BAAL from #281 and see whether that removes the need for my forced chunking solution.
Thanks!

parmidaatg · 2024-04-27T02:15:39Z

hi @hugocool , Wanted to see if the above issue was resolved with the fix from #281?

hugocool added the bug Something isn't working label Oct 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Baal seems to ignore eval_batch_size causing gpu memory issues #280

Baal seems to ignore eval_batch_size causing gpu memory issues #280

hugocool commented Oct 20, 2023

Dref360 commented Oct 20, 2023

hugocool commented Oct 20, 2023 via email

Dref360 commented Oct 21, 2023

Dref360 commented Oct 21, 2023 •

edited

Loading

hugocool commented Oct 21, 2023 •

edited

Loading

parmidaatg commented Apr 27, 2024

Baal seems to ignore eval_batch_size causing gpu memory issues #280

Baal seems to ignore eval_batch_size causing gpu memory issues #280

Comments

hugocool commented Oct 20, 2023

Dref360 commented Oct 20, 2023

hugocool commented Oct 20, 2023 via email

Dref360 commented Oct 21, 2023

Progress bar stuff

Dref360 commented Oct 21, 2023 • edited Loading

hugocool commented Oct 21, 2023 • edited Loading

parmidaatg commented Apr 27, 2024

Dref360 commented Oct 21, 2023 •

edited

Loading

hugocool commented Oct 21, 2023 •

edited

Loading