Skip to content

tatsu-lab/fast_imagenet_ffcv

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 

Repository files navigation

fast_imagenet_ffcv

Fast ImageNet training code with FFCV

Setting up

Installation: install pytorch and ffcv (https://github.com/libffcv/ffcv).

Run via python train_ffcv.py.

Notes

The datasets are already preprocessed (available in /self/scr-sync/nlp/imagenet_ffcv on each NLP cluster machine). Training is fast bc entire ImageNet is cached in memory. Tested on 4 3090s, it takes 10 hours to train a ResNet50 (~10 epochs/hour for 100 epochs). On 1 3090, it takes 36 hours. On 4 A100s, it takes 5 hours. Accuracy for a ResNet50 on 4 gpus with batch size 256 per gpu (default params) is 75.4%.

Slurm advice

Since dataset itself is ~70G, I would request 100G for the job to be safe. I would also request # of cpus = 2 * (num_workers + 1) * num_gpus, so 40 cpus for 4 gpus. This is since each gpu requires num_workers dataloading threads and one thread for the model. Multiply by 2 since the dataloading transforms are JITed so hyperthreading doesn't give much perf. This means for the machines with only 32 slurm cpus, probably best to use 3 workers per gpu (but you should try for yourself). Otherwise, I found 4 workers per gpu with total # cpus request 40 to give fast speeds on the 3090 setup.

How I wrote this

This is essentially just the standard PyTorch ImageNet training script (https://github.com/pytorch/examples/blob/main/imagenet/main.py) with a modified dataloader. The meat of the dataloader bits are here: https://github.com/tatsu-lab/fast_imagenet_ffcv/blob/main/train_ffcv.py#L123-L154.

About

Fast ImageNet training code with FFCV

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages