[Minillm] Using the qwen2-72b model as the teacher model for minillm training results in out of memory #281

shhn1 · 2024-11-12T09:46:37Z

I use qwen2-72b as the teacher model and qwen2.5 32b model as the student model for training. 8*80g A100 are used for training.

When I load the qwen2 72b model, I find that the teacher model is not split on each gpu, and the complete qwen2 72b model is loaded on each gpu, resulting in oom.

When I test the model loading alone, the qwen2 72b can be split and loaded on multiple gpus. I don't understand why this happens now.

Have you tried two larger models for minillm experiments? I see that the largest teacher model in the paper is only 13B.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Minillm] Using the qwen2-72b model as the teacher model for minillm training results in out of memory #281

[Minillm] Using the qwen2-72b model as the teacher model for minillm training results in out of memory #281

shhn1 commented Nov 12, 2024

[Minillm] Using the qwen2-72b model as the teacher model for minillm training results in out of memory #281

[Minillm] Using the qwen2-72b model as the teacher model for minillm training results in out of memory #281

Comments

shhn1 commented Nov 12, 2024