If the chatgpt rlhf trainning example Support shardint +tp + lora? #3303
Unanswered
taishiciR
asked this question in
Community | Q&A
Replies: 1 comment 1 reply
-
@taishiciR TP is currently unnecessary for most cases. You can use lora+zero+gemini for large models. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
When turning shardinit on in the rlhf step of chatgpt example, I came across runtime error like "mat1 and mat2 shapes cannot be multiplied (768x2048 and 512x16)" in the actor`s generate function.
So I wonder:
If [shardint +tp] just conflicts with [lora] by now ?
Or maybe I am using it in a wrong way?
│ /usr/local/python3.9.16/lib/python3.9/site-packages/colossalai/tensor/colo_tensor.py:184 in │
│ torch_function │
│ │
│ 181 │ │ │ │ return backward_tensor.backward(**tensor_kwargs) │
│ 182 │ │ │
│ 183 │ │ with torch._C.DisableTorchFunction(): │
│ ❱ 184 │ │ │ ret = func(*args, **kwargs) │
│ 185 │ │ │ if func in _get_my_nowrap_functions(): │
│ 186 │ │ │ │ return ret │
│ 187 │ │ │ else: │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: mat1 and mat2 shapes cannot be multiplied (768x2048 and 512x16)
Error message came from chatgpt/models/lora.py line 89
"result = result + (self.lora_dropout(x) @ self.lora_A.t() @ self.lora_B.t()) * self.scaling"
when multipling self.lora_dropout(x) and self.lora_A.t() , torch.Size([8, 96, 2048]) , torch.Size([512, 16])
By the way, I observed that :
If turn on shard init ,lora_A param is split by world_size e.g: hidden_size/world_size:
decoder.layers.0.self_attn.k_proj.lora_A
torch.Size([16, 512])
If turn off shard init, lora_A param is of hidden_size
decoder.layers.0.self_attn.k_proj.lora_A
torch.Size([16, 2048])
@ht-zhou
Beta Was this translation helpful? Give feedback.
All reactions