If the chatgpt rlhf trainning example Support shardint +tp + lora? #3303

taishiciR · 2023-03-29T01:19:19Z

taishiciR
Mar 29, 2023

When turning shardinit on in the rlhf step of chatgpt example, I came across runtime error like "mat1 and mat2 shapes cannot be multiplied (768x2048 and 512x16)" in the actor`s generate function.

So I wonder：
If [shardint +tp] just conflicts with [lora] by now ?
Or maybe I am using it in a wrong way？

│ /usr/local/python3.9.16/lib/python3.9/site-packages/colossalai/tensor/colo_tensor.py:184 in │
│ torch_function │
│ │
│ 181 │ │ │ │ return backward_tensor.backward(**tensor_kwargs) │
│ 182 │ │ │
│ 183 │ │ with torch._C.DisableTorchFunction(): │
│ ❱ 184 │ │ │ ret = func(*args, **kwargs) │
│ 185 │ │ │ if func in _get_my_nowrap_functions(): │
│ 186 │ │ │ │ return ret │
│ 187 │ │ │ else: │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: mat1 and mat2 shapes cannot be multiplied (768x2048 and 512x16)

Error message came from chatgpt/models/lora.py line 89
"result = result + (self.lora_dropout(x) @ self.lora_A.t() @ self.lora_B.t()) * self.scaling"
when multipling self.lora_dropout(x) and self.lora_A.t() , torch.Size([8, 96, 2048]) , torch.Size([512, 16])

By the way, I observed that ：
If turn on shard init ，lora_A param is split by world_size e.g: hidden_size/world_size:
decoder.layers.0.self_attn.k_proj.lora_A
torch.Size([16, 512])
If turn off shard init, lora_A param is of hidden_size
decoder.layers.0.self_attn.k_proj.lora_A
torch.Size([16, 2048])

@ht-zhou

binmakeswell · 2023-04-06T09:37:39Z

binmakeswell
Apr 6, 2023
Maintainer

@taishiciR TP is currently unnecessary for most cases. You can use lora+zero+gemini for large models.

1 reply

taishiciR Apr 6, 2023
Author

thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

If the chatgpt rlhf trainning example Support shardint +tp + lora? #3303

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

If the chatgpt rlhf trainning example Support shardint +tp + lora? #3303

taishiciR Mar 29, 2023

Replies: 1 comment · 1 reply

binmakeswell Apr 6, 2023 Maintainer

taishiciR Apr 6, 2023 Author

taishiciR
Mar 29, 2023

Replies: 1 comment 1 reply

binmakeswell
Apr 6, 2023
Maintainer

taishiciR Apr 6, 2023
Author