You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First off, big thank you for the huge amount of work that has gone into open sourcing the implementation of your research, it is highly appreciated!
While going through the repo and trying to deeply understand the method I discovered that there are some issues with the test script.
the test script does not appear to be running different attention methods and is only ever comparing against the default setting. My initial impression from the code was that by setting the 'attention_label' it would update the config and run the attention mechanism associate with that label (i.e standard, ring blockwise etc.) however after further inspection it seems like this no longer does anything and the method will always run based on what has been defined in the base config using the scan_attention, scan_mlp, scan_layers and mesh_dim arguments. In order to actually compare methods you have to update the config at each iteration.
it appears that it isn't possible to change the mesh_dims as this is defined once at the start of the testing and is used as a context manager for the whole test. So I think we can't change between ring and blockwise during the test.
It doesn't look like the grads being returned are a 'FrozenDict' , so the unfreeze at line 163 is not needed (I think its fine that its not frozen in this case).
After applying my naive updates to compare Standard with Ring I am now seeing a larger diff in the logits and grads then expected.
Is this similar to your own results or should the results be more aligned to Standard Attention as my understanding is that the Blockwise Ring Attention is numerically equivalent. Please could you confirm if my configs are correct for comparing these methods, there is a good chance I have made a mistake somewhere. For reference, I am running on a TPU v4-8, so I only have 4 devices.
Would like to confirm if you agree with these observations, or have I just done something silly when applying my changes? If these are in-fact issues that have crept in I am happy to submit a fix 😃
Cheers,
Donal
The text was updated successfully, but these errors were encountered:
Hi Hao,
First off, big thank you for the huge amount of work that has gone into open sourcing the implementation of your research, it is highly appreciated!
While going through the repo and trying to deeply understand the method I discovered that there are some issues with the test script.
it appears that it isn't possible to change the mesh_dims as this is defined once at the start of the testing and is used as a context manager for the whole test. So I think we can't change between ring and blockwise during the test.
It doesn't look like the grads being returned are a 'FrozenDict' , so the unfreeze at line 163 is not needed (I think its fine that its not frozen in this case).
After applying my naive updates to compare Standard with Ring I am now seeing a larger diff in the logits and grads then expected.
Is this similar to your own results or should the results be more aligned to Standard Attention as my understanding is that the Blockwise Ring Attention is numerically equivalent. Please could you confirm if my configs are correct for comparing these methods, there is a good chance I have made a mistake somewhere. For reference, I am running on a TPU v4-8, so I only have 4 devices.
Would like to confirm if you agree with these observations, or have I just done something silly when applying my changes? If these are in-fact issues that have crept in I am happy to submit a fix 😃
Cheers,
Donal
The text was updated successfully, but these errors were encountered: