-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Llama and Mistral model inputs change after using different computation #104
Comments
Hi @kahfizulkifli , thanks for raising the issue. At a high level, although the two computations are the same, the input format can be different, which leads to the different outputs. So here, according to the comment in the code, our goal is to make |
Thanks for the response! I get that the input format changes if we use specific features, for example, the BSH attention layout, the parameter.1 value will be transposed from the original baseline. What I don't understand is why does the input change if we change the computation below, where it doesn't affect the input parameters. Here are snippets of the parameter.1 value in the Llama-3 baseline, Llama-3 correct BSH version, and Llama-3 bug BSH version respectively.
Snippet 1. Parameter.1 from baseline
Snippet 2. Parameter.1 from wrong BSH
Snippet 3. Parameter.1 from correct BSH I get that snippet 3 is a transpose of snippet 1 since in the https://github.com/aws-neuron/transformers-neuronx/blob/main/src/transformers_neuronx/layers/transformer.py#L60C9-L63C68 it has different configurations for hidden_sizes, and the values match. But I don't understand why snippet 3 tensor values are different to snippet 2 values. This is taken from the Llama-3 example, and this is evident in the Mistral model as well. For the GPT2 model, snippet 2 and snippet 3 have the same values. |
Do you mean if you just apply the single commit the parameters change? Or other commits are also applied? |
From you snippets, only snippet 1 and 3 are related. Snippet 2 looks totally different. I would suggest you to print out the shape if you want to understand more. Also, just to clarify, you are just curious about how the code works, and this is actually not a bug, right? |
Yes, just the single commit, the parameters change. I'm curious how this works, because the parameter.1 values don't change if it's in the GPT2 model but changes in the Llama-3 and Mistral models. I don't know if this is a bug or not, where specific model types have different behaviors even though the same change is applied. |
Sorry, I just edited my previous comment on the parameter values. I used the wrong captions for snippet 2 and snippet 3, now I have changed them. |
Hi all, this is just a quick follow-up, and I guess maybe these additional snippets will help. So attached below are the parameter values from GPT-2 model.
Snippet 1. Parameter.1 value from most recent commit or correct BSH computation
Snippet 2. Parameter.1 value from applying previous commit or wrong BSH computation Using different versions of the commit doesn't change the parameter values for GPT-2 model, but the parameter values changes for the Llama-3 and Mistral models. I'm wondering this is expected behavior or not. Thanks! |
Description
This issue is about commit 69d039d where a reshape(x, (n_seqs, n_active_tokens, hidden_size)) is converted into a transpose(reshape(x, (n_seqs, n_active_tokens, hidden_size)), 0, 1). I ran inference with BSH attention layout on GPT2, Llama-3 and Mistral-7b models with these two types of computations.
Looking at the generated sequences, Llama-3 and Mistral-7b both generate different sequences when using these two different computations, but this is not the case for GPT2.
So I tested whether these two computations are equivalent or not in numpy. I use the generated HLO graph as reference for the numpy testing.
Based on the simple testing in numpy, the tensors are equivalent, but I'm still curious, why is the generated sequence different in the Llama-3 and Mistral-7b models but not in the GPT2 model. So I decided to print out the tensor inputs generated by the inputs function at layers/transformers.py in the model at https://github.com/aws-neuron/transformers-neuronx/blob/main/src/transformers_neuronx/layers/transformer.py#L19.
The interesting part is this: the inputs generated by the Llama-3 and Mistral models are different when we use a different computation, but the inputs generated by the GPT2 model is the same even with different computations. It seems that even though the change is introduced in later parts of the computation, the inputs to the model become different.
My question is this: how is it that the operator sequence transpose(reshape(x, (n_seqs, n_active_tokens, hidden_size)), 0, 1) makes the generated sequence of the models equal to the running the model with tp degree of 1 (baseline), but the reshape(x, (n_seqs, n_active_tokens, hidden_size)) operation causes the generated sequence to be different to the baseline? Is it because Llama-3 and Mistral models generate 2 HLO graphs: 1 for token generation and 1 for context decoding, which causes the inputs of the models to change?
Code changes
This is introduced in https://github.com/aws-neuron/transformers-neuronx/blob/main/src/transformers_neuronx/layers/attention.py#L774
If you want to run the correct/bug one, you can just comment/uncomment the computation
To get the input values, you add a global_debugger in this line (https://github.com/aws-neuron/transformers-neuronx/blob/main/src/transformers_neuronx/layers/transformer.py#L89)
Version
I use the transformers-neuronx version 0.12.313 installed from pip
Environment
I use the trn1.32xlarge instance type
Files
These are the scripts I used to run the model inference. I zipped them into one folder.
TNX.zip
To Reproduce
The --debug flag is for printing out the model parameters.
If you run the gpt driver, the parameter values for the old version and current version of BSH computation is the same, but for the llama and mistral models, they are different.
The text was updated successfully, but these errors were encountered: