Llama and Mistral model inputs change after using different computation #104

kahfizulkifli · 2024-12-20T15:20:51Z

Description

This issue is about commit 69d039d where a reshape(x, (n_seqs, n_active_tokens, hidden_size)) is converted into a transpose(reshape(x, (n_seqs, n_active_tokens, hidden_size)), 0, 1). I ran inference with BSH attention layout on GPT2, Llama-3 and Mistral-7b models with these two types of computations.

Looking at the generated sequences, Llama-3 and Mistral-7b both generate different sequences when using these two different computations, but this is not the case for GPT2.

So I tested whether these two computations are equivalent or not in numpy. I use the generated HLO graph as reference for the numpy testing.

# BSH current version
%reshape.168 = f32[1,4,768]{2,1,0} reshape(f32[4,768]{1,0} %add.167)
%transpose.169 = f32[4,1,768]{2,1,0} transpose(f32[1,4,768]{2,1,0} %reshape.168), dimensions={1,0,2}

# BSH old version
%reshape.168 = f32[4,1,768]{2,1,0} reshape(f32[4,768]{1,0} %add.167)

>>> import numpy as np
>>> x = np.random.rand(4,768)
>>> np.allclose(x.reshape(4,1,768), np.transpose(x.reshape(1,4,768), (1,0,2)), 1e-05, 1e-08)
True

Based on the simple testing in numpy, the tensors are equivalent, but I'm still curious, why is the generated sequence different in the Llama-3 and Mistral-7b models but not in the GPT2 model. So I decided to print out the tensor inputs generated by the inputs function at layers/transformers.py in the model at https://github.com/aws-neuron/transformers-neuronx/blob/main/src/transformers_neuronx/layers/transformer.py#L19.

The interesting part is this: the inputs generated by the Llama-3 and Mistral models are different when we use a different computation, but the inputs generated by the GPT2 model is the same even with different computations. It seems that even though the change is introduced in later parts of the computation, the inputs to the model become different.

My question is this: how is it that the operator sequence transpose(reshape(x, (n_seqs, n_active_tokens, hidden_size)), 0, 1) makes the generated sequence of the models equal to the running the model with tp degree of 1 (baseline), but the reshape(x, (n_seqs, n_active_tokens, hidden_size)) operation causes the generated sequence to be different to the baseline? Is it because Llama-3 and Mistral models generate 2 HLO graphs: 1 for token generation and 1 for context decoding, which causes the inputs of the models to change?

Code changes

This is introduced in https://github.com/aws-neuron/transformers-neuronx/blob/main/src/transformers_neuronx/layers/attention.py#L774

    if bsh_output or bsh_collective:
        # (s * b, h) => (b, s, h)
        
        # inject bug 
        result = hlo.reshape(result, (n_seqs, n_active_tokens, hidden_size))

        # correct one
        # result = hlo.reshape(result, (n_active_tokens, n_seqs, hidden_size))
        # result = hlo.transpose(result, 0, 1)
    else:
        # (s * b, h) => (h, s, b)
        result = hlo.transpose(result, 0, 1)
        result = hlo.reshape(result, hidden_sizes)

If you want to run the correct/bug one, you can just comment/uncomment the computation

To get the input values, you add a global_debugger in this line (https://github.com/aws-neuron/transformers-neuronx/blob/main/src/transformers_neuronx/layers/transformer.py#L89)

for tensor in (hidden, cache_ids, start_ids, last_token_id):
        global_debugger.tap(vars(tensor)['instruction'].name, tensor)

Version

I use the transformers-neuronx version 0.12.313 installed from pip

Environment

I use the trn1.32xlarge instance type

Files

These are the scripts I used to run the model inference. I zipped them into one folder.
TNX.zip

To Reproduce

Download the pretrained models of GPT2, Mistral-7b and Llama-3.1.
Change the configs of each model so that it uses 1 layer to speed up inference (this is the n_layer field in GPT2 and num_hidden_layers field in Mistral and Llama model)
Run inference model with showing the input parameters for both the old BSH computation and the current BSH computation with this command.

python gpt_driver.py run <model_path> --tp_degree=1 --attention_layout="BSH" --debug > <log_file> 
python llama_driver.py run <model_path> --tp_degree=1 --attention_layout="BSH" --debug > <log_file> 
python mistral_driver.py run <model_path> --tp_degree=1 --attention_layout="BSH" --debug > <log_file>

The --debug flag is for printing out the model parameters.

If you run the gpt driver, the parameter values for the old version and current version of BSH computation is the same, but for the llama and mistral models, they are different.

The text was updated successfully, but these errors were encountered:

seanlatias · 2024-12-20T16:49:17Z

Hi @kahfizulkifli , thanks for raising the issue. At a high level, although the two computations are the same, the input format can be different, which leads to the different outputs. So here, according to the comment in the code, our goal is to make (s*b, h) -> (b, s, h). If you simply apply reshape to the input (s*b, h), the data order will be incorrect. The reshape will only be correct is the input is (b*s, h). This also explains why you see different input tensors because they have different layout from the beginning. As for the reason why the input format is changed, we are always pursuing the best performance and layout is one of the key factor. Please let me know if this solves your question.

kahfizulkifli · 2024-12-20T17:32:32Z

Thanks for the response!

I get that the input format changes if we use specific features, for example, the BSH attention layout, the parameter.1 value will be transposed from the original baseline.

What I don't understand is why does the input change if we change the computation below, where it doesn't affect the input parameters.

Here are snippets of the parameter.1 value in the Llama-3 baseline, Llama-3 correct BSH version, and Llama-3 bug BSH version respectively.

parameter.1
tensor([[ 0.0101,  0.0002, -0.0024, -0.0056]])
tensor([[ 0.0070, -0.0014,  0.0059,  0.0190]])
tensor([[-0.0012, -0.0019, -0.0131, -0.0009]])
tensor([[ 0.0137,  0.0016,  0.0035, -0.0081]])
tensor([[-0.0168, -0.0131, -0.0037,  0.0045]])
tensor([[-9.2697e-04, -8.1787e-03,  7.9870e-06, -1.2024e-02]])

Snippet 1. Parameter.1 from baseline

parameter.1
tensor([[-4.4861e-03,  3.9368e-03, -2.1057e-03, -6.5613e-03,  1.3367e-02,
         -5.8365e-04, -1.1108e-02,  8.0566e-03, -1.0986e-02,  7.4158e-03,
          5.7983e-03, -1.6602e-02,  5.9509e-03,  2.1240e-02,  1.2283e-03,
         -1.1597e-03, -1.5625e-02,  1.7944e-02, -1.8555e-02, -1.9379e-03,
          1.4526e-02,  1.4832e-02, -6.8970e-03, -1.7090e-03,  6.6528e-03,

Snippet 2. Parameter.1 from wrong BSH

parameter.1
tensor([[ 1.0132e-02,  7.0496e-03, -1.1978e-03,  1.3672e-02, -1.6846e-02,
         -9.2697e-04,  2.2736e-03, -5.2795e-03, -9.6436e-03, -1.9684e-03,
          9.4604e-03,  5.3711e-03,  1.1292e-02, -1.2512e-02,  1.2695e-02,
          2.7924e-03, -1.4305e-05,  5.7068e-03, -6.4087e-03, -1.8433e-02,
         -6.4087e-03,  7.8125e-03, -8.1787e-03, -2.4170e-02,  7.7820e-03,

Snippet 3. Parameter.1 from correct BSH

I get that snippet 3 is a transpose of snippet 1 since in the https://github.com/aws-neuron/transformers-neuronx/blob/main/src/transformers_neuronx/layers/transformer.py#L60C9-L63C68 it has different configurations for hidden_sizes, and the values match. But I don't understand why snippet 3 tensor values are different to snippet 2 values. This is taken from the Llama-3 example, and this is evident in the Mistral model as well. For the GPT2 model, snippet 2 and snippet 3 have the same values.

seanlatias · 2024-12-20T18:58:30Z

Do you mean if you just apply the single commit the parameters change? Or other commits are also applied?

seanlatias · 2024-12-20T19:08:59Z

From you snippets, only snippet 1 and 3 are related. Snippet 2 looks totally different. I would suggest you to print out the shape if you want to understand more. Also, just to clarify, you are just curious about how the code works, and this is actually not a bug, right?

kahfizulkifli · 2024-12-20T19:30:02Z

Yes, just the single commit, the parameters change.

I'm curious how this works, because the parameter.1 values don't change if it's in the GPT2 model but changes in the Llama-3 and Mistral models. I don't know if this is a bug or not, where specific model types have different behaviors even though the same change is applied.

kahfizulkifli · 2024-12-20T19:31:14Z

Sorry, I just edited my previous comment on the parameter values. I used the wrong captions for snippet 2 and snippet 3, now I have changed them.

kahfizulkifli · 2024-12-30T16:42:18Z

Hi all, this is just a quick follow-up, and I guess maybe these additional snippets will help.

So attached below are the parameter values from GPT-2 model.

tensor([[-1.5737e-01, -1.7294e-01,  3.1421e-02, -7.1631e-02,  1.5819e-01,
         -1.2367e-01, -3.6575e-01, -1.0455e-01, -1.6234e-01,  1.5181e-01,
         -1.4173e-01, -3.8613e-01, -1.0296e-01, -2.5501e-02, -9.7907e-02,
         -6.2348e-02,  8.0500e-02, -8.9016e-02,  1.8280e-01, -7.2021e-02,
         -4.6476e-02,  1.4756e-01,  7.5920e-02,  6.7053e-02,  3.8734e-01,

Snippet 1. Parameter.1 value from most recent commit or correct BSH computation

tensor([[-1.5737e-01, -1.7294e-01,  3.1421e-02, -7.1631e-02,  1.5819e-01,
         -1.2367e-01, -3.6575e-01, -1.0455e-01, -1.6234e-01,  1.5181e-01,
         -1.4173e-01, -3.8613e-01, -1.0296e-01, -2.5501e-02, -9.7907e-02,
         -6.2348e-02,  8.0500e-02, -8.9016e-02,  1.8280e-01, -7.2021e-02,
         -4.6476e-02,  1.4756e-01,  7.5920e-02,  6.7053e-02,  3.8734e-01,

Snippet 2. Parameter.1 value from applying previous commit or wrong BSH computation

Using different versions of the commit doesn't change the parameter values for GPT-2 model, but the parameter values changes for the Llama-3 and Mistral models. I'm wondering this is expected behavior or not. Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama and Mistral model inputs change after using different computation #104

Llama and Mistral model inputs change after using different computation #104

kahfizulkifli commented Dec 20, 2024

seanlatias commented Dec 20, 2024

kahfizulkifli commented Dec 20, 2024 •

edited

Loading

seanlatias commented Dec 20, 2024

seanlatias commented Dec 20, 2024

kahfizulkifli commented Dec 20, 2024

kahfizulkifli commented Dec 20, 2024

kahfizulkifli commented Dec 30, 2024

Llama and Mistral model inputs change after using different computation #104

Llama and Mistral model inputs change after using different computation #104

Comments

kahfizulkifli commented Dec 20, 2024

Description

Code changes

Version

Environment

Files

To Reproduce

seanlatias commented Dec 20, 2024

kahfizulkifli commented Dec 20, 2024 • edited Loading

seanlatias commented Dec 20, 2024

seanlatias commented Dec 20, 2024

kahfizulkifli commented Dec 20, 2024

kahfizulkifli commented Dec 20, 2024

kahfizulkifli commented Dec 30, 2024

kahfizulkifli commented Dec 20, 2024 •

edited

Loading