Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internal Tensorizer Error #1052

Open
ewang675 opened this issue Dec 5, 2024 · 13 comments
Open

Internal Tensorizer Error #1052

ewang675 opened this issue Dec 5, 2024 · 13 comments
Labels

Comments

@ewang675
Copy link

ewang675 commented Dec 5, 2024

We are getting an Internal Tensorizer Error. The error logs are copied below.

[TEN404] Internal tensorizer error: TensorInitialization:Expect NeuronReduceMacro! - Please open a support ticket at https://github.com/aws-neuron/aws-neuron-sdk/issues/new. You may also be able to obtain more information using the 'XLA_IR_DEBUG' and 'XLA_HLO_DEBUG' environment variables. Traceback (most recent call last): File "/home/ubuntu/asst4-trainium/part2/test_harness.py", line 219, in <module> test_result = test_correctness_conv2d_kernel( File "/home/ubuntu/asst4-trainium/part2/test_harness.py", line 85, in test_correctness_conv2d_kernel out = kernel(*args, **kwargs) File "neuronxcc/nki/compile.py", line 92, in neuronxcc.nki.compile.GenericKernel.__call__ File "neuronxcc/starfish/penguin/targets/nki/TraceKernel.py", line 174, in neuronxcc.starfish.penguin.targets.nki.TraceKernel.Kernel.__call__ File "neuronxcc/starfish/penguin/targets/nki/TraceKernel.py", line 422, in neuronxcc.starfish.penguin.targets.nki.TraceKernel.BaremetalKernel.post_process_call File "neuronxcc/starfish/penguin/targets/nki/TraceKernel.py", line 425, in neuronxcc.starfish.penguin.targets.nki.TraceKernel.BaremetalKernel.post_process_call File "neuronxcc/starfish/penguin/targets/nki/TraceKernel.py", line 508, in neuronxcc.starfish.penguin.targets.nki.TraceKernel.BaremetalKernel._compile RuntimeError: Compilation failed for fused_conv2d_maxpool with error Command '['neuronx-cc', 'compile', '--framework', 'XLA', 'penguin.py', '--internal-tensorizer-opt-level=nki', '--pipeline', 'compile', 'SaveTemps', '--target', 'trn1', '--disable-internal-io-dge', '--output=file.neff']' returned non-zero exit status 70.

The source code is attached. The issue seems to be in the conv2d.py file on line 173 on the call to nl.max(...).

The code can be run using the command:
python3 test_harness.py --test_maxpool

Please use the --test_maxpool flag; without it, the relevant segment of code will not run and the error will not be produced.

part2.zip

@AWSNB
Copy link
Contributor

AWSNB commented Dec 5, 2024

@ewang675 couple of quick things to try:

change out_tile += broadcasted_bias to out_tile = out_tile + broadcasted_bias
(we likely have bug in += when doing in place for sbuf or outside affine loop )

in general, change all the a +=b to a = a+b for now

@ewang675
Copy link
Author

ewang675 commented Dec 5, 2024

Hello, thank you! when I make this change, I think the first 'a' in 'a = a + b' gets interpreted as a new variable. For example:

res_psum = nl.zeros((TILE_C_OUT, TILE_H * out_width), nl.float32, buffer=nl.psum)
for k in nl.affine_range(n_tiles_c_in):

  • .... [other code] ...
  • res_psum = res_psum + nl.matmul(weight_[k, :, :, i, j], rhs_tile, transpose_x=True)

conv_result += nl.copy(res_psum, dtype=X_out.dtype)`

I get the error:
conv_result += nl.copy(res_psum, dtype=X_out.dtype) SyntaxError: local variable 'res_psum' is referenced outside of its parent scope (loop k at f/home/ubuntu/asst4-trainium/part2/conv2d.py:149)!

Similar errors occur for replacing 'a += b' with 'a = a + b' in other places

@AWSNB
Copy link
Contributor

AWSNB commented Dec 5, 2024

@aws-zhehongb ^^

@AWSNB
Copy link
Contributor

AWSNB commented Dec 5, 2024

@ewang675 i asked our experts to look at it
but my initial suspect that += nl.copy() is the issue

could you try:
// copy to sbuf
temp_result = nl.copy(res_psum, dtype=X_out.dtype)
// add temp to accumulated results
conv_result = nl.add(conv_result, temp_result)

@ewang675
Copy link
Author

ewang675 commented Dec 5, 2024

Hello, thank you for taking a look! We actually managed to get things working with your initial suggestion -- out_tile = out_tile + bias still worked even if the first out_tile was seen as a distinct object of some kind (since this wasn't in an inner loop). However, for the specific case named above, changing the code to:

res_psum = nl.zeros((TILE_C_OUT, TILE_H * out_width), nl.float32, buffer=nl.psum)

for k in nl.affine_range(n_tiles_c_in):

  • .... [other code] ...
  • res_psum = res_psum + nl.matmul(weight_[k, :, :, i, j], rhs_tile, transpose_x=True)

temp_result = nl.copy(res_psum, dtype=X_out.dtype)

conv_result = nl.add(conv_result, temp_result)

results in a different error:
SyntaxError: local variable 'conv_result' is referenced outside of its parent scope (loop j at f/home/ubuntu/asst4-trainium/part2/conv2d.py:145)!

(for context, conv_result is declared outside the loop this whole block takes place in, and is referenced after this loop)

Really appreciate your help above, though!

@AWSNB
Copy link
Contributor

AWSNB commented Dec 5, 2024

@aws-zhehongb

@aws-serina-tan
Copy link

To assign to the original declared tensor, we would need conv_result[...] = nl.add(conv_result, res_psum). Can you give that a try?

@AWSNB
Copy link
Contributor

AWSNB commented Dec 5, 2024

@ewang675 the following info explain many of the issues you saw so far, and matches your understanding of the error:

a = nl.copy will create a new variable a.
If you want to assign to a pre-declared variable a that was created using ndarray, nl.zero etc., you should use a[...] = nl.copy

there are few places like this in your code

@AWSNB
Copy link
Contributor

AWSNB commented Dec 5, 2024

@ewang675 did it work ?

@ewang675
Copy link
Author

ewang675 commented Dec 5, 2024

I'm getting a different error now:

conv_result[...] = conv_result + nl.copy(res_psum, dtype=X_out.dtype)

SyntaxError: Unexpected output dependencies, missing indices in the dst access: j, i

This error doesn't happen with:

conv_result += nl.copy(res_psum, dtype=X_out.dtype)

@aws-zhehongb
Copy link

could you change i and j to sequential_range?

@aws-serina-tan
Copy link

New error related to #1055 ?

@aws-zhehongb
Copy link

generic reduction like

conv_result[...] = conv_result + nl.copy(res_psum, dtype=X_out.dtype)

is considered as loop-carried dependency, if we see

SyntaxError: Unexpected output dependencies, missing indices in the dst access: j, i

check this out: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/api/nki.errors.html#err-unexpected-output-dependencies

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants