-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Internal Tensorizer Error #1052
Comments
@ewang675 couple of quick things to try: change out_tile += broadcasted_bias to out_tile = out_tile + broadcasted_bias in general, change all the a +=b to a = a+b for now |
Hello, thank you! when I make this change, I think the first 'a' in 'a = a + b' gets interpreted as a new variable. For example: res_psum = nl.zeros((TILE_C_OUT, TILE_H * out_width), nl.float32, buffer=nl.psum)
conv_result += nl.copy(res_psum, dtype=X_out.dtype)` I get the error: Similar errors occur for replacing 'a += b' with 'a = a + b' in other places |
@ewang675 i asked our experts to look at it could you try: |
Hello, thank you for taking a look! We actually managed to get things working with your initial suggestion -- out_tile = out_tile + bias still worked even if the first out_tile was seen as a distinct object of some kind (since this wasn't in an inner loop). However, for the specific case named above, changing the code to: res_psum = nl.zeros((TILE_C_OUT, TILE_H * out_width), nl.float32, buffer=nl.psum) for k in nl.affine_range(n_tiles_c_in):
temp_result = nl.copy(res_psum, dtype=X_out.dtype) conv_result = nl.add(conv_result, temp_result) results in a different error: (for context, conv_result is declared outside the loop this whole block takes place in, and is referenced after this loop) Really appreciate your help above, though! |
To assign to the original declared tensor, we would need |
@ewang675 the following info explain many of the issues you saw so far, and matches your understanding of the error: a = nl.copy will create a new variable a. there are few places like this in your code |
@ewang675 did it work ? |
I'm getting a different error now:
SyntaxError: Unexpected output dependencies, missing indices in the dst access: j, i This error doesn't happen with:
|
could you change i and j to sequential_range? |
New error related to #1055 ? |
generic reduction like
is considered as loop-carried dependency, if we see
check this out: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/api/nki.errors.html#err-unexpected-output-dependencies |
We are getting an Internal Tensorizer Error. The error logs are copied below.
[TEN404] Internal tensorizer error: TensorInitialization:Expect NeuronReduceMacro! - Please open a support ticket at https://github.com/aws-neuron/aws-neuron-sdk/issues/new. You may also be able to obtain more information using the 'XLA_IR_DEBUG' and 'XLA_HLO_DEBUG' environment variables. Traceback (most recent call last): File "/home/ubuntu/asst4-trainium/part2/test_harness.py", line 219, in <module> test_result = test_correctness_conv2d_kernel( File "/home/ubuntu/asst4-trainium/part2/test_harness.py", line 85, in test_correctness_conv2d_kernel out = kernel(*args, **kwargs) File "neuronxcc/nki/compile.py", line 92, in neuronxcc.nki.compile.GenericKernel.__call__ File "neuronxcc/starfish/penguin/targets/nki/TraceKernel.py", line 174, in neuronxcc.starfish.penguin.targets.nki.TraceKernel.Kernel.__call__ File "neuronxcc/starfish/penguin/targets/nki/TraceKernel.py", line 422, in neuronxcc.starfish.penguin.targets.nki.TraceKernel.BaremetalKernel.post_process_call File "neuronxcc/starfish/penguin/targets/nki/TraceKernel.py", line 425, in neuronxcc.starfish.penguin.targets.nki.TraceKernel.BaremetalKernel.post_process_call File "neuronxcc/starfish/penguin/targets/nki/TraceKernel.py", line 508, in neuronxcc.starfish.penguin.targets.nki.TraceKernel.BaremetalKernel._compile RuntimeError: Compilation failed for fused_conv2d_maxpool with error Command '['neuronx-cc', 'compile', '--framework', 'XLA', 'penguin.py', '--internal-tensorizer-opt-level=nki', '--pipeline', 'compile', 'SaveTemps', '--target', 'trn1', '--disable-internal-io-dge', '--output=file.neff']' returned non-zero exit status 70.
The source code is attached. The issue seems to be in the conv2d.py file on line 173 on the call to nl.max(...).
The code can be run using the command:
python3 test_harness.py --test_maxpool
Please use the --test_maxpool flag; without it, the relevant segment of code will not run and the error will not be produced.
part2.zip
The text was updated successfully, but these errors were encountered: