Compile Op-By-Op changes SHLO decompositions of some ops #105

LPanosTT · 2024-12-09T15:14:51Z

nn.MaxPool2D gets lowered to %result0, %result1 = torch.aten.max_pool2d_with_indices. %result0 is the actual output of the maxpool, and %result1 is some indices. If %result1 is not used (and it usually isn't), then compiling the full graph converts this op to:

%cst = stablehlo.constant dense<0xFF80> : tensor<bf16>
    %0 = "stablehlo.reduce_window"(%arg0, %cst) <{padding = dense<0> : tensor<4x2xi64>, window_dilations = array<i64: 1, 1, 1, 1>, window_dimensions = array<i64: 1, 1, 2, 2>, window_strides = array<i64: 1, 1, 2, 2>}> ({
    ^bb0(%arg1: tensor<bf16>, %arg2: tensor<bf16>):
      %1 = stablehlo.maximum %arg1, %arg2 : tensor<bf16>
      stablehlo.return %1 : tensor<bf16>
    }) : (tensor<1x1x224x224xbf16>, tensor<bf16>) -> tensor<1x1x112x112xbf16>

Which we can lower to TTIR. Notice how it only has one result.

However, when compiling op-by-op we construct the single-op graph to return all results of whatever torch op we're looking at, this causes the maxpool to be lowered to:

%cst = stablehlo.constant dense<0xFF80> : tensor<bf16>
    %c = stablehlo.constant dense<0> : tensor<i64>
    %0 = stablehlo.iota dim = 2 : tensor<1x1x50176xi64>
    %1 = stablehlo.reshape %0 : (tensor<1x1x50176xi64>) -> tensor<1x1x224x224xi64>
    %2:2 = "stablehlo.reduce_window"(%arg0, %1, %cst, %c) <{padding = dense<0> : tensor<4x2xi64>, window_dilations = array<i64: 1, 1, 1, 1>, window_dimensions = array<i64: 1, 1, 2, 2>, window_strides = array<i64: 1, 1, 2, 2>}> ({
    ^bb0(%arg1: tensor<bf16>, %arg2: tensor<i64>, %arg3: tensor<bf16>, %arg4: tensor<i64>):
      %3 = stablehlo.compare  GE, %arg1, %arg3,  FLOAT : (tensor<bf16>, tensor<bf16>) -> tensor<i1>
      %4 = stablehlo.select %3, %arg1, %arg3 : tensor<i1>, tensor<bf16>
      %5 = stablehlo.compare  EQ, %arg1, %arg3,  FLOAT : (tensor<bf16>, tensor<bf16>) -> tensor<i1>
      %6 = stablehlo.minimum %arg2, %arg4 : tensor<i64>
      %7 = stablehlo.select %3, %arg2, %arg4 : tensor<i1>, tensor<i64>
      %8 = stablehlo.select %5, %6, %7 : tensor<i1>, tensor<i64>
      stablehlo.return %4, %8 : tensor<bf16>, tensor<i64>
    }) : (tensor<1x1x224x224xbf16>, tensor<1x1x224x224xi64>, tensor<bf16>, tensor<i64>) -> (tensor<1x1x112x112xbf16>, tensor<1x1x112x112xi64>)

This cannot be lowered to TTIR.

You can use the following test to try it out:

def test_maxpool2d():

    class Basic(nn.Module):
        def __init__(self):
            super().__init__()
        def forward(self, x):
            return torch.nn.functional.max_pool2d(x, kernel_size=2, stride=2)
    cc = CompilerConfig()
    cc.compile_depth = CompileDepth.COMPILE_OP_BY_OP # Comment this line out to compile full graph
    verify_module(Basic(), inputs=[torch.randn(1, 1, 224, 224).to(torch.bfloat16)], compiler_config=cc)

FYI: @AleksKnezevic

The text was updated successfully, but these errors were encountered:

LPanosTT assigned brataTT Dec 9, 2024

brataTT added a commit that referenced this issue Jan 7, 2025

Remove unused getitems in op by op flow [#105]

6e28e4b

brataTT added a commit that referenced this issue Jan 7, 2025

Remove unused getitems in op by op flow [#105]

78ceecd

brataTT added a commit that referenced this issue Jan 7, 2025

Remove unused getitems in op by op flow [#105]

16fb250

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compile Op-By-Op changes SHLO decompositions of some ops #105

Compile Op-By-Op changes SHLO decompositions of some ops #105

LPanosTT commented Dec 9, 2024 •

edited

Loading

Compile Op-By-Op changes SHLO decompositions of some ops #105

Compile Op-By-Op changes SHLO decompositions of some ops #105

Comments

LPanosTT commented Dec 9, 2024 • edited Loading

LPanosTT commented Dec 9, 2024 •

edited

Loading