Refactor code generation for pointwise operation & PointwiseDynamicFunction #167

iclementine · 2024-08-16T05:07:24Z

Now we generate code with nd tiles and 1d grid with grid-stride-loop, where the n is the ndim of the task space;
Add some simple logic to simplify task space(when all operand have the same shape and same stride, and all of them are non overlapping and dense, we simplify the task space into a 1d space, although we can use better policy but we leave it for future work);
Use a smarter policy for output layout inference:(the output will follow the stride order of the first tensor that has the same shape as the broadcasted shape, pre-defined ouputs has higher priority than all input tensors; otherwise, the output in c-contiguous);
make tile size and grid size in generated code configurable;
work around the problem that save to block pointer does not automatically cast the value to the pointer's dtype;
work around the problem that values loaded from a pointer to bool is int8 and block pointer from pointer to bool has dtype int8;
fix the bitwise-* operators without those work arounds, and add test-cases with bool inputs & outputs for them;
add TypedPtr and StridedBuffer as drop-in replament for torch.Tensor to be used in generated triton kernel & wrappers, which allows some unsafe reinterpretation of Tensors(dtype, shape, stride, offset), which cannot be done by torch APIs;
fix a bug in flip op where the flipped view(shifted data pointer and negative strides) from input in applied to the output.

1. Now we generate code with nd tiles and 1d grid with grid-stride-loop, where the n is the ndim of the task space; 2. Add some simple logic to simplify task space(when all operand have the same shape and same stride, and all of them are non overlapping and dense, we simplify the task space into a 1d space, although we can use better policy but we leave it for future work); 3. Use a smarter policy for output layout inference:(the output will follow the stride order of the first tensor that has the same shape as the broadcasted shape, pre-defined ouputs has higher priority than all input tensors; otherwise, the output in c-contiguous); 4. make tile size and grid size in generated code configurable; 5. work around the problem that save to block pointer does not automatically cast the value to the pointer's dtype; 6. work around the problem that values loaded from a pointer to bool is int8 and block pointer from pointer to bool has dtype int8; 7. fix the bitwise-* operators without those work arounds, and add test-cases with bool inputs & outputs for them; 8. add TypedPtr and StridedBuffer as drop-in replament for torch.Tensor to be used in generated triton kernel & wrappers, which allows some unsafe reinterpretation of Tensors(dtype, shape, stride, offset), which cannot be done by torch APIs; 9. fix a bug in flip op where the flipped view(shifted data pointer and negative strides) from input in applied to the output.

… broadcasting, does not allocate outputs, and keeps outputs type(Tensor or StridedBuffer)

…by block_pointer; 2. heuristics_for_tile_sizes now prefer large size inner dimension, since we change to C order

sethbrin

other, LGTM

src/flag_gems/utils/pointwise_dynamic.py

tests/test_pointwise_dynamic.py

src/flag_gems/utils/pointwise_dynamic.py

2. add scalar function name and ndim as part of the name of the generated functions.

iclementine · 2024-08-20T11:09:26Z

Since there is some performance degradation in the new PointwiseDynamicFunction, we will try improving it before we proceed.

Update: This is fixed in 8eac7f0.

src/flag_gems/utils/pointwise_dynamic.py

… the key when checking for exsting overload, the key is integer

…ter device guard

.gitignore

src/flag_gems/utils/tensor_wrapper.py

… 2 branches

…en triton version is less than 3

Bowen12992

LGTM

StrongSpoon

lgtm

…nction (#167) * Refar code generation for pointwise operation & PointwiseDynamicFunction 1. Now we generate code with nd tiles with 1d grid or 1d tiles with 1d grid with grid-stride-loop, where the n is the ndim of the task space; 2. Add some simple logic to simplify task space(when all operand have the same shape and same stride, and all of them are non overlapping and dense, we simplify the task space into a 1d space, although we can use better policy but we leave it for future work); 3. Use a smarter policy for output layout inference:(the output will follow the stride order of the first tensor that has the same shape as the broadcasted shape, pre-defined ouputs has higher priority than all input tensors; otherwise, the output in c-contiguous); 4. make tile size and grid size in generated code configurable; 5. work around the problem that save to block pointer does not automatically cast the value to the pointer's dtype; 6. work around the problem that values loaded from a pointer to bool is int8 and block pointer from pointer to bool has dtype int8; 7. fix the bitwise-* operators without those work arounds, and add test-cases with bool inputs & outputs for them; 8. add TypedPtr and StridedBuffer as drop-in replament for torch.Tensor to be used in generated triton kernel & wrappers, which allows some unsafe reinterpretation of Tensors(dtype, shape, stride, offset), which cannot be done by torch APIs; 9. fix a bug in flip op where the flipped view(shifted data pointer and negative strides) from input in applied to the outpu; add a test for flip op with input that is not c-contiguous. 10. add config as a parameter for PointwoseDynamicFunction

iclementine force-pushed the nditer branch from 3d2ab8a to eaba92b Compare August 16, 2024 05:15

iclementine changed the title ~~Refar code generation for pointwise operation & PointwiseDynamicFunction~~ Refactor code generation for pointwise operation & PointwiseDynamicFunction Aug 16, 2024

iclementine added 6 commits August 16, 2024 13:29

add config as a parameter for PointwoseDynamicFunction

c53f765

add cases to test that manually instantiated overload does not suport…

a474aa3

… broadcasting, does not allocate outputs, and keeps outputs type(Tensor or StridedBuffer)

1. add each operand's stride order as tl.constexprs, which is needed …

73eb77c

…by block_pointer; 2. heuristics_for_tile_sizes now prefer large size inner dimension, since we change to C order

add pointwise_dynamic into CI

932048e

add a test for flip op with input that is not c-contiguous

24d5245

fix a bug in stride computation from shape

b73dea9

StrongSpoon requested review from StrongSpoon, zhzhcookie and Bowen12992 and removed request for StrongSpoon and zhzhcookie August 19, 2024 09:31

StrongSpoon assigned StrongSpoon, Bowen12992 and zhzhcookie Aug 19, 2024

sethbrin reviewed Aug 19, 2024

View reviewed changes

iclementine added 3 commits August 20, 2024 15:36

1. add test case where operands have different stride order;

27efaa3

2. add scalar function name and ndim as part of the name of the generated functions.

1. add test case where operands have different stride order;

f842229

2. add scalar function name and ndim as part of the name of the generated functions.

remove redundant code in FunctionSchema

41b844b

zhzhcookie reviewed Aug 21, 2024

View reviewed changes

src/flag_gems/utils/pointwise_dynamic.py Show resolved Hide resolved

sethbrin reviewed Aug 21, 2024

View reviewed changes

src/flag_gems/utils/pointwise_dynamic.py Show resolved Hide resolved

src/flag_gems/utils/pointwise_dynamic.py Show resolved Hide resolved

iclementine added 3 commits August 21, 2024 11:08

fix a bug in PointwiseDynamicFunction's cache key: string was used as…

8eac7f0

… the key when checking for exsting overload, the key is integer

change class name OPDesc -> FunctionSchema in some text message

9972067

skip stride order computation when task rank is 1; use a slightly fas…

651d7d4

…ter device guard

Bowen12992 reviewed Aug 23, 2024

View reviewed changes

.gitignore Outdated Show resolved Hide resolved

MARD1NO reviewed Aug 26, 2024

View reviewed changes

src/flag_gems/utils/tensor_wrapper.py Outdated Show resolved Hide resolved

src/flag_gems/utils/tensor_wrapper.py Show resolved Hide resolved

iclementine added 17 commits August 26, 2024 17:04

add nd tile style codegen without block pointer

67767d8

fix typos

e4a8fde

merge upstream changes

65e8400

merge upstream changes

55f437b

fix conflict in gitignore

1e0de3f

fix import of some functions to work in triton 3.0.0

9ed7425

add test for tensors that requires int64 indexing

3ccf20a

add test_pointwise_dynamic into CI tests

8085cf9

add codegen for using 1d tile

b356e69

minor change to 1d tile code generation: move tile_id assignment into…

c047ce5

… 2 branches

merge upstream

79b5370

add test for shape_utils

4a2fb57

add test for tensor wrapper

df41cc7

Merge branch 'master' of github.com:FlagOpen/FlagGems into nditer

eadad1f

skip some tests when triton version is less than 3; prefer 1d tile wh…

5cf34a3

…en triton version is less than 3

skip tests that requires too much memory

d69cbb8

do not upcast to float64 for test_floor_div

f6c8052

Bowen12992 previously approved these changes Sep 11, 2024

View reviewed changes

add test_tensor_wrapper into CI steps

d0c0c52

iclementine dismissed Bowen12992’s stale review via d0c0c52 September 11, 2024 08:15

StrongSpoon approved these changes Sep 11, 2024

View reviewed changes

iclementine merged commit 436f1a7 into FlagOpen:master Sep 12, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor code generation for pointwise operation & PointwiseDynamicFunction #167

Refactor code generation for pointwise operation & PointwiseDynamicFunction #167

iclementine commented Aug 16, 2024 •

edited

Loading

sethbrin left a comment

iclementine commented Aug 20, 2024 •

edited

Loading

Bowen12992 left a comment

StrongSpoon left a comment

Refactor code generation for pointwise operation & PointwiseDynamicFunction #167

Refactor code generation for pointwise operation & PointwiseDynamicFunction #167

Conversation

iclementine commented Aug 16, 2024 • edited Loading

sethbrin left a comment

Choose a reason for hiding this comment

iclementine commented Aug 20, 2024 • edited Loading

Bowen12992 left a comment

Choose a reason for hiding this comment

StrongSpoon left a comment

Choose a reason for hiding this comment

iclementine commented Aug 16, 2024 •

edited

Loading

iclementine commented Aug 20, 2024 •

edited

Loading