[Operator] scatter & gather #96

GwokHiujin · 2024-07-05T03:06:32Z

We have completed the development of the select, scatter, and gather operators. Specifically:

The corresponding aTen operators are select.int, scatter.src, scatter.reduce, scatter_add, gather, and gather.out.
According to the document, since scatter.reduce is still in beta, we have only implemented the add and multiply reduce options, which were originally supported in the scatter operator's parameters. Torch plans to implement additional reduce options such as mean, amax, and amin in the future. We can evaluate adding these reduce options or not based on our future work plans.

See also:

https://pytorch.org/docs/stable/generated/torch.Tensor.scatter_reduce_.html#torch.Tensor.scatter_reduce_
The scatter and gather operators face reproducibility issues in output due to potential non-deterministic results caused by non-unique indices. It's worth noting that this issue is not exclusive to scatter-related operators, so further discussions might be needed on potential solutions. For example, following torch's approach by sacrificing some performance to ensure reproducibility for the same set of inputs.

See also:

https://pytorch.org/docs/stable/generated/torch.use_deterministic_algorithms.html#torch.use_deterministic_algorithms

https://pytorch.org/docs/stable/notes/randomness.html#reproducibility
To avoid issues caused by non-deterministic results, we designed test cases with unique indices to ensure consistent output.

Bowen12992 · 2024-07-08T05:33:04Z

src/flag_gems/utils/shape_utils.py

+        if d == dim:
+            idx_dim = add_on
+        idx = idx // shape[d]
+        # FIXME: Should we write a fast div/mod


I think we do need a fast div-mod functor, but i wonder if it already exist in triton ？

I checked out triton's source code and found that its fast division is implemented using NumPy's np.divide, and as far as I know, np's integer division is optimized. So do you think we just need to import np's div/mod operators in this file to use them?
BTW triton doesn't have a divMod-like functor who can return the pair (div i b, mod i b).

Bowen12992

Should we do benchmark for the ops？

…fsets

…ther

tongxin · 2024-08-04T04:23:26Z

src/flag_gems/utils/shape_utils.py

@@ -137,3 +139,30 @@ def dim_compress(inp, dims):
    sorted_reduction_dim = sorted(dims, key=lambda x: stride[x], reverse=True)
    order = batch_dim + sorted_reduction_dim
    return inp.permute(order).contiguous()
+
+
+def offsetCalculator(inp, idx, strides, dim, isInp):


Can we doc this function to help others understand it?

The offset calculation incurs massive overhead. Let's try to do it in a Triton kernel, shall we?

* Use triton to do the offset calculations, the perf test results can be seen in scatter&gather doc

GwokHiujin · 2024-08-09T10:02:09Z

Hello reviewers, I’ve added the version of the offsets_calculator’s Triton kernel in the latest commits!

Based on the perf test results (which show some improvement over the previous version, but still lagging behind Torch, with latency levels remaining the same as before...), I've temporarily switched the offsets calculations in scatter & gather to the kernel implementation version.

Everyone can take a look at this part of the code for review.

tongxin · 2024-08-09T14:39:59Z

Thanks Xiaoyan. The improvement is significant and we appreciate any effort that contributes to a better codebase.

…t code

…related to scatter&gather

…r test

tongxin

LGTM

tongxin · 2024-08-19T01:58:41Z

src/flag_gems/ops/gather.py

+    idx = torch.arange(0, index.numel(), device=inp.device).reshape(index.shape)
+    # Temporarily call offsetCalculator() outside the block(although it can actually proceed in parallel),
+    # because the triton jit.function cannot accept Tuple as input in version 2.2.0(in 3.0.0, it's available),
+    # and we do need **the whole stride[]** to accomplish this calculation!
+    # FIXME: If stride[] can be wholely passed to triton jit.function, we can do this calculation in the kernel
+    # so that the offset calculation can proceed in parallel
+    inp_offsets = offset_calculator(inp_strided, idx, inp.stride(), dim, isInp=True)


It seems if idx is always passed as a trivial iterator, it may not be materialized.

GwokHiujin added 4 commits May 27, 2024 10:35

[fix] Temporarily use upcasting to make prod support bf16

6006778

Merge branch 'master' of github.com:FlagOpen/FlagGems

0a66f15

Merge branch 'master' of github.com:FlagOpen/FlagGems

49aacfb

[Operator] add select, scatter, gather

31ff2fe

Bowen12992 reviewed Jul 8, 2024

View reviewed changes

GwokHiujin and others added 11 commits July 9, 2024 09:39

[fix] del unnecessary offsets calculation

7638b50

[fix] src shares the same unravel indices with index but different of…

6285346

…fsets

[Fix] remove useless src_indices

0b008e2

[fix] remove unnecessary indices calculation in gather

8be0af4

[Operator] add assertion to gather

70b7771

[Operator] add assertion to gather.out

a81463c

[fix] fix the comparison between 2 BoolType tensors

92268d4

Merge branch 'master' of github.com:FlagOpen/FlagGems into scatter_ga…

cae80f1

…ther

[fix] use inp's device to initialize tensors

ea760fc

Merge branch 'master' into scatter_gather

3312acf

[Operator] remove select op

9884441

tongxin reviewed Aug 4, 2024

View reviewed changes

GwokHiujin added 3 commits August 9, 2024 17:49

Merge remote-tracking branch 'origin/master' into scatter_gather

fc10444

[chore] Add offset_calculator kernel

6cbaabf

* Use triton to do the offset calculations, the perf test results can be seen in scatter&gather doc

[fix] Add offset_cal_kernel to utils.init

2c907ea

GwokHiujin and others added 2 commits August 12, 2024 17:33

Merge branch 'master' into scatter_gather

bf5c582

[chore] remove the previous offsets calculator in utils package's ini…

d3c0461

…t code

StrongSpoon assigned tongxin and Bowen12992 Aug 19, 2024

GwokHiujin and others added 3 commits August 19, 2024 09:51

[chore] Considering perf, pause the replacement of the aTen operator …

b26a377

…related to scatter&gather

[fix] Use ops.scatter/gather instead in the testing

3e96bca

Merge branch 'master' into scatter_gather

1bf7fad

GwokHiujin added 2 commits August 19, 2024 11:52

[fix] Remove the upcasting on the ref input data in the scatter&gathe…

dfaf063

…r test

[chore] reformat

a3e43ee

tongxin approved these changes Sep 2, 2024

View reviewed changes

tongxin merged commit d78b76d into master Sep 2, 2024
1 check passed

tongxin deleted the scatter_gather branch September 2, 2024 07:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Operator] scatter & gather #96

[Operator] scatter & gather #96

GwokHiujin commented Jul 5, 2024 •

edited

Loading

Bowen12992 Jul 8, 2024

GwokHiujin Jul 8, 2024 •

edited

Loading

Bowen12992 left a comment

tongxin Aug 4, 2024

tongxin Aug 4, 2024

GwokHiujin commented Aug 9, 2024

tongxin commented Aug 9, 2024

tongxin left a comment

tongxin Aug 19, 2024

[Operator] scatter & gather #96

[Operator] scatter & gather #96

Conversation

GwokHiujin commented Jul 5, 2024 • edited Loading

Bowen12992 Jul 8, 2024

Choose a reason for hiding this comment

GwokHiujin Jul 8, 2024 • edited Loading

Choose a reason for hiding this comment

Bowen12992 left a comment

Choose a reason for hiding this comment

tongxin Aug 4, 2024

Choose a reason for hiding this comment

tongxin Aug 4, 2024

Choose a reason for hiding this comment

GwokHiujin commented Aug 9, 2024

tongxin commented Aug 9, 2024

tongxin left a comment

Choose a reason for hiding this comment

tongxin Aug 19, 2024

Choose a reason for hiding this comment

GwokHiujin commented Jul 5, 2024 •

edited

Loading

GwokHiujin Jul 8, 2024 •

edited

Loading