Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make Coder.getEncodedElementByteSize public to allow performance improvements in higher level coders #33626

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

stankiewicz
Copy link
Contributor

Fixes #33620

Half of the sdk coders have getEncodedElementByteSize public because they are invoked outside of beam.sdk.coders package.
Some accumulators like ComposedAccumulatorCoder should have access to optimized methods for calculating size but due to it can be any Coder.
Optimized way of calculating size shouldn't be limited to Coders implemented in beam.sdk.coders only.
This change changes it to public and fixes all the coders in beam sdk.

This will break coders implemented in customer codebase requiring small patch to change visibility.

@stankiewicz
Copy link
Contributor Author

Alternative to breaking change would be keeping things as is and moving problematic accumulator to sdk but it won't help if someone else writes problematic coder outside of beam.

Copy link
Contributor

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @robertwb for label java.
R: @chamikaramj for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

long size = 0;
for (int i = 0; i < value.length; ++i) {
Coder<Object> objectCoder = coders.get(i);
if (objectCoder instanceof StructuredCoder) {
Copy link
Contributor Author

@stankiewicz stankiewicz Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've seen this type of conditionhere. StructuredCoder doesn't enforce overriding getEncodedElementByteSize so I'm not sure if this is best practice. Ideally I would like to loop and invoke this method for every coder (extending CustomCoder/StructuredCoder/Coder) on the list. If underlying coder has default implementation, it will just serialize as is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove the typecheck on StructuredCoder, just loop and add for every coder. The only downside is that there may be overhead constructing multiple CountingOutputStreams rather than creating one, but as you say there's no contract between StructuredCoder and implementing getEncodedElementByteSize.

Go ahead and remove the check in LengthPrefixCoder too, there's no benefit for ever skipping this attempt.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robertwb thanks, changed.

@stankiewicz stankiewicz changed the title make Coder.getEncodedElementByteSize public. make Coder.getEncodedElementByteSize public to allow performance improvements in higher level coders Jan 17, 2025
Copy link
Contributor

@robertwb robertwb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit concerned about the backwards incompatibility implications of changing the visibility here. If we do need to call it from outside a Coder (which seems OK to me), what if we instead make a public static method on Coder that calls this?

long size = 0;
for (int i = 0; i < value.length; ++i) {
Coder<Object> objectCoder = coders.get(i);
if (objectCoder instanceof StructuredCoder) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove the typecheck on StructuredCoder, just loop and add for every coder. The only downside is that there may be overhead constructing multiple CountingOutputStreams rather than creating one, but as you say there's no contract between StructuredCoder and implementing getEncodedElementByteSize.

Go ahead and remove the check in LengthPrefixCoder too, there's no benefit for ever skipping this attempt.

@stankiewicz
Copy link
Contributor Author

I am a bit concerned about the backwards incompatibility implications of changing the visibility here. If we do need to call it from outside a Coder (which seems OK to me), what if we instead make a public static method on Coder that calls this?

added static method, thanks for suggestions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: StateBackedIterable serializes elements size for every element when ComposedCombine is used
2 participants