Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support comparison semantics for batch serialize/deserialize of Column #9756

Open
wants to merge 31 commits into
base: master
Choose a base branch
from

Conversation

guo-shaoge
Copy link
Contributor

@guo-shaoge guo-shaoge commented Jan 1, 2025

What problem does this PR solve?

Issue Number: close #9761

Problem Summary: #9553 add serialize/deserialize interface by column-wise, but it didn't handle collator for ColumnString and real copy format for ColumnDecimal([sign, limb_count, limb_data])

What is changed and how it works?


Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

None

Signed-off-by: guo-shaoge <[email protected]>
Signed-off-by: guo-shaoge <[email protected]>
Signed-off-by: guo-shaoge <[email protected]>
Signed-off-by: guo-shaoge <[email protected]>
Signed-off-by: guo-shaoge <[email protected]>
@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-linked-issue release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jan 1, 2025
Signed-off-by: guo-shaoge <[email protected]>
Signed-off-by: guo-shaoge <[email protected]>
Signed-off-by: guo-shaoge <[email protected]>
This reverts commit c07d13a.
This reverts commit abd55ac.
Signed-off-by: guo-shaoge <[email protected]>
Signed-off-by: guo-shaoge <[email protected]>
Signed-off-by: guo-shaoge <[email protected]>
@guo-shaoge
Copy link
Contributor Author

/retest

Signed-off-by: guo-shaoge <[email protected]>
@ti-chi-bot ti-chi-bot bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Jan 10, 2025
Copy link
Contributor

ti-chi-bot bot commented Jan 10, 2025

[LGTM Timeline notifier]

Timeline:

  • 2025-01-10 05:57:26.14022174 +0000 UTC m=+505989.429053445: ☑️ agreed by yibin87.

@guo-shaoge guo-shaoge requested a review from gengliqi January 10, 2025 08:55
@guo-shaoge guo-shaoge changed the title support batch serialize/deserialize method for Column support unique family of batch serialize/deserialize of Column Jan 10, 2025
@guo-shaoge guo-shaoge changed the title support unique family of batch serialize/deserialize of Column support unique semantics of batch serialize/deserialize of Column Jan 10, 2025
@guo-shaoge guo-shaoge changed the title support unique semantics of batch serialize/deserialize of Column support unique semantics for batch serialize/deserialize of Column Jan 10, 2025
@guo-shaoge guo-shaoge changed the title support unique semantics for batch serialize/deserialize of Column support comparison semantics for batch serialize/deserialize of Column Jan 10, 2025
@guo-shaoge guo-shaoge force-pushed the batch_serialize branch 2 times, most recently from 06ff1d4 to 7572680 Compare January 10, 2025 09:27
}
else
{
inline_memcpy(pos[i], &data[array_offsets[start + i - 1]], len * sizeof(T));
if (len <= 4)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just a simple optimization. If the length is very small, copying them one by one is faster than std::memcpy.

{
assert(sizeAt(i) >= 1);
// Minus 1 because of terminating zero.
byte_size[i] += sizeof(UInt32) + (sizeAt(i) - 1) * max_bytes_one_char;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

size * max_bytes_one_char may waste lots of memory. For example, a utf8 character is 3 bytes, max_bytes_one_char is 4 bytes. 4 bytes is enough but here need 12 bytes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

const void * src = &chars[offsetAt(start + i)];
if constexpr (has_collator)
{
auto sort_key = collator->sortKey(reinterpret_cast<const char *>(src), str_size - 1, *sort_key_container);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sortKey is a virtual function. How about adding a batch version to reduce the overhead of virtual functions?

Copy link
Contributor Author

@guo-shaoge guo-shaoge Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some marcos so we can static_cast collator to Derived class and call its method

Signed-off-by: guo-shaoge <[email protected]>
Signed-off-by: guo-shaoge <[email protected]>
Signed-off-by: guo-shaoge <[email protected]>
Signed-off-by: guo-shaoge <[email protected]>
Signed-off-by: guo-shaoge <[email protected]>
@guo-shaoge guo-shaoge requested a review from gengliqi January 14, 2025 06:23
Signed-off-by: guo-shaoge <[email protected]>
Signed-off-by: guo-shaoge <[email protected]>
dbms/src/Columns/ColumnString.cpp Outdated Show resolved Hide resolved
dbms/src/Columns/ColumnString.cpp Show resolved Hide resolved
dbms/src/Columns/ColumnString.cpp Show resolved Hide resolved
Signed-off-by: guo-shaoge <[email protected]>
@guo-shaoge guo-shaoge requested a review from gengliqi January 20, 2025 05:55
{
size_t prev_size = offsets.size();
size_t char_size = chars.size();
size_t size = pos.size();

#ifdef TIFLASH_ENABLE_AVX_SUPPORT
if (use_nt_align_buffer)
if constexpr (!add_terminating_zero)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding a terminating zero in serializeToPos? If so, this branch parameter can be removed and align_buffer can be used as original version.

size_t maxBytesForOneChar() const override
{
// Every char have 8 uint16 at most.
return 8 * sizeof(uint16_t);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems too large. Is it necessary to use 16 bytes?

@guo-shaoge
Copy link
Contributor Author

this new interface is not needed anymore.
For ColumnDecimal, a new PR will make sure the semantics of batch serialize and row serialize are same.
For ColumnString, ICollator::sortKey() will be called before calling countSerializeByteSize/serializeToPos. This can avoid using too large memory because of collator::sortKey. (max bytes of one char of UCACICollator is very large, but not all bytes are used)

@guo-shaoge guo-shaoge closed this Jan 20, 2025
@guo-shaoge guo-shaoge reopened this Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved needs-1-more-lgtm Indicates a PR needs 1 more LGTM. release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

support serialize/deserialize by column-wise for collator
3 participants