Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add stringview support to encode and decode and bit_length #13332

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

jonathanc-n
Copy link
Contributor

@jonathanc-n jonathanc-n commented Nov 10, 2024

Which issue does this PR close?

Closes #13330 .

Rationale for this change

Add string view to some functions to prevent regression during @timsaucer df python migration.

What changes are included in this PR?

Added support for utf8view to bit_length, array_to_string, and encode

Are these changes tested?

@jonathanc-n
Copy link
Contributor Author

@timsaucer Do you think testing is necessary here? Can add if so. Also, are you able to show how to replicate the optimizer error for ConcatFunc.

@timsaucer
Copy link
Contributor

Wonderful. Thank you. I'll try to review tomorrow.

I did dive into the ConcatFunc and I found (and resolved) two issues:

  1. when doing concat on pure literals we aren't allowing for utf8view or large utf8view with values. This isn't caught in the unit tests because they don't have these variants
  2. simplify_concat is changing the merged types to utf8, which is what caused the schema error.

I'll put up the PR today or tomorrow to fix both of these, with unit tests.

@timsaucer
Copy link
Contributor

ConcatFunc: #13346

@jonathanc-n
Copy link
Contributor Author

Will add tests tomorrow.

@alamb alamb marked this pull request as draft November 13, 2024 13:51
@alamb
Copy link
Contributor

alamb commented Nov 13, 2024

Thank you @jonathanc-n -- I marked this PR as draft as you seem to be planning to add more tests. Please mark it as ready for review with it is ready for another look

Thanks a lot for sorting this out

@jonathanc-n
Copy link
Contributor Author

jonathanc-n commented Nov 13, 2024

@timsaucer Will push some tests, however after going through it, date_bin need to wait on this conversion to work:
Unsupported CAST from Timestamp(Nanosecond, None) to Utf8View (a issue can be opened for this in arrow-rs to do this as a double cast is probably not ideal). named_struct also has this problem, if you want the output to be defaulted to be utf8view this has to be solved in arrow for array_struct. Even if i defined the schema in datafusion to output Utf8View or input Utf8View, there will be a schema mismatch or a conversion error, due to the data originally being Utf8 (please correct me if i'm wrong on this).

I changed this pr to only deal with support stringview for array_to_string, encoding, and bit_length

@jonathanc-n jonathanc-n marked this pull request as ready for review November 13, 2024 22:41
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Nov 13, 2024
@Omega359
Copy link
Contributor

Omega359 commented Nov 14, 2024

I updated array_to_string and string_to_array in a separate PR - it's a bit more extensive than what this PR contains.

@Omega359
Copy link
Contributor

@timsaucer Will push some tests, however after going through it, date_bin need to wait on this conversion to work: Unsupported CAST from Timestamp(Nanosecond, None) to Utf8View (a issue can be opened for this in arrow-rs to do this as a double cast is probably not ideal).

Is this an arrow-rs issue or is this #13363 ?

@Omega359
Copy link
Contributor

Omega359 commented Nov 14, 2024

@timsaucer Will push some tests, however after going through it, date_bin need to wait on this conversion to work: Unsupported CAST from Timestamp(Nanosecond, None) to Utf8View (a issue can be opened for this in arrow-rs to do this as a double cast is probably not ideal).

Is this an arrow-rs issue or is this #13363 ?

Possibly both. I think arrow-cast is missing the _,Utf8View for primitive & temporal casting here: https://github.com/apache/arrow-rs/blob/1d580ec1996e75bb9be4c1c880871d0dd2718ad4/arrow-cast/src/cast/mod.rs#L1465

This PR in arrow-rs I think may fix the issue

@alamb alamb changed the title feat: Add stringview for functions to prevent regression feat: Add stringview support to encode Nov 15, 2024
@alamb alamb changed the title feat: Add stringview support to encode feat: Add stringview support to encode and decode Nov 15, 2024
@alamb alamb changed the title feat: Add stringview support to encode and decode feat: Add stringview support to encode and decode and bit_length Nov 15, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jonathanc-n @Omega359 and @timsaucer

I updated the title of this PR to reflect what it currently contains and it now looks good to me

Please let me know if I missed anything

@@ -224,6 +224,7 @@ fn encode_process(value: &ColumnarValue, encoding: Encoding) -> Result<ColumnarV
ColumnarValue::Array(a) => match a.data_type() {
DataType::Utf8 => encoding.encode_utf8_array::<i32>(a.as_ref()),
DataType::LargeUtf8 => encoding.encode_utf8_array::<i64>(a.as_ref()),
DataType::Utf8View => encoding.encode_utf8_array::<i32>(a.as_ref()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this means the output type will be Utf8 but for this type i think that is totally fine (there is no need to preserve the view encoding)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
functions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for string view to a few functions
4 participants