Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error running crypto functions on Dictionary arrays such as md5 #13444

Open
alamb opened this issue Nov 15, 2024 · 3 comments
Open

Error running crypto functions on Dictionary arrays such as md5 #13444

alamb opened this issue Nov 15, 2024 · 3 comments
Assignees
Labels
bug Something isn't working regression Something that used to work no longer does

Comments

@alamb
Copy link
Contributor

alamb commented Nov 15, 2024

Describe the bug

A regression appears to have been introduced in

T

To Reproduce

This used to work in DataFusion 42.0.0 but doesn't in DataFusion 43.0.0:

DataFusion CLI v43.0.0
> create table t as values (arrow_cast('Foo', 'Dictionary(Int32, LargeUtf8)'));
0 row(s) fetched.
Elapsed 0.013 seconds.

> select md5(column1) from t;
Arrow error: Compute error: Internal Error: Cannot cast Utf8View to StringArray of expected type

Expected behavior

Query should complete

Additional context

I found this while writing tests in #13443 -- Feel free to take those tests to work on this issue

Those tests fail like this:

External error: query failed: DataFusion error: Arrow error: Invalid argument error: column types must match schema types, expected LargeUtf8 but found Utf8 at column index 0
[SQL] select md5(ascii_1) from test_basic_operator;
at test_files/string/./string_query.slt.part:1381
at test_files/string/large_string.slt:93
@alamb alamb added bug Something isn't working regression Something that used to work no longer does labels Nov 15, 2024
@Omega359
Copy link
Contributor

I think this may have been a pre-existing issue. The core issue I think is that apparently md5 is required to return utf8:

// md5 requires special handling because of its unique utf8 return type
but the return_type function says it can return other types:

fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> {
    use DataType::*;
    Ok(match &arg_types[0] {
        LargeUtf8 | LargeBinary => LargeUtf8,
        Utf8View | Utf8 | Binary => Utf8,
        Null => Null,
        Dictionary(_, t) => match **t {
            LargeUtf8 | LargeBinary => LargeUtf8,
            Utf8 | Binary => Utf8,
            Null => Null,
            _ => {
                return plan_err!(
                    "the md5 can only accept strings but got {:?}",
                    **t
                );
            }
        },
        other => {
            return plan_err!(
                "The md5 function can only accept strings. Got {other}"
            );
        }
    })
}

Should be an easy fix to the return_type fn.

@Omega359
Copy link
Contributor

take

@jayzhan211
Copy link
Contributor

I think adding casting from binary to string in TypeSignature::String and apply it to md5 is able to fix this too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working regression Something that used to work no longer does
Projects
None yet
Development

No branches or pull requests

3 participants