Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG (string dtype): comparison of string column to mixed object column fails #60228

Open
Tracked by #54792
jorisvandenbossche opened this issue Nov 7, 2024 · 2 comments
Assignees
Labels
Bug Strings String extension data type and string data
Milestone

Comments

@jorisvandenbossche
Copy link
Member

At the moment you can freely compare with mixed object dtype column:

>>> ser_string = pd.Series(["a", "b"])
>>> ser_mixed = pd.Series([1, "b"])
>>> ser_string == ser_mixed
0    False
1     True
dtype: bool

But with the string dtype enabled (using pyarrow), this now raises an error:

>>> pd.options.future.infer_string = True
>>> ser_string = pd.Series(["a", "b"])
>>> ser_mixed = pd.Series([1, "b"])
>>> ser_string == ser_mixed
...
File ~/scipy/repos/pandas/pandas/core/arrays/arrow/array.py:510, in ArrowExtensionArray._box_pa_array(cls, value, pa_type, copy)
...
--> 510     pa_array = pa.array(value, from_pandas=True)
...
ArrowInvalid: Could not convert 'b' with type str: tried to convert to int64

This happens because the ArrowEA tries to convert the other operand to Arrow as well, which fails for mixed types.

In general, I think our rule is that == comparison never fails, but then just gives False for when values are not comparable.

@jorisvandenbossche jorisvandenbossche added Bug Strings String extension data type and string data labels Nov 7, 2024
@jorisvandenbossche jorisvandenbossche added this to the 2.3 milestone Nov 7, 2024
@jorisvandenbossche
Copy link
Member Author

It seems we actually have a comment in the code about this issue in case of object dtype:

try:
result = pc_func(self._pa_array, self._box_pa(other))
except pa.ArrowNotImplementedError:
# TODO: could this be wrong if other is object dtype?
# in which case we need to operate pointwise?
result = ops.invalid_comparison(self, other, op)
result = pa.array(result, type=pa.bool_())

@TEARFEAR
Copy link

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Strings String extension data type and string data
Projects
None yet
Development

No branches or pull requests

2 participants