Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: .convert_dtypes(dtype_backend="pyarrow") strips timezone from tz-aware pyarrow timestamp Series #60237

Open
3 tasks done
jamesdow21 opened this issue Nov 8, 2024 · 7 comments
Assignees
Labels
Arrow pyarrow functionality Bug Dtype Conversions Unexpected or buggy dtype conversions good first issue Localization Internationalization of data

Comments

@jamesdow21
Copy link

jamesdow21 commented Nov 8, 2024

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

>>> import pandas as pd
>>> s = pd.Series(pd.to_datetime(range(5), utc=True, unit="h"), dtype="timestamp[ns, tz=UTC][pyarrow]")
>>> s
0    1970-01-01 00:00:00+00:00
1    1970-01-01 01:00:00+00:00
2    1970-01-01 02:00:00+00:00
3    1970-01-01 03:00:00+00:00
4    1970-01-01 04:00:00+00:00
dtype: timestamp[ns, tz=UTC][pyarrow]
>>> s.convert_dtypes(dtype_backend="pyarrow")
0    1970-01-01 00:00:00
1    1970-01-01 01:00:00
2    1970-01-01 02:00:00
3    1970-01-01 03:00:00
4    1970-01-01 04:00:00
dtype: timestamp[ns][pyarrow]

Issue Description

Calling .convert_dtypes(dtype_backend="pyarrow") on a Series that is already a timezone aware pyarrow timestamp dtype strips the timezone information.

Testing on older versions, this seems to be a regression introduced sometime between versions 2.0.3 and 2.1.0rc0

Expected Behavior

No change should be made to the dtype

Installed Versions

INSTALLED VERSIONS

commit : 3f7bc81
python : 3.12.2
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19045
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : English_United States.1252

pandas : 3.0.0.dev0+1654.g3f7bc81ae6
numpy : 2.1.3
dateutil : 2.9.0.post0
pip : 24.3.1
Cython : None
sphinx : None
IPython : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
blosc : None
bottleneck : None
fastparquet : None
fsspec : None
html5lib : None
hypothesis : None
gcsfs : None
jinja2 : None
lxml.etree : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
psycopg2 : None
pymysql : None
pyarrow : 18.0.0
pyreadstat : None
pytest : None
python-calamine : None
pytz : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlsxwriter : None
zstandard : None
tzdata : 2024.2
qtpy : None
pyqt5 : None

@jamesdow21 jamesdow21 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 8, 2024
@jamesdow21
Copy link
Author

jamesdow21 commented Nov 8, 2024

side note at the beginning: this is somewhat similar to the fix in #53382, but that was handling a DatetimeTZDtype being converted to a pyarrow timestamp dtype

After a touch of further investigation, I've mostly found what changed between 2.0.3 and 2.1.0rc0 to cause this

Call stack setup (which is just forwarding along the keyword arguments in this case):

  • pd.Series.convert_dtypes is inherited from pandas.core.generic.NDFrame
  • which calls it's block manager's .convert_dtypes
  • which then calls the block manager's .apply
  • which then calls .convert_dtypes on each Block
  • which loops over each column in the Block
  • and finally calls pandas.core.dtypes.cast.convert_dtypes with the underlying .values array (as the argument input_array)

In this case, input_array is an ArrowExtensionArray still with the original "timestamp[ns, tz=UTC][pyarrow]" ArrowDtype

The first big if and elif blocks in pandas.core.dtypes.cast.convert_dtypes are False, so it goes to the else block which sets
inferred_dtype = input_array.dtype

The remainder of the function is

    if dtype_backend == "pyarrow":
        from pandas.core.arrays.arrow.array import to_pyarrow_type
        from pandas.core.arrays.string_ import StringDtype

        assert not isinstance(inferred_dtype, str)

        if (
            (convert_integer and inferred_dtype.kind in "iu")
            or (convert_floating and inferred_dtype.kind in "fc")
            or (convert_boolean and inferred_dtype.kind == "b")
            or (convert_string and isinstance(inferred_dtype, StringDtype))
            or (
                inferred_dtype.kind not in "iufcb"
                and not isinstance(inferred_dtype, StringDtype)
            )
        ):
            if isinstance(inferred_dtype, PandasExtensionDtype) and not isinstance(
                inferred_dtype, DatetimeTZDtype
            ):
                base_dtype = inferred_dtype.base
            elif isinstance(inferred_dtype, (BaseMaskedDtype, ArrowDtype)):
                base_dtype = inferred_dtype.numpy_dtype
            elif isinstance(inferred_dtype, StringDtype):
                base_dtype = np.dtype(str)
            else:
                base_dtype = inferred_dtype
            if (
                base_dtype.kind == "O"  # type: ignore[union-attr]
                and input_array.size > 0
                and isna(input_array).all()
            ):
                import pyarrow as pa

                pa_type = pa.null()
            else:
                pa_type = to_pyarrow_type(base_dtype)
            if pa_type is not None:
                inferred_dtype = ArrowDtype(pa_type)
    elif dtype_backend == "numpy_nullable" and isinstance(inferred_dtype, ArrowDtype):
        # GH 53648
        inferred_dtype = _arrow_dtype_mapping()[inferred_dtype.pyarrow_dtype]

    # error: Incompatible return value type (got "Union[str, Union[dtype[Any],
    # ExtensionDtype]]", expected "Union[dtype[Any], ExtensionDtype]")
    return inferred_dtype  # type: ignore[return-value]

inferred_dtype.kind is "M" and it's not a StringDtype, so the last condition of the if is True

inferred_dtype is a ArrowDtype so inferred_dtype.numpy_dtype get assigned to base_dtype

In pandas 2.0.3, the numpy dtype was dtype('O'), so the to_pyarrow_type returned None and inferred_dtype never gets modified.

But in pandas >=2.1.0rc0, the numpy dtype is dtype('<M8[ns]') and to_pyarrow_type converts it to a tz-naive timestamp dtype because numpy datetimes do not have timezone information.

@jamesdow21
Copy link
Author

jamesdow21 commented Nov 8, 2024

The changes to ArrowDtype.numpy_dtype in #51800 are the root cause of this regression

    def numpy_dtype(self) -> np.dtype:
        """Return an instance of the related numpy dtype"""
+        if pa.types.is_timestamp(self.pyarrow_dtype):
+            # pa.timestamp(unit).to_pandas_dtype() returns ns units
+            # regardless of the pyarrow timestamp units.
+            # This can be removed if/when pyarrow addresses it:
+            # https://github.com/apache/arrow/issues/34462
+            return np.dtype(f"datetime64[{self.pyarrow_dtype.unit}]")
+        if pa.types.is_duration(self.pyarrow_dtype):
+            # pa.duration(unit).to_pandas_dtype() returns ns units
+            # regardless of the pyarrow duration units
+            # This can be removed if/when pyarrow addresses it:
+            # https://github.com/apache/arrow/issues/34462
+            return np.dtype(f"timedelta64[{self.pyarrow_dtype.unit}]")
        if pa.types.is_string(self.pyarrow_dtype):
            # pa.string().to_pandas_dtype() = object which we don't want
            return np.dtype(str)
        try:
            return np.dtype(self.pyarrow_dtype.to_pandas_dtype())
        except (NotImplementedError, TypeError):
            return np.dtype(object)

Without those, self.pyarrow_dtype.to_pandas_dtype() returns a DatetimeTZDtype, which then raises a TypeError when passed to np.dtype and returns the object numpy dtype instead.

The linked issues have been resolved, so those if blocks could presumably now just be removed (while ensuring that the minimum pyarrow version includes those fixes) and this would be fixed.

An even simpler fix though is to just add modify the line in pandas.core.dtypes.cast.convert_dtypes to be if dtype_backend == "pyarrow" and not isinstance(inferred_dtype, ArrowDtype):

I'm not sure why it's currently trying to go from arrow dtype -> numpy dtype -> arrow dtype

It seems intentional though since there's explicitly

            elif isinstance(inferred_dtype, (BaseMaskedDtype, ArrowDtype)):
                base_dtype = inferred_dtype.numpy_dtype

@rhshadrach
Copy link
Member

Thanks for the report, confirmed on main. The suggested fix of removing the if statements seems to resolve the issue, and I'm seeing no test failures locally. PRs are welcome!

@rhshadrach rhshadrach added Dtype Conversions Unexpected or buggy dtype conversions Timezones Timezone data dtype Arrow pyarrow functionality Localization Internationalization of data good first issue and removed Needs Triage Issue that has not been reviewed by a pandas team member Timezones Timezone data dtype labels Nov 8, 2024
@jasonmokk
Copy link
Contributor

take

@jasonmokk jasonmokk removed their assignment Nov 9, 2024
@road-1
Copy link

road-1 commented Nov 9, 2024

take

@jamesdow21
Copy link
Author

The table in this section of the data types API docs reads to me as implying that a pyarrow to:aware timestamp dtype should map to a numpy datetime64 dtype.

I would definitely defer to someone else's judgement on whether that is correct, or if there should be a distinction in that table linked between a pa.timestamp() type with and without timezone

@AmeyGaneshMedewar
Copy link

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Bug Dtype Conversions Unexpected or buggy dtype conversions good first issue Localization Internationalization of data
Projects
None yet
Development

No branches or pull requests

5 participants