BUG: `.convert_dtypes(dtype_backend="pyarrow")` strips timezone from tz-aware pyarrow timestamp Series #60237

jamesdow21 · 2024-11-08T06:29:30Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

>>> import pandas as pd
>>> s = pd.Series(pd.to_datetime(range(5), utc=True, unit="h"), dtype="timestamp[ns, tz=UTC][pyarrow]")
>>> s
0    1970-01-01 00:00:00+00:00
1    1970-01-01 01:00:00+00:00
2    1970-01-01 02:00:00+00:00
3    1970-01-01 03:00:00+00:00
4    1970-01-01 04:00:00+00:00
dtype: timestamp[ns, tz=UTC][pyarrow]
>>> s.convert_dtypes(dtype_backend="pyarrow")
0    1970-01-01 00:00:00
1    1970-01-01 01:00:00
2    1970-01-01 02:00:00
3    1970-01-01 03:00:00
4    1970-01-01 04:00:00
dtype: timestamp[ns][pyarrow]

Issue Description

Calling .convert_dtypes(dtype_backend="pyarrow") on a Series that is already a timezone aware pyarrow timestamp dtype strips the timezone information.

Testing on older versions, this seems to be a regression introduced sometime between versions 2.0.3 and 2.1.0rc0

Expected Behavior

No change should be made to the dtype

Installed Versions

INSTALLED VERSIONS

commit : 3f7bc81
python : 3.12.2
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19045
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : English_United States.1252

pandas : 3.0.0.dev0+1654.g3f7bc81ae6
numpy : 2.1.3
dateutil : 2.9.0.post0
pip : 24.3.1
Cython : None
sphinx : None
IPython : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
blosc : None
bottleneck : None
fastparquet : None
fsspec : None
html5lib : None
hypothesis : None
gcsfs : None
jinja2 : None
lxml.etree : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
psycopg2 : None
pymysql : None
pyarrow : 18.0.0
pyreadstat : None
pytest : None
python-calamine : None
pytz : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlsxwriter : None
zstandard : None
tzdata : 2024.2
qtpy : None
pyqt5 : None

The text was updated successfully, but these errors were encountered:

jamesdow21 · 2024-11-08T08:32:31Z

side note at the beginning: this is somewhat similar to the fix in #53382, but that was handling a DatetimeTZDtype being converted to a pyarrow timestamp dtype

After a touch of further investigation, I've mostly found what changed between 2.0.3 and 2.1.0rc0 to cause this

Call stack setup (which is just forwarding along the keyword arguments in this case):

pd.Series.convert_dtypes is inherited from pandas.core.generic.NDFrame
which calls it's block manager's .convert_dtypes
which then calls the block manager's .apply
which then calls .convert_dtypes on each Block
which loops over each column in the Block
and finally calls pandas.core.dtypes.cast.convert_dtypes with the underlying .values array (as the argument input_array)

In this case, input_array is an ArrowExtensionArray still with the original "timestamp[ns, tz=UTC][pyarrow]" ArrowDtype

The first big if and elif blocks in pandas.core.dtypes.cast.convert_dtypes are False, so it goes to the else block which sets
inferred_dtype = input_array.dtype

The remainder of the function is

    if dtype_backend == "pyarrow":
        from pandas.core.arrays.arrow.array import to_pyarrow_type
        from pandas.core.arrays.string_ import StringDtype

        assert not isinstance(inferred_dtype, str)

        if (
            (convert_integer and inferred_dtype.kind in "iu")
            or (convert_floating and inferred_dtype.kind in "fc")
            or (convert_boolean and inferred_dtype.kind == "b")
            or (convert_string and isinstance(inferred_dtype, StringDtype))
            or (
                inferred_dtype.kind not in "iufcb"
                and not isinstance(inferred_dtype, StringDtype)
            )
        ):
            if isinstance(inferred_dtype, PandasExtensionDtype) and not isinstance(
                inferred_dtype, DatetimeTZDtype
            ):
                base_dtype = inferred_dtype.base
            elif isinstance(inferred_dtype, (BaseMaskedDtype, ArrowDtype)):
                base_dtype = inferred_dtype.numpy_dtype
            elif isinstance(inferred_dtype, StringDtype):
                base_dtype = np.dtype(str)
            else:
                base_dtype = inferred_dtype
            if (
                base_dtype.kind == "O"  # type: ignore[union-attr]
                and input_array.size > 0
                and isna(input_array).all()
            ):
                import pyarrow as pa

                pa_type = pa.null()
            else:
                pa_type = to_pyarrow_type(base_dtype)
            if pa_type is not None:
                inferred_dtype = ArrowDtype(pa_type)
    elif dtype_backend == "numpy_nullable" and isinstance(inferred_dtype, ArrowDtype):
        # GH 53648
        inferred_dtype = _arrow_dtype_mapping()[inferred_dtype.pyarrow_dtype]

    # error: Incompatible return value type (got "Union[str, Union[dtype[Any],
    # ExtensionDtype]]", expected "Union[dtype[Any], ExtensionDtype]")
    return inferred_dtype  # type: ignore[return-value]

inferred_dtype.kind is "M" and it's not a StringDtype, so the last condition of the if is True

inferred_dtype is a ArrowDtype so inferred_dtype.numpy_dtype get assigned to base_dtype

In pandas 2.0.3, the numpy dtype was dtype('O'), so the to_pyarrow_type returned None and inferred_dtype never gets modified.

But in pandas >=2.1.0rc0, the numpy dtype is dtype('<M8[ns]') and to_pyarrow_type converts it to a tz-naive timestamp dtype because numpy datetimes do not have timezone information.

jamesdow21 · 2024-11-08T21:36:04Z

The changes to ArrowDtype.numpy_dtype in #51800 are the root cause of this regression

    def numpy_dtype(self) -> np.dtype:
        """Return an instance of the related numpy dtype"""
+        if pa.types.is_timestamp(self.pyarrow_dtype):
+            # pa.timestamp(unit).to_pandas_dtype() returns ns units
+            # regardless of the pyarrow timestamp units.
+            # This can be removed if/when pyarrow addresses it:
+            # https://github.com/apache/arrow/issues/34462
+            return np.dtype(f"datetime64[{self.pyarrow_dtype.unit}]")
+        if pa.types.is_duration(self.pyarrow_dtype):
+            # pa.duration(unit).to_pandas_dtype() returns ns units
+            # regardless of the pyarrow duration units
+            # This can be removed if/when pyarrow addresses it:
+            # https://github.com/apache/arrow/issues/34462
+            return np.dtype(f"timedelta64[{self.pyarrow_dtype.unit}]")
        if pa.types.is_string(self.pyarrow_dtype):
            # pa.string().to_pandas_dtype() = object which we don't want
            return np.dtype(str)
        try:
            return np.dtype(self.pyarrow_dtype.to_pandas_dtype())
        except (NotImplementedError, TypeError):
            return np.dtype(object)

Without those, self.pyarrow_dtype.to_pandas_dtype() returns a DatetimeTZDtype, which then raises a TypeError when passed to np.dtype and returns the object numpy dtype instead.

The linked issues have been resolved, so those if blocks could presumably now just be removed (while ensuring that the minimum pyarrow version includes those fixes) and this would be fixed.

An even simpler fix though is to just add modify the line in pandas.core.dtypes.cast.convert_dtypes to be if dtype_backend == "pyarrow" and not isinstance(inferred_dtype, ArrowDtype):

I'm not sure why it's currently trying to go from arrow dtype -> numpy dtype -> arrow dtype

It seems intentional though since there's explicitly

            elif isinstance(inferred_dtype, (BaseMaskedDtype, ArrowDtype)):
                base_dtype = inferred_dtype.numpy_dtype

rhshadrach · 2024-11-08T21:49:35Z

Thanks for the report, confirmed on main. The suggested fix of removing the if statements seems to resolve the issue, and I'm seeing no test failures locally. PRs are welcome!

jasonmokk · 2024-11-09T02:01:43Z

take

road-1 · 2024-11-09T13:16:14Z

take

jamesdow21 · 2024-11-09T22:58:28Z

The table in this section of the data types API docs reads to me as implying that a pyarrow to:aware timestamp dtype should map to a numpy datetime64 dtype.

I would definitely defer to someone else's judgement on whether that is correct, or if there should be a distinction in that table linked between a pa.timestamp() type with and without timezone

AmeyGaneshMedewar · 2024-11-10T13:08:38Z

take

jamesdow21 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 8, 2024

github-actions bot assigned jasonmokk Nov 9, 2024

jasonmokk removed their assignment Nov 9, 2024

github-actions bot assigned road-1 Nov 9, 2024

road-1 mentioned this issue Nov 14, 2024

BUG: fix convert_dtypes to preserve timezone from tz-aware pyarrow timestamp dtype #60304

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: `.convert_dtypes(dtype_backend="pyarrow")` strips timezone from tz-aware pyarrow timestamp Series #60237

BUG: `.convert_dtypes(dtype_backend="pyarrow")` strips timezone from tz-aware pyarrow timestamp Series #60237

jamesdow21 commented Nov 8, 2024 •

edited

Loading

INSTALLED VERSIONS

jamesdow21 commented Nov 8, 2024 •

edited

Loading

jamesdow21 commented Nov 8, 2024 •

edited

Loading

rhshadrach commented Nov 8, 2024

jasonmokk commented Nov 9, 2024

road-1 commented Nov 9, 2024

jamesdow21 commented Nov 9, 2024

AmeyGaneshMedewar commented Nov 10, 2024

BUG: .convert_dtypes(dtype_backend="pyarrow") strips timezone from tz-aware pyarrow timestamp Series #60237

BUG: .convert_dtypes(dtype_backend="pyarrow") strips timezone from tz-aware pyarrow timestamp Series #60237

Comments

jamesdow21 commented Nov 8, 2024 • edited Loading

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

jamesdow21 commented Nov 8, 2024 • edited Loading

jamesdow21 commented Nov 8, 2024 • edited Loading

rhshadrach commented Nov 8, 2024

jasonmokk commented Nov 9, 2024

road-1 commented Nov 9, 2024

jamesdow21 commented Nov 9, 2024

AmeyGaneshMedewar commented Nov 10, 2024

BUG: `.convert_dtypes(dtype_backend="pyarrow")` strips timezone from tz-aware pyarrow timestamp Series #60237

BUG: `.convert_dtypes(dtype_backend="pyarrow")` strips timezone from tz-aware pyarrow timestamp Series #60237

jamesdow21 commented Nov 8, 2024 •

edited

Loading

jamesdow21 commented Nov 8, 2024 •

edited

Loading

jamesdow21 commented Nov 8, 2024 •

edited

Loading