-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: .convert_dtypes(dtype_backend="pyarrow")
strips timezone from tz-aware pyarrow timestamp Series
#60237
Comments
side note at the beginning: this is somewhat similar to the fix in #53382, but that was handling a DatetimeTZDtype being converted to a pyarrow timestamp dtype After a touch of further investigation, I've mostly found what changed between 2.0.3 and 2.1.0rc0 to cause this Call stack setup (which is just forwarding along the keyword arguments in this case):
In this case, The first big The remainder of the function is if dtype_backend == "pyarrow":
from pandas.core.arrays.arrow.array import to_pyarrow_type
from pandas.core.arrays.string_ import StringDtype
assert not isinstance(inferred_dtype, str)
if (
(convert_integer and inferred_dtype.kind in "iu")
or (convert_floating and inferred_dtype.kind in "fc")
or (convert_boolean and inferred_dtype.kind == "b")
or (convert_string and isinstance(inferred_dtype, StringDtype))
or (
inferred_dtype.kind not in "iufcb"
and not isinstance(inferred_dtype, StringDtype)
)
):
if isinstance(inferred_dtype, PandasExtensionDtype) and not isinstance(
inferred_dtype, DatetimeTZDtype
):
base_dtype = inferred_dtype.base
elif isinstance(inferred_dtype, (BaseMaskedDtype, ArrowDtype)):
base_dtype = inferred_dtype.numpy_dtype
elif isinstance(inferred_dtype, StringDtype):
base_dtype = np.dtype(str)
else:
base_dtype = inferred_dtype
if (
base_dtype.kind == "O" # type: ignore[union-attr]
and input_array.size > 0
and isna(input_array).all()
):
import pyarrow as pa
pa_type = pa.null()
else:
pa_type = to_pyarrow_type(base_dtype)
if pa_type is not None:
inferred_dtype = ArrowDtype(pa_type)
elif dtype_backend == "numpy_nullable" and isinstance(inferred_dtype, ArrowDtype):
# GH 53648
inferred_dtype = _arrow_dtype_mapping()[inferred_dtype.pyarrow_dtype]
# error: Incompatible return value type (got "Union[str, Union[dtype[Any],
# ExtensionDtype]]", expected "Union[dtype[Any], ExtensionDtype]")
return inferred_dtype # type: ignore[return-value]
In pandas 2.0.3, the numpy dtype was But in pandas >=2.1.0rc0, the numpy dtype is |
The changes to def numpy_dtype(self) -> np.dtype:
"""Return an instance of the related numpy dtype"""
+ if pa.types.is_timestamp(self.pyarrow_dtype):
+ # pa.timestamp(unit).to_pandas_dtype() returns ns units
+ # regardless of the pyarrow timestamp units.
+ # This can be removed if/when pyarrow addresses it:
+ # https://github.com/apache/arrow/issues/34462
+ return np.dtype(f"datetime64[{self.pyarrow_dtype.unit}]")
+ if pa.types.is_duration(self.pyarrow_dtype):
+ # pa.duration(unit).to_pandas_dtype() returns ns units
+ # regardless of the pyarrow duration units
+ # This can be removed if/when pyarrow addresses it:
+ # https://github.com/apache/arrow/issues/34462
+ return np.dtype(f"timedelta64[{self.pyarrow_dtype.unit}]")
if pa.types.is_string(self.pyarrow_dtype):
# pa.string().to_pandas_dtype() = object which we don't want
return np.dtype(str)
try:
return np.dtype(self.pyarrow_dtype.to_pandas_dtype())
except (NotImplementedError, TypeError):
return np.dtype(object) Without those, The linked issues have been resolved, so those An even simpler fix though is to just add modify the line in I'm not sure why it's currently trying to go from arrow dtype -> numpy dtype -> arrow dtype It seems intentional though since there's explicitly elif isinstance(inferred_dtype, (BaseMaskedDtype, ArrowDtype)):
base_dtype = inferred_dtype.numpy_dtype |
Thanks for the report, confirmed on main. The suggested fix of removing the if statements seems to resolve the issue, and I'm seeing no test failures locally. PRs are welcome! |
take |
take |
The table in this section of the data types API docs reads to me as implying that a pyarrow to:aware timestamp dtype should map to a numpy datetime64 dtype. I would definitely defer to someone else's judgement on whether that is correct, or if there should be a distinction in that table linked between a |
take |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
Calling
.convert_dtypes(dtype_backend="pyarrow")
on a Series that is already a timezone aware pyarrow timestamp dtype strips the timezone information.Testing on older versions, this seems to be a regression introduced sometime between versions 2.0.3 and 2.1.0rc0
Expected Behavior
No change should be made to the dtype
Installed Versions
INSTALLED VERSIONS
commit : 3f7bc81
python : 3.12.2
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19045
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : English_United States.1252
pandas : 3.0.0.dev0+1654.g3f7bc81ae6
numpy : 2.1.3
dateutil : 2.9.0.post0
pip : 24.3.1
Cython : None
sphinx : None
IPython : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
blosc : None
bottleneck : None
fastparquet : None
fsspec : None
html5lib : None
hypothesis : None
gcsfs : None
jinja2 : None
lxml.etree : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
psycopg2 : None
pymysql : None
pyarrow : 18.0.0
pyreadstat : None
pytest : None
python-calamine : None
pytz : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlsxwriter : None
zstandard : None
tzdata : 2024.2
qtpy : None
pyqt5 : None
The text was updated successfully, but these errors were encountered: