-
Notifications
You must be signed in to change notification settings - Fork 598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG-REPORT] Arrow columns in interchange dataframes have erroneous null behaviour #2083
Comments
I encountered the same problem, with at least string types. # Pandas 1.5.3
import pandas as pd
import vaex
# Pyarrow
import pyarrow.interchange
In [42]: pd_df = pd.DataFrame({"a": ["aa", "bb", "cc"], "b": [1, 2, 3]})
In [43]: pd_df
Out[43]:
a b
0 aa 1
1 bb 2
2 cc 3
In [44]: vaex_df = vaex.from_pandas(pd_df)
In [45]: vaex_df
Out[45]:
# a b
0 aa 1
1 bb 2
2 cc 3
In [64]: pa.interchange.from_dataframe(vaex_df)
Out[64]:
pyarrow.Table
a: string
b: int64
----
a: [[null,null,null]]
b: [[1,2,3]]
In [50]: pd.api.interchange.from_dataframe(vaex_df)
Out[50]:
a b
0 NaN 1
1 NaN 2
2 NaN 3 Interchange protocol is not supported when a Vaex dataframe is created from an Arrow Table: In [54]: vaex_df_from_arrow = vaex.from_arrow_table(vaex_df.to_arrow_table())
(raylet) [2023-02-13 11:58:51,725 E 64323 64400] (raylet) file_system_monitor.cc:105: /tmp/ray/session_2023-02-13_11-23-07_977223_63582 is over 95% full, available space: 4589871104; capacity: 982141468672. Object creation will fail if spilling is required.
In [55]: vaex_df_from_arrow
Out[55]:
# a b
0 aa 1
1 bb 2
2 cc 3
In [59]: pa.interchange.from_dataframe(vaex_df_from_arrow)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[59], line 1
----> 1 pa.interchange.from_dataframe(vaex_df_from_arrow)
File ~/software/polars/py-polars/venv/lib/python3.8/site-packages/pyarrow/interchange/from_dataframe.py:85, in from_dataframe(df, allow_copy)
82 if not hasattr(df, "__dataframe__"):
83 raise ValueError("`df` does not support __dataframe__")
---> 85 return _from_dataframe(df.__dataframe__(allow_copy=allow_copy),
86 allow_copy=allow_copy)
File ~/software/polars/py-polars/venv/lib/python3.8/site-packages/pyarrow/interchange/from_dataframe.py:108, in _from_dataframe(df, allow_copy)
106 batches = []
107 for chunk in df.get_chunks():
--> 108 batch = protocol_df_chunk_to_pyarrow(chunk, allow_copy)
109 batches.append(batch)
111 table = pa.Table.from_batches(batches)
File ~/software/polars/py-polars/venv/lib/python3.8/site-packages/pyarrow/interchange/from_dataframe.py:151, in protocol_df_chunk_to_pyarrow(df, allow_copy)
143 dtype = col.dtype[0]
144 if dtype in (
145 DtypeKind.INT,
146 DtypeKind.UINT,
(...)
149 DtypeKind.DATETIME,
150 ):
--> 151 columns[name] = column_to_array(col, allow_copy)
152 elif dtype == DtypeKind.BOOL:
153 columns[name] = bool_column_to_array(col, allow_copy)
File ~/software/polars/py-polars/venv/lib/python3.8/site-packages/pyarrow/interchange/from_dataframe.py:181, in column_to_array(col, allow_copy)
162 def column_to_array(
163 col: ColumnObject,
164 allow_copy: bool = True,
165 ) -> pa.Array:
166 """
167 Convert a column holding one of the primitive dtypes to a PyArrow array.
168 A primitive type is one of: int, uint, float, bool (1 bit).
(...)
179 pa.Array
180 """
--> 181 buffers = col.get_buffers()
182 data = buffers_to_array(buffers, col.size(),
183 col.describe_null,
184 col.offset,
185 allow_copy)
186 return data
File ~/software/polars/py-polars/venv/lib/python3.8/site-packages/vaex/dataframe_protocol.py:585, in _VaexColumn.get_buffers(self)
564 """
565 Return a dictionary containing the underlying buffers.
566
(...)
582 buffer.
583 """
584 buffers = {}
--> 585 buffers["data"] = self._get_data_buffer()
586 try:
587 buffers["validity"] = self._get_validity_buffer()
File ~/software/polars/py-polars/venv/lib/python3.8/site-packages/vaex/dataframe_protocol.py:625, in _VaexColumn._get_data_buffer(self)
623 dtype = self._dtype_from_vaexdtype(self._col.dtype)
624 elif self.dtype[0] == _k.STRING:
--> 625 bitmap_buffer, offsets, string_bytes = self._col.evaluate().buffers()
627 if string_bytes is None:
628 string_bytes = np.array([], dtype="uint8")
AttributeError: 'pyarrow.lib.ChunkedArray' object has no attribute 'buffers' I guess #2134 is still not solved. |
Maybe instead of implementing the interchange protocol directly in vaex, it might be better for now to delegate it to pyarrow for now. Converting vaex Dataframe to arrow and then use , In [66]: pd.api.interchange.from_dataframe(pa.interchange.from_dataframe(vaex_df.to_arrow_table()))
Out[66]:
a b
0 aa 1
1 bb 2
2 cc 3 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Arrow columns in a Vaex dataframe seems to have incorrect null bitmasks, although it could very well be a problem in specification/implementation.
Maybe this is just an issue with
vaex.dataframe_protocol
's adoption of the interchange protocol, but in any case I'll use it in the example as it's what I'm familiar with 😅 The following example shows that the null mask you can infer from an interchange column ends up falsely marking non-null elements as null.Logic to get
mask
is lifted fromvaex.dataframe_protocol.buffer_to_ndarray()
. Is there a chance it's doing something wrong?I say
mask
should be[True, True]
because assuming0
/False
indicates a missing value, right now Vaex is erroneously saying all our values indf
are null. I'm not familiar with Arrow and have been assuming this specification of Arrow's null representations fromvaex/packages/vaex-core/vaex/dataframe_protocol.py
Lines 466 to 471 in ac56da0
This affects the dataframe interchange introduced to
pandas
in pandas-dev/pandas#46141, e.g.(I have a very WIP test suite for the interchange protocol at honno/dataframe-interchange-tests, where I originally found this bug.)
Vaex was built locally from source (upstream
master
) on Ubuntu 20.04. Let me know if there's any useful information I can provide!The text was updated successfully, but these errors were encountered: