[BUG-REPORT] Arrow columns in interchange dataframes have erroneous null behaviour #2083

honno · 2022-06-15T10:58:17Z

Arrow columns in a Vaex dataframe seems to have incorrect null bitmasks, although it could very well be a problem in specification/implementation.

Maybe this is just an issue with vaex.dataframe_protocol's adoption of the interchange protocol, but in any case I'll use it in the example as it's what I'm familiar with 😅 The following example shows that the null mask you can infer from an interchange column ends up falsely marking non-null elements as null.

>>> import pyarrow as pa
>>> table = pa.Table.from_pydict({"foo_col": pa.array([7, 42])})
>>> import vaex
>>> df = vaex.from_arrow_table(table)
>>> df
#  foo_col
0        7
1       42
>>> protocol_df = df.__dataframe__()
>>> col = protocol_df.get_column(0)  # i.e. foo_col
>>> col.dtype
(<_DtypeKind.INT: 0>, 64, '<i8', '=')
>>> bufinfo = col.get_buffers()
>>> col.describe_null
(3, 0)  # i.e. a bitmask represents nulls, where False indicates a missing value
>>> validity_buf, validity_dtype = bufinfo["validity"]
>>> validity_dtype
(<_DtypeKind.BOOL: 20>, 8, '|b1', '|')
>>> import ctypes
>>> data_pointer = ctypes.cast(validity_buf.ptr, ctypes.POINTER(ctypes.c_bool))
>>> import numpy as np
>>> mask = np.ctypeslib.as_array(data_pointer, shape=(2,))
>>> mask
array([False, False])  # should be array([True, True])

Logic to get mask is lifted from vaex.dataframe_protocol.buffer_to_ndarray(). Is there a chance it's doing something wrong?

I say mask should be [True, True] because assuming 0/False indicates a missing value, right now Vaex is erroneously saying all our values in df are null. I'm not familiar with Arrow and have been assuming this specification of Arrow's null representations from

vaex/packages/vaex-core/vaex/dataframe_protocol.py

Lines 466 to 471 in ac56da0

    
           if kind in (_k.INT, _k.UINT, _k.FLOAT, _k.BOOL, _k.STRING): 
        
               if self._col.dtype.is_arrow: 
        
                   # arrow arrays always allow for null values 
        
                   # where 0 encodes a null/missing value 
        
                   null = 3 
        
                   value = 0

This affects the dataframe interchange introduced to pandas in pandas-dev/pandas#46141, e.g.

>>> import pandas as pd
>>> df = pd.DataFrame({"foo_col": [7, 42]})
>>> df
   foo_col
0        7
1       42
>>> from vaex.dataframe_protocol import from_dataframe_to_vaex
>>> vaex_df = from_dataframe_to_vaex(df)
>>> vaex_df
#  foo_col
0        7
1       42
>>> from pandas.api.exchange import from_dataframe as pandas_from_dataframe
>>> roundtrip_df = pandas_from_dataframe(vaex_df)
>>> roundtrip_df
   foo_col
0      NaN
1      NaN

(I have a very WIP test suite for the interchange protocol at honno/dataframe-interchange-tests, where I originally found this bug.)

Vaex was built locally from source (upstream master) on Ubuntu 20.04. Let me know if there's any useful information I can provide!

The text was updated successfully, but these errors were encountered:

ghuls · 2023-02-13T11:07:57Z

I encountered the same problem, with at least string types.

# Pandas 1.5.3
import pandas as pd
import vaex
# Pyarrow 
import pyarrow.interchange

In [42]: pd_df = pd.DataFrame({"a": ["aa", "bb", "cc"], "b": [1, 2, 3]})

In [43]: pd_df
Out[43]: 
    a  b
0  aa  1
1  bb  2
2  cc  3

In [44]: vaex_df = vaex.from_pandas(pd_df)

In [45]: vaex_df
Out[45]: 
  #  a      b
  0  aa     1
  1  bb     2
  2  cc     3

In [64]: pa.interchange.from_dataframe(vaex_df)
Out[64]: 
pyarrow.Table
a: string
b: int64
----
a: [[null,null,null]]
b: [[1,2,3]]


In [50]: pd.api.interchange.from_dataframe(vaex_df)
Out[50]: 
     a  b
0  NaN  1
1  NaN  2
2  NaN  3

Interchange protocol is not supported when a Vaex dataframe is created from an Arrow Table:

In [54]: vaex_df_from_arrow = vaex.from_arrow_table(vaex_df.to_arrow_table())

(raylet) [2023-02-13 11:58:51,725 E 64323 64400] (raylet) file_system_monitor.cc:105: /tmp/ray/session_2023-02-13_11-23-07_977223_63582 is over 95% full, available space: 4589871104; capacity: 982141468672. Object creation will fail if spilling is required.
In [55]: vaex_df_from_arrow
Out[55]: 
  #  a      b
  0  aa     1
  1  bb     2
  2  cc     3

In [59]: pa.interchange.from_dataframe(vaex_df_from_arrow)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[59], line 1
----> 1 pa.interchange.from_dataframe(vaex_df_from_arrow)

File ~/software/polars/py-polars/venv/lib/python3.8/site-packages/pyarrow/interchange/from_dataframe.py:85, in from_dataframe(df, allow_copy)
     82 if not hasattr(df, "__dataframe__"):
     83     raise ValueError("`df` does not support __dataframe__")
---> 85 return _from_dataframe(df.__dataframe__(allow_copy=allow_copy),
     86                        allow_copy=allow_copy)

File ~/software/polars/py-polars/venv/lib/python3.8/site-packages/pyarrow/interchange/from_dataframe.py:108, in _from_dataframe(df, allow_copy)
    106 batches = []
    107 for chunk in df.get_chunks():
--> 108     batch = protocol_df_chunk_to_pyarrow(chunk, allow_copy)
    109     batches.append(batch)
    111 table = pa.Table.from_batches(batches)

File ~/software/polars/py-polars/venv/lib/python3.8/site-packages/pyarrow/interchange/from_dataframe.py:151, in protocol_df_chunk_to_pyarrow(df, allow_copy)
    143 dtype = col.dtype[0]
    144 if dtype in (
    145     DtypeKind.INT,
    146     DtypeKind.UINT,
   (...)
    149     DtypeKind.DATETIME,
    150 ):
--> 151     columns[name] = column_to_array(col, allow_copy)
    152 elif dtype == DtypeKind.BOOL:
    153     columns[name] = bool_column_to_array(col, allow_copy)

File ~/software/polars/py-polars/venv/lib/python3.8/site-packages/pyarrow/interchange/from_dataframe.py:181, in column_to_array(col, allow_copy)
    162 def column_to_array(
    163     col: ColumnObject,
    164     allow_copy: bool = True,
    165 ) -> pa.Array:
    166     """
    167     Convert a column holding one of the primitive dtypes to a PyArrow array.
    168     A primitive type is one of: int, uint, float, bool (1 bit).
   (...)
    179     pa.Array
    180     """
--> 181     buffers = col.get_buffers()
    182     data = buffers_to_array(buffers, col.size(),
    183                             col.describe_null,
    184                             col.offset,
    185                             allow_copy)
    186     return data

File ~/software/polars/py-polars/venv/lib/python3.8/site-packages/vaex/dataframe_protocol.py:585, in _VaexColumn.get_buffers(self)
    564 """
    565 Return a dictionary containing the underlying buffers.
    566 
   (...)
    582                  buffer.
    583 """
    584 buffers = {}
--> 585 buffers["data"] = self._get_data_buffer()
    586 try:
    587     buffers["validity"] = self._get_validity_buffer()

File ~/software/polars/py-polars/venv/lib/python3.8/site-packages/vaex/dataframe_protocol.py:625, in _VaexColumn._get_data_buffer(self)
    623         dtype = self._dtype_from_vaexdtype(self._col.dtype)
    624 elif self.dtype[0] == _k.STRING:
--> 625     bitmap_buffer, offsets, string_bytes = self._col.evaluate().buffers()
    627     if string_bytes is None:
    628         string_bytes = np.array([], dtype="uint8")

AttributeError: 'pyarrow.lib.ChunkedArray' object has no attribute 'buffers'

I guess #2134 is still not solved.

ghuls · 2023-02-13T11:13:01Z

Maybe instead of implementing the interchange protocol directly in vaex, it might be better for now to delegate it to pyarrow for now.

Converting vaex Dataframe to arrow and then use , pa.interchange.from_dataframe works:

In [66]: pd.api.interchange.from_dataframe(pa.interchange.from_dataframe(vaex_df.to_arrow_table()))
Out[66]: 
    a  b
0  aa  1
1  bb  2
2  cc  3

honno mentioned this issue Jun 22, 2022

Generate interesting examples data-apis/dataframe-interchange-tests#1

Closed

honno mentioned this issue Jul 8, 2022

Strings are converted to nans when interchanging dataframes modin-project/modin#4654

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG-REPORT] Arrow columns in interchange dataframes have erroneous null behaviour #2083

[BUG-REPORT] Arrow columns in interchange dataframes have erroneous null behaviour #2083

honno commented Jun 15, 2022

ghuls commented Feb 13, 2023 •

edited

Loading

ghuls commented Feb 13, 2023 •

edited

Loading

[BUG-REPORT] Arrow columns in interchange dataframes have erroneous null behaviour #2083

[BUG-REPORT] Arrow columns in interchange dataframes have erroneous null behaviour #2083

Comments

honno commented Jun 15, 2022

ghuls commented Feb 13, 2023 • edited Loading

ghuls commented Feb 13, 2023 • edited Loading

ghuls commented Feb 13, 2023 •

edited

Loading

ghuls commented Feb 13, 2023 •

edited

Loading