Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python][C++][Parquet] "OSError: Repetition level histogram size mismatch" when reading parquet file in pyarrow since 19.0.0 #45283

Open
progval opened this issue Jan 16, 2025 · 5 comments

Comments

@progval
Copy link

progval commented Jan 16, 2025

Describe the bug, including details regarding any error messages, version, and platform.

Since pyarrow v19, this file cannot be read anymore:

test.parquet.gz (decompress it with gunzip; I had to compress it for Github to accept the upload)

with pyarrow 18.1.0:

$ python3                     
Python 3.11.2 (main, Nov 30 2024, 21:22:50) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow.dataset
>>> dataset = pyarrow.dataset.dataset("test.parquet", format="parquet")
>>> dataset.to_table().to_pylist()
[{'id': 0, 'type': 'ori', 'sha1_git': b'\x8fP\xd3\xf6\x0e\xae7\r\xdb\xf8\\\x86!\x9cU\x10\x8a5\x01e'}, {'id': 2, 'type': 'ori', 'sha1_git': b'\x83@O\x99Q\x18\xbd%wOJ\xc1D"\xa8\xf1u\xe7\xa0T'}, {'id': 3, 'type': 'rev', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\t'}, {'id': 4, 'type': 'rel', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x10'}, {'id': 6, 'type': 'rev', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x03'}, {'id': 7, 'type': 'dir', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02'}, {'id': 8, 'type': 'cnt', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x05'}, {'id': 9, 'type': 'dir', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x06'}, {'id': 10, 'type': 'cnt', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x04'}, {'id': 11, 'type': 'cnt', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01'}, {'id': 12, 'type': 'dir', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x08'}, {'id': 13, 'type': 'cnt', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x07'}, {'id': 14, 'type': 'dir', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x12'}, {'id': 15, 'type': 'cnt', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x11'}, {'id': 16, 'type': 'rev', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x13'}, {'id': 17, 'type': 'dir', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x16'}, {'id': 18, 'type': 'cnt', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x15'}, {'id': 19, 'type': 'rel', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00!'}, {'id': 20, 'type': 'rev', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x18'}, {'id': 21, 'type': 'rel', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x19'}, {'id': 22, 'type': 'dir', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x17'}, {'id': 23, 'type': 'cnt', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x14'}]
>>> 

with pyarrow 19.0.0:

$ python3                     
Python 3.11.2 (main, Nov 30 2024, 21:22:50) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow.dataset
>>> dataset = pyarrow.dataset.dataset("test.parquet", format="parquet")
>>> dataset.to_table().to_pylist()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/_dataset.pyx", line 574, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 3865, in pyarrow._dataset.Scanner.to_table
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
OSError: Repetition level histogram size mismatch

Component(s)

Parquet

@raulcd
Copy link
Member

raulcd commented Jan 16, 2025

Do you know how the parquet file was generated? This is related to the newly implemented size statistics:
https://github.com/raulcd/arrow/blob/f93004f23f7cb1a641abb805b10fb845c77bb23f/cpp/src/parquet/size_statistics.cc#L57-L60

cc @wgtmac

@raulcd raulcd changed the title "OSError: Repetition level histogram size mismatch" when reading parquet file in pyarrow since 19.0.0 [Python][C++][Parquet] "OSError: Repetition level histogram size mismatch" when reading parquet file in pyarrow since 19.0.0 Jan 16, 2025
@progval
Copy link
Author

progval commented Jan 16, 2025

With version 53.3.0 of the Rust parquet crate.

My code to generate it is: https://gitlab.softwareheritage.org/swh/devel/swh-graph/-/blob/master/tools/provenance/src/bin/list-provenance-nodes.rs?ref_type=heads

I can work on a smaller repro code if you think that would help

@raulcd
Copy link
Member

raulcd commented Jan 16, 2025

From my understanding parquet-rs builds statistics by default so now that we are processing those on parquet-cpp you might have found an incompatibility issue. I guess the issue stops failing if you disable statistics on the Writer, right? could you validate that please?

https://docs.rs/parquet/latest/parquet/file/properties/struct.WriterPropertiesBuilder.html#method.set_statistics_enabled

edit: correct API to set statistics

@progval
Copy link
Author

progval commented Jan 16, 2025

Confirmed, this happens both with EnabledStatistics::Page and EnabledStatistics::Chunk, but not EnabledStatistics::None. Specifically, this happens iff they are enabled on the type column which is defined as type Dictionary(Int8.into(), Utf8.into()), false)

@wgtmac
Copy link
Member

wgtmac commented Jan 16, 2025

The file schema is as below. All columns are required so their max_repetition_levels are 0 and the corresponding repetition level histograms are omitted. This is not implemented by parquet-cpp yet. Let me fix this.

message arrow_schema {
  required int64 id (INTEGER(64,false));
  required binary type (STRING);
  required fixed_len_byte_array(20) sha1_git;
}

deepyaman added a commit to dagster-io/dagster that referenced this issue Jan 17, 2025
## Summary & Motivation

Probably was partially premature in blaming Polars; they are probably
just using the Rust crate under the hood to write Parquet, but the issue
actually needs to be fixed on Arrow side. Linking the
[issue](apache/arrow#45283) so have something
to track.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants