-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python][C++][Parquet] "OSError: Repetition level histogram size mismatch" when reading parquet file in pyarrow since 19.0.0 #45283
Comments
Do you know how the parquet file was generated? This is related to the newly implemented size statistics: cc @wgtmac |
With version 53.3.0 of the Rust My code to generate it is: https://gitlab.softwareheritage.org/swh/devel/swh-graph/-/blob/master/tools/provenance/src/bin/list-provenance-nodes.rs?ref_type=heads I can work on a smaller repro code if you think that would help |
From my understanding parquet-rs builds statistics by default so now that we are processing those on parquet-cpp you might have found an incompatibility issue. I guess the issue stops failing if you disable statistics on the Writer, right? could you validate that please? edit: correct API to set statistics |
Confirmed, this happens both with |
The file schema is as below. All columns are
|
## Summary & Motivation Probably was partially premature in blaming Polars; they are probably just using the Rust crate under the hood to write Parquet, but the issue actually needs to be fixed on Arrow side. Linking the [issue](apache/arrow#45283) so have something to track.
Describe the bug, including details regarding any error messages, version, and platform.
Since pyarrow v19, this file cannot be read anymore:
test.parquet.gz (decompress it with
gunzip
; I had to compress it for Github to accept the upload)with pyarrow 18.1.0:
with pyarrow 19.0.0:
Component(s)
Parquet
The text was updated successfully, but these errors were encountered: