[Python][C++][Parquet] "OSError: Repetition level histogram size mismatch" when reading parquet file in pyarrow since 19.0.0 #45283

progval · 2025-01-16T11:32:13Z

Describe the bug, including details regarding any error messages, version, and platform.

Since pyarrow v19, this file cannot be read anymore:

test.parquet.gz (decompress it with gunzip; I had to compress it for Github to accept the upload)

with pyarrow 18.1.0:

$ python3                     
Python 3.11.2 (main, Nov 30 2024, 21:22:50) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow.dataset
>>> dataset = pyarrow.dataset.dataset("test.parquet", format="parquet")
>>> dataset.to_table().to_pylist()
[{'id': 0, 'type': 'ori', 'sha1_git': b'\x8fP\xd3\xf6\x0e\xae7\r\xdb\xf8\\\x86!\x9cU\x10\x8a5\x01e'}, {'id': 2, 'type': 'ori', 'sha1_git': b'\x83@O\x99Q\x18\xbd%wOJ\xc1D"\xa8\xf1u\xe7\xa0T'}, {'id': 3, 'type': 'rev', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\t'}, {'id': 4, 'type': 'rel', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x10'}, {'id': 6, 'type': 'rev', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x03'}, {'id': 7, 'type': 'dir', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02'}, {'id': 8, 'type': 'cnt', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x05'}, {'id': 9, 'type': 'dir', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x06'}, {'id': 10, 'type': 'cnt', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x04'}, {'id': 11, 'type': 'cnt', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01'}, {'id': 12, 'type': 'dir', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x08'}, {'id': 13, 'type': 'cnt', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x07'}, {'id': 14, 'type': 'dir', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x12'}, {'id': 15, 'type': 'cnt', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x11'}, {'id': 16, 'type': 'rev', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x13'}, {'id': 17, 'type': 'dir', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x16'}, {'id': 18, 'type': 'cnt', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x15'}, {'id': 19, 'type': 'rel', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00!'}, {'id': 20, 'type': 'rev', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x18'}, {'id': 21, 'type': 'rel', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x19'}, {'id': 22, 'type': 'dir', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x17'}, {'id': 23, 'type': 'cnt', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x14'}]
>>>

with pyarrow 19.0.0:

$ python3                     
Python 3.11.2 (main, Nov 30 2024, 21:22:50) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow.dataset
>>> dataset = pyarrow.dataset.dataset("test.parquet", format="parquet")
>>> dataset.to_table().to_pylist()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/_dataset.pyx", line 574, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 3865, in pyarrow._dataset.Scanner.to_table
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
OSError: Repetition level histogram size mismatch

Component(s)

Parquet

The text was updated successfully, but these errors were encountered:

raulcd · 2025-01-16T11:54:14Z

Do you know how the parquet file was generated? This is related to the newly implemented size statistics:
https://github.com/raulcd/arrow/blob/f93004f23f7cb1a641abb805b10fb845c77bb23f/cpp/src/parquet/size_statistics.cc#L57-L60

cc @wgtmac

progval · 2025-01-16T13:08:51Z

With version 53.3.0 of the Rust parquet crate.

My code to generate it is: https://gitlab.softwareheritage.org/swh/devel/swh-graph/-/blob/master/tools/provenance/src/bin/list-provenance-nodes.rs?ref_type=heads

I can work on a smaller repro code if you think that would help

raulcd · 2025-01-16T13:30:28Z

From my understanding parquet-rs builds statistics by default so now that we are processing those on parquet-cpp you might have found an incompatibility issue. I guess the issue stops failing if you disable statistics on the Writer, right? could you validate that please?

https://docs.rs/parquet/latest/parquet/file/properties/struct.WriterPropertiesBuilder.html#method.set_statistics_enabled

edit: correct API to set statistics

progval · 2025-01-16T13:50:51Z

Confirmed, this happens both with EnabledStatistics::Page and EnabledStatistics::Chunk, but not EnabledStatistics::None. Specifically, this happens iff they are enabled on the type column which is defined as type Dictionary(Int8.into(), Utf8.into()), false)

wgtmac · 2025-01-16T14:38:23Z

The file schema is as below. All columns are required so their max_repetition_levels are 0 and the corresponding repetition level histograms are omitted. This is not implemented by parquet-cpp yet. Let me fix this.

message arrow_schema {
  required int64 id (INTEGER(64,false));
  required binary type (STRING);
  required fixed_len_byte_array(20) sha1_git;
}

## Summary & Motivation Probably was partially premature in blaming Polars; they are probably just using the Rust crate under the hood to write Parquet, but the issue actually needs to be fixed on Arrow side. Linking the [issue](apache/arrow#45283) so have something to track.

progval added the Type: bug label Jan 16, 2025

github-actions bot added the Component: Parquet label Jan 16, 2025

raulcd changed the title ~~"OSError: Repetition level histogram size mismatch" when reading parquet file in pyarrow since 19.0.0~~ [Python][C++][Parquet] "OSError: Repetition level histogram size mismatch" when reading parquet file in pyarrow since 19.0.0 Jan 16, 2025

raulcd added the Component: C++ label Jan 16, 2025

wgtmac added a commit to wgtmac/arrow that referenced this issue Jan 16, 2025

apacheGH-45283: [C++][Parquet] Omit level histogram when max level is 0

6959c5b

github-actions bot assigned wgtmac Jan 16, 2025

github-actions bot mentioned this issue Jan 16, 2025

GH-45283: [C++][Parquet] Omit level histogram when max level is 0 #45285

Open

raulcd added the backport-candidate label Jan 16, 2025

deepyaman mentioned this issue Jan 17, 2025

[chore] Add GitHub issue to comment on PyArrow pin dagster-io/dagster#27185

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python][C++][Parquet] "OSError: Repetition level histogram size mismatch" when reading parquet file in pyarrow since 19.0.0 #45283

[Python][C++][Parquet] "OSError: Repetition level histogram size mismatch" when reading parquet file in pyarrow since 19.0.0 #45283

progval commented Jan 16, 2025

raulcd commented Jan 16, 2025

progval commented Jan 16, 2025 •

edited

Loading

raulcd commented Jan 16, 2025 •

edited

Loading

progval commented Jan 16, 2025 •

edited

Loading

wgtmac commented Jan 16, 2025 •

edited

Loading

[Python][C++][Parquet] "OSError: Repetition level histogram size mismatch" when reading parquet file in pyarrow since 19.0.0 #45283

[Python][C++][Parquet] "OSError: Repetition level histogram size mismatch" when reading parquet file in pyarrow since 19.0.0 #45283

Comments

progval commented Jan 16, 2025

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

raulcd commented Jan 16, 2025

progval commented Jan 16, 2025 • edited Loading

raulcd commented Jan 16, 2025 • edited Loading

progval commented Jan 16, 2025 • edited Loading

wgtmac commented Jan 16, 2025 • edited Loading

progval commented Jan 16, 2025 •

edited

Loading

raulcd commented Jan 16, 2025 •

edited

Loading

progval commented Jan 16, 2025 •

edited

Loading

wgtmac commented Jan 16, 2025 •

edited

Loading