-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT] Support parquet RLE decoding for booleans #3477
[FEAT] Support parquet RLE decoding for booleans #3477
Conversation
CodSpeed Performance ReportMerging #3477 will not alter performanceComparing Summary
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #3477 +/- ##
==========================================
- Coverage 77.49% 77.35% -0.15%
==========================================
Files 687 696 +9
Lines 84483 84851 +368
==========================================
+ Hits 65474 65637 +163
- Misses 19009 19214 +205 |
|
||
impl Decodable for u32 { | ||
fn from_le_bytes(pack: &[u8]) -> Self { | ||
let mut bytes = [0u8; std::mem::size_of::<u32>()]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you have to do this memcpy thing?
and not just do u32::from_le_bytes(pack)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah turns out that this was because from_le_bytes
expects a fixed size array instead of a slice of bytes.
We can do
let bytes = pack.try_into().expect("Parquet should always encode u32 values");
let value = u32::from_le_bytes(bytes);
which will do a runtime check that the slice is [u8; 4]
, which would still be faster than copying the bytes over.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed offline, unfortunately in some cases pack
contains less than 4 bytes (test_invalid_utf8_parquet_reading
contains an example) so we can't do this.. reverting to the state prior that does byte-wise copying.
#3329 shows that we do not currently support reading boolean values from parquet files when they are RLE-encoded. This PR adds support for this.
#3329 shows that we do not currently support reading boolean values from parquet files when they are RLE-encoded.
This PR adds support for this.