Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move ORC-rust into this repo #1

Merged
merged 284 commits into from
Oct 24, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
284 commits
Select commit Hold shift + click to select a range
f10f4e0
Update README.md (#8)
WenyXu Oct 5, 2023
b7ccf67
chore: bump arrow to 47.0 (#9)
waynexia Oct 16, 2023
46fadc0
chore: bump 0.2.43 (#10)
WenyXu Oct 17, 2023
4f79a3a
chore: remove MIT LICENSE
WenyXu Nov 2, 2023
8d4c660
chore: update Cargo.toml
WenyXu Nov 2, 2023
486e591
chore: add comments for tracking the provenance
WenyXu Nov 2, 2023
36a000c
chore: remove unstable features (#4)
WenyXu Nov 3, 2023
755660b
Crate for generating proto.rs from orc_proto.proto (#3)
Jefffrey Nov 3, 2023
61526a4
docs: update CI badger in README.md (#6)
waynexia Nov 3, 2023
c4d03e8
Add timestamp instant to README (#16)
Jefffrey Nov 4, 2023
5c98eea
feat: Support all decompression types (#20)
Jefffrey Nov 4, 2023
06d5e02
Update README to reflect decompression support
Jefffrey Nov 4, 2023
dd93878
Merge pull request #21 from datafusion-contrib/chore/update_readme_de…
WenyXu Nov 4, 2023
3f76450
feat: support to read tinyint
WenyXu Nov 4, 2023
5101ab0
feat: map tinyint to i8
WenyXu Nov 5, 2023
493a753
chore: apply suggestions from CR
WenyXu Nov 5, 2023
e1af701
Merge pull request #22 from WenyXu/feat/support-tiny-int
WenyXu Nov 5, 2023
c09363a
Refactor varint handling (#23)
Jefffrey Nov 5, 2023
884ae22
Support for Int RLE v1 encoding (#24)
Jefffrey Nov 5, 2023
60bd3e1
Fix bug where not reading all stripes in file
Jefffrey Nov 5, 2023
7743c3c
Add async test
Jefffrey Nov 5, 2023
2d929ff
Merge pull request #25 from datafusion-contrib/fix/read_entire_file
WenyXu Nov 5, 2023
421c92e
feat: support to struct datatype
WenyXu Nov 5, 2023
63a9577
chore: apply suggestions from CR
WenyXu Nov 7, 2023
ea6fde9
Merge pull request #26 from WenyXu/feat/struct
WenyXu Nov 7, 2023
ad43524
Update README.md
WenyXu Nov 7, 2023
32da199
Merge pull request #28 from datafusion-contrib/WenyXu-patch-1
WenyXu Nov 7, 2023
7a40cf0
refactor: refactor byte/boolean iter
WenyXu Nov 9, 2023
a0f8ee8
chore: add codecov config (#31)
WenyXu Nov 9, 2023
de6557e
Merge pull request #29 from WenyXu/refactor/refactor-boolean-iter
WenyXu Nov 10, 2023
ca64017
feat: support to list datatype
WenyXu Nov 9, 2023
5ab6f3b
chore: apply suggestions from CR
WenyXu Nov 11, 2023
e99885e
Merge pull request #30 from WenyXu/feat/list
WenyXu Nov 11, 2023
dabcc64
feat: support to map datatype
WenyXu Nov 9, 2023
51e08cb
Update README.md (#35)
WenyXu Nov 11, 2023
90ee4c8
Merge pull request #32 from WenyXu/feat/map
WenyXu Nov 11, 2023
b5398a6
Update README.md
WenyXu Nov 12, 2023
6572d90
Refactor stream retrieval and datatype iterators (#36)
Jefffrey Nov 12, 2023
3fc091c
Merge pull request #37 from datafusion-contrib/WenyXu-patch-1
WenyXu Nov 12, 2023
cdfd79c
Remove Chrono dependency (#38)
Jefffrey Nov 12, 2023
7f38e3a
Remove Cargo.lock, add to .gitignore (#40)
Jefffrey Nov 14, 2023
a92d93f
Add initial simple benchmark (#39)
Jefffrey Nov 14, 2023
7cebe8c
Refactor synchronous parsing of file tail metadata (#43)
Jefffrey Nov 17, 2023
38d953e
Refactor to decouple from relying directly on proto (#44)
Jefffrey Nov 18, 2023
0392dd9
Refactor schema/type handling (#45)
Jefffrey Nov 19, 2023
ed84772
Remove Reader struct, condense into Cursor (#48)
Jefffrey Nov 27, 2023
8e81a10
chore(deps): bump arrow to 48.0 (#49)
waynexia Nov 27, 2023
0c569c2
Refactor Integer RLE V2 handling (#50)
Jefffrey Dec 4, 2023
c217f3f
Introduce ProjectionMask (#51)
Jefffrey Dec 29, 2023
4819916
Switch CI toolchain to stable (#54)
Jefffrey Dec 30, 2023
b4bc924
Add ArrowReaderBuilder (#53)
Jefffrey Dec 30, 2023
590231e
Use assert_batches_eq in tests
Jefffrey Feb 28, 2024
de24088
Introduce generics into RLE Int decoders (#57)
Jefffrey Mar 2, 2024
dc58d95
Use specialized Int decoders when decoding Integer columns
Jefffrey Mar 2, 2024
3ab5141
Remove unused integer decoding impl's
Jefffrey Mar 2, 2024
5fa5bb9
Simplify fetching RLE iterator
Jefffrey Mar 2, 2024
bd22488
Split up arrow_reader.rs, use mod.rs pattern
Jefffrey Mar 2, 2024
5503019
Remove bool return from Decoder::append_value
Jefffrey Mar 2, 2024
ff8457b
Replace float iter macro with generics
Jefffrey Mar 2, 2024
aa32941
Refactor NInt from_be_bytes to use associated Bytes type
Jefffrey Mar 2, 2024
9bda02c
Move number_of_rows from Column to Stripe
Jefffrey Mar 2, 2024
63c437b
Don't return Option in NullableIterator::collect_chunk
Jefffrey Mar 2, 2024
18abaca
Simplify NullableIterator::next
Jefffrey Mar 3, 2024
8f2540d
Refactor to dyn array trait based column decoders
Jefffrey Mar 4, 2024
4ce9bcd
Centralize string column decoding into decoder/string.rs
Jefffrey Mar 4, 2024
f78cddd
Refactor string column handling to read contents directly to StringAr…
Jefffrey Mar 5, 2024
511b5e5
Consolidate binary decoding logic with strings
Jefffrey Mar 5, 2024
9226c9d
Don't return Option in ArrayBatchDecoder::next_batch
Jefffrey Mar 5, 2024
808534f
Mark ArrowReader as Send (#60)
progval Mar 5, 2024
0b77485
Introduce get_present_vec to optionally get present stream
Jefffrey Mar 5, 2024
cd4402e
Refactor common code
Jefffrey Mar 5, 2024
0760723
Replace usage of new_present_iter in binary decoder
Jefffrey Mar 5, 2024
254269a
Remove NullableIterator in favour of explicitly handling an optional …
Jefffrey Mar 5, 2024
3a3234d
Comment
Jefffrey Mar 5, 2024
4f1e505
Add minimal example of integration with DataFusion
Jefffrey Mar 6, 2024
6c4606d
Light refactoring
Jefffrey Mar 6, 2024
fe5116e
Put async functionality behind feature (#64)
progval Mar 10, 2024
8c215de
Remove variable_length.rs
Jefffrey Mar 13, 2024
e5710c3
Rename decode.rs to decode/mod.rs
Jefffrey Mar 13, 2024
4463f64
Add integration tests using example files from apache/orc (#65)
progval Mar 13, 2024
df6dde7
Unit tests for FloatIter
Jefffrey Mar 13, 2024
1f6406d
Simplify Float trait bounds
Jefffrey Mar 13, 2024
a4a3e6b
Run examples as part of CI (#69)
Jefffrey Mar 14, 2024
802a2cc
Cast dictionary encoded string column stripes to regular StringArray
Jefffrey Mar 20, 2024
26ba19d
Minor refactor and comments
Jefffrey Mar 23, 2024
733f7e3
Decimal support
Jefffrey Mar 23, 2024
07c492a
Generate expected data for integration tests as feather files (#73)
Jefffrey Mar 23, 2024
eda2dee
Compare concatenated RecordBatches in integration tests
Jefffrey Mar 23, 2024
fe731a1
Align Map Arrow datatype derivation with MapArrayDecoder
Jefffrey Mar 23, 2024
77a4fee
Update comments
Jefffrey Mar 23, 2024
1fcbc3f
Enable test1 integration test by fixing MapArray children names
Jefffrey Mar 23, 2024
07c7e3d
Enable empty_file integration test by fixing MapArray children names
Jefffrey Mar 23, 2024
2e00975
Edge case where required streams may be missing
Jefffrey Mar 24, 2024
4d85bd9
Enable over1k_bloom integration test and add comments
Jefffrey Mar 24, 2024
fb90d01
Minor refactor
Jefffrey Mar 24, 2024
d79f3c5
Initial orc-metadata CLI tool
Jefffrey Mar 25, 2024
943515a
Enhance orc-metadata bin to show basic stripe metadata
Jefffrey Mar 26, 2024
b15e1f3
Update Spark test data and add PyArrow timestamp data generator
Jefffrey Mar 26, 2024
8996277
Support TIMESTAMP_INSTANT
Jefffrey Mar 29, 2024
bfb1c08
Refactor Stripe
Jefffrey Mar 29, 2024
39fd853
Add Tz to Stripe
Jefffrey Mar 29, 2024
223592b
Fix TIMESTAMP to align with ORC impl
Jefffrey Mar 31, 2024
f824998
Refactor to consistent mod structure
Jefffrey Mar 31, 2024
bd285ad
Display file format version with orc-metadata
Jefffrey Mar 31, 2024
ac7879b
Update comments
Jefffrey Mar 31, 2024
8330368
Comment out failing orc_11_format integration test
Jefffrey Apr 1, 2024
7d1073a
Support decoding Union with <= 127 variants into Sparse UnionArrays
Jefffrey Apr 1, 2024
4bffda8
Enable test_seek integration test
Jefffrey Apr 1, 2024
27e13df
Typo
Jefffrey Apr 1, 2024
95998fa
Cleanup tests
Jefffrey Apr 1, 2024
1ff6b32
Update documentation and cleanup root level files
Jefffrey Apr 2, 2024
21debc0
Fix crate name usages
Jefffrey Apr 2, 2024
001c4e2
Update README on Union limitation
Jefffrey Apr 4, 2024
ffa7306
Empty changelog
Jefffrey Apr 6, 2024
d245234
Add explicit path for examples
Jefffrey Apr 6, 2024
66896dc
Add explicit path for benches
Jefffrey Apr 6, 2024
3972a52
Release orc-rust v0.3.0
Jefffrey Apr 6, 2024
3f3837e
Fix bench path
Jefffrey Apr 6, 2024
2527286
Comments
Jefffrey Apr 6, 2024
e3f24fe
Remove CHANGELOG.md
Jefffrey Apr 6, 2024
030dc64
Remove unused builder.rs
Jefffrey Apr 6, 2024
244b929
chore: bump `arrow` version to 51 (#83)
WenyXu Apr 10, 2024
6eae187
#62 cli tool for printing file stats (#84)
klangner Apr 16, 2024
6a82e82
#62 Added cli tool to export data in a csv format (#85)
klangner Apr 19, 2024
38a11c9
#62 added filtering by rows and columns (#87)
klangner Apr 22, 2024
2eb9014
Revamp DataFusion integration example and support projection
Jefffrey May 12, 2024
8b448ff
Move DataFusion integration code into separate feature
Jefffrey May 12, 2024
90a5a39
Make proto private
Jefffrey May 12, 2024
72b3690
Split up datafusion integration files
Jefffrey May 12, 2024
aaab562
Limit public API
Jefffrey May 12, 2024
a92d170
Update documentation
Jefffrey May 12, 2024
8d7dbbc
Refactor Result to be more versatile
Jefffrey May 12, 2024
a520059
Consolidate async impl
Jefffrey May 12, 2024
e98e02a
Reorganize arrow_reader/decoder to array_decoder mod
Jefffrey May 12, 2024
6b32379
Reorganize arrow_reader/column/timestamp to reader/decode/timestamp
Jefffrey May 12, 2024
1afdfb1
Reorganize arrow_reader
Jefffrey May 12, 2024
aeb1ec9
Cleanup async cfg
Jefffrey May 12, 2024
0caa5c4
Refactor out metadata Byte copy
Jefffrey May 12, 2024
ef94231
Refactor out Bytes copy in Stripe footer
Jefffrey May 12, 2024
4f04eba
Minor refactoring around decompression handling
Jefffrey May 13, 2024
33121c0
Fix unreachable encoding for bit width decoder util
Jefffrey May 23, 2024
1697a3e
Error instead of panic on timestamp overflow (#91)
progval May 24, 2024
719495b
Avoid adding a NullBuffer when decoding timestamp offsets (#90)
progval May 24, 2024
79a20b7
Pass target arrow type to array_decoder_factory (#92)
progval Jun 5, 2024
b515866
Add support for configuring time units through ArrowReaderBuilder::wi…
progval Jun 16, 2024
22d1888
Add ArrowReaderBuilder::schema() (#94)
progval Jun 19, 2024
ac92a25
Write target time unit in DecodeTimestampSnafu (#95)
progval Jun 20, 2024
e3d9a43
Add support for decoding Timestamp as Decimal128 (#96)
progval Jun 20, 2024
6babdb9
Make dependency on async-trait optional (#98)
progval Jun 24, 2024
fe1305f
Create dependabot.yml
WenyXu Jun 28, 2024
1e61096
Revert "Create dependabot.yml" (#104)
WenyXu Jun 28, 2024
0b937e0
refactor: Bump arrow 52 and datafusion 39 (#105)
Xuanwo Jul 1, 2024
dc4642a
fix(bin): Expose needed types public (#108)
Xuanwo Jul 1, 2024
cc76ece
chore: release version 0.3.1 (#109)
WenyXu Jul 1, 2024
f102a23
chore: Reorganize deps and upgrade them (#110)
Xuanwo Jul 2, 2024
d4c5377
fix: read array<float> columns correctly (#112)
youngsofun Jul 2, 2024
85344c1
feat: Add opendal native support (#117)
Xuanwo Aug 5, 2024
56c9497
Relax check on `patch_bits` overflows in delta decoding (#118)
progval Aug 6, 2024
67db60d
fix: execute "select count(*) from tbl" always getting zero (#114)
harveyyue Aug 7, 2024
141215c
chore: add license header (#121)
waynexia Aug 15, 2024
2ea9a11
Remove unused error types
Jefffrey May 30, 2024
0b9b797
Initial write support (#122)
Jefffrey Aug 21, 2024
6bd8503
Move float encoding code to encoding mod
Jefffrey Aug 21, 2024
bb678fe
Consolidate decimal/timestamp/byte/boolean encoding code to encoding mod
Jefffrey Aug 21, 2024
4768358
Consolidate int rle code into encoding mod
Jefffrey Aug 21, 2024
5019f32
Move EstimateMemory trait into own memory.rs file
Jefffrey Aug 21, 2024
2f971e9
Move PrimitiveValueEncoder to encoding mod
Jefffrey Aug 21, 2024
f221d72
Rename PrimitiveStripeEncoder to PrimitiveColumnEncoder
Jefffrey Aug 21, 2024
b31466f
Direct string writing to ORC file
Jefffrey Aug 22, 2024
f8d96ab
Rename primitive column encoders
Jefffrey Aug 22, 2024
a5db02e
Support for Binary type writing to ORC file
Jefffrey Aug 22, 2024
f2b639c
Rename PresentStreamEncoder to BooleanEncoder and move under encoding…
Jefffrey Aug 22, 2024
3cc57a2
Support for writing BooleanArrays
Jefffrey Aug 22, 2024
9b3c0ae
Move src/reader/decompress.rs to src/compression.rs
Jefffrey Aug 22, 2024
4e6826c
Remove Decompressor and StreamMap from public API
Jefffrey Aug 22, 2024
9512883
Documentation
Jefffrey Aug 22, 2024
1413948
Minor comment
Jefffrey Aug 25, 2024
eb694a6
Remove unnecessary u32 overflow error condition in patched base
Jefffrey Sep 11, 2024
30b57c4
Extract common read NInt from big endian functionality
Jefffrey Sep 11, 2024
483ba7f
Reorganize python scripts
Jefffrey Sep 19, 2024
905e147
Script to generate TPCH data in ORC format
Jefffrey Sep 19, 2024
3d6f467
Flamegraph options
Jefffrey Sep 20, 2024
b527667
Disable DataFusion as a default feature (#124)
Jefffrey Sep 20, 2024
20f1fdf
Enable lz4 feature for arrow-ipc to fix tests
Jefffrey Sep 21, 2024
7546bfb
Switch ByteRleReader to emit i8 instead of u8
Jefffrey Sep 21, 2024
cfe514e
Cargo.toml formatting
Jefffrey Sep 21, 2024
0b9b005
Refactor away macro usage in timestamp array decoder
Jefffrey Sep 22, 2024
dc13b51
Simplify retrieval of timestamp decoder
Jefffrey Sep 22, 2024
1234838
Simplify generic usage in timestamp decoder
Jefffrey Sep 22, 2024
d3b7d15
Extract timestamp as decimal with timezone iterator into struct
Jefffrey Sep 22, 2024
b8ec195
Introduce PrimitiveValueDecoder to enable batch decoding of values (#…
Jefffrey Sep 23, 2024
5f6b8da
Fix TPCH data conversion
Jefffrey Sep 23, 2024
98c7816
Minor refactoring on decimal array decoder
Jefffrey Sep 23, 2024
fef9ba6
Refactor PrimitiveArrayDecoder to fully use PrimitiveValueDecoder::de…
Jefffrey Sep 23, 2024
d2be243
Simplify PrimitiveValueDecoder::decode to error if buffer isn't filled
Jefffrey Sep 23, 2024
9912183
Implement PrimitiveValueDecoder::decode for FloatIter
Jefffrey Sep 23, 2024
1847a45
Minor refactoring
Jefffrey Sep 23, 2024
f265a8f
Introduce PrimitiveValueDecoder::decode_spaced for nullable streams
Jefffrey Sep 24, 2024
30209ac
Rename FloatIter to FloatDecoder
Jefffrey Sep 24, 2024
202ad30
Use PrimitiveValueDecoder::decode_spaced in string/list/map lengths d…
Jefffrey Sep 24, 2024
fd95d83
Rename ByteRleReader to ByteRleDecoder
Jefffrey Sep 25, 2024
d18185e
Rename ByteRleWriter to ByteRleEncoder
Jefffrey Sep 25, 2024
5d30b65
Simplify ByteRleDecoder
Jefffrey Sep 25, 2024
233cfda
Rename BooleanIter to BooleanDecoder
Jefffrey Sep 25, 2024
8c531ac
Refactor BooleanArrayDecoder to use PrimitiveValueDecoder
Jefffrey Sep 25, 2024
1682c46
Default implementation for PrimitiveValueDecoder::decode_spaced
Jefffrey Sep 28, 2024
29c90dc
Implement PrimitiveValueDecoder::decode for Timestamp decoders
Jefffrey Sep 28, 2024
0b471c9
Rename TimestampIterator to TimestampDecoder
Jefffrey Sep 28, 2024
c5cd664
Rename TimestampNanosecondAsDecimalIterator to TimestampNanosecondAsD…
Jefffrey Sep 28, 2024
11830a7
Implement PrimitiveValueDecoder for UnbounadedVarintStreamDecoder
Jefffrey Sep 28, 2024
60b24f2
Implement PrimitiveValueDecoder::decode for DecimalScaleRepairIter
Jefffrey Sep 28, 2024
3fd7781
Rename DecimalScaleRepairIter to DecimalScaleRepairDecoder
Jefffrey Sep 28, 2024
320c56b
Implement PrimitiveValueDecoder::decode for UnboundedVarintStreamDecoder
Jefffrey Sep 28, 2024
7b7641b
Implement PrimitiveValueDecoder::decode for TimestampNanosecondAsDeci…
Jefffrey Sep 28, 2024
4883442
Rename TimestampNanosecondAsDecimalWithTzIterator to TimestampNanosec…
Jefffrey Sep 28, 2024
38c1b35
Implement PrimitiveValueDecoder::decode for RleReaderV1
Jefffrey Sep 28, 2024
7f6019c
Implement PrimitiveValueDecoder::decode for ByteRleDecoder
Jefffrey Sep 28, 2024
9259cc1
Implement PrimitiveValueDecoder::decode for BooleanDecoder
Jefffrey Sep 28, 2024
9292e0b
Remove default implemention of PrimitiveValueDecoder::decode
Jefffrey Sep 28, 2024
fededeb
Remove Iterator implementations for Decoders
Jefffrey Sep 28, 2024
6410549
Remove number_of_rows from Column
Jefffrey Sep 28, 2024
2377272
Minor refactoring
Jefffrey Sep 28, 2024
a03bbe5
Simplify PrimitiveValueDecoder::decode_spaced iterator
Jefffrey Sep 28, 2024
2684b35
Minor refactoring
Jefffrey Sep 28, 2024
d341c28
Minor refactoring
Jefffrey Sep 28, 2024
7098a4c
Make present decoding batch based using NullBuffer
Jefffrey Sep 29, 2024
e1e3142
Move DecimalArrayDecoder to array_decoder/decimal.rs
Jefffrey Sep 29, 2024
0dceaad
Refactor array decoder factory to reduce duplication
Jefffrey Sep 29, 2024
ebda90c
Fix bug with RLEv1 decoding
Jefffrey Sep 29, 2024
aea5c46
Align RLEv1 and RLEv2 structure
Jefffrey Sep 29, 2024
7c811e9
Move integer RLE files into new encoding/integer mod
Jefffrey Sep 29, 2024
1b20b69
Support filter out strip by provided range (#126)
harveyyue Sep 29, 2024
122528b
Move integer RLE specific util functions
Jefffrey Sep 29, 2024
4bd1c97
Consolidate integer encoding code
Jefffrey Sep 29, 2024
332283d
Privatise the (integer) utilities
Jefffrey Sep 29, 2024
d3f55f1
Refactor integer RLEv1
Jefffrey Sep 29, 2024
bd9a114
Fix visibility on RLEv2 mods
Jefffrey Sep 29, 2024
4c1c116
Rename RLEv1/v2 reader/writer to decoder/encoder
Jefffrey Sep 29, 2024
279f758
Bump to 0.4.0 (#128)
Jefffrey Oct 2, 2024
7a66adc
Introduce bytemuck to simplify float decoding
Jefffrey Oct 8, 2024
75616d9
Further simplify Float encoding
Jefffrey Oct 9, 2024
86db222
Consolidate common RLE decoding logic into new GenericRle trait
Jefffrey Oct 9, 2024
a33c746
Release 0.4.1 to fix features on docs.rs page (#130)
Jefffrey Oct 15, 2024
a48dd87
feat: Add Cargo Deny license check to CI workflow (#131)
waynexia Oct 24, 2024
3e100fe
chore: add nextest config (#134)
WenyXu Oct 24, 2024
d3fb27b
chore: Add license headers for toml/yml/proto (#135)
Xuanwo Oct 24, 2024
b15df36
skip reading unused columns (#133)
richox Oct 24, 2024
bc498c1
chore: remove datafusion feature
waynexia Oct 24, 2024
6c63cae
Merge remote-tracking branch 'datafusion-orc/split-datafusion-integra…
waynexia Oct 24, 2024
c31214e
ci: remove examples job from CI workflow
waynexia Oct 24, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions .config/nextest.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

[profile.default]
slow-timeout = { period = "60s", terminate-after = 3, grace-period = "30s" }
184 changes: 184 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,184 @@
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

on:
pull_request:
types: [opened, synchronize, reopened, ready_for_review]
paths-ignore:
- 'docs/**'
- 'config/**'
- '**.md'
- '.dockerignore'
- 'docker/**'
- '.gitignore'
push:
branches:
- develop
- main
paths-ignore:
- 'docs/**'
- 'config/**'
- '**.md'
- '.dockerignore'
- 'docker/**'
- '.gitignore'
workflow_dispatch:

name: CI

env:
RUST_TOOLCHAIN: stable

jobs:
typos:
name: Spell Check with Typos
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: crate-ci/[email protected]

check:
name: Check
if: github.event.pull_request.draft == false
runs-on: ubuntu-latest
timeout-minutes: 60
strategy:
matrix:
features:
- ''
- '--no-default-features'
- '--all-features'
steps:
- uses: actions/checkout@v3
- uses: dtolnay/rust-toolchain@master
with:
toolchain: ${{ env.RUST_TOOLCHAIN }}
- name: Rust Cache
uses: Swatinem/rust-cache@v2
- name: Run cargo check
run: cargo check --workspace --all-targets ${{ matrix.features }}

toml:
name: Toml Check
if: github.event.pull_request.draft == false
runs-on: ubuntu-latest
timeout-minutes: 60
steps:
- uses: actions/checkout@v3
- uses: dtolnay/rust-toolchain@master
with:
toolchain: ${{ env.RUST_TOOLCHAIN }}
- name: Rust Cache
uses: Swatinem/rust-cache@v2
- name: Install taplo
run: cargo install taplo-cli --version ^0.8 --locked
- name: Run taplo
run: taplo format --check

fmt:
name: Rustfmt
if: github.event.pull_request.draft == false
runs-on: ubuntu-latest
timeout-minutes: 60
steps:
- uses: actions/checkout@v3
- uses: dtolnay/rust-toolchain@master
with:
toolchain: ${{ env.RUST_TOOLCHAIN }}
components: rustfmt
- name: Rust Cache
uses: Swatinem/rust-cache@v2
- name: Run cargo fmt
run: cargo fmt --all -- --check

clippy:
name: Clippy
if: github.event.pull_request.draft == false
runs-on: ubuntu-latest
timeout-minutes: 60
strategy:
matrix:
features:
- ''
- '--no-default-features'
- '--all-features'
steps:
- uses: actions/checkout@v3
- uses: dtolnay/rust-toolchain@master
with:
toolchain: ${{ env.RUST_TOOLCHAIN }}
components: clippy
- name: Rust Cache
uses: Swatinem/rust-cache@v2
- name: Run cargo clippy
run: cargo clippy --workspace --all-targets ${{ matrix.features }} -- -D warnings

license-header:
name: Check license header
if: github.event.pull_request.draft == false
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Check license headers
uses: korandoru/hawkeye@v5

cargo-deny:
name: Cargo Deny License Check
if: github.event.pull_request.draft == false
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: EmbarkStudios/cargo-deny-action@v1
with:
command: check license

coverage:
if: github.event.pull_request.draft == false
runs-on: ubuntu-latest
timeout-minutes: 60
needs: [clippy]
steps:
- uses: actions/checkout@v3
- uses: KyleMayes/install-llvm-action@v1
with:
version: "14.0"
- name: Install toolchain
uses: dtolnay/rust-toolchain@master
with:
toolchain: ${{ env.RUST_TOOLCHAIN }}
components: llvm-tools-preview
- name: Rust Cache
uses: Swatinem/rust-cache@v2
- name: Install latest nextest release
uses: taiki-e/install-action@nextest
- name: Install cargo-llvm-cov
uses: taiki-e/install-action@cargo-llvm-cov
- name: Collect coverage data
run: cargo llvm-cov nextest --workspace --lcov --output-path lcov.info --all-features
env:
CARGO_BUILD_RUSTFLAGS: "-C link-arg=-fuse-ld=lld"
RUST_BACKTRACE: 1
CARGO_INCREMENTAL: 0
UNITTEST_LOG_DIR: "__unittest_logs"
- name: Codecov upload
uses: codecov/codecov-action@v2
with:
token: ${{ secrets.CODECOV_TOKEN }}
files: ./lcov.info
flags: rust
fail_ci_if_error: false
verbose: true
12 changes: 11 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -18,4 +18,14 @@ Cargo.lock
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/
.idea/

venv
/benchmark_data

private/
*.txt

/perf.*
/flamegraph.svg

107 changes: 107 additions & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

[package]
name = "orc-rust"
version = "0.4.1"
edition = "2021"
homepage = "https://github.com/datafusion-contrib/datafusion-orc"
repository = "https://github.com/datafusion-contrib/datafusion-orc"
authors = ["Weny <[email protected]>", "Jeffrey Vo <[email protected]>"]
license = "Apache-2.0"
description = "Implementation of Apache ORC file format using Apache Arrow in-memory format"
keywords = ["arrow", "orc", "arrow-rs", "datafusion"]
include = ["src/**/*.rs", "Cargo.toml"]
rust-version = "1.73"

[package.metadata.docs.rs]
all-features = true

[dependencies]
arrow = { version = "52", features = ["prettyprint", "chrono-tz"] }
bytemuck = { version = "1.18.0", features = ["must_cast"] }
bytes = "1.4"
chrono = { version = "0.4.37", default-features = false, features = ["std"] }
chrono-tz = "0.9"
fallible-streaming-iterator = { version = "0.1" }
flate2 = "1"
lz4_flex = "0.11"
lzokay-native = "0.1"
num = "0.4.1"
prost = { version = "0.12" }
snafu = "0.8"
snap = "1.1"
zstd = "0.12"

# async support
async-trait = { version = "0.1.77", optional = true }
futures = { version = "0.3", optional = true, default-features = false, features = ["std"] }
futures-util = { version = "0.3", optional = true }
tokio = { version = "1.28", optional = true, features = [
"io-util",
"sync",
"fs",
"macros",
"rt",
"rt-multi-thread",
] }

# cli
anyhow = { version = "1.0", optional = true }
clap = { version = "4.5.4", features = ["derive"], optional = true }

# opendal
opendal = { version = "0.48", optional = true, default-features = false }

[dev-dependencies]
arrow-ipc = { version = "52.0.0", features = ["lz4"] }
arrow-json = "52.0.0"
criterion = { version = "0.5", default-features = false, features = ["async_tokio"] }
opendal = { version = "0.48", default-features = false, features = ["services-memory"] }
pretty_assertions = "1.3.0"
proptest = "1.0.0"
serde_json = { version = "1.0", default-features = false, features = ["std"] }

[features]
default = ["async"]

async = ["async-trait", "futures", "futures-util", "tokio"]
cli = ["anyhow", "clap"]
# Enable opendal support.
opendal = ["dep:opendal"]

[[bench]]
name = "arrow_reader"
harness = false
required-features = ["async"]
# Some issue when publishing and path isn't specified, so adding here
path = "./benches/arrow_reader.rs"

[profile.bench]
debug = true

[[bin]]
name = "orc-metadata"
required-features = ["cli"]

[[bin]]
name = "orc-export"
required-features = ["cli"]

[[bin]]
name = "orc-stats"
required-features = ["cli"]
17 changes: 17 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
.PHONY: fmt
fmt: ## Format all the Rust code.
cargo fmt --all


.PHONY: clippy
clippy: ## Check clippy rules.
cargo clippy --workspace --all-targets -- -D warnings


.PHONY: fmt-toml
fmt-toml: ## Format all TOML files.
taplo format --option "indent_string= "

.PHONY: check-toml
check-toml: ## Check all TOML files.
taplo format --check --option "indent_string= "
Loading
Loading