Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of range error #120

Closed
Deduction42 opened this issue Dec 1, 2020 · 7 comments · Fixed by #123
Closed

Out of range error #120

Deduction42 opened this issue Dec 1, 2020 · 7 comments · Fixed by #123

Comments

@Deduction42
Copy link

Deduction42 commented Dec 1, 2020

I couldn't open a parquet file, so I read it in Python with fastparquet and split it up into smaller files, and I can read every file except the second one.

I could open the parquet file:

parFile
Parquet file: /home/USER/Desktop/Parquet Data Transformation/Split/split02.parquet
    version: 1
    nrows: 253120
    created by: fastparquet-python version 1.0.0 (build 111)
    cached: 146 column chunks

But I couldn't even start a cursor on it, and the error message doesn't even tell me where the error in the file is. I suspect it's within one of the columns that uses strings.

cursor = BatchedColumnsCursor(parFile, batchsize=20000, use_threads=false, reusebuffer=false)
ERROR: BoundsError: attempt to access 980329-element Array{UInt8,1} at index [33448:1699185319]
Stacktrace:
 [1] throw_boundserror(::Array{UInt8,1}, ::Tuple{UnitRange{Int64}}) at ./abstractarray.jl:541
 [2] checkbounds at ./abstractarray.jl:506 [inlined]
 [3] getindex at ./array.jl:815 [inlined]
 [4] read_plain_byte_array(::Parquet.InputState, ::Int32) at /home/USER/.julia/packages/Parquet/yx9gp/src/codec.jl:142
 [5] read_plain_byte_array at /home/USER/.julia/packages/Parquet/yx9gp/src/codec.jl:140 [inlined]
 [6] read_plain_values(::Parquet.InputState, ::Parquet.OutputState{String}, ::Int32, ::typeof(logical_string), ::Int32) at /home/USER/.julia/packages/Parquet/yx9gp/src/codec.jl:196
 [7] iterate(::Parquet.ColumnChunkPageValues{String}, ::Int64) at /home/USER/.julia/packages/Parquet/yx9gp/src/reader.jl:283
 [8] iterate at /home/USER/.julia/packages/Parquet/yx9gp/src/reader.jl:237 [inlined]
 [9] setrow(::Parquet.ColCursor{String}, ::Int64) at /home/USER/.julia/packages/Parquet/yx9gp/src/cursor.jl:114
 [10] Parquet.ColCursor(::Parquet.File, ::Array{String,1}; rows::UnitRange{Int64}, row::Int64) at /home/USER/.julia/packages/Parquet/yx9gp/src/cursor.jl:62
 [11] ColCursor at /home/USER/.julia/packages/Parquet/yx9gp/src/cursor.jl:57 [inlined]
 [12] #51 at ./none:0 [inlined]
 [13] iterate at ./generator.jl:47 [inlined]
 [14] collect_to!(::Array{Parquet.ColCursor,1}, ::Base.Generator{Array{Array{String,1},1},Parquet.var"#51#53"{Parquet.File}}, ::Int64, ::Int64) at ./array.jl:732
 [15] collect_to!(::Array{Parquet.ColCursor{Float64},1}, ::Base.Generator{Array{Array{String,1},1},Parquet.var"#51#53"{Parquet.File}}, ::Int64, ::Int64) at ./array.jl:740
 [16] collect_to_with_first!(::Array{Parquet.ColCursor{Float64},1}, ::Parquet.ColCursor{Float64}, ::Base.Generator{Array{Array{String,1},1},Parquet.var"#51#53"{Parquet.File}}, ::Int64) at ./array.jl:710
 [17] collect(::Base.Generator{Array{Array{String,1},1},Parquet.var"#51#53"{Parquet.File}}) at ./array.jl:691
 [18] BatchedColumnsCursor(::Parquet.File; rows::UnitRange{Int64}, batchsize::Int64, reusebuffer::Bool, use_threads::Bool) at /home/USER/.julia/packages/Parquet/yx9gp/src/cursor.jl:254
 [19] top-level scope at REPL[3]:1
@tanmaykm
Copy link
Member

tanmaykm commented Dec 2, 2020

It will help if you could share a sample file that can be used to replicate the issue.

@Deduction42
Copy link
Author

TestFile.parquet.zip

@Deduction42
Copy link
Author

I wouldn't be surprised if it is some sort of weird character in one of the strings.

@tanmaykm
Copy link
Member

tanmaykm commented Dec 3, 2020

Thanks. I see what the issue is. Will work out a fix shortly.

tanmaykm added a commit that referenced this issue Dec 3, 2020
There may be trailing bytes in bitpack encoded data. The reader should be able to skip those while reading bit packed runs.

fixes #120
@tanmaykm
Copy link
Member

tanmaykm commented Dec 3, 2020

For some reason one extra byte was present in the encoding of one of the columns which was causing the reading to go out of whack. Should be fixed by #123.

@Deduction42
Copy link
Author

Okay, it's probably what's causing issue #122. Can you tell me when it gets released so I can test it on some other files?

By the way, I want to thank you guys for all the work you're doing on this particular project. Working with Parquet files has been a real pain-point with me and Julia; I currently rely on PyCall to fastparquet which sometimes segfaults if the versions have LLVM conflicts and will always segfault if I try any kind of parallelism. I've noticed huge improvements in the package's usability over the past 9 months.

@tanmaykm
Copy link
Member

tanmaykm commented Dec 4, 2020

👍 v0.8.0 released now with this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants