Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Access to Undefined Reference #122

Open
Deduction42 opened this issue Dec 2, 2020 · 4 comments
Open

Access to Undefined Reference #122

Deduction42 opened this issue Dec 2, 2020 · 4 comments

Comments

@Deduction42
Copy link

I can only partially iterate from a file created by parquet-mr. I can iterate through it once, but trying to do this a second time yields

iterate(cursor)
ERROR: UndefRefError: access to undefined reference
[1] getindex at ./array.jl:809 [inlined]
 [2] colcursor_values(::Parquet.ColCursor{String}, ::Int64, ::Type{Array{Union{Missing, String},1}}, ::Nothing) at /home/user/.julia/packages/Parquet/yx9gp/src/cursor.jl:289
 [3] (::Parquet.var"#55#58"{BatchedColumnsCursor{NamedTuple{...}}}})(::Tuple{Parquet.ColCursor{String},DataType}) at ./none:0
 [4] iterate at ./generator.jl:47 [inlined]
 [5] collect_to!(::Array{Array{T,1} where T,1}, ::Base.Generator{Base.Iterators.Zip{Tuple{Array{Parquet.ColCursor,1},Core.SimpleVector}},Parquet.var"#55#58"{BatchedColumnsCursor{NamedTuple{...}}}}}, ::Int64, ::Tuple{Int64,Int64}) at ./array.jl:732
 [6] collect_to!(::Array{Array{Union{Missing, Decimals.Decimal},1},1}, ::Base.Generator{Base.Iterators.Zip{Tuple{Array{Parquet.ColCursor,1},Core.SimpleVector}},Parquet.var"#55#58"{BatchedColumnsCursor{NamedTuple{...}) at ./array.jl:740
 [7] collect_to_with_first!(::Array{Array{Union{Missing, Decimals.Decimal},1},1}, ::Array{Union{Missing, Decimals.Decimal},1}, ::Base.Generator{Base.Iterators.Zip{Tuple{Array{Parquet.ColCursor,1},Core.SimpleVector}},Parquet.var"#55#58"{BatchedColumnsCursor{NamedTuple{...}) at ./array.jl:710
 [8] collect(::Base.Generator{Base.Iterators.Zip{Tuple{Array{Parquet.ColCursor,1},Core.SimpleVector}},Parquet.var"#55#58"{BatchedColumnsCursor{NamedTuple{...}}) at ./array.jl:691
 [9] iterate(::BatchedColumnsCursor{NamedTuple{...}, ::Int64) at /home/user/.julia/packages/Parquet/yx9gp/src/cursor.jl:336
 [10] iterate(::BatchedColumnsCursor{NamedTuple{...}) at /home/user/.julia/packages/Parquet/yx9gp/src/cursor.jl:350
 [11] top-level scope at REPL[5]:1

Note that NamedTuple{...} is abridged becasue the actual tuple is a massive long list of the entire file schema. I can't give you the original file for this one, but I wouldn't be surprised if it has something to do with initializing a mutable type with #undef and failing to populate it. There could be sizable gaps in data for some of the columns. Note that it was created by parquet-mr

Parquet file: Input/input_data.parquet
version: 1
nrows: 4887400
created by: parquet-mr version 1.9.0 (build 38262e2c80015d0935dad20f8e18f2d6f9fbd03c)
cached: 157 column chunks

@Deduction42
Copy link
Author

This issue could potentially be related to Issue #120. There could potentially be a string column that fails to parse a delimiter, putting a whole whack of data into a single cell, causing the rest of the columns.

@Deduction42
Copy link
Author

If I iterate through the file with no batch size specified, I get an inexact error (trying to convert a NaN to Int32)

cursor = BatchedColumnsCursor(parFile, use_threads=false, reusebuffer=false)
df = DataFrame(iterate(cursor)[1])
ERROR: InexactError: Int32(NaN)
Stacktrace:
 [1] Int32 at ./float.jl:689 [inlined]
 [2] read_plain_values(::Parquet.InputState, ::Parquet.OutputState{Decimals.Decimal}, ::Int32, ::Parquet.var"#32#35"{Int32,DataType}, ::Int32) at /home/user/.julia/packages/Parquet/yx9gp/src/codec.jl:170
 [3] iterate(::Parquet.ColumnChunkPageValues{Decimals.Decimal}, ::Int64) at /home/user/.julia/packages/Parquet/yx9gp/src/reader.jl:283
 [4] iterate at /home/user/.julia/packages/Parquet/yx9gp/src/reader.jl:237 [inlined]
 [5] setrow(::Parquet.ColCursor{Decimals.Decimal}, ::Int64) at /home/user/.julia/packages/Parquet/yx9gp/src/cursor.jl:114
 [6] colcursor_advance at /home/user/.julia/packages/Parquet/yx9gp/src/cursor.jl:267 [inlined]
 [7] colcursor_values(::Parquet.ColCursor{Decimals.Decimal}, ::Int64, ::Type{Array{Union{Missing, Decimals.Decimal},1}}, ::Nothing) at /home/user/.julia/packages/Parquet/yx9gp/src/cursor.jl:296
 [8] (::Parquet.var"#55#58"{BatchedColumnsCursor{NamedTuple{...}}})(::Tuple{Parquet.ColCursor{Decimals.Decimal},DataType}) at ./none:0
 [9] iterate at ./generator.jl:47 [inlined]
 [10] collect_to!(::Array{Array{T,1} where T,1}, ::Base.Generator{Base.Iterators.Zip{Tuple{Array{Parquet.ColCursor,1},Core.SimpleVector}},Parquet.var"#55#58"{BatchedColumnsCursor{NamedTuple{...}}}}, ::Int64, ::Tuple{Int64,Int64}) at ./array.jl:732
 [11] collect_to!(::Array{Array{Union{Missing, Decimals.Decimal},1},1}, ::Base.Generator{Base.Iterators.Zip{Tuple{Array{Parquet.ColCursor,1},Core.SimpleVector}},Parquet.var"#55#58"{BatchedColumnsCursor{NamedTuple{...}}}}, ::Int64, ::Tuple{Int64,Int64}) at ./array.jl:740
 [12] collect_to_with_first!(::Array{Array{Union{Missing, Decimals.Decimal},1},1}, ::Array{Union{Missing, Decimals.Decimal},1}, ::Base.Generator{Base.Iterators.Zip{Tuple{Array{Parquet.ColCursor,1},Core.SimpleVector}},Parquet.var"#55#58"{BatchedColumnsCursor{NamedTuple{...}}}}, ::Tuple{Int64,Int64}) at ./array.jl:710
 [13] collect(::Base.Generator{Base.Iterators.Zip{Tuple{Array{Parquet.ColCursor,1},Core.SimpleVector}},Parquet.var"#55#58"{BatchedColumnsCursor{NamedTuple{...}}}}) at ./array.jl:691
 [14] iterate(::BatchedColumnsCursor{NamedTuple{...}}, ::Int64) at /home/user/.julia/packages/Parquet/yx9gp/src/cursor.jl:336
 [15] iterate(::BatchedColumnsCursor{NamedTuple{...}}) at /home/user/.julia/packages/Parquet/yx9gp/src/cursor.jl:350
 [16] top-level scope at REPL[12]:1

@Deduction42
Copy link
Author

Deduction42 commented Dec 4, 2020

I just verified the fix to Issue #120, that fix doesn't fix this problem unfortunately so this issue is still open.

@nickrobinson251
Copy link

nickrobinson251 commented Feb 12, 2021

i also get exactly this UndefRefError error trying to read a parquet file written using python/pandas, reading with Parquet v0.8.0, Julia v1.5.3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants