Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Usage with ZSTD Compressed Files #159

Open
koosh85 opened this issue Oct 11, 2021 · 1 comment
Open

Memory Usage with ZSTD Compressed Files #159

koosh85 opened this issue Oct 11, 2021 · 1 comment

Comments

@koosh85
Copy link

koosh85 commented Oct 11, 2021

First I wanted to say thanks for maintaining this package, it's great to have Parquet support with Julia.

I noticed that there seems to be a memory leak when I read parquet files that use ZSTD compression. The easiest way for me to reproduce the issue was to create a parquet file and then repeatedly read it in Julia while monitoring memory usage.

Creating the file in Python with pyarrow (I wasn't sure how to create a similar file in Julia):

import pyarrow as pa
import pyarrow.parquet as pq
import numpy as np

NUM_ROWS = 10000000
SCHEMA = pa.schema({'c1': pa.float64(), 'c2': pa.float64(), 'c3': pa.float64()})

writer = pq.ParquetWriter("/tmp/mem.parquet", SCHEMA, use_dictionary = False, compression = "ZSTD")
tab = pa.Table.from_pydict({"c1": np.random.rand(NUM_ROWS), "c2": np.random.rand(NUM_ROWS), "c3": np.random.rand(NUM_ROWS)})
writer.write_table(tab)
writer.close()

Reading the same file a few times and monitoring memory usage:

using DataFrames
using Parquet

function BuildDF()::Int64
    df = DataFrame(read_parquet("/tmp/mem.parquet"))
    return 1
end

for i in 1:10
    BuildDF()
    GC.gc()
    run(`ps -p $(getpid()) -h -o rss`)
end

I see output like this:

774600
979464
1384296
1398920
1801576
2024260
2030104
2432612
2662292
2883896

Changing the compression type of the parquet file to Snappy or uncompressed doesn't show the same memory growth. GZip compression shows some growth but not as large as ZSTD.

I'm wasn't able to dig deeper to see where the memory usage may be coming from. Any ideas?

@sschmidhuber
Copy link

I just wanted to mention, that I ran into the same issue with ZSTD compression, so I'm moving to SNAPPY.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants