Memory Usage with ZSTD Compressed Files #159

koosh85 · 2021-10-11T02:21:57Z

First I wanted to say thanks for maintaining this package, it's great to have Parquet support with Julia.

I noticed that there seems to be a memory leak when I read parquet files that use ZSTD compression. The easiest way for me to reproduce the issue was to create a parquet file and then repeatedly read it in Julia while monitoring memory usage.

Creating the file in Python with pyarrow (I wasn't sure how to create a similar file in Julia):

import pyarrow as pa
import pyarrow.parquet as pq
import numpy as np

NUM_ROWS = 10000000
SCHEMA = pa.schema({'c1': pa.float64(), 'c2': pa.float64(), 'c3': pa.float64()})

writer = pq.ParquetWriter("/tmp/mem.parquet", SCHEMA, use_dictionary = False, compression = "ZSTD")
tab = pa.Table.from_pydict({"c1": np.random.rand(NUM_ROWS), "c2": np.random.rand(NUM_ROWS), "c3": np.random.rand(NUM_ROWS)})
writer.write_table(tab)
writer.close()

Reading the same file a few times and monitoring memory usage:

using DataFrames
using Parquet

function BuildDF()::Int64
    df = DataFrame(read_parquet("/tmp/mem.parquet"))
    return 1
end

for i in 1:10
    BuildDF()
    GC.gc()
    run(`ps -p $(getpid()) -h -o rss`)
end

I see output like this:

774600
979464
1384296
1398920
1801576
2024260
2030104
2432612
2662292
2883896

Changing the compression type of the parquet file to Snappy or uncompressed doesn't show the same memory growth. GZip compression shows some growth but not as large as ZSTD.

I'm wasn't able to dig deeper to see where the memory usage may be coming from. Any ideas?

sschmidhuber · 2023-02-07T17:50:17Z

I just wanted to mention, that I ran into the same issue with ZSTD compression, so I'm moving to SNAPPY.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory Usage with ZSTD Compressed Files #159

Memory Usage with ZSTD Compressed Files #159

koosh85 commented Oct 11, 2021

sschmidhuber commented Feb 7, 2023

Memory Usage with ZSTD Compressed Files #159

Memory Usage with ZSTD Compressed Files #159

Comments

koosh85 commented Oct 11, 2021

sschmidhuber commented Feb 7, 2023