-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++] Metadata related memory leak when reading parquet dataset #45287
Comments
Data can be generated with
and the code in the notebook above |
I haven't tried to look down for the precise source of memory consumption (yet?) but some quick comments already:
A quick back of the envelope calculation says that this is roughly 2 kB per column per file.
Interesting data point. That would be 4 kB per column per file, so quite a bit of additional overhead just for 128 additional characters...
I would stress that "a single row and 10 kB columns" is never going to be a good use case for Parquet, which is designed from the ground up as a columnar format. If you're storing less than e.g. 1k rows (regardless of the number of columns), the format will certainly impose a lot of overhead. Of course, we can still try to find out if there's some low-hanging fruit that would allow reducing the memory usage of metadata. |
I was expecting the metadata memory usage to be more of O(C) where C=number_columns instead of O(C * F) where C=number_columns and F=number_files? Since once a parquet file is loaded to pyarrow Table, we don't need to keep the metadata around (all files have the same scheme), but perhaps I am misunderstanding how read parquet works.
Yeah certainly feels the that there are multiple copies of the string for column name even though all file/partition has the same schema.
Yeah this is a extreme case just to show the repro. In practice the file has a couple thousands row per file.
It would be great to reduce metadata memory usage when the files being read all have the same schema since this is a quite common case I think |
Hmm, this needs clarifying a bit then :) What do the memory usage numbers you posted represent? Is it peak memory usage? Is it memory usage after loading the dataset as a Arrow table? Is the dataset object still alive at that point?
Definitely. |
How many row groups per file (or rows per row group)? It turns out much of the Parquet metadata consumption is in ColumnChunk entries. A Thrift-deserialized ColumnChunk is 640 bytes long, and there are O(CRF) ColumnChunks in your dataset, with C=number_columns, R=number_row_groups_per_file and F=number_files. |
I will let my colleague @timothydijamco to provide details here |
Thanks for helping look into this
We typically use one row group per file For some additional background, one of the situations where we originally observed high memory usage is this:
In that dataset I observed that the length of the metadata region in one of the .parquet files is 1082066 bytes, and since the metadata region is read in full, the reader needs to read ~120 bytes of metadata-region-data per data value -- so I think it would be expected if there's some memory usage overhead because of this. However I think what our main concern is is that the memory usage doesn't seem to be constant -- it constantly increases and isn't freed after the read is done
I think it's peak memory usage after loading the table into an Arrow Table. However, I'm not sure about whether the dataset object being alive or not. I'll work on a C++ repro and share here |
Yes, unfortunately with the current version of the Parquet format it's difficult to avoid that overhead. There are discussions in the Parquet community about a redesign of the Parquet metadata to precisely avoid the issue of metadata loading overhead with very wide schemas. Some preliminary proof of concept gave encouraging results, but the whole project will need pushing forward with actual specs and implementations.
When you say it isn't freed, how does your use case look exactly? Do you:
|
A typical use case looks like this: in a Jupyter notebook, read the dataset once with a column selection. What we observed is that...
In this ^ example I believe the dataset object is still alive after the read cell completes. Also, here's a C++ repro script. Observations:
|
Ok, I don't know how Jupyter works in that regard, but I know that the IPython console (in the command line) keeps past results alive by default. See %reset.
Great, thank you!
That might also have to do with how memory allocators work (they often keep some cache of deallocated memory for better performance instead of returning it to the OS). There are several things that you could try and report results for:
|
By the way, you may also try memray to further diagnose the issue, but I would recommend selecting the "system" allocator as I don't think memray is able to intercept Arrow's mimalloc/jemalloc allocations. |
Also of note: some allocators allow to print useful information. For example mimalloc: https://microsoft.github.io/mimalloc/environment.html and glibc: https://www.gnu.org/software/libc/manual/html_node/Statistics-of-Malloc.html |
Thanks! these are interesting points. I'll take a look at memray and the memory allocators |
In #45359 I'm adding a method to print memory pool statistics. |
Further diagnosing using #45359 suggests that, on a 1000 cols * 300 chunks table, around 250 MB memory is spent on per-chunk metadata, which is about 100 64-bit words per column chunk. I suspect it's a combination of the fields on |
@pitrou to clarify, how does "chunks" maps to "number of files in the dataset"? I assume that each file is at least one chunk, but one file can map to multiple chunks if it contains more than one row group? |
Awesome, thanks. Can you point me to how you were able to tell 250MB was spent on Column Chunk metadata using the memory pool statistics debugging? I think I was getting only high-level summary statistics with I think I may be seeing what you're saying about column chunk metadata in the output that I see when running At the peak (middle of the graph), the top three things using memory seem to be: |
What I did in a Python prompt:
The diff between 4 and 1 is the space saved when combining the table chunks. |
That's interesting, thank you. How many columns and chunks does your reproducer have? |
From what I can read in the Dataset source code (but it's quite complex), the row groups for each file will actually be concatenated and then chunked according to |
ahh I see makes sense
My repro run was scanning a dataset with 260 .parquet files, each with 1 row and 10,000 columns. Each file contains one row group, so I think that means the dataset contains Looking at the "Arrow chunk" side of things, I'm not sure how many Arrow chunks the data materializes to -- I also just remembered that in my repro I'm iterating over batches of the data instead of accumulating the read data into a table so theoretically it shouldn't accumulate overhead data attached to Arrow column chunk objects?
|
Well, if you're only reading one column, the Arrow column chunks will not be a problem then. Looks like the memory consumption is entirely on the Parquet metadata handling side. |
Adding
(This is using a C++ repro scanning a dataset with 250 files, 10k columns, 200-character-long column names; and one scan. Will share the new C++ code I used for generating this test data and getting these results tomorrow after I clean it up a bit in case it's useful) |
I printed out info about the default memory pool after every batch is read (read from the
Calling My shaky understanding of Arrow memory pools and allocators says this means the memory usage I'm hoping to reduce is some memory that is not allocated on the Arrow memory pool? |
|
I see, that's fair.
I did some memory profiling on a version of Arrow with Here's the memory usage graph of a C++ program that scans my synthetic "250 files (one row per file), 10k columns, 200-character-long column names" dataset two times:
And for good measure, here's the same thing but on a dataset with twice as many files (from 250 files -> 500 files) to show memory accumulation better:
Overall, clearing |
It's not obvious that clearing |
Superficially (in the context of "scan node") it looks safe from what I can tell from the code:
Where is
In summary, it seems that today, clearing I locally added the below check to the end of your "Parquet cached metadata" test here and it runs OK. I think this exercises that
|
Whether it makes sense in the overall design of |
Describe the bug, including details regarding any error messages, version, and platform.
Hi,
I have observed some memory leak when loading parquet dataset, which I think is related to metadata file.
I ran with Pyarrow 19.0 .0 Here is the code to repro
Here is the description of the dataset:
The dataset roughly looks like this
When running the code above with "time -v", it shows the memory usage is about 6G, which is significantly larger than the data loaded so I think there is some metadata related memory leak. I also noticed that the memory usage increases if I use longer column names, e.g., if I prepend a 128 char long prefix to the column names, the memory usage is about 11G.
This issue is probably the same root cause as #37630
There is script that can be used to generate the dataset for repro, but has permissioned access (due to company policy), but happy to give permission to who is looking into this:
https://github.com/twosigma/bamboo-streaming/blob/master/notebooks/generate_parquet_test_data.ipynb
Component(s)
Parquet, C++
The text was updated successfully, but these errors were encountered: