[C++] Metadata related memory leak when reading parquet dataset #45287

icexelloss · 2025-01-16T20:18:48Z

Describe the bug, including details regarding any error messages, version, and platform.

Hi,

I have observed some memory leak when loading parquet dataset, which I think is related to metadata file.

I ran with Pyarrow 19.0 .0 Here is the code to repro

import pyarrow.parquet as pq
t = pq.read_table("bamboo-streaming-parquet-test-data/10000col_2_short_name", columns=['time', 'id'])
print(t)

Here is the description of the dataset:

It is a daily partitioned parquet dataset, total size is 1G, each parquet / partition is 3.7M, total 260 parquet files.
Each partition has a single row, and 10k double columns.

The dataset roughly looks like this

                         time  id      md_0      md_1      md_2      md_3      md_4      md_5      md_6      md_7      md_8      md_9     md_10     md_11     md_12     md_13     md_14  ...   md_9986   md_9987   md_9988   md_9989   md_9990   md_9991   md_9992   md_9993   md_9994   md_9995   md_9996   md_9997   md_9998   md_9999  year  month  day
0   2023-01-02 09:00:00+00:00   0  0.345584  0.821618  0.330437 -1.303157  0.905356  0.446375 -0.536953  0.581118  0.364572  0.294132  0.028422  0.546713 -0.736454 -0.162910 -0.482119  ... -0.559077  0.422268 -0.694504 -0.024630 -1.142861  2.203289 -0.293591 -1.076218 -2.264640  1.424887  1.601123  0.301252 -0.771280  0.185484  2023      1    2
1   2023-01-03 09:00:00+00:00   0 -0.581676 -0.889318  0.487676  0.678370 -0.834241  0.990142 -0.502560 -3.089640 -1.354553  0.669394  0.173036  0.904321  0.528163  1.386469 -1.018272  ...  2.348579  0.682227 -0.212912  0.404263 -1.527967 -0.636490 -1.094308 -0.049889  0.290552 -0.428462 -0.688299  1.856678  1.714070  0.228840  2023      1    3
2   2023-01-04 09:00:00+00:00   0 -0.436375  1.554100  1.583000 -0.427829 -0.105547 -1.210442 -1.995322 -0.676878  0.957899 -1.569809  0.411940  0.190030 -1.502412 -0.006992  0.086427  ... -0.039152 -0.325682 -3.200570  0.415924 -1.892018 -0.324783 -0.397570  1.310791  1.284943  0.148449  0.844266 -0.045938  0.745099  1.037851  2023      1    4
3   2023-01-05 09:00:00+00:00   0 -0.158549 -1.239811 -4.030404  1.357348  0.323645 -1.222858 -0.285377  0.963126 -0.531556 -0.652767  0.161818 -0.727889 -0.845209  2.557909  0.192841  ...  0.349263  1.362306  0.993748 -0.198351 -0.270906  0.667339  0.265590 -0.344429 -0.025954 -0.751611 -0.614933  0.629236 -0.765841  1.214225  2023      1    5
4   2023-01-06 09:00:00+00:00   0  0.165239  1.645823  1.345670 -0.966753 -1.149769  0.245695  0.731457 -0.902745  1.270495  2.031029  0.312967 -1.554449  1.177362 -0.843873 -0.216501  ... -0.070219  1.582911  0.146530 -2.169505 -0.474960  0.896453 -1.591739  0.560348 -1.130101  1.137671  1.327553 -0.383506 -0.346886 -0.189187  2023      1    6
..                        ...  ..       ...       ...       ...       ...       ...       ...       ...       ...       ...       ...       ...       ...       ...       ...       ...  ...       ...       ...       ...       ...       ...       ...       ...       ...       ...       ...       ...       ...       ...       ...   ...    ...  ...
255 2023-12-25 09:00:00+00:00   0 -1.346927  0.359054  0.539482  0.367916 -1.574514  0.986346 -0.695192  0.658779  1.335143  1.846663 -0.341364  0.817412 -0.797522  0.073098  0.821410  ... -1.186771  0.887036  1.411563 -0.292395  0.430151  1.141385  0.496770 -0.644220 -0.799314 -1.696699  0.862889  2.979495  0.630375  1.303667  2023     12   25
256 2023-12-26 09:00:00+00:00   0  0.227574 -1.466949 -0.333808 -1.710143  1.314850 -0.322474  0.048659  0.470558 -0.045580  1.193444 -1.826998 -1.368194  0.489085  0.947896  0.640531  ...  0.914886  0.261353 -0.691675 -0.399880  2.045703 -2.356994  1.374474  0.398776 -1.112503 -0.821812  1.238957 -0.940858 -0.912673 -0.784034  2023     12   26
257 2023-12-27 09:00:00+00:00   0  0.054617 -1.524966  0.890249  0.360648  2.271556 -0.964410  1.819533 -0.050139  1.859295 -0.590993  0.306090  0.354523  0.094928  0.191593 -0.225309  ... -0.488067 -0.309505  0.544273 -0.408513 -0.111164  0.974175 -0.441507  2.331777  0.726422 -0.165301 -1.163866  0.077637  0.404457  1.498559  2023     12   27
258 2023-12-28 09:00:00+00:00   0  0.827725  1.090989  0.273126  0.586210  0.753180 -1.544673  0.180036 -1.136032  0.919575 -0.733295 -0.661449  0.194519  0.228403 -0.531628 -0.226339  ... -0.986043  0.099540 -0.729874  0.692716 -0.506130 -0.122421  0.321638 -2.592867  0.083722  0.418742 -0.076682  1.067173 -0.331503  0.617221  2023     12   28
259 2023-12-29 09:00:00+00:00   0  0.527097  0.358271 -0.659745  1.500467 -0.977564  1.198143  0.650929  0.876694 -0.144450  1.175169  0.749327 -0.475795 -0.978405 -0.888626  0.041753  ... -0.090532 -2.414195  1.619769 -0.005002 -0.672586  0.638271  1.819008 -0.446535 -0.629320 -1.241598  0.926157 -0.304448 -0.129029  0.750146  2023     12   29

[260 rows x 10005 columns]

When running the code above with "time -v", it shows the memory usage is about 6G, which is significantly larger than the data loaded so I think there is some metadata related memory leak. I also noticed that the memory usage increases if I use longer column names, e.g., if I prepend a 128 char long prefix to the column names, the memory usage is about 11G.

This issue is probably the same root cause as #37630

There is script that can be used to generate the dataset for repro, but has permissioned access (due to company policy), but happy to give permission to who is looking into this:
https://github.com/twosigma/bamboo-streaming/blob/master/notebooks/generate_parquet_test_data.ipynb

Component(s)

Parquet, C++

The text was updated successfully, but these errors were encountered:

icexelloss · 2025-01-21T19:21:05Z

Data can be generated with

python -m datagen.batch_process \
    BatchStreamParquetFileWriter \
    test.parquet \
    TableMDGenerator \
    '{"begin_date": "2023-01-01", "end_date": "2023-12-31", "seed": 1, "freq": "1d", "ids": 1, "cols": 10000}'

and the code in the notebook above

pitrou · 2025-01-21T19:52:12Z

I haven't tried to look down for the precise source of memory consumption (yet?) but some quick comments already:

When running the code above with "time -v", it shows the memory usage is about 6G, which is significantly larger than the data loaded so I think there is some metadata related memory leak

A quick back of the envelope calculation says that this is roughly 2 kB per column per file.

I also noticed that the memory usage increases if I use longer column names, e.g., if I prepend a 128 char long prefix to the column names, the memory usage is about 11G.

Interesting data point. That would be 4 kB per column per file, so quite a bit of additional overhead just for 128 additional characters...

Each partition has a single row, and 10k double columns.

I would stress that "a single row and 10 kB columns" is never going to be a good use case for Parquet, which is designed from the ground up as a columnar format. If you're storing less than e.g. 1k rows (regardless of the number of columns), the format will certainly impose a lot of overhead.

Of course, we can still try to find out if there's some low-hanging fruit that would allow reducing the memory usage of metadata.

icexelloss · 2025-01-21T21:41:56Z

A quick back of the envelope calculation says that this is roughly 2 kB per column per file.

I was expecting the metadata memory usage to be more of O(C) where C=number_columns instead of O(C * F) where C=number_columns and F=number_files? Since once a parquet file is loaded to pyarrow Table, we don't need to keep the metadata around (all files have the same scheme), but perhaps I am misunderstanding how read parquet works.

Interesting data point. That would be 4 kB per column per file, so quite a bit of additional overhead just for 128 additional characters...

Yeah certainly feels the that there are multiple copies of the string for column name even though all file/partition has the same schema.

I would stress that "a single row and 10 kB columns" is never going to be a good use case for Parquet

Yeah this is a extreme case just to show the repro. In practice the file has a couple thousands row per file.

Of course, we can still try to find out if there's some low-hanging fruit that would allow reducing the memory usage of metadata.

It would be great to reduce metadata memory usage when the files being read all have the same schema since this is a quite common case I think

pitrou · 2025-01-22T08:19:37Z

I was expecting the metadata memory usage to be more of O(C) where C=number_columns instead of O(C * F) where C=number_columns and F=number_files? Since once a parquet file is loaded to pyarrow Table, we don't need to keep the metadata around (all files have the same scheme), but perhaps I am misunderstanding how read parquet works.

Hmm, this needs clarifying a bit then :) What do the memory usage numbers you posted represent? Is it peak memory usage? Is it memory usage after loading the dataset as a Arrow table? Is the dataset object still alive at that point?

It would be great to reduce metadata memory usage when the files being read all have the same schema since this is a quite common case I think

Definitely.

pitrou · 2025-01-22T09:53:35Z

Yeah this is a extreme case just to show the repro. In practice the file has a couple thousands row per file.

How many row groups per file (or rows per row group)? It turns out much of the Parquet metadata consumption is in ColumnChunk entries. A Thrift-deserialized ColumnChunk is 640 bytes long, and there are O(CRF) ColumnChunks in your dataset, with C=number_columns, R=number_row_groups_per_file and F=number_files.

icexelloss · 2025-01-22T14:40:24Z

I will let my colleague @timothydijamco to provide details here

timothydijamco · 2025-01-22T16:11:45Z

Thanks for helping look into this

Yeah this is a extreme case just to show the repro. In practice the file has a couple thousands row per file.

How many row groups per file (or rows per row group)? It turns out much of the Parquet metadata consumption is in ColumnChunk entries. A Thrift-deserialized ColumnChunk is 640 bytes long, and there are O(CRF) ColumnChunks in your dataset, with C=number_columns, R=number_row_groups_per_file and F=number_files.

We typically use one row group per file

For some additional background, one of the situations where we originally observed high memory usage is this:

Dataset has ~3000 rows per row group (and per file) and 5000 columns
User is reading 3 columns

In that dataset I observed that the length of the metadata region in one of the .parquet files is 1082066 bytes, and since the metadata region is read in full, the reader needs to read ~120 bytes of metadata-region-data per data value -- so I think it would be expected if there's some memory usage overhead because of this. However I think what our main concern is is that the memory usage doesn't seem to be constant -- it constantly increases and isn't freed after the read is done

Hmm, this needs clarifying a bit then :) What do the memory usage numbers you posted represent? Is it peak memory usage? Is it memory usage after loading the dataset as a Arrow table? Is the dataset object still alive at that point?

I think it's peak memory usage after loading the table into an Arrow Table. However, I'm not sure about whether the dataset object being alive or not. I'll work on a C++ repro and share here

pitrou · 2025-01-22T17:33:56Z

In that dataset I observed that the length of the metadata region in one of the .parquet files is 1082066 bytes, and since the metadata region is read in full, the reader needs to read ~120 bytes of metadata-region-data per data value -- so I think it would be expected if there's some memory usage overhead because of this

Yes, unfortunately with the current version of the Parquet format it's difficult to avoid that overhead.

There are discussions in the Parquet community about a redesign of the Parquet metadata to precisely avoid the issue of metadata loading overhead with very wide schemas. Some preliminary proof of concept gave encouraging results, but the whole project will need pushing forward with actual specs and implementations.

However I think what our main concern is is that the memory usage doesn't seem to be constant -- it constantly increases and isn't freed after the read is done

When you say it isn't freed, how does your use case look exactly? Do you:

reuse the same dataset always to read different rows and/or columns?
dispose the dataset and create a new one for each read?

timothydijamco · 2025-01-24T14:23:46Z

When you say it isn't freed, how does your use case look exactly? Do you:

reuse the same dataset always to read different rows and/or columns?

dispose the dataset and create a new one for each read?

A typical use case looks like this: in a Jupyter notebook, read the dataset once with a column selection. What we observed is that...

As the read operation is running: memory usage steadily increases as it goes on, and much more than the amount of data that is actually being materialized given the column selection (e.g. GBs of memory usage vs 100's of MBs of "real data" loaded). We had expected memory usage to be constant, assuming that the "overhead data" (e.g. metadata) that is read earlier isn't needed anymore once the real data associated with that metadata has been loaded.
After the read is done (i.e. the Jupyter notebook cell running the read completes): the memory usage still hasn't decreased

In this ^ example I believe the dataset object is still alive after the read cell completes.

Also, here's a C++ repro script. Observations:

In the script I do two reads (of the same data, but copies of each other to prevent any potential "caching based on file names" from muddling the memory usage results). Interestingly, the memory usage of doing one read vs two reads is the same, implying that the memory is freed once the dataset object (and other related objects) are freed.
I ran the repro on your patch commit as well (#37630) and memory usage is a quarter of what it was without the patch!
- Internally, we already have metadata_.reset() and manifest_.reset(), but adding adding original_metadata_.reset(); halves the memory usage when reading a real-world dataset.
- However, I think we're still left with the general issue that memory usage is significantly higher than the amount of "real data" loaded (GBs of memory usage for MBs of real data)-- it seems like something is still accumulating?

pitrou · 2025-01-24T14:41:15Z

After the read is done (i.e. the Jupyter notebook cell running the read completes): the memory usage still hasn't decreased

Ok, I don't know how Jupyter works in that regard, but I know that the IPython console (in the command line) keeps past results alive by default. See %reset.

I ran the repro on your patch commit as well (#37630) and memory usage is a quarter of what it was without the patch!

Great, thank you!

However, I think we're still left with the general issue that memory usage is significantly higher than the amount of "real data" loaded (GBs of memory usage for MBs of real data)-- it seems like something is still accumulating?

That might also have to do with how memory allocators work (they often keep some cache of deallocated memory for better performance instead of returning it to the OS). There are several things that you could try and report results for:

selecting different memory pool implementations: jemalloc, mimalloc, system
trying to release memory more forcibly: this is not recommended in production cases (because this makes later allocations more expensive), but can be used for experiments like this to find out the possible cause of memory consumption

pitrou · 2025-01-24T14:42:25Z

By the way, you may also try memray to further diagnose the issue, but I would recommend selecting the "system" allocator as I don't think memray is able to intercept Arrow's mimalloc/jemalloc allocations.

pitrou · 2025-01-24T15:21:08Z

Also of note: some allocators allow to print useful information. For example mimalloc: https://microsoft.github.io/mimalloc/environment.html and glibc: https://www.gnu.org/software/libc/manual/html_node/Statistics-of-Malloc.html

timothydijamco · 2025-01-24T21:46:46Z

Thanks! these are interesting points. I'll take a look at memray and the memory allocators

pitrou · 2025-01-27T10:09:46Z

In #45359 I'm adding a method to print memory pool statistics.

pitrou · 2025-01-27T15:11:34Z

Further diagnosing using #45359 suggests that, on a 1000 cols * 300 chunks table, around 250 MB memory is spent on per-chunk metadata, which is about 100 64-bit words per column chunk. I suspect it's a combination of the fields on ChunkedArray, NumericArray, ArrayData, PoolBuffer.

icexelloss · 2025-01-27T16:06:31Z

@pitrou to clarify, how does "chunks" maps to "number of files in the dataset"?

I assume that each file is at least one chunk, but one file can map to multiple chunks if it contains more than one row group?

timothydijamco · 2025-01-27T17:20:25Z

Awesome, thanks. Can you point me to how you were able to tell 250MB was spent on Column Chunk metadata using the memory pool statistics debugging? I think I was getting only high-level summary statistics with PrintStats().

I think I may be seeing what you're saying about column chunk metadata in the output that I see when running valgrind --tool=massif on the C++ repro I posted above and visualizing using massif-visualizer. I ran on a version of Arrow with your metadata-clearing patch (#37630). Here's what the memory usage graph looks like when running the repro which performs two scans one after another:

At the peak (middle of the graph), the top three things using memory seem to be:

342.9MiB: Some "parquet::schema::node to 'schema field'" map
155.4 MiB: Some "name to index" map
109.9 MiB: Vector of parquet::format::ColumnChunks

pitrou · 2025-01-27T17:30:44Z

Awesome, thanks. Can you point me to how you were able to tell 250MB was spent on Column Chunk metadata using the memory pool statistics debugging? I think I was getting only high-level summary statistics with PrintStats().

What I did in a Python prompt:

tab = pd.dataset(...).to_table(memory_pool=pa.system_memory_pool())
Look up "in use bytes" in pa.system_memory_pool().print_stats(): I get around 932 MB
tab = tab.combine_chunks(memory_pool=pa.system_memory_pool())
Look up "in use bytes" in pa.system_memory_pool().print_stats() again: I get around 655 MB

The diff between 4 and 1 is the space saved when combining the table chunks.
I admit I'm not entirely sure this would be Arrow column chunk metadata, as perhaps some buffers were overallocated when reading the Parquet file. I would have to check that explicitly.

pitrou · 2025-01-27T17:35:39Z

At the peak (middle of the graph), the top three things using memory seem to be:

That's interesting, thank you. How many columns and chunks does your reproducer have?

pitrou · 2025-01-27T17:44:49Z

I assume that each file is at least one chunk, but one file can map to multiple chunks if it contains more than one row group?

From what I can read in the Dataset source code (but it's quite complex), the row groups for each file will actually be concatenated and then chunked according to ScanOptions::batch_size.

timothydijamco · 2025-01-27T19:42:41Z

ahh I see makes sense

How many columns and chunks does your reproducer have?

My repro run was scanning a dataset with 260 .parquet files, each with 1 row and 10,000 columns. Each file contains one row group, so I think that means the dataset contains 260 * 10,000 = 2,600,000 Parquet column chunks. I configured the scan to use a 1-column selection so I think it should be reading Parquet data pages for 260 Parquet column chunks (but I guess reading Parquet metadata for all columns / Parquet column chunks)

Looking at the "Arrow chunk" side of things, I'm not sure how many Arrow chunks the data materializes to -- I also just remembered that in my repro I'm iterating over batches of the data instead of accumulating the read data into a table so theoretically it shouldn't accumulate overhead data attached to Arrow column chunk objects?

  ARROW_ASSIGN_OR_RAISE(auto record_batch_reader, scanner->ToRecordBatchReader());
  std::shared_ptr<arrow::RecordBatch> batch;
  while (true) {
    ARROW_RETURN_NOT_OK(record_batch_reader->ReadNext(&batch));
    if (batch == nullptr) {
        break;
    }
  }

pitrou · 2025-01-27T19:50:39Z

Looking at the "Arrow chunk" side of things, I'm not sure how many Arrow chunks the data materializes to

Well, if you're only reading one column, the Arrow column chunks will not be a problem then. Looks like the memory consumption is entirely on the Parquet metadata handling side.

timothydijamco · 2025-01-28T20:24:43Z

Adding physical_schema_.reset() to the ClearCachedMetadata() method (from #45330) seems to reduce memory usage a bit further

Without ClearCachedMetadata(): 9.58GiB
With ClearCachedMetadata() enabled but without physical_schema_.reset(): 2.81GiB (30% of original memory usage)
With ClearCachedMetadata() enabled and with physical_schema_.reset(): 1.73GiB (18% of original memory usage)

(This is using a C++ repro scanning a dataset with 250 files, 10k columns, 200-character-long column names; and one scan. Will share the new C++ code I used for generating this test data and getting these results tomorrow after I clean it up a bit in case it's useful)

timothydijamco · 2025-01-28T20:32:44Z

However, I think we're still left with the general issue that memory usage is significantly higher than the amount of "real data" loaded (GBs of memory usage for MBs of real data)-- it seems like something is still accumulating?

That might also have to do with how memory allocators work (they often keep some cache of deallocated memory for better performance instead of returning it to the OS). There are several things that you could try and report results for:

selecting different memory pool implementations: jemalloc, mimalloc, system

trying to release memory more forcibly: this is not recommended in production cases (because this makes later allocations more expensive), but can be used for experiments like this to find out the possible cause of memory consumption

I printed out info about the default memory pool after every batch is read (read from the RecordBatchReader I created from the Scanner)

total_bytes_allocated steadily increases over time which makes sense
bytes_allocated fluctuates but remains capped (i.e. does not correlate with the overall memory usage of the process increasing steadily over time)

Calling arrow::default_memory_pool()->ReleaseUnused()after every record batch is read also seems to not have an effect

My shaky understanding of Arrow memory pools and allocators says this means the memory usage I'm hoping to reduce is some memory that is not allocated on the Arrow memory pool?

pitrou · 2025-01-29T15:44:26Z

Calling arrow::default_memory_pool()->ReleaseUnused()after every record batch is read also seems to not have an effect

My shaky understanding of Arrow memory pools and allocators says this means the memory usage I'm hoping to reduce is some memory that is not allocated on the Arrow memory pool?

ReleaseUnused is best effort, so you can't really deduce this unfortunately. The new PrintStats method might allow you to get a better idea, though the allocator stats are not always easy to understand.

timothydijamco · 2025-01-29T19:46:37Z

ReleaseUnused is best effort, so you can't really deduce this unfortunately. The new #45359 might allow you to get a better idea, though the allocator stats are not always easy to understand.

I see, that's fair.

Adding physical_schema_.reset() to the ClearCachedMetadata() method (from #45330) seems to reduce memory usage a bit further

I did some memory profiling on a version of Arrow with physical_schema_.reset() and I notice that memory usage actually looks bounded now.

Here's the memory usage graph of a C++ program that scans my synthetic "250 files (one row per file), 10k columns, 200-character-long column names" dataset two times:

Clearing `metadata_`, `manifest_`, `original_metadata_`	Clearing `metadata_`, `manifest_`, `original_metadata_`, `physical_schema_`

And for good measure, here's the same thing but on a dataset with twice as many files (from 250 files -> 500 files) to show memory accumulation better:

Clearing `metadata_`, `manifest_`, `original_metadata_`	Clearing `metadata_`, `manifest_`, `original_metadata_`, `physical_schema_`

Overall, clearing metadata_, manifest_, original_metadata_, and physical_schema_ all together seems to do the trick of preventing metadata-related objects from accumulating over a scan. Going to test on some real datasets as well and see how they are affected.

pitrou · 2025-01-29T22:03:55Z

It's not obvious that clearing physical_schema_ is correct, though.

timothydijamco · 2025-01-31T14:20:19Z

Superficially (in the context of "scan node") it looks safe from what I can tell from the code:

The Fragments whose physical_schema_s we would delete are first created at the beginning of MakeScanNode (here, where dataset->GetFragments is called).
These Fragments do not propogate into the ExecNode that is created at the end of MakeScanNode (since we're passing a generator of only ExecBatches to acero::MakeExecNode).
So it seems like we have to make sure that any usages of Fragment within the long chain of generator logic between <when the Fragments are created> and <when they have finally been turned into just ExecBatches> don't need to use physical_schema_ before we clear physical_schema_.

Where is Fragment used in this long chain of generator logic?

A call to Fragment::ScanBatchesAsync here
- In your PR this currently happens before you trigger the clearing of any fields of Fragment (i.e. before we would clear physical_schema_) anyways, so this call won't be a problem
A call to Fragment::partition_expression here
- Today, this (happens to) not use physical_schema_ — the behavior of this method for Fragment and all its subclasses is to return a partition_expression_ value that was passed to the constructor of Fragment.
A call to Fragment::ToString here
- Today, this (happens to) not use physical_schema_ — the behavior of this method is to return a type_name() value that is overriden by each subclass to return a constant.

In summary, it seems that today, clearing physical_schema_ here (as you're doing in your PR) would be safe, although doing it here (i.e. after the last time the Fragments are used) is safest.

I locally added the below check to the end of your "Parquet cached metadata" test here and it runs OK. I think this exercises that physical_schema_ is re-populated and usable after previously clearing it and then doing another read (because CountRows calls into ParquetFileFragment::TestRowGroups which makes use of physical_schema_ here)

  // ...
  auto predicate = compute::equal(compute::field_ref("x"), compute::literal(0));
  ASSERT_OK_AND_ASSIGN(predicate, predicate.Bind(*test_schema));
  ASSERT_FINISHES_OK_AND_EQ(std::make_optional<int64_t>(1),
                            pq_fragment->CountRows(predicate, options));
}

timothydijamco · 2025-01-31T14:25:09Z

Whether it makes sense in the overall design of Fragment I'm not sure. I see that the docstring for Fragment::ReadPhysicalSchema mentions that the physical schema is cached after being read once, which I think makes it less surprising if physical schema was cleared in a ClearCachedMetadata method. However the physical schema alternatively may be specified at construction and in this case it probably would be surprising if it could be deleted.

icexelloss added the Type: bug label Jan 16, 2025

github-actions bot added Component: Parquet Component: C++ labels Jan 16, 2025

icexelloss changed the title ~~Metadata related memory leak when reading parquet dataset~~ [C++] Metadata related memory leak when reading parquet dataset Jan 21, 2025

pitrou mentioned this issue Jan 22, 2025

[C++] Potential memory leak in Parquet reading with Dataset #37630

Open

pitrou mentioned this issue Jan 29, 2025

GH-37630: [C++][Python][Dataset] Allow disabling fragment metadata caching #45330

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++] Metadata related memory leak when reading parquet dataset #45287

[C++] Metadata related memory leak when reading parquet dataset #45287

icexelloss commented Jan 16, 2025 •

edited

Loading

icexelloss commented Jan 21, 2025 •

edited

Loading

pitrou commented Jan 21, 2025

icexelloss commented Jan 21, 2025 •

edited

Loading

pitrou commented Jan 22, 2025

pitrou commented Jan 22, 2025

icexelloss commented Jan 22, 2025

timothydijamco commented Jan 22, 2025

pitrou commented Jan 22, 2025

timothydijamco commented Jan 24, 2025

pitrou commented Jan 24, 2025

pitrou commented Jan 24, 2025

pitrou commented Jan 24, 2025

timothydijamco commented Jan 24, 2025

pitrou commented Jan 27, 2025

pitrou commented Jan 27, 2025 •

edited

Loading

icexelloss commented Jan 27, 2025

timothydijamco commented Jan 27, 2025 •

edited

Loading

pitrou commented Jan 27, 2025

pitrou commented Jan 27, 2025

pitrou commented Jan 27, 2025

timothydijamco commented Jan 27, 2025 •

edited

Loading

pitrou commented Jan 27, 2025

timothydijamco commented Jan 28, 2025

timothydijamco commented Jan 28, 2025

pitrou commented Jan 29, 2025

timothydijamco commented Jan 29, 2025 •

edited

Loading

pitrou commented Jan 29, 2025

timothydijamco commented Jan 31, 2025

timothydijamco commented Jan 31, 2025

[C++] Metadata related memory leak when reading parquet dataset #45287

[C++] Metadata related memory leak when reading parquet dataset #45287

Comments

icexelloss commented Jan 16, 2025 • edited Loading

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

icexelloss commented Jan 21, 2025 • edited Loading

pitrou commented Jan 21, 2025

icexelloss commented Jan 21, 2025 • edited Loading

pitrou commented Jan 22, 2025

pitrou commented Jan 22, 2025

icexelloss commented Jan 22, 2025

timothydijamco commented Jan 22, 2025

pitrou commented Jan 22, 2025

timothydijamco commented Jan 24, 2025

pitrou commented Jan 24, 2025

pitrou commented Jan 24, 2025

pitrou commented Jan 24, 2025

timothydijamco commented Jan 24, 2025

pitrou commented Jan 27, 2025

pitrou commented Jan 27, 2025 • edited Loading

icexelloss commented Jan 27, 2025

timothydijamco commented Jan 27, 2025 • edited Loading

pitrou commented Jan 27, 2025

pitrou commented Jan 27, 2025

pitrou commented Jan 27, 2025

timothydijamco commented Jan 27, 2025 • edited Loading

pitrou commented Jan 27, 2025

timothydijamco commented Jan 28, 2025

timothydijamco commented Jan 28, 2025

pitrou commented Jan 29, 2025

timothydijamco commented Jan 29, 2025 • edited Loading

pitrou commented Jan 29, 2025

timothydijamco commented Jan 31, 2025

timothydijamco commented Jan 31, 2025

icexelloss commented Jan 16, 2025 •

edited

Loading

icexelloss commented Jan 21, 2025 •

edited

Loading

icexelloss commented Jan 21, 2025 •

edited

Loading

pitrou commented Jan 27, 2025 •

edited

Loading

timothydijamco commented Jan 27, 2025 •

edited

Loading

timothydijamco commented Jan 27, 2025 •

edited

Loading

timothydijamco commented Jan 29, 2025 •

edited

Loading