[C++] Potential memory leak in Parquet reading with Dataset #37630

icexelloss · 2023-09-08T13:59:22Z

Describe the bug, including details regarding any error messages, version, and platform.

Version

Arrow 12.0

Platform

Debian 5.4.228

Description

I have been testing "What is the max rss needed to scan through ~100G of data in a parquet stored in gcs using Arrow C++".

The current answer is about ~6G of memory which seems a bit high so I looked into it. What I observed during the process led me to think that there are some potential cache/memory issues in the dataset/parquet cpp code.

Main observation:
(1) As I am scanning through the dataset, I printed out (a) memory allocated by the memory pool from ScanOptions (b) process rss. I found that while (a) stays pretty stable throughout the scan (stays < 1G), (b) keeps increasing during the scan (looks linear to the number of files scanned).
(2) I tested ScanNode in Arrow as well as an in-house library that implements its own "S3Dataset" similar to Arrow dataset, both showing similar rss usage. (Which led me to think the issue is more likely to be in the parquet cpp code instead of dataset code).
(3) Scan the same dataset twice in the same process doesn't increase the max rss.

Following suggestions from the mailing list, I also did a memory profiling of the test program and the results seem to indicate potential memory leak.

Test code

https://gist.github.com/icexelloss/88195de046962e1d043c99d96e1b8b43

Massif output

massif.out.3505660.txt

Component(s)

C++, Parquet

lidavidm · 2023-09-08T14:34:03Z

You could upload it in a gist and link that here, perhaps?

icexelloss · 2023-09-08T14:35:03Z

You could upload it in a gist and link that here, perhaps?

I ended up changing the file extension and that worked :)

icexelloss · 2023-09-08T14:36:51Z

Not too familiar with the async/future code (nor massif outputs), but does this look like the Future objects are not destructed somehow (which has a reference to the parquet metadata object?) (You need to expand the massif output in visualizer to see the Future object at the bottom of the call chain)

icexelloss · 2023-09-08T17:05:04Z

cc @mapleFU @wgtmac @pitrou @felipecrv (Folks that have been opining on the mailing list)

mapleFU · 2023-09-08T17:22:01Z

Future in arrow is an shared_future, which means if reference is hold, the object might not be dtor. I guess maybe not that problem, but I'll try to run the script and reproduce the problem later

westonpace · 2023-09-08T17:24:16Z

Note that FileFragment in the datasets API caches the parquet metadata (with no option to disable this at the moment). So if you are scanning many files you will see memory grow over the lifetime of the scan as more and more metadatas are cached. I would expect a second scan would not grow the memory.

icexelloss · 2023-09-08T17:46:34Z

Note that FileFragment in the datasets API caches the parquet metadata (with no option to disable this at the moment). So if you are scanning many files you will see memory grow over the lifetime of the scan as more and more metadatas are cached. I would expect a second scan would not grow the memory.

Thanks @westonpace, can you give a pointer to where that is happening?

westonpace · 2023-09-08T18:07:40Z

In file_parquet.h. The ParquetFileFragment class has these variables:

  /// Indices of row groups selected by this fragment,
  /// or std::nullopt if all row groups are selected.
  std::optional<std::vector<int>> row_groups_;

  std::vector<compute::Expression> statistics_expressions_;
  std::vector<bool> statistics_expressions_complete_;
  std::shared_ptr<parquet::FileMetaData> metadata_;
  std::shared_ptr<parquet::arrow::SchemaManifest> manifest_;

They initially start empty/null and are initialized during a call to EnsureCompleteMetadata.

icexelloss · 2023-09-08T18:23:38Z

In file_parquet.h. The ParquetFileFragment class has these variables:

  /// Indices of row groups selected by this fragment,
  /// or std::nullopt if all row groups are selected.
  std::optional<std::vector<int>> row_groups_;

  std::vector<compute::Expression> statistics_expressions_;
  std::vector<bool> statistics_expressions_complete_;
  std::shared_ptr<parquet::FileMetaData> metadata_;
  std::shared_ptr<parquet::arrow::SchemaManifest> manifest_;

They initially start empty/null and are initialized during a call to EnsureCompleteMetadata.

Hmm, what do you mean by they are cached? Doesn't ParquetFileFragment get dtored after we scanned the fragment?

icexelloss · 2023-09-08T19:33:36Z

Oh I see we keep a vector of Fragment, each Fragment has a shared_ptr to the schema of the file

westonpace · 2023-09-08T21:34:34Z

Yes, the fragments are part of the dataset. So they stay around until the dataset is destroyed which has to be after the scan.

icexelloss · 2023-09-11T20:58:19Z

In a local build I cleared

  std::shared_ptr<parquet::FileMetaData> metadata_;
  std::shared_ptr<parquet::arrow::SchemaManifest> manifest_;

after scanning a fragment and things appear to be in better shape. Memory usage goes from 4G -> 700+Mb.

There still seem to be some leaking that I couldn't figure out but at least things seem to be in a much better shape after clearing those two fields.

massif.out.593871.txt

mapleFU · 2023-09-20T13:48:35Z

I was so busy these two weeks, sorry for late reply. Have you find out the reason? Seems that you're suffering from too many arrow::field? @icexelloss

IamJeffG · 2023-09-21T13:40:33Z

I believe I might be be experiencing this same problem through the Python API. Having to maintain a local build of Arrow doesn't sound like the right solution, so I wonder if there are ideas of how to achieve the same result as @icexelloss 's last comment, but through Arrow's normal APIs?

mapleFU · 2023-09-21T14:14:53Z

Result<std::vector<std::shared_ptr<FileFragment>>>
ParquetDatasetFactory::CollectParquetFragments(const Partitioning& partitioning) {
  std::vector<std::shared_ptr<FileFragment>> fragments(paths_with_row_group_ids_.size());

  size_t i = 0;
  for (const auto& e : paths_with_row_group_ids_) {
    const auto& path = e.first;
    auto metadata_subset = metadata_->Subset(e.second);

    auto row_groups = Iota(metadata_subset->num_row_groups());

    auto partition_expression =
        partitioning.Parse(StripPrefix(path, options_.partition_base_dir))
            .ValueOr(compute::literal(true));

    ARROW_ASSIGN_OR_RAISE(
        auto fragment,
        format_->MakeFragment({path, filesystem_}, std::move(partition_expression),
                              physical_schema_, std::move(row_groups)));

    RETURN_NOT_OK(fragment->SetMetadata(metadata_subset, manifest_));
    fragments[i++] = std::move(fragment);
  }

  return fragments;
}

I noticed that these metadata and manifest is shared_ptr and share among fragments, maybe when users have too many columns, this would be a huge cost?

IamJeffG · 2023-09-25T20:32:25Z

maybe when users have too many columns

Maybe but I think there must be more to it. In my example I have a partitioned parquet dataset on local disk, 8.6GB in total, with 13 columns and 38,747 fragments. Writing this dataset to a new location on disk (i.e. to compact the fragments) consumes all 8GB of RAM on my machine and then swaps to disk. I can't imagine that 13 columns or even 13×38,747 takes upwards of 8GB of memory.

icexelloss · 2023-09-27T14:47:02Z

I was so busy these two weeks, sorry for late reply. Have you find out the reason? Seems that you're suffering from too many arrow::field? @icexelloss

@mapleFU Sorry I have not found out the reason for the seemly leak of "arrow::Field" in #37630 (comment) post.

icexelloss · 2023-09-27T14:47:57Z

I believe I might be be experiencing this same problem through the Python API. Having to maintain a local build of Arrow doesn't sound like the right solution, so I wonder if there are ideas of how to achieve the same result as @icexelloss 's last comment, but through Arrow's normal APIs?

I don't think you can at the moment. Because there is no way to tell Dataset to not to cache these metadata through normal API. I had a local fix but didn't find time the push it out (the fix is actually quite simple - clear the metadata from segment after done reading a segment)

icexelloss · 2023-09-27T14:50:24Z

I do have plan to push the patch upstream proper when I have some spare time

FWIW, This is my local patch:

diff --git a/c/src/arrow/dataset/dataset.h b/c/src/arrow/dataset/dataset.h
index 1db230b16e9c2e52ad58c02255f87375307271d0..37c2ddca4884d32bb1721dd3a2d937f9faf6b93f 100644
--- a/c/src/arrow/dataset/dataset.h
+++ b/c/src/arrow/dataset/dataset.h
@@ -199,6 +199,9 @@ class ARROW_DS_EXPORT Fragment : public std::enable_shared_from_this<Fragment> {
     return partition_expression_;
   }
 
+  /// \brief Clear metadata in the fragment. By default this is an noop.
+  virtual void ClearMetadata() { return; }
+
   virtual ~Fragment() = default;
 
  protected:
diff --git a/c/src/arrow/dataset/file_parquet.h b/c/src/arrow/dataset/file_parquet.h
index f33190bd93347f781df91a6cda612031f09caf75..632a555894a7b91ee6ac722a6820a9eaff61f232 100644
--- a/c/src/arrow/dataset/file_parquet.h
+++ b/c/src/arrow/dataset/file_parquet.h
@@ -174,6 +174,11 @@ class ARROW_DS_EXPORT ParquetFileFragment : public FileFragment {
   static std::optional<compute::Expression> EvaluateStatisticsAsExpression(
       const Field& field, const parquet::Statistics& statistics);
 
+  void ClearMetadata() override {
+    metadata_ = NULLPTR;
+    manifest_ = NULLPTR;
+  }
+
  private:
   ParquetFileFragment(FileSource source, std::shared_ptr<FileFormat> format,
                       compute::Expression partition_expression,
diff --git a/c/src/arrow/dataset/scanner.cc b/c/src/arrow/dataset/scanner.cc
index 18981d1451980f4b775fb87e42dcaede89a1fc7c..8f83a7c2dc52b8f5c728b91be2f93442168002d7 100644
--- a/c/src/arrow/dataset/scanner.cc
+++ b/c/src/arrow/dataset/scanner.cc
@@ -1037,6 +1037,9 @@ Result<acero::ExecNode*> MakeScanNode(acero::ExecPlan* plan,
         // unnecessarily materialized columns in batch. We could drop them now instead of
         // letting them coast through the rest of the plan.
         auto guarantee = partial.fragment.value->partition_expression();
+        // This is safe because by the time we reach here we have already decoded
+        // the fragment into a batch so the metadata is no longer needed.
+        partial.fragment.value->ClearMetadata();
 
         ARROW_ASSIGN_OR_RAISE(
             std::optional<compute::ExecBatch> batch,

felipecrv · 2023-09-27T17:33:28Z

You can do .reset() on shared_ptrs to achieve the same effect. [1] I would call the method ResetMetadata() because "clear" in C++ usually means semantically clearing data, but not really deallocating the buffers (e.g. std::string::clear() and std::vector<T>::clear()).

[1] https://en.cppreference.com/w/cpp/memory/shared_ptr/reset

westonpace · 2023-09-27T18:22:05Z

I believe I might be be experiencing this same problem through the Python API. Having to maintain a local build of Arrow doesn't sound like the right solution, so I wonder if there are ideas of how to achieve the same result as @icexelloss 's last comment, but through Arrow's normal APIs?

We should definitely make metadata caching an optional feature of the scanner and/or dataset. I think the API could be as simple as...

my dataset = pyarrow.dataset.dataset(..., cache_metadata=False)

Any place that is using a dataset "temporarily" should also set this to false (e.g. when we run pyarrow.parquet.read_table it creates a dataset and scans it. That temporary dataset should NOT cache metadata)

mapleFU · 2023-10-02T15:35:12Z

Sorry I have not found out the reason for the seemly leak of "arrow::Field" in #37630 (comment) post.

Oh, in your graph the arrow::Field and schema occupied lots of memory. Release Manifest would release them. Previously I wonder if a schema cache would be better. But I think clear the pointer in scanner api would be much better.

…ata caching

pitrou · 2025-01-22T14:56:04Z

I've submitted a draft PR that adds a cache_metadata option when scanning.

Here are some memory consumption measurements on a synthetic dataset similar to the one in #45287:

with cache_metadata=True (the default):

$ /usr/bin/time -f "\n%E real\n%U user\n%S sys\n%M kB peak RSS" python  -c 'import pyarrow.dataset as pd; pd.dataset("/home/antoine/arrow/data/bamboo-streaming-parquet-test-data/1000col-dataset/").to_table()'

0:11.01 real
107.66 user
4.37 sys
2069284 kB peak RSS

with cache_metadata=False:

$ /usr/bin/time -f "\n%E real\n%U user\n%S sys\n%M kB peak RSS" python  -c 'import pyarrow.dataset as pd; pd.dataset("/home/antoine/arrow/data/bamboo-streaming-parquet-test-data/1000col-dataset/").to_table(cache_metadata=False)'

0:10.97 real
109.11 user
3.87 sys
1381696 kB peak RSS

…ata caching

pitrou · 2025-01-22T15:21:55Z

@icexelloss It would be useful it you could test the above PR on a more real-world dataset.

icexelloss · 2025-01-22T15:48:19Z

Thanks @pitrou! We actually already have an internal patch similar to what you have now but still observed higher than expected memory usage when scanning a parquet dataset - my colleague @timothydijamco is working a repro and should have it soon

I also noticed that you also cleared original_metadata_.reset(); - we will test that as well

…ata caching

icexelloss added the Type: bug label Sep 8, 2023

github-actions bot added Component: Parquet Component: C++ labels Sep 8, 2023

mapleFU mentioned this issue Sep 10, 2023

[Python] Read table stuck and hangs forever #37139

Open

IamJeffG mentioned this issue Sep 21, 2023

Out-of-core compaction of fragments grows unbounded in memory #37820

Closed

icexelloss mentioned this issue Jan 16, 2025

[C++] Metadata related memory leak when reading parquet dataset #45287

Open

pitrou added a commit to pitrou/arrow that referenced this issue Jan 22, 2025

apacheGH-37630: [C++][Python][Dataset] Allow disabling fragment metad…

5b7c3f2

…ata caching

github-actions bot assigned pitrou Jan 22, 2025

github-actions bot mentioned this issue Jan 22, 2025

GH-37630: [C++][Python][Dataset] Allow disabling fragment metadata caching #45330

Open

pitrou added a commit to pitrou/arrow that referenced this issue Jan 22, 2025

apacheGH-37630: [C++][Python][Dataset] Allow disabling fragment metad…

5a92ef5

…ata caching

pitrou added a commit to pitrou/arrow that referenced this issue Jan 27, 2025

apacheGH-37630: [C++][Python][Dataset] Allow disabling fragment metad…

55ee5bb

…ata caching

pitrou added a commit to pitrou/arrow that referenced this issue Jan 28, 2025

apacheGH-37630: [C++][Python][Dataset] Allow disabling fragment metad…

30e88fc

…ata caching

pitrou added a commit to pitrou/arrow that referenced this issue Jan 29, 2025

apacheGH-37630: [C++][Python][Dataset] Allow disabling fragment metad…

31297b4

…ata caching

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++] Potential memory leak in Parquet reading with Dataset #37630

[C++] Potential memory leak in Parquet reading with Dataset #37630

icexelloss commented Sep 8, 2023 •

edited

Loading

lidavidm commented Sep 8, 2023

icexelloss commented Sep 8, 2023 •

edited

Loading

icexelloss commented Sep 8, 2023 •

edited

Loading

icexelloss commented Sep 8, 2023

mapleFU commented Sep 8, 2023 •

edited

Loading

westonpace commented Sep 8, 2023

icexelloss commented Sep 8, 2023

westonpace commented Sep 8, 2023

icexelloss commented Sep 8, 2023 •

edited

Loading

icexelloss commented Sep 8, 2023

westonpace commented Sep 8, 2023

icexelloss commented Sep 11, 2023

mapleFU commented Sep 20, 2023

IamJeffG commented Sep 21, 2023

mapleFU commented Sep 21, 2023

IamJeffG commented Sep 25, 2023

icexelloss commented Sep 27, 2023

icexelloss commented Sep 27, 2023 •

edited

Loading

icexelloss commented Sep 27, 2023 •

edited

Loading

felipecrv commented Sep 27, 2023

westonpace commented Sep 27, 2023

mapleFU commented Oct 2, 2023

pitrou commented Jan 22, 2025

pitrou commented Jan 22, 2025 •

edited

Loading

icexelloss commented Jan 22, 2025 •

edited

Loading

[C++] Potential memory leak in Parquet reading with Dataset #37630

[C++] Potential memory leak in Parquet reading with Dataset #37630

Comments

icexelloss commented Sep 8, 2023 • edited Loading

Describe the bug, including details regarding any error messages, version, and platform.

Version

Platform

Description

Test code

Massif output

Component(s)

lidavidm commented Sep 8, 2023

icexelloss commented Sep 8, 2023 • edited Loading

icexelloss commented Sep 8, 2023 • edited Loading

icexelloss commented Sep 8, 2023

mapleFU commented Sep 8, 2023 • edited Loading

westonpace commented Sep 8, 2023

icexelloss commented Sep 8, 2023

westonpace commented Sep 8, 2023

icexelloss commented Sep 8, 2023 • edited Loading

icexelloss commented Sep 8, 2023

westonpace commented Sep 8, 2023

icexelloss commented Sep 11, 2023

mapleFU commented Sep 20, 2023

IamJeffG commented Sep 21, 2023

mapleFU commented Sep 21, 2023

IamJeffG commented Sep 25, 2023

icexelloss commented Sep 27, 2023

icexelloss commented Sep 27, 2023 • edited Loading

icexelloss commented Sep 27, 2023 • edited Loading

felipecrv commented Sep 27, 2023

westonpace commented Sep 27, 2023

mapleFU commented Oct 2, 2023

pitrou commented Jan 22, 2025

pitrou commented Jan 22, 2025 • edited Loading

icexelloss commented Jan 22, 2025 • edited Loading

icexelloss commented Sep 8, 2023 •

edited

Loading

icexelloss commented Sep 8, 2023 •

edited

Loading

icexelloss commented Sep 8, 2023 •

edited

Loading

mapleFU commented Sep 8, 2023 •

edited

Loading

icexelloss commented Sep 8, 2023 •

edited

Loading

icexelloss commented Sep 27, 2023 •

edited

Loading

icexelloss commented Sep 27, 2023 •

edited

Loading

pitrou commented Jan 22, 2025 •

edited

Loading

icexelloss commented Jan 22, 2025 •

edited

Loading