Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Converting LargeArray to arrow::Buffer can cause memory leaks #189

Open
arthurp opened this issue Apr 27, 2021 · 2 comments
Open

Converting LargeArray to arrow::Buffer can cause memory leaks #189

arthurp opened this issue Apr 27, 2021 · 2 comments
Labels
bug Something isn't working

Comments

@arthurp
Copy link
Contributor

arthurp commented Apr 27, 2021

The obvious way to convert a katana::LargeArray into an arrow::Buffer leaks memory because Buffer often doesn't own it's underlying data and assumes some other object owns it and that the Buffer lifetime is contained in the data's lifetime. Specifically, the following code leaks two blocks of memory:

auto numeric_array_out_indices =
std::make_shared<arrow::NumericArray<arrow::UInt64Type>>(
static_cast<int64_t>(num_nodes),
arrow::MutableBuffer::Wrap(out_indices.release()->data(), num_nodes));

This code, and other code like it that probably exists in the codebase, leaks:

  • The LargeBuffer instance out_indices because of the call to release without passing ownership of the instance to some other object.
  • The buffer out_indices->data() is leaking because arrow::MutableBuffer does not own it's data by default, so it will never deallocate the buffer passed to arrow::MutableBuffer::Wrap.

Further, arrow would not be able to correctly deallocate the LargeArray data pointer anyway, since arrow doesn't make assumptions about the allocator used for any given buffer. So there is no simple take_ownership flag.

The solution will be to do the following:

  • Implement a KatanaMemoryPool subclass of arrow::MemoryPool, which would use Katana's NUMA aware allocator and deallocator. Our memory pool can be passed to arrow::ArrayBuilders, arrow::AllocateBuffer, and even parquet to control how it's memory is allocated and freed.
  • Make katana::LargeArray into a subclass of arrow::MutableBuffer, so that it can be passed directly into Arrow without conversion. The semantics of LargeArray are simple enough they should fit into the Buffer interface just fine.

In fixing this, be careful of the ownership assumptions in Arrow. Buffer often doesn't own it's data. ArrayData and Array share their data with other instances (to implement zero-copy slicing). These objects generally use shared_ptr to manage the underlying data, but be a little careful.

@arthurp arthurp added the bug Something isn't working label Apr 27, 2021
@arthurp
Copy link
Contributor Author

arthurp commented Apr 27, 2021

@gurbinder533, this is the issue we talked about on Slack. If you have any comments please add them. This should probably get assigned to someone soon, though maybe after M1.

@arthurp
Copy link
Contributor Author

arthurp commented Apr 27, 2021

Fixing this with a new MemoryPool could actually improve performance for data loaded via Parquet or passed through some Arrow functions that allocate data, since it would allow them to use our allocators instead of the default allocator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant