You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The obvious way to convert a katana::LargeArray into an arrow::Buffer leaks memory because Buffer often doesn't own it's underlying data and assumes some other object owns it and that the Buffer lifetime is contained in the data's lifetime. Specifically, the following code leaks two blocks of memory:
This code, and other code like it that probably exists in the codebase, leaks:
The LargeBuffer instance out_indices because of the call to release without passing ownership of the instance to some other object.
The buffer out_indices->data() is leaking because arrow::MutableBuffer does not own it's data by default, so it will never deallocate the buffer passed to arrow::MutableBuffer::Wrap.
Further, arrow would not be able to correctly deallocate the LargeArray data pointer anyway, since arrow doesn't make assumptions about the allocator used for any given buffer. So there is no simple take_ownership flag.
The solution will be to do the following:
Implement a KatanaMemoryPool subclass of arrow::MemoryPool, which would use Katana's NUMA aware allocator and deallocator. Our memory pool can be passed to arrow::ArrayBuilders, arrow::AllocateBuffer, and even parquet to control how it's memory is allocated and freed.
Make katana::LargeArray into a subclass of arrow::MutableBuffer, so that it can be passed directly into Arrow without conversion. The semantics of LargeArray are simple enough they should fit into the Buffer interface just fine.
In fixing this, be careful of the ownership assumptions in Arrow. Buffer often doesn't own it's data. ArrayData and Arrayshare their data with other instances (to implement zero-copy slicing). These objects generally use shared_ptr to manage the underlying data, but be a little careful.
The text was updated successfully, but these errors were encountered:
@gurbinder533, this is the issue we talked about on Slack. If you have any comments please add them. This should probably get assigned to someone soon, though maybe after M1.
Fixing this with a new MemoryPool could actually improve performance for data loaded via Parquet or passed through some Arrow functions that allocate data, since it would allow them to use our allocators instead of the default allocator.
The obvious way to convert a
katana::LargeArray
into anarrow::Buffer
leaks memory becauseBuffer
often doesn't own it's underlying data and assumes some other object owns it and that theBuffer
lifetime is contained in the data's lifetime. Specifically, the following code leaks two blocks of memory:katana/libgalois/src/analytics/subgraph_extraction/subgraph_extraction.cpp
Lines 92 to 95 in 5a8c284
This code, and other code like it that probably exists in the codebase, leaks:
out_indices
because of the call to release without passing ownership of the instance to some other object.out_indices->data()
is leaking becausearrow::MutableBuffer
does not own it's data by default, so it will never deallocate the buffer passed toarrow::MutableBuffer::Wrap
.Further, arrow would not be able to correctly deallocate the
LargeArray
data pointer anyway, since arrow doesn't make assumptions about the allocator used for any given buffer. So there is no simpletake_ownership
flag.The solution will be to do the following:
KatanaMemoryPool
subclass ofarrow::MemoryPool
, which would use Katana's NUMA aware allocator and deallocator. Our memory pool can be passed toarrow::ArrayBuilder
s,arrow::AllocateBuffer
, and even parquet to control how it's memory is allocated and freed.katana::LargeArray
into a subclass ofarrow::MutableBuffer
, so that it can be passed directly into Arrow without conversion. The semantics ofLargeArray
are simple enough they should fit into theBuffer
interface just fine.In fixing this, be careful of the ownership assumptions in Arrow.
Buffer
often doesn't own it's data.ArrayData
andArray
share their data with other instances (to implement zero-copy slicing). These objects generally useshared_ptr
to manage the underlying data, but be a little careful.The text was updated successfully, but these errors were encountered: