-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Access memory address of start of chunk #6320
Comments
@ritchie46 Looks like this only works for Series with an integer dtype. I tried with boolean and string types and I get a Is there a chance we can get this to work for all dtypes that are supported by the interchange protocol? So no nested types. |
Boolean we can still add. But for other datatypes it doesn't make any sense. A string for instance is not really useful without it offsets. It is represented as a list of bytes, e.g. nested. |
On second thought I also think the boolean buffer is useless without an offset into that array. |
For string columns, I create an offsets buffer like this (maybe this is completely off the mark, but it made sense to me):
Of course, it would be more efficient if I could just access the underlying offsets buffer. Not sure why boolean buffers would need offsets, since they are not variable length? |
But shouldn't it be zero copy? The booleans are represented as a bitmask in a byte slice. Given an array of bytes, you need to know where it starts in the first byte and how many bits are valid (e.g. the length). |
I admit I got a bit lazy when I saw all the corners that were cut in the Pandas implementation; indeed that code is not zero copy. But I think we're not going to finish this without cutting a few corners ourselves. I think I'm going to go a different route for now for the interchange: I'll write something that utilizes the pyarrow implementation of the protocol, and throw an error when the user specifies zero copy requirement when there's categoricals in there. And then we can work our way up from there. |
The offsets are measured in bytes, not in chars. I can give access to that array. Pyarrow could also give it. Maybe it is good to use pyarrow and polars as a hybrid for this. |
Yeah I should've used
I'll get back to you on this, let me wrestle with the pyarrow thing for a bit! 11.0.0 was released recently which includes the protocol. |
Required for #5662
In order to finish the DataFrame Interchange Protocol, we need to be able to specify the memory address of where a chunk starts.
Definition stated in the protocol: "Pointer to start of the buffer as an integer."
This should be available as a method on the
PySeries
Rust object. It does not need to be a method of theSeries
class - this is strictly for use in the interchange protocol.The text was updated successfully, but these errors were encountered: