-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
to_iterable #109
Comments
I'd say that yes, adding Of course, longer-term it'd be also great if Matplotlib & co. gained support for |
Right, thanks - I think this is good enough for now |
If we would be pragmatic, we would add |
@jorisvandenbossche I'll note that class Column
def __dlpack__(self, *, stream=None):
if stream is not None:
raise NotImplementedError('my_df_lib does not support CUDA streams')
# `_arr` is the numpy array that you'd want to return if you had implemented __array__
return self._arr.__dlpack__() And then Do you agree that that is pragmatic enough? |
That only works for numeric data types and for dataframes libraries that use dlpack-compatible memory under the hood, though? So for example, I don't think that works for datetimes and strings (either as numpy's fixed width dtype or as object dtype)? Both those are supported by matplotlib when using numpy arrays. And what are the expectations around implementing |
Ah, good point, that's a gap. There's a standards vs. pragmatism tension there. It'd be nice if that was solved with something that's in principle library-independent (e.g., |
Perhaps let's talk about this tomorrow Rethinking about:
Sure, but in that case we wouldn't need to do any work on the Standard if they're just doing to use the interchange protocol directly? For matplotlib, I think all they need is something they can iterate over. E.g. this can be plotted: import matplotlib.pyplot as plt
import numpy as np
class MyIter:
def __init__(self, arr):
self.arr = arr
def __getitem__(self, idx):
return self.arr[idx]
def __len__(self):
return len(self.arr)
myiter = MyIter(np.array([1,2,3]))
fig, ax = plt.subplots()
ax.plot(myiter) We're explicitly ruling out letting consumers iterate over elements in a Perhaps we just need a |
That example doesn't make it dataframe library independent, as matplotlib would then still need to use some specific dataframe library (pandas in your example) to get the actual data, while all it wants in an array. I think a goal should be that libraries like matplotlib could accept any dataframe-like object without having to rely on a specific one being installed. Tapping into Marco's latest comment, it also doesn't necessarily need to hardcode "numpy". We could also have a |
I think I quite like this idea. That allows nandas to return a numpy array, cuDF a cupy array, and so on. The main follow-up question I have here is: what guarantees do we give about the returned array object? |
I think there's two separate things folks may want here:
The two could be the same object or they could be different objects. I.E. you could imagine a distributed library that has 2 return a distributed array implementation whereas 1 guarantees local memory. |
Right, let's try to get this in, as it's a fairly important one. We can always revisit later if what we get into the first version isn't good enough
Concretely, what methods does the return value need to have? You wrote above that |
That's a good point. For (1) I think
I think we should focus on (1) - interchange to an actual array object containing the data. For that, I'd say the primary spec should be "
|
does it work for bool? In [19]: np.array([True, True]).__dlpack__()
---------------------------------------------------------------------------
BufferError Traceback (most recent call last)
Cell In [19], line 1
----> 1 np.array([True, True]).__dlpack__()
BufferError: DLPack only supports signed/unsigned integers, float and complex dtypes. |
argh, |
The |
And maybe also a target dtype? (although the question then becomes how to specify this dtype, unless that's something the array API spec has resolved?) |
let's take the conversation on |
regarding if we have |
I think before we can commit to this we need to have alignment on what type is returned from |
this sounds fine, we can note that it's implementation-specific |
Sounds fine to me too - and that is generically true for any Python objects (scalars, tuples, etc.); they can (almost?) always be replaced by duck-type-compatible objects that the library prefers. |
It was brought up in the last call that there's probably a need to be able to access the values inside a
Column
- for example, if passing aDataFrame
to a plotting library, to be able to do:Probably is - how to do that?
Should each
Column
have a__dlpack__
method, so that one can call e.g.np.from_dlpack(column)
and get a numpy array they can pass tomatplotlib
?The text was updated successfully, but these errors were encountered: