-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework "NA" (i.e., null, invalid) handling in PropertyViews #25
Comments
I'm definitely open to improvements here, as the current state isn't the greatest. A couple interesting notes:
Because of point 1, we've already been able to load and access data with nulls while preserving the assumption of validity in analytics. But point 3 still leaves significant challenges. Do you have an overall vision of how to address this? |
To be clear, nothing I'm talking about here will affect My plan was simply to remove BTW, I am concerned about the "write to Arrow Array thing". But I'm not trying to address that here, and I'm leaving that to the RDG people. |
It would induce substantial changes to distributed because the Partitioner uses PropertyView on arrays with nulls. That's on the enterprise repo as well though, so I don't have a good answer. |
This would break the partitioner. Properties are expected to be sparse, so I don't understand why it is an error to create a properties array with nulls? |
The issue with nulls in PropertyViews is that they only work correctly if the PropertyView is not mutated (or at least the sparsity doesn't change). PropertyView doesn't update the null bitmap when writes occur. This is dangerously inconsistent I think. What if we split PropertyView in to two parts: PropertyView which supports nulls as it does now, but will no longer allow writes (it will return const references); and MutablePropertyView which allows writes (non-const references), but returns an error if the property is sparse. This is a very preliminary plan. I'm not sure how this would be handled in PropertyGraph, yet. Maybe separate accessors for sparse properties that always return const references. In general, will this approach work for the partitioner? Does it need to mutate the data in it's sparse PropertyViews? |
It might be reasonable to assert no nulls in |
Great.This issue can wait until that stuff lands. I mainly created this issue to make sure we didn't forget that there was a potential problem that needs to be fixed eventually. |
Other than PODPropertyView, other views are read-only views. I'm strongly in favor of limiting all these views, including PODPropertyView, to read-only views. For read-write use csaes, @danielmawhirter has some wrappers around arrow Builders. We should be using those for read-write use cases. The partitioner uses PropertyView for read-only views because it handles string and boolean types in addition to POD types. So limiting to PropertyViews to read-only does not affect it. |
Looks like they're in this repo, https://github.com/KatanaGraph/katana/blob/master/libgalois/include/galois/ArrowRandomAccessBuilder.h The version that's there currently uses vector for its initial storage, and converts to arrow at the end. The unfinished part I mentioned removes the need for vector (at least for POD properties). I'll ping this issue when I get to working on that! |
If I understand it correctly, there are basically three current use-cases for PropertyView:
I don't think the builder API is going to support case 1 that well. My suggestion is the @danielmawhirter + @arthurp approach:
|
@ddn0 makes good points here. I support this overall.
I don't think we can safely support nulls unless we use a setter API instead of an "assign to reference" API. And writing to bitmaps in parallel is really hard (oh the false sharing). So I think we will want to ban nulls on mutable views and only reintroduce them if we have a good design and a specific need. |
I'm uncomfortable about (1) both due to the restriction on nulls and due to the restriction to fixed-width types. However, I'm not strongly against it. |
@roshandathathri can you provide specific reasons for your discomfort? Such things are very useful. Like a use case or an expectation that a user might have. |
My discomfort about (1) is about whether we can automatically check and enforce the assumptions like fixed-width types and no nulls without facing an undue burden to the application developer (who is the user I'm concerned about here). If there's a clear and enforceable choice for the application developer, then I would be more comfortable with it. As an example consider applications like matrix-completion and mrbc where the node data is a vector. If we are developing these applications, then we would know which datatype is best to use. If they are implemented by users of Katana (python interface), do we have the same confidence that they can pick the right datatype? |
Currently handling of Arrow invalid values (also called nulls and NAs) in PropertyViews, is ad hoc and probably not what people would expect. Currently reading the valid bit works (
IsValid
), but the valid bits are never updated even when the user writes to the property data.We don't really expect to use NAs much at all and there are lots of tools to eliminate NAs (e.g., in pandas). So my approach would be to remove all support for NAs and add checks to fail as early as possible if an NA is encountered. Arrow has specific support for arrays that never contain NA; they are represented with no null_bitmap. We should be able to use that to eliminate many dangerous cases.
The main reason we may need to do this is that in the python ecosystem the arrow arrays we use may originate from other libraries that has NAs (like pandas) and those NAs will be maintained by pyarrow. So we need to be aware of them and provide useful errors so users know to eliminate NAs.
The text was updated successfully, but these errors were encountered: