-
Notifications
You must be signed in to change notification settings - Fork 10
Conversation
I personally think the MLLabelUtils.jl interface needs a lot of cleaning, but for now it is copied over as-is, and we can save that for later work. |
this PR needs a rebase |
I'm not sure I understand the keyword arg problem this PR is trying to fix. I'm pretty sure though that I very much favour the clarity of
|
Shadowing is not bad, but I fear a situation where someone overrides Since we had the routing getobs(x, i; obsdim) = getobs(x, i) If a type The alternative is removing the default rerouting and to force implementers to define |
But we do want people to have their own internal implementation module MyModule
import LearnBase
LearnBase.getobs(d::MyDataset, i; obsdim) = _getobs(d, i, obsdim)
# Some internal function of MyModule
_getobs(d, i, obsdim) = ...
end I'm not sure I understand the concern
Yes I think we should remove the routing and tell implementers to implement
even if obsdims won't make sense in their case. The alternative is to be more lax and allow for the implementation
but then datasets' consumers such Dataloaders cannot rely on the presence of obsdims. For DataLoaders this is fine, e.g. I didn't use it in FluxML/Flux.jl#1683, but maybe a lot of code in MLDataPattern will have to be changed ? |
b15e1f4
to
cc0aec4
Compare
Okay I have rebased and swapped back to the old interface. Array and tuple code has been migrated back over. I opted to use the implementations in MLDataPattern.jl for consistency, but we can drop the error checking if we want. We also need to decide whether to move over the |
Something that we could put in the pipeline for the next breaking release (not necessarily in this PR) is to stop pirating |
Do we want every method to have to handle the |
…e the implementation is now there
Sorry to hijack this thread, but have you considered using the Tables.jl interface to represent data? It's very flexible and can be used not only with tabular types such as |
@nalimilan I think integration with Tables.jl would be nice, but it only handles a subset of data needed for ML use cases. Higher dimensional arrays, images, graphs, audio/waveforms and more don't really fit a tabular interface. |
Also note that the interface that we are "refreshing" is also an established interface (though maybe not as popular as Tables.jl). |
OK. I just hoped we could find consistent definitions for table-like cases across JuliaML and JuliaStats. In particular, the fact that Tables.jl and JuliaStats packages use rows as observations, but that JuliaML packages use column as observations by default isn't great. |
I'm not familiar with how JuliaStats packages handle array inputs, do you mean they always slice/index the first dimension? Either way, that shouldn't affect integration with Tables.jl because we've managed to successfully integrate both interfaces before. The code in FastAI could use some updating to work with RE coordination, I mentioned in JuliaML/MLUtils.jl#2 (comment) that it would be great to have folks from each org talking again. I'm not sure how these ecosystems became so siloed in the first place, but it benefits nobody to keep things that way. So if you're game, we could look into setting something up in the new year :) |
Worth noting that this code should eventually be "standardized" in MLDatasets.jl as per JuliaML/MLDatasets.jl#73. |
In general when a matrix is passed observations are in rows and variables in columns: that e.g.
Cool!
Sure, let's do that. Some of the discussion could continue at JuliaStats/StatsAPI.jl#3, but Slack might be a good venue for less formal exchanges. |
I was notified about this discussion here but since I'm not involved in this package at all feel free to ignore my comment 🙃 I think it is quite annoying in general to have to propagate and support For convenience, one could still support An additional advantage of using vectors, and e.g. |
You're not the only one, see the prototyping work going on at JuliaML/MLUtils.jl#1. |
Closing in favor of #55. |
Based on comments, the proposed interface here is basically the same as before with one small exception: authors of new types must specify the
obsdim
keyword. There is no longer any default rerouting that drops the keyword.I also made a couple changes to reduce the complexity of the type hierarchy and added some defaults. Now, if a type doesn't define
getobs(x, idx)
, it defaults togetindex(x, idx)
(we can remove this if people don't like it). The only abstract types are nowAbstractDataContainer
andAbstractDataIterator
. At minimum, a type must definegetobs(x, idx)
andnobs(x)
to buy into the interface. If it also subtypesAbstractDataContainer
, thengetindex
anditerate
are defined for that type based ongetobs
. This should reduce the overhead / boilerplate of adopting this interface by default. I also addedAbstractDataIterator
though it doesn't serve a purpose just yet. This is mainly forward-looking to address things like: lorenzoh/DataLoaders.jl#26I should also note that @racinmat has taken over the corresponding MLDataPattern.jl and MLLabelUtils.jl PRs to see this work through. I haven't had the bandwidth to work on it, so I appreciate their help finishing the job.