-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add parallel
and shuffle
support to eachobs
and DataLoader
#82
Conversation
Codecov Report
@@ Coverage Diff @@
## main #82 +/- ##
==========================================
+ Coverage 89.97% 90.27% +0.29%
==========================================
Files 14 13 -1
Lines 479 473 -6
==========================================
- Hits 431 427 -4
+ Misses 48 46 -2
Continue to review full report at Codecov.
|
Just saw that |
Co-authored-by: Carlo Lucibello <[email protected]>
Based on the feedback, a single Is being able to pass a preallocated buffer necessary? The code would be cleaner without and based on the arguments to
Hence I suggest we drop that behavior, and make |
And we could rename the Since we need a stateful object in any case, it makes sense to just use the DataLoader we already have, and |
I'm not sure, I've never used it. But it would be good to avoid breaking changes if not necessary. |
True, it's looking like
In any case, this adds buffering and parallel loading to it |
So which of the three buffers should it accept? Should it validate them? The original docstring also does not say anything about this, so I wonder if it was used:
I think with the parallel stuff added, I feel this would bloat the code a little. We can easily make this nonbreaking simply by allowing passing in a buffer, but not using it and instead setting |
…ML/MLUtils.jl into lo/unify-eachobs-reshuffle
I changed so that it is now just very similar to just |
In terms of the naming, I suggest either consolidating around |
Beyond backward compat, is |
…ML/MLUtils.jl into lo/unify-eachobs-reshuffle
Batch size 1 is not the same as no batching, e.g. |
I agree. I was thinking that |
Ah, missed that, thanks. Here's a slightly different implementation that I prefer. I think it offers greater flexibility than boolean switches while still having the user-facing convenience that we want. function eachobs(
data;
buffer = false,
parallel = false,
shuffle = false,
batchsize::Int = -1,
partial::Bool = true,
rng::AbstractRNG = Random.GLOBAL_RNG)
buffer = buffer isa Bool ? buffer : true
initialize = if batchsize > 1 && shuffle
x -> BatchView(shuffleobs(x); batchsize, partial)
elseif batchsize > 1
x -> BatchView(x; batchsize, partial)
elseif shuffle
shuffleobs
else
identity
end
return EachObs(data, initialize, buffer, parallel, rng)
end
struct EachObs{T, F, R<:AbstractRNG}
data::T
initialize::F
buffer::Bool
parallel::Bool
rng::R
end
function Base.iterate(e::EachObs)
data = e.initialize(e.data)
iter = if e.parallel
eachobsparallel(data; e.buffer)
else
if e.buffer
buf = getobs(data, 1)
(getobs!(buf, data, i) for i in 1:numobs(data))
else
(getobs(data, i) for i in 1:numobs(data))
end
end
obs, state = iterate(iter)
return obs, (iter, state)
end
function Base.iterate(::EachObs, (iter, state))
ret = iterate(iter, state)
isnothing(ret) && return
obs, state = ret
return obs, (iter, state)
end
Base.length(e::EachObs) = numobs(e.initialize(e.data))
Base.eltype(e::EachObs) = eltype(e.initialize(e.data)) We could also offer a more declarative syntax like |
Possibly as a performance optimization, |
@darsnack I think adding an initialization function is a useful addition, but I'm wondering about the semantics. One issue I see with the approach of making shuffling and batching part of this initialization function is that when one wants to add a custom initialization function "on top", you would have to rewrite the shuffling and batching part inside the custom initialization function. Maybe a @CarloLucibello I think it's clear that only one of |
As for the naming, eachobs(args...; kws...) = DataLoader(args...; kws...) |
What if we did |
Re As long as we store the shuffling/batching logic within |
@lorenzoh I took the liberty of pushing this forward and entirely replace the previous DataLoader implementation with this. Right now the only difference between |
@@ -11,7 +11,7 @@ | |||
for (i,x) in enumerate(eachobs(X, buffer=b)) | |||
@test x == X[:,i] | |||
end | |||
@test b == X[:,end] | |||
@test_broken b == X[:,end] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the only test that is now failing with this PR. Why is that? It seems we don't use the buffer at all...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lorenzoh what changed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #82 (comment); I wasn't sure what the right semantics here are, so I didn't implement the preallocated buffer handling. See also #82 (comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change okay?
I wouldn't consider this breaking in any case, since ret = getobs!(buf, data, i)
will fall back to getobs
, so one should always be using the return value and not rely on buf
being mutated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can reinstate the old behavior only in the parallel = false
branch. But also fine to keep this pr as it is and consider it non-breaking
parallel
and shuffle
support to eachobs
parallel
and shuffle
support to eachobs
and DataLoader
Should I merge? This + #88 + a |
Closes #66. Closes #13
Unifies data iteration under
eachobs
as discussed in #66.parallel = true
uses the parallel data loader.false
by default since parallel does not guarantee correct ordering (see Deterministic, parallel data iteration #68)shuffle = true
reshuffles the observations every time you start iterating (unlikeshuffleobs
). This required changing the implementation of non-paralleleachobs
: they used generator expressions before, but now have a wrapper type, so that we can reach into that when reshuffling.With this, I think we can start advertising
eachobs
as the go-to data iteration function. We still need to decide how to proceed withDataLoader
.