Add `parallel` and `shuffle` support to `eachobs` and `DataLoader` #82

lorenzoh · 2022-04-30T08:54:54Z

Closes #66. Closes #13

Unifies data iteration under eachobs as discussed in #66.

parallel = true uses the parallel data loader. false by default since parallel does not guarantee correct ordering (see Deterministic, parallel data iteration #68)
shuffle = true reshuffles the observations every time you start iterating (unlike shuffleobs). This required changing the implementation of non-parallel eachobs: they used generator expressions before, but now have a wrapper type, so that we can reach into that when reshuffling.

With this, I think we can start advertising eachobs as the go-to data iteration function. We still need to decide how to proceed with DataLoader.

codecov-commenter · 2022-04-30T08:58:48Z

Codecov Report

Merging #82 (5df0d7a) into main (c1dc3c2) will increase coverage by 0.29%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main      #82      +/-   ##
==========================================
+ Coverage   89.97%   90.27%   +0.29%     
==========================================
  Files          14       13       -1     
  Lines         479      473       -6     
==========================================
- Hits          431      427       -4     
+ Misses         48       46       -2

Impacted Files	Coverage Δ
src/eachobs.jl	`100.00% <100.00%> (ø)`
src/parallel.jl	`94.82% <100.00%> (-0.09%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c1dc3c2...5df0d7a. Read the comment docs.

lorenzoh · 2022-04-30T09:10:52Z

Just saw that BatchView does not collate by default, so actually BatchView needs collating support before replacing DataLoader. But that can be done in another PR.

src/eachobs.jl

src/reshuffle.jl

Co-authored-by: Carlo Lucibello <[email protected]>

lorenzoh · 2022-04-30T14:09:51Z

Based on the feedback, a single EachObs would require the least code; what is now being done in the constructor eachobs could then be done in the first call to iterate, including the reshuflling.

Is being able to pass a preallocated buffer necessary? The code would be cleaner without and based on the arguments to eachobs, one would have to pass different things:

if batchsize == -1 a valid buffer is a single observation
if batchsize > 0 a valid buffer is a batch of observations
additionally, if parallel = true, one needs a buffer for every thread anyway which will be created inside eachobs.

Hence I suggest we drop that behavior, and make eachobs accept only a Bool for buffer.

CarloLucibello · 2022-04-30T14:38:09Z

Based on the feedback, a single EachObs would require the least code; what is now being done in the constructor eachobs could then be done in the first call to iterate, including the reshuflling.

And we could rename the EachObs in this PR to DataLoader and delete the old implementation 🤯

Since we need a stateful object in any case, it makes sense to just use the DataLoader we already have, and eachobs becomes just an alternative syntax to create a DataLoader object.

CarloLucibello · 2022-04-30T14:40:53Z

Is being able to pass a preallocated buffer necessary?

I'm not sure, I've never used it. But it would be good to avoid breaking changes if not necessary.

lorenzoh · 2022-04-30T14:44:37Z

And we could rename the EachObs in this PR to DataLoader and delete the old implementation 🤯

True, it's looking like DataLoader + buffering + parallel now 🙈 .

Since we need a stateful object in any case, it makes sense to just use the DataLoader we already have, and eachobs becomes just an alternative syntax to create a DataLoader object.

In any case, this adds buffering and parallel loading to it

lorenzoh · 2022-04-30T14:47:00Z

Is being able to pass a preallocated buffer necessary?

I'm not sure, I've never used it. But it would be good to avoid breaking changes if not necessary.

So which of the three buffers should it accept? Should it validate them? The original docstring also does not say anything about this, so I wonder if it was used:

buffer. If buffer=true and supported by the type of data,
a buffer will be allocated and reused for memory efficiency.

I think with the parallel stuff added, I feel this would bloat the code a little. We can easily make this nonbreaking simply by allowing passing in a buffer, but not using it and instead setting buffer = true. What do you think Carlo?

src/eachobs.jl

src/reshuffle.jl

…ML/MLUtils.jl into lo/unify-eachobs-reshuffle

lorenzoh · 2022-04-30T14:58:40Z

I changed so that it is now just very similar to just DataLoader + parallel + buffer. I can merge one of the two into each other to consolidate. For now, without preallocated buffer, but if we find good semantics for the different argument combinations (see thread above), I can re-add support (at the cost of some bloat).

darsnack · 2022-04-30T15:03:58Z

In terms of the naming, I suggest either consolidating around DataLoader or eachobs and not having one as convenience for another. If we keep all of them, we don't really solve the confusion from #79 that prompted some of this.

darsnack · 2022-04-30T15:05:23Z

Beyond backward compat, is batchsize = -1 really necessary? Why not just treat batchsize = 1 as this case?

…ML/MLUtils.jl into lo/unify-eachobs-reshuffle

lorenzoh · 2022-04-30T15:11:31Z

Beyond backward compat, is batchsize = -1 really necessary? Why not just treat batchsize = 1 as this case?

Batch size 1 is not the same as no batching, e.g. [obs] (batch size 1) and obs (no batch).

lorenzoh · 2022-04-30T15:14:52Z

In terms of the naming, I suggest either consolidating around DataLoader or eachobs and not having one as convenience for another. If we keep all of them, we don't really solve the confusion from #79 that prompted some of this.

I agree. eachobs is more idiomatic and in line with the rest of the library while DataLoader is easier both for users coming from PyTorch or migrating from DataLoaders.jl.

I was thinking that DataLoader could refer to the same function but with better defaults for deep learning, i.e. parallel = true and collate = true (once added). But then we get into problems with backwards compatibility again.

darsnack · 2022-04-30T15:18:18Z

Batch size 1 is not the same as no batching, e.g. [obs] (batch size 1) and obs (no batch).

Ah, missed that, thanks.

Here's a slightly different implementation that I prefer. I think it offers greater flexibility than boolean switches while still having the user-facing convenience that we want.

function eachobs(
        data;
        buffer = false,
        parallel = false,
        shuffle = false,
        batchsize::Int = -1,
        partial::Bool = true,
        rng::AbstractRNG = Random.GLOBAL_RNG)
    buffer = buffer isa Bool ? buffer : true

    initialize = if batchsize > 1 && shuffle
        x -> BatchView(shuffleobs(x); batchsize, partial)
    elseif batchsize > 1
        x -> BatchView(x; batchsize, partial)
    elseif shuffle
        shuffleobs
    else
        identity
    end

    return EachObs(data, initialize, buffer, parallel, rng)
end

struct EachObs{T, F, R<:AbstractRNG}
    data::T
    initialize::F
    buffer::Bool
    parallel::Bool
    rng::R
end


function Base.iterate(e::EachObs)
    data = e.initialize(e.data)

    iter = if e.parallel
        eachobsparallel(data; e.buffer)
    else
        if e.buffer
            buf = getobs(data, 1)
            (getobs!(buf, data, i) for i in 1:numobs(data))
        else
            (getobs(data, i) for i in 1:numobs(data))
        end
    end
    obs, state = iterate(iter)
    return obs, (iter, state)
end


function Base.iterate(::EachObs, (iter, state))
    ret = iterate(iter, state)
    isnothing(ret) && return
    obs, state = ret
    return obs, (iter, state)
end


Base.length(e::EachObs) = numobs(e.initialize(e.data))

Base.eltype(e::EachObs) = eltype(e.initialize(e.data))

We could also offer a more declarative syntax like eachobs(initialize, data; kwargs...) which lets the user define initialize using a do block. Or defining it as a data pipeline and passing it in. Personally, I find this syntax more friendly than the keywords. e.g. do I set batchsize = 0 or = -1 for disabling batching? With a declarative syntax, I know what's going to happen cause I just wrote BatchView or I didn't.

darsnack · 2022-04-30T15:19:23Z

Possibly as a performance optimization, EachObs could be mutable and store a cache of e.initialize(data), but at least for the common case variants of e.initialize, construction is cheap.

lorenzoh · 2022-05-04T08:30:30Z

@darsnack I think adding an initialization function is a useful addition, but I'm wondering about the semantics. One issue I see with the approach of making shuffling and batching part of this initialization function is that when one wants to add a custom initialization function "on top", you would have to rewrite the shuffling and batching part inside the custom initialization function. Maybe a eachobs(data; buffer = false, parallel = false, shuffle = false, batchsize::Int = -1, partial::Bool = true, rng::AbstractRNG = Random.GLOBAL_RNG, initialize = identity) where initialize is applied before shuffling and batching. Additional method with data(initialize, data; kwargs...) for do syntax could also work then. Could this be adressed in a follow-up PR or would we need breaking changes then?

@CarloLucibello I think it's clear that only one of eachobs and DataLoader needs to be implemented. Can you give your opinion on which should hold the implementation and what should happen to the other symbol?

CarloLucibello · 2022-05-04T08:38:08Z

As for the naming, eachobs is more consistent with the naming convention in the library.
We cannot remove DataLoader since it would break a lot of code, and also having DataLoader(..) return an EachObs type instead of a DataLoader type would be breaking (I myself have some code dispatching on the iterable type).
So maybe the best solution for the time being is to just go with DataLoader and have eachobs as a placeholder:

eachobs(args...; kws...) = DataLoader(args...; kws...)

lorenzoh · 2022-05-04T09:29:54Z

What if we did const DataLoader = EachObs? Then I suppose your dispatching code would not break, right?

darsnack · 2022-05-04T12:45:52Z

Re initialize: I could see an initialize = identity as applying on top of shuffling/batching making sense in the first method that I wrote above. It's easy enough to set shuffle = false and batchsize = -1 to use only the initialize part if someone wants to.

As long as we store the shuffling/batching logic within initialize, we can always add eachobs(initialize, data; kwargs...) later. If shuffle, etc. are stored as fields of the struct, then it would be breaking (though EachObs itself is un-exported so maybe not?). Personally, I would put everything inside initialize since there are few places where we need to replicate the batchsize > 0 code, so the code is cleaner. But I have no strong feelings on this.

CarloLucibello · 2022-05-22T09:11:36Z

@lorenzoh I took the liberty of pushing this forward and entirely replace the previous DataLoader implementation with this.
I've been conservative and kept the object name DataLoader (I don't want surprises to Flux's users and problems in the docs etc..). We can change the naming scheme in the future.

Right now the only difference between eachobs and DataLoader is in the default value for batchsize (respectively -1 and +1). The -1 default seems more sensible, but I don't want this PR to be breaking.

CarloLucibello · 2022-05-22T09:13:21Z

test/eachobs.jl

@@ -11,7 +11,7 @@
    for (i,x) in enumerate(eachobs(X, buffer=b))
        @test x == X[:,i]
    end
-    @test b == X[:,end]
+    @test_broken b == X[:,end]


this is the only test that is now failing with this PR. Why is that? It seems we don't use the buffer at all...

@lorenzoh what changed here?

See #82 (comment); I wasn't sure what the right semantics here are, so I didn't implement the preallocated buffer handling. See also #82 (comment).

Is this change okay?

I wouldn't consider this breaking in any case, since ret = getobs!(buf, data, i) will fall back to getobs, so one should always be using the return value and not rely on buf being mutated.

Maybe we can reinstate the old behavior only in the parallel = false branch. But also fine to keep this pr as it is and consider it non-breaking

lorenzoh · 2022-05-27T17:06:31Z

Should I merge? This + #88 + a collate kwarg to DataLoader and I can start transitioning FastAI.jl to MLUtils.jl 🥳

Add parallel and shuffle support to eachobs

769e8fe

lorenzoh marked this pull request as draft April 30, 2022 09:06

lorenzoh marked this pull request as ready for review April 30, 2022 09:10

lorenzoh requested a review from CarloLucibello April 30, 2022 09:11