Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parallel and shuffle support to eachobs and DataLoader #82

Merged
merged 11 commits into from
May 27, 2022

Conversation

lorenzoh
Copy link
Contributor

@lorenzoh lorenzoh commented Apr 30, 2022

Closes #66. Closes #13

Unifies data iteration under eachobs as discussed in #66.

  • parallel = true uses the parallel data loader. false by default since parallel does not guarantee correct ordering (see Deterministic, parallel data iteration #68)
  • shuffle = true reshuffles the observations every time you start iterating (unlike shuffleobs). This required changing the implementation of non-parallel eachobs: they used generator expressions before, but now have a wrapper type, so that we can reach into that when reshuffling.

With this, I think we can start advertising eachobs as the go-to data iteration function. We still need to decide how to proceed with DataLoader.

@codecov-commenter
Copy link

codecov-commenter commented Apr 30, 2022

Codecov Report

Merging #82 (5df0d7a) into main (c1dc3c2) will increase coverage by 0.29%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main      #82      +/-   ##
==========================================
+ Coverage   89.97%   90.27%   +0.29%     
==========================================
  Files          14       13       -1     
  Lines         479      473       -6     
==========================================
- Hits          431      427       -4     
+ Misses         48       46       -2     
Impacted Files Coverage Δ
src/eachobs.jl 100.00% <100.00%> (ø)
src/parallel.jl 94.82% <100.00%> (-0.09%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c1dc3c2...5df0d7a. Read the comment docs.

@lorenzoh lorenzoh marked this pull request as draft April 30, 2022 09:06
@lorenzoh
Copy link
Contributor Author

Just saw that BatchView does not collate by default, so actually BatchView needs collating support before replacing DataLoader. But that can be done in another PR.

@lorenzoh lorenzoh marked this pull request as ready for review April 30, 2022 09:10
@lorenzoh lorenzoh requested a review from CarloLucibello April 30, 2022 09:11
@lorenzoh
Copy link
Contributor Author

lorenzoh commented Apr 30, 2022

Based on the feedback, a single EachObs would require the least code; what is now being done in the constructor eachobs could then be done in the first call to iterate, including the reshuflling.

Is being able to pass a preallocated buffer necessary? The code would be cleaner without and based on the arguments to eachobs, one would have to pass different things:

  • if batchsize == -1 a valid buffer is a single observation
  • if batchsize > 0 a valid buffer is a batch of observations
  • additionally, if parallel = true, one needs a buffer for every thread anyway which will be created inside eachobs.

Hence I suggest we drop that behavior, and make eachobs accept only a Bool for buffer.

@CarloLucibello
Copy link
Member

CarloLucibello commented Apr 30, 2022

Based on the feedback, a single EachObs would require the least code; what is now being done in the constructor eachobs could then be done in the first call to iterate, including the reshuflling.

And we could rename the EachObs in this PR to DataLoader and delete the old implementation 🤯

Since we need a stateful object in any case, it makes sense to just use the DataLoader we already have, and eachobs becomes just an alternative syntax to create a DataLoader object.

@CarloLucibello
Copy link
Member

Is being able to pass a preallocated buffer necessary?

I'm not sure, I've never used it. But it would be good to avoid breaking changes if not necessary.

@lorenzoh
Copy link
Contributor Author

lorenzoh commented Apr 30, 2022

And we could rename the EachObs in this PR to DataLoader and delete the old implementation 🤯

True, it's looking like DataLoader + buffering + parallel now 🙈 .

Since we need a stateful object in any case, it makes sense to just use the DataLoader we already have, and eachobs becomes just an alternative syntax to create a DataLoader object.

In any case, this adds buffering and parallel loading to it

@lorenzoh
Copy link
Contributor Author

lorenzoh commented Apr 30, 2022

Is being able to pass a preallocated buffer necessary?

I'm not sure, I've never used it. But it would be good to avoid breaking changes if not necessary.

So which of the three buffers should it accept? Should it validate them? The original docstring also does not say anything about this, so I wonder if it was used:

  • buffer. If buffer=true and supported by the type of data,
    a buffer will be allocated and reused for memory efficiency.

I think with the parallel stuff added, I feel this would bloat the code a little. We can easily make this nonbreaking simply by allowing passing in a buffer, but not using it and instead setting buffer = true. What do you think Carlo?

@lorenzoh
Copy link
Contributor Author

I changed so that it is now just very similar to just DataLoader + parallel + buffer. I can merge one of the two into each other to consolidate. For now, without preallocated buffer, but if we find good semantics for the different argument combinations (see thread above), I can re-add support (at the cost of some bloat).

@darsnack
Copy link
Member

In terms of the naming, I suggest either consolidating around DataLoader or eachobs and not having one as convenience for another. If we keep all of them, we don't really solve the confusion from #79 that prompted some of this.

@darsnack
Copy link
Member

Beyond backward compat, is batchsize = -1 really necessary? Why not just treat batchsize = 1 as this case?

@lorenzoh
Copy link
Contributor Author

Beyond backward compat, is batchsize = -1 really necessary? Why not just treat batchsize = 1 as this case?

Batch size 1 is not the same as no batching, e.g. [obs] (batch size 1) and obs (no batch).

@lorenzoh
Copy link
Contributor Author

In terms of the naming, I suggest either consolidating around DataLoader or eachobs and not having one as convenience for another. If we keep all of them, we don't really solve the confusion from #79 that prompted some of this.

I agree. eachobs is more idiomatic and in line with the rest of the library while DataLoader is easier both for users coming from PyTorch or migrating from DataLoaders.jl.

I was thinking that DataLoader could refer to the same function but with better defaults for deep learning, i.e. parallel = true and collate = true (once added). But then we get into problems with backwards compatibility again.

@darsnack
Copy link
Member

darsnack commented Apr 30, 2022

Batch size 1 is not the same as no batching, e.g. [obs] (batch size 1) and obs (no batch).

Ah, missed that, thanks.


Here's a slightly different implementation that I prefer. I think it offers greater flexibility than boolean switches while still having the user-facing convenience that we want.

function eachobs(
        data;
        buffer = false,
        parallel = false,
        shuffle = false,
        batchsize::Int = -1,
        partial::Bool = true,
        rng::AbstractRNG = Random.GLOBAL_RNG)
    buffer = buffer isa Bool ? buffer : true

    initialize = if batchsize > 1 && shuffle
        x -> BatchView(shuffleobs(x); batchsize, partial)
    elseif batchsize > 1
        x -> BatchView(x; batchsize, partial)
    elseif shuffle
        shuffleobs
    else
        identity
    end

    return EachObs(data, initialize, buffer, parallel, rng)
end

struct EachObs{T, F, R<:AbstractRNG}
    data::T
    initialize::F
    buffer::Bool
    parallel::Bool
    rng::R
end


function Base.iterate(e::EachObs)
    data = e.initialize(e.data)

    iter = if e.parallel
        eachobsparallel(data; e.buffer)
    else
        if e.buffer
            buf = getobs(data, 1)
            (getobs!(buf, data, i) for i in 1:numobs(data))
        else
            (getobs(data, i) for i in 1:numobs(data))
        end
    end
    obs, state = iterate(iter)
    return obs, (iter, state)
end


function Base.iterate(::EachObs, (iter, state))
    ret = iterate(iter, state)
    isnothing(ret) && return
    obs, state = ret
    return obs, (iter, state)
end


Base.length(e::EachObs) = numobs(e.initialize(e.data))

Base.eltype(e::EachObs) = eltype(e.initialize(e.data))

We could also offer a more declarative syntax like eachobs(initialize, data; kwargs...) which lets the user define initialize using a do block. Or defining it as a data pipeline and passing it in. Personally, I find this syntax more friendly than the keywords. e.g. do I set batchsize = 0 or = -1 for disabling batching? With a declarative syntax, I know what's going to happen cause I just wrote BatchView or I didn't.

@darsnack
Copy link
Member

Possibly as a performance optimization, EachObs could be mutable and store a cache of e.initialize(data), but at least for the common case variants of e.initialize, construction is cheap.

@lorenzoh
Copy link
Contributor Author

lorenzoh commented May 4, 2022

@darsnack I think adding an initialization function is a useful addition, but I'm wondering about the semantics. One issue I see with the approach of making shuffling and batching part of this initialization function is that when one wants to add a custom initialization function "on top", you would have to rewrite the shuffling and batching part inside the custom initialization function. Maybe a eachobs(data; buffer = false, parallel = false, shuffle = false, batchsize::Int = -1, partial::Bool = true, rng::AbstractRNG = Random.GLOBAL_RNG, initialize = identity) where initialize is applied before shuffling and batching. Additional method with data(initialize, data; kwargs...) for do syntax could also work then. Could this be adressed in a follow-up PR or would we need breaking changes then?

@CarloLucibello I think it's clear that only one of eachobs and DataLoader needs to be implemented. Can you give your opinion on which should hold the implementation and what should happen to the other symbol?

@CarloLucibello
Copy link
Member

As for the naming, eachobs is more consistent with the naming convention in the library.
We cannot remove DataLoader since it would break a lot of code, and also having DataLoader(..) return an EachObs type instead of a DataLoader type would be breaking (I myself have some code dispatching on the iterable type).
So maybe the best solution for the time being is to just go with DataLoader and have eachobs as a placeholder:

eachobs(args...; kws...) = DataLoader(args...; kws...)

@lorenzoh
Copy link
Contributor Author

lorenzoh commented May 4, 2022

What if we did const DataLoader = EachObs? Then I suppose your dispatching code would not break, right?

@darsnack
Copy link
Member

darsnack commented May 4, 2022

Re initialize: I could see an initialize = identity as applying on top of shuffling/batching making sense in the first method that I wrote above. It's easy enough to set shuffle = false and batchsize = -1 to use only the initialize part if someone wants to.

As long as we store the shuffling/batching logic within initialize, we can always add eachobs(initialize, data; kwargs...) later. If shuffle, etc. are stored as fields of the struct, then it would be breaking (though EachObs itself is un-exported so maybe not?). Personally, I would put everything inside initialize since there are few places where we need to replicate the batchsize > 0 code, so the code is cleaner. But I have no strong feelings on this.

@CarloLucibello
Copy link
Member

@lorenzoh I took the liberty of pushing this forward and entirely replace the previous DataLoader implementation with this.
I've been conservative and kept the object name DataLoader (I don't want surprises to Flux's users and problems in the docs etc..). We can change the naming scheme in the future.

Right now the only difference between eachobs and DataLoader is in the default value for batchsize (respectively -1 and +1). The -1 default seems more sensible, but I don't want this PR to be breaking.

@@ -11,7 +11,7 @@
for (i,x) in enumerate(eachobs(X, buffer=b))
@test x == X[:,i]
end
@test b == X[:,end]
@test_broken b == X[:,end]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the only test that is now failing with this PR. Why is that? It seems we don't use the buffer at all...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lorenzoh what changed here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #82 (comment); I wasn't sure what the right semantics here are, so I didn't implement the preallocated buffer handling. See also #82 (comment).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change okay?

I wouldn't consider this breaking in any case, since ret = getobs!(buf, data, i) will fall back to getobs, so one should always be using the return value and not rely on buf being mutated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can reinstate the old behavior only in the parallel = false branch. But also fine to keep this pr as it is and consider it non-breaking

@CarloLucibello CarloLucibello changed the title Add parallel and shuffle support to eachobs Add parallel and shuffle support to eachobs and DataLoader May 22, 2022
@lorenzoh
Copy link
Contributor Author

lorenzoh commented May 27, 2022

Should I merge? This + #88 + a collate kwarg to DataLoader and I can start transitioning FastAI.jl to MLUtils.jl 🥳

@CarloLucibello CarloLucibello merged commit be64c14 into main May 27, 2022
@CarloLucibello CarloLucibello deleted the lo/unify-eachobs-reshuffle branch June 28, 2022 03:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API for iterator/view variants add multitaksing to DataLoader
4 participants