-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incremental PCA #289
Incremental PCA #289
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two doubts.
lib/scholar/shared.ex
Outdated
if leftover_size > min_batch_size do | ||
Nx.slice_along_axis(tensor, num_batches * batch_size, leftover_size, axis: 0) | ||
else | ||
nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this won't work outside defn
. Is it fine though?
This is currently undefined behaviour. I would maybe have a function name specific to streams... if I have more ideas I will drop as comment when I review the PR.
It is fine to add scidata as a test dependency only. |
first_batch = Enum.at(batches, 0) | ||
model = fit_first_n(first_batch, opts) | ||
batches = Stream.drop(batches, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will iterate some lazy streams (such as file) twice. For eample, we will open it once to get the first batch and then again to traverse it. The correct way is to have a single reduce, which it starts with an accumulator of {:head, ...}
and then you swap it to {:tail, ...}
for the remaining iterations.
Although I wonder if we should have this function here... perhaps it is best to have fit_head
and fit_tail
as public functions and then later on we provide facilities for streaming on any model that has fit_head
and fit_tail
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am still thinking about it. Perhaps there is a possibility to combine fit_first_n
and fit_partial_n
into a single function. I find it cleaner to have two separate functions and it also avoids the overhead of checking whether the function is called for the first time on every element of the stream.
One problem with doing this using a single function is that there needs to be some starting model as initial accumulator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One problem with doing this using a single function is that there needs to be some starting model as initial accumulator.
This can probably be achieved by setting num_samples_seen
to 0
and filling tensor fields with nan
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no need to have an initial accumulator, you could have the acc be :first
and then model
, and pattern matching distinguighes between them.
But you could also have a model with a single :initialized
field as a boolean (or even a u8) and then we could easily fold it all into a single defn
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no need to have an initial accumulator, you could have the acc be :first and then model, and pattern matching distinguighes between them.
I see, that would work.
But you could also have a model with a single :initialized field as a boolean (or even a u8) and then we could easily fold it all into a single defn.
Checking if num_samples_seen
is equal to 0
can tell us if the model has been initialised or not. The only issue would be the floating point type of tensor fields, but then we can ask the user to provide it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, it is possible to have a single-clause reducer, but then the problem is that the user needs to specify the type and the number of features. Those are necessary for initialising the container fields of the model (e.g. components
is a floating-point tensor of shape {num_components, num_features}
) and they cannot be inferred without running the stream.
I suppose it is a cleaner approach to have a reducer with two clauses: one for head and one for tail.
I've decided to remove the The code should be much cleaner now, though some problems remain. |
) | ||
end | ||
|
||
defp fit_batch(nil, batch, opts), do: fit_head_n(batch, opts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using fit_head/2
instead of fit_head_n/2
over here won't work. I think it has to do with the compilation. For some reason the options won't be passed to fit_head/2
and opts[:num_components]
will be nil
. I don't know how to fix this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a typo in the option name down below, that's probably why: num_components vs num_componenets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Damn, good catch. :)
iex> ipca.singular_values | ||
Nx.tensor([77.05782028025969, 10.137848854064941]) | ||
""" | ||
def fit(batches = %Stream{}, opts) do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not match on a stream because %Stream{}
is only one of the possible streams. Instead we should just treat it as an enumerable:
def fit(batches = %Stream{}, opts) do | |
def fit(batches, opts) do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder though if we should instead merge this into PCA and add a new function called fit_stream
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was about to recommend that! Do we need to call it fit_stream
or fit_incremental
or we can just add another clause? Two clauses were working over here; one that was taking a tensor as the first argument was defined with deftransform
, the other that takes stream as the first argument is defined with def
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would do a separate function. Either fit_stream or fit_incremental.
Closing in favor of #291. |
Adds Incremental PCA. As usual, there are several things I am not sure about:
scikit-learn
).There are twofit/2
clauses: one that takes a tensor and another that takes a stream as a first argument. The one that takes a tensor is defined withdeftransform
while the one that takes a stream is defined withdef
. Is this fine?Is it possible to add a popular dataset inincremental_pca_test.exs
without addingexplorer
orscidata
dependency tomix.exs
?transform/2
is the same as for regular PCA.