Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental PCA #289

Closed
wants to merge 15 commits into from
Closed

Incremental PCA #289

wants to merge 15 commits into from

Conversation

krstopro
Copy link
Member

@krstopro krstopro commented Jul 18, 2024

Adds Incremental PCA. As usual, there are several things I am not sure about:

  • Basically the struct is almost the same as for regular PCA. These should probably be a single module, but it isn't that trivial to merge them (they are separate classes in scikit-learn).
  • There are two fit/2 clauses: one that takes a tensor and another that takes a stream as a first argument. The one that takes a tensor is defined with deftransform while the one that takes a stream is defined with def. Is this fine?
  • Is it possible to add a popular dataset in incremental_pca_test.exs without adding explorer or scidata dependency to mix.exs?
  • transform/2 is the same as for regular PCA.

Copy link
Member Author

@krstopro krstopro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two doubts.

if leftover_size > min_batch_size do
Nx.slice_along_axis(tensor, num_batches * batch_size, leftover_size, axis: 0)
else
nil
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this won't work outside defn. Is it fine though?

lib/scholar/decomposition/incremental_pca.ex Outdated Show resolved Hide resolved
@josevalim
Copy link
Contributor

There are two fit/2 clauses: one that takes a tensor and another that takes a stream as a first argument. The one that takes a tensor is defined with deftransform while the one that takes a stream is defined with def. Is this fine?

This is currently undefined behaviour. I would maybe have a function name specific to streams... if I have more ideas I will drop as comment when I review the PR.

Is it possible to add a popular dataset in incremental_pca_test.exs without adding explorer or scidata dependency to mix.exs?

It is fine to add scidata as a test dependency only.

Comment on lines 167 to 169
first_batch = Enum.at(batches, 0)
model = fit_first_n(first_batch, opts)
batches = Stream.drop(batches, 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will iterate some lazy streams (such as file) twice. For eample, we will open it once to get the first batch and then again to traverse it. The correct way is to have a single reduce, which it starts with an accumulator of {:head, ...} and then you swap it to {:tail, ...} for the remaining iterations.

Although I wonder if we should have this function here... perhaps it is best to have fit_head and fit_tail as public functions and then later on we provide facilities for streaming on any model that has fit_head and fit_tail?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still thinking about it. Perhaps there is a possibility to combine fit_first_n and fit_partial_n into a single function. I find it cleaner to have two separate functions and it also avoids the overhead of checking whether the function is called for the first time on every element of the stream.

One problem with doing this using a single function is that there needs to be some starting model as initial accumulator.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One problem with doing this using a single function is that there needs to be some starting model as initial accumulator.

This can probably be achieved by setting num_samples_seen to 0 and filling tensor fields with nan.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need to have an initial accumulator, you could have the acc be :first and then model, and pattern matching distinguighes between them.

But you could also have a model with a single :initialized field as a boolean (or even a u8) and then we could easily fold it all into a single defn.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need to have an initial accumulator, you could have the acc be :first and then model, and pattern matching distinguighes between them.

I see, that would work.

But you could also have a model with a single :initialized field as a boolean (or even a u8) and then we could easily fold it all into a single defn.

Checking if num_samples_seen is equal to 0 can tell us if the model has been initialised or not. The only issue would be the floating point type of tensor fields, but then we can ask the user to provide it.

Copy link
Member Author

@krstopro krstopro Jul 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, it is possible to have a single-clause reducer, but then the problem is that the user needs to specify the type and the number of features. Those are necessary for initialising the container fields of the model (e.g. components is a floating-point tensor of shape {num_components, num_features}) and they cannot be inferred without running the stream.

I suppose it is a cleaner approach to have a reducer with two clauses: one for head and one for tail.

@krstopro
Copy link
Member Author

krstopro commented Jul 19, 2024

I've decided to remove the fit/2 clause that takes tensor as the first argument as I don't think it's needed. :)

The code should be much cleaner now, though some problems remain.

)
end

defp fit_batch(nil, batch, opts), do: fit_head_n(batch, opts)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using fit_head/2 instead of fit_head_n/2 over here won't work. I think it has to do with the compilation. For some reason the options won't be passed to fit_head/2 and opts[:num_components] will be nil. I don't know how to fix this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a typo in the option name down below, that's probably why: num_components vs num_componenets.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Damn, good catch. :)

@krstopro krstopro requested a review from josevalim July 22, 2024 09:44
iex> ipca.singular_values
Nx.tensor([77.05782028025969, 10.137848854064941])
"""
def fit(batches = %Stream{}, opts) do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not match on a stream because %Stream{} is only one of the possible streams. Instead we should just treat it as an enumerable:

Suggested change
def fit(batches = %Stream{}, opts) do
def fit(batches, opts) do

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder though if we should instead merge this into PCA and add a new function called fit_stream.

Copy link
Member Author

@krstopro krstopro Jul 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was about to recommend that! Do we need to call it fit_stream or fit_incremental or we can just add another clause? Two clauses were working over here; one that was taking a tensor as the first argument was defined with deftransform, the other that takes stream as the first argument is defined with def.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would do a separate function. Either fit_stream or fit_incremental.

@krstopro
Copy link
Member Author

Closing in favor of #291.

@krstopro krstopro closed this Jul 29, 2024
@krstopro krstopro deleted the incremental-pca branch July 31, 2024 17:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants