Incremental PCA #289

krstopro · 2024-07-18T19:27:11Z

Adds Incremental PCA. As usual, there are several things I am not sure about:

Basically the struct is almost the same as for regular PCA. These should probably be a single module, but it isn't that trivial to merge them (they are separate classes in scikit-learn).
There are two fit/2 clauses: one that takes a tensor and another that takes a stream as a first argument. The one that takes a tensor is defined with deftransform while the one that takes a stream is defined with def. Is this fine?
~~Is it possible to add a popular dataset in incremental_pca_test.exs without adding explorer or scidata dependency to mix.exs?~~
transform/2 is the same as for regular PCA.

krstopro

Two doubts.

krstopro · 2024-07-18T19:54:08Z

lib/scholar/shared.ex

+      if leftover_size > min_batch_size do
+        Nx.slice_along_axis(tensor, num_batches * batch_size, leftover_size, axis: 0)
+      else
+        nil


I guess this won't work outside defn. Is it fine though?

lib/scholar/decomposition/incremental_pca.ex

josevalim · 2024-07-18T20:48:39Z

There are two fit/2 clauses: one that takes a tensor and another that takes a stream as a first argument. The one that takes a tensor is defined with deftransform while the one that takes a stream is defined with def. Is this fine?

This is currently undefined behaviour. I would maybe have a function name specific to streams... if I have more ideas I will drop as comment when I review the PR.

Is it possible to add a popular dataset in incremental_pca_test.exs without adding explorer or scidata dependency to mix.exs?

It is fine to add scidata as a test dependency only.

josevalim · 2024-07-18T20:52:37Z

lib/scholar/decomposition/incremental_pca.ex

+    first_batch = Enum.at(batches, 0)
+    model = fit_first_n(first_batch, opts)
+    batches = Stream.drop(batches, 1)


This will iterate some lazy streams (such as file) twice. For eample, we will open it once to get the first batch and then again to traverse it. The correct way is to have a single reduce, which it starts with an accumulator of {:head, ...} and then you swap it to {:tail, ...} for the remaining iterations.

Although I wonder if we should have this function here... perhaps it is best to have fit_head and fit_tail as public functions and then later on we provide facilities for streaming on any model that has fit_head and fit_tail?

I am still thinking about it. Perhaps there is a possibility to combine fit_first_n and fit_partial_n into a single function. I find it cleaner to have two separate functions and it also avoids the overhead of checking whether the function is called for the first time on every element of the stream.

One problem with doing this using a single function is that there needs to be some starting model as initial accumulator.

One problem with doing this using a single function is that there needs to be some starting model as initial accumulator.

This can probably be achieved by setting num_samples_seen to 0 and filling tensor fields with nan.

There is no need to have an initial accumulator, you could have the acc be :first and then model, and pattern matching distinguighes between them.

But you could also have a model with a single :initialized field as a boolean (or even a u8) and then we could easily fold it all into a single defn.

There is no need to have an initial accumulator, you could have the acc be :first and then model, and pattern matching distinguighes between them.

I see, that would work.

But you could also have a model with a single :initialized field as a boolean (or even a u8) and then we could easily fold it all into a single defn.

Checking if num_samples_seen is equal to 0 can tell us if the model has been initialised or not. The only issue would be the floating point type of tensor fields, but then we can ask the user to provide it.

Alright, it is possible to have a single-clause reducer, but then the problem is that the user needs to specify the type and the number of features. Those are necessary for initialising the container fields of the model (e.g. components is a floating-point tensor of shape {num_components, num_features}) and they cannot be inferred without running the stream.

I suppose it is a cleaner approach to have a reducer with two clauses: one for head and one for tail.

krstopro · 2024-07-19T18:50:24Z

I've decided to remove the fit/2 clause that takes tensor as the first argument as I don't think it's needed. :)

The code should be much cleaner now, though some problems remain.

krstopro · 2024-07-19T18:53:43Z

lib/scholar/decomposition/incremental_pca.ex

+    )
+  end
+
+  defp fit_batch(nil, batch, opts), do: fit_head_n(batch, opts)


Using fit_head/2 instead of fit_head_n/2 over here won't work. I think it has to do with the compilation. For some reason the options won't be passed to fit_head/2 and opts[:num_components] will be nil. I don't know how to fix this.

There is a typo in the option name down below, that's probably why: num_components vs num_componenets.

Damn, good catch. :)

josevalim · 2024-07-25T09:45:48Z

lib/scholar/decomposition/incremental_pca.ex

+      iex> ipca.singular_values
+      Nx.tensor([77.05782028025969, 10.137848854064941])
+  """
+  def fit(batches = %Stream{}, opts) do


We should not match on a stream because %Stream{} is only one of the possible streams. Instead we should just treat it as an enumerable:

Suggested change

def fit(batches = %Stream{}, opts) do

def fit(batches, opts) do

I wonder though if we should instead merge this into PCA and add a new function called fit_stream.

Yes, I was about to recommend that! Do we need to call it fit_stream or fit_incremental or we can just add another clause? Two clauses were working over here; one that was taking a tensor as the first argument was defined with deftransform, the other that takes stream as the first argument is defined with def.

I would do a separate function. Either fit_stream or fit_incremental.

krstopro · 2024-07-29T15:13:32Z

Closing in favor of #291.

Krsto Proroković and others added 7 commits May 21, 2024 20:10

Add incremental PCA

af2135e

Update PCA unit-tests

643622a

Merge branch 'elixir-nx:main' into incremental-pca

f8c93cb

Merge branch 'elixir-nx:main' into incremental-pca

fb2b254

Add Incremental PCA

bed9afd

Fix PCA and its tests.

a8adff4

Bug fix.

d7dfde1

krstopro commented Jul 18, 2024

View reviewed changes

josevalim reviewed Jul 18, 2024

View reviewed changes

Krsto Proroković added 2 commits July 19, 2024 15:07

Single stream run

1a4ba4a

Remove fit clauses that works with tensor, add docs

7e56f07

Remove unused import.

b773a94

krstopro commented Jul 19, 2024

View reviewed changes

Krsto Proroković added 3 commits July 22, 2024 10:31

Update docstrings and add scidata to deps

a50d30c

Fix get_batches and docstrings

0ef3a3b

mix format

c59d8b0

krstopro requested a review from josevalim July 22, 2024 09:44

josevalim reviewed Jul 25, 2024

View reviewed changes

Krsto Proroković added 2 commits July 25, 2024 13:34

Fix type: componenets -> components

3b5bc33

mix format

70a3c78

krstopro closed this Jul 29, 2024

krstopro deleted the incremental-pca branch July 31, 2024 17:45

josevalim mentioned this pull request Sep 4, 2024

Non-terminating PCA decomposition #298

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incremental PCA #289

Incremental PCA #289

krstopro commented Jul 18, 2024 •

edited

Loading

krstopro left a comment

krstopro Jul 18, 2024

josevalim commented Jul 18, 2024

josevalim Jul 18, 2024

krstopro Jul 18, 2024

krstopro Jul 18, 2024

josevalim Jul 18, 2024

krstopro Jul 18, 2024

krstopro Jul 19, 2024 •

edited

Loading

krstopro commented Jul 19, 2024 •

edited

Loading

krstopro Jul 19, 2024

josevalim Jul 25, 2024

krstopro Jul 25, 2024

josevalim Jul 25, 2024

josevalim Jul 25, 2024

krstopro Jul 25, 2024 •

edited

Loading

josevalim Jul 25, 2024

krstopro commented Jul 29, 2024

	def fit(batches = %Stream{}, opts) do
	def fit(batches, opts) do

Incremental PCA #289

Incremental PCA #289

Conversation

krstopro commented Jul 18, 2024 • edited Loading

krstopro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

josevalim commented Jul 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

krstopro Jul 19, 2024 • edited Loading

Choose a reason for hiding this comment

krstopro commented Jul 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

krstopro Jul 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

krstopro commented Jul 29, 2024

krstopro commented Jul 18, 2024 •

edited

Loading

krstopro Jul 19, 2024 •

edited

Loading

krstopro commented Jul 19, 2024 •

edited

Loading

krstopro Jul 25, 2024 •

edited

Loading