-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incremental PCA #289
Closed
Closed
Incremental PCA #289
Changes from all commits
Commits
Show all changes
15 commits
Select commit
Hold shift + click to select a range
af2135e
Add incremental PCA
643622a
Update PCA unit-tests
f8c93cb
Merge branch 'elixir-nx:main' into incremental-pca
krstopro fb2b254
Merge branch 'elixir-nx:main' into incremental-pca
krstopro bed9afd
Add Incremental PCA
a8adff4
Fix PCA and its tests.
d7dfde1
Bug fix.
1a4ba4a
Single stream run
7e56f07
Remove fit clauses that works with tensor, add docs
b773a94
Remove unused import.
a50d30c
Update docstrings and add scidata to deps
0ef3a3b
Fix get_batches and docstrings
c59d8b0
mix format
3b5bc33
Fix type: componenets -> components
70a3c78
mix format
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,314 @@ | ||
defmodule Scholar.Decomposition.IncrementalPCA do | ||
@moduledoc """ | ||
Incremental Principal Component Analysis. | ||
|
||
Performs linear dimensionality reduction by processing the input data | ||
in batches and incrementally updating the principal components. | ||
This iterative approach is particularly suitable for datasets too large to fit in memory, | ||
as its memory complexity is independent of the number of data samples. | ||
|
||
References: | ||
|
||
* [1] [Incremental Learning for Robust Visual Tracking](https://www.cs.toronto.edu/~dross/ivt/RossLimLinYang_ijcv.pdf) | ||
""" | ||
import Nx.Defn | ||
require Nx | ||
|
||
@derive {Nx.Container, | ||
keep: [:whiten?], | ||
containers: [ | ||
:components, | ||
:singular_values, | ||
:num_samples_seen, | ||
:mean, | ||
:variance, | ||
:explained_variance, | ||
:explained_variance_ratio | ||
]} | ||
defstruct [ | ||
:components, | ||
:singular_values, | ||
:num_samples_seen, | ||
:mean, | ||
:variance, | ||
:explained_variance, | ||
:explained_variance_ratio, | ||
:whiten? | ||
] | ||
|
||
opts = [ | ||
num_components: [ | ||
required: true, | ||
type: :pos_integer, | ||
doc: "The number of principal components." | ||
], | ||
whiten?: [ | ||
type: :boolean, | ||
default: false, | ||
doc: """ | ||
When true the `components` are divided by `num_samples` times `components` to ensure uncorrelated outputs with unit component-wise variances. | ||
|
||
Whitening will remove some information from the transformed signal (the relative variance scales of the components) | ||
but can sometimes improve the predictive accuracy of the downstream estimators by making data respect some hard-wired assumptions. | ||
""" | ||
] | ||
] | ||
|
||
@opts_schema NimbleOptions.new!(opts) | ||
|
||
@doc """ | ||
Fits an Incremental PCA model on a stream of batches. | ||
|
||
## Options | ||
|
||
#{NimbleOptions.docs(@opts_schema)} | ||
|
||
## Return values | ||
|
||
The function returns a struct with the following parameters: | ||
|
||
* `:num_components` - The number of principal components. | ||
|
||
* `:components` - Principal axes in feature space, representing the directions of maximum variance in the data. | ||
Equivalently, the right singular vectors of the centered input data, parallel to its eigenvectors. | ||
The components are sorted by decreasing `:explained_variance`. | ||
|
||
* `:singular_values` - The singular values corresponding to each of the selected components. | ||
The singular values are equal to the 2-norms of the `:num_components` variables in the lower-dimensional space. | ||
|
||
* `:num_samples_seen` - The number of data samples processed. | ||
|
||
* `:mean` - Per-feature empirical mean. | ||
|
||
* `:variance` - Per-feature empirical variance. | ||
|
||
* `:explained_variance` - Variance explained by each of the selected components. | ||
|
||
* `:explained_variance_ratio` - Percentage of variance explained by each of the selected components. | ||
|
||
* `:whiten?` - Whether to apply whitening. | ||
|
||
## Examples | ||
|
||
iex> {x, _} = Scidata.Iris.download() | ||
iex> batches = x |> Nx.tensor() |> Nx.to_batched(10) | ||
iex> ipca = Scholar.Decomposition.IncrementalPCA.fit(batches, num_components: 2) | ||
iex> ipca.components | ||
Nx.tensor( | ||
[ | ||
[-0.33354005217552185, 0.1048964187502861, -0.8618107080105579, -0.3674643635749817], | ||
[-0.5862125754356384, -0.7916879057884216, 0.15874788165092468, -0.06621300429105759] | ||
] | ||
) | ||
iex> ipca.singular_values | ||
Nx.tensor([77.05782028025969, 10.137848854064941]) | ||
""" | ||
def fit(batches = %Stream{}, opts) do | ||
opts = NimbleOptions.validate!(opts, @opts_schema) | ||
|
||
Enum.reduce( | ||
batches, | ||
nil, | ||
fn batch, model -> fit_batch(model, batch, opts) end | ||
) | ||
end | ||
|
||
defp fit_batch(nil, batch, opts), do: fit_head(batch, opts) | ||
defp fit_batch(%__MODULE__{} = model, batch, _opts), do: partial_fit(model, batch) | ||
|
||
deftransformp fit_head(x, opts) do | ||
{num_samples, num_features} = Nx.shape(x) | ||
num_components = opts[:num_components] | ||
|
||
cond do | ||
num_components > num_samples -> | ||
raise ArgumentError, | ||
""" | ||
num_components must be less than or equal to \ | ||
batch_size = #{num_samples}, got #{num_components} | ||
""" | ||
|
||
num_components > num_features -> | ||
raise ArgumentError, | ||
""" | ||
num_components must be less than or equal to \ | ||
num_features = #{num_features}, got #{num_components} | ||
""" | ||
|
||
true -> | ||
nil | ||
end | ||
|
||
fit_head_n(x, opts) | ||
end | ||
|
||
defnp fit_head_n(x, opts) do | ||
# This is similar to Scholar.Decomposition.PCA.fit_n | ||
num_components = opts[:num_components] | ||
num_samples = Nx.u64(Nx.axis_size(x, 0)) | ||
mean = Nx.mean(x, axes: [0]) | ||
variance = Nx.variance(x, axes: [0]) | ||
x_centered = x - mean | ||
{u, s, vt} = Nx.LinAlg.svd(x_centered, full_matrices?: false) | ||
{_, vt} = Scholar.Decomposition.Utils.flip_svd(u, vt) | ||
components = vt[[0..(num_components - 1)]] | ||
singular_values = s[[0..(num_components - 1)]] | ||
explained_variance = s * s / (num_samples - 1) | ||
|
||
explained_variance_ratio = | ||
(explained_variance / Nx.sum(explained_variance))[[0..(num_components - 1)]] | ||
|
||
%__MODULE__{ | ||
components: components, | ||
singular_values: singular_values, | ||
num_samples_seen: num_samples, | ||
mean: mean, | ||
variance: variance, | ||
explained_variance: explained_variance[[0..(num_components - 1)]], | ||
explained_variance_ratio: explained_variance_ratio, | ||
whiten?: opts[:whiten?] | ||
} | ||
end | ||
|
||
@doc false | ||
deftransform partial_fit(model, x) do | ||
{num_components, num_features_seen} = Nx.shape(model.components) | ||
{num_samples, num_features} = Nx.shape(x) | ||
|
||
cond do | ||
num_features_seen != num_features -> | ||
raise ArgumentError, | ||
""" | ||
each batch must have the same number of features, \ | ||
got #{num_features_seen} and #{num_features} | ||
""" | ||
|
||
num_components > num_samples -> | ||
raise ArgumentError, | ||
""" | ||
num_components must be less than or equal to \ | ||
batch_size = #{num_samples}, got #{num_components} | ||
""" | ||
|
||
true -> | ||
nil | ||
end | ||
|
||
partial_fit_n(model, x) | ||
end | ||
|
||
defnp partial_fit_n(model, x) do | ||
components = model.components | ||
num_components = Nx.axis_size(components, 0) | ||
singular_values = model.singular_values | ||
num_samples_seen = model.num_samples_seen | ||
mean = model.mean | ||
variance = model.variance | ||
{num_samples, _} = Nx.shape(x) | ||
|
||
{x_mean, x_centered, new_num_samples_seen, new_mean, new_variance} = | ||
incremental_mean_and_variance(x, num_samples_seen, mean, variance) | ||
|
||
mean_correction = | ||
Nx.sqrt(num_samples_seen / new_num_samples_seen) * num_samples * (mean - x_mean) | ||
|
||
mean_correction = Nx.new_axis(mean_correction, 0) | ||
|
||
matrix = | ||
Nx.concatenate( | ||
[ | ||
Nx.new_axis(singular_values, 1) * components, | ||
x_centered, | ||
mean_correction | ||
], | ||
axis: 0 | ||
) | ||
|
||
{u, s, vt} = Nx.LinAlg.svd(matrix, full_matrices?: false) | ||
{_, vt} = Scholar.Decomposition.Utils.flip_svd(u, vt) | ||
new_components = vt[[0..(num_components - 1)]] | ||
new_singular_values = s[[0..(num_components - 1)]] | ||
new_explained_variance = singular_values * singular_values / (new_num_samples_seen - 1) | ||
|
||
new_explained_variance_ratio = | ||
singular_values * singular_values / Nx.sum(new_variance * new_num_samples_seen) | ||
|
||
%__MODULE__{ | ||
components: new_components, | ||
singular_values: new_singular_values, | ||
num_samples_seen: new_num_samples_seen, | ||
mean: new_mean, | ||
variance: new_variance, | ||
explained_variance: new_explained_variance, | ||
explained_variance_ratio: new_explained_variance_ratio, | ||
whiten?: model.whiten? | ||
} | ||
end | ||
|
||
defnp incremental_mean_and_variance(x, num_samples_seen, mean, variance) do | ||
num_samples = Nx.axis_size(x, 0) | ||
new_num_samples_seen = num_samples_seen + num_samples | ||
sum = num_samples_seen * mean | ||
x_sum = Nx.sum(x, axes: [0]) | ||
new_mean = (sum + x_sum) / new_num_samples_seen | ||
x_mean = x_sum / num_samples | ||
x_centered = x - x_mean | ||
correction = Nx.sum(x_centered, axes: [0]) | ||
|
||
x_unnormalized_variance = | ||
Nx.sum(x_centered * x_centered, axes: [0]) - correction * correction / num_samples | ||
|
||
unnormalized_variance = num_samples_seen * variance | ||
last_over_new_count = num_samples_seen / num_samples | ||
|
||
new_unnormalized_variance = | ||
unnormalized_variance + | ||
x_unnormalized_variance + | ||
last_over_new_count / new_num_samples_seen * | ||
(sum / last_over_new_count - x_sum) ** 2 | ||
|
||
new_variance = new_unnormalized_variance / new_num_samples_seen | ||
{x_mean, x_centered, new_num_samples_seen, new_mean, new_variance} | ||
end | ||
|
||
@doc """ | ||
Applies dimensionality reduction to the data `x` using Incremental PCA `model`. | ||
|
||
## Examples | ||
|
||
iex> {x, _} = Scidata.Iris.download() | ||
iex> batches = x |> Nx.tensor() |> Nx.to_batched(10) | ||
iex> ipca = Scholar.Decomposition.IncrementalPCA.fit(batches, num_components: 2) | ||
iex> x = Nx.tensor([[5.2, 2.6, 2.475, 0.7], [6.1, 3.2, 3.95, 1.3], [7.0, 3.8, 5.425, 1.9]]) | ||
iex> Scholar.Decomposition.IncrementalPCA.transform(ipca, x) | ||
Nx.tensor( | ||
[ | ||
[1.4564743041992188, 0.5657951235771179], | ||
[-0.27242332696914673, -0.24238374829292297], | ||
[-2.0013210773468018, -1.0505625009536743] | ||
] | ||
) | ||
""" | ||
deftransform transform(model, x) do | ||
transform_n(model, x) | ||
end | ||
|
||
defnp transform_n( | ||
%__MODULE__{ | ||
components: components, | ||
explained_variance: explained_variance, | ||
mean: mean, | ||
whiten?: whiten? | ||
} = _model, | ||
x | ||
) do | ||
# This is literally the same as Scholar.Decomposition.PCA.transform_n! | ||
z = Nx.dot(x - mean, [1], components, [1]) | ||
|
||
if whiten? do | ||
z / Nx.sqrt(explained_variance) | ||
else | ||
z | ||
end | ||
end | ||
end |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not match on a stream because
%Stream{}
is only one of the possible streams. Instead we should just treat it as an enumerable:There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder though if we should instead merge this into PCA and add a new function called
fit_stream
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was about to recommend that! Do we need to call it
fit_stream
orfit_incremental
or we can just add another clause? Two clauses were working over here; one that was taking a tensor as the first argument was defined withdeftransform
, the other that takes stream as the first argument is defined withdef
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would do a separate function. Either fit_stream or fit_incremental.