-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Additional show
info for PCA
#90
Conversation
This arose from https://discourse.julialang.org/t/pca-output/22687/10. Main contribution is pattern matrix with features x components. This should probably be refined but not sure what to do without DataFrames dependency.
Thanks. Could you do a quick survey of what other packages/languages print? I suspect there are differences between disciplinary fields so better check that we don't choose an approach/terminology which is only familiar to psychologists. Additional points:
|
Here's some broad coverage I looked for Python output but I didn't find anything consistent, but I didn't look that hard because I've always found Python's support for stats to be underwhelming. I don't think the proposed addition is unique to the field of psychology, but I think the minitab example provides a lot of additional output that we might consider. |
OK, thanks. Do all of these use the same definition of loadings (notably regarding standardization)? Regarding row and column names, we had a discussion with @andreasnoack and an interesting approach would be to use wider types in method signatures so that |
I haven't read through all of the docs for each of these but I know that the exact method of obtaining loadings varies in the R functions when choosing different methods of rotation. I think the most sustainable way to manage this is to have an API that supports specification of different rotation methods. We don't need to implement anything other than what we have here and then other contributors would have a clear place to introduce more methods in the future. I might be able to make some time in the future to add to other rotation methods but I'm basing a lot of this off what I can learn from R docs. I'll probably just go without column/row labels for now and provide some documentation for what it means, if that sounds alright to you. |
Yes, we should have a way to choose the definition at some point. But here we just have to decide what to print by default. To make that decision I really think we need to have a rough idea of what other software prints, and how they call it. |
Why not to put it all as an |
|
Perhaps overloading |
I'd prefer FWIW, R has both a semi-compact output by default and a somewhat longer |
I'm not a fan of a default full output, but it looks like the only possible option to have separate short and full output comes with having full output by default in a console. So, let's put a longer output in |
Any updates? |
Sorry, this got lost in some other stuff I was working on. I just pushed what I had sitting around and forgot about but I can look at putting the output in a |
src/pca.jl
Outdated
ldgs_signs[ldgs_signs .== 0] .= 1 | ||
ldgs = ldgs * diagm(0 => ldgs_signs[:]) | ||
print(io, "\n\nPattern matrix\n") | ||
display(ldgs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if this came out with columns and rows labeled but without access to the variable names that's not possible. In the current state the columns are the components in the same order as those in the following coeftable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's annoying. For now maybe print a CoefTable
object without row names. If you pass an empty vector as row names, numbers will automatically be generated. That way at least the names of PCs will be consistent with the table below.
Also, how about printing the second table first? That's the most important result I think.
One note that may be out of the scope of this PR, it would be nice if the returned |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
I wonder whether there should be a way for users to extract the printed information in an object. It's frustrating to get printed results, but have to redo the computations by hand. That's a bit of an orthogonal issue, but still. Maybe coeftable
could return the pattern matrix? Or could we find another name for that operation, which would return something meaningful also for other algorithms?
src/MultivariateStats.jl
Outdated
@@ -3,7 +3,7 @@ module MultivariateStats | |||
using StatsBase: SimpleCovariance, CovarianceEstimator | |||
import Statistics: mean, var, cov, covm | |||
import Base: length, size, show, dump | |||
import StatsBase: fit, predict, ConvergenceException | |||
import StatsBase: fit, predict, ConvergenceException, CoefTable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import
is for when you overload a function with a new method. It shouldn't be needed here.
src/pca.jl
Outdated
ldgs_signs = sign.(sum(ldgs, dims=1)) | ||
ldgs_signs[ldgs_signs .== 0] .= 1 | ||
ldgs = ldgs * diagm(0 => ldgs_signs[:]) | ||
print(io, "\n\nPattern matrix\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
print(io, "\n\nPattern matrix\n") | |
print(io, "\n\nPattern matrix:\n") |
(for consistency with below)
Also, better be specific about what this matrix is: psych says "standardized loadings".
src/pca.jl
Outdated
ldgs_signs[ldgs_signs .== 0] .= 1 | ||
ldgs = ldgs * diagm(0 => ldgs_signs[:]) | ||
print(io, "\n\nPattern matrix\n") | ||
display(ldgs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's annoying. For now maybe print a CoefTable
object without row names. If you pass an empty vector as row names, numbers will automatically be generated. That way at least the names of PCs will be consistent with the table below.
Also, how about printing the second table first? That's the most important result I think.
src/pca.jl
Outdated
print(io, "Importance of components:\n") | ||
print(io, CoefTable(vcat(principalvars(M)', (principalvars(M) ./ tvar(M))', (cumsum(principalvars(M) ./tvar(M)))'), | ||
string.("PC", 1:length(principalvars(M))), # components in order | ||
["Loadings", "Proportion explained", "Cumulative proportion"])) # row names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Loadings" is "SS loadings" from psych, right? Shouldn't that be specified? Also "proportion" should say of what (the variance). And how about adding "proportion of variance" and "cumulative variance" which are in base R and psych?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are not "Loadings", they are "Eigenvalues" or "SS loadings" as R calls them.
Loadings would be
projection(M).*sqrt.(principalvars(M))'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wildart You're definitely right. I apologize for the oversight.
Co-Authored-By: Milan Bouchet-Valat <[email protected]>
Co-Authored-By: Milan Bouchet-Valat <[email protected]>
Co-Authored-By: Milan Bouchet-Valat <[email protected]>
Co-Authored-By: Milan Bouchet-Valat <[email protected]>
Would these make sense?
Perhaps this is an API thing that needs to be resolved in #94 first. |
I'm not sure these are the most needed, as they are relatively simple to compute using |
Now, it looks like this: julia> X = RDatasets.dataset("datasets", "iris")[:,1:4] |> Matrix;
julia> M = fit(PCA, X', maxoutdim=3)
PCA(indim = 4, outdim = 3, principalratio = 0.9947878161267246)
Pattern matrix:
────────────────────────────────────
PC1 PC2 PC3
────────────────────────────────────
1 0.743108 0.323446 -0.16277
2 -0.173801 0.359689 0.167212
3 1.76155 -0.0854062 0.0213202
4 0.736739 -0.0371832 0.152647
────────────────────────────────────
Importance of components:
─────────────────────────────────────────────────────────
PC1 PC2 PC3
─────────────────────────────────────────────────────────
SS Loadings (Eigenvalues) 4.22824 0.242671 0.0782095
Variance explained 0.924619 0.0530665 0.0171026
Cumulative variance 0.924619 0.977685 0.994788
Proportion explained 0.929463 0.0533445 0.0171922
Cumulative proportion 0.929463 0.982808 1.0
───────────────────────────────────────────────────────── which is close to R
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool!
Co-authored-by: Milan Bouchet-Valat <[email protected]>
Co-authored-by: Milan Bouchet-Valat <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This arose from a discussion on discourse https://discourse.julialang.org/t/pca-output/22687/10.
Output is inspired by psych package from R. The pattern matrix included here is features by components. It would be more clear with row and column names but I didn't want to introduce new dependencies.
It's not polished by any means but I wanted to atleast get it started and get some feedback.