Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix description of Kmat in fit_gbcd #6

Open
pcarbo opened this issue Jan 3, 2025 · 0 comments
Open

Fix description of Kmat in fit_gbcd #6

pcarbo opened this issue Jan 3, 2025 · 0 comments

Comments

@pcarbo
Copy link
Member

pcarbo commented Jan 3, 2025

From @YushaLiu:

Precisely, Kmax is the maximum number of factors added during the initialization step where we fit flash with a point Laplace prior. Then we do a nonnegative transform to split each factor into two. But we will always get an nonnegative intercept/baseline factor at this stage, so we will have at most 2*Kmax - 1 factors that enter the next step where we fit flash with a GB prior and improve the fit using backfit. (edited)

But after we are done with fitting flash with GB prior, we have an additional step to filter out those k's for which l_k and \tilde{l}_k are not consistent (by having a correlation < 0.8 by default), which means that we could have fewer than 2*Kmax - 1 factors in the function output.

If the underlying structure in the data requires much fewer than 2Kmax - 1 factors to explain, we will find many k's such that l_k and \tilde{l}_k are not consistent and these will be removed during this step. On the other hand, if our specified Kmax is smaller than needed, this final step will filter out no or very few factors, leaving almost all the 2Kmax - 1 factors in the function output.

So the resulting number of factors will be up to 2Kmax - 1 factors; but it can be much smaller than that. The difference depends on the relationship between the specified Kmax and the underlying "true" number of factors needed to explain the data structure (Again this is totally based on my empirical experience with real datasets). I think I made it clear that Kmax should be interpreted as an approximation of the final K we will get; maybe we can also say that the final K is up to 2Kmax - 1?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant