-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimization improvements #156
Comments
Thanks, @wsdewitt ! I'm slowly working my way through this, and will probably have more comments/questions but here's a few I'd like to start with: Convergence resultsI'm trying to test this now, but it seems the results you posted were using the reference-equivariant Typo's in
|
Thanks for the questions, @jgallowa07. If and the penalty is For the second question, the problem is that it's impossible to find a sparsity pattern that has condition 1 shifted wrt conditions 2 and 3 while 2 and 3 are not shifted wrt eachother, unless we lasso conditions 2 and 3 to eachother the same way we lasso each to condition 1. We currently aren't doing that, so we can't find this pattern, and if we change the reference this limitation changes with it. I didn't follow the last point about the suggested difference matrix. If you multiply your matrix by None of the elements of the result seem to correspond to any shift we'd want to penalize (each should be a difference |
Oops, my CS is showing. I was 0-indexing which confused a few things. I was also assuming we needed the resulting vector of
Thanks for the updates! Taking a closer look now. |
Lots of discussion and results from my investigation into this approach have been communicated via slack or zoom, but to summarize the results:
Here are the summary figures from the working #4 obj approach for documentation @wsdewitt The notebook that produced all the results discussed so far can be found here. This can be executed directly, or invoked across a grid or papermill parameters (in parallel) using the the nb_papermill.py script, paired with the desired params.json. Like
This would output a directory with all the individual resulting notebooks, and their respective outputs to be collected and analyzed using code like that in papermill_results.ipynb. It would be great if you could:
In the meantime, I'm going to start implementing this into the Let me know if you have any questions - and thanks! |
This issue proposes fixes for a few interconnected technical problems with optimization:
A notebook prototyping the ideas below: https://github.com/matsengrp/multidms/blob/149_custom_optimization_loop/notebooks/convergence_testing_2024-03-13/wsd.ipynb
Modeling and transformations for better-conditioned gradients
Inadequate convergence is likely caused by data/parameter scaling pathologies in gradient moves. Bundle mutations in non-reference conditions have most variants in the hot state, whereas non-bundle sites are mostly cold. This scaling discrepancy can be seen by plotting the row-sums of the encodings (left plot):
This creates a ridge in parameter space defined by the subspace of non-bundle mutations: small changes in the bundle-associated parameter will result in huge changes in loss, whereas this isn't so for the non-bundle parameter.
Equivalent parameterization
Given any 1-hot encoding$X_d\in{0,1}^{n_d\times M}, d=1\dots,D$ , we pose the regression problem as
The shifts of mutation effects between conditions$d$ and $d'$ are
and the shifts between WTs are
We will usually use a 1-hot encoding referenced to the WT sequence from one of the$D$ conditions, so that the mutation effects and shifts are interpreted wrt to that condition. This introduces parameter scaling problems in the non-reference conditions that confound gradient methods, arising from the set of "bundle" sites that define the divergence from the reference encoding (more on this below). For now, note that the choice of reference condition for interpreting shifts can be made a posteriori by fixing $d'$ , recovering the parameterization in the main text.
Data scale transformation
To appropriately scale regression parameters, we need condition-specific 1-hot encodings, each wrt the WT in that condition. We get this by bit flipping the "bundle" mutations in all conditions. Now each condition has data referenced to its own WT. Now the scaling across conditions is comparable:
Parameter transformation
Now we need to clarify how parameters for the rescaled data relate to those for the original data. Consider just one condition, so that we suppress the index$d$ below for simplicity. Let $\mathcal{B}$ be the set of indices of the columns of $X$ corresponding to the bundle. We want to work with rescaled data $1-x_m$ at sites $m\in\mathcal{B}$ in the bundle, and $x_m$ at sites $m\notin\mathcal{B}$ not in the bundle. Let $\tilde\beta_0$ and $\tilde\beta_m$ be the intercept and coefficients in the rescaled problem, and $\beta_0$ and $\beta_m$ be the intercept and coefficients in the original data. The equivalence of latent phenotype predictions between scales requires
Matching like terms, we find
and
Note that$\tilde\beta_m^{\mathrm{ref}} = \beta_m^{\mathrm{ref}}$ , since $\mathcal{B}^{\mathrm{ref}}=\emptyset$ . Note also that the transform operation is its own inverse.
With the above, we can transform into the original parameterization, perform proximal updates, and then transform back into the scaled parameterization.EDIT: no, we can't cavalierly treat the gradient and prox with different parameters, see later commentIn the prototype notebook, using this scaling transformation for data and parameters improves convergence behavior dramatically.
Here are resulting correlations with observed functional scores
Reference-experiment equivariance during optimization
We currently choose a reference experiment a priori and penalize only the differences between each non-reference experiment and the reference experiment. For this section, consider one mutation, so we suppress the index$m$ below for simplicity. Our mutation effects across conditions are the vector $β\in\mathbb{R}^D$ , and our shift penalty is
where, assuming$d=1$ is the reference experiment and $D=3$ , we use the difference matrix
The above formulation is equivalent to the current model, but is expressed in the form of the generalized lasso problem (which considers an arbitrary$\mathfrak{D}$ ). In our $\mathfrak{D}$ , the row of zeros captures the fact that we don't penalize shifts between non-reference experiments. This lack of symmetry limits the potential sparsity structure of the model. In particular, it's not possible for two non-reference experiments to have the same mutation effect that is different from the reference experiment. If an epistatic interaction has arisen only on the branch leading to the reference experiment, we can't discover that shift pattern. If we change the reference experiment, we get a different limitation on shift sparsity pattern discovery.
It would be more natural to infer a reference-equivariant model that fuses across all conditions symmetrically, after which an a posteriori choice of reference experiment can be made for mutation effect interpretation. We would use the difference matrix
where all shifts are penalized symmetrically.
Finally, we can combine this with the parameter transformation above by transforming parameters back to the original scale, applying proximal steps in$\beta$ , then retransforming to $\tilde\beta$ (taking care to update the intercepts correctly).EDIT: no, we can't cavalierly treat the gradient and prox with different parameters, see later comment on what we actually need to do.ADMM approach
We next derive the proximal operator for the symmetric fusion problem above, using standard augmented Lagrangian methods for the generalized lasso (see citations below). We have a special case of the generalized lasso. Our penalized problem is
where$f$ is the smooth piece of the objective. To use FISTA, we'll need the gradient $\nabla f$ and the proximal operator $\mathrm{prox}_{\lambda}$ , defined as
which is equivalent to the constrained problem
This can be evaluated by ADMM iterations with dual variable$u$ :
where$S_{\lambda/\rho}$ is the soft-thresholding operator and $\rho>0$ is the ADMM step-size. These iterations rapidly converge in terms of the primal and dual residuals:
To parallelize across all mutations, we simply stack in the column dimensions of$y, z, u$ . Thus, we have the proximal operator necessary for the symmetrically penalized model. A prototype ADMM prox function is implemented in the notebook linked at the top.
See:
The text was updated successfully, but these errors were encountered: