Update the documentation (#52)

* Change the docs * Add model description * Add API for numeric and quasimultinomial submodules. * Finish API description * Mark as uncompleted
cbg-ethz · Feb 12, 2025 · 245bf79 · 245bf79
1 parent 0009d38
commit 245bf79
Show file tree

Hide file tree

Showing 13 changed files with 297 additions and 19 deletions.
diff --git a/docs/api.md b/docs/api.md
diff --git a/docs/api/dynamics.md b/docs/api/dynamics.md
@@ -0,0 +1,12 @@
+# Dynamics
+
+Models predicting the changes in variant abundances over time.
+
+Import as
+
+```python
+
+from covvfit import dynamics
+```
+
+::: covvfit.dynamics.JointLogisticGrowthParams
diff --git a/docs/api/index.md b/docs/api/index.md
@@ -0,0 +1,32 @@
+# API Reference
+
+The package is divided into several modules, corresponding to different utilities.
+
+## Quasimultinomial model
+
+The quasimultinomial noise model, which corresponds to the high-level *Covvfit* utilities.  
+The API is available [here](quasimultinomial.md).
+
+**Caution:** There are public functions in this submodule, which are being refactored into other modules.
+
+## Dynamics
+
+The growth models, linking model parameters to outcomes of idealised abundance.
+The API is available [here](dynamics.md).
+
+
+## Preprocessing
+
+Data preprocessing utilities, used to load deconvoluted data into the model and reshape them into appropriate data formats.
+The API is available [here](preprocessing.md).
+
+
+## Plotting
+
+The plotting utilities. 
+The API is available [here](plotting.md).
+
+## Numeric
+
+General numerical programming utilities.
+The API is available [here](numeric.md).
diff --git a/docs/api/numeric.md b/docs/api/numeric.md
@@ -0,0 +1,30 @@
+# Numeric
+
+General numerical programming utilities.
+
+It can be imported as:
+```python
+from covvfit import numeric
+```
+
+or
+
+```python
+import covvfit
+
+covvfit.numeric.some_function()
+```
+
+
+## Optimization
+
+::: covvfit.numeric.OptimizeMultiResult
+
+::: covvfit.numeric.jax_multistart_minimize
+
+
+## Matrix operations
+
+::: covvfit.numeric.log_matrix
+
+::: covvfit.numeric.log1mexp
diff --git a/docs/api/plotting.md b/docs/api/plotting.md
@@ -0,0 +1,35 @@
+# Plotting
+
+Plotting utilities.
+
+Import as:
+
+```python
+
+from covvfit import plot
+```
+
+::: covvfit.plot.ArrangedGrid
+
+
+::: covvfit.plot.arrange_into_grid
+
+
+::: covvfit.plot.set_axis_off
+
+
+## Timeseries submodule
+
+Import as 
+
+```python
+
+from covvfit import plot
+plot_ts = plot.timeseries
+```
+
+::: covvfit.plot.timeseries.plot_fit
+
+::: covvfit.plot.timeseries.plot_complement
+
+::: covvfit.plot.timeseries.plot_data
diff --git a/docs/api/preprocessing.md b/docs/api/preprocessing.md
@@ -0,0 +1,13 @@
+# Preprocessing
+
+Data preprocessing utilities, used to load deconvoluted data into the model and reshape them into appropriate data formats.
+
+Import as:
+
+```python
+
+from covvfit import preprocess
+```
+
+::: covvfit.preprocess.TimeScaler
+
diff --git a/docs/api/quasimultinomial.md b/docs/api/quasimultinomial.md
@@ -0,0 +1,14 @@
+# Quasimultinomial
+
+The quasimultinomial noise model, which corresponds to the high-level *Covvfit* utilities.  
+It can be imported as:
+
+```python
+from covvfit import quasimultinomial as qm
+```
+
+It has the following capabilities:
+
+::: covvfit.quasimultinomial.construct_total_loss
+
+::: covvfit.quasimultinomial.construct_model
diff --git a/docs/cli.md b/docs/cli.md
@@ -0,0 +1,3 @@
+# Command line workflow
+
+Tutorial to be added in February 2025.
diff --git a/docs/index.md b/docs/index.md
diff --git a/docs/index.md b/docs/index.md
@@ -0,0 +1,35 @@
+# Covvfit: Variant fitness estimates from wastewater data
+
+*Covvfit* is a framework for estimating relative growth advantages of different variants from deconvolved wastewater samples.
+It consists of command line tools, which can be included in the existing workflows and a Python package, which can be used to quickly develop custom solutions and extensions.
+
+
+
+
+## FAQ
+
+**How do I run *Covvfit* on my data?**
+
+We recommend to start using *Covvfit* as a command line tool, with the tutorial available [here](cli.md). 
+
+**What data does *Covvfit* use?**
+
+*Covvfit* uses deconvolved wastewater data, accepting relative abundances of different variants measured at different locations and times.
+Tools such as [LolliPop](https://github.com/cbg-ethz/LolliPop) or [Freyja](https://github.com/andersen-lab/Freyja/) can be used to deconvolve wastewater data. 
+
+**Can *Covvfit* predict emergence of new variants?**
+
+No, *Covvfit* explicitly assumes that no new variants emerge in the considered timeframe, so its predictions are unlikely to hold on longer timescales.
+The underlying model also cannot take into account changes in the transmission dynamics or immune response, so that it cannot predict the effects of vaccination programs or lockdowns.
+
+**How can I contact the developers?**
+
+In case you find a bug, want to ask about integrating *Covvfit* into your pipeline, or have any other feedback, we would love to hear it via our [issue tracker](https://github.com/cbg-ethz/covvfit/issues)!
+In this manner, other users can also benefit from your insights.
+
+**Is there a manuscript associated with the tool?**
+
+The manuscript is being finalised. We hope to release the preprint describing the method in February 2025.
+In case you would like to cite *Covvfit* in your work, we would recommend the following for now:
+
+D. Dreifuss, P. Czyż, N. Beerenwinkel, *Learning and forecasting selection dynamics of SARS-CoV-2 variants from wastewater sequencing data using Covvfit* (2025; in preparation). URL: https://github.com/cbg-ethz/covvfit
diff --git a/docs/models.md b/docs/models.md
@@ -0,0 +1,56 @@
+# Statistical models
+
+Here, we explain the statistical models used.
+
+The models are composed of two components:
+
+  - The growth model used to describe how the abundances increase or decrease, basing on competition between different variants. This model links some interpretable parameters (such as the growth advantages) with idealised (unobserved) abundances, $p(t)$.
+  - The noise model connecting idealised abundances $p(t)$ with deconvolved values $y(t)$, which are subject to some noise.
+
+## Growth model
+
+### Single-location model
+
+We consider V competing variants, numbered between 1, ..., V present in the population. At time $t$ the relative abundances of the variants are represented by a vector $p (t) = (p_1(t), p_2(t), ..., p_V(t))$.
+We assume survival-of-the fittest type of [selection dynamics](https://en.wikipedia.org/wiki/Replicator_equation#Equation), where the $i$-th variant has a fitness value $f_i$, which is fixed in time, and the relative abundances change at a rate proportional to their fitness advantage over the average value:
+
+$$ \frac{\mathrm d}{\mathrm dt}p_i(t)= p_i(t)\left(f_i - \sum_{j=1}^V p_j(t) f_j \right).$$
+
+In this model, the selection advantage of variant $X_i$ over variant $X_j$ is given by $s_{ij}=f_i-f_j$.
+Note that the model dynamics is determined only by the relative selection advantages, rather than the fitness values $f_i$, making the problem non-identifiable: adding a constant to all fitness values does not change the dynamics of the model.
+Hence, without loss of generality, we set $f_1=0$, to avoid this identifiability issue.
+
+This set of ordinary differential equations has the analytical solution 
+
+$$ p_i(t)=\frac{\exp(f_i\cdot t+b_i)}{ \sum_{j=1}^V \exp(f_j\cdot t + b_j)} $$
+
+Where $b_1, b_2, ..., b_V$ are constants given by the initial conditions. Similarly as with fitness values, adding a constant to all $b_i$ does not change the functions $p_i(t)$. We fix $b_1 = 0$.
+
+### Multiple-location model
+
+The above model can be used to describe a collection of location-specific relative abundance vectors $p_k(t)$, where the index $k\in \{1,..., K\}$ represents a spatial location (e.g., the city or a district connected to one data collection system).
+
+We expect that the introduction times of different variants to different locations may be different, which we accommodate by defining location-specific parameters $b_{vk}$. However, if the wastewater sampling locations are located at the same country subject to the same vaccine program and immunization, we suspect that the processes $p_k(t)$ are not entirely independent.
+We therefore *make an assumption that the fitness value does not change across the locations*.
+
+Note that while we find this assumption plausible in the analysis of the data from the same country, it may not hold when analysing locations subject to different vaccine programs.
+
+To summarize, we infer parameters $b_{vk}$ for all variants $v\in \{1, ..., V\}$ and locations $k\in \{1,..., K\}$  together with fitness values $f_1, ..., f_V$, which are shared between the sampling locations.
+We use the identifiability constraints $f_1 = 0$ and $b_{1k} = 0$ for all $k$.
+
+## Noise model
+
+We deconvolute a wastewater sample collected at time point $t$ and location $k$ to obtain the observed relative abundances vector
+
+$y_k(t) =(y_{1k}(t), ..., y_{Vk}(t))$, where $y_{vk}(t)$ represents the relative abundance of variant $v\in \{1, ..., V\}$ as obtained in the deconvolution procedure.
+
+Due to a small load of viral signal, amplification through next generation sequencing method, and deconvolution procedure, we do not have an explicit generative model linking the ideal abundance value $p_k(t)$, to the deconvolved value $y_k(t)$.
+Instead, we use the quasi-likelihood approach, where we assume that $\mathbb E[y_k(t)] = p_k(t),$
+and use a covariance function in a generalized linear model corresponding to the scaled multinomial distribution.
+In this manner, the quasi-likelihood inference allows one to correct the obtained confidence intervals by using the dispersion parameter, which is adapted to capture the variance observed in the data.
+
+As both $y_k(t)$ and $p_k(t)$ are probability vectors, we use the quasi-multinomial model, in which the quasi-loglikelihood function is given by
+
+$$ q(f,b)= \sum_{k=1}^K \sum_{t=1}^T\sum_{v=1}^V y_{vk}(t) \log p_{vk}(t). $$ 
+
+We numerically optimize it to find the maximum quasi-likelihood estimate $\hat \theta = (\hat f, \hat b)$.
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -17,14 +17,14 @@ theme:
       toggle:
         icon: material/toggle-switch-off-outline 
         name: Switch to dark mode
-      primary: teal
-      accent: purple 
+      primary: indigo
+      accent: indigo
     - scheme: slate 
       toggle:
         icon: material/toggle-switch
         name: Switch to light mode    
-      primary: teal
-      accent: lime
+      primary: grey
+      accent: grey
 
 plugins:
   - search
@@ -37,8 +37,10 @@ plugins:
             show_root_heading: true
 
 nav:
-  - Documentation: index.md
-  - API Reference: api.md
+  - Overview: index.md
+  - Command line workflow: cli.md
+  - API Reference: api/index.md
+  - Statistical models: models.md
 
 repo_name: covvfit
 repo_url: https://github.com/cbg-ethz/covvfit

diff --git a/src/covvfit/_dynamics.py b/src/covvfit/_dynamics.py
@@ -18,21 +18,29 @@ class JointLogisticGrowthParams(NamedTuple):
 
     This model has `V-1` relative growth rate parameters
     and `K*(V-1)` offsets.
+
+    Attrs:
+
+        relative_growths: relative growth rates, shape `(V-1,)`
+        relative_offsets: relative offsets, shape `(K, V-1)`
     """
 
     relative_growths: Float[Array, " variants-1"]
     relative_offsets: Float[Array, "cities variants-1"]
 
     @property
     def n_cities(self) -> int:
+        """Number of cities."""
         return self.relative_offsets.shape[0]
 
     @property
     def n_variants(self) -> int:
+        """Number of variants."""
         return 1 + self.relative_offsets.shape[1]
 
     @property
     def n_params(self) -> int:
+        """Number of all parameters in the model."""
         return (self.n_variants - 1) * (self.n_cities + 1)
 
     @staticmethod
@@ -51,6 +59,7 @@ def predict_log_abundance(
         self,
         timepoints: Float[Array, "cities timepoints"],
     ) -> Float[Array, "cities timepoints variants"]:
+        """Predicts the abundances at the specified time points."""
         _new_shape = (self.n_cities, self.n_variants - 1)
         tiled_growths = jnp.broadcast_to(self.relative_growths[None, :], _new_shape)
 
@@ -60,12 +69,21 @@ def predict_log_abundance(
 
     @classmethod
     def from_vector(cls, theta, n_variants: int) -> "JointLogisticGrowthParams":
+        """Wraps a vector with parameters of shape `(dim,)` to the model.
+        Note that `dim` should match the number of parameters.
+        """
         return JointLogisticGrowthParams(
             relative_growths=qm.get_relative_growths(theta, n_variants=n_variants),
             relative_offsets=qm.get_relative_midpoints(theta, n_variants=n_variants),
         )
 
     def to_vector(self) -> Float[Array, " *batch"]:
+        """Wraps all the parameter into a single vector.
+
+        Note:
+            This function is useful for optimization purposes, as many optimizers accept vectors,
+            rather than tuples.
+        """
         return qm.construct_theta(
             relative_growths=self.relative_growths,
             relative_midpoints=self.relative_offsets,
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		# Command line workflow

		Tutorial to be added in February 2025.