Skip to content

Commit

Permalink
Update the documentation (#52)
Browse files Browse the repository at this point in the history
* Change the docs

* Add model description

* Add API for numeric and quasimultinomial submodules.

* Finish API description

* Mark as uncompleted
  • Loading branch information
pawel-czyz authored Feb 12, 2025
1 parent 0009d38 commit 245bf79
Show file tree
Hide file tree
Showing 13 changed files with 297 additions and 19 deletions.
10 changes: 0 additions & 10 deletions docs/api.md

This file was deleted.

12 changes: 12 additions & 0 deletions docs/api/dynamics.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# Dynamics

Models predicting the changes in variant abundances over time.

Import as

```python

from covvfit import dynamics
```

::: covvfit.dynamics.JointLogisticGrowthParams
32 changes: 32 additions & 0 deletions docs/api/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# API Reference

The package is divided into several modules, corresponding to different utilities.

## Quasimultinomial model

The quasimultinomial noise model, which corresponds to the high-level *Covvfit* utilities.
The API is available [here](quasimultinomial.md).

**Caution:** There are public functions in this submodule, which are being refactored into other modules.

## Dynamics

The growth models, linking model parameters to outcomes of idealised abundance.
The API is available [here](dynamics.md).


## Preprocessing

Data preprocessing utilities, used to load deconvoluted data into the model and reshape them into appropriate data formats.
The API is available [here](preprocessing.md).


## Plotting

The plotting utilities.
The API is available [here](plotting.md).

## Numeric

General numerical programming utilities.
The API is available [here](numeric.md).
30 changes: 30 additions & 0 deletions docs/api/numeric.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Numeric

General numerical programming utilities.

It can be imported as:
```python
from covvfit import numeric
```

or

```python
import covvfit

covvfit.numeric.some_function()
```


## Optimization

::: covvfit.numeric.OptimizeMultiResult

::: covvfit.numeric.jax_multistart_minimize


## Matrix operations

::: covvfit.numeric.log_matrix

::: covvfit.numeric.log1mexp
35 changes: 35 additions & 0 deletions docs/api/plotting.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Plotting

Plotting utilities.

Import as:

```python

from covvfit import plot
```

::: covvfit.plot.ArrangedGrid


::: covvfit.plot.arrange_into_grid


::: covvfit.plot.set_axis_off


## Timeseries submodule

Import as

```python

from covvfit import plot
plot_ts = plot.timeseries
```

::: covvfit.plot.timeseries.plot_fit

::: covvfit.plot.timeseries.plot_complement

::: covvfit.plot.timeseries.plot_data
13 changes: 13 additions & 0 deletions docs/api/preprocessing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Preprocessing

Data preprocessing utilities, used to load deconvoluted data into the model and reshape them into appropriate data formats.

Import as:

```python

from covvfit import preprocess
```

::: covvfit.preprocess.TimeScaler

14 changes: 14 additions & 0 deletions docs/api/quasimultinomial.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Quasimultinomial

The quasimultinomial noise model, which corresponds to the high-level *Covvfit* utilities.
It can be imported as:

```python
from covvfit import quasimultinomial as qm
```

It has the following capabilities:

::: covvfit.quasimultinomial.construct_total_loss

::: covvfit.quasimultinomial.construct_model
3 changes: 3 additions & 0 deletions docs/cli.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Command line workflow

Tutorial to be added in February 2025.
1 change: 0 additions & 1 deletion docs/index.md

This file was deleted.

35 changes: 35 additions & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Covvfit: Variant fitness estimates from wastewater data

*Covvfit* is a framework for estimating relative growth advantages of different variants from deconvolved wastewater samples.
It consists of command line tools, which can be included in the existing workflows and a Python package, which can be used to quickly develop custom solutions and extensions.




## FAQ

**How do I run *Covvfit* on my data?**

We recommend to start using *Covvfit* as a command line tool, with the tutorial available [here](cli.md).

**What data does *Covvfit* use?**

*Covvfit* uses deconvolved wastewater data, accepting relative abundances of different variants measured at different locations and times.
Tools such as [LolliPop](https://github.com/cbg-ethz/LolliPop) or [Freyja](https://github.com/andersen-lab/Freyja/) can be used to deconvolve wastewater data.

**Can *Covvfit* predict emergence of new variants?**

No, *Covvfit* explicitly assumes that no new variants emerge in the considered timeframe, so its predictions are unlikely to hold on longer timescales.
The underlying model also cannot take into account changes in the transmission dynamics or immune response, so that it cannot predict the effects of vaccination programs or lockdowns.

**How can I contact the developers?**

In case you find a bug, want to ask about integrating *Covvfit* into your pipeline, or have any other feedback, we would love to hear it via our [issue tracker](https://github.com/cbg-ethz/covvfit/issues)!
In this manner, other users can also benefit from your insights.

**Is there a manuscript associated with the tool?**

The manuscript is being finalised. We hope to release the preprint describing the method in February 2025.
In case you would like to cite *Covvfit* in your work, we would recommend the following for now:

D. Dreifuss, P. Czyż, N. Beerenwinkel, *Learning and forecasting selection dynamics of SARS-CoV-2 variants from wastewater sequencing data using Covvfit* (2025; in preparation). URL: https://github.com/cbg-ethz/covvfit
56 changes: 56 additions & 0 deletions docs/models.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Statistical models

Here, we explain the statistical models used.

The models are composed of two components:

- The growth model used to describe how the abundances increase or decrease, basing on competition between different variants. This model links some interpretable parameters (such as the growth advantages) with idealised (unobserved) abundances, $p(t)$.
- The noise model connecting idealised abundances $p(t)$ with deconvolved values $y(t)$, which are subject to some noise.

## Growth model

### Single-location model

We consider V competing variants, numbered between 1, ..., V present in the population. At time $t$ the relative abundances of the variants are represented by a vector $p (t) = (p_1(t), p_2(t), ..., p_V(t))$.
We assume survival-of-the fittest type of [selection dynamics](https://en.wikipedia.org/wiki/Replicator_equation#Equation), where the $i$-th variant has a fitness value $f_i$, which is fixed in time, and the relative abundances change at a rate proportional to their fitness advantage over the average value:

$$ \frac{\mathrm d}{\mathrm dt}p_i(t)= p_i(t)\left(f_i - \sum_{j=1}^V p_j(t) f_j \right).$$

In this model, the selection advantage of variant $X_i$ over variant $X_j$ is given by $s_{ij}=f_i-f_j$.
Note that the model dynamics is determined only by the relative selection advantages, rather than the fitness values $f_i$, making the problem non-identifiable: adding a constant to all fitness values does not change the dynamics of the model.
Hence, without loss of generality, we set $f_1=0$, to avoid this identifiability issue.

This set of ordinary differential equations has the analytical solution

$$ p_i(t)=\frac{\exp(f_i\cdot t+b_i)}{ \sum_{j=1}^V \exp(f_j\cdot t + b_j)} $$

Where $b_1, b_2, ..., b_V$ are constants given by the initial conditions. Similarly as with fitness values, adding a constant to all $b_i$ does not change the functions $p_i(t)$. We fix $b_1 = 0$.

### Multiple-location model

The above model can be used to describe a collection of location-specific relative abundance vectors $p_k(t)$, where the index $k\in \{1,..., K\}$ represents a spatial location (e.g., the city or a district connected to one data collection system).

We expect that the introduction times of different variants to different locations may be different, which we accommodate by defining location-specific parameters $b_{vk}$. However, if the wastewater sampling locations are located at the same country subject to the same vaccine program and immunization, we suspect that the processes $p_k(t)$ are not entirely independent.
We therefore *make an assumption that the fitness value does not change across the locations*.

Note that while we find this assumption plausible in the analysis of the data from the same country, it may not hold when analysing locations subject to different vaccine programs.

To summarize, we infer parameters $b_{vk}$ for all variants $v\in \{1, ..., V\}$ and locations $k\in \{1,..., K\}$ together with fitness values $f_1, ..., f_V$, which are shared between the sampling locations.
We use the identifiability constraints $f_1 = 0$ and $b_{1k} = 0$ for all $k$.

## Noise model

We deconvolute a wastewater sample collected at time point $t$ and location $k$ to obtain the observed relative abundances vector

$y_k(t) =(y_{1k}(t), ..., y_{Vk}(t))$, where $y_{vk}(t)$ represents the relative abundance of variant $v\in \{1, ..., V\}$ as obtained in the deconvolution procedure.

Due to a small load of viral signal, amplification through next generation sequencing method, and deconvolution procedure, we do not have an explicit generative model linking the ideal abundance value $p_k(t)$, to the deconvolved value $y_k(t)$.
Instead, we use the quasi-likelihood approach, where we assume that $\mathbb E[y_k(t)] = p_k(t),$
and use a covariance function in a generalized linear model corresponding to the scaled multinomial distribution.
In this manner, the quasi-likelihood inference allows one to correct the obtained confidence intervals by using the dispersion parameter, which is adapted to capture the variance observed in the data.

As both $y_k(t)$ and $p_k(t)$ are probability vectors, we use the quasi-multinomial model, in which the quasi-loglikelihood function is given by

$$ q(f,b)= \sum_{k=1}^K \sum_{t=1}^T\sum_{v=1}^V y_{vk}(t) \log p_{vk}(t). $$

We numerically optimize it to find the maximum quasi-likelihood estimate $\hat \theta = (\hat f, \hat b)$.
14 changes: 8 additions & 6 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,14 +17,14 @@ theme:
toggle:
icon: material/toggle-switch-off-outline
name: Switch to dark mode
primary: teal
accent: purple
primary: indigo
accent: indigo
- scheme: slate
toggle:
icon: material/toggle-switch
name: Switch to light mode
primary: teal
accent: lime
primary: grey
accent: grey

plugins:
- search
Expand All @@ -37,8 +37,10 @@ plugins:
show_root_heading: true

nav:
- Documentation: index.md
- API Reference: api.md
- Overview: index.md
- Command line workflow: cli.md
- API Reference: api/index.md
- Statistical models: models.md

repo_name: covvfit
repo_url: https://github.com/cbg-ethz/covvfit
Expand Down
18 changes: 18 additions & 0 deletions src/covvfit/_dynamics.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,21 +18,29 @@ class JointLogisticGrowthParams(NamedTuple):
This model has `V-1` relative growth rate parameters
and `K*(V-1)` offsets.
Attrs:
relative_growths: relative growth rates, shape `(V-1,)`
relative_offsets: relative offsets, shape `(K, V-1)`
"""

relative_growths: Float[Array, " variants-1"]
relative_offsets: Float[Array, "cities variants-1"]

@property
def n_cities(self) -> int:
"""Number of cities."""
return self.relative_offsets.shape[0]

@property
def n_variants(self) -> int:
"""Number of variants."""
return 1 + self.relative_offsets.shape[1]

@property
def n_params(self) -> int:
"""Number of all parameters in the model."""
return (self.n_variants - 1) * (self.n_cities + 1)

@staticmethod
Expand All @@ -51,6 +59,7 @@ def predict_log_abundance(
self,
timepoints: Float[Array, "cities timepoints"],
) -> Float[Array, "cities timepoints variants"]:
"""Predicts the abundances at the specified time points."""
_new_shape = (self.n_cities, self.n_variants - 1)
tiled_growths = jnp.broadcast_to(self.relative_growths[None, :], _new_shape)

Expand All @@ -60,12 +69,21 @@ def predict_log_abundance(

@classmethod
def from_vector(cls, theta, n_variants: int) -> "JointLogisticGrowthParams":
"""Wraps a vector with parameters of shape `(dim,)` to the model.
Note that `dim` should match the number of parameters.
"""
return JointLogisticGrowthParams(
relative_growths=qm.get_relative_growths(theta, n_variants=n_variants),
relative_offsets=qm.get_relative_midpoints(theta, n_variants=n_variants),
)

def to_vector(self) -> Float[Array, " *batch"]:
"""Wraps all the parameter into a single vector.
Note:
This function is useful for optimization purposes, as many optimizers accept vectors,
rather than tuples.
"""
return qm.construct_theta(
relative_growths=self.relative_growths,
relative_midpoints=self.relative_offsets,
Expand Down
Loading

0 comments on commit 245bf79

Please sign in to comment.