template.qmd

---
title: "VIPERA report: `r params$name`" 
subtitle: "Workflow version: `r params$workflow_version`"
date: last-modified
date-format: "YYYY-MM-DD"
title-block-banner: true
format: 
  html: 
    page-layout: article
    embed-resources: true
    smooth-scroll: true
    theme: cosmo
    toc: true
    toc-expand: true
    toc-location: left
    toc-title: Summary
    number-sections: true
css: config/report.styles.css
editor: visual
params:
  ufboot_reps: ""
  shalrt_reps: ""
  min_ivar_freq: ""
  workflow_version: ""
  div: ""
  freyja: ""
  tree: ""
  tempest: ""
  SNV: ""
  SNV_s: ""
  evo: ""
  div_value: ""
  panel: ""
  volcano: ""
  tree_ml: ""
  fig_cor_snp: ""
  stats_lm: ""
  table: ""
  sum_nv: ""
  heat_tab: ""
  omega_plot: ""
  name: ""
output-file: report.html
---
<head>
  <link rel="preconnect" href="https://fonts.googleapis.com">
  <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
  <link href="https://fonts.googleapis.com/css2?family=Montserrat&display=swap" rel="stylesheet">
</head>


```{r read, echo = F, message = F, include = F}
library(jsonlite)
library(gt)
library(tidyr)
library(dplyr)
library(heatmaply)
library(readr)

# Diversity
div_values <- fromJSON(params$div_value)

# Temporal signal
stats <- fromJSON(params$stats_lm)
correlation <- stats[["r2"]]
sub_rate <- stats[["sub_rate"]]
sub_rate <- round(sub_rate, digits = 2)
p_value_lm <- stats[["pvalue"]]

# NV counts
nv.counts <- fromJSON(params$sum_nv)
n_SNV <- nv.counts[["SNV"]]
n_INDELS <- nv.counts[["INDELS"]]

# Summary table
table <- read.csv(params$table)

# Heatmap
vcf <- read_csv(params$heat_tab) 
row.names(vcf) <- vcf$`...1`
vcf <- vcf[-1]
cor.mat <- cor(vcf)
cor.mat[is.na(cor.mat)] <- 0
```

## Summary of the target samples dataset

```{r summary_table, echo = F,message = F}
#| label: tbl-summary
#| tbl-cap: Target dataset summary

table %>%
  mutate(
    DeltaTime = difftime(
      as.Date(Collection_Date),
      min(as.Date(Collection_Date), na.rm = TRUE), units = "days"
    )
  ) %>%
  gt %>%
  cols_label(
    Sample = "Sample",
    Collection_Date = "Collection date",
    Lineage = "Lineage",
    DeltaTime = "Days after first sampling"
  ) %>%
  tab_style(
    style = list(
      cell_text(weight = "bold")
    ),
    location = cells_column_labels()
  ) %>%
  cols_align(
    align = "center",
    columns = everything()
  ) %>% 
  opt_table_font(
    font = google_font("Montserrat")
  )
```

## Evidence for single, serially-sampled infection

### Lineage admixture

The estimated lineage admixture for each sample has been calculated
using [Freyja](https://github.com/andersen-lab/Freyja).

![Estimated lineage admixture of each sample.
Samples in the X-axis are ordered chronologically, from more ancient to newer.](`r params$freyja`){#fig-freyja}

### Phylogenetic reconstruction

A maximum likelihood tree of the target and context samples has been
built using [IQTREE](http://www.iqtree.org/).
The target samples `r stats[["monophyly"]]` monophyletic. The clade
that contains all the target samples is supported by a UFBoot score of
$`r stats[["boot"]]`$% and a SH-aLRT score of $`r stats[["alrt"]]`$% (@fig-tree_ml).

![Maximum-likelihood phylogeny with $`r params$ufboot_reps`$ UFBoot
and $`r params$shalrt_reps`$ SH-aLRT support replicates of the
target dataset and its context samples. The clade that contains the target
samples is squared in red.](`r params$tree_ml`){#fig-tree_ml}

### Nucleotide diversity comparison

Nucleotide diversity (π) has been calculated for $`r div_values[["boot.reps"]]`$ random
sample subsets of size $`r div_values[["sample.size"]]`$, extracted
from the context dataset. The distribution of the nuclotide diversity is assumed to
`r div_values[["norm.text"]]` be normal after performing a Shapiro-Wilk test
(p-value of $`r div_values[["normal.pvalue"]]`$).

The nucleotide diversity of the target samples is $`r div_values[["diversity"]] `$ (red line in @fig-div).
Assuming the independence of the context samples, the `r div_values[["type.test"]]`
p-value of the context samples having a nucleotide diversity (in orange in @fig-div)
as low as that of the target dataset is $`r div_values[["p.value"]]`$.

![Analysis of the nucleotide diversity (π). The orange line describes
a normal distribution with the same mean and standard deviation as the distribution
of π from $`r div_values[["boot.reps"]]`$ subsets of $`r div_values[["sample.size"]]`$
sequences from the context. The red vertical line indicates the π value of the
target samples.](`r params$div`){#fig-div}

## Evolutionary trajectory of the serially-sampled SARS-CoV-2 infection

### Number of polymorphic sites

Sites with minor allele frequency $> `r params$min_ivar_freq`$ are considered polymorphic.
The linear association between the collection date of the samples and the number of
polymorphic sites has an $R^2$ of $`r nv.counts[["r2"]]`$ and a p-value of
$`r nv.counts[["value"]]`$ (@fig-fig_cor_snp).

![Number of polymorphic sites along time. The
blue line shows the linear model fit.](`r params$fig_cor_snp`){#fig-fig_cor_snp}

### Description of intra-host nucleotide variants

A total of $`r n_SNV`$ different single nucleotide variants (SNV) and $`r n_INDELS`$
insertions and deletions (indels) have been detected along the genome (@fig-SNV).

::: {.panel-tabset}

## Whole genome

![Summary of the intra-host accumulation of nucleotide variants,
using the reconstructed dataset ancestor as reference. A) Nucleotide
variants per site along the SARS-CoV-2 genome. Relative abundance of NVs is calculated
with a sliding window of width $`r nv.counts[["window"]]`$ nucleotides and a step of
$`r nv.counts[["step"]]`$. Labels indicate the coding regions of the non structural
proteins (NSP) within ORF1ab. B) Genome variation along the genome for each sample.
The Y-axis displays samples in chronological order, with the earliest collection date
at the bottom, and the latest, at the top.](`r params$SNV`){#fig-SNV}

## Spike ORF

![Summary of the intra-host accumulation of nucleotide variants
in the spike sequence, using the reconstructed dataset ancestor as reference. A) Nucleotide
variants per site along the S gene. Relative abundance of NVs is calculated
with a sliding window of width $`r nv.counts[["window"]]`$ nucleotides and a step of
$`r nv.counts[["step"]]`$. B) Genome variation along the S gene for each sample.
The Y-axis displays samples in chronological order, with the earliest collection date
at the bottom, and the latest, at the top.](`r params$SNV_s`){#fig-SNV_s}

:::

### Temporal signal of the intra-host mutations

The correlation of the allele frequency of each NV with the time since the
initial sampling has been calculated (@fig-volcano).

![Pearson’s correlation coefficients and adjusted p-values of
allele frequencies with time. Red dashed line indicates adjusted $p = 0.05$.
Labeled dots represent nucleotide variants correlated with time
(adjusted $p < 0.05$).](`r params$volcano`){#fig-volcano}

Significantly correlated nucleotide variants are described in more detail in @fig-panel.

![Time series of relative allele frequencies. The shown positions include
nucleotide variants with a significant correlation with time and sites with more
than two possible states. Each subplot depicts the progression of the allele
frequencies in time for a given genome position.](`r params$panel`){#fig-panel}

A neighbor-joining tree has been constructed using pairwise distances
between target samples (@fig-tree), based on the allele frequencies measured from
read mappings.

![Neighbor-joining tree based on the pairwise allele
frequency-weighted distances.](`r params$tree`){#fig-tree}

To estimate the evolutionary rate, root-to-tip distances measured on the previous
tree (@fig-tree) have been correlated with time, obtaining a $R^2$ of
$`r correlation`$ and a p-value of $`r p_value_lm`$. The estimated evolutionary
rate is $`r sub_rate`$ number of changes per year (@fig-tempest).

![Scatterplot depicting the relationship between root-to-tip
distances and the number of days passed since the first sample. The red
line shows the linear model fit.](`r params$tempest`){#fig-tempest}

### Correlation between alternative alleles

To detect possible interactions between mutations, pairwise correlation between allele
frequencies are calculated (@fig-heatmap). The heatmap is an interactive figure that allows
zooming in on specific regions.

```{r heatmap, echo = F, message = F, fig.align = 'center'}
#| label: fig-heatmap
#| fig-cap: "Interactive hierarchically clustered heatmap of the pairwise Pearson’s correlation coefficients between the time series of allele frequencies in the case study."

heatmaply_cor(
  cor.mat,
  fontsize_row = 8, 
  fontsize_col = 8,
  column_text_angle = 90
)
```

### Non-synonymous and synonymous substitution rate over time

To track selection footprints, the substitutions per synonymous site ($dS$) and
per non-synonymous site ($dN$) for each sample with respect to the reconstructed
ancestral sequence have been calculated (@fig-evo), as well as their ratio ($\omega = dN/dS$; @fig-omega).

::: {.panel-tabset}

## $dN$ and $dS$
![Time series of $dN$ and $dS$. Each point corresponds to a different sample, sorted in chronological order.](`r params$evo`){#fig-evo}

## $\omega$ ($dN/dS$)

![Time series of $\omega$ ($dN/dS$). Each point corresponds to a different sample, sorted in chronological order.](`r params$omega_plot`){#fig-omega}

:::