Skip to content
Matt Selensky edited this page Oct 26, 2024 · 39 revisions

Welcome to the bngal wiki!

IMPORTANT NOTE: The bngal documentation has moved and these Wiki pages are no longer maintained! Please visit https://mselensky.github.io/bngal/ for the most up-to-date documentation.

What is bngal?

Biological Network Graph Analysis and Learning (bngal) is a package primarily written in R to create high-quality, complex correlation networks from biological data.

bngal can create correlation networks at each level of taxonomic classification (phylum to ASV) from a taxonomic count table to visualize complex co-occurrence substructures in the data via edge betweenness clustering. Numeric variables from a corresponding metadata table can be optionally included to explore environmental-taxonomic correlations. "Subcommunity networks" can also be created in parallel to explore different correlation patterns within a dataset in addition to a global comparison. For example, one may want to examine separate networks for the human skin, oral, and gut microbiomes from the same dataset, while also examining microbial co-occurrence patterns across the entire body. Another may want to do the same thing for subsurface environments that span distinct geological contexts. As such, microbial ecologists from a wide range of backgrounds may be interested in applying bngal to model microbial niche space in the habitats they study with network analysis!

Installation

Although bngal is released as a standalone R package and can be interactively used in an IDE such as RStudio, I strongly recommend running the command-line utility wrapper (bngal-cli) to simplify its use, especially for first-time users. bngal-cli currently only works on MacOS and Linux.

You can quickly install both the bngal R package and its command-line utility wrapper via the following instructions:

Command line utility (recommended)

There are two ways to install the command line utility: DockerHub or Singularity. I recommend using one of the images hosted on DockerHub (do note your chip architecture - discussed further below). An installation route is also available from Anaconda.

Docker (and Singularity)

Before pulling from DockerHub, please note your chip architecture - arm64 (e.g., Apple Silicon) or amd64 / x86_64 (e.g., Intel). You will want to pull the right image matched to your chip type. The bngal images are bootstrapped with micromamba-docker:

Image Tag Version Architecture
mjsel/bngal:1.0.0 1.0.0 amd64 / x86_64
mjsel/bngal:1.0.0-arm64 1.0.0 arm64

bngal-cli is easily installable if you use Docker. You will only need to install one of these, depending on your architecture:

docker pull mjsel/bngal:1.0.0
docker pull mjsel/bngal:1.0.0-arm64 

Alternatively, you can pull the same image if you use Singularity:

singularity pull docker://mjsel/bngal:1.0.0
singularity pull docker://mjsel/bngal:1.0.0-arm64

Anaconda virtual environment

If you instead prefer to use conda, please follow the instructions as follows.

  1. Install the appropriate Anaconda version for your operating system if you don't have it already.
  2. Clone the bngal-cli GitHub repository into your directory of choice (my-directory) and run the setup script in a bash or zsh shell session. This will install the bngal R package within a new conda environment called "bngal":
cd my-directory
git clone https://github.com/mselensky/bngal-cli
cd bngal-cli
bash bngal-setup.sh

And that's it! Sit tight and grab a coffee while bngal-cli installs. It may take a few minutes.

Once you successfully install and activate the bngal environment, you can remove the bngal-cli folder. When the bngal environment is active, you will have access to two bngal functions:

Function Application
bngal-build-nets Build network model(s) according to defined cutoffs
bngal-summarize-nets Summarize and visualize network statistics from bngal-build-nets

R package only

If you only want to use the bngal R package interactively, you can install it and its dependencies within an active R session via:

source("https://raw.githubusercontent.com/mselensky/bngal-cli/main/R/install-R-pkgs.R")

Please refer to the internal documentation when using the standalone R package.

Quick Start

To quickly begin using bngal, please adapt from the example code template below. You should only have to modify the locations to your input files and output directory. Note: if you are using one of the bngal containers and you want to call files outside your working directory, remember to bind the appropriate path with another -v (Docker) or -B (Singularity) option!

# define input files
ASV_TABLE=example-asv-table.csv
META_DATA=example-metadata.csv
# define output directory
OUT_DR=`pwd`/all-communities
mkdir -p $OUT_DR

# using Docker
docker run -v `pwd`:/home/mambauser -it mjsel/bngal:1.0.0 \
  bngal-build-nets \
  --asv_table=${ASV_TABLE} \
  --metadata=${META_DATA} \
  --output=${OUT_DR}

docker run -v `pwd`:/home/mambauser -it mjsel/bngal:1.0.0 \
  bngal-summarize-nets \
  --asv_table=${ASV_TABLE} \
  --metadata=${META_DATA} \
  --network_dir=${OUT_DR} \
  --fill_ebc_by="metadata_var_you_want_to_visualize"

For more details, consult the helpful links below, or keep reading for a couple of common use cases.

Helpful links:

Example use case 1: global network

Step 1: bngal-build-nets

The first step in the bngal pipeline, bngal-build-nets, creates co-occurrence networks a specified level of taxonomic classification (phylum-ASV) and exports the output data for downstream processing. Critically, the first column of the ASV/OTU table must be named "sample-id", while the remaining columns are taxonomic IDs. One of the metadata file's columns must also be named "sample-id" (position does not matter). Both files must be in CSV format and contain unique "sample-id" values.

If you use qiime2 to process your sequencing data like many microbial ecologists do, I recommend using the read_qza() function from the qiime2R package to import a collapsed ASV-level table into R and export it as a CSV file for use in bngal:

# import ASV table from qiime2
library(tidyverse)
library(qiime2R)
read_qza("collapsed-table-l7.qza") %>%
  .[["data"]] %>%
  t() %>%
  as.data.frame() %>%
  write_csv("example-asv-table.csv")

There are only three required options for bngal-build-nets: --asv-table, a rarefied ASV/OTU table, --metadata, sample metadata corresponding to asv-table, and --output, a directory path that must exist. By default, bngal will only create networks from pairwise associations that have at least 5 observations across the dataset and have an absolute correlation coefficient of at least 0.6 (p <= 0.05). bngal also assumes by defaults that co-occurrences will be analyzed at the ASV level from an ASV-level taxonomic count table, but users may tweak this and many other parameters to their liking - see the bngal-build-nets Wiki page for more details.

The simplest use case is to create a global network of the entire input ASV table without including any metadata variables. By default, the "observational threshold", or number of unique observations required per pairwise relationship to be included in the network, is set to 5. Building such a network looks like:

cd data-directory
OUT_DR=`pwd`/all-communities
mkdir -p $OUT_DR

# using conda
conda activate bngal
bngal-build-nets \
  --asv_table="example-asv-table.csv" \
  --metadata="example-metadata.csv" \
  --output=$OUT_DR

# using Docker
docker run -v `pwd`:/home/mambauser -it mjsel/bngal:1.0.0 \
  bngal-build-nets \
  --asv_table="example-asv-table.csv" \
  --metadata="example-metadata.csv" \
  --output=$OUT_DR

The above command results in several output subfolders. The subfolder network-plots contains publication-ready network visualizations with nodes colored by phylum and edge between cluster (EBC) in the network-plots/pdfs subfolder:

Nodes colored by phylum:

Screen Shot 2023-05-31 at 1 52 10 PM

Nodes colored by EBC:

Screen Shot 2023-05-31 at 1 51 50 PM

Nodes can also be colored by "functional groupings" from a curated list of family-level functions defined in the literature. Note: be very careful with any conclusions you might draw from this! Remember that phylogeny != function. Functional categories are based on the nearest cultured relative. When multiple major biogeochemical functions are represented within a given family, the grouping is marked as "multiple". Refer to this key for Grouping legend names. This feature is only available at the taxonomic level of "family" or below:

Screen Shot 2023-05-31 at 1 52 39 PM

To facilitate network structure exploration, network-plots/html contains the same plots as interactive HTML figures that users can manually manipulate and re-save as PDFs:

Screen Shot 2023-05-31 at 1 53 24 PM

The pairwise-summaries output subfolder contains a list of pairwise node statistics for each sample included in network analysis.

Step 2: bngal-summarize-nets

The second step in the bngal pipeline, bngal-summarize-nets, outputs more useful network summary data and plots. bngal-summarize-nets takes the output directory path of bngal-build-nets as its input. While bngal-build-nets constructs the networks and identifies edge betweenness clusters (EBC) in the data, bngal-summarize-nets calculates the relative abundance of each EBC per sample in the dataset. These summary data, alongside the distribution of each EBC and taxon in the dataset, are exported to the network-summary-tables subfolder. Notably, the "*_tax_spread.csv" output file reports the EBC assigned to a given taxon along with its abundance distribution in the data set.

bngal-summarize-nets is also useful to visualize biogeographic patterns of taxonomic and EBC distributions. For example, imagine that your samples are categorized by the metadata column sample_type and you want to examine whether certain EBCs are associated with certain types of samples. By including the --fill_ebc_by option below, bngal-summarize-nets will produce "EBC composition" plots that summarize which sample_ty[e the majority of the taxa comprising each EBC originate:

 bngal-summarize-nets \
  --asv_table="example-asv-table.csv" \
  --metadata="example-metadata.csv" \
  --network_dir=$OUT_DR \
  --fill_ebc_by="sample_type"

By examining the contents of ebc-composition-plots at the ASV level, we see that EBCs 2 and 12 are both highly central clusters in the network, but tend to be most abundant in different sample types:

Screen Shot 2023-05-31 at 2 31 55 PM

To visualize the distribution of EBCs across each sample, bngal-summarize-nets also produces clustered taxa barplots. By examining the contents of taxa-barplots/ebc, more biogeographic patters are revealed; EBC 1 is almost exclusive to the biofilm sample type, while EBCs 2 and 3 appear to be fairly well-distributed throughout the dataset:

Screen Shot 2023-05-31 at 2 34 48 PM

Similar clustered barplots filled by taxonomic phylum and family-level functional groupings at taxa-barplots/phylum and taxa-barplots/groupings, respectively. For example, this is the same clustered barplot filled by phylum:

Screen Shot 2023-05-31 at 2 35 13 PM

Example use case 2: subnetworks defined by metadata column

We see from the first use case that some EBCs, such as EBC 1, are most abundant within certain "sample types" as defined by the sample metadata column sample_type. By examining the clustered taxa barplots and the contents of the network-summary-tables output subfolder, you notice that the abundance of some EBCs also appear to show variability within a given sample type. As such, you have reason to believe that taxonomic co-occurrence patterns may differ significantly from type to type. In other words, the global network described in the first example use case is likely "smoothing" the pairwise relationships you see as an average across the dataset.

To explore separate networks for each type of sample, you can pass the --subnetworks option to bngal-build-nets and bngal-summarize-nets. As both commands require the same option, I recommend saving the metadata column name as a shell variable in your script (SUBNETS in the example below). Additionally, we can add four numeric environmental variables from our metadata into the network with the --corr_columns option. bngal-build-nets can build subnetworks in parallel if the --cores option is defined.

Assuming each sample is also classified by the categorical cave variable as defined in the metadata, we can pass the --fill_ebc_by="cave" option to bngal-summarize-nets. This will produce "EBC composition" plots that summarize which cave the majority of the taxa comprising each EBC originate from each network by sample type:

conda activate bngal

cd data-directory
OUT_DR=`pwd`/sample_type
SUBNETS="sample_type"
mkdir -p $OUT_DR

bngal-build-nets \
  --asv_table="example-asv-table.csv" \
  --metadata="example-metadata.csv" \
  --subnetworks=$SUBNETS \
  --obs_threshold=3 \
  --corr_columns='env_var1,env_var2,env_var3,env_var4' \
  --cores=4 \
  --output=$OUT_DR

  bngal-summarize-nets  \
    --asv_table="example-asv-table.csv" \
    --metadata="example-metadata.csv" \
    --network_dir=$OUT_DR \
    --output=$OUT_DR \
    --taxonomic_level="asv" \
    --cores=4 \
    --subnetworks="sample_type" \
    --fill_ebc_by="cave" \
    --interactive=FALSE 

Visually, we can see clear differences in the ASV-level network structures between biofilm and spheroid sample types. Note the differences in how the environmental metadata (squares) fit into the networks:

Screen Shot 2023-05-31 at 2 38 41 PM Screen Shot 2023-05-31 at 2 39 47 PM

Finally, you can also pass the optional --query argument to compare co-occurrence patterns among specific taxa across the groups defined by the --subnetworks option. Add --skip_plotting=TRUE if outputting to the same directory as a previous run to save computational power and time.

  bngal-summarize-nets  \
    --asv_table="example-asv-table.csv" \
    --metadata="example-metadata.csv" \
    --network_dir=$OUT_DR \
    --output=$OUT_DR \
    --taxonomic_level="asv" \
    --cores=4 \
    --subnetworks="sample_type" \
    --fill_ebc_by="cave" \
    --interactive=FALSE \
    --query "Bacteria;Actinobacteriota;Actinobacteria;Euzebyales;Euzebyaceae;uncultured;uncultured_bacterium" \
    --skip_plotting=TRUE

This will produce a plot summarizing the co-occurrence relationships of the queries across subnetworks. Multiple queries can be supplied as a space-delineated list in the --query argument - a plot for each query will automatically save as a separate page in the output:

Screen Shot 2023-05-31 at 2 47 28 PM