RESEDA - REpertoire SEquencing Data Analysis

Data analysis workflow for T- and B-cell receptor repertoire sequencing. The workflow identifies clones and their frequency from next generation sequencing of repertoires and includes steps for quality control and bias correction.

Authors

Barbera DC van Schaik - [email protected]

Prof. dr. Antoine HC van Kampen - [email protected]

Workflow

Required software

Samtools

The software packages below are included in this repository for convenience. Please visit the websites for more recent versions and information about the licenses.

Other requirements

The older scripts only work in Python 2 (see execute-all.sh)

Python 3.5 (or higher)
- sys
- os
- subprocess
- gzip
- re
- math
- random
- biopython
- regex
- sqlite3
- matplotlib
- numpy
- scipy
- json
- pandas
- shutil
- argparse
- csv
R
- plyr
Bash

Settings webdav server

These scripts assume that the data is on the SurfSara ResearchDrive (webdav server) and that the drive is mounted in /mnt/immunogenomics

copy-from-beehub.sh
copy-basespace-data-to-beehub.py
execute-all.sh
report-ALL.sh
ConcatenateCloneFilesBatch.py
MakeSamplesFiles.py
VerifyBasespaceCopy.py

Settings for curl

Create a .netrc file in your home directory:

machine researchdrive.surfsara.nl
login YOUR_RESEARCHDRIVE_USER
password YOUR_PASSWORD

Mount ResearchDrive

# change "bioinfo" with your own user name on the machine you are working on
sudo mount -t davfs -o uid=bioinfo,gid=bioinfo,rw https://researchdrive.surfsara.nl/remote.php/webdav/amc-immunogenomics /mnt/immunogenomics

Rclone installation and configuration

You can download the deb package, or fetch it from the ansible-playbooks git repository

sudo apt install ./rclone-v1.61.1-linux-amd64.deb

Run: rclone config

name> remote
Storage> webdav
url> https://researchdrive.surfsara.nl/remote.php/nonshib-webdav
vendor> owncloud

How to run the code

You have received the raw data (fastq files) and sample information from the Immunogenomics group. These are stored on the ResearchDrive. Follow the steps below to store it on the right location.

You might need to convert the sample information into the Miseq Datasheet format first. This can be done with the notebook MakeDatasheetFromPT.ipynb.

Setting up the FSS data structure and create a new git branch

Use the script from the ENCORE_AUTOMATION repository:

https://github.com/EDS-Bioinformatics-Laboratory/ENCORE_AUTOMATION/blob/main/Processing/2-CREATE-TEMPLATE-REPSEQ/README.md

Transfer the fastq files to the appropriate directory on the ResearchDrive:

/mnt/immunogenomics/RUNS/runNN-yyyymmdd-miseq/Data/NameOfDataset_1/Raw/

Preparation

Start virtual machines for the analysis (in the SurfSara cloud webinterface)
An Ansible script is used to install the machines
- See: ../ansible-playbooks/Reseda/README.md

Create tokens (jobs) and upload them to the virtual machines

Note: the following steps can run on a Linux laptop or on a virtual machine. In the latter case you need to install the software on the VM first (with Ansible)

Convert the MiSeq sample sheet with MetaData.py (creates a json file)
- An example of a datasheet can be found in the run directories on the ResearchDrive (Data/NameOfDataset_1/Meta/)
Mount the ResearchDrive webdav server, if you haven't done so already
Add extra information to the json file with MakeSamplesFiles.py (this will also make the SAMPLE-* files)
Sort and split the SAMPLE-* files with: ./SortAndSplit.sh SAMPLE-* It does the following:
- Sorts the SAMPLE-* files: sort SAMPLE-blah > SAMPLE-blah.sort
- Makes manageable jobs by splitting the SAMPLE-*.sort files, e.g.: split -l 20 SAMPLES-run13-human-BCRh.sort SAMPLES-run13-human-BCRh-
Create jobs with ToposCreateTokens.py (run with the -h option to see the arguments)
Upload the jobs to the VMs using the script TransferTokens.py (run with -h to see the options)

Start the jobs on each virtual machine

Login to each virtual machine and do the following:
- Copy/move the relevant (or all) database files from directories "reference", "reftables" and "mids" to the root directory of the repository: git/reseda/
- Start the script RUN-RESEDA.py tokens/

When all jobs are finished

In the jobs the results are automatically transferred to the ResearchDrive webdav server
Check with the notebook CompareProcessedUnprocessed.ipynb if all result files are on the ResearchDrive
Execute ConcatenateCloneFilesBatch.py to generate a bash script for concatenating clone files per project+organism+cell_type (run the generated script)
Run report-ALL.sh to generate reports about the sequence run (help is available for this script if you do not provide arguments to the script)
Check for contamination with the notebooks SampleSimilarity.ipynb and SharedClonesDirection.ipynb
- Specify the files that were created by ConcatenateCloneFilesBatch.py
- Specify the Datasheet table that you received from the immunogenomics group
- Check by hand if the column names in the Datasheet are correct
- Run the script
- Usually I make reports for all samples per cell_type

Preparation for Roche data (OBSOLETE)

Use MakePTtableFromAAreads.R - Create a pt.table (sample description) from AA.reads file
SplitAAreads.py - Splits the AA.reads table per sample (check the column names that you want to include in the file name!)
SeqToFastq.py - Convert the tab-delimited files to fastq format

How to cite

Barbera D. C. van Schaik, Paul L. Klarenbeek, Marieke E. Doorenspleet, Sabrina Pollastro, Anne Musters, Giulia Balzaretti, Rebecca E. Esveldt, Frank Baas, Niek de Vries and Antoine H. C. van Kampen (2016) T- and B-cell Receptor Repertoire Sequencing: Quality Control and Clone Identification. In prep.

License

RESEDA - REpertoire SEquencing Data Analysis
Copyright (C) 2016-2024 Barbera DC van Schaik

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program.  If not, see <http://www.gnu.org/licenses/>.

Name	Name	Last commit message	Last commit date
Latest commit Narya updated the conda environment file Jun 12, 2024 65abb50 · Jun 12, 2024 History 723 Commits
FastQC	FastQC	New repository for immunogenomics miseq pipeline	Nov 11, 2015
NOTEBOOKS	NOTEBOOKS	notebooks for sample similarity and shared clones between samples	May 23, 2024
SCRIPTS-HLA	SCRIPTS-HLA	Integrated changes for RACE and HLA analysis into the master branch	Aug 4, 2021
SCRIPTS-PLOT	SCRIPTS-PLOT	use the raw UMIs.frac column values	Mar 24, 2021
SCRIPTS-UNSORTED	SCRIPTS-UNSORTED	Heatmap notebook works, but I miss two set of values in each dataset.…	Sep 1, 2020
TESTDATA	TESTDATA	Aram needs the original, non-masked, sequences for C-region detection…	Sep 8, 2020
bwa-0.7.12	bwa-0.7.12	New repository for immunogenomics miseq pipeline	Nov 11, 2015
mids	mids	improved the mid definition for the vasaseq experiment	Jun 15, 2022
picard-tools-1.126	picard-tools-1.126	New repository for immunogenomics miseq pipeline	Nov 11, 2015
reference	reference	Added ALL human D sequences to reference directory	Oct 27, 2022
reftables	reftables	Moved all the ref.table.* files to subdir 'reftables' and added a lin…	Jul 21, 2020
test	test	Improved testing of test/test_translate_and_extract_cdr3_getMotifs.py	Jun 6, 2018
tokens	tokens	Added directory "tokens" to the repository, so I dont need to create …	May 8, 2024
.gitignore	.gitignore	Cleaning up and restructuring the git repository	Jun 27, 2019
CleanupDirectory.py	CleanupDirectory.py	added directory to clean up in CleanupDirectory.py	Mar 6, 2024
CompareProcessedUnprocessed.ipynb	CompareProcessedUnprocessed.ipynb	Samples are now stored one by one in the tokens directory (SortAndSpl…	May 21, 2024
ConcatenateCloneFiles.py	ConcatenateCloneFiles.py	Check if number of expected samples (from json file) is the same as n…	Oct 22, 2018
ConcatenateCloneFilesBatch.py	ConcatenateCloneFilesBatch.py	Renamed copy-from-beehub.sh -> copy-from-webdav.sh	May 8, 2024
CreateAndImportClonesSqlite.py	CreateAndImportClonesSqlite.py	moved CreateAndImportClonesSqlite.py back to root directory	May 15, 2024
DataSheetHeader.csv	DataSheetHeader.csv	Notebook to create a samplesheet from a header together with the file…	Jan 23, 2020
ENV.sh	ENV.sh	files for run55	May 14, 2024
FastqSplitOnMid.py	FastqSplitOnMid.py	improved the mid definition for the vasaseq experiment	Jun 15, 2022
FastqSplitOnSequenceLength.py	FastqSplitOnSequenceLength.py	Implemented an option to filter sequences on sequence length	Nov 29, 2021
MakeDatasheetFromPT.ipynb	MakeDatasheetFromPT.ipynb	files for run55	May 14, 2024
MakeSamplesFiles.py	MakeSamplesFiles.py	Changed path in MakeSamplesFiles.py	Jun 14, 2023
MaskSequences.py	MaskSequences.py	Do not remove the first sam file in align-sequences.sh. Wrote a scrip…	Jun 23, 2020
MetaData.py	MetaData.py	Checked and changed the style (flake8) of all python scripts	Feb 7, 2017
MutationAnalysisVJ.R	MutationAnalysisVJ.R	Added N-glyc site info to the mutation files	May 8, 2018
MutationAnalysisVJ.py	MutationAnalysisVJ.py	Added methods-and-acknowledgments.txt. MutationAnalysisVJ.py: convert…	Feb 10, 2021
MutationsFromSam.py	MutationsFromSam.py	MutationsFromSam.py: added the gene name to the output file (test wit…	Jun 12, 2018
README.md	README.md	removed obsolete files	May 30, 2024
RUN-RESEDA.py	RUN-RESEDA.py	run reseda in batch without picas: RUN-RESEDA.py	Feb 28, 2024
ReAssignVGenes.py	ReAssignVGenes.py	Added the N-glycosylation site info to the clones files	May 8, 2018
ReassignGenes.py	ReassignGenes.py	Added a report to count nr of dominant clones. This is added to the r…	May 11, 2021
SAMPLES	SAMPLES	SAMPLES was accidently overwritten, fixed now	Apr 11, 2018
SeqToFastq.py	SeqToFastq.py	fixed reading and writing issues (binary vs text IO) in SeqToFastq.py	Aug 31, 2021
SortAndSplit.sh	SortAndSplit.sh	Samples are now stored one by one in the tokens directory (SortAndSpl…	May 21, 2024
ToposCreateTokens.py	ToposCreateTokens.py	added the rigth mid file in help ToposCreateTokens.py	Mar 12, 2024
TransferTokens.py	TransferTokens.py	Fixed a small error in TransferTokens.py. Script works now	May 8, 2024
TranslateAndExtractCdr3.py	TranslateAndExtractCdr3.py	Resolved issues to migrate from python2 to python3	Aug 9, 2021
VarScan.v2.3.7.jar	VarScan.v2.3.7.jar	New repository for immunogenomics miseq pipeline	Nov 11, 2015
align-sequences.sh	align-sequences.sh	Do not remove the first sam file in align-sequences.sh. Wrote a scrip…	Jun 23, 2020
batch-align.sh	batch-align.sh	New repository for immunogenomics miseq pipeline	Nov 11, 2015
batch-pear.sh	batch-pear.sh	Added: jupyter notebook with the analysis (from execute-all.sh)	May 9, 2017
combine-immuno-data.py	combine-immuno-data.py	Changed the way the arguments are given to combine-immuno-data.py: fr…	Feb 8, 2022
copy-from-webdav.sh	copy-from-webdav.sh	Renamed copy-from-beehub.sh -> copy-from-webdav.sh	May 8, 2024
copy-to-webdav.sh	copy-to-webdav.sh	changed example paths in copy-from-beehub.sh and copy-to-webdav.sh	Jun 14, 2023
environment-conda.yml	environment-conda.yml	updated the conda environment file	Jun 12, 2024
execute-all.sh	execute-all.sh	Renamed copy-from-beehub.sh -> copy-from-webdav.sh	May 8, 2024
execute-arcashla.sh	execute-arcashla.sh	script to execute arcasHLA on multiple fastq files	Jul 13, 2022
execute-c-region.sh	execute-c-region.sh	Renamed copy-from-beehub.sh -> copy-from-webdav.sh	May 8, 2024
helper-ref-table.py	helper-ref-table.py	Checked and changed the style (flake8) of all python scripts	Feb 7, 2017
immuno-primers.csv	immuno-primers.csv	New repository for immunogenomics miseq pipeline	Nov 11, 2015
log-versions.sh	log-versions.sh	Log versions of the software at start of execute-all.sh	Nov 14, 2016
methods-and-acknowledgments.txt	methods-and-acknowledgments.txt	Added methods-and-acknowledgments.txt. MutationAnalysisVJ.py: convert…	Feb 10, 2021
overview.svg	overview.svg	SVG with overview of the RESEDA workflow	Jun 12, 2018
pear-0.9.6-bin-64	pear-0.9.6-bin-64	New repository for immunogenomics miseq pipeline	Nov 11, 2015
report-ALL.sh	report-ALL.sh	Renamed copy-from-beehub.sh -> copy-from-webdav.sh	May 8, 2024
report-after-v-reassignment.py	report-after-v-reassignment.py	Column order has changed: report-after-v-reassignment.py	Dec 14, 2017
report-alignments.py	report-alignments.py	Check if alignment flag != 4 (unmapped), so it will report all other …	Aug 24, 2017
report-combine-all.py	report-combine-all.py	Added a report to count nr of dominant clones. This is added to the r…	May 11, 2021
run-fastqc.sh	run-fastqc.sh	New repository for immunogenomics miseq pipeline	Nov 11, 2015
select-correct-mids.py	select-correct-mids.py	Select file with the correct MID for data transfer (during the c-regi…	Sep 9, 2020
sequences.py	sequences.py	Bio.Alphabet was deprecated, so I removed it from the code	Aug 9, 2021
setup-and-run.sh	setup-and-run.sh	Samples are now stored one by one in the tokens directory (SortAndSpl…	May 21, 2024
workflow.png	workflow.png	Figure of the workflow	May 3, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RESEDA - REpertoire SEquencing Data Analysis

Authors

Workflow

Required software

Other requirements

Settings webdav server

Settings for curl

Mount ResearchDrive

Rclone installation and configuration

How to run the code

Setting up the FSS data structure and create a new git branch

Preparation

Create tokens (jobs) and upload them to the virtual machines

Start the jobs on each virtual machine

When all jobs are finished

Preparation for Roche data (OBSOLETE)

How to cite

License

About

Releases

Packages

Contributors 2

Languages

EDS-Bioinformatics-Laboratory/reseda

Folders and files

Latest commit

History

Repository files navigation

RESEDA - REpertoire SEquencing Data Analysis

Authors

Workflow

Required software

Other requirements

Settings webdav server

Settings for curl

Mount ResearchDrive

Rclone installation and configuration

How to run the code

Setting up the FSS data structure and create a new git branch

Preparation

Create tokens (jobs) and upload them to the virtual machines

Start the jobs on each virtual machine

When all jobs are finished

Preparation for Roche data (OBSOLETE)

How to cite

License

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages