Skip to content

Latest commit

 

History

History
158 lines (138 loc) · 7.92 KB

README.md

File metadata and controls

158 lines (138 loc) · 7.92 KB

INTRODUCTION
INSTALLATION
USAGE
EXAMPLES
ALGORITHM
CITATION
CONTACT

Cdbgtricks

Cdbgtricks is a C++11 modular tool, mainly designed to update a compacted de Bruijn graph, when adding new sequences. In addition it indexes the graph, updates the index when adding sequences and so it enables querying sequences on the graph.

Publication

A preprint is available on bioRxiv:

  • Hannoush, Khodor, Camille Marchet, and Pierre Peterlongo. "Cdbgtricks: strategies to update a compacted de bruijn graph." bioRxiv (2024): 2024-05.
  • Accepted and presented at the Prague Stringology Conference PSC

Experiments

Please refer to this repository to access the experiments of the paper.

Requirements

  • C++11 compiler: GCC >= 4.8.5

  • kmtricks (to find the set of absent k-mers which are in the genome to be added but not in the graph)

Installation

  • Installing kmtricks: First Install kmtricks [version v1.2.1]

    conda install -c conda-forge -c tlemane kmtricks
  • Installing ccdbupdater: Currently we have only the source version

    • From sources
    git clone --recursive https://github.com/khodor14/Cdbgtricks.git
    cd Cdbgtricks && mkdir build && cd build
    cmake -S ../ -B .
    make

Binary usage:

Setup the ulimit

First change the number of files that can be oppened

ulimit -n 2000

Running Cdbgtricks

./cdbgtricks

displays the command line interface:

Usage: ./cdbgtricks [COMMAND] [PARAMETERS]

[COMMAND]:
	update 		 update a compacted de Bruijn graph with a new genome
	index 		 index k-mers of a compacted de Bruijn graph
	convert		 convert a graph from GFA/FASTA to binary or binary to fasta
	query 		 Query reads in Fasta/Fastq format (output presence/absence of reads)
	map		 Map reads in Fasta/Fastq format (output uni-MEMs)

[PARAMETERS]: update
	> Obligatory with required arguments:
	--input_graph[-ig]	 the path to the graph in gfa or fasta format
	--input_genome	 the path to the genome used to augment the input graph
	> Optional with required arguments:
	--load_graph_binary[-lgb] load the graph from binary file
	--k_mer_size[-k] the size of the k-mer.
			 It must be the same value used when constructing the input graph [default: 31]
	--minimizer_size[-m] the size of the minimizer (m < k) [default: 11]
	--k_mer_file	 the file of absent k-mers from the graph if already computed
	--smallest_merge[-s]	 the threshold for creating buckets (rho in the paper) [default: 5000]
	--multiplier_super_bucket[-msb]	 size of super-bucket in terms of small bucket (gamma in the paper) [default: 4]
	--load_index[-li] Input the index file of the graph
	--threads[-t] the number of threads to be used during the update [default: 1]
	> Optional with no arguments:
	--update_index[-u] update the index of the graph [default: no update]
	--output_index[-oi] write the index to a binary file [default: no output]
	--output_graph_binary[-ogb] write the graph in binary format [default: output in fasta]

[PARAMETERS]: index
	> Obligatory with required arguments:
	--input_graph[-ig]	 the path to the graph in gfa/fasta/binary format
	> Optional with required arguments:
	--k_mer_size[-k] the size of the k-mer [default: 31]
	--minimizer_size[-m] the size of the minimizer (m < k) [default: 11]
	--smallest_merge[-s]	 the threshold for creating buckets (rho in the paper) [default: 5000]
	--multiplier_super_bucket[-msb]	 size of super-bucket in terms of small bucket (gamma in the paper) [default: 4]
	> Optional with no arguments:
	 -it	 the input is in text format (fasta/GFA)
	 -ib	 the input is in binary format

[PARAMETERS]: convert
	> Obligatory with required arguments:
	--input_graph[-ig]	 the path to the graph in gfa/fasta/binary format
	> Obligatory with no arguments:
	 -it	 the input is in text format (fasta/GFA)
	 -ib	 the input is in binary format (Cdbtricks fromat)
	 -of	 output in fasta format
	 -ob	 output in binary format (Cdbtricks fromat)

[PARAMETERS]: map or query
	> Obligatory with required arguments:
	--input_graph[-ig]	 the path to the pangenome graph in gfa/fasta/binary format
	--query_reads[-qr]	 the path to the to the read query set in fasta format
	> Optional with arguments:
	--ratio[-r]	 the desired ratio of shared k-mers between a read and the graph [default: 1.0]
	--k_mer_size[-k] the size of the k-mer [default: 31]
	--minimizer_size[-m] the size of the minimizer (m < k) [default: 11]
	--smallest_merge[-s]	 the threshold for creating buckets (rho in the paper)
	--multiplier_super_bucket[-msb]	 size of super-bucket in terms of small bucket (gamma in the paper)
	> Optional with no arguments:
	 -it	 the input graph is in text format (fasta/GFA)
	 -ib	 the input graph is in binary format (Cdbgtricks format)
	--load_index[-li] Input the index file of the graph

[PARAMETERS] Common to all COMMANDS
	--output_file_name[-o]	 the name of the output file
	--help[-h]	 prints this help message
	-v verbose

Examples

  1. Indexing a compacted de Bruijn Graph

    ./cdbgtricks index -ig ../test/graph.gfa -k 31 -s 5000 -msb 1 -m 11 -o graph_index -it
    

    The 31-mers (-k 31) of the compacted de Bruijn graph graph.gfa is indexed using a minimizer size 11 (-m 11). We set here rho to be 5000 (-s 5000). It means that we should have at least 5000 31-mers in a bucket sharing the same 11-mer minimizer to create an MPHF f. We set gamma to be 1 (-msb 1). It means that the size of the super-bucket is 5000. (-it) means that the graph is in text format. The output is written to graph_index_index.bin

  2. Update a compacted de Bruijn graph from a reference genome file with loading the graph index

    ./cdbgtricks update -ig ../test/graph.gfa --input_genome ../test/genome.fa -k 31 -li graph_index_index.bin -o graph_update -u -oi -t 32
    

    The compacted de Bruijn graph is updated with the absent 31-mers (-k 31) of file genome.fa (--input_genome genome.fa). Here kmtricks is used to find the absent k-mers. The index is loaded (-li) and used to update the graph. The graph is written to file graph_update.fa (-o graph_update). The index is updated (-u) and written to graph_update_index.bin.

  3. Update a compacted de Bruijn graph without loading its index

    ./cdbgtricks update -ig ../test/graph.gfa --input_genome ../test/genome.fa -k 31 -o graph_update -u -oi -t 32
    

    The compacted de Bruijn graph is updated with the absent 31-mers (-k 31) of file genome.fa (--input_genome genome.fa). Here kmtricks is used to find the absent k-mers. The graph is written to file graph_update.fa (-o graph_update). The index is created (-u) and written to graph_update_index.bin.

  4. Query a set of reads (output presence/absence) ./cdbgtricks query -ig ../test/graph.gfa -qr queries.fa -k 31 -r 1 -li graph_update_index.bin -o results_query This tries to find if the reads are present or absent. A read is present if it shares a ratio of k-mers equal to r -r 1. The output is written to results_query.tsv

  5. Map a set of reads (output uni-MEMs)

./cdbgtricks map -ig ../test/graph.gfa -qr unitigs_fasta.fa -k 31 -r 1 -li graph_update_index.bin -o results_mapping

Instead of outputing if a read is present or absent, it outputs the uni-MEMs of the read. A uni-MEM is a 5-tuple (unitig id, unitig start, unitig end,read start, read end). The output is written to results_mapping.txt

Contact

For any question, feedback or problem, please feel free to write an issue on this GitHub repository and we will get back to you as soon as possible.