pbh5tools
is a collection of tools that can manipulate the content or extract data from
two types of h5 files:
cmp.h5
: files that contain alignment information.bas.h5
andpls.h5
: files that contain base-call information.
pbh5tools
is comprised of two executables: cmph5tools.py
and
bash5tools.py
. At the moment, the cmph5tools.py
program
provides a rich set of tools to manipulate and analyze the data in a
cmp.h5
file. The bash5tools.py
provides mechanisms to extract
basecall information from bas.h5 files.
To install pbh5tools
, run the following command from the pbh5tools
root directory:
python setup.py install
If you do not have root or sudo permissions, you can install locally by:
- Installing pysam, numpy, Cython, and h5py to your home directory.
- pip install --user --upgrade numpy h5py pysam cython
- Running
- python setup.py install --user
bash5tools.py
can extract read sequences and quality values for
both Raw and circular consensus sequencing (CCS) readtypes and use
create fastq
and fasta
files.
usage: bash5tools.py [-h] [--verbose] [--version] [--profile] [--debug] [--outFilePrefix OUTFILEPREFIX] [--readType {ccs,subreads,unrolled}] [--outType OUTTYPE] [--minLength MINLENGTH] [--minReadScore MINREADSCORE] [--minPasses MINPASSES] input.bas.h5 Tool for extracting data from .bas.h5 files positional arguments: input.bas.h5 input .bas.h5 filename optional arguments: -h, --help show this help message and exit --verbose, -v Set the verbosity level (default: None) --version show program's version number and exit --profile Print runtime profile at exit (default: False) --debug Run within a debugger session (default: False) --outFilePrefix OUTFILEPREFIX output filename prefix [None] --readType {ccs,subreads,unrolled} read type (ccs, subreads, or unrolled) [] --outType OUTTYPE output file type (fasta, fastq) [fasta] Read filtering arguments: --minLength MINLENGTH min read length [0] --minReadScore MINREADSCORE min read score, valid only with --readType={unrolled,subreads} [0] --minPasses MINPASSES min number of CCS passes, valid only with --readType=ccs [0]
Extracting all subreads reads from input.bas.h5
without any filtering
and exporting to a FASTA file named myreads.fasta
:
python bash5tools.py --outFilePrefix myreads --outType fasta --readType subreads input.bas.h5
Extracting all CCS reads from input.bas.h5
that have read lengths
larger than 100 and exporting to FASTQ (myreads.fastq
):
python bash5tools.py --inFile input.bas.h5 --outFilePref myreads --outType fastq --readType CCS --minLength 100
cmph5tools.py
is a multi-commandline tool that provides access to
the following subtools:
- merge: Merge multiple
cmp.h5
files into a single file. - sort: Sort a
cmp.h5
file.
3. select: Create a new file from a cmp.h5
file by specifying
which reads to include.
4. equal: Compare the contents of 2 cmp.h5
files for
equivalence.
5. summarize: Summarize the contents of a cmp.h5
file in a
verbose, human readable format.
6. stats: Extract summary metrics from a cmp.h5
file into a
csv
file.
- valid: Determine whether a
cmp.h5
file is valid.
8. listMetrics: Emit the available metrics and statistics for use
in the select
and stats
subcommands.
To list all available subtools provided by cmph5tools.py
simply
run:
cmph5tools.py --help
Each subtool has its own usage information which can be generated by running:
cmph5tools.py <toolname> --help
To run any subtool it is suggested to use the --info
commandline
argument since this will provide progress information while the script
is running via printing in stdout:
cmph5tools.py <toolname> --info <other arguments>
.. toctree:: :maxdepth: 2 cmph5tools-examples