Skip to content

Commit

Permalink
Smarter unexpected sequence report
Browse files Browse the repository at this point in the history
* Rework unexpected sequence writing

Rather than sample unexpected sequences by % of each column barcode's
unexpected sequences, we set a maximum number of row barcodes for which
we are willing to track unexpected sequences. When reading the
unexpected sequence cache, we take 1 barcode from each shard in turn,
until we have read all of them or established a sample of that size. All
further unexpected sequences that are not found in that set are not
tracked.

This lets us bound the memory that PoolQ uses for its unexpected
sequence reporting, without needing to individually track the sizes of
individual shards. It should also perform better than the previous
strategy since it reads the cache only once.

* Do not track shard sizes

* Set version to 3.10.0-SNAPSHOT

* Update changelog, manual, and readme

* Fix makefile

* Use `.size`

* Simpler breadth-first iterator
  • Loading branch information
mtomko authored Feb 13, 2024
1 parent a3d7a52 commit 31a4bed
Show file tree
Hide file tree
Showing 10 changed files with 166 additions and 130 deletions.
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
# Changelog

## 3.10.0
* More efficient and memory-safe sampling technique for unexpected sequence reporting

## 3.9.0
* Use sampling technique for generating unexpected sequence reports

Expand Down
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
fullversion := $(shell grep -m 1 'ThisBuild / version :=' ./version.sbt | perl -pe 's/^ThisBuild \/ version := "([0-9]+\.[0-9+]\.[0-9]+).*$$/$$1/g')
fullversion := $(shell grep -m 1 'ThisBuild / version :=' ./version.sbt | perl -pe 's/^ThisBuild \/ version := "([0-9]+\.[0-9]+\.[0-9]+).*$$/$$1/g')

version := $(shell grep -m 1 'ThisBuild / version :=' ./version.sbt | perl -pe 's/^ThisBuild \/ version := "([0-9]+\.[0-9+]).*$$/$$1/g')

Expand Down
46 changes: 23 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,68 +1,68 @@
# PoolQ 3.0

Copyright (c) 2022 Genetic Perturbation Platform, The Broad Institute of Harvard and MIT.
Copyright (c) 2024 Genetic Perturbation Platform, The Broad Institute of Harvard and MIT.

[![Build Status](https://github.com/broadinstitute/poolq/actions/workflows/ci.yml/badge.svg)](https://github.com/broadinstitute/poolq/actions/workflows/ci.yml)

## Overview

PoolQ is a counter for indexed samples from next-generation sequencing of pooled DNA. Given a set
of sequencing data files (FASTQ, SAM, or BAM), and a pair of reference files mapping DNA barcodes
to construct or experimental identifiers, PoolQ reads the sequencing data and tallies the
to construct or experimental identifiers, PoolQ reads the sequencing data and tallies the
co-occurrence of each pair of barcodes from the two files, yielding a two-dimensional histogram.
The barcodes in one reference file are treated as rows in the histogram; the other correspond to
The barcodes in one reference file are treated as rows in the histogram; the other correspond to
columns.

PoolQ is capable of locating barcodes within reads using a variety of techniques:
* Fixed location
* Known DNA prefix
* Template matching

- Fixed location
- Known DNA prefix
- Template matching

It matches barcodes to reference data either exactly or allowing up to one base of mismatch. Currently,
PoolQ does not support matching with gaps or deletions.

In addition to producing a histogram, PoolQ generates a number of reports, which contain statistics and
other information that can be used to troubleshoot experiments. These include match percentages, barcode
other information that can be used to troubleshoot experiments. These include match percentages, barcode
locations, matching correlations between barcodes, and lists of frequently-occurring unknown barcodes.

## Documentation
For information on how to run PoolQ and its various modes and options, please see the

For information on how to run PoolQ and its various modes and options, please see the
[manual](docs/MANUAL.md). We also maintain a [changelog](CHANGELOG.md) listing updates made to PoolQ.

As of version 3.5.0, the source code to PoolQ is available under a [BSD 3-clause license](LICENSE). We
As of version 3.5.0, the source code to PoolQ is available under a [BSD 3-clause license](LICENSE). We
welcome contributions to PoolQ and have created a [contributor guide](CONTRIBUTING.md). Additionally,
we maintain a [list](NOTICE.txt) of other open-source libraries PoolQ depends on, along with links to
we maintain a [list](NOTICE.txt) of other open-source libraries PoolQ depends on, along with links to
associated licenses.

## Changes in PoolQ 3

PoolQ was completely rewritten for version 3. The new code is faster and the codebase is much cleaner
and more maintainable. We have taken the opportunity to make other changes to PoolQ as well.

* There are substantial changes to the command-line interface for the program.
* The default counts file format has changed slightly, although there is a command-line
argument that indicates that PoolQ 3 should write a backwards-compatible counts file. The differences
are in headers only; file parsers should be able to adapt easily.
* The quality file has changed somewhat. Importantly, the definition of certain statistics has changed
slightly, so quality metrics cannot be directly compared between the the new and old versions. In addition,
we no longer provide normalized match counts.
- There are substantial changes to the command-line interface for the program.
- The default counts file format has changed slightly, although there is a command-line
argument that indicates that PoolQ 3 should write a backwards-compatible counts file. The differences
are in headers only; file parsers should be able to adapt easily.
- The quality file has changed somewhat. Importantly, the definition of certain statistics has changed
slightly, so quality metrics cannot be directly compared between the the new and old versions. In addition,
we no longer provide normalized match counts.

See the [manual](docs/MANUAL.md) for complete details on the differences versions 2 and 3.

## PoolQ 2 support

We will continue to make the PoolQ 2.4 artifacts available for download on the
We will continue to make the PoolQ 2.4 artifacts available for download on the
[GPP portal](https://portals.broadinstitute.org/gpp/public/software/poolq). We have no plans to add
features to the code. We will address bugs on a case-by-case basis; in general only critical
features to the code. We will address bugs on a case-by-case basis; in general only critical
bugfixes will be ported to versions prior to 2.4, effective immediately.

## Maintainers
## Maintainers

PoolQ was originally developed by John Sullivan and Shuba Gopal of the Broad Institute RNAi Platform. It
PoolQ was originally developed by John Sullivan and Shuba Gopal of the Broad Institute RNAi Platform. It
is maintained by Mark Tomko of the Broad Institute Genetic Perturbation Platform.

## Contact Us

Your feedback of any kind is much appreciated. Please email us at [email protected].


11 changes: 6 additions & 5 deletions docs/MANUAL.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

PoolQ is a counter for indexed samples from next-gen sequencing of pooled DNA.

_This documentation covers PoolQ version 3.7.0 (last updated 09/05/2023)._
_This documentation covers PoolQ version 3.10.0 (last updated 02/12/2024)._

## Background

Expand Down Expand Up @@ -171,7 +171,7 @@ prefix, it is assumed to contain unmatched reads.

For paired-end sequencing mode, use the same column barcode-file pair scheme with the arguments to
`--row-reads` and `--rev-row-reads`. It is essential that the comma-delimited lists of files passed to
`--row-reads` and `-rev-row-reads` are in corresponding order.
`--row-reads` and `-rev-row-reads` are in corresponding order.

### Reference Files

Expand Down Expand Up @@ -559,7 +559,7 @@ PoolQ you will need a Java 8 JDK. You can download an appropriate JRE or JDK fro
You can download PoolQ from an as yet undetermined location. The file you download is a ZIP file
that you will need to unzip. In most cases, this is as simple as right-clicking on the zip file, and
selecting something like "extract contents" from the popup menu. This will create a new folder on
your computer named `poolq-3.7.0`, with the following contents:
your computer named `poolq-3.10.0`, with the following contents:

- `poolq3.jar`
- `poolq3.bat`
Expand Down Expand Up @@ -610,7 +610,7 @@ You can run PoolQ from any Windows, Mac, or Linux machine, but it requires some
how to launch programs from the command line on your given operating system.

1. Open a terminal window for your operating system
2. Change directories to the `poolq-3.7.0` directory
2. Change directories to the `poolq-3.10.0` directory

- On Windows, run:

Expand All @@ -627,7 +627,7 @@ how to launch programs from the command line on your given operating system.
If you successfully launched PoolQ, you should see a usage message explaining all of the
command-line options:

poolq3 3.7.0
poolq3 3.10.0
Usage: poolq [options]

--row-reference <file> reference file for row barcodes (i.e., constructs)
Expand Down Expand Up @@ -661,6 +661,7 @@ command-line options:
--correlation <file>
--run-info <file>
--unexpected-sequence-threshold <number>
--unexpected-sequence-max-sample-size <number>
--unexpected-sequences <file>
--umi-quality <file>
--unexpected-sequence-cache <cache-dir>
Expand Down
3 changes: 1 addition & 2 deletions src/main/scala/org/broadinstitute/gpp/poolq3/PoolQ.scala
Original file line number Diff line number Diff line change
Expand Up @@ -199,11 +199,10 @@ object PoolQ {
.write(
config.output.unexpectedSequencesFile,
dir,
unexpectedSequenceTrackerOpt.map(_.unexpectedBarcodeCounts).getOrElse(Map.empty),
config.unexpectedSequencesToReport,
colReference,
globalReference,
config.unexpectedSequenceSamplePct
config.unexpectedSequenceMaxSampleSize
)
.as(UnexpectedSequencesFileType.some)
if (config.removeUnexpectedSequenceCache) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -96,7 +96,7 @@ final case class PoolQConfig(
unexpectedSequenceCacheDir: Option[Path] = None,
removeUnexpectedSequenceCache: Boolean = true,
unexpectedSequencesToReport: Int = 100,
unexpectedSequenceSamplePct: Double = 0.02,
unexpectedSequenceMaxSampleSize: Int = 10_000_000,
skipShortReads: Boolean = false,
reportsDialect: ReportsDialect = PoolQ3Dialect,
alwaysCountColumnBarcodes: Boolean = false,
Expand Down Expand Up @@ -288,8 +288,8 @@ object PoolQConfig {
else failure(s"Unexpected sequence threshold must be greater than 0, got: $n")
}

val _ = opt[Double]("unexpected-sequence-sample-pct").valueName("<pct>").action { (f, c) =>
c.copy(unexpectedSequenceSamplePct = f)
val _ = opt[Int]("unexpected-sequence-max-sample-size").valueName("<number>").action { (n, c) =>
c.copy(unexpectedSequenceMaxSampleSize = n)
}

val _ = opt[Path]("unexpected-sequences").valueName("<file>").action { (f, c) =>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,6 @@ package org.broadinstitute.gpp.poolq3.process
import java.io.{BufferedWriter, Closeable, OutputStreamWriter}
import java.nio.file.{Files, Path}

import scala.collection.mutable
import scala.util.control.NonFatal

import org.broadinstitute.gpp.poolq3.process.UnexpectedSequenceTracker.nameFor
Expand All @@ -17,8 +16,6 @@ import org.log4s.{Logger, getLogger}

final class UnexpectedSequenceTracker(cacheDir: Path, colReference: Reference) extends Closeable {

private[this] val unexpectedCountsByColBarcode: mutable.Map[String, Int] = mutable.HashMap()

private[this] val log: Logger = getLogger

// prep the directory and create file writers
Expand All @@ -36,14 +33,8 @@ final class UnexpectedSequenceTracker(cacheDir: Path, colReference: Reference) e
val writer = outputFileWriters(colBc)
writer.write(rowBc)
writer.write("\n")
val _ = unexpectedCountsByColBarcode.updateWith(colBc) {
case None => Some(1)
case Some(pc) => Some(pc + 1)
}
}

def unexpectedBarcodeCounts: Map[String, Int] = unexpectedCountsByColBarcode.toMap

override def close(): Unit =
outputFileWriters.values.foreach { writer =>
try {
Expand Down
Loading

0 comments on commit 31a4bed

Please sign in to comment.