Smarter unexpected sequence report

* Rework unexpected sequence writing Rather than sample unexpected sequences by % of each column barcode's unexpected sequences, we set a maximum number of row barcodes for which we are willing to track unexpected sequences. When reading the unexpected sequence cache, we take 1 barcode from each shard in turn, until we have read all of them or established a sample of that size. All further unexpected sequences that are not found in that set are not tracked. This lets us bound the memory that PoolQ uses for its unexpected sequence reporting, without needing to individually track the sizes of individual shards. It should also perform better than the previous strategy since it reads the cache only once. * Do not track shard sizes * Set version to 3.10.0-SNAPSHOT * Update changelog, manual, and readme * Fix makefile * Use `.size` * Simpler breadth-first iterator
broadinstitute · Feb 13, 2024 · 31a4bed · 31a4bed
1 parent a3d7a52
commit 31a4bed
Show file tree

Hide file tree

Showing 10 changed files with 166 additions and 130 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,8 @@
 # Changelog
 
+## 3.10.0
+* More efficient and memory-safe sampling technique for unexpected sequence reporting
+
 ## 3.9.0
 * Use sampling technique for generating unexpected sequence reports
 

diff --git a/Makefile b/Makefile
@@ -1,4 +1,4 @@
-fullversion := $(shell grep -m 1 'ThisBuild / version :=' ./version.sbt | perl -pe 's/^ThisBuild \/ version := "([0-9]+\.[0-9+]\.[0-9]+).*$$/$$1/g')
+fullversion := $(shell grep -m 1 'ThisBuild / version :=' ./version.sbt | perl -pe 's/^ThisBuild \/ version := "([0-9]+\.[0-9]+\.[0-9]+).*$$/$$1/g')
 
 version := $(shell grep -m 1 'ThisBuild / version :=' ./version.sbt | perl -pe 's/^ThisBuild \/ version := "([0-9]+\.[0-9+]).*$$/$$1/g')
 

diff --git a/README.md b/README.md
@@ -1,68 +1,68 @@
 # PoolQ 3.0
 
-Copyright (c) 2022 Genetic Perturbation Platform, The Broad Institute of Harvard and MIT.
+Copyright (c) 2024 Genetic Perturbation Platform, The Broad Institute of Harvard and MIT.
 
 [![Build Status](https://github.com/broadinstitute/poolq/actions/workflows/ci.yml/badge.svg)](https://github.com/broadinstitute/poolq/actions/workflows/ci.yml)
 
 ## Overview
 
 PoolQ is a counter for indexed samples from next-generation sequencing of pooled DNA. Given a set
 of sequencing data files (FASTQ, SAM, or BAM), and a pair of reference files mapping DNA barcodes
-to construct or experimental identifiers, PoolQ reads the sequencing data and tallies the 
+to construct or experimental identifiers, PoolQ reads the sequencing data and tallies the
 co-occurrence of each pair of barcodes from the two files, yielding a two-dimensional histogram.
-The barcodes in one reference file are treated as rows in the histogram; the other correspond to 
+The barcodes in one reference file are treated as rows in the histogram; the other correspond to
 columns.
 
 PoolQ is capable of locating barcodes within reads using a variety of techniques:
-* Fixed location
-* Known DNA prefix
-* Template matching
+
+- Fixed location
+- Known DNA prefix
+- Template matching
 
 It matches barcodes to reference data either exactly or allowing up to one base of mismatch. Currently,
 PoolQ does not support matching with gaps or deletions.
 
 In addition to producing a histogram, PoolQ generates a number of reports, which contain statistics and
-other information that can be used to troubleshoot experiments. These include match percentages, barcode 
+other information that can be used to troubleshoot experiments. These include match percentages, barcode
 locations, matching correlations between barcodes, and lists of frequently-occurring unknown barcodes.
 
 ## Documentation
-For information on how to run PoolQ and its various modes and options, please see the 
+
+For information on how to run PoolQ and its various modes and options, please see the
 [manual](docs/MANUAL.md). We also maintain a [changelog](CHANGELOG.md) listing updates made to PoolQ.
 
-As of version 3.5.0, the source code to PoolQ is available under a [BSD 3-clause license](LICENSE). We 
+As of version 3.5.0, the source code to PoolQ is available under a [BSD 3-clause license](LICENSE). We
 welcome contributions to PoolQ and have created a [contributor guide](CONTRIBUTING.md). Additionally,
-we maintain a [list](NOTICE.txt) of other open-source libraries PoolQ depends on, along with links to 
+we maintain a [list](NOTICE.txt) of other open-source libraries PoolQ depends on, along with links to
 associated licenses.
 
 ## Changes in PoolQ 3
 
 PoolQ was completely rewritten for version 3. The new code is faster and the codebase is much cleaner
 and more maintainable. We have taken the opportunity to make other changes to PoolQ as well.
 
-* There are substantial changes to the command-line interface for the program.
-* The default counts file format has changed slightly, although there is a command-line 
-argument that indicates that PoolQ 3 should write a backwards-compatible counts file. The differences
-are in headers only; file parsers should be able to adapt easily.
-* The quality file has changed somewhat. Importantly, the definition of certain statistics has changed
-slightly, so quality metrics cannot be directly compared between the the new and old versions. In addition,
-we no longer provide normalized match counts.
+- There are substantial changes to the command-line interface for the program.
+- The default counts file format has changed slightly, although there is a command-line
+  argument that indicates that PoolQ 3 should write a backwards-compatible counts file. The differences
+  are in headers only; file parsers should be able to adapt easily.
+- The quality file has changed somewhat. Importantly, the definition of certain statistics has changed
+  slightly, so quality metrics cannot be directly compared between the the new and old versions. In addition,
+  we no longer provide normalized match counts.
 
 See the [manual](docs/MANUAL.md) for complete details on the differences versions 2 and 3.
 
 ## PoolQ 2 support
 
-We will continue to make the PoolQ 2.4 artifacts available for download on the 
+We will continue to make the PoolQ 2.4 artifacts available for download on the
 [GPP portal](https://portals.broadinstitute.org/gpp/public/software/poolq). We have no plans to add
-features to the code. We will address bugs on a case-by-case basis; in general only critical 
+features to the code. We will address bugs on a case-by-case basis; in general only critical
 bugfixes will be ported to versions prior to 2.4, effective immediately.
 
-## Maintainers 
+## Maintainers
 
-PoolQ was originally developed by John Sullivan and Shuba Gopal of the Broad Institute RNAi Platform. It 
+PoolQ was originally developed by John Sullivan and Shuba Gopal of the Broad Institute RNAi Platform. It
 is maintained by Mark Tomko of the Broad Institute Genetic Perturbation Platform.
 
 ## Contact Us
 
 Your feedback of any kind is much appreciated. Please email us at [email protected].
-
-
diff --git a/docs/MANUAL.md b/docs/MANUAL.md
@@ -2,7 +2,7 @@
 
 PoolQ is a counter for indexed samples from next-gen sequencing of pooled DNA.
 
-_This documentation covers PoolQ version 3.7.0 (last updated 09/05/2023)._
+_This documentation covers PoolQ version 3.10.0 (last updated 02/12/2024)._
 
 ## Background
 
@@ -171,7 +171,7 @@ prefix, it is assumed to contain unmatched reads.
 
 For paired-end sequencing mode, use the same column barcode-file pair scheme with the arguments to
 `--row-reads` and `--rev-row-reads`. It is essential that the comma-delimited lists of files passed to
- `--row-reads` and `-rev-row-reads` are in corresponding order.
+`--row-reads` and `-rev-row-reads` are in corresponding order.
 
 ### Reference Files
 
@@ -559,7 +559,7 @@ PoolQ you will need a Java 8 JDK. You can download an appropriate JRE or JDK fro
 You can download PoolQ from an as yet undetermined location. The file you download is a ZIP file
 that you will need to unzip. In most cases, this is as simple as right-clicking on the zip file, and
 selecting something like "extract contents" from the popup menu. This will create a new folder on
-your computer named `poolq-3.7.0`, with the following contents:
+your computer named `poolq-3.10.0`, with the following contents:
 
 - `poolq3.jar`
 - `poolq3.bat`
@@ -610,7 +610,7 @@ You can run PoolQ from any Windows, Mac, or Linux machine, but it requires some
 how to launch programs from the command line on your given operating system.
 
 1. Open a terminal window for your operating system
-2. Change directories to the `poolq-3.7.0` directory
+2. Change directories to the `poolq-3.10.0` directory
 
 - On Windows, run:
 
@@ -627,7 +627,7 @@ how to launch programs from the command line on your given operating system.
 If you successfully launched PoolQ, you should see a usage message explaining all of the
 command-line options:
 
-    poolq3 3.7.0
+    poolq3 3.10.0
     Usage: poolq [options]
 
       --row-reference <file>   reference file for row barcodes (i.e., constructs)
@@ -661,6 +661,7 @@ command-line options:
       --correlation <file>
       --run-info <file>
       --unexpected-sequence-threshold <number>
+      --unexpected-sequence-max-sample-size <number>
       --unexpected-sequences <file>
       --umi-quality <file>
       --unexpected-sequence-cache <cache-dir>

diff --git a/src/main/scala/org/broadinstitute/gpp/poolq3/PoolQ.scala b/src/main/scala/org/broadinstitute/gpp/poolq3/PoolQ.scala
@@ -199,11 +199,10 @@ object PoolQ {
             .write(
               config.output.unexpectedSequencesFile,
               dir,
-              unexpectedSequenceTrackerOpt.map(_.unexpectedBarcodeCounts).getOrElse(Map.empty),
               config.unexpectedSequencesToReport,
               colReference,
               globalReference,
-              config.unexpectedSequenceSamplePct
+              config.unexpectedSequenceMaxSampleSize
             )
             .as(UnexpectedSequencesFileType.some)
         if (config.removeUnexpectedSequenceCache) {

diff --git a/src/main/scala/org/broadinstitute/gpp/poolq3/PoolQConfig.scala b/src/main/scala/org/broadinstitute/gpp/poolq3/PoolQConfig.scala
@@ -96,7 +96,7 @@ final case class PoolQConfig(
   unexpectedSequenceCacheDir: Option[Path] = None,
   removeUnexpectedSequenceCache: Boolean = true,
   unexpectedSequencesToReport: Int = 100,
-  unexpectedSequenceSamplePct: Double = 0.02,
+  unexpectedSequenceMaxSampleSize: Int = 10_000_000,
   skipShortReads: Boolean = false,
   reportsDialect: ReportsDialect = PoolQ3Dialect,
   alwaysCountColumnBarcodes: Boolean = false,
@@ -288,8 +288,8 @@ object PoolQConfig {
             else failure(s"Unexpected sequence threshold must be greater than 0, got: $n")
           }
 
-        val _ = opt[Double]("unexpected-sequence-sample-pct").valueName("<pct>").action { (f, c) =>
-          c.copy(unexpectedSequenceSamplePct = f)
+        val _ = opt[Int]("unexpected-sequence-max-sample-size").valueName("<number>").action { (n, c) =>
+          c.copy(unexpectedSequenceMaxSampleSize = n)
         }
 
         val _ = opt[Path]("unexpected-sequences").valueName("<file>").action { (f, c) =>

diff --git a/src/main/scala/org/broadinstitute/gpp/poolq3/process/UnexpectedSequenceTracker.scala b/src/main/scala/org/broadinstitute/gpp/poolq3/process/UnexpectedSequenceTracker.scala
@@ -8,7 +8,6 @@ package org.broadinstitute.gpp.poolq3.process
 import java.io.{BufferedWriter, Closeable, OutputStreamWriter}
 import java.nio.file.{Files, Path}
 
-import scala.collection.mutable
 import scala.util.control.NonFatal
 
 import org.broadinstitute.gpp.poolq3.process.UnexpectedSequenceTracker.nameFor
@@ -17,8 +16,6 @@ import org.log4s.{Logger, getLogger}
 
 final class UnexpectedSequenceTracker(cacheDir: Path, colReference: Reference) extends Closeable {
 
-  private[this] val unexpectedCountsByColBarcode: mutable.Map[String, Int] = mutable.HashMap()
-
   private[this] val log: Logger = getLogger
 
   // prep the directory and create file writers
@@ -36,14 +33,8 @@ final class UnexpectedSequenceTracker(cacheDir: Path, colReference: Reference) e
     val writer = outputFileWriters(colBc)
     writer.write(rowBc)
     writer.write("\n")
-    val _ = unexpectedCountsByColBarcode.updateWith(colBc) {
-      case None     => Some(1)
-      case Some(pc) => Some(pc + 1)
-    }
   }
 
-  def unexpectedBarcodeCounts: Map[String, Int] = unexpectedCountsByColBarcode.toMap
-
   override def close(): Unit =
     outputFileWriters.values.foreach { writer =>
       try {