Add forcecells flag to ATAC, CITE, GEX, and MULTI pipelines

OpenOmics · Aug 5, 2024 · 877dc34 · 877dc34
1 parent 30b44ae
commit 877dc34
Show file tree

Hide file tree

Showing 9 changed files with 357 additions and 34 deletions.
diff --git a/cell-seek b/cell-seek
@@ -314,7 +314,7 @@ def parsed_arguments(name, description):
                 [--aggregate {{mapped, none}}][--libraries LIBRARIES] \\
                 [--features FEATURES] [--cmo-reference CMOREFERENCE] \\
                 [--cmo-sample CMOSAMPLE] [--exclude-introns] [--filter FILTER] \\
-                [--create-bam] [--rename RENAMEFILE] \\
+                [--create-bam] [--rename RENAMEFILE] [--forcecells FORCECELLS]\\
                 --input INPUT [INPUT ...] \\
                 --output OUTPUT \\
                 --pipeline {{gex, ...}} \\
@@ -372,9 +372,11 @@ def parsed_arguments(name, description):
                                 from higher depth samples until each library type has an
                                 equal number of reads per cell that are confidently mapped.
                                 None means to not normalize at all. If this flag is not
-                                used then aggregate will not be run. To run Cell Ranger
-                                aggregate, please select one of the following options:
-                                mapped, none.
+                                used then aggregate will not be run. Aggregate analysis
+                                is generally not needed, but it can be used to generate a
+                                Loupe Browser file for interactive exploration of the data.
+                                To run Cell Ranger aggregate, please select one of the 
+                                following options: mapped, none.
                                   Example: --aggregate mapped
           --libraries LIBRARIES
                                 Libraries file. A CSV file containing information about
@@ -556,16 +558,67 @@ def parsed_arguments(name, description):
                                   Here is an example rename.csv file:
                                     FASTQ,Name
                                     original_name1,new_name1
-                                    original_name2,new_name1
-                                    original_name3,new_name2
-                                    original_name4,new_name3
+                                    original_name2,new_name2
+                                    original_name3,new_name3
+                                    original_name3-2,new_name3
+                                    original_name4,original_name4
+                                  where:
+                                    • FASTQ: The name that is used in the FASTQ file
+                                    • Name: Unique sample ID that is the sample name used for
+                                      Cell Ranger count.
                                 In this example, new_name3 has FASTQ files with two different
                                 names. With this input, both sets of FASTQ files will be used
                                 when processing the sample as new_name3. original_name4 will not
                                 be renamed. Any FASTQ file that does not have the name
                                 original_name1, original_name2, original_name3, or original_name4
                                 will not be run.
                                   Example: --rename rename.csv
+          --forcecells FORCECELLS
+                                Force cells file. A CSV file containing the name of the sample
+                                (the Cell Ranger outputted name) and the number of cells to
+                                force the sample to. This flag is applicable when using the GEX,
+                                CITE, MULTI, and ATAC pipelines. It will generally be used if
+                                the first analysis run appears to do a poor job at estimating
+                                the number of cells, and a re-run is needed to adjust the number
+                                of cells in the sample.
+
+                                This file can created in two different formats. The first one
+                                can be used for the GEX, CITE, MULTI, and ATAC pipelines. It
+                                will contain the name of the sample and the number of cells
+                                to be forced to.
+                                  Here is an example forcecells.csv file:
+                                    Sample,Cells
+                                    Sample1,3000
+                                    Sample2,5000
+                                  where:
+                                    • Sample: The sample name used as the Cell Ranger output
+                                    • Cells: The number of cells the sample should be forced to
+                                In this example, Sample1 and Sample2 will be run while being forced
+                                to have 3000 and 5000 cells respectively. Any other samples that
+                                are processed will be run without using the force cells flag and
+                                will use the default cell calling algorithm.
+
+                                The second format is only compatible with the MULTI pipeline and
+                                would be used when hashtag multiplexing is used and the number of
+                                cells needs to be forced for a specific hashtagged sample.
+                                  Here is an example forcecells.csv file:
+                                    Name,Sample,Cells
+                                    Library1,HTO_1,3000
+                                    Library1,HTO_2,5000
+                                  where:
+                                    • Library: The name of the library that is provided as to Cell
+                                      Ranger when running multi analysis. This should match the
+                                      name that is given in the libraries.csv file.
+                                    • Sample: The sample ID used for the associated hashtag. This
+                                      will have to match the value used in the CMO sample file or
+                                      the CMO reference file that is provided as input. If only a
+                                      CMO reference file is provided, the pipeline default assigns
+                                      each hashtag with the IDs of HTO_1, HTO_2, etc.
+                                    • Cells: The number of cells the sample should be forced to
+                                  In this example, the hashtags HTO_1 and HTO_2 in Library 1 will
+                                  be run while being forced to 3000 and 5000 cells respectively.
+                                  Any other libraries or samples that are processed will be run
+                                  without using the force cells flag.
 
         {3}{4}Orchestration options:{5}
           --mode {{slurm,local}}
@@ -840,6 +893,15 @@ def parsed_arguments(name, description):
         help = argparse.SUPPRESS
     )
 
+    # Number of cells to force samples to when running Cell Ranger analysis
+    subparser_run.add_argument(
+        '--forcecells',
+        # Check if the file exists and if it is readable
+        type = lambda file: permissions(parser, file, os.R_OK),
+        required = False,
+        help = argparse.SUPPRESS
+    )
+
     # Orchestration Options
     # Execution Method, run locally
     # on a compute node or submit to

diff --git a/docs/usage/run.md b/docs/usage/run.md
@@ -94,7 +94,7 @@ Each of the following arguments are optional, and do not need to be provided.
 > **Cell Ranger aggregate normalization.**   
 > *type: string*
 >  
-> This option defines the normalization mode that should be used. Mapped is what Cell Ranger would run by default, which subsamples reads from higher depth samples until each library type has an equal number of reads per cell that are confidently mapped.  None means to not normalize at all. If this flag is not used then aggregate will not be run. To run Cell Ranger aggregate, please select one of the following options: mapped, none.
+> This option defines the normalization mode that should be used. Mapped is what Cell Ranger would run by default, which subsamples reads from higher depth samples until each library type has an equal number of reads per cell that are confidently mapped.  None means to not normalize at all. If this flag is not used then aggregate will not be run. Aggregate analysis is generally not needed, but it can be used to generate a Loupe Browser file for interactive exploration of the data. To run Cell Ranger aggregate, please select one of the following options: mapped, none.
 >
 > ***Example:*** `--aggregate mapped`
 
@@ -164,7 +164,7 @@ Each of the following arguments are optional, and do not need to be provided.
 
 ---
   `--rename RENAME`
-> **Rename sample file.**
+> **Rename sample file.**   
 > *type: file*
 >
 > Rename sample file. A CSV file containing the name of the FASTQ file and the new name of the sample. Only the samples listed in the CSV files will be run.
@@ -183,11 +183,35 @@ Each of the following arguments are optional, and do not need to be provided.
 >
 > - *FASTQ:* The name that is used in the FASTQ file
 > - *Name:* Unique sample ID that is the sample name used for Cell Ranger count.
-> 
+>
 > In this example, new_name3 has FASTQ files with two different names. With this input, both sets of FASTQ files will be used when processing the sample as new_name3. original_name4 will not be renamed. Any FASTQ file that does not have the name original_name1, original_name2, original_name3, or original_name4 will not be run.
 >
 > ***Example:*** `--rename rename.csv`
 
+---
+  `--forcecells FORCECELLS`
+> **Force cells file.**  
+> *type: file*
+>
+> Force cells file. A CSV file containing the name of the sample (the Cell Ranger outputted name) and the number of cells to force the sample to. It will generally be used if the first analysis run appears to do a poor job at estimating the number of cells, and a re-run is needed to adjust the number of cells in the sample.
+>
+> *Here is an example forcecells.csv file:*
+> ```
+> Sample,Cells
+> Sample1,3000
+> Sample2,5000
+> ```
+>
+> *Where:*
+>
+> - *Sample:* The sample name used as the Cell Ranger output
+> - *Cells:* The number of cells the sample should be forced to
+>
+> In this example, Sample1 and Sample2 will be run while being forced to have 3000 and 5000 cells respectively. Any other samples that are processed will be run without using the force cells flag and will use the default cell calling algorithm.
+>
+> ***Example:*** `--forcecells forcecells.csv`
+
+
 ### 2.2 VDJ
 
 #### 2.2.1 Required Arguments
@@ -245,7 +269,7 @@ Each of the following arguments are required. Failure to provide a required argu
 #### 2.2.2 Analysis Options
 
   `--rename RENAME`
-> **Rename sample file.**
+> **Rename sample file.**  
 > *type: file*
 >
 > Rename sample file. A CSV file containing the name of the FASTQ file and the new name of the sample. Only the samples listed in the CSV files will be run.
@@ -403,6 +427,29 @@ Each of the following arguments are required. Failure to provide a required argu
 >
 > ***Example:*** `--create-bam`
 
+---
+`--forcecells FORCECELLS`
+> **Force cells file.**  
+> *type: file*
+>
+> Force cells file. A CSV file containing the name of the sample (the Cell Ranger outputted name) and the number of cells to force the sample to. It will generally be used if the first analysis run appears to do a poor job at estimating the number of cells, and a re-run is needed to adjust the number of cells in the sample.
+>
+> *Here is an example forcecells.csv file:*
+> ```
+> Sample,Cells
+> Sample1,3000
+> Sample2,5000
+> ```
+>
+> *Where:*
+>
+> - *Sample:* The sample name used as the Cell Ranger output
+> - *Cells:* The number of cells the sample should be forced to
+>
+> In this example, Sample1 and Sample2 will be run while being forced to have 3000 and 5000 cells respectively. Any other samples that are processed will be run without using the force cells flag and will use the default cell calling algorithm.
+>
+> ***Example:*** `--forcecells forcecells.csv`
+
 ### 2.4 MULTI
 
 There are multiple different combinations of library types that may result in the use of Cell Ranger `multi` analysis. Any combination that combines GEX and VDJ data for cell calls, or the use of HTO with the Cell Ranger hashtag caller would need `multi` analysis.
@@ -540,7 +587,7 @@ Each of the following arguments are optional, and do not need to be provided.
 > - *id:* Unique ID for this feature. Must not contain whitespace, quote or comma characters. Each ID must be unique and must not collide with a gene identifier from the transcriptome.
 > - *name:* Human-readable name for this feature. Must not contain whitespace.
 > - *sequence:* Nucleotide barcode sequence associated with this hashtag
-> - *feature_type: Type of the feature. This should always be multiplexing capture.
+> - *feature_type:* Type of the feature. This should always be multiplexing capture.
 > - *read:* Specifies which RNA sequencing read contains the Feature Barcode sequence. Must be R1 or R2, but in most cases R2 is the correct read.
 > - *pattern:* Specifies how to extract the sequence of the feature barcode from the read.
 >
@@ -586,6 +633,47 @@ Each of the following arguments are optional, and do not need to be provided.
 >
 > ***Example:*** `--create-bam`
 
+---
+`--forcecells FORCECELLS`
+> **Force cells file.**  
+> *type: file*
+>
+> Force cells file. A CSV file containing the name of the sample (the Cell Ranger outputted name) and the number of cells to force the sample to. It will generally be used if the first analysis run appears to do a poor job at estimating the number of cells, and a re-run is needed to adjust the number of cells in the sample.
+>
+> This file can created in two different formats. The first one will contain the name of the sample and the number of cells to be forced to.
+>
+> *Here is an example forcecells.csv file:*
+> ```
+> Sample,Cells
+> Sample1,3000
+> Sample2,5000
+> ```
+>
+> *Where:*
+>
+> - *Sample:* The sample name used as the Cell Ranger output
+> - *Cells:* The number of cells the sample should be forced to
+>
+> In this example, Sample1 and Sample2 will be run while being forced to have 3000 and 5000 cells respectively. Any other samples that are processed will be run without using the force cells flag and will use the default cell calling algorithm.
+>
+> The second format is only compatible when hashtag multiplexing is used and the number of cells needs to be forced for a specific hashtagged sample.
+>
+> *Here is an example forcecells.csv file:*
+> ```
+> Name,Sample,Cells
+> Library1,Sample1,3000
+> Library1,Sample2,5000
+> ```
+>
+> *Where:*
+>
+> - *Library:* The name of the library that is provided as to Cell Ranger when running multi analysis. This should match the name that is given in the libraries.csv file.
+> - *Sample:* The sample ID used for the associated hashtag. This will have to match the value used in the CMO sample file or the CMO reference file that is provided as input. If only a CMO reference file is provided, the pipeline default assigns each hashtag with the IDs of HTO_1, HTO_2, etc.
+> - *Cells:* The number of cells the sample should be forced to
+>
+> In this example, the hashtags HTO_1 and HTO_2 in Library 1 will be run while being forced to 3000 and 5000 cells respectively. Any other libraries or samples that are processed will be run without using the force cells flag.
+>
+> ***Example:*** `--forcecells forcecells.csv`
 
 ### 2.5 ATAC
 
@@ -634,7 +722,7 @@ Each of the following arguments are required. Failure to provide a required argu
 
 
 #### 2.5.2 Analysis Options
-
+`--rename RENAME`
 > **Rename sample file.**
 > *type: file*
 >
@@ -659,6 +747,29 @@ Each of the following arguments are required. Failure to provide a required argu
 >
 > ***Example:*** `--rename rename.csv`
 
+---
+  `--forcecells FORCECELLS`
+> **Force cells file.**  
+> *type: file*
+>
+> Force cells file. A CSV file containing the name of the sample (the Cell Ranger outputted name) and the number of cells to force the sample to. It will generally be used if the first analysis run appears to do a poor job at estimating the number of cells, and a re-run is needed to adjust the number of cells in the sample.
+>
+> *Here is an example forcecells.csv file:*
+> ```
+> Sample,Cells
+> Sample1,3000
+> Sample2,5000
+> ```
+>
+> *Where:*
+>
+> - *Sample:* The sample name used as the Cell Ranger output
+> - *Cells:* The number of cells the sample should be forced to
+>
+> In this example, Sample1 and Sample2 will be run while being forced to have 3000 and 5000 cells respectively. Any other samples that are processed will be run without using the force cells flag and will use the default cell calling algorithm.
+>
+> ***Example:*** `--forcecells forcecells.csv`
+
 ### 2.6 Multiome
 
 #### 2.6.1 Required Arguments