analysis for multiple runs per sample #410

shishuo16 · 2024-04-05T04:34:23Z

Hi,

Thank you for this great tool! I have a question about how to merge multiple runs per sample. I have two fastq1 and fastq2 files per sample (four fastq files), I wonder in which step should I merge them together ? "java -jar picard.jar FastqToSam" or "Drop-seq_alignment.sh" ? And can you provide a usage example ?

Thanks in advance.

alecw · 2024-04-10T14:41:09Z

Hi @shishuo16 ,

It depends. Are these multiple 10X reactions, or reads from the same 10X reaction? If they are from multiple 10X reactions, then you shouldn't merge them without some thought, because you might get the same cell barcode in different reactions and you don't want to combine those.

If it's all the same 10X reaction, then you have choices that are all fine, but depend on your computing situation.

You could simply concatenate (using cat) all your fastq1s, and do the same for all your fastq2s. Just make sure you cat them in the same order for both read1 and read2. Note that even if the fastqs are gzipped, you can use cat.
You could run the Drop-seq_alignment.sh workflow independently on each fastq pair, and then merge the aligned BAMs, using Picard MergeSamFiles or samtools merge.

The first approach is simpler. The second approach allows you to do a lot of processing in parallel, which depending on your compute environment might be faster.

Regards, Alec

mschilli87 · 2024-04-10T15:11:47Z

The 2nd approach also has the advantage that when you add more samples later you don't have to re-run all the processing for the previous samples but just for the new data and merge the results.

shishuo16 · 2024-04-11T00:56:22Z

Got it, thank you!

shishuo16 · 2024-04-12T01:11:23Z

@alecw
Hi, I used your first suggestion (cat fastqs), however when I run Drop-seq_alignment.sh, it went to wrong
`
INFO 2024-04-11 13:11:54 TagBamWithReadSequenceExtended

********** NOTE: Picard's command line syntax is changing.

********** For more information, please see:
********** https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)

********** The command line looks like this in the new syntax:

********** TagBamWithReadSequenceExtended -SUMMARY /media/dell/171fe25b-15b7-4827-8508-ab06a0a8b3f4/single_cell/GSE147082/broad_Drop-seq/GSM4416535/align//unaligned_tagged_Cellular.bam_summary.txt -BASE_RANGE 1-12 -BASE_QUALITY 10 -BARCODED_READ 1 -DISCARD_READ false -TAG_NAME XC -NUM_BASES_BELOW_QUALITY 1 -INPUT /media/dell/171fe25b-15b7-4827-8508-ab06a0a8b3f4/single_cell/GSE147082/broad_Drop-seq/GSM4416535/GSM4416535_unaligned_read_pairs.bam -OUTPUT /media/dell/171fe25b-15b7-4827-8508-ab06a0a8b3f4/single_cell/GSE147082/broad_Drop-seq/GSM4416535/align/tmp//unaligned_tagged_Cell.bam

13:11:54 [main] WARN com.intel.gkl.NativeLibraryLoader - Unable to load libgkl_compression.so from native/libgkl_compression.so (没有那个文件或目录)
13:11:54 [main] WARN com.intel.gkl.NativeLibraryLoader - Unable to load libgkl_compression.so from native/libgkl_compression.so (没有那个文件或目录)
[Thu Apr 11 13:11:54 CST 2024] TagBamWithReadSequenceExtended INPUT=/media/dell/171fe25b-15b7-4827-8508-ab06a0a8b3f4/single_cell/GSE147082/broad_Drop-seq/GSM4416535/GSM4416535_unaligned_read_pairs.bam OUTPUT=/media/dell/171fe25b-15b7-4827-8508-ab06a0a8b3f4/single_cell/GSE147082/broad_Drop-seq/GSM4416535/align/tmp/unaligned_tagged_Cell.bam SUMMARY=/media/dell/171fe25b-15b7-4827-8508-ab06a0a8b3f4/single_cell/GSE147082/broad_Drop-seq/GSM4416535/align/unaligned_tagged_Cellular.bam_summary.txt BASE_RANGE=1-12 BARCODED_READ=1 DISCARD_READ=false BASE_QUALITY=10 NUM_BASES_BELOW_QUALITY=1 TAG_NAME=XC TAG_BARCODED_READ=false HARD_CLIP_BASES=false TAG_QUALITY=XQ VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Thu Apr 11 13:11:54 CST 2024] Executing as dell@dell-PowerEdge-T550 on Linux 6.5.0-18-generic amd64; OpenJDK 64-Bit Server VM 11.0.13+7-b1751.21; Deflater: Jdk; Inflater: Jdk; Provider GCS is not available; Picard version: 2.5.3(e21ba24_1679595148)
13:11:54 [main] WARN com.intel.gkl.compression.IntelInflaterFactory - IntelInflater is not supported, using Java.util.zip.Inflater
13:11:54 [main] WARN com.intel.gkl.compression.IntelDeflaterFactory - Intel Deflater not supported, using Java.util.zip.Deflater
ERROR 2024-04-11 13:11:54 TagBamWithReadSequenceExtended Reads not properly paired! R1: 1 R2: 1
`

I wonder is it because the two fastq1 have same reads name ?

head of the first fastq1 file:
@1
CTCCCNAGAGGAATCCTTCG
+
AAAA/#/EE6AEEE<E/E/E
@2
CGGGTNTGGCGCCGGGCAAG
+
AAA/A#EEEEEEEEEEEEEE
@3
TGCCGNACGCATCTTGGGTG

head of the second fastq1 file:
@1
NAAGGACCTATCGGGCGCTACTGGG
+
#FFFFFFFFFFFFFF:FFFFF::::
@2
NTTCGTACACTTGCGGCGGAGTGAC
+
#FFFFFFFFFFFFFFFFFFFFF:,,
@3
NTGCCTGTTCACCGCAAGATGGTTG

alecw · 2024-04-12T01:29:54Z

Hi @shishuo16 ,
The integer read names are suspicious. These are not normal Illumina FASTQs. Where did the reads come from. Did pipseeker produce these?

Regards, Alec

shishuo16 · 2024-04-12T01:34:23Z

Hi @alecw
I use prefetch command download the sra file from NCBI (https://www.ncbi.nlm.nih.gov/Traces/study/?acc=PRJNA612966&o=acc_s%3Aa), And then use "fastq-dump.3 ./fastq/${line}/${line}.sra --split-files --origfmt --defline-qual '+' --gzip -O ./fastq/" command to transform them from sra to fastq, and I got the fastq files

alecw · 2024-04-12T01:41:14Z

If your multiple runs have the same read names, you can't cat them like I suggested originally. If these are from the same 10X reaction, which I'm assuming they are since you chose to cat them, you'll need to do something to disambiguate the read names across pairs of fastqs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

analysis for multiple runs per sample #410

analysis for multiple runs per sample #410

shishuo16 commented Apr 5, 2024

alecw commented Apr 10, 2024

mschilli87 commented Apr 10, 2024

shishuo16 commented Apr 11, 2024

shishuo16 commented Apr 12, 2024 •

edited

Loading

alecw commented Apr 12, 2024

shishuo16 commented Apr 12, 2024

alecw commented Apr 12, 2024

analysis for multiple runs per sample #410

analysis for multiple runs per sample #410

Comments

shishuo16 commented Apr 5, 2024

alecw commented Apr 10, 2024

mschilli87 commented Apr 10, 2024

shishuo16 commented Apr 11, 2024

shishuo16 commented Apr 12, 2024 • edited Loading

alecw commented Apr 12, 2024

shishuo16 commented Apr 12, 2024

alecw commented Apr 12, 2024

shishuo16 commented Apr 12, 2024 •

edited

Loading