Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

analysis for multiple runs per sample #410

Open
shishuo16 opened this issue Apr 5, 2024 · 7 comments
Open

analysis for multiple runs per sample #410

shishuo16 opened this issue Apr 5, 2024 · 7 comments

Comments

@shishuo16
Copy link

Hi,

Thank you for this great tool! I have a question about how to merge multiple runs per sample. I have two fastq1 and fastq2 files per sample (four fastq files), I wonder in which step should I merge them together ? "java -jar picard.jar FastqToSam" or "Drop-seq_alignment.sh" ? And can you provide a usage example ?

Thanks in advance.

@alecw
Copy link
Collaborator

alecw commented Apr 10, 2024

Hi @shishuo16 ,

It depends. Are these multiple 10X reactions, or reads from the same 10X reaction? If they are from multiple 10X reactions, then you shouldn't merge them without some thought, because you might get the same cell barcode in different reactions and you don't want to combine those.

If it's all the same 10X reaction, then you have choices that are all fine, but depend on your computing situation.

  • You could simply concatenate (using cat) all your fastq1s, and do the same for all your fastq2s. Just make sure you cat them in the same order for both read1 and read2. Note that even if the fastqs are gzipped, you can use cat.
  • You could run the Drop-seq_alignment.sh workflow independently on each fastq pair, and then merge the aligned BAMs, using Picard MergeSamFiles or samtools merge.

The first approach is simpler. The second approach allows you to do a lot of processing in parallel, which depending on your compute environment might be faster.

Regards, Alec

@mschilli87
Copy link
Contributor

The 2nd approach also has the advantage that when you add more samples later you don't have to re-run all the processing for the previous samples but just for the new data and merge the results.

@shishuo16
Copy link
Author

Got it, thank you!

@shishuo16
Copy link
Author

shishuo16 commented Apr 12, 2024

@alecw
Hi, I used your first suggestion (cat fastqs), however when I run Drop-seq_alignment.sh, it went to wrong
`
INFO 2024-04-11 13:11:54 TagBamWithReadSequenceExtended

********** NOTE: Picard's command line syntax is changing.


********** For more information, please see:
********** https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)


********** The command line looks like this in the new syntax:


********** TagBamWithReadSequenceExtended -SUMMARY /media/dell/171fe25b-15b7-4827-8508-ab06a0a8b3f4/single_cell/GSE147082/broad_Drop-seq/GSM4416535/align//unaligned_tagged_Cellular.bam_summary.txt -BASE_RANGE 1-12 -BASE_QUALITY 10 -BARCODED_READ 1 -DISCARD_READ false -TAG_NAME XC -NUM_BASES_BELOW_QUALITY 1 -INPUT /media/dell/171fe25b-15b7-4827-8508-ab06a0a8b3f4/single_cell/GSE147082/broad_Drop-seq/GSM4416535/GSM4416535_unaligned_read_pairs.bam -OUTPUT /media/dell/171fe25b-15b7-4827-8508-ab06a0a8b3f4/single_cell/GSE147082/broad_Drop-seq/GSM4416535/align/tmp//unaligned_tagged_Cell.bam


13:11:54 [main] WARN com.intel.gkl.NativeLibraryLoader - Unable to load libgkl_compression.so from native/libgkl_compression.so (没有那个文件或目录)
13:11:54 [main] WARN com.intel.gkl.NativeLibraryLoader - Unable to load libgkl_compression.so from native/libgkl_compression.so (没有那个文件或目录)
[Thu Apr 11 13:11:54 CST 2024] TagBamWithReadSequenceExtended INPUT=/media/dell/171fe25b-15b7-4827-8508-ab06a0a8b3f4/single_cell/GSE147082/broad_Drop-seq/GSM4416535/GSM4416535_unaligned_read_pairs.bam OUTPUT=/media/dell/171fe25b-15b7-4827-8508-ab06a0a8b3f4/single_cell/GSE147082/broad_Drop-seq/GSM4416535/align/tmp/unaligned_tagged_Cell.bam SUMMARY=/media/dell/171fe25b-15b7-4827-8508-ab06a0a8b3f4/single_cell/GSE147082/broad_Drop-seq/GSM4416535/align/unaligned_tagged_Cellular.bam_summary.txt BASE_RANGE=1-12 BARCODED_READ=1 DISCARD_READ=false BASE_QUALITY=10 NUM_BASES_BELOW_QUALITY=1 TAG_NAME=XC TAG_BARCODED_READ=false HARD_CLIP_BASES=false TAG_QUALITY=XQ VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Thu Apr 11 13:11:54 CST 2024] Executing as dell@dell-PowerEdge-T550 on Linux 6.5.0-18-generic amd64; OpenJDK 64-Bit Server VM 11.0.13+7-b1751.21; Deflater: Jdk; Inflater: Jdk; Provider GCS is not available; Picard version: 2.5.3(e21ba24_1679595148)
13:11:54 [main] WARN com.intel.gkl.compression.IntelInflaterFactory - IntelInflater is not supported, using Java.util.zip.Inflater
13:11:54 [main] WARN com.intel.gkl.compression.IntelDeflaterFactory - Intel Deflater not supported, using Java.util.zip.Deflater
ERROR 2024-04-11 13:11:54 TagBamWithReadSequenceExtended Reads not properly paired! R1: 1 R2: 1
`

I wonder is it because the two fastq1 have same reads name ?

head of the first fastq1 file:
@1
CTCCCNAGAGGAATCCTTCG
+
AAAA/#/EE6AEEE<E/E/E
@2
CGGGTNTGGCGCCGGGCAAG
+
AAA/A#EEEEEEEEEEEEEE
@3
TGCCGNACGCATCTTGGGTG

head of the second fastq1 file:
@1
NAAGGACCTATCGGGCGCTACTGGG
+
#FFFFFFFFFFFFFF:FFFFF::::
@2
NTTCGTACACTTGCGGCGGAGTGAC
+
#FFFFFFFFFFFFFFFFFFFFF:,,
@3
NTGCCTGTTCACCGCAAGATGGTTG

@alecw
Copy link
Collaborator

alecw commented Apr 12, 2024

Hi @shishuo16 ,
The integer read names are suspicious. These are not normal Illumina FASTQs. Where did the reads come from. Did pipseeker produce these?

Regards, Alec

@shishuo16
Copy link
Author

Hi @alecw
I use prefetch command download the sra file from NCBI (https://www.ncbi.nlm.nih.gov/Traces/study/?acc=PRJNA612966&o=acc_s%3Aa), And then use "fastq-dump.3 ./fastq/${line}/${line}.sra --split-files --origfmt --defline-qual '+' --gzip -O ./fastq/" command to transform them from sra to fastq, and I got the fastq files

@alecw
Copy link
Collaborator

alecw commented Apr 12, 2024

If your multiple runs have the same read names, you can't cat them like I suggested originally. If these are from the same 10X reaction, which I'm assuming they are since you chose to cat them, you'll need to do something to disambiguate the read names across pairs of fastqs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants