Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CCS Header File #21

Open
ddubocan opened this issue Apr 2, 2020 · 6 comments
Open

CCS Header File #21

ddubocan opened this issue Apr 2, 2020 · 6 comments

Comments

@ddubocan
Copy link

ddubocan commented Apr 2, 2020

Hi,

Is there documentation on how the CCS header file required for scallop-lr needs to be formatted?

@shaomingfu
Copy link
Collaborator

Hi Danilo,

We did not define our own CCS header files. The header file, used by Scallop-LR, is obtained by concatenating the header lines (i.e., those lines starting with >) in the full-length and non-full-length .fasta files. (Please see an example of such header file here: https://github.com/Kingsford-Group/scallop/tree/isoseq). These full-length and non-full-length .fasta files (usually in names of isoseq_flnc.fasta and isoseq_nfl.fasta) are obtained by running the PacBio SMRT Link software.

Best,
Mingfu

@lauraht
Copy link

lauraht commented Apr 8, 2020

Hi Danilo,

In our GitHub “Long-Read Transcript Assembly Analysis” (https://github.com/Kingsford-Group/lrassemblyanalysis), we provide scripts to generate classified CCS reads (full-length CCS and non-full-length CCS reads) from PacBio raw reads, as well as a script to automatically generate the CCS header file and run Scallop-LR. Please refer to the “Analyze a BioSample-based Dataset with Iso-Seq Analysis, Scallop-LR, and StringTie” section in the README page of the above GitHub for detailed descriptions on how to run these scripts.

Among these scripts, biosample_isoseq.sh performs Iso-Seq full-analysis, while generating classified CCS reads (full-length CCS and non-full-length CCS reads) on the way; post_isoseq_analysis.sh performs post analysis and evaluations for Iso-Seq, and one of its outputs is flnc_and_nfl.fasta that contains the classified CCS reads (full-length CCS and non-full-length CCS reads); minimap2_scallop_isoseq_allreads_pipeline.sh aligns the classified CCS reads to the reference genome using Minimap2 and runs Scallop-LR, while generating the CCS header file (ccsread_info) on the way.
If you run our provided scripts to perform Iso-Seq and Scallop-LR full analyses, you would not need to worry about these details (e.g. flnc_and_nfl.fasta, ccsread_info)--- the whole process is automated.

However, if you start with some existing classified CCS reads and would like to run Scallop-LR with them, you can simply use the following command to create the CCS header file (and then input the CCS header file and the alignments of the classified CCS reads into Scallop-LR):
cat flnc_and_nfl.fasta | grep ">" > ccsread_info
where flnc_and_nfl.fasta is your classified CCS reads fasta file, and ccsread_info is the output CCS header file which basically contains the header lines of all reads in flnc_and_nfl.fasta.
Note that your CCS reads need to be classified in this case. In other words, your CCS reads need to run through the Classify tool in PacBio SMRT Link, and the classified CCS reads fasta file (flnc_and_nfl.fasta) should be the concatenation of full-length CCS reads (isoseq_flnc.fasta) and non-full-length CCS reads (isoseq_nfl.fasta) outputted from the Classify tool. The detailed description on how to run the Classify tool can be found by typing the following (after you install the SMRT Tools v5.1.0):
pbtranscript classify --help
Here, you would only need to use the --flnc=isoseq_flnc.fasta --nfl=isoseq_nfl.fasta options to specify the output files for full-length CCS reads and non-full-length CCS reads.

Best,
Laura

@fh-zju
Copy link

fh-zju commented Jul 24, 2020

Hi Minghu and Laura,
I am trying to run Scallop-LR following the Iso-Seq analysis of SMRT Link v9. Now I have trouble of locating the files that contain the CCS header information. Any ideas? Shall I try an earlier version of SMRT Link, or anyway I can generate a CCS header from the v9 out files?

Best,
Feng

@lauraht
Copy link

lauraht commented Jul 25, 2020

Hi Feng,

I would recommend using SMRT Link v5.1.0.
I think SMRT Link v9 no longer saves non-full-length reads in its output.

Best,
Laura

@Huangyizhong
Copy link

Hi Feng,

I would recommend using SMRT Link v5.1.0.
I think SMRT Link v9 no longer saves non-full-length reads in its output.

Best,
Laura

Hi, there
I also want to use the scallop-lr to do the assembly of the PB-data. As you mentationed above, I would use the SMRT Link v5.1.0 to do the following process. But I can not find the release of the SMRT Link v5.1.0, would you mind providing the downloading site about it ? Another question, how does the scallop-lr deal with the primer and also the concatemer reads to obtain the final isoforms?
Sincerely
yizhong Huang

@lauraht
Copy link

lauraht commented Sep 25, 2021

Hi Yizhong,

I looked at the link where we originally downloaded SMRT Link v5.1.0, but it looks like PacBio has removed the previous downloading link. Unfortunately I could not find the current downloading site for SMRT Link v5.1.0 by searching on the internet. I found the PacBio SMRT Link v5.1.0 Archives (https://www.pacb.com/asset_tags/smrt-link-v5-1-0/), which does not seem to contain a downloading link; however, there is a "Contact Us” (or “Ask a Question”) button that is a short form which you may submit to ask PacBio where you can download SMRT Link v5.1.0. And I think SMRT Link v6.0.0 also saves non-full-length reads in its output.

About the primer, we use the Classify tool in SMRT Link (v5.1.0) to obtain full-length and non-full-length CCS reads, and the Classify tool also removes primers from reads during the classification process. So the full-length and non-full-length CCS reads as input to Scallop-LR no longer contain primers. The Classify tool further classifies full-length reads into artificial-concatemer chimeric reads or non-chimeric reads, and it only outputs full-length non-artificial-concatemer reads. So the input CCS reads to Scallop-LR are non-artificial-concatemer reads.

Best,
Laura

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants