-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bed file option for Ordered-Histgrowth #55
Comments
Hi Taek, for this plot all 24 chromosomes where taken into account. As described in the example, only the fraction of the graph were quantified that are not touched by the reference genome GRCh38. In other words, each node through which the reference genome traverses, is ignored. The subset option was used to quantify only HPRC haplotypes (this excludes CHM13). If you want to know what genomic regions contributed to the plot, take the graph, subtract all GRCh38 paths and extract all remaining nodes that are covered by HG* and NA* paths. |
Thank you for your reply. I followed these steps. Step 2: Establish the exclude of samples in the growth statistics Step 3: Exclude paths from reference genome CHM13 and GRCH38 Step 4: panacus histgrowth to calculate coverage and pangenome growth for nodes Step 5: Visualize coverage histogram and pangenome growth curve with estimated growth parameters In Step 4, the file *pggb.ng0.ordhstgrw.bp.tsv will be used to generate the graph. Attached is the TSV file from chr1 in my trial. As you can see, even a single chromosome (chr1) already shows a size comparable to the total chromosome size described in the Panacus example. I’ve cross-checked the steps and parameters with the manual in the provided link. Did I miss anything? Regarding genomic region contributions, I’m not fully clear on your explanation. When you say "take the graph," do you mean the PGGB graph GFA file? And how exactly should I subtract or extract regions from it? Could you please share any relevant steps, scripts, or manuals I could follow? Looking forward to your guidance. Kind regards, Taek |
Hi again, I have also tried using the same approach explained in the manual (https://github.com/marschall-lab/panacus/blob/main/examples/ordered_pangenome_growth_statistics.md). Step 4 (followed manual): panacus histgrowth to calculate coverage and pangenome growth for nodes Eventually, the -e ${OUTPUT_DIR}/${BASEFILE}.pggb_hprc.exculde.samples.txt and -e ${OUTPUT_DIR}/${BASEFILE}.pggb.path.txt generated the same TSV file and PDF from chr1 in my trial. Both approaches generated an abnormally large chromosome size even with a single chromosome (chr1). I’ve tried both v0.2.4 from Bioconda and the binary release from GitHub. It seems likely that Step 4 didn’t generate the correct values for the TSV file. I have also tried the -s parameter using a bed file from CHM13v2. Step 4 (bed file) Looking forward to your insights. Kind regards, Take |
Hi Taek, the differing base pair counts are explained by the differing graphs. The example was generated using the Minigraph-Cactus graph, while your results are from the PGGB graph. The Minigraph-Cactus graph does not contain any heterochromatic regions, so it already starts with fewer bases than the PGGB graph. The PGGB graph does include these regions but the sequences in these regions are seriously under-aligned meaning that we will have lots of bases in the graph where there would be only one when using a single sequence. The size in bps in the last line is not the size of the chromosome but instead the size of the graph of the chromosome, which can be in reality quite different from each other. For example, the human chr1 has a length of 248,387,328 bp (CHM13) while the PGGB graph for it contains 1,117,392,094 bps. With regards of extracting the non-GRCh38 parts of the graph: Unfortunately, there are no tools available (at least to my knowledge) that can do this for you. My approach would be to take all the nodes in the GRCh38 path and removing them from all the other paths manually (e.g. using a bash/python script). Best regards |
Thank you for your reply all the time! So, if the difference between Minigraph-Cactus and PGGB is accepted, your explanation confirms that the TSV and graph generated from PGGB are correct. Is that right? Additionally, regarding the chr1 example: A GFA file from PGGB with 56 samples (112 haplotypes) excluding CHM13 and GRCh38 was used in Panacus. If I want to retrieve or calculate the added sequence length for chr1 from the PGGB graph, which file or information should I refer to? Kind regards, Taek |
Yes, the tsv and the plots for the PGGB graph should be correct. About the second question: What do you mean with added sequence length? The number of base pairs that the non-GRCh38/CHM13 sequences add to the graph? Best regards |
That’s correct. In Steps 2 and 3, it should reflect the number of base pairs added to the graph by non-GRCh38/CHM13 sequences. |
This should be in the last line of the .tsv file generated in Step 4. Using the Best regards |
Thanks. Based on your explanation and my current PGGB trial, the total accumulated base pairs added by non-GRCh38/CHM13 sequences (with the -l 1,2,4,49 option) would be 1,376,115,016 from HG005: Or is it just 1,086,959,399 from the first -l 1 information? Cheers, Taek |
It is just the first number, e.g. 1,086,959,399. The other numbers apply to the other coverages/quora respectively. So for example, if you would like to only take into account nodes that appear at least in two paths (because they might be errors, etc.), you would take the second value (in this case) 224,525,351 (since it is for a coverage of 2). Best regards |
Thank you for your reply and clarification. Based on the manual and my trials, I can generate the Panacus growth plot for specific assemblies using Steps 1, 2, and 3. However, I’m wondering if there’s a way to extract only specific assemblies from the PGGB-generated GFA file. For example, is it possible to extract just the paths covered by ^HH* and ^NA* to make a subset of a GFA file? Kind regards, Taek |
Yes, it is possible. This could be done manually or by creating a list of all the paths one wants to keep and using trim-graph on it. This will remove all node/edges not covered by the paths in the list. After that you might want to run odgi unchop on the graph to merge all the nodes that can now be merged since they are not separated anymore. Best regards |
Thank you so much for your help! I have another question regarding the use of bed files in Panacus. FYI, Step 4 (bed file) However, the generated TSV file contains only “0” values. Looking forward to your insights. Kind regards, Taek |
Yes, there is definitely an issue there. Best regards |
This is the bed file I used. chr1 6046 13941 transcript_id . - cheers, Taek |
Thank you for the file. Okay, so the problem is that panacus cannot match the "chr1" in the first field of the bed file to any path in the graph. If you replace the "chr1" everywhere in the bed file with "chm13#chr1" (name of the corresponding path in the graph) it should work. However, this will give you a hist/growth file having only two lines, since the graph is then subsetted to only have the chm13 path. If you wanted to subset all paths correspondingly you would have to liftover the annotations to other paths and include all of them in the bed file. Best regards |
Thank you for your reply. For the single merged subset bed file, I created a dummy bed file (see below) to test the -s parameter, and it worked. chm13#chr1 6046 13941 transcript_id . - Hope this is the correct approach. cheers, Taek |
Yes, this should be the correct approach. Best regards |
Happy New Year! I have another burning question. |
Happy new year to you too! I'm not 100% sure I understand the question completely, but let me try to explain the ordered growth. Panacus itself has no concept of insertions/deletions, it just works on the graph itself. So everything that is not included in the first path in a ordered growth plot counts towards the growth of the graph. Since deletions are represented not by additional nodes, but by additional edges, they are only represented in growth curves that look at the growth in edges. Panacus also has no concept of a reference, so it can be used with reference-free graphs. In the ordered histgrowth plot only the order is relevant. Best regards |
Hi again,
I have a few more burning questions:
-s, --subset Produce counts by subsetting the graph to a given list of paths (1-column list) or path coordinates (3- or 12-column BED file). If the "order" option is not used, the subset list will also indicate the order of
paths/groups in the histogram. [default: ]
-e, --exclude Exclude bp/node/edge in growth count that intersect with paths (1-column list) or path coordinates (3- or 12-column BED-file) provided by the given file [default: ]
If I want to identify which genomic regions contribute to graph growth, is there a way to investigate this via Panacus using CHM13’s BED file with the -s and -e options?
Looking forward to your insights!
Cheers,
Taek
The text was updated successfully, but these errors were encountered: