Make parser more comprehensive #93

pjvandehaar · 2017-06-09T20:34:15Z

Parsers to look at:

LDSC
Daniel's Harmonize
GWASS
- checks for strand-ambiguous alleles
Py/R LZ here?
METAL docs

It'd be great to make this a stand-alone tool, parse-assoc --num-samples=100 --chr=CONTIG --pos=BP ... <assoc_file>

Steps:

Figure out columns, but consider ref/alt to just be two allele columns.
If there are the two allele columns, check them against hg18, hg19, hg38. If one is consistently ref, make it ref. If neither is consistent, then what? make ref, alt, risk_allele? That'll make PheWAS a pain, it'd be nicer to just invert OR/beta right away to be ref-relative. If it's on a build we don't like, liftover to whatever the standard is. Watch out for negative strand SNPs and indels!
- if there's an ambiguous allele, options:
  - drop the allele
  - drop the effect size (but keep pval, &c)
  - tag it as "strand-ambiguous" (how? where does this data go? "notes"/"warning" column?)
If (rsid, chr, pos, ref, alt) all exist, just drop rsid and redo it? If rsid exists but something else is missing, use dbSNP to remake whatever's missing and check that it matches? But with which build? This seems like a pain...
sort by (chrom, pos, ref, alt)
parse other fields, checking their types &c. maybe auto-compute MAF (from AC+num_samples, genoct, AF) and beta (from OR)?

The text was updated successfully, but these errors were encountered:

pjvandehaar · 2017-07-13T02:51:38Z

(see #77)

Make pheweb parse-one -c parse-config.json <assoc-files...>
- parse-config.json specifies everything needed. eg,
```
{
 format: {type:'csv', delimiter: '\t'}
 header: {ignore_leading: '#'},
 ignore_lines: {starting_with: '#'},
 build: 'hg19',
 check_against_reference: true,
 conversions: [
  {in: 'CHR', convert: 'chrom', out: 'chrom'},
  {in: 'BP', convert: 'int', out: 'pos'},
  {in: 'MARKER_ID', regex: '^([0-9XYMT]+)_([1-9][0-9]+)_([ACGT]+)_([ACGT]+)', out: [{out:'chrom', convert:'chrom'}, {out:'pos', convert:'int'}, {out:'ref'}, {out:'alt'}]},
  {in: 'P.VALUE', convert: 'float', sigfigs: 2, out: 'pval', skip_if: ['NA', '.', '']},
  {in: 'freq', convert: 'float', sigfigs: 2, out: 'af'},
  ...
 ],
 constraints: {
  pos: {ge: 0},
  pval: {ge: 0, le: 0},
  af: {gt: 0, lt: 0},
  or: {gt: 0}
 }
}
```
  - if ref vs alt aren't known, use a1 and a2.
    - if we know that we're on the + strand (ie, that either a1 or a2 matches the reference), set {strand: '+'}, which allows converting from [a1, a2] -> [ref, alt]. And maybe it has to recipricalize odds ratio and negate beta. Rather than being assumed, we should encode what happens a1 of a2 or neither or we-don't-know-which match ref.
  - allow {convert_rsid_to: ['chrom', 'pos', 'ref', 'alt']} (allow any subset) which downloads dbSNP. (see Add a script to add chr-pos-ref-alt to input files with only rsid #66)
  - if one out is produced multiple times, assert that they agree.
  - maybe allow specifying {reader: 'pandas'}?
  - maybe make {sort_variants: true}, which will force sorting. if false or missing, sort-order will still be checked, and unsorted input will throw PheWebUnsortedAssocFile. (see pheweb parse should automatically sort variants if they aren't sorted. #71)
  - should we make ref/alt optional? if it's missing, chrom-pos will have to be unique, and lots of code will need changes. worse yet would be a1/a2. worry about that later.
make tests for pheweb parse-one
make pheweb guess-format-one -f fields-config.json <assoc-files...>, which produces a parse-config.json.
- fields-config.json is somewhat like conf.parse currently is.
- this should also try to extract constant per-pheno (ie, per-analysis) fields like num_samples.
- I'd really prefer if parse-config.json allowed comments. maybe switch to json5.
once pheweb guess-format-one works well, use it to make an interactive (either through the terminal, a text file, or a web browser) pheweb quickstart.
- figure out how to integrate fields from categories.xlsx-style files.
- figure out how to deal with [per-analysis, per-variant, per-association] fields. ie, the parser needs to understand per-analysis fields to pull them out, but it can treat [per-variant, per-association] fields the same. then downstream I need to check that per-variant fields (eg, [rsid, nearest-gene, indel/snp]) really do agree between different analyses. can that be done automatically somehow? or, could we treat all fields in association files as either per-analysis or per-association, and only treat annotations as per-variant?

pjvandehaar mentioned this issue Jul 14, 2017

Improve the data-loading experience #77

Closed

pjvandehaar closed this as completed Feb 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make parser more comprehensive #93

Make parser more comprehensive #93

pjvandehaar commented Jun 9, 2017 •

edited

Loading

pjvandehaar commented Jul 13, 2017 •

edited

Loading

Make parser more comprehensive #93

Make parser more comprehensive #93

Comments

pjvandehaar commented Jun 9, 2017 • edited Loading

pjvandehaar commented Jul 13, 2017 • edited Loading

pjvandehaar commented Jun 9, 2017 •

edited

Loading

pjvandehaar commented Jul 13, 2017 •

edited

Loading