Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make parser more comprehensive #93

Closed
pjvandehaar opened this issue Jun 9, 2017 · 1 comment
Closed

Make parser more comprehensive #93

pjvandehaar opened this issue Jun 9, 2017 · 1 comment

Comments

@pjvandehaar
Copy link
Collaborator

pjvandehaar commented Jun 9, 2017

Parsers to look at:

It'd be great to make this a stand-alone tool, parse-assoc --num-samples=100 --chr=CONTIG --pos=BP ... <assoc_file>

Steps:

  1. Figure out columns, but consider ref/alt to just be two allele columns.
  2. If there are the two allele columns, check them against hg18, hg19, hg38. If one is consistently ref, make it ref. If neither is consistent, then what? make ref, alt, risk_allele? That'll make PheWAS a pain, it'd be nicer to just invert OR/beta right away to be ref-relative. If it's on a build we don't like, liftover to whatever the standard is. Watch out for negative strand SNPs and indels!
    • if there's an ambiguous allele, options:
      • drop the allele
      • drop the effect size (but keep pval, &c)
      • tag it as "strand-ambiguous" (how? where does this data go? "notes"/"warning" column?)
  3. If (rsid, chr, pos, ref, alt) all exist, just drop rsid and redo it? If rsid exists but something else is missing, use dbSNP to remake whatever's missing and check that it matches? But with which build? This seems like a pain...
  4. sort by (chrom, pos, ref, alt)
  5. parse other fields, checking their types &c. maybe auto-compute MAF (from AC+num_samples, genoct, AF) and beta (from OR)?
@pjvandehaar
Copy link
Collaborator Author

pjvandehaar commented Jul 13, 2017

(see #77)

  • Make pheweb parse-one -c parse-config.json <assoc-files...>

    • parse-config.json specifies everything needed. eg,
      {
       format: {type:'csv', delimiter: '\t'}
       header: {ignore_leading: '#'},
       ignore_lines: {starting_with: '#'},
       build: 'hg19',
       check_against_reference: true,
       conversions: [
        {in: 'CHR', convert: 'chrom', out: 'chrom'},
        {in: 'BP', convert: 'int', out: 'pos'},
        {in: 'MARKER_ID', regex: '^([0-9XYMT]+)_([1-9][0-9]+)_([ACGT]+)_([ACGT]+)', out: [{out:'chrom', convert:'chrom'}, {out:'pos', convert:'int'}, {out:'ref'}, {out:'alt'}]},
        {in: 'P.VALUE', convert: 'float', sigfigs: 2, out: 'pval', skip_if: ['NA', '.', '']},
        {in: 'freq', convert: 'float', sigfigs: 2, out: 'af'},
        ...
       ],
       constraints: {
        pos: {ge: 0},
        pval: {ge: 0, le: 0},
        af: {gt: 0, lt: 0},
        or: {gt: 0}
       }
      }
      
      • if ref vs alt aren't known, use a1 and a2.
        • if we know that we're on the + strand (ie, that either a1 or a2 matches the reference), set {strand: '+'}, which allows converting from [a1, a2] -> [ref, alt]. And maybe it has to recipricalize odds ratio and negate beta. Rather than being assumed, we should encode what happens a1 of a2 or neither or we-don't-know-which match ref.
      • allow {convert_rsid_to: ['chrom', 'pos', 'ref', 'alt']} (allow any subset) which downloads dbSNP. (see Add a script to add chr-pos-ref-alt to input files with only rsid #66)
      • if one out is produced multiple times, assert that they agree.
      • maybe allow specifying {reader: 'pandas'}?
      • maybe make {sort_variants: true}, which will force sorting. if false or missing, sort-order will still be checked, and unsorted input will throw PheWebUnsortedAssocFile. (see pheweb parse should automatically sort variants if they aren't sorted. #71)
      • should we make ref/alt optional? if it's missing, chrom-pos will have to be unique, and lots of code will need changes. worse yet would be a1/a2. worry about that later.
  • make tests for pheweb parse-one

  • make pheweb guess-format-one -f fields-config.json <assoc-files...>, which produces a parse-config.json.

    • fields-config.json is somewhat like conf.parse currently is.
    • this should also try to extract constant per-pheno (ie, per-analysis) fields like num_samples.
    • I'd really prefer if parse-config.json allowed comments. maybe switch to json5.
  • once pheweb guess-format-one works well, use it to make an interactive (either through the terminal, a text file, or a web browser) pheweb quickstart.

    • figure out how to integrate fields from categories.xlsx-style files.
    • figure out how to deal with [per-analysis, per-variant, per-association] fields. ie, the parser needs to understand per-analysis fields to pull them out, but it can treat [per-variant, per-association] fields the same. then downstream I need to check that per-variant fields (eg, [rsid, nearest-gene, indel/snp]) really do agree between different analyses. can that be done automatically somehow? or, could we treat all fields in association files as either per-analysis or per-association, and only treat annotations as per-variant?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant