Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement OWL parsing with iterparse #2

Open
bgyori opened this issue Jun 15, 2020 · 3 comments
Open

Implement OWL parsing with iterparse #2

bgyori opened this issue Jun 15, 2020 · 3 comments

Comments

@bgyori
Copy link
Member

bgyori commented Jun 15, 2020

This would be better for some very large OWL files

@cthoyt
Copy link
Member

cthoyt commented Nov 30, 2021

Not exactly sure what you had in mind, but parsing the PC12 dump on my laptop takes like 30 minutes and melts my lap, so it would be great :)

import pystow

def ensure_pc_detailed(version: Optional[str], force: bool = False):
    if version is None:
        import bioversions
        
        version = bioversions.get_version("pathwaycommons")

    url = f"https://www.pathwaycommons.org/archives/PC2/v{version}/PathwayCommons{version}.Detailed.BIOPAX.owl.gz"
    path = pystow.ensure("bio", "pathwaycommons", version, url=url)    
    return pybiopax.model_from_owl_gz(path)

@cmungall
Copy link

One approach here is to load the RDF/XML into a SQLIte database, and then operate over triples in the relational database

There are various ways to do a fast load of RDF/XML into SQLIte. For semantic sql we use rdftab.rs but we have plans to wrap the rust in python (INCATools/semantic-sql#41)

Of course, RDF triples are still quite a low level way of working with a higher level representation like OWL or BioPAX instances, but this can be abstracted in a number of ways, YMMV, e.g. views, sqla code, basic python routines. ...

IMHO I think having a set of sqlite downloads of all available pathway biopax files would be appealing to a lot of people...

@cmungall
Copy link

As a demonstrator, I put up a version of Reactome here: https://s3.amazonaws.com/bbop-sqlite/reactome-Homo-sapiens.db.gz

I haven't tried PC yet but once you have the initial download, obviously sqlite bypasses the need for any start-up parse, making it quite nice for interactive exploration

You can query things at a (very low-level) RDF level:

sqlite> select subject from rdf_type_statement where object = 'biopax:BiochemicalReaction' limit 5;
reactome.biopax:BiochemicalReaction1
reactome.biopax:BiochemicalReaction10
reactome.biopax:BiochemicalReaction100
reactome.biopax:BiochemicalReaction1000
reactome.biopax:BiochemicalReaction10000

It looks like you are parsing BioPAX as XML rather than OWL so I am not sure if this would be a simple drop-in replacement

https://github.com/indralab/pybiopax/blob/7a90a177a8a08274931b8f9df52f916751ab5e37/pybiopax/biopax/model.py#L109-L123

FWIW, you can even use OAK to treat it as a (strangely behaved) OWL "ontology":

runoak -i sqlite:obo:reactome-Homo-sapiens descendants .desc//p=t biopax:BiochemicalReaction .and t~Calmodulin
reactome.biopax:BiochemicalReaction11326 ! IQGAPs bind F-actin, which is inhibited by calmodulin
reactome.biopax:BiochemicalReaction10284 ! Calmodulin activates Cam-PDE 1
reactome.biopax:BiochemicalReaction10300 ! Inactive catalytic PP2B is activated by the binding of calmodulin
reactome.biopax:BiochemicalReaction1327 ! Calcium binds calmodulin
reactome.biopax:BiochemicalReaction10662 ! Active calmodulin binds CAMK2
reactome.biopax:BiochemicalReaction8747 ! Sepiapterin reductase (SPR) is phosphorylated by Ca2+/calmodulin-dependent protein kinase II
reactome.biopax:BiochemicalReaction6200 ! CaMKK binds activated calmodulin in the nucleus
reactome.biopax:BiochemicalReaction6208 ! CaMKK binds activated calmodulin in the cytosol
reactome.biopax:BiochemicalReaction6204 ! Calmodulin binds CAMK4
reactome.biopax:BiochemicalReaction6210 ! CAMK1 binds calmodulin
reactome.biopax:BiochemicalReaction6196 ! Activated calmodulin binds ADCY1,ADCY8
reactome.biopax:BiochemicalReaction6199 ! Activated calmodulin dissociates from CaMKII-gamma
reactome.biopax:BiochemicalReaction6197 ! Calmodulin-activated adenylate cyclases ADCY1 and ADCY8 generate cAMP
reactome.biopax:BiochemicalReaction6184 ! CaMKII binds activated calmodulin
reactome.biopax:BiochemicalReaction6183 ! Calcium binds calmodulin at the synapse
reactome.biopax:BiochemicalReaction11015 ! S-Farn-Me KRAS4B binds calmodulin
reactome.biopax:BiochemicalReaction11016 ! Calmodulin dissociates KRAS4B from the plasma membrane
reactome.biopax:BiochemicalReaction11291 ! MYLK (MLCK) Active Calmodulin Binding

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants