PIPP Knowledge Graph

This project contains source code for the ETL supporting the ingestion of data into Neo4j, and all data and analysis for the paper below. Developed on Python 3.9 and Neo4j 5.10. Please note that this repository makes use of data from the WAHIS platform which requires the following statement: “WOAH bears no responsibility for the integrity or accuracy of the data contained herein, in particular due, but not limited to, any deletion, manipulation, or reformatting of data that may have occurred beyond its control.”

Getting started

This repository is organized into several folders: archive, cache, data, database_dump, manuscript_figures, network,src, and tests. Some of these folders contain subfolders and zipped files. All zipped files will need to be unzipped before proceedings.

Building the knowledge graph and running the queries is not necessary to reproduce visualized results, as the results from the queries (and the full knowledge graph) are provided as .csv files. All data and figures are provided in the manuscript_figures directory. A separate README is available in that directory.

If you wish to build the graph and run the queries, you may proceed one of two ways: restore from a database dump or build the knowledge graph locally. For either method, you will need to download and install Neo4j (here).

Restore from dump

Download the hpai_kg_backup.dump file from the database_dump directory.
Open Neo4j, creating a new project if needed.
Click into the project and then click "Reveal files" (in your directory or Finder).
Move the downloaded database dump into the project folder you just opened.
Navigate back to Neo4j, where the .dump file should now appear under files.
Hover over the file, click the three dots to the right, and select "Create new DBMS from dump."
Name and set a password for your new database
The database should load. You will be able to start the DBMS with the blue "Open" button to open the Neo4j browser, viewing the graph as it is structured by clicking the node/relationship labels on the left pane or by using CYPHER to write queries.

Building the knowledge graph locally

Deployment

Create a virtual environment

python3 -m venv env

Install required python modules

source env/bin/activate
pip3 install -r requirements.txt

Set environmental variables in file .env. Some folders may need to be unzipped, and you may need to install and set up a Neo4j database (https://neo4j.com/docs/operations-manual/current/installation/). You will also need to create an NCBI (https://account.ncbi.nlm.nih.gov/) account with an API key and a GeoNames account (http://www.geonames.org/).

NEO4J_URI=<neo4j_uri>
NEO4J_USER=<neo4j_user>
NEO4J_PASSWORD=<neo4j_password>
NEO4J_DATABASE=<neo4j_database>
GEO_USER=<geonames_user>
NCBI_API_KEY=<api_key>

Create taxa and geographical constraints

CREATE CONSTRAINT taxId_UQ FOR (taxon:Taxon) REQUIRE taxon.taxId IS UNIQUE
CREATE CONSTRAINT geonameId_UQ FOR (geography:Geography) REQUIRE geography.geonameId IS UNIQUE

Create knowledge graph locally

python main.py

Development and testing

All source code is designed to stop after hitting an error. The most common are API related, usually triggered by reaching credit limits, malformatted API responses, or excess throttling. Batch sizes can trigger errors in Neo4j in rare occasions, change the batch size according to your hardware.

Unit tests coverage

coverage run pytest -vm unit

Integrity tests coverage

coverage run pytest -vm integrity

Timing execution

Set the level to DEBUG in main.py

logger.add(sys.stderr, level="DEBUG")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PIPP Knowledge Graph

Getting started

Restore from dump

Building the knowledge graph locally

Deployment

Development and testing

Unit tests coverage

Integrity tests coverage

Timing execution

About

Releases

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 823 Commits
archive		archive
cache		cache
data		data
database_dump		database_dump
manuscript_figures		manuscript_figures
network		network
src		src
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
py12requirements.txt		py12requirements.txt
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

License

cghss-data-lab/uga-pipp

Folders and files

Latest commit

History

Repository files navigation

PIPP Knowledge Graph

Getting started

Restore from dump

Building the knowledge graph locally

Deployment

Development and testing

Unit tests coverage

Integrity tests coverage

Timing execution

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages