Skip to content

weiweivv2222/data2services-pipeline

 
 

Repository files navigation

Get started

This is a demonstrator ETL pipeline that converts relational databases, tabular files, and XML files to RDF. A generic RDF, based on the input data structure, is generated and SPARQL queries are designed by the user to map the generic RDF to a specific model.

  • Only Docker is required to run the pipeline. Checkout the Wiki if you have issues with Docker installation.
  • Following documentation focuses on Linux & MacOS.
  • Windows documentation can be found here.
  • Modules are from the Data2Services ecosystem.

The Data2Services philosophy

Containers run with a few parameters (input file path, SPARQL endpoint, credentials, mapping file path)

  • Build the Docker images
  • Start services that need to be running
  • Execute the containers you want, providing the proper parameters

Clone

git clone --recursive https://github.com/MaastrichtU-IDS/data2services-pipeline.git

cd data2services-pipeline

# To update all submodules
git submodule update --recursive --remote

Build

build.sh is a convenience script to build all Docker images, but they can be built separately.

# Download Apache Drill
curl http://apache.40b.nl/drill/drill-1.15.0/apache-drill-1.15.0.tar.gz -o apache-drill/apache-drill-1.15.0.tar.gz
# Build docker images (don't forget to get GraphDB zip file)
./build.sh

Start services

In a production environment, it is considered that both Apache Drill and GraphDB services are present. Other RDF triple stores should also work, but have not been tested yet.

# Build and start apache-drill
docker build -t apache-drill ./apache-drill
docker run -dit --rm -p 8047:8047 -p 31010:31010 \
  --name drill -v /data:/data:ro \
  apache-drill
# Build and start graphdb
docker build -t graphdb ./graphdb
docker run -d --rm --name graphdb -p 7200:7200 \
  -v /data/graphdb:/opt/graphdb/home \
  -v /data/graphdb-import:/root/graphdb-import \
  graphdb
  • For MacOS, make sure that access to the /data repository has been granted in Docker configuration.
  • Check the Wiki to use docker-compose to run the 2 containers.

Run using Docker commands

  • Check the Wiki for more detail on how to run Docker containers (sharing volumes, link between containers)
  • The directory where are the files to convert needs to be in /data (to comply with Apache Drill shared volume).
  • In those examples we are using /data/data2services as working directory (containing all the files, note that it is usually shared as /data in the Docker containers).

Download datasets

Source files can be set to be downloaded automatically using Shell scripts. See the data2services-download module for more details.

docker pull vemonet/data2services-download
docker run -it --rm -v /data/data2services:/data \
  vemonet/data2services-download \
  --download-datasets drugbank,hgnc,date \
  --username my_login --password my_password \
  --clean # to delete all files in /data/data2services

Convert XML

Use xml2rdf to convert XML files to a generic RDF based on the file structure.

docker build -t xml2rdf ./xml2rdf
docker run --rm -it -v /data:/data \
  xml2rdf  \
  -i "/data/data2services/myfile.xml.gz" \
  -o "/data/data2services/myfile.nq.gz" \
  -g "https://w3id.org/data2services/graph/xml2rdf"

Generate R2RML mapping file for TSV & RDB

We use AutoR2RML to generate the R2RML mapping file to convert relational databases (Postgres, SQLite, MariaDB), CSV, TSV and PSV files to a generic RDF representing the input data structure. See the Wiki for other DBMS systems and how to deploy databases.

docker build -t autor2rml ./AutoR2RML
# For CSV, TSV, PSV files
# Apache Drill needs to be running with the name 'drill'
docker run -it --rm --link drill:drill -v /data:/data \
  autor2rml \
  -j "jdbc:drill:drillbit=drill:31010" -r \
  -o "/data/data2services/mapping.trig" \
  -d "/data/data2services" \
  -b "https://w3id.org/data2services/" \
  -g "https://w3id.org/data2services/graph/autor2rml"
# For Postgres, a postgres docker container 
# needs to be running with the name 'postgres'
docker run -it --rm --link postgres:postgres -v /data:/data \
  autor2rml \
  -j "jdbc:postgresql://postgres:5432/my_database" -r \
  -o "/data/data2services/mapping.trig" \
  -u "postgres" -p "pwd" \
  -b "https://w3id.org/data2services/" \
  -g "https://w3id.org/data2services/graph/autor2rml"

Use R2RML mapping file to generate RDF

Generate the generic RDF using R2RML and the previously generated mapping.trig file.

docker build -t r2rml ./r2rml
# Add config.properties file for R2RML in /data/data2services
connectionURL = jdbc:drill:drillbit=drill:31010
mappingFile = /data/mapping.trig
outputFile = /data/rdf_output.nq
format = NQUADS
# Run R2RML for Drill or Postgres
docker run -it --rm --link drill:drill \ # --link postgres:postgres
  -v /data/data2services:/data \
  r2rml /data/config.properties

Upload RDF

Finally, use RdfUpload to upload the generated RDF to GraphDB. It can also be done manually using GraphDB server imports for more efficiency on large files.

docker build -t rdf-upload ./RdfUpload
docker run -it --rm --link graphdb:graphdb -v /data/data2services:/data \
  rdf-upload \
  -m "HTTP" -if "/data" \
  -url "http://graphdb:7200" \
  -rep "test" \
  -un "import_user" -pw "PASSWORD"

Transform generic RDF to target model

Last step is to transform the generic RDF generated a particular data model. See the data2services-transform-repository project for examples of transformation to the BioLink model using the data2services-sparql-operations module to execute multiple SPARQL queries from a Github repository.

docker pull vemonet/data2services-sparql-operations

# Load UniProt organisms and Human proteins as BioLink in local endpoint
docker run -d --link graphdb:graphdb \
  vemonet/data2services-sparql-operations \
  -f "https://github.com/MaastrichtU-IDS/data2services-transform-repository/tree/master/sparql/insert-biolink/uniprot" \
  -ep "http://graphdb:7200/repositories/test/statements" \
  -un MYUSERNAME -pw MYPASSWORD \
  -var outputGraph:https://w3id.org/data2services/graph/biolink/uniprot

# Load DrugBank xml2rdf generic RDF as BioLink to remote SPARQL endpoint
docker run -d vemonet/data2services-sparql-operations \
  -f "https://github.com/MaastrichtU-IDS/data2services-transform-repository/tree/master/sparql/insert-biolink/drugbank" \
  -ep "http://graphdb.dumontierlab.com/repositories/ncats-red-kg/statements" \
  -un USERNAME -pw PASSWORD \
  -var serviceUrl:http://localhost:7200/repositories/test inputGraph:http://data2services/graph/xml2rdf/drugbank outputGraph:https://w3id.org/data2services/graph/biolink/drugbank
  • You can find example of SPARQL queries used for conversion to RDF BioLink:
  • It is recommended to write multiple SPARQL queries with simple goals (get all drugs infos, get all drug-drug interactions, get gene infos), rather than one complex query addressing everything.
  • Remove the \ and make the docker run command one line for Windows PowerShell.

Further documentation in the Wiki


Citing this work

If you use Data2Services in a scientific publication, you are highly encouraged (not required) to cite the following paper:

Data2Services: enabling automated conversion of data to services. Vincent Emonet, Alexander Malic, Amrapali Zaveri, Andreea Grigoriu and Michel Dumontier.

Bibtex entry:

@inproceedings{Emonet2018,
author = {Emonet, Vincent and Malic, Alexander and Zaveri, Amrapali and Grigoriu, Andreea and Dumontier, Michel},
title = {Data2Services: enabling automated conversion of data to services},
booktitle = {11th Semantic Web Applications and Tools for Healthcare and Life Sciences},
year = {2018}
}

About

Pipeline to convert any data format to RDF for the Data2Services project

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Shell 72.8%
  • Batchfile 27.2%