Get started

This is a demonstrator ETL pipeline that converts relational databases, tabular files, and XML files to RDF. A generic RDF, based on the input data structure, is generated and SPARQL queries are designed by the user to map the generic RDF to a specific model.

Only Docker is required to run the pipeline. Checkout the Wiki if you have issues with Docker installation.
Following documentation focuses on Linux & MacOS.
Windows documentation can be found here.
Modules are from the Data2Services ecosystem.

The Data2Services philosophy

Containers run with a few parameters (input file path, SPARQL endpoint, credentials, mapping file path)

Build the Docker images
Start services that need to be running
Execute the containers you want, providing the proper parameters

Clone

git clone --recursive https://github.com/MaastrichtU-IDS/data2services-pipeline.git

cd data2services-pipeline

# To update all submodules
git submodule update --recursive --remote

Build

build.sh is a convenience script to build all Docker images, but they can be built separately.

You need to download Apache Drill installation bundle and GraphDB standalone zip (register to get an email with download URL).
Then put the .tar.gz and .zip files in the ./apache-drill and ./graphdb repositories.

# Download Apache Drill
curl http://apache.40b.nl/drill/drill-1.15.0/apache-drill-1.15.0.tar.gz -o apache-drill/apache-drill-1.15.0.tar.gz
# Build docker images (don't forget to get GraphDB zip file)
./build.sh

Start services

In a production environment, it is considered that both Apache Drill and GraphDB services are present. Other RDF triple stores should also work, but have not been tested yet.

# Build and start apache-drill
docker build -t apache-drill ./apache-drill
docker run -dit --rm -p 8047:8047 -p 31010:31010 \
  --name drill -v /data:/data:ro \
  apache-drill
# Build and start graphdb
docker build -t graphdb ./graphdb
docker run -d --rm --name graphdb -p 7200:7200 \
  -v /data/graphdb:/opt/graphdb/home \
  -v /data/graphdb-import:/root/graphdb-import \
  graphdb

For MacOS, make sure that access to the /data repository has been granted in Docker configuration.
Check the Wiki to use docker-compose to run the 2 containers.

Run using Docker commands

Check the Wiki for more detail on how to run Docker containers (sharing volumes, link between containers)
The directory where are the files to convert needs to be in /data (to comply with Apache Drill shared volume).
In those examples we are using /data/data2services as working directory (containing all the files, note that it is usually shared as /data in the Docker containers).

Download datasets

Source files can be set to be downloaded automatically using Shell scripts. See the data2services-download module for more details.

docker pull vemonet/data2services-download
docker run -it --rm -v /data/data2services:/data \
  vemonet/data2services-download \
  --download-datasets drugbank,hgnc,date \
  --username my_login --password my_password \
  --clean # to delete all files in /data/data2services

Convert XML

Use xml2rdf to convert XML files to a generic RDF based on the file structure.

docker build -t xml2rdf ./xml2rdf
docker run --rm -it -v /data:/data \
  xml2rdf  \
  -i "/data/data2services/myfile.xml.gz" \
  -o "/data/data2services/myfile.nq.gz" \
  -g "https://w3id.org/data2services/graph/xml2rdf"

Generate R2RML mapping file for TSV & RDB

We use AutoR2RML to generate the R2RML mapping file to convert relational databases (Postgres, SQLite, MariaDB), CSV, TSV and PSV files to a generic RDF representing the input data structure. See the Wiki for other DBMS systems and how to deploy databases.

docker build -t autor2rml ./AutoR2RML
# For CSV, TSV, PSV files
# Apache Drill needs to be running with the name 'drill'
docker run -it --rm --link drill:drill -v /data:/data \
  autor2rml \
  -j "jdbc:drill:drillbit=drill:31010" -r \
  -o "/data/data2services/mapping.trig" \
  -d "/data/data2services" \
  -b "https://w3id.org/data2services/" \
  -g "https://w3id.org/data2services/graph/autor2rml"
# For Postgres, a postgres docker container 
# needs to be running with the name 'postgres'
docker run -it --rm --link postgres:postgres -v /data:/data \
  autor2rml \
  -j "jdbc:postgresql://postgres:5432/my_database" -r \
  -o "/data/data2services/mapping.trig" \
  -u "postgres" -p "pwd" \
  -b "https://w3id.org/data2services/" \
  -g "https://w3id.org/data2services/graph/autor2rml"

Use R2RML mapping file to generate RDF

Generate the generic RDF using R2RML and the previously generated mapping.trig file.

docker build -t r2rml ./r2rml
# Add config.properties file for R2RML in /data/data2services
connectionURL = jdbc:drill:drillbit=drill:31010
mappingFile = /data/mapping.trig
outputFile = /data/rdf_output.nq
format = NQUADS
# Run R2RML for Drill or Postgres
docker run -it --rm --link drill:drill \ # --link postgres:postgres
  -v /data/data2services:/data \
  r2rml /data/config.properties

Upload RDF

Finally, use RdfUpload to upload the generated RDF to GraphDB. It can also be done manually using GraphDB server imports for more efficiency on large files.

docker build -t rdf-upload ./RdfUpload
docker run -it --rm --link graphdb:graphdb -v /data/data2services:/data \
  rdf-upload \
  -m "HTTP" -if "/data" \
  -url "http://graphdb:7200" \
  -rep "test" \
  -un "import_user" -pw "PASSWORD"

Transform generic RDF to target model

Last step is to transform the generic RDF generated a particular data model. See the data2services-transform-repository project for examples of transformation to the BioLink model using the data2services-sparql-operations module to execute multiple SPARQL queries from a Github repository.

docker pull vemonet/data2services-sparql-operations

# Load UniProt organisms and Human proteins as BioLink in local endpoint
docker run -d --link graphdb:graphdb \
  vemonet/data2services-sparql-operations \
  -f "https://github.com/MaastrichtU-IDS/data2services-transform-repository/tree/master/sparql/insert-biolink/uniprot" \
  -ep "http://graphdb:7200/repositories/test/statements" \
  -un MYUSERNAME -pw MYPASSWORD \
  -var outputGraph:https://w3id.org/data2services/graph/biolink/uniprot

# Load DrugBank xml2rdf generic RDF as BioLink to remote SPARQL endpoint
docker run -d vemonet/data2services-sparql-operations \
  -f "https://github.com/MaastrichtU-IDS/data2services-transform-repository/tree/master/sparql/insert-biolink/drugbank" \
  -ep "http://graphdb.dumontierlab.com/repositories/ncats-red-kg/statements" \
  -un USERNAME -pw PASSWORD \
  -var serviceUrl:http://localhost:7200/repositories/test inputGraph:http://data2services/graph/xml2rdf/drugbank outputGraph:https://w3id.org/data2services/graph/biolink/drugbank

You can find example of SPARQL queries used for conversion to RDF BioLink:
- UniProt (RDF)
- DrugBank (XML)
- HGNC (TSV through AutoR2RML)
It is recommended to write multiple SPARQL queries with simple goals (get all drugs infos, get all drug-drug interactions, get gene infos), rather than one complex query addressing everything.
Remove the \ and make the docker run command one line for Windows PowerShell.

Further documentation in the Wiki

Docker documentation (fix known issues, run, share volumes, link containers, network)
Run using docker-compose
Run AutoR2RML with various DBMS
Fix CSV, TSV, PSV files without columns
Run on Windows
Run using convenience scripts
Run Postgres
Run MariaDB
Secure GraphDB
BETA: RDF validation using ShEx

Citing this work

If you use Data2Services in a scientific publication, you are highly encouraged (not required) to cite the following paper:

Data2Services: enabling automated conversion of data to services. Vincent Emonet, Alexander Malic, Amrapali Zaveri, Andreea Grigoriu and Michel Dumontier.

Bibtex entry:

@inproceedings{Emonet2018,
author = {Emonet, Vincent and Malic, Alexander and Zaveri, Amrapali and Grigoriu, Andreea and Dumontier, Michel},
title = {Data2Services: enabling automated conversion of data to services},
booktitle = {11th Semantic Web Applications and Tools for Healthcare and Life Sciences},
year = {2018}
}

Name		Name	Last commit message	Last commit date
Latest commit History 280 Commits
AutoR2RML @ 655fba4		AutoR2RML @ 655fba4
RdfUpload @ ff11593		RdfUpload @ ff11593
apache-drill @ d42455d		apache-drill @ d42455d
data2services-download @ 7da1f0c		data2services-download @ 7da1f0c
graphdb @ da3554a		graphdb @ da3554a
r2rml @ 2d8b043		r2rml @ 2d8b043
resources		resources
xml2rdf @ 5fd8dfe		xml2rdf @ 5fd8dfe
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
_config.yml		_config.yml
build.sh		build.sh
docker-compose.yaml		docker-compose.yaml
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Get started

The Data2Services philosophy

Clone

Build

Start services

Run using Docker commands

Download datasets

Convert XML

Generate R2RML mapping file for TSV & RDB

Use R2RML mapping file to generate RDF

Upload RDF

Transform generic RDF to target model

Further documentation in the Wiki

Citing this work

About

Releases

Packages

Languages

License

weiweivv2222/data2services-pipeline

Folders and files

Latest commit

History

Repository files navigation

Get started

The Data2Services philosophy

Clone

Build

Start services

Run using Docker commands

Download datasets

Convert XML

Generate R2RML mapping file for TSV & RDB

Use R2RML mapping file to generate RDF

Upload RDF

Transform generic RDF to target model

Further documentation in the Wiki

Citing this work

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages