This is a demonstrator ETL pipeline that converts relational databases, tabular files, and XML files to RDF. A generic RDF, based on the input data structure, is generated and SPARQL queries are designed by the user to map the generic RDF to a specific model.
- Only Docker is required to run the pipeline. Checkout the Wiki if you have issues with Docker installation.
- Following documentation focuses on Linux & MacOS.
- Windows documentation can be found here.
- Modules are from the Data2Services ecosystem.
Containers run with a few parameters (input file path, SPARQL endpoint, credentials, mapping file path)
- Build the Docker images
- Start services that need to be running
- Execute the containers you want, providing the proper parameters
git clone --recursive https://github.com/MaastrichtU-IDS/data2services-pipeline.git
cd data2services-pipeline
# To update all submodules
git submodule update --recursive --remote
build.sh
is a convenience script to build all Docker images, but they can be built separately.
- You need to download Apache Drill installation bundle and GraphDB standalone zip (register to get an email with download URL).
- Then put the
.tar.gz
and.zip
files in the./apache-drill
and./graphdb
repositories.
# Download Apache Drill
curl http://apache.40b.nl/drill/drill-1.15.0/apache-drill-1.15.0.tar.gz -o apache-drill/apache-drill-1.15.0.tar.gz
# Build docker images (don't forget to get GraphDB zip file)
./build.sh
In a production environment, it is considered that both Apache Drill and GraphDB services are present. Other RDF triple stores should also work, but have not been tested yet.
# Build and start apache-drill
docker build -t apache-drill ./apache-drill
docker run -dit --rm -p 8047:8047 -p 31010:31010 \
--name drill -v /data:/data:ro \
apache-drill
# Build and start graphdb
docker build -t graphdb ./graphdb
docker run -d --rm --name graphdb -p 7200:7200 \
-v /data/graphdb:/opt/graphdb/home \
-v /data/graphdb-import:/root/graphdb-import \
graphdb
- For MacOS, make sure that access to the
/data
repository has been granted in Docker configuration. - Check the Wiki to use
docker-compose
to run the 2 containers.
- Check the Wiki for more detail on how to run Docker containers (sharing volumes, link between containers)
- The directory where are the files to convert needs to be in
/data
(to comply with Apache Drill shared volume). - In those examples we are using
/data/data2services
as working directory (containing all the files, note that it is usually shared as/data
in the Docker containers).
Source files can be set to be downloaded automatically using Shell scripts. See the data2services-download module for more details.
docker pull vemonet/data2services-download
docker run -it --rm -v /data/data2services:/data \
vemonet/data2services-download \
--download-datasets drugbank,hgnc,date \
--username my_login --password my_password \
--clean # to delete all files in /data/data2services
Use xml2rdf to convert XML files to a generic RDF based on the file structure.
docker build -t xml2rdf ./xml2rdf
docker run --rm -it -v /data:/data \
xml2rdf \
-i "/data/data2services/myfile.xml.gz" \
-o "/data/data2services/myfile.nq.gz" \
-g "https://w3id.org/data2services/graph/xml2rdf"
We use AutoR2RML to generate the R2RML mapping file to convert relational databases (Postgres, SQLite, MariaDB), CSV, TSV and PSV files to a generic RDF representing the input data structure. See the Wiki for other DBMS systems and how to deploy databases.
docker build -t autor2rml ./AutoR2RML
# For CSV, TSV, PSV files
# Apache Drill needs to be running with the name 'drill'
docker run -it --rm --link drill:drill -v /data:/data \
autor2rml \
-j "jdbc:drill:drillbit=drill:31010" -r \
-o "/data/data2services/mapping.trig" \
-d "/data/data2services" \
-b "https://w3id.org/data2services/" \
-g "https://w3id.org/data2services/graph/autor2rml"
# For Postgres, a postgres docker container
# needs to be running with the name 'postgres'
docker run -it --rm --link postgres:postgres -v /data:/data \
autor2rml \
-j "jdbc:postgresql://postgres:5432/my_database" -r \
-o "/data/data2services/mapping.trig" \
-u "postgres" -p "pwd" \
-b "https://w3id.org/data2services/" \
-g "https://w3id.org/data2services/graph/autor2rml"
Generate the generic RDF using R2RML and the previously generated mapping.trig
file.
docker build -t r2rml ./r2rml
# Add config.properties file for R2RML in /data/data2services
connectionURL = jdbc:drill:drillbit=drill:31010
mappingFile = /data/mapping.trig
outputFile = /data/rdf_output.nq
format = NQUADS
# Run R2RML for Drill or Postgres
docker run -it --rm --link drill:drill \ # --link postgres:postgres
-v /data/data2services:/data \
r2rml /data/config.properties
Finally, use RdfUpload to upload the generated RDF to GraphDB. It can also be done manually using GraphDB server imports for more efficiency on large files.
docker build -t rdf-upload ./RdfUpload
docker run -it --rm --link graphdb:graphdb -v /data/data2services:/data \
rdf-upload \
-m "HTTP" -if "/data" \
-url "http://graphdb:7200" \
-rep "test" \
-un "import_user" -pw "PASSWORD"
Last step is to transform the generic RDF generated a particular data model. See the data2services-transform-repository project for examples of transformation to the BioLink model using the data2services-sparql-operations module to execute multiple SPARQL queries from a Github repository.
docker pull vemonet/data2services-sparql-operations
# Load UniProt organisms and Human proteins as BioLink in local endpoint
docker run -d --link graphdb:graphdb \
vemonet/data2services-sparql-operations \
-f "https://github.com/MaastrichtU-IDS/data2services-transform-repository/tree/master/sparql/insert-biolink/uniprot" \
-ep "http://graphdb:7200/repositories/test/statements" \
-un MYUSERNAME -pw MYPASSWORD \
-var outputGraph:https://w3id.org/data2services/graph/biolink/uniprot
# Load DrugBank xml2rdf generic RDF as BioLink to remote SPARQL endpoint
docker run -d vemonet/data2services-sparql-operations \
-f "https://github.com/MaastrichtU-IDS/data2services-transform-repository/tree/master/sparql/insert-biolink/drugbank" \
-ep "http://graphdb.dumontierlab.com/repositories/ncats-red-kg/statements" \
-un USERNAME -pw PASSWORD \
-var serviceUrl:http://localhost:7200/repositories/test inputGraph:http://data2services/graph/xml2rdf/drugbank outputGraph:https://w3id.org/data2services/graph/biolink/drugbank
- You can find example of SPARQL queries used for conversion to RDF BioLink:
- It is recommended to write multiple SPARQL queries with simple goals (get all drugs infos, get all drug-drug interactions, get gene infos), rather than one complex query addressing everything.
- Remove the
\
and make thedocker run
command one line for Windows PowerShell.
- Docker documentation (fix known issues, run, share volumes, link containers, network)
- Run using docker-compose
- Run AutoR2RML with various DBMS
- Fix CSV, TSV, PSV files without columns
- Run on Windows
- Run using convenience scripts
- Run Postgres
- Run MariaDB
- Secure GraphDB
BETA
: RDF validation using ShEx
If you use Data2Services in a scientific publication, you are highly encouraged (not required) to cite the following paper:
Data2Services: enabling automated conversion of data to services. Vincent Emonet, Alexander Malic, Amrapali Zaveri, Andreea Grigoriu and Michel Dumontier.
Bibtex entry:
@inproceedings{Emonet2018,
author = {Emonet, Vincent and Malic, Alexander and Zaveri, Amrapali and Grigoriu, Andreea and Dumontier, Michel},
title = {Data2Services: enabling automated conversion of data to services},
booktitle = {11th Semantic Web Applications and Tools for Healthcare and Life Sciences},
year = {2018}
}