Blend, grind, and enjoy LOD – fresh from the mill!
This software implements the backend of http://lobid.org, the LOD service offered by hbz.
See the .travis.yml
file for details on the CI config used by Travis.
Raw data is transformed with Metafacture and Hadoop, indexed in Elasticsearch, and exposed via an HTTP API implemented with the Play framework:
The lodmill-rd
folder contains code for working with raw data and its transformation to LOD, based on the Culturegraph Metafacture toolkit. The lodmill-ld
folder contains code for processing RDF with Hadoop and for indexing it in Elasticsearch. The hbz/lobid
repo contains a web app based on the Play framework that interacts with the resulting Elasticsearch index. It provides an HTTP API to the index and a UI for documentation and basic sample usage. See below for details on the data workflow.
Prerequisites: Java 8 and Maven 3 (check mvn -version
to make sure Maven is using Java 8)
To set up a local build of the lodmill components follow these steps:
git clone https://github.com/lobid/lodmill.git; cd lodmill; echo "required for MRUnit tests:"; umask 0022; mvn clean install -DdescriptorId=jar-with-dependencies -DskipIntegrationTests --settings settings.xml
cd .. ; cd lodmill-ld
mvn clean assembly:assembly --settings ../settings.xml
See the convert.sh
and index.sh
scripts in lodmill-ld/doc/scripts
on how to use the resulting Jar with your Hadoop and Elasticsearch clusters.
For information on how we have set up our Hadoop and Elasticsearch backend clusters see README.textile
in lodmill-ld
.
The complete data workflow transforms the original raw data to Linked Data served via an HTTP API:
The raw data is transformed to linked data triples in the N-Triples RDF serialization. Being a line based format, N-Triples are well suited for batch processing with Hadoop. Every triple, both those generated from the raw data, and the enrichment data from external sources (like GND and Dewey), are stored in the Hadoop distributed file system (HDFS) and made available to the two Hadoop jobs that process and convert the data.
The first Hadoop job (implemented in CollectSubjects.java
) selects triples that need to be resolved to build meaningful records to be indexed later. One example would be the creator
triples for lobid-resources, e.g.
<http://lobid.org/resource/HT002189125> <http://purl.org/dc/elements/1.1/creator> <http://d-nb.info/gnd/118580604> .
The creator of the resource is identified via its GND ID. To allow searching by the actual name of the creator, we want to resolve the name literals, so we declare that the creator
property needs to be resolved in the resolve.properties
file:
resolve = \ http://purl.org/dc/elements/1.1/creator; \ [...]
Having the property declared to need resolution, when the first Hadoop job encounters the triple above, it will map the GND ID to the resource ID:
<http://d-nb.info/gnd/118580604> : <http://lobid.org/resource/HT002189125>
This information is written to a zipped Hadoop map file (a persistent lookup mechanism) after the first job is complete.
The second Hadoop job (implemented in NTriplesToJsonLd.java
) collects all triples with the same subject (i.e. all statements about a resource), and converts them to a JSON-LD record. In addition to the selected triples, we also need details about the entities defined as needing resolution in the first job. For instance, we want the preferredNameForThePerson
of creator
in our records, so we declare it as a resolution property in resolve.properties
:
predicates = \ http://d-nb.info/standards/elementset/gnd#preferredNameForThePerson; \ [...]
This will cause the second Hadoop job to perform a lookup on the subject of a triple containing that property, e.g.:
<http://d-nb.info/gnd/118580604> <http://d-nb.info/standards/elementset/gnd#preferredNameForThePerson> "Melville, Herman" .
Since the subject is mapped to the <http://lobid.org/resource/HT002189125>
resource ID, we add that triple to the triples of that resource, which yields a record for <http://lobid.org/resource/HT002189125>
that contains not only the triples with that subject, but also the enrichment triples defined in the resolve.properties
file. That way, we effectively define our records to be subgraphs of the complete triple set.
The same mechanism is used to resolve information modeled using blank nodes. See our current resolve.properties file for details.
The second Hadoop job writes the records as expanded JSON-LD in the Elasticsearch bulk import format, which is then indexed in Elasticsearch using the Elasticsearch bulk Java API. We use expanded JSON-LD in the index to have consistent types for each field. In compact JSON-LD, if a field has just a single value, the type of that field will simply be the type of the value. If the same field has multiple values, the type will be an array, etc. Elasticsearch learns the index schema from the data, so we need to use consistent types for a given field. The expanded JSON-LD serialization does exactly this. For instance, it always uses arrays, even if there is only a single value.
Finally our Play frontend accesses the Elasticsearch index and serves the records as JSON-LD. There are multiple options for queries and different supported results formats, see documentation at http://lobid.org/api (implemented in the app/views/index.scala.html
template in the hbz/lobid
repo). Since the expanded JSON-LD described above is cumbersome, the API serves compact JSON-LD. It also uses an external JSON-LD context document to allow shorter keys instead of full URI properties, and to encapsulate the actual properties used in the expanded form. That way, we can change the properties, without requiring API clients to change how they process the JSON responses.
Eclipse Public License: http://www.eclipse.org/legal/epl-v10.html