Skip to content

Commit

Permalink
Merge commit '89cb8ab2623eb3e7fad06c1eda45ffa5a0d950ba' as 'semantic_…
Browse files Browse the repository at this point in the history
…standardization'
  • Loading branch information
seralf committed Nov 9, 2017
2 parents bb3c0ea + 89cb8ab commit 0efaa22
Show file tree
Hide file tree
Showing 55 changed files with 136,506 additions and 0 deletions.
50 changes: 50 additions & 0 deletions semantic_standardization/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@

.idea
.DS_Store
.history

# Eclipse #
.classpath
.project
.settings/
target


# Play! #
logs
.swagger-codegen-ignore
client/.gitignore
client/.swagger-codegen-ignore
client/build.gradle
client/build.sbt
client/git_push.sh
client/gradle.properties
client/gradle/
client/gradlew
client/gradlew.bat
client/pom.xml
client/settings.gradle
client/src/

# Play! #
bin/
/db
.eclipse
/lib/
/logs/
/modules
/project/project
/project/target
/target
tmp/
test-result
server.pid
*.eml
#/dist/
.cache
.cache-main
.cache-tests

# extra #
NO__lib

275 changes: 275 additions & 0 deletions semantic_standardization/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,275 @@

semantic_standardization
==========================

This project is currently a POC exploring the standardization of terms using the vocabulary `Istat-Classificazione-08-Territorio` and the ontology `CLV-AP_IT`.
Currently the component are designed to use an in-memory storage of only those ontology and vocabulary, but the component can be extended to act in a similar way for different use cases and ontology/vocabulary couples.

Two endpoints are provided:

1. the first one retrieves a flat representation of a vocabulary (conceptually similar to a CSV, but in JSON), using an ad-hoc SPARQL query.
2. the second on expose a list of properties actually used in a vocabulary from an ontology, returning the "local" hierarchy for each property.

The idea is that each endpoint (and its configured queries) acts for a very specific domain, so the next versions could introduce new vocabularies and ontologies, but needs to create ad-hoc SPARQL queries for retrieving the informations needed.

## semantic annotation in DAF ingestion

The [DAF](https://github.com/italia/daf) `semantic_annotation` has currently the following structure: `{ontology}.{concept}.{property}`.
During the ingestion phase of datasets in DAF platform a `semantic_annotation` is used, in order to relate some column of a dataset to the most appropriate property of a given existing concept, from the controlled vocabularies.

**Note** that while the annotation is used to relate cells with vocabularies, it does not save explicitly a reference to the vocabularies used. A reference to concept from an ontology is used instead.


## examples


### example: sequence of calls

1. retrieves (vocabulary,ontology) reference from semantic_annotation tag
```
curl -X GET http://localhost:9000/kb/v1/daf/annotation/lookup?semantic_annotation=POI-AP_IT.PointOfInterestCategory.POIcategoryIdentifier -H "accept: application/json" -H "content-type: application/json"
```

2. retrieves the hierarchies for a given property
```
curl -X GET http://localhost:9000/kb/v1/hierarchies/properties?vocabulary_name=POICategoryClassification&ontology_name=poiapit&lang=it -H "accept: application/json" -H "content-type: application/json"
```

3. retrieves the dataset values for a certain vocaulary
```
curl -X GET http://localhost:9000/kb/v1/vocabularies/POICategoryClassification?lang=it -H "accept: application/json" -H "content-type: application/json"
```

----

### example: retrieves informations from the semantic_annotation tag
With this endpoint we can retrieve informations about the vocabulary/ontology pair related to a given `semantic_annotation` tag:

```
curl -X GET http://localhost:9000/kb/v1/daf/annotation/lookup?semantic_annotation={semantic_annotation} \
-H "accept: application/json" -H "content-type: application/json"
```

for example, for the Point Of Interest vocabulary:

```
curl -X GET 'http://localhost:9000/kb/v1/daf/annotation/lookup?semantic_annotation=POI-AP_IT.PointOfInterestCategory.POIcategoryIdentifier' \
-H "accept: application/json" -H "content-type: application/json"
```

This will return a datastructure similar to the following one for each tag:

```
[
{
"vocabulary_id": "POICategoryClassification",
"vocabulary": "http://dati.gov.it/onto/controlledvocabulary/POICategoryClassification",
"ontology": "http://dati.gov.it/onto/poiapit",
"semantic_annotation": "POI-AP_IT.PointOfInterestCategory.POIcategoryIdentifier",
"property_id": "POIcategoryIdentifier",
"concept_id": "PointOfInterestCategory",
"ontology_prefix": "poiapit",
"ontology_id": "POI-AP_IT",
"concept": "http://dati.gov.it/onto/poiapit#PointOfInterestCategory",
"property": "http://dati.gov.it/onto/poiapit#POIcategoryIdentifier"
}
]
```

the idea is to be able to have as much informations as possible to eventually relate the annotation to ontologies and vocabularies.


### example: retrieving a vocabulary dataset

We can obtain a de-normalized, tabular version of the vocabulary `Istat-Classificazione-08-Territorio` using the curl call:

```
curl -X GET http://localhost:9000/kb/v1/hierarchies/properties?vocabulary_name={vocabulary_name}&ontology_name={ontology_prefix}&lang={lang} \
-H "accept: application/json" -H "content-type: application/json"
```

A `SPARQL` query is used to create a proper tabular representation of the data.

#### example: PontOfInterest / POI_AP-IT

```
curl -X GET http://localhost:9000/kb/v1/hierarchies/properties?vocabulary_name=POICategoryClassification&ontology_name=poiapit&lang=it -H "accept: application/json" -H "content-type: application/json"
```

this will return a data structure:

```
[
{
"vocabulary": "POI-AP_IT",
"path": "POI-AP_IT.PointOfInterestCategory.definition",
"hierarchy_flat": "PointOfInterestCategory",
"hierarchy": [
{
"class": "PointOfInterestCategory",
"level": 0
}
]
},
...
]
```


#### example: Luoghi Istat / CLV_AP-IT
```
$ curl -X GET "http://localhost:9000/kb/v1/vocabularies/Istat-Classificazione-08-Territorio?lang=it" -H "accept: application/json" -H "content-type: application/json"
```
this will return a result structure similar to the following one:

```
[
[
{ "key": "CLV-AP_IT_Country_name", "value": "Italia"},
{"key": "CLV-AP_IT_City_name", "value": "Abano Terme"},
{"key": "CLV-AP_IT_Province_name", "value": "Padova"},
{"key": "CLV-AP_IT_Region_name", "value": "Veneto"}
],
[
{"key":"CLV-AP_IT_Province_name", "value": "Lodi"},
{"key":"CLV-AP_IT_City_name", "value": "Abbadia Cerreto"},
{"key": "CLV-AP_IT_Country_name", "value": "Italia"},
{"key": "CLV-AP_IT_Region_name", "value": "Lombardia"}
]
...
]
```

For technical reason, currently a value of `CLV-AP_IT_Region_name` is used in place of `CLV-AP_IT.Region.name`.

### example: retrieve the hierarchies for the properties used

If we have the example vocabulary `Istat-Classificazione-08-Territorio`, which uses terms from the ontology `clvapit`, we can retrieve the local hierarchy associated to each property with the curl command:

```
$ curl -X GET http://localhost:9000/kb/v1/hierarchies/properties?vocabulary_name={vocabulary_name}&ontology_name={ontology_prefix}&lang={lang} \
-H "accept: application/json" -H "content-type: application/json"
```

#### example: POI / POI_AP-IT

```
curl -X GET http://localhost:9000/kb/v1/vocabularies/POICategoryClassification?lang=it \
-H "accept: application/json" -H "content-type: application/json"
```

which will return results:

```
[
[
{
"key": "POI-AP_IT_PointOfInterestCategory_definition",
"value": "Rientrano in questa categoria tutti i punti di interesse connessi all'intrattenimento come zoo, discoteche, pub, teatri, acquari, stadi, casino, parchi divertimenti, ecc."
},
{
"key": "POI-AP_IT_PointOfInterestCategory_POICategoryName",
"value": "Settore intrattenimento"
},
{
"key": "POI-AP_IT_PointOfInterestCategory_POICategoryIdentifier",
"value": "cat_1"
}
],
...
]
```


#### example: Luoghi Istat / CLV_AP-IT

```
$ curl -X GET http://localhost:9000/kb/v1/hierarchies/properties?vocabulary_name=Istat-Classificazione-08-Territorio&ontology_name=clvapit&lang=it \
-H "accept: application/json" -H "content-type: application/json"
```

which will return the results:

```
[
{
"vocabulary": "CLV-AP_IT",
"path": "CLV-AP_IT.Country.name",
"hierarchy_flat": "Country",
"hierarchy": "hierarchy"
},
{
"vocabulary": "CLV-AP_IT",
"path": "CLV-AP_IT.City.name",
"hierarchy_flat": "Country.Region.Province.City",
"hierarchy": "hierarchy"
}
...
]
```


### example configurations

An example configuration for working with a vocabulary (VocabularyAPI):

```
"data_dir": "./data"
"Istat-Classificazione-08-Territorio" {
vocabulary.name: "Istat-Classificazione-08-Territorio"
vocabulary.ontology.name: "CLV-AP_IT"
vocabulary.ontology.prefix: "clvapit"
vocabulary.file: ${data_dir}"/vocabularies/Istat-Classificazione-08-Territorio.ttl"
vocabulary.contexts: [ "http://dati.gov.it/onto/clvapit#" ]
vocabulary.query.csv: ${data_dir}"/vocabularies/Istat-Classificazione-08-Territorio#dataset.csv.sparql"
}
```

The `vocabulary.query.csv` is a reference to a SPARQL query designed to produce a flat representation of the vocabulary informations.


An example configuration for working with an ontology (OntologyAPI) could be similar to the following one:

```
clvapit {
ontology.name: "CLV-AP_IT"
ontology.prefix: "clvapit"
ontology.file: ${data_dir}"/ontologies/agid/CLV-AP_IT/CLV-AP_IT.ttl"
ontology.contexts: [ "http://dati.gov.it/onto/clvapit#" ]
ontology.query.hierarchy: ${data_dir}"/ontologies/agid/CLV-AP_IT/CLV-AP_IT.hierarchy.sparql"
}
```
The `ontology.query.hierarchy` is a reference to a SPARQL query designed to produce a flat representation of the vocabulary informations.


* * *

**Note** that the `${data_dir}` can be replaced with a specific root path on disk: at this stage of the development this will be a relative folder (for example: `/dist/data` for the sbt project).

Eventually the idea of pre-loading ontologies and vocabularies from disk can be replaced with the import from a central datastore (dedicated maintain the last version of ontologies), where they are already loaded under conventional paths/names. This way we will be able to switch from an in-memory tiny repository (one for each ontology/vocabulary) to a central RDF/SPARQL repository, containing all the pre-loaded ontologies and vocabulariesl.


----

## TODO

+ more documentation / comments
+ more proper tests
+ remove redundant classes for RDFRepository, importing external kb-core dependency, instead


## known ISSUES

...
58 changes: 58 additions & 0 deletions semantic_standardization/app/ErrorHandler.scala
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
/*
* Copyright 2017 TEAM PER LA TRASFORMAZIONE DIGITALE
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

import javax.inject._

import play.api.http.DefaultHttpErrorHandler
import play.api._
import play.api.mvc._
import play.api.mvc.Results._
import play.api.routing.Router

import scala.concurrent.Future

import de.zalando.play.controllers.PlayBodyParsing

/**
* The purpose of this ErrorHandler is to override default play's error reporting with application/json content type.
*/
class ErrorHandler @Inject() (
env: Environment,
config: Configuration,
sourceMapper: OptionalSourceMapper,
router: Provider[Router]
) extends DefaultHttpErrorHandler(env, config, sourceMapper, router) {

private def contentType(request: RequestHeader): String =
request.acceptedTypes.map(_.toString).filterNot(_ == "text/html").headOption.getOrElse("application/json")

override def onProdServerError(request: RequestHeader, exception: UsefulException) = {
implicit val writer = PlayBodyParsing.anyToWritable[Throwable](contentType(request))
Future.successful(InternalServerError(exception))
}

// called when a route is found, but it was not possible to bind the request parameters
override def onBadRequest(request: RequestHeader, error: String): Future[Result] = {
implicit val writer = PlayBodyParsing.anyToWritable[String](contentType(request))
Future.successful(BadRequest("Bad Request: " + error))
}

// 404 - page not found error
override def onNotFound(request: RequestHeader, message: String): Future[Result] = {
implicit val writer = PlayBodyParsing.anyToWritable[String](contentType(request))
Future.successful(NotFound(request.path))
}
}
10 changes: 10 additions & 0 deletions semantic_standardization/app/Filters.scala
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
/**
* Created by ale on 06/06/17.
*/
import javax.inject.Inject

import play.api.http.DefaultHttpFilters
import play.filters.cors.CORSFilter

class Filters @Inject() (corsFilter: CORSFilter)
extends DefaultHttpFilters(corsFilter)
Loading

0 comments on commit 0efaa22

Please sign in to comment.