Skip to content
This repository has been archived by the owner on May 22, 2021. It is now read-only.

GlobalNamesArchitecture/gnindex

Repository files navigation

IMPORTANT

This project is archived and will not be developed further. Use GNames and GNverifier for the same functionality.

https://api.codacy.com/project/badge/Grade/2ad5b690419b4f869add628d7b022a17

gnindex

Brief Info

gnindex is a search engine over collection of strings (combinations of characters) that have been used as names for organisms. gnindex contains many examples of names spelled in slightly different ways. It indexes occurances of biological scientific names in the world, normalizes and reconsiles lexical and taxonomic variants of the name strings.

gnindex handles two type of queries over indexed name strings: resolution and faceted search.

Faceted search accepts part of name string (canonical part, author word(s), genus word(s), species word(s), year) possibly wildcarded or their logically combined composition. In response it returns all results that satisfy given request.

Resolution accepts a set of whole name strings with resolution parameters. Parameters includes:

  • raw string supplied to name input ("a label")
  • data source IDs where gnindex should perform resolution
  • should names be resolved in respect to context
  • should gnindex resolve a name with vernacular

Installation

gnindex is dockerized:

docker-compose up

docker-compose pulls all necessary Docker containers (Postgres database engine, gnindex matcher, nameResolver, facetedSearcher, api and front services), initializes them, and listens for HTTP connections over 80 port with paths as follows:

  • / path returns user-friendly HTML interface
  • /api returns GraphiQL interactive in-browser environment to interact with gnindex API.

Using API

gnindex uses GraphQL interface to make requests to API. gnindex accepts GraphQL requests at /api/graphql URL. It's simply a POST request with JSON body of predefined format:

{
  "query": "BODY OF QUERY",
  "variables": "SUPPLIED VARIABLES",
  "operation": "SELECTED OPERATION"
}

gnindex API exposes GraphQL methods as follows:

  • nameResolver for name resolution
  • nameStrings for faceted search

Detailed documentation on input parameters and output could also be found in documentation fetched from API and rendered by GraphiQL at /api.

A simple GraphQL request might be:

query {
  nameResolver(names: [{value: "Homo sapiens"}], dataSourceIds: [1]) {
    results {
      name {
        id
        value
      }
    }
  }
}

That returns result:

{
  "data": {
    "nameResolver": [
      {
        "results": [
          {
            "name": {
              "id": "7db4f8a2-aafe-56b6-8838-89522c67d9f0",
              "value": "Homo sapiens Linnaeus, 1758"
            }
          }
        ]
      }
    ]
  }
}

The request could be composed with cURL:

curl 'http://index.globalnames.org/api/graphql' \
  -H 'Content-Type: application/json' \
  -H 'Accept: application/json' \
  --data-binary '{"query":"query {\n  nameResolver(names: [{value: \"Homo sapiens\"}], dataSourceIds: [1]) {\n    results {\n      name {\n        id\n        value\n      }\n    }\n  }\n}\n","variables":null}'

Forturately, it isn't necessary to encode GraphQL query string and to nest it to parent JSON. Some languages provide GraphQL clients that do all dirty work. Consider simple Ruby client to request gnindex API:

Project Structure

The projects consists of 6 subprojects:

  • common
  • matcher
  • nameResolver
  • facetedSearcher
  • api
  • front

common

Contains Thrift data structures that are shared between microservices.

matcher

The Thrift microservice expects canonical names of type Seq[String] and data sources IDs of type Seq[Int]. It tries to fuzzy match through all known canonical names and those stems (according to Latin stemming) with Levenstein algorithm of edit distance 1.

It returns list of lists of found fuzzy matches UUIDs: one list per provided canonical name. Note that it returns UUIDs only as it has no connection to database.

nameResolver

The Thrift microservice expects complex requests of type thrift.nameresolver.Request. It passes through stages for every provided name request:

  1. Two UUIDv5 values are computed: first one is for provided name string, second one is for parsed canonical name. Then microservice tries to find records according to those two UUIDs
  2. If nothing is found for exact match and name is not parsable, then empty result is returned
  3. If provided name is parsable, then canonical form of the name goes to matcher microservice. Final results are formed based on database results matched by those UUIDs.

facetedSearcher

Performs faceted search.

api

The api microservice is connected with nameresolver and facetedSearcher microservices. It provides GraphQL interface to the user. GraphQL requests are then translated to microservices requests.

Development

Database schema creation

Database is created and migrated using ActiveRecord gem from RubyOnRails package. Database is created in sql format, rather than ruby, because it includes configuration of external Postgresql packages.

To create schema at ./db_migration/db/structure.sql file run bundle exec rake db:migration.

Testing

TODO