Skip to content

Elasticsearch

jezekp edited this page Apr 3, 2015 · 39 revisions
  • Detailed insight on how the Elasticsearch database is used in eegportal

#Installation

Because of elasticsearch is not in an oficial (Debian) repository, there is necessary to add the one.

In the sudo user, necessary steps follow:

Adding a new repository signature:

wget -O - http://packages.elasticsearch.org/GPG-KEY-elasticsearch | apt-key add -

Adding a repository:

echo "deb http://packages.elasticsearch.org/elasticsearch/1.3/debian stable main" >> /etc/apt/sources.list

Now, installation of ellasticsearch is as usuall with Linux packages:

apt-get update && apt-get install elasticsearch

Necessary configuration is in /etc/elasticsearch/elasticsearch.yml and /etc/default/elasticsearch.

There is need to change:

cluster.name: eeg-dev - The name of a cluster instance

node.name: es-eeg-dev - The name of a node in the cluster (in case of more nodes, different names must be used).

Config paths (here the home directory is used for data and logs):

path.conf: /etc/elasticsearch

path.data: /home/elasticsearch/data

path.logs: /home/elasticsearch/logs

path.plugins: /usr/share/elasticsearch/plugins

discovery.zen.ping.multicast.enabled: false - Disable scanning of nodes with different cluster names

Now, elasticsearch can be run by an init.d script:

/etc/init.d/elasticsearch start

Integration with spring

  • Accomplished by https://github.com/spring-projects/spring-data-elasticsearch.

  • Now running on elasticsearch v0.90.9. Hopefully, before my job is done, there will be an elasticsearch 1.0-Final release and we will have database + driver running on this version. Right now, there is a maven dependency 1.0-SNAPSHOT (for spring-data-elasticsearch) which gets updated daily and updates can break things up. In case they do not publish a stable maven release before end of my work, the jar will be placed to a local repo so we can freeze the driver version.

  • Connection strings are stored in WEB-INF/project.properties

* Beans related to elasticsearch are defined in WEB-INF/persistence.xml

<elasticsearch:transport-client id="client" cluster-name="${elasticsearch.clusterName}" cluster-nodes="${elasticsearch.url}" />
<bean name="elasticsearchTemplate" class="org.springframework.data.elasticsearch.core.ElasticsearchTemplate">
  <constructor-arg name="client" ref="client"/>
</bean>	
<elasticsearch:repositories base-package="cz.zcu.kiv.eegdatabase.data.nosql.repositories" />

How are elasticsearch and hibernate entities connected

todo (interceptor, Experiment+ExperimentElastic classes...)

## Database setup

  • Development database now runs on eeg2.kiv.zcu.cz with HEAD plugin installed (https://github.com/mobz/elasticsearch-head)

  • We use elasticsearch version 1.0.1 (http://www.elasticsearch.org/downloads/1-0-1/) - this is essential - the driver included in nexus repository is made for this exact version and usage of any other can lead to unpredicted results.

  • One non-replicated node runs on this server with standard five shards. There is (obviously) custom mapping defined like:

curl -XPUT 'http://eeg2.kiv.zcu.cz:9200/eegdatabase' -d
{
  "settings": {
    "index": {
      "number_of_shards": 5,
      "number_of_replicas": 0
    },
    "analysis": {
      "analyzer": {
        "standard_snowball_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "standard",
            "lowercase",
            "stop",
            "eeg_synonym",
            "eng_snowball"
          ]
        }
      },
      "filter": {
        "eng_snowball": {
          "type": "snowball",
          "language": "English"
        },
        "eeg_synonym": {
          "type": "synonym",
          "synonyms_path": "/home/elasticsearch/synonyms.txt"
        }
      }
    }
  }
}

and then mapping for experiment type

curl -XPUT 'http://eeg2.kiv.zcu.cz:9200/eegdatabase/experiment/_mapping' -d
{
  "experiment": {
    "dynamic": "strict",
    "properties": {
      "experimentId": {
        "type": "string"
      },
      "groupId": {
        "type": "integer"
      },
      "userId": {
        "type": "long"
      },
      "params": {
        "type": "nested",
        "properties": {
          "attributes": {
            "type": "nested",
            "properties": {
              "name": {
                "type": "string"
              },
              "value": {
                "type": "string",
                "analyzer": "standard_snowball_analyzer"

              }
            }
          },
          "name": {
            "type": "string",
            "index": "not_analyzed"
          },
          "valueInteger": {
            "type": "double"
          },
          "valueString": {
            "type": "string",
                "analyzer": "standard_snowball_analyzer"
          }
        }
      }
    }
  }
}

Those mappings and custom analyzers MUST BE SET BEFORE anything is inserted into ES (into specific index). Changes to index mappings or adding custom analyzers on a running cluster is a problem and usually all data has to be reindexed (= DELETED and application must insert them somehow again). Elasticsearch doesn't handle this internally. It is one of the most annoying things, but it is as it is, just get over it :)

Elasticsearch migration (synchronization)

I've heard there are some teams working on some kind of offline sync/backup/whatever. This might come handy:

Queries and DAO interface

Sample experimet (so we know what are we talking about):

{
  "experimentId": "123",
  "params": [
    {
      "name": "temperature",
      "valueString": null,
      "valueInteger": 30,
      "attributes": []
    },
    
    {
      "name": "hardware",
      "valueString": "Intel Xeon III",
      "valueInteger": null,
      "attributes": [
        { "name": "year", "value": "2013"   },
        { "name": "frequency", "value": "2,3GHz"   },
        { "name": "model", "value": "xx1yy2.9"   },
        { "name": "cacheSize", "value": "2MB"   },
        { "name": "hyperthreading", "value": "yes"   }
      ]
    },
    
    {
      "name": "tested subject",
      "valueString": "Mouse 128.2",
      "valueInteger": null,
      "attributes": [
        { "name": "color", "value": "red"   },
        { "name": "length", "value": "150mm"   },
        { "name": "weight", "value": "350g"   }
      ]
    }  
  ]
}

Methods:

//searches in params.name AND params.valueString 
//use case: Give me all experiments, where "hardware" is "wooden chair"
List<Experiment> searchByParameter (String paramName, String paramValue)


//searches in params.name AND params.valueInteger
//use case: Give me all experiments, where "temperature" is "30"
List<Experiment> searchByParameter (String paramName, int paramValue)


//match on params.name AND range query on params.valueInteger
//use case: Give me all experiments, where "temperature" is betweet "20" and "80"
List<Experiment> searchByParameterRange (String paramName, int min, int max)


// AND match on several params.name AND params.valueString/Integer
//use case: Give me all experiments, where "software" is "corel" AND "hardware" is "wooden chair" AND "disease" is "blindness"
List<Experiment> searchByParameters (GenericParam[] params)

// AND match on several params.name AND params.valueString/Integer
//use case: Give me all experiments, where "software" is "corel" AND "hardware" is "wooden chair" AND "disease" is not "blindness"
List<Experiment> searchByParameters (GenericParam[] contains, GenericParam[] notContains)


//searches in params.valueString OR params.attributes.value
//use case: Give me all experiments, where "RJ45" is involved (either in parameter value OR in parameter additional attribute.
List<Experiment> search(String value)


/*
  These queries are more complicated as they are in nested.nested object, so thats why only one basic query is written for now
*/
//searches in params.attributes.name AND params.atributes.value
List<Experiment> searchByAttribute(String attrName, String attrValue)