Elasticsearch

ELASTICSEARCH
What’s new since 0.90?
techtalk @ ferret

• Latest stable release: Elasticsearch 1.1.0

• Released: 25.03.2014

• Based on Lucene 4.6.1

BREAKING CHANGES
in versions 1.x

CONFIGURATION
• The cluster.routing.allocation settings (disable_allocation,
disable_new_allocation and disable_replica_location) have
been replaced by the single setting:
cluster.routing.allocation.enable: all|primaries|new_primaries|
none

• Elasticsearch on 64 bit Linux now uses mmapfs by default. Make
sure that you set MAX_MAP_COUNT to a sufﬁciently high
number. The RPM and Debian packages default this value to
262144.

MULTI-FIELDS
Existing multi-fields will be upgraded to the new format automatically.
"title": {

"type": "multi_field",

"fields": {

"title": { "type": "string" },

"raw": {

"type":“string",

"index": "not_analyzed"

}

}

}
"title": {

"type": "string",

"fields": {

"raw": {

"type":“string",

"index": "not_analyzed" }

}

}

STOPWORDS
• Previously, the standard and pattern analyzers used
the list of English stopwords by default, which
caused some hard to debug indexing issues.

• Now they are set to use the empty stopwords list
(ie _none_) instead.

RETURNVALUES
• The ok return value has been removed from all response bodies
as it added no useful information.

• The found, not_found and exists return values have been
unified as found on all relevant APIs.

• Field values, in response to the fields parameter, are now always
returned as arrays. Metadata fields are always returned as scalars.

• The analyze API no longer supports the text response format,
but does support JSON andYAML.

DEPRECATIONS
• Per-document boosting with the _boost ﬁeld has been
removed.You can use the function_score instead.

• The custom_score and custom_boost_score is no longer
supported. You can use function_score instead.

• The ﬁeld query has been removed. Use the query_string
query instead.

• The path parameter in mappings has been deprecated. Use
the copy_to parameter instead.

AGGREGATIONS
since version 1.0.0

AGGREGATIONTYPES
• Bucketing aggregations

Aggregations that build buckets, where each bucket is associated with a key and a
document criterion.

!
Examples: range, terms, histogram

!
Bucketing aggregations can have sub-aggregations (bucketing or metric). The sub-aggregations
will be computed for the buckets which their parent aggregation generates.
• Metrics aggregations

Aggregations that keep track and compute metrics over a set of documents.

!
Examples: min, max, stats

{

"aggs" : {

"price_ranges" : {

"range" : {

"ﬁeld" : "price",

"ranges" : [

{ "to" : 50 },

{ "from" : 100 }

]

},

"aggs" : {

"price_stats" : {

"stats" : { "ﬁeld" : "price" }

}

}

}

}

}
{

"aggregations": {

"price_ranges" : {

"buckets": [

{

"to": 50,

"doc_count": 2,

"price_stats": {

"count": 2,

"min": 20,

"max": 47,

"avg": 33.5,

"sum": 67

}

}, …

]

}

}

}

CARDINALITY
The cardinality aggregation is a metric aggregation that allows to compute approximate unique
counts based on the HyperLogLog++ algorithm which has the nice properties of both being close
to accurate on low cardinalities and having ﬁxed memory usage so that estimating high cardinalities
doesn't blow up memory.
{

"aggs" : {

"author_count" : {

"cardinality" : {

"ﬁeld" : "author"

}

}

}

}

PERCENTILES
A percentiles aggregation would allow to compute (approximate) values of arbitrary percentiles
based on the t-digest algorithm. Computing exact percentiles is not reasonably feasible as it would
require shards to stream all values to the node that coordinates search execution, which could be
gigabytes on a high-cardinality ﬁeld.
1.1.0
{

"aggs" : {

"load_time_outlier" : {

"percentiles" : {

"ﬁeld" : "load_time"

}

}

}

}
{

...

"aggregations": {

"load_time_outlier": {

"1.0": 15,

"5.0": 20,

"25.0": 23,

"50.0": 25,

"75.0": 29,

"95.0": 60,

"99.0": 150

}

}

}

SIGNIFICANT_TERMS
{

"query" : {

"terms" : {

"force" : [ "BritishTransport Police" ]

}

},

"aggregations" : {

"significantCrimeTypes" : {

"significant_terms" : { "field" : "crime_type" }

}

}

}
An aggregation that identifies terms that are significant rather than merely popular in a result set.
Significance is related to the changes in document frequency observed between everyday use in the
corpus and frequency observed in the result set.
1.1.0
{

"aggregations" : {

"significantCrimeTypes" : {

"doc_count": 47347,

"buckets" : [

{

"key": "Bicycle theft",

"doc_count": 3640,

"score": 0.371235374214817,

"bg_count": 66799

}, …

]

}

}

}

TERMS AGGREGATION
• Before 1.1.0 terms aggregations return up to size terms, so the way
to get all matching terms back was to set size to an arbitrary high
number that would be larger than the number of unique terms.

!
• Since version 1.1.0 to get ALL terms just set size=0

MULTI-FIELD SEARCH
• The multi_match query now supports three types of execution: 
• best_fields (field-centric, default) Find the field that best matches the
query string. Useful for finding a single concept like “full text search” in
either the title or the body field.

!
• most_fields (field-centric) Find all matching fields and add up their
scores. Useful for matching against multi-fields, where the same text
has been analyzed in different ways to improve the relevance score:
with/without stemming, shingles, edge-ngrams etc.

!
• cross_fields (term-centric) New execution mode which looks for
each term in any of the listed fields. Useful for documents whose
identifying features are spread across multiple fields, such as
first_name and last_name, and supports the minimum_should_match
operator in a more natural way than the other two modes.

JSON is great… for computers. Human eyes, especially when looking at an ssh terminal, need
compact and aligned text.The cat API aims to meet this need.
$ curl 'localhost:9200/_cat/nodes?h=ip,port,heapPercent,name'

192.168.56.40 9300 40.3 Captain Universe

192.168.56.20 9300 15.3 Kaluu

192.168.56.50 9300 17.0Yellowjacket

192.168.56.10 9300 12.3 Remy LeBeau

192.168.56.30 9300 43.9 Ramsey, Doug

TRIBE NODES
since version 1.0.0

The tribes feature allows a tribe node to act as a federated client across multiple clusters.
tribe:

t1:

cluster.name: cluster_one

t2:

cluster.name: cluster_two
elasticsearch.yml
The merged global cluster state means that almost all operations work in the sam
way as a single cluster: distributed search, suggest, percolation, indexing, etc.

!
However, there are a few exceptions:

• The merged view cannot handle indices with the same name in multiple cluster
• Master level read operations (eg Cluster State, Cluster Health) will automati
execute with a local ﬂag set to true since there is no master.

• Master level write operations (eg Create Index) are not allowed.These should
performed on a single cluster.

BACKUP & RESTORE
since version 1.0.0

REPOSITORIES
$ curl -XPUT 'http://localhost:9200/_snapshot/my_backup' -d '{

"type": "fs",

"settings": {

"location": "/mount/backups/my_backup",

"compress": true }}'
Before any snapshot or restore operation can be performed a snapshot
repository should be registered in Elasticsearch.
Supported repository types:

• fs (ﬁlesystem)

• S3

• HDFS (Hadoop)

• Azure

SNAPSHOTS
$ curl -XPUT "localhost:9200/_snapshot/my_backup/snapshot_1" -d '{

"indices": "index_1,index_2"

}'
A repository can contain multiple snapshots of the same cluster. Snapshot are
identiﬁed by unique names within the cluster.
• The index snapshot process is incremental.

• Only one snapshot process can be executed in the cluster at any
time.

• Snapshotting process is executed in non-blocking fashion

RESTORE
$ curl -XPOST "localhost:9200/_snapshot/my_backup/snapshot_1/_restore" -d '{

"indices": "index_1,index_2",

"rename_pattern": "index_(.+)",

"rename_replacement": "restored_index_$1"

}'
A snapshot can be restored using the following command:
• The restore operation can be performed on a functioning cluster.

• An existing index can be only restored if it’s closed.

• The restored persistent settings are added to the existing
persistent settings.

ELASTICSEARCH-PY
Ofﬁcial low-level client for Elasticsearch

Features:

• translating basic Python data types to and from json (datetimes are not
decoded for performance reasons)

• conﬁgurable automatic discovery of cluster nodes

• persistent connections

• load balancing (with pluggable selection strategy) across all available nodes

• failed connection penalization (time based - failed connections won’t be
retried until a timeout is reached)

• thread safety

• pluggable architecture

Versioning:
• There are two branches - master and 0.4. Master branch is used to track all the
changes for Elasticsearch 1.0 and beyond whereas 0.4 tracks Elasticsearch 0.90.

Elasticsearch

More Related Content

What's hot (20)

Viewers also liked (17)

Similar to Elasticsearch (20)

More from Andrii Gakhov (20)

Recently uploaded (20)

Elasticsearch