Disclaimer: While we're not using ElasticSearch for Algolia's hosted full-text, numerical & faceted search engine; we're using it for internal analytics (faceting over billions of log lines generated by our engine, no full-text search).
This plugin extends Elasticsearch providing a fast & memory-efficient aggregation statistically retrieving the Top-K elements of a field. The field can be either string, numerical or boolean. The plugin registers a new type of aggregation (topk
).
This plugin is a temporary replacement of #6697.
We love pull-requests!
- Elasticsearch 1.3.0+
- Compiled versions of the plugin are stored in the
dist
directory.
The default terms
aggregations implementations use an amount of memory that is linear with the cardinality of the value source they run on. Things get even worse when using sub aggregations, especially the memory-intensive ones such as percentiles
, cardinality
, top_hits
or bucket
aggregations. This plugin is based on the Space-Saving
algorithm, which try to detect the most frequent terms with a fixed (configurable) number of counters.
This plugin uses the StreamSummary
data structure provided by the Stream-lib library to compute the top-k values of a field. Basically, it retrieves the most frequent terms of a field without loading all of them (and their associated sub aggregations) into RAM. The merge between shards and between indices is supported but might introduce accuracy issues: this is the general trade-off of this algorithm.
To build an aggregation keeping the top-k elements of a field, use the following code:
{
"aggregations": {
"<aggregation_name>": {
"topk": {
"field": "<field_name>",
"size": 10
}
}
}
}
For example, to keep the 100 most frequent values of your "ip" field, use:
{
"aggregations": {
"top_ips": {
"topk": {
"field": "ip",
"size": 100
}
}
}
}
{
"aggregations": {
"top_ips": {
"buckets": [
{ "key": "1.2.3.4", "doc_count": 62718 },
{ "key": "5.6.7.8", "doc_count": 54233 },
[...]
{ "key": "1.6.3.8", "doc_count": 12123 },
]
}
}
}
./plugin --url file:///absolute/path/to/elasticsearch-topk-plugin-LATEST.zip --install topk-aggregation
./plugin --remove topk-aggregation