logstash-plugins/logstash-filter-http

Implement native caching for higher scale lookups

acchen97 opened this issue · 4 comments

There have already been some demand for native caching for HTTP lookups with this plugin. This would help enable higher throughput without the need for usage with conjunction with third-party caching systems like Memcached.

Please feel free to +1 if you are interested in this feature.

I envision a two-part solution:

  1. Support for proxies (including https) would be trivial to add, and would allow users to configure a local caching proxy (e.g., Squid Cache) that obeyed all of the semantics and standards of the web and kept that complexity out of our maintenance domain.
  2. A naïve LRU in-memory cache (perhaps around LogStash::Filters::Http#request_http(verb, url, options)) is also possible if a little more complex, and would reduce the overhead of a user of this plugin configuring and running above-mentioned caching proxy, at the cost of breaking some of the semantics (e.g., no upstream cache invalidation) and some unpredictability in the plugin's memory consumption.

I add my vote on this one, it would be ideal for our data enrichment use case. We are now using the jdbc_streaming filter, but it's a less-than-ideal choice. The perfect choice would be the http filter with caching capabilities, just like the aforementioned jdbc_streaming, only making HTTP calls instead of SQL queries.

+1
Just came to add my interest in this. I haven't gotten any method other than hammering my REST source with the exact same request to work.

vjt commented

-1

I don't think that LogStash should have a caching layer, as there is already external software (nginx, memcached) that does that well and it's easy to integrate them with LogStash.

I have two use cases for which I am using external caches:

  • Querying an internal API service over HTTP to enrich logs coming from different sources. The information on the API service changes seldom, and logstash processes hundreds of events per second. I set up an nginx listening on localhost that proxies my API service, configured its disk cache and pointed logstash to it (see [1] below)
  • Keeping a mapping of client IP - user name. If an incoming log event has both a clientip and user fields, I store it in memcached. If I have a clientip and not an user field, I query memcached to enrich the log event (see [2] below).

That said, I find the following pluses in having the caching layer external:

  • being able to tune, change the behaviour or replace altogether the caching layers without involving LogStash or having to reconfigure it
  • being sure of not losing the cache contents if I need to restart LogStash; otherwise being able to flush the cache without involving LogStash
  • being able to scale out the cache separately than LogStash

Sorry for the verbosity, I hope this is useful also for your use cases.

[1] local caching proxy

proxy_cache_path /srv/cache/foobar levels=1:2 keys_zone=foobar:40m inactive=24h max_size=1g;

server {
  listen localhost:8084;

  access_log off;

  location / {
    proxy_pass            https://foobar;

    proxy_ignore_headers  Cache-Control;

    proxy_set_header      Host foobar.example.org;
    proxy_buffering       on;
    proxy_cache           foobar;
    proxy_cache_key       $uri$is_args$args;
    proxy_cache_valid     200 404 1h;
    proxy_cache_valid     any 5m;
    proxy_cache_lock      on;
    proxy_cache_use_stale error timeout invalid_header updating http_500 http_502 http_503 http_504;

    add_header X-Cache-Status $upstream_cache_status;
  }
}

upstream foobar {
  server foobar.example.org:443;
}

[2] memcached enrichment

# We have a mapping from the event, store it in the cache for usage by other future events.
#
if [clientip] and [user] and [user] !~ '(?:^(?:unauthenticated|_?system|anonymous|\[?unknown\]?)$)' {
  memcached {
    hosts => ["cache-01"]
    namespace => "logstash-ip"
    set => { "[user]" => "%{clientip}" }
    ttl => 86400 # Avoid stale lookups
  }
}

# We don't have a mapping from the event, try to look it up from the cache.
#
if [clientip] and ! [user] {
  # Check the cache
  #
  memcached {
    hosts => ["cache-01"]
    namespace => "logstash-ip"
    get => { "%{clientip}" => "[user]" }
    add_tag => ["user_from_cache"]
  }
}