logstash-plugins/logstash-filter-geoip

skip the lookup and tag_on_failure when ip is private

jsvd opened this issue · 17 comments

jsvd commented

since version 3.0.0 this plugin tags a failed lookup with tag_on_failure.

However this tagging does not differentiate between a failed lookup for a private ip (which will always fail and it's OK), and a public ip (which may require further action).

we could add an option called skip_private_ips => false, which would skip the lookup for any ip in private network ranges, and not tag as a failure.

This would be a great feature to have. For the record, I'm currently working around this with the CIDR plugin, doing the following:

  1. Using CIDR, check if address is a private block (currently defined as rfc1918, link-local, multicast, on either IPv4 or v6). If so, add tag skip_geoip_lookup
  2. If skip_geoip_lookup is not in tags, then do the lookup
  3. Else, remove skip_geoip_lookup

Works like a charm but is a little ugly - a built in solution would do wonders.

An even better enhancement would be to support GEOIP lookups of private IPs from a csv or JSON file.

I had looked into customizing the GEOIP2 database, but I can't find any decent tools for customizing the new MMDB format.

we could add an option called skip_private_ips => false

private may not be the correct word to use. We could have a setting to skip over address ranges we know would never appear in a maxmind data, such as: RFC1918, Class D and E ranges, 0.0.0.0/8, broadcast (255.255.255.255), etc?

Alternately, we could make a new setting that allows users to specify CIDR notation for address ranges they want to skip?

@bufordt13 I have heard of folks makking their own geoip2 data files, but I have no advice for you on how to achieve this.

If you want custom lookups from csv or json, check out the translate filter which lets you do this.

I realize that I can create my own translate filter, and that's probably the route I'm going down, but custom mapping of private IPs seems like a natural feature for geoip to have.

It would be nicer to just have a single plugin that handles all the geoip tasks instead of having 1 for public IPs and a completely different one for Private IPs.

@bufordt13 if you want to provide custom data, you are certainly welcome to do so! This filter only supports Maxmind's GeoIP data file formats. Adding other formats is outside of the scope of the filter and another filter could be created (or reused, as in translate filter) to serve your needs.

Any progress on this? I assume skipping the lookup entirely for IPs not in maxmind db would add performance gains.

I assume skipping the lookup entirely for IPs not in maxmind db would add performance gains.

The way you phrased this is a bit confusing. In order to skip a lookup for IPs not in maxmind db, we have to actually determine the answer to "not in maxmind db?" which requires a lookup in order to answer this..

For this idea, we need to also consider ipv6, not just ipv4.

I currently handle this using the cidr filter with something like...

cidr {
    address => [ "%{[conn][dst_addr]}" ]
    network => [ "0.0.0.0/32", "10.0.0.0/8", "172.16.0.0/12", "192.168.0.0/16", "fc00::/7", "127.0.0.0/8", "::1/128","169.254.0.0/16", "fe80::/10","224.0.0.0/4", "ff00::/8","255.255.255.255/32" ]
    add_field => { "[conn][dst_locality]" => "private" }
}

I can then use this result to decide whether to perform the geoip lookup. There is no waste in this in my use-case as I also leverage the resulting locality field in dashboards to allow user to filter on public or private network traffic data.

I assume the ask here is do the equivalent check inside the geoip filter code prior to the actual geoip lookup up. What would this gain us? If the cidr check is considerably faster than the geoip lookup itself, then there could be an efficiency boost. But would it be faster than using the cidr filter and a conditional in the Logstash pipeline as I do it today?

Either way in my case this wins me nothing as I still have to do the cidr matching for other reasons. Additionally, by leveraging the cidr filter I have control over the exact IPs that are looked up in the geoip DB (for example I may want to ignore some public IPs too). If there was code inside the geoip filter, everyone would have to be aware of it or would get unexpected results (would also need to be able to override this behaviour or risk a breaking change for those who depended on a failure tag in their pipeline logic).

IMO, this is an example of a "best practice" or "pattern"... using the cidr and geoip filters together. There is definitely a need for better documentation of such patterns. I can still remember the pains I went through until it all started to click. The foo => bar examples in the docs are a hurdle here... no real world context. But that is a different topic to be debated (and contributed to) elsewhere.

Rather than add additional logic to the geoip filter, I would rather see the cidr filter made part of the of the out-of-the-box logstash config, as it is currently only a community-supported plugin which must be manually installed. I think this would be a wise move since these two are used so often together.

Additionally, I would be happy to contribute a blog that explains the patterns for using these two plugins together, including plenty of real world context.

Just my 2 cents.

Rob

@robcowart Thank you for your detailed example and response. I agree with you that a good approach to solving this is to use the existing cidr filter.

For custom MMDB, have a look at this informative post from MaxMind https://blog.maxmind.com/2015/09/29/building-your-own-mmdb-database-for-fun-and-profit/

It does require writing in Perl, but quite doable (if you add remove_reserved_networks => 0 when contructing the $tree)

... although for simple things the csv filter would be easier

The biggest problem with the geoip filter (with a custom database) is that you need to use the same basic structure as the City or ASN database, as those are the only ones that geoip supports.

Just to add a data-point, I have about a hundred different subnets in my custom MMDB, which gets created from source data contained in an Excel workbook (and will likely incorporate other sources too).

The initial major use-case is to make it easier to filter and aggregate logs coming from network switches.

There are some nice tie-ins with machine-learning too.

Cheers,
Cameron

Documenting this better would be good, but skipping lookups for reserved IPs when someone supplies a custom database that allows lookups of those IPs would be bad. The assumption that a lookup of a reserved IP will always fail is false.

I have roughed out some code in Go to merge our RFC1918 addresses into the MMDB files. I did this with a plan to use them in @robcowart 's Elastiflow project. Once I had built the MMDB, I assumed I was over the hump.

Is there a flag/option I can use with the geoip plugin so that it will NOT skip on these addresses?

@hkelley did you test your files only with ElastiFlow or with a small test pipeline? I ask because ElastiFlow will only query the GeoIP City and ASN DBs for public IPs. Depending where you added the lookups for your DBs the other ElastiFlow logic may be preventing the lookup from happening.

Yes, I'm using a test with only a snippet lifted from 20_filter_90_post_process.logstash.conf.

		  source => "[source][ip]"
		  database => "${ELASTIFLOW_GEOIP_DB_PATH:/etc/logstash/elastiflow/geoipdbs}/GeoLite2-City.mmdb"
		  cache_size => "${ELASTIFLOW_GEOIP_CACHE_SIZE:8192}"
		  target => "[metadata][geoip_src]"
		  fields => [ "city_name", "location" ]
		}

I discovered that even though I was only asking for city name and location, the plugin's retrieveCityGeoData() method makes MaxMind API calls for several other attributes, including Subdivisions.

From org.logstash.filters.GeoIPFilter

  private Map<String,Object> retrieveCityGeoData(InetAddress ipAddress) throws GeoIp2Exception, IOException {
    CityResponse response = databaseReader.city(ipAddress);
    Country country = response.getCountry();
    City city = response.getCity();
    Location location = response.getLocation();
    Continent continent = response.getContinent();
    Postal postal = response.getPostal();
    Subdivision subdivision = response.getMostSpecificSubdivision();
    Map<String, Object> geoData = new HashMap<>();

I saw this error when I cranked up the logging level for that class.

[DEBUG][org.logstash.filters.GeoIPFilter][main] GeoIP2 Exception. exception=com.fasterxml.jackson.databind.exc.MismatchedInputException: Cannot deserialize instance of `java.util.ArrayList` out of START_OBJECT token
 at [Source: UNKNOWN; line: -1, column: -1] (through reference chain: com.maxmind.geoip2.model.CityResponse["subdivisions"]), field=[source][ip],

I haven't been able to get the formatting quite right for the array of subdivision elements in the MMDB so I simply removed it (for now). Without that bad data, the geoip filter works as I would expect, it looks up private IPs just like any other.

So to bring this full circle . . . the default behavior of the plugin (looking up all IPs) is good. Rather than change the options, perhaps the plugin could surface the underlying error from the handler. Instead of logging these only to debug, they could be returned in the tags.

    } catch (UnknownHostException e) {
      logger.debug("IP Field contained invalid IP address or hostname. exception={}, field={}, event={}", e, sourceField, event);
    } catch (AddressNotFoundException e) {
      logger.debug("IP not found! exception={}, field={}, event={}", e, sourceField, event);
    } catch (GeoIp2Exception | IOException e) {
      logger.debug("GeoIP2 Exception. exception={}, field={}, event={}", e, sourceField, event);
    }