ankane/logstop

Filter IPv6 addresses

bjeanes opened this issue · 8 comments

I am looking through the code and noticed the regex only seems to match IPv4 addresses.

Should IPv6 also be matched?

I'm hoping to do this so that I can add a custom scrubber which used IpAnonymizer and lean on the gem-maintained Regex.

Thoughts?

Thinking on this more, could the IP regex be delegated to stdlib's Resolv::AddressRegex?

[7] pry(main)> Resolv::AddressRegex
=> /(?:(?-mix:\A((?x-mi:0
               |1(?:[0-9][0-9]?)?
               |2(?:[0-4][0-9]?|5[0-5]?|[6-9])?
               |[3-9][0-9]?))\.((?x-mi:0
               |1(?:[0-9][0-9]?)?
               |2(?:[0-4][0-9]?|5[0-5]?|[6-9])?
               |[3-9][0-9]?))\.((?x-mi:0
               |1(?:[0-9][0-9]?)?
               |2(?:[0-4][0-9]?|5[0-5]?|[6-9])?
               |[3-9][0-9]?))\.((?x-mi:0
               |1(?:[0-9][0-9]?)?
               |2(?:[0-4][0-9]?|5[0-5]?|[6-9])?
               |[3-9][0-9]?))\z))|(?:(?x-mi:
      (?:(?x-mi:\A
      (?:[0-9A-Fa-f]{1,4}:){7}
         [0-9A-Fa-f]{1,4}
      \z)) |
      (?:(?x-mi:\A
      ((?:[0-9A-Fa-f]{1,4}(?::[0-9A-Fa-f]{1,4})*)?) ::
      ((?:[0-9A-Fa-f]{1,4}(?::[0-9A-Fa-f]{1,4})*)?)
      \z)) |
      (?:(?x-mi:\A
      ((?:[0-9A-Fa-f]{1,4}:){6,6})
      (\d+)\.(\d+)\.(\d+)\.(\d+)
      \z)) |
      (?:(?x-mi:\A
      ((?:[0-9A-Fa-f]{1,4}(?::[0-9A-Fa-f]{1,4})*)?) ::
      ((?:[0-9A-Fa-f]{1,4}:)*)
      (\d+)\.(\d+)\.(\d+)\.(\d+)
      \z))))/

EDIT: Yes, there is a good reason. The regex is anchored so wouldn't match anything in a logging context.

Hey @bjeanes, great suggestion. I created an ipv6 branch for this. Unfortunately, there's a pretty big performance hit (slows down log throughput by around 33%), so need to decide what to do. Also need to make it work with url encoding. Anyways, should should be able to use the regexp from the branch for your use case for now.

https://github.com/ankane/logstop/compare/ipv6

Hah!

Regexp.new(Resolv::AddressRegex.source.gsub('\A', '\b').gsub('\z', '\b'))

^^ that is exactly what I have just deployed to production already!

slows down log throughput by around 33%

I've deployed this and the slowdown is noticeable in web RPMs too. 🤔

so need to decide what to do

I run on Heroku so my logs go via STDOUT. I do wonder if I'm better off with a solution that filters the logs outside of the core Ruby process, even if backed by this gem. That should, at least in theory, allow for some better use of multiple cores...

I'm surprised it's noticeable at the RPM level. What difference are you seeing there?

Just ran benchmarks with Ruby 2.7 and latest code:

Warming up --------------------------------------
             no ipv6     3.126k i/100ms
                ipv6     1.938k i/100ms
Calculating -------------------------------------
             no ipv6     34.599k (± 3.5%) i/s -    175.056k in   5.065833s
                ipv6     19.874k (± 3.4%) i/s -    100.776k in   5.076570s

It still appears to reduce throughput significantly, but 20k iterations per second is still pretty fast and most of the time spent in an application is not in logging. Will plan to merge once IPv6 has more adoption.

Edit: another approach could be to use a less complex regex if common sources of IPs (like Rack::Request) use a specific format.

Edit: another approach could be to use a less complex regex if common sources of IPs (like Rack::Request) use a specific format.

Yeah, or use a less complex regex regardless. It probably is better (from GDPR etc standpoint) to accidentally filter non-IPs but catch all actual IPs than to only catch IPs but take a large hit in performance. This could even be a configuration option, that swaps in a more accurate regex when the user opts into the trade-off of slower perf.