logstash-plugins/logstash-filter-fingerprint

Multiple sources without "concatenate_sources" only fingerprints last source

Opened this issue · 10 comments

If you put multiple sources in the "source" key with "concatenate_sources" set to false, this filter only fingerprints the value of the last source. Additionally, the test "Concatenate multiple values into 1" is incorrect, this is only fingerprinting "field2". Example:

filter {
key => "secret"
method => "SHA256"
source => ["@timestamp", "host"]
}

Outputs the same fingerprint every time. If I switch it so timestamp is last, then it changes. If this is intended, then the docs don't really make that clear.

I can make a PR for this, but I'm not sure what you'd like the intended behavior to be - perhaps if you pass multiple sources, it turns concatenate_sources to true? Or it creates multiple fingerprints?

I agree with Tom. Buddy P pointed this out to me on the mailing list and I forked to debug this issue and its pretty easy to find the problem as line #91 (not the is_a Array call below is only if the individual item in the source array points to an array, not is the source an array itself):

86 @source.each do |field| 
87             next unless event.include?(field) 
88             if event[field].is_a?(Array) 
89               event[@target] = event[field].collect { |v| anonymize(v) } 
90             else 
91               event[@target] = anonymize(event[field]) 
92             end 
93           end # @source.each 

but I couldn't also figure out the best behavior. I originally left concatenate off as I figured it was a performance hit to concat the strings but I can't figure out a better way to build the hash. Obviously you could hash the previous hash + the new field but that seems worse perf than just concatting together. The current code does hash every field in the source, it just overrides the result of the previous hash with the next one (so only the last field matters as Tom mentioned). The concat option does add the field name to the string also, so removing that may be a better option? I would say concat_sources should be true by default, and instead an additional option of include_field_name or similar should be added to allow adding that to the string. At a minimum if no behavior changes are desired the existing code should be modified to only hash the last time in a source array as right now we are just wasting CPU cycles. The documentation should also make it clear that using a source array is useless if concat is not enabled.

I second the previous opinion. I just hit the issue yesterday and after some debugging found the reason for repetitive fingerprints with different messages and same timestamps (otherwise we wouldn't have ever found it).
My thoughts:

  • indeed, behaviour change "just like that" is not a good idea
  • current operation mode is pointlessly wasting CPU for counting hashes which are later discarded, so even if the operation of the fingerprint function is not changed, it would be better to find a field that is to be the source of a hash and only then calculate a single hash.
  • Please update the documentation, so it reflects the real functionality. In current state a user can expect fingerprint to create combined fingerprint taking into account all source fields, not just last one. That's the most important thing, because the "proper" behaviour can be forced with concatenate_sources but you have to be aware of the issue to walk-around it. I'd expect that majority of fingerprint users are completely unaware that their fingerprints are not created the way they expect them to be.
jsvd commented

+1 this is a bug indeed. looking at the spec suite:

  describe "Concatenate multiple values into 1" do
    config <<-CONFIG
      filter {
        fingerprint {
          source => ['field1', 'field2']
          key => "longencryptionkey"
          method => 'MD5'
        }
      }
    CONFIG

    sample("field1" => "test1", "field2" => "test2") do
      insist { subject["fingerprint"]} == "872da745e45192c2a1d4bf7c1ff8a370"
    end
  end

The test isn't actually correct, since it tests that the filter will "concatenate multiple values into 1" but then checks that the result is 872da745e45192c2a1d4bf7c1ff8a370. checking how that is computed:

> require 'openssl'
 => true 
jruby-1.7.23 :002 > OpenSSL::HMAC.hexdigest(OpenSSL::Digest::MD5.new, "longencryptionkey", "test2".to_s).force_encoding(Encoding::UTF_8)
 => "872da745e45192c2a1d4bf7c1ff8a370" 

It's clearly confirming that only the last value is used for checksumming.

That said, I'm also not sure what the behaviour should be when multiple source fields are used and concatenate_sources is set to false.
We could change the concatenate_sources option to become something like:
config :merge_sources, :validate => ["concatenate", "first", "last"], :default => "last" so the user can choose which values to checksum, out of the array of sources..
However, at that point, choosing first or last is the same as setting only 1 source.

So I'm wondering if there's any use case where a user wants to disable concatenate_sources and user multiple source fields. If there's none, I vote for removing the option and always concatenate.

The question remains how else should the fingerprinting work with multiple sources and concatenate_sources option set to false.
The only sane option IMHO would be to do "the same thing as concatenate_sources but the other way around". Which means, instead of joining the sources and calculating fingerprint, it should calculate individual fingerprints for separate source fields and then join (concatenate? concatenate and then digest?) them together. Or put into target field an array of fingerprints (as it should be done with source field itself being an array if I understand the filter code correctly; I'm not a ruby-wizard ;->).

jsvd commented

Well, I'm questioning the use of any of those methods. If no one has a use for non concatenated multi-field sources then I'd suggest removing the option and always concatenate.

@jsvd yes but when concatenate is true this plugin sorts the keys. When you use it in 2 indices with different fields name, you will have a big bug.

Just found this bug also.

The docs say this:

concatenate_sources: ... If false and multiple source fields are given, the target field will be an array with fingerprints of the source fields given.

The code doesn't do this, though. Should be an easy fix assuming what is documented is the desired behavior

As it stands and with the incorrect documentation, the default behavior likely leads people to believe that multiple source fields are being fingerprinted in aggregate rather than simply the last field being fingerprinted. I'd suggest removing the concatenate_sources option as I don't see a use case for returning an array of hashes for an array of source fields.

I got bit by this! (my fault for not doing thorough enough testing)

I agree that it is unlikely to see a use case for returning an array of hashes. However, having said that, I don't think I would change the functionality -- I would just make it do what the documentation says it should do.

If fingerprint were to return an array as documented, that would be a very good indicator that something unexpected was being done, and would lead to fewer unexpected bugs. I would worry that removing concatenate_sources might break existing code and lead to backwards incompatibility.