logstash-plugins/logstash-filter-fingerprint

fingerprint punctuation doesn't just leave punctuation

Closed this issue · 2 comments

The PUNCTUATION method of the fingerprint filter isn't described on the fingerprint docs, but the behaviour matches the old punct filter (which is understandable).

That's described as "Strip everything but punctuation from a field and store the remainder in the a separate field" but it's not true because the logic used actually only stripes out US-ASCII letters and digits, plus space and tab.

So it doesn't do what it describes for non-ASCII inputs such as letters with accents.

I"m not familiar with Ruby but the fix might be to use the [^[:punct:]] regex, assuming that behaves properly for unicode strings and uses the unicode punctuation property.

You'd need to give this a new method name, perhaps "PUNCT".

Would also be nice to give an example output in the docs for each of the methods.

moved from https://logstash.jira.com/browse/LOGSTASH-2251

+1 on this bug

This is fix in #5, feel free to review.