logstash-plugins/logstash-patterns-core

Incorrect pattern for AWS CLOUDFRONT_ACCESS_LOG

jpleger opened this issue · 7 comments

Not sure if its a version change or log format change on the AWS side, but currently there are a few fields that are incorrect pattern-wise for the CLOUDFRONT_ACCESS_LOG format. This is due to the use of the GREEDYDATA which offsets the patterns incorrectly. To address, will probably use a [^\t\r\n] to delimit the fields.

  • Version: 4.1.2
  • Operating System: All (Docker)
  • Config File (if you have sensitive info, please remove it): N/A
  • Sample Data:

2018-07-24	22:22:47	SEA19	557	196.52.43.106	GET	d2lv5my8ejglq4.cloudfront.net	/	301	-	Mozilla/5.0%2520(compatible;%2520nsrbot/1.0;%2520&%2343;http://netsystemsresearch.com)	-	-	Redirect	IKaDjLqf5T8ptafxZk_HNJ49zZ1N4SuI8f_kdivoUvPNZFnzpuKhKA==	jamespleger.com	http	127	0.000	-	-	-	Redirect	HTTP/1.1	-	-
--


  • Steps to Reproduce: add cloudfront logs.

Reference:
https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/AccessLogs.html

Will submit a PR shortly to fix this.

I think looking at the patterns that are in common use in the grok patterns, should add a NOTTAB, which can help address the problem in the future if aws adds new fields.

This should solve it:

# patterns/grok-patterns
NOTTAB [^\t\r\n]+
# patterns/aws
CLOUDFRONT_ACCESS_LOG (?<timestamp>%{YEAR}-%{MONTHNUM}-%{MONTHDAY}\t%{TIME})\t%{WORD:x_edge_location}\t(?:%{NUMBER:sc_bytes:int}|-)\t%{IPORHOST:clientip}\t%{WORD:cs_method}\t%{HOSTNAME:cs_host}\t%{NOTSPACE:cs_uri_stem}\t%{NUMBER:sc_status:int}\t%{NOTTAB:referrer}\t%{NOTTAB:agent}\t%{NOTTAB:cs_uri_query}\t%{NOTTAB:cookies}\t%{WORD:x_edge_result_type}\t%{NOTSPACE:x_edge_request_id}\t%{HOSTNAME:x_host_header}\t%{URIPROTO:cs_protocol}\t%{INT:cs_bytes:int}\t%{NOTTAB:time_taken:float}\t%{NOTTAB:x_forwarded_for}\t%{NOTTAB:ssl_protocol}\t%{NOTTAB:ssl_cipher}\t%{NOTTAB:x_edge_response_result_type}\t%{NOTTAB:cs_protocol_version}(?:\t%{NOTTAB:fle_status}\t%{NOTTAB:fle_encrypted_fields})?```
jsvd commented

@jpleger I'm happy to merge such a PR if you get to create it (also, there are tons of examples of how to write a test for the pattern so please include that as well in the PR)

I've also started seeing x_edge_location not match on WORD - it can contain hyphens.

Yes, I can confirm that this is broken. => _grokparsefailure

Also AWS doc on how to plug Cloudfront logs into Logstash isn't correct either: https://aws.amazon.com/premiumsupport/knowledge-center/cloudfront-logs-elasticsearch/

(fails with new log fields)

Hey,

I was working on parsing the new data fields.

For information, Amazon changelog is here: https://aws.amazon.com/about-aws/whats-new/2019/12/cloudfront-detailed-logs/

I'm using this pattern:

%{DATE_EU:date}\t%{TIME:time}\t(?<x_edge_location>\b[\w\-]+\b)\t(?:%{NUMBER:sc_bytes:int}|-)\t%{IPORHOST:c_ip}\t%{WORD:cs_method}\t%{HOSTNAME:cs_host}\t%{NOTSPACE:cs_uri_stem}\t%{NUMBER:sc_status:int}\t%{NOTTAB:referrer}\t%{NOTTAB:user_agent}\t%{NOTTAB:cs_uri_query}\t%{NOTTAB:cookie}\t%{WORD:x_edge_result_type}\t%{NOTSPACE:x_edge_request_id}\t%{HOSTNAME:x_host_header}\t%{URIPROTO:cs_protocol}\t%{INT:cs_bytes:int}\t%{NUMBER:time_taken:float}\t%{NOTTAB:x_forwarded_for}\t%{NOTTAB:ssl_protocol}\t%{NOTTAB:ssl_cipher}\t%{NOTTAB:x_edge_response_result_type}\t%{NOTTAB:cs_protocol_version}\t%{NOTTAB:fle_status}\t%{NOTTAB:fle_encrypted_field}(\t%{INT:c_port:int}\t%{NUMBER:time_to_first_byte:float}\t%{NOTTAB:x_edge_detailed_result_type}\t%{NOTTAB:sc_content_type}\t(?:%{NUMBER:sc_content_len:int}|-)\t(?:%{NUMBER:sc_content_start:int}|-)\t(?:%{NUMBER:sc_content_end:int}|-))?

With the following pattern mentioned by @jpleger

NOTTAB [^\t\r\n]+
kares commented

expected to be addressed by the updated ECS compliant aws pattern set from #287
the wrong behaviour of the legacy CLOUDFRONT_ACCESS_LOG is spec-ed.