cloudfront-log-parser: A Java repository from rkalla

Amazon CloudFront Log Parser

Changelog
---------
1.4
	* Fixed Issue #11 - IllegalArgumentException while parsing newer CF log format.
	* Fixed Issue #12 - Supporting new CF log fields.
	
1.3
	* Improved performance by using new feature in tbm-common-parser-lib that
	allows a tokenizer to re-use the same IToken for every token event instead
	of creating a new one.
	
1.2	
	* More secure exception handling inside of Parse ensures cleanup of any 
	temporary input streams.
	
	* Improved the exception semantics of the parse method.
	#2
	
	* Parser is more verbal with MalformedContentException errors when the content
	of the log file doesn't match the Amazon-defined DOWNLOAD or STREAMING 
	CloudFront log formats.
	#2
	
	* Fixed incorrect bounds check for insertion of fields beyond what ILogEntry
	will allow.
	#5
	
	* Fixed a handful of "tightening" bugs that didn't break the parser previously
	but could lead to leaks or bugs down the road.
	 
	* Addition Javadoc to the source was added to make it more clear what certain
	constructs are for.
	
	* Added a new file to the benchmark that matches a worst-case-scenario log
	file from CloudFront (50MB uncompressed, ~174k entries)
	
	* Added library to The Buzz Media's Maven repo.

1.1
	* Initial public release.


License
-------
This library is released under the Apache 2 License. See LICENSE.


Description
-----------
CloudFront Log Parser is a Java library offering a low-complexity, 
high-performance, adaptive CloudFront log parser for both the Download and 
Streaming CloudFront log formats.

The library is "adaptive" in the sense that it determines the format of the
log files at parse time, you don't need to tell it, and it is additionally
resilient in that it can skip unknown Field names and values while parsing
instead of throwing an exception. This is important if the library is ever
deployed in a production setting and Amazon changes the CloudFront log format
before a new version of the library is read.

CloudFront Log Parser's API was designed with the existing AWS Java JDK in mind 
and ensuring that integration would fit naturally. More specifically, the API 
can directly consume raw InputStream for the stored .gz log files on S3 directly 
from the S3Object.getObjectContent() method.

That being said, the API is coded to accept ANY InputStream; whether you are 
processing local copies and passing FileInputStreams or processing remote 
copies; it just happened to be particularly easy to integrate it with the AWS
Java SDK.

CloudFront Log Parser is intended to be used in any deployment scenario from a
desktop analysis app to a long-running server process environment.


Design
-----------
CloudFront Log Parser was designed, first and foremost, to be as fast as 
possible with as little memory allocation as possible so as to work smoothly
in a long-running server log-processing usage scenario.

Object creation is kept to a minimum during parsing (no matter how big the
parse job) by re-using a single ILogEntry instance, per LogParser, to wrap
parsed line values and report those to the given ILogParserCallback.

The ILogEntry instance received by the callbacks is ephemeral; in that the 
ILogEntry instance is only valid for the scope of the callback's method. Once
returned from that method, the ILogEntry instance is reused and the values it
holds are swapped.

*** CALLBACKS SHOULD NEVER HOLD ONTO ILOGENTRY INSTANCES! ***

However, the values reported by the callbacks are copies and can be safely
stored.

This design was chosen because over large parse jobs for busy sites, where 
millions of log entries would not be uncommon, the heap memory savings and
performance improvement because of this design would be noticeable. The VM
would not be thrown into longer GC cycles as it attempted to clean up the
millions of useless objects that were so short lived. 


Performance
-----------
Benchmarks can be found in the /src/test/java folder and can be run directly
from the command line (no need to setup JUnit).

[Platform]
* Java 1.6.0_24 on Windows 7 64-bit 
* Dual Core Intel E6850 processor
* 8 GB of ram

[Benchmark Results]
Parsed 100 Log Entries in 27ms (3703 entries / sec)
Parsed 100,000 Log Entries in 864ms (115740 entries / sec)
Parsed 174,200 Log Entries in 1341ms (129903 entries / sec)
Parsed 1,000,000 Log Entries in 7520ms (132978 entries / sec)

The Amazon CloudFront docs say log files are truncated at a maximum size of 50MB 
(uncompressed) before they are written out to the log directory. The 3rd test,
parsing the 174k log entries is exactly 50MB uncompressed and matches this
worst-case-scenario. 

This means that using CloudFront Log Parser to parse the largest log files that
CloudFront will write out to your S3 bucket, you can parse that file in a little
over a second on equivalent hardware.

If you are running on beefier server hardware, you can increase that rate and
if you are parsing logs in a multi-threaded environment (1 thread per LogParser)
you can increase that rate my magnitudes.

CloudFront Log Parser is fast.  


Runtime Requirements
--------------------

1.	The Buzz Media common-lib (tbm-common-lib-<VER>.jar)

2.	The Buzz Media common-pars-erlib (tbm-common-parser-lib-<VER>.jar)
	

History
-------
After deploying apps that utilized CloudFront for content delivery, I had the
need to parse the resulting access logs to get an idea of what kind of traffic,
bandwidth and access patterns the content was receiving.

Amazon promotes the use of their Map/Reduce Hadoop-based log parser multiple 
times on their site, but that requires additional EC2 charges to run.

After about a week of prototyping and engineering I had initial versions of the
CloudFront Log Parser written and running. Initial "does it work" prototypes took
an afternoon, but I toyed with a multitude of different API designs and 
approaches trying to best balance an easy-to-use API with runtime performance.

Eventually settling on what was by far the cleanest and simplest API, I used
HPROF to tighten up the runtime performance and minimize object creation down to
the bare minimum.

I hope this helps folks out there.
rkalla/cloudfront-log-parser