/common-parser-lib

The Buzz Media Common Parser Library

Primary LanguageJavaApache License 2.0Apache-2.0

The Buzz Media common-parser-lib

Changelog
---------
3.0
	* Refactored library under base "parser" package to keep integration with
	future Buzz Media "common" libraries cleaner.
	
	* Decoupled source type, input type, delimiter type and Token value types
	all from each other. Allows for much more flexible API definitions.
	
	* Generified types were renamed to follow the naming scheme:
	  <IT>: input-type, the type of the input that is processed by the parser.
	  <ST>: source-type, type of the "source" that a Token gets its value from.
	  <TT>: token-type, if needed, the type returned by IToken.getType()
	  <VT>: value-type, the return type of IToken.getValue()
	  <DT>: delimiter-type, the type of the delims used by IDelimitedTokenizer.
	  <ET>: event-type, the type of the event pull parser's return.
	
	* AbstractParser.refillBuffer was refactored and generalized so code is no 
	longer duplicated into different base parser implementations. 
	
	* AbstractParser.createBuffer hook was added to simplify parser initialized
	in subclasses.
	
	* AbstractParser.parseToken was added as the universal logic-loop used to
	parse the next token from the given bIndex/bEndIndex range of chars in the
	buffer. This method also handles refilling and re-trying the parse operation
	on first-failure before automatically stopping the parser and returning
	null. 
	
	This logic is universal and doesn't need to be duplicated in any other sub
	classes; the only thing subclasses need is the actual scanning/marking logic
	for the types of tokens they parse.
	
	* Added AbstractReusableToken to make creating/using reusable tokens simpler.
	
	* Moved ByteArrayToken and CharArrayToken into their respective ITokenizer
	implementations. These are not general-use tokens except for these specific
	ITokenizer implementation, so it made more sense to have them defined as
	part of the tokenizer itself.
	
	* IContainerToken definition was fixed to extend IToken.
	
	* IContainerToken added the ability to dictate a bounds-growth mode based
	on the child tokens added to it.
	
	* Javadoc added to all the core interfaces to help make understanding the API
	(starting with core classes) easier.
	
2.0
	* Added parser package containing the fundamentals of a callback-based parser.
	
	* Added IStreamParser to spec an InputStream parser processing bytes
	
	* Added IReaderParser to spec a Reader parser processing chars.
	
	* Added base abstract implementations for all parser types that include
	boiler-plate logic like refilling the underlying read buffer from the stream
	and passing generated tokens to the callback. Implementors need only add the
	actual parsing logic.
	
	* IContainerToken was added and contain 0 or more child IToken instances.
	
	* AbstractContainerToken is a base implementation for IContainerToken and
	provides the following behavior: 
		* Parent's length automatically expands to contain child tokens as they
		are added.
		* Child token bounds are vetted to ensure they are allowed within the
		parent's bounds when added.
	
	* ITokenizer.isReuseToken/setReuseToken was added for performance-minded
	use; it instructs the tokenizer impl to update and return the same IToken
	instance every call to nextToken instead of creating a new object each time.
	This offers a performance boost and smaller memory footprint for implementors
	that know the token instance will be ephemeral and don't try and hold on to
	it. 
	
	* ITokenizer was generalized to be a more flexible base-interface for ANY
	kind of tokenizer; not just a delimiter-based tokenizer which it was 
	previously.
	
	* IDelimitedTokenizer now contained specific delimiter-based specification.
	
	* Added IScanner interface used to define a base scanner implementation.
	
	* IToken was moved to the common base package as it applies to all the
	parsers.

1.1
	* Initial public release.


License
-------
This library is released under the Apache 2 License. See LICENSE.


Description
-----------
A collection of interfaces and base implementations for classes that deal with
parsing. 

The goal of this library is provide a clean API with well-defined behaviors for 
different types of parsers to make custom parser implementation quick and easy.
Base abstract implementations are provided, where applicable, to make extension 
and implementation straight forward.

The 3 types of parsers and intended use are:
	* Scanner: A stateless set of logic used to scan a portion of data, 
	extracting a series of IToken<T> objects describing the content and returning
	them to the caller.	Scanners are meant to be thread-safe as they are 
	stateless.
	
	* Tokenizer: A stateful class used to wrap a source of data and be invoked
	by the caller in a while-loop, pulling nextToken() from the tokenizer until
	it returns null; indicating the data has been exhausted. Tokenizers are 
	meant to be reusable, but NOT thread-safe as they maintain state relative to 
	the content they are parsing.
	
	* Parser: A stateful class used to parse IToken<T> instances from the given
	source and notify a callback every time a new token is parsed. Sources can 
	be InputStreams (byte[]) or Readers (char[]). Parsers are meant to be 
	reusable, like tokenizers, but NOT thread-safe as they maintain state about
	their source content as well.
	
There are no base implementations for scanners are they are simple yet 
highly-specialized; any implementor can implement IScanner and fill in his own
logic.

There are base abstract and concrete implementations for ITokenizer in the form
of a simple delimiter-based tokenizer. Base abstract implementations provide
all the boiler plate necessary to run through the given data source and 
implementors only need to provide the parsing logic.

Base abstract implementations for IParser are available as well, providing
optimized boilerplate for tasks like refilling read buffers from the underlying
streams, vetting arguments and providing hooks for the actual parse logic to 
implementors.


Runtime Requirements
--------------------
1.	The Buzz Media common-lib (tbm-common-lib-<VER>.jar)


History
-------
While developing the CloudFront Log Parser,
High Performance XML Parser and Redis Client Driver I began to see a lot of duplication in parser/scanner/lexer logic emerge.

It took about a week to normalize all the use-cases into what looked like 3 
different approaches to parsing: Scanning, tokenizing and parsing to a callback.

In an attempt to normalize my efforts across all these projects and come up with
a clean API that I can follow now and in the future, this general-use project
was born.

My goals for this project were not so much base implementations as they were:

	* Well-defined structure to follow
	* Solid, performant base implementations for common boilerplate.

So I set out to define this library providing a clean API that can be easily
extended to make custom parser implementations quickly.