The Buzz Media common-parser-lib http://www.thebuzzmedia.com/software/common-parser-lib-common-parser-java-utility-library/ Changelog --------- 3.0 * Refactored library under base "parser" package to keep integration with future Buzz Media "common" libraries cleaner. * Decoupled source type, input type, delimiter type and Token value types all from each other. Allows for much more flexible API definitions. * Generified types were renamed to follow the naming scheme: <IT>: input-type, the type of the input that is processed by the parser. <ST>: source-type, type of the "source" that a Token gets its value from. <TT>: token-type, if needed, the type returned by IToken.getType() <VT>: value-type, the return type of IToken.getValue() <DT>: delimiter-type, the type of the delims used by IDelimitedTokenizer. <ET>: event-type, the type of the event pull parser's return. * AbstractParser.refillBuffer was refactored and generalized so code is no longer duplicated into different base parser implementations. * AbstractParser.createBuffer hook was added to simplify parser initialized in subclasses. * AbstractParser.parseToken was added as the universal logic-loop used to parse the next token from the given bIndex/bEndIndex range of chars in the buffer. This method also handles refilling and re-trying the parse operation on first-failure before automatically stopping the parser and returning null. This logic is universal and doesn't need to be duplicated in any other sub classes; the only thing subclasses need is the actual scanning/marking logic for the types of tokens they parse. * Added AbstractReusableToken to make creating/using reusable tokens simpler. * Moved ByteArrayToken and CharArrayToken into their respective ITokenizer implementations. These are not general-use tokens except for these specific ITokenizer implementation, so it made more sense to have them defined as part of the tokenizer itself. * IContainerToken definition was fixed to extend IToken. * IContainerToken added the ability to dictate a bounds-growth mode based on the child tokens added to it. * Javadoc added to all the core interfaces to help make understanding the API (starting with core classes) easier. 2.0 * Added parser package containing the fundamentals of a callback-based parser. * Added IStreamParser to spec an InputStream parser processing bytes * Added IReaderParser to spec a Reader parser processing chars. * Added base abstract implementations for all parser types that include boiler-plate logic like refilling the underlying read buffer from the stream and passing generated tokens to the callback. Implementors need only add the actual parsing logic. * IContainerToken was added and contain 0 or more child IToken instances. * AbstractContainerToken is a base implementation for IContainerToken and provides the following behavior: * Parent's length automatically expands to contain child tokens as they are added. * Child token bounds are vetted to ensure they are allowed within the parent's bounds when added. * ITokenizer.isReuseToken/setReuseToken was added for performance-minded use; it instructs the tokenizer impl to update and return the same IToken instance every call to nextToken instead of creating a new object each time. This offers a performance boost and smaller memory footprint for implementors that know the token instance will be ephemeral and don't try and hold on to it. * ITokenizer was generalized to be a more flexible base-interface for ANY kind of tokenizer; not just a delimiter-based tokenizer which it was previously. * IDelimitedTokenizer now contained specific delimiter-based specification. * Added IScanner interface used to define a base scanner implementation. * IToken was moved to the common base package as it applies to all the parsers. 1.1 * Initial public release. License ------- This library is released under the Apache 2 License. See LICENSE. Description ----------- A collection of interfaces and base implementations for classes that deal with parsing. The goal of this library is provide a clean API with well-defined behaviors for different types of parsers to make custom parser implementation quick and easy. Base abstract implementations are provided, where applicable, to make extension and implementation straight forward. The 3 types of parsers and intended use are: * Scanner: A stateless set of logic used to scan a portion of data, extracting a series of IToken<T> objects describing the content and returning them to the caller. Scanners are meant to be thread-safe as they are stateless. * Tokenizer: A stateful class used to wrap a source of data and be invoked by the caller in a while-loop, pulling nextToken() from the tokenizer until it returns null; indicating the data has been exhausted. Tokenizers are meant to be reusable, but NOT thread-safe as they maintain state relative to the content they are parsing. * Parser: A stateful class used to parse IToken<T> instances from the given source and notify a callback every time a new token is parsed. Sources can be InputStreams (byte[]) or Readers (char[]). Parsers are meant to be reusable, like tokenizers, but NOT thread-safe as they maintain state about their source content as well. There are no base implementations for scanners are they are simple yet highly-specialized; any implementor can implement IScanner and fill in his own logic. There are base abstract and concrete implementations for ITokenizer in the form of a simple delimiter-based tokenizer. Base abstract implementations provide all the boiler plate necessary to run through the given data source and implementors only need to provide the parsing logic. Base abstract implementations for IParser are available as well, providing optimized boilerplate for tasks like refilling read buffers from the underlying streams, vetting arguments and providing hooks for the actual parse logic to implementors. Runtime Requirements -------------------- 1. The Buzz Media common-lib (tbm-common-lib-<VER>.jar) http://www.thebuzzmedia.com/software/common-lib-common-java-utility-library/ History ------- While developing the CloudFront Log Parser (http://www.thebuzzmedia.com/software/cloudfront-log-parser/), High Performance XML Parser (http://www.thebuzzmedia.com/software/high-performance-java-xml-parser-hpjxp/) and Redis Client Driver (http://www.thebuzzmedia.com/software/redis-java-client-db-driver/) I began to see a lot of duplication in parser/scanner/lexer logic emerge. It took about a week to normalize all the use-cases into what looked like 3 different approaches to parsing: Scanning, tokenizing and parsing to a callback. In an attempt to normalize my efforts across all these projects and come up with a clean API that I can follow now and in the future, this general-use project was born. My goals for this project were not so much base implementations as they were: * Well-defined structure to follow * Solid, performant base implementations for common boilerplate. So I set out to define this library providing a clean API that can be easily extended to make custom parser implementations quickly. Contact ------- If you have questions, comments or bug reports for this software please contact us at: software@thebuzzmedia.com