/tag-statistic

"Top N Tags" application reads and parses in parallel multiple text files and maintains the Top N frequently occurring words (tags), user can query the real time version of the Top N list thorough a multiplexed socket connection or an API.

Primary LanguageJava

"Top N Tags" application reads and parses in parallel multiple text files and maintains the Top N frequently occurring words (tags), user can query the real time version of the Top N list thorough a multiplexed socket connection or an API.

1. Usage (requires at least Java 7 JRE or SDK)

java -cp {tagstats.jar} org.encalmo.tagstats.TagStatsApp -p {serverSocketPort} -d {base baseDirectory} [-t {numberOfThreads} -s {topListSize} -a {topListUpdateAccuracy (1,100)}]

2. Options:

-p Telnet port number on which the application would serve real time tag query requests.
-d Relative or absolute path to base directory to scan for text files.
-t (optional) Number of parallel threads dedicated to parsing files.
-s (optional) Size of maintained top list, default 10.
-a (optional) Top list updates accuracy, default 1.

3. Features:

The application processes multiple files at the same time.
Tags are composed by a mix of letters and numbers and dash “-“. (Eg: 12313, abc123, 66-route, 66route,).
Tags can also be numbers or a mix of letters and numbers. (Eg: 12313, abc123, 66route).
The plural version of a word are counted as a separate tag. Eg: Bottle and Bottles are two different tags.
Symbols, Space or punctuation marks are counted as tags.
New lines, carriage returns, tabs and any white space are excluded.
When the application completes processing it writes to standard output the top 10 tag list (neatly formatted) 
sorted in descending order based on number of occurrences.
The application accepts multiple connections on the port specified (as a command line parameter, -p).
When the connection is made, the application prints the top 10 tag list (line by line) sorted in descending 
order based on the number of occurrences and closes the connection immediately.

4. Architecture:

TagStatsApp aggregates several components linked by few interfaces and working together in a multi-threaded, non-blocking, asynchronous way achieved using "Actor" pattern.

Interfaces:
- TagSet<T> is an interface of components gathering tags and providing top list.
- TagParserStrategy is an interface of tags parsing strategies.
- TagParser is an interface of components parsing streams into tags.
- FileParser is an interface of components parsing files.
- TagStatsService aggregates TagParser, FileParser and TagStats

Abstract classes:
- Actor is a base abstract class of all asynchronous components, encapsulates message queue and multi-threaded execution details
- SingleThreadService is a base abstract class of manageable services running in the single dedicated thread, encapsulates asynchronous stop/start execution details

Components:
- TagStatsApp acts as a TagStatsService launcher, uses TagStatsServiceFactory and TagStatsServiceConfig.
- GenericTagStatsService is main aggregating component.
- GenericTagSet maintains in-memory doubly-linked ordered list of tags and current TopN stats.
- TagSetActor encapsulates GenericTagSet and process all inserts using long queue and single thread to avoid synchronization of concurrent inserts inside GenericTagSet.
- GenericTagParserStrategy encapsulates default parsing rules
- GenericTagParser parses stream of characters into tags and updates TagSet on the go.
- TagParserActor encapsulates GenericTagParser and processes streams using multiple threads and short queue to avoid keeping them open too long.
- FileParserActor encapsulates TagParserActor and acts as a proxy responsible of queueing and opening files before they reach the TagParserActor.
- DirectoryWatchAndScan observes baseDirectory and notifies FileParserActor about new files to parse, uses TagStatsFileEventListener.
- MultiplexedServerSocket serves at dedicated network socket TopN tags provided by TagStatsSet, uses TagStatsServerSocketEventListener.

5. Repository and build

Code source available at https://github.com/arturopala/tag-statistic.git

Application is build and tested using standard maven commands: mvn clean install
More info: http://maven.apache.org/guides/getting-started/index.html