/rcp3

Word frequency count for URL's

Primary LanguageJavaGNU General Public License v3.0GPL-3.0

A simple command line program to run the core of the word count problem.

BUILD

To build use maven:

    mvn clean build

RUN

Run the program via java -jar <target-jar> <URL-to-count> for example:

java -jar target/rcp3-0.0.1-SNAPSHOT-jar-with-dependencies.jar  http://bayareatango.org

DATABASE

The database is HSQL stored in the directory data/. To interact with the DB after runs use the following:

java -cp target/rcp3-0.0.1-SNAPSHOT-jar-with-dependencies.jar  org.hsqldb.util.DatabaseManager

Use the URL jdbc:hsqldb:data/wordcount to access the database.

TODO

Depending on customer requirements of the code some changes would be needed:

  • duplicate page detection
  • query interface
  • change the schema to use mongodb since the schema in SQL can get unwieldy as the application grows.