This is a solution for the prompt in this gist
Given two arguments:
- Path for a text file containing a list of common words
- Path for a text file containing a text file
This program will compute the count of each of the words contained in the file at path #2, excluding the words defined in the file at path #1.
I setup the project with Gradle, so the easiest way to run is the following:
./gradlew run --args='"src/test/resources/common_words.txt" "src/test/resources/alice_in_wonderland.txt"'
Alternatively you can build a jar and run the program that way:
./gradlew assemble
java -jar build/libs/words.jar "src/test/resources/common_words.txt" "src/test/resources/alice_in_wonderland.txt"
I used streams to break each line of the text file down into a stream of Strings, and then accumulated the words into a Map<String, Integer> while incrementing the count on duplicates. I did try to normalize by shifting everything to lower case and stripping out punctuation/special characters.
Sanitizing the text was done pretty naively, I didn't think it was worth the effort right now to do thing like balancing quotes and trying not to exclude contractions.
The speed could potentially be improved by using parallel streams and collecting to a ConcurrentHashMap.