Stackery.io Coding Exercise - adlio
This program calculates the most frequently occuring n-grams in text files supplied as command line arguments or via standard input. By default it looks for 3-word sequences and outputs the 100 most common found.
Command line arguments:
ngramfreq file1.txt file2.txt
Standard input and command line args combined:
cat first.txt | ngramfreq second.txt
Search for bigrams instead of trigrams:
ngramfreq -s 2 file1.txt
Output the top 10 instead of the top 100:
ngramfreq -n 10 file1.txt
This application is written in Go, using only the Go standard library. Follow the Go install guide appropriate for your OS.
Prerequisites:
- Go (1.8+)
You'll want to check the project out in your $GOPATH
.
mkdir -p $GOPATH/src/github.com/samg
git clone git@github.com:samg/adlio-coding-exercise
cd adlio-coding-exercise
Tests should be run from the cmd/ngramfreq
directory:
cd cmd/ngramfreq
go test
Compile and run the app from the cmd/ngramfreq
directory:
cd cmd/ngramfreq
go build
./ngramfreq ./testdata/sonnets.txt
This project's architectural decisions are being tracked in Markdown files in the adr directory.
Below are the original requirements for this project:
-
The program can be written in any language but should be easy to run.
-
It should accept as arguments a list of one or more file paths (e.g. ./solution file1.txt file2.txt ...). The program also accepts input on stdin (e.g. cat file1.txt | ./solution).
-
The program outputs a list of the 100 most common three word sequences in the text, along with a count of how many times each occurred in the text.
For example:
231 - i will not 116 - i do not 105 - there is no 54 - i know not 37 - i am not
-
The program ignores punctuation, line endings, and is case insensitive (e.g.
I love\nsandwiches.
should be treated the same as(I LOVE SANDWICHES!!)
). -
The program is capable of processing large files and runs as fast as possible.