wordCount

Exercise for a FunClub meetup

You'll be given a simple textfile written in English language (like this one: http://www.textlibrary.com/download/moby-dic.txt).

Your task is to write a litte program that counts the occurences of words and print the 10 most frequent words with their number of occurences to stdout, like so (numbers are not correct!):

$ mysolution < moby-dic.txt the: 50123 of: 10236 and: 9999 to: 4024 a: 3901 in: 2561 that: 2400 i: 2331 was: 2114 he: 1738

What is a word?

we'll assume that a word consists just of the the characters from a-z
we don't distinguish uppercase and lowercase, so it's okay to convert everything to lowercase
everything that is not in a-z can be considered a word boundary, so it's easiert for you to deal with commas, colons and the like.

What is the minimal requirement?

Write a minimal solution in your language that solves the task for the moby-dic.txt
Make your solution presentable (comment your source or prepare a little slide)
Be able to explain in a few sentences

how your solution works
what dependencies it has (non standard libraries etc)
in what way your solution benefits from something special about your language
and what its drawbacks are (if there are any)

What else can be done (optional)?

Performance:

Benchmark your solution in regards of time consumption. Either use the time command (man time) or even show us how to benchmark in your language.
What is your solution spending time with? IO? Garbage collection?
Improve that.
Benchmark again...

Memory consumption:

It is fine to read moby-dic.txt all at one into memory, but what if we give you a corpus that does exceed your machines memory? Fix this. Tell us about.
Since this is functional programming: What datastructure did you use? Is it functional? Does it trigger heavy allocation and garbage collection? Find out and tell us about.

joel-wright/word-count

wordCount

What is a word?

What is the minimal requirement?

What else can be done (optional)?