wordCount
Exercise for a FunClub meetup
You'll be given a simple textfile written in English language (like this one: http://www.textlibrary.com/download/moby-dic.txt).
Your task is to write a litte program that counts the occurences of words and print the 10 most frequent words with their number of occurences to stdout, like so (numbers are not correct!):
$ mysolution < moby-dic.txt the: 50123 of: 10236 and: 9999 to: 4024 a: 3901 in: 2561 that: 2400 i: 2331 was: 2114 he: 1738
What is a word?
- we'll assume that a word consists just of the the characters from a-z
- we don't distinguish uppercase and lowercase, so it's okay to convert everything to lowercase
- everything that is not in a-z can be considered a word boundary, so it's easiert for you to deal with commas, colons and the like.
What is the minimal requirement?
- Write a minimal solution in your language that solves the task for the moby-dic.txt
- Make your solution presentable (comment your source or prepare a little slide)
- Be able to explain in a few sentences
- how your solution works
- what dependencies it has (non standard libraries etc)
- in what way your solution benefits from something special about your language
- and what its drawbacks are (if there are any)
What else can be done (optional)?
Performance:
- Benchmark your solution in regards of time consumption. Either use the time command (man time) or even show us how to benchmark in your language.
- What is your solution spending time with? IO? Garbage collection?
- Improve that.
- Benchmark again...
Memory consumption:
- It is fine to read moby-dic.txt all at one into memory, but what if we give you a corpus that does exceed your machines memory? Fix this. Tell us about.
- Since this is functional programming: What datastructure did you use? Is it functional? Does it trigger heavy allocation and garbage collection? Find out and tell us about.