Imagine you have a collection of books, and you want to analyze how word usage varies from book to book or author to author.
We can analyze one book by running the wordcount.py
script, with the
name of the book we want to analyze:
$ ./wordcount.py Alices_Adventures_in_Wonderland_by_Lewis_Carroll.txt
We want to run this script on all the books we have copies of.
- What is the input set for this HTC workload?
- What is the output set?
- What other files might we need to think about?
Based on what you know about the script, inputs, and outputs, how would you organize this HTC workload in folders?