Organizing Files

Imagine you have a collection of books, and you want to analyze how word usage varies from book to book or author to author.

Our Workload

We can analyze one book by running the wordcount.py script, with the name of the book we want to analyze:

$ ./wordcount.py Alices_Adventures_in_Wonderland_by_Lewis_Carroll.txt 

We want to run this script on all the books we have copies of.

  1. What is the input set for this HTC workload?
  2. What is the output set?
  3. What other files might we need to think about?

Get organized

Based on what you know about the script, inputs, and outputs, how would you organize this HTC workload in folders?