aws-emr-word-frequency

This example helps you getting started with AWS Elastic MapReduce (EMR). It shows how to obtain word frequency and produce a list of words sorted in ascending order from the least to the most frequently used word. The application uses 2 steps, hence it needs two mapper functions and 2 reducer functions.

Running the Application

  1. Locally in the editor console execute this command:
    !python word_frequency_sorted.py word_frequency_book.txt > wfs.txt
  2. Remotely in the AWS; in the Canopy command terminalexecute this command:
    python word_frequency_sorted.py -r emr --conf-path=C:\Users\[user name]\.mrjob.config word_frequency_book.txt > wfs.txt
    Notice the directory where to store the .mrjob.config file can be any of your choice.

Minimal .mrjob.config Example

The following is an example of a minimal configuration file.

runners:
  emr:
    ec2_key_pair: [keypairfile] # Name of your key pair file
    ec2_key_pair_file: [C:\\dir\\keypairfile.pem] # Path of your key pair file
    aws_region: us-west-2
    ec2_instance_type: m1.small
    num_ec2_instances: 2
    ssh_tunnel_to_job_tracker: true

References