This repo provides scripts that can help manage AWS instances. Specifically, if you have a large corpus to process, you can use AWS to create many instances in the same time at your specified price, and process the corpus in parallel. All instances will be automatically stopped once your processing is finished to avoid over-charge.
- Write
init.sh
(e.g., init_script.sh) andmain.sh
(e.g., runIllinoisTemporal.sh) main.sh
must take partition numbers as its first argument; other input arguments should be passed to it via the option--main_script_args
ofrunCluster.py
- Copy or upload
main.sh
to s3:s3 cp [path2main.sh] s3://cogcomp-public-data/scripts/main-[uniqueId].sh
, whereuniqueId
is to distinguish different versions of the same script, which you can just type in the date today - Split the corpus into many partitions (multiply of the number of instances you want to use). For example, if you want to use 100 instances, you can split the large corpus into 100*n partitions. Then you save these partitions to somewhere in s3, say
s3://cogcomp-public-data/results/illinois-temporal/[num].ser.tgz
. This script will evenly distribute all the partitions to each instance. - Run from the local computer:
python runCluster.py
with arguments. For a full list of arguments, please see therun()
function inrunCluster.py
.--key_name
is your key name to use AWS.--private_key_file`` should match to your
key_nameabove; this file is typically in
~/.ssh/keyname.pem`.--count
number of AWS instances you want to use.--instance_type
type of AWS instance you want to use.--price
the price you would like to pay for each hour each instance of the type specified above. This price should be slighly higher than the bidding price for that particular type.--init_script_path
is the init script that runs on all instances before any real processing starts (e.g., create directories and download the program files like .jar files to each instance).--main_script_path
is the path of main.sh on each instance, which is actually determined byinit.sh
. Remember, ininit.sh
,main-[uniqueId].sh
is downloaded to each instance (see this line).--input_s3_dir
and--input_suffix
are where the input files are saved ons3://cogcomp-public-data/
. For example, ins3://cogcomp-public-data/results/illinois-temporal/
, there are many[num].ser.tgz
's; theninput_s3_dir=results/illinois-temporal/
andinput_suffix=ser.tgz
.--output_s3_dir
and--output_suffix
are where the input files are saved ons3://cogcomp-public-data/
. A special note is that--output_suffix
can be multiple suffixes split by space, e.g.,arg1 arg2
, which is useful when you want to save multiple types of output.