developping make rake run on aws emr.
when running tasks on emr. you have two choices.
One is offical aws add step, but this cannot be applied to complex tasks.
The other is login to the master of cluster, copy your scripts to master, and run tasks on master. For this choice, you have to take care of cluster status.
cluster initialization takes a lot of time and money. We want to test whole task flow on local hadoop, and run it on aws without changing any code.
this library takes care of all things for you.
Add this line to your application's Gemfile:
gem 'rake_emr'
And then execute:
$ bundle
Or install it yourself as:
$ gem install rake_emr
if not ENV["on_local"]
RakeEmr.default_config("chenkovsky.chen", "ime_user_profile", "s3://chenkovsky.chen/log")
end
RakeEmr.set_ssl_ca_file "/home/chenkovsky.chen/.ssh/chenkovsky.chen.pem"
RakeEmr.script_dirs << "scripts"
task :A do
rsh "hadoop fs ls .."
adistcp "s3://....", "hdfs://....."
end
after config the parameters. execute
rake A # run on aws
rake A on_local=true # run on local hdfs
- Fork it ( https://github.com/chenkovsky/rake_emr/fork )
- Create your feature branch (
git checkout -b my-new-feature
) - Commit your changes (
git commit -am 'Add some feature'
) - Push to the branch (
git push origin my-new-feature
) - Create a new Pull Request