Strata 2016 NYC - Hadoop in the Cloud Tutorial ============================================== Introduction ------------ This repository contains everything needed to run through a two step pipeline that will ingest data via Spark and make it available for interactive queries via Impala. The Spark step is expected to be a transient cluster running only for a couple of hours each day (potentially triggered via cron). The Impala one is expected to be long running and elastic and have multiple users that use Hue or the JDBC interface. Common Settings --------------- Start by going through the AWS Quickstart flow: https://aws.amazon.com/quickstart/ http://docs.aws.amazon.com/quickstart/latest/cloudera/welcome.html Relevant identifiers will be available as CloudFormation outputs. Go to the AWS console and make note of the EC2 instance called ClusterLauncher. This is your Director server instance. SSH to the created Director server instance. $ ssh ec2-user@<Director server IP> -i <PEM file whose keyname was provided to AWS Quickstart> Download the conf files from this github repo to your Director server instance. Modify common.conf.sample by providing details specific to your AWS account. Use the information from the CloudFormation output. Use the ClusterLauncher security-group, and not the NAT security-group. Alternatively, look at aws.sample.conf to see values that should go into common.conf. Save this file as common.conf. Run validation for both configuration files to ensure everything is configured properly: $ cloudera-director validate spark.conf $ cloudera-director validate impala.conf Create a tunnel to Director from your local machine: $ ssh -C -L 7189:localhost:7189 ec2-user@<Cluster Launcher IP> # Use your browser to go to http://localhost:7189/ Data ingest via Spark --------------------- Ask Director to setup the Spark cluster for ETL: $ cloudera-director bootstrap-remote spark.conf --lp.remote.username=admin # Director will ask for the admin password Progress information is also available in the Director UI. Establish a tunnel to Cloudera Manager: $ ssh -i cloudera.pem -CN -L 7180:<CM Private IP>:7180 ec2-user@<Cluster Launcher IP> SSH into the master node and open the Spark shell: $ sudo -u hdfs -i bash $ curl -o ingest.scala https://raw.githubusercontent.com/cloudera/strata-tutorial-2016-nyc/master/ingest.scala $ spark-shell -i ingest.scala $ curl -o schema.hql https://raw.githubusercontent.com/cloudera/strata-tutorial-2016-nyc/master/schema.hql $ curl -o copy_to_s3.hql https://raw.githubusercontent.com/cloudera/strata-tutorial-2016-nyc/master/copy_to_s3.hql Modify the schema.hql file to point to a new S3 bucket created via the AWS console. $ hive -f schema.hql $ hive -f copy_to_s3.hql SQL via Impala on integested data --------------------------------- Ask Director to setup the Impala cluster for interactive queries: $ cloudera-director bootstrap-remote impala.conf --lp.remote.username=admin # Director will ask for the admin password Establish a tunnel to Cloudera Manager: $ ssh -i cloudera.pem -CN -L 7180:<CM Private IP>:7180 ec2-user@<Cluster Launcher IP> SSH into the master node and open the Impala shell: $ sudo -u hdfs -i bash $ curl -o schema.hql https://raw.githubusercontent.com/cloudera/strata-tutorial-2016-nyc/master/schema.hql $ hive -f schema.hql Start the impala-shell from a worker node. Identify the Impala worker node from Cloudera Director UI. $ impala-shell -i <IP address of Impala worker node> # Run some interesting queries reading from S3