Udacity Data Engineering Capstone

Introduction

Millions of international tourists travel to the United States every year, by routes connecting their home country and big cities in the United States. The board of one big airline wants to know whether the existing routes satisfy the present demand. To decide whether there need to add new routes. A dataset needs to be built and to be analyzed. This project built one demo with Spark.

The dataset includes data from three sources:

i94 immigration data

This data comes from the US National Tourism and Trade Office. https://travel.trade.gov/research/reports/i94/historical/2016.html

International tourist information comes from this data.

World tempereture data

This data comes from Kaggle. https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data

Tempereture of main cities in the United States is extracted from this data.

airline data

This data comes from OpenFlight. https://openflights.org/data.html

Airline information like country comes from this data.

Choosen tools and data models

Apache Spark

Spark is very fast compared to other frameworks because it works in cluster mode and uses distributed processing and computation frameworks internally. It took only 36 mins to finish processing immigration data in the Spark cluster while my personal computer spent several hours. Spark also supports sas7dbat files, which is the immigration data file type.

AWS EMR

Amazon EMR offers the expandable low-configuration service as an easier alternative to running in-house cluster computing. With Amazon EMR, I don't need to maintain Hadoop Cluster. Apache Spark and Apache Zeppelin are pre-configured in the EMR cluster as well.

Star Schema

The data in this dataset is well-structured. With star schema, this database can be used to cope with different queries without changing schema structure.

Table preprocess steps

Immigration table

Replace reference number in raw immigration table with information from I94_SAS_Labels_Descriptions.SAS file.
Split port column with city_port and state_port columns.
Reformat arrival date and departure date.
Calculate day stayed of immigrator.
Some immigrators' address information is missed.(49 in 1000) Replace it with state_port column.
Assume the address is the state where the immigrants most want to visit. It is important in my analysis.

model: the approach the immigrator takes to come to America, including Air, Sea, Land, and Not reported.

Extract useful columns and write the table ordered by address and partitioned by month.
Include 15 columns: id(primary key), year, month, resident, city_port, state_port, model, address, age, visa, gender, airline, day_stayed, arrival_date, depardure_date

id: Immigration record id.
arrival_date: Date of entry.
depardure_date: Date of leave.
year, month: year and month of entry.
resident: a country where immigrator lived.
port: port of immigration entry in the USA.
model: the approach the immigrator takes to come to America, including Air, Sea, Land, and Not reported.
address: the address in the USA where the immigrator can be contacted.
age: the age of immigrator.
visa: the visa type of immigrator, including Business, Pleasure, Student.
gender: the gender of immigrator.
airline: the airline code of the flight the immigrator took if he came to America by air.
day_stayed: the number of days the immigrator stayed in the USA.

Temperature table

Filter country with "United States"
Filter year after 2000-01-01. Because the climate changes a lot, the statistics from too long ago have no meaning.
Order the data by month and city.
Extract useful columns.
Include 4 columns: year, month, city, average_temperature. year, month, and city are primary keys

Airline table

Include 5 columns: id(primary key), full_name, code, country, finantial

The code column is used to join with the immigration table

Getting Started

Prerequisites

Python 2.7 or above.
AWS Account.
AWS Command Line.

Procedures

Upload data into S3
Set up EMR cluster

aws emr create-cluster --name spark-cluster --use-default-roles --release-label emr-5.28.0 --instance-count 3 --applications Name=Spark --ec2-attributes KeyName=spark-cluster --instance-type m5.xlarge --instance-count 3

Move data from S3 to HDFS.

aws emr add-steps --cluster-id j-XXXXXXXX --steps file://./Desktop/myStep.json

JSON file [ { "Name":"S3DistCp step", "Args":["s3-dist-cp","--s3Endpoint=s3.amazonaws.com","--src=s3://XXXX/airline_data","--dest=hdfs:///airline_result","--srcPattern=.*[a-zA-Z,]+"], "ActionOnFailure":"CONTINUE", "Type":"CUSTOM_JAR", "Jar":"command-runner.jar" } ]

Upload scripts(immigration_table and temperature table) to EMR master node and run script

scp -i XXX.pem XXX/immigration_table.py hadoop@XXXXXXXXXX.us-west-2.compute.amazonaws.com:/home/hadoop

Move result from HDFS to s3.

s3-dist-cp --src hdfs:///immigration_data/immigration.csv --dest s3://XXX/immigration_result/immigration.csv

Use Apache Zeppelin to visualize the dataset.

Example usage

There are many casinos in Las Vegas in the USA that are making plans to appeal to international tourists to play and live in their hotels. They want to analyze the statistics of their customers first by digging for information from USA I94 immigration data.

Their plan includes

Making some discounts on international flight tickets landing in Las Vegas

They need to know by which airline tourists fly to Las Vegas.

Opening some potential hot international flight routes.

They need to know the information of tourists whose landing port is not Las Vegas but whose address is Las Vegas.

Analyzing the temperature influence on the tourist's number

They need to know tourists and temperature each month

Dataset Mining

Count tourist groups by the airlines they took.

record.filter(record.port_state=="NEVADA").filter(record.visa=="Pleasure").groupBy(record.airline).agg(count("id").\ alias("count")).sort(desc("count")).limit(20).join(airline, record.airline==airline.code, how = 'inner').select("airline", "count", "full name", "country").sort(desc("count"))

With the figure above, it is easy to find a business partner.

Count and compare tourists whose port state is Nevada and tourists whose address is Nevada

record.filter(record.address=="NEVADA").filter(record.visa=="Pleasure").groupBy(record.resident).agg(count("id").alias("count")).sort(desc("count")).limit(10)

record.filter(record.port_state=="NEVADA").filter(record.visa=="Pleasure").groupBy(record.resident).agg(count("id").alias("count")).sort(desc("count")).limit(10)

By comparing the two tables above, many Japanese tourists were coming to Nevada in 2016, while only a very small number of them entered the United States through the port in Nevada. There may not be enough air routes connecting Japan and Nevada.

Plot tourists' number against temperature.

tempereture1 = tempereture.filter(tempereture.City=="Las Vegas").groupBy(tempereture.month).agg(avg("AverageTemperature").alias("AverageTempereture"))

record.filter(record.address=="NEVADA").filter(record.visa=="Pleasure").groupBy(record.month).agg(count("id").\ alias("count")).join(tempereture1, "month", how = 'inner').select("AverageTempereture","count").orderBy("AverageTempereture")

By comparing the two tables above, it seems that tourists prefer to travel when Nevada has mild temperatures.

This dataset can also be used by other American cities to analyze their international tourists.

Other Scenarios

The data was increased by 100x.

By splitting the data into several parts before running in EMR, the EMR cluster can handle this amount of data. My script ran for 36 minutes in the EMR cluster with three nodes. A 100 larger dataset can be easily handled by adding more nodes.

The pipelines would be run daily by 7 am every day.

An airflow pipeline can be built to deal with this scenario.

The database needed to be accessed by 100+ people.

A preserved redshift cluster can be built to contain the database. The authorized user can easily access data.

yileiCao/touristDB