Millions of international tourists travel to the United States every year, by routes connecting their home country and big cities in the United States. The board of one big airline wants to know whether the existing routes satisfy the present demand. To decide whether there need to add new routes. A dataset needs to be built and to be analyzed. This project built one demo with Spark.
This data comes from the US National Tourism and Trade Office. https://travel.trade.gov/research/reports/i94/historical/2016.html
International tourist information comes from this data.
This data comes from Kaggle. https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data
Tempereture of main cities in the United States is extracted from this data.
This data comes from OpenFlight. https://openflights.org/data.html
Airline information like country comes from this data.
Spark is very fast compared to other frameworks because it works in cluster mode and uses distributed processing and computation frameworks internally. It took only 36 mins to finish processing immigration data in the Spark cluster while my personal computer spent several hours. Spark also supports sas7dbat files, which is the immigration data file type.
Amazon EMR offers the expandable low-configuration service as an easier alternative to running in-house cluster computing. With Amazon EMR, I don't need to maintain Hadoop Cluster. Apache Spark and Apache Zeppelin are pre-configured in the EMR cluster as well.
The data in this dataset is well-structured. With star schema, this database can be used to cope with different queries without changing schema structure.
- Replace reference number in raw immigration table with information from I94_SAS_Labels_Descriptions.SAS file.
- Split port column with city_port and state_port columns.
- Reformat arrival date and departure date.
- Calculate day stayed of immigrator.
- Some immigrators' address information is missed.(49 in 1000) Replace it with state_port column.
- Assume the address is the state where the immigrants most want to visit. It is important in my analysis.
- Extract useful columns and write the table ordered by address and partitioned by month.
- Include 15 columns: id(primary key), year, month, resident, city_port, state_port, model, address, age, visa, gender, airline, day_stayed, arrival_date, depardure_date
model: the approach the immigrator takes to come to America, including Air, Sea, Land, and Not reported.
- id: Immigration record id.
- arrival_date: Date of entry.
- depardure_date: Date of leave.
- year, month: year and month of entry.
- resident: a country where immigrator lived.
- port: port of immigration entry in the USA.
- model: the approach the immigrator takes to come to America, including Air, Sea, Land, and Not reported.
- address: the address in the USA where the immigrator can be contacted.
- age: the age of immigrator.
- visa: the visa type of immigrator, including Business, Pleasure, Student.
- gender: the gender of immigrator.
- airline: the airline code of the flight the immigrator took if he came to America by air.
- day_stayed: the number of days the immigrator stayed in the USA.
- Filter country with "United States"
- Filter year after 2000-01-01. Because the climate changes a lot, the statistics from too long ago have no meaning.
- Order the data by month and city.
- Extract useful columns.
- Include 4 columns: year, month, city, average_temperature. year, month, and city are primary keys
Include 5 columns: id(primary key), full_name, code, country, finantial
The code column is used to join with the immigration table
- Python 2.7 or above.
- AWS Account.
- AWS Command Line.
- Upload data into S3
- Set up EMR cluster
- Move data from S3 to HDFS.
- Upload scripts(immigration_table and temperature table) to EMR master node and run script
- Move result from HDFS to s3.
- Use Apache Zeppelin to visualize the dataset.
aws emr create-cluster --name spark-cluster --use-default-roles --release-label emr-5.28.0 --instance-count 3 --applications Name=Spark --ec2-attributes KeyName=spark-cluster --instance-type m5.xlarge --instance-count 3
aws emr add-steps --cluster-id j-XXXXXXXX --steps file://./Desktop/myStep.json
JSON file [ { "Name":"S3DistCp step", "Args":["s3-dist-cp","--s3Endpoint=s3.amazonaws.com","--src=s3://XXXX/airline_data","--dest=hdfs:///airline_result","--srcPattern=.*[a-zA-Z,]+"], "ActionOnFailure":"CONTINUE", "Type":"CUSTOM_JAR", "Jar":"command-runner.jar" } ]
scp -i XXX.pem XXX/immigration_table.py hadoop@XXXXXXXXXX.us-west-2.compute.amazonaws.com:/home/hadoop
s3-dist-cp --src hdfs:///immigration_data/immigration.csv --dest s3://XXX/immigration_result/immigration.csv
There are many casinos in Las Vegas in the USA that are making plans to appeal to international tourists to play and live in their hotels. They want to analyze the statistics of their customers first by digging for information from USA I94 immigration data.
- Making some discounts on international flight tickets landing in Las Vegas
- Opening some potential hot international flight routes.
- Analyzing the temperature influence on the tourist's number
They need to know by which airline tourists fly to Las Vegas.
They need to know the information of tourists whose landing port is not Las Vegas but whose address is Las Vegas.
They need to know tourists and temperature each month
- Count tourist groups by the airlines they took.
- Count and compare tourists whose port state is Nevada and tourists whose address is Nevada
- Plot tourists' number against temperature.
record.filter(record.port_state=="NEVADA").filter(record.visa=="Pleasure").groupBy(record.airline).agg(count("id").\ alias("count")).sort(desc("count")).limit(20).join(airline, record.airline==airline.code, how = 'inner').select("airline", "count", "full name", "country").sort(desc("count"))
With the figure above, it is easy to find a business partner.
record.filter(record.address=="NEVADA").filter(record.visa=="Pleasure").groupBy(record.resident).agg(count("id").alias("count")).sort(desc("count")).limit(10)
record.filter(record.port_state=="NEVADA").filter(record.visa=="Pleasure").groupBy(record.resident).agg(count("id").alias("count")).sort(desc("count")).limit(10)
By comparing the two tables above, many Japanese tourists were coming to Nevada in 2016, while only a very small number of them entered the United States through the port in Nevada. There may not be enough air routes connecting Japan and Nevada.
tempereture1 = tempereture.filter(tempereture.City=="Las Vegas").groupBy(tempereture.month).agg(avg("AverageTemperature").alias("AverageTempereture"))
record.filter(record.address=="NEVADA").filter(record.visa=="Pleasure").groupBy(record.month).agg(count("id").\ alias("count")).join(tempereture1, "month", how = 'inner').select("AverageTempereture","count").orderBy("AverageTempereture")
By comparing the two tables above, it seems that tourists prefer to travel when Nevada has mild temperatures.
- The data was increased by 100x.
- The pipelines would be run daily by 7 am every day.
- The database needed to be accessed by 100+ people.
By splitting the data into several parts before running in EMR, the EMR cluster can handle this amount of data. My script ran for 36 minutes in the EMR cluster with three nodes. A 100 larger dataset can be easily handled by adding more nodes.
An airflow pipeline can be built to deal with this scenario.
A preserved redshift cluster can be built to contain the database. The authorized user can easily access data.