
Performed business operations using Big data technologies: AWS EMR, AWS RDS (MySQL), Hadoop, Apache Scoop, Apache HBase, MapReduce

Primary LanguagePython

Unlocking Insights with NYC Yellow Taxi Data using Big Data Technologies 🚖

NYC Yellow Taxi photo

In this repository, we leverage the power of Big Data technologies to perform data-driven business operations on the NYC Yellow Taxi dataset. Our toolkit includes industry-standard tools and services such as:

AWS EMR: Harness the scalability of Amazon Elastic MapReduce for efficient data processing and analysis.

AWS RDS (MySQL): Store and manage structured data seamlessly with Amazon RDS, a reliable and high-performance database service.

Hadoop: Utilize the robust Hadoop ecosystem for distributed data storage and processing.

Apache Scoop: Streamline data ingestion between Hadoop and relational databases effortlessly.

Apache HBase: Leverage the NoSQL capabilities of Apache HBase for high-speed, random read/write access to your data.

MapReduce: Implement MapReduce algorithms to extract valuable insights from massive datasets



  • The project was broken down into the following 4 tasks

  • Please refer to attached files for detailed explanations with code samples and screenshots

Task 1: Setting up the environment and loading data

  • I created an RDS (Relational Database Service) instance on my AWS account and uploaded data to the RDS instance

    • I created an appropriate schema for the data sets to upload them to RDS
  • I created an AWS EMR Instance with the above services.

    • I used the m4.xlarge cluster with ample storage size since we are working with a huge data set
    • I used a single master node instead of a multi-node cluster to limit my AWS credit consuption
  • I then proceeded to connect RDS with the EMR instance

  • I then logged into RDS through EMR instance

  • I created the "yellow_taxi" database followed by the table "trip_records"

  • I then downloaded the data files onto the EMR cluster using wget "url" command

  • To load the data into MySQL table, I logged in and run appropriate SQL commands

  • I confirmed the data was loaded into the table by running simple SQL queries and observing the outputs

Task 2: Ingesting data from RDS into the HBase table using Sqoop

  • First, I logged in into the EMR instance and completed the initial steps of setup

    • Now I istalled the MySQL connector jar file then run appropriate step to extract the MySQL connector tar file

    • I then went to MySQL connector directory and copied it the the Sqoop library to complete the installation

  • Having now installed the MySQL Connector. I now set up MySQL on EMR cluster and proceeded

  • I run appropriate commands to ingest data from MySQL RDS to HBase table

Task 3: Bulk inport subsequent files to HBase table

  • I bulk imported data from subsequent files in the dataset on the EMR cluster to the HBase table using relevant codes

  • See the Python code (batch_ingest.py) used

Task 4 : Using MapReduce to perform data analysis on files downloaded to the EMR instance

  • Please refer to the MapReduceTasks pdf file for a detailed approach with screenshots

  • Please refer to the corresponding mrtask_#.py files for Python codes used

  • The following business questions where explored:

    • mrtask_a) Which vendors have the most trips, and what is the total revenue generated by that vendor?

    • mrtask_b) Which pickup location generates the most revenue?

    • mrtask_c) What are the different payment types used by customers and their count?

    • mrtask_d) What is the average trip time for different pickup locations?

    • mrtask_e) Calculate the average tips to revenue ratio of the drivers for different pickup locations in sorted format

    • mrtask_f) How does revenue vary over time? Calculate the average trip revenue per month - analyzing it by hour of the day (day vs night) and the day of the week (weekday vs weekend)