This repository is a final project about Big Data topic on Computer Science college. In order to answer some questions and extract insights through the processing, summarization and grouping of an extensive dataset, pyspark
was used, from which two equivalent solutions were presented: one using RDD (resilient distributed dataset, and the other using SQL.
You and your team were hired to process data using Apache Spark. Your company has access to a dataset with commercial transactions between countries during the past 30 years. For each transaction, the dataset contains the following variables:
Variable (column) | Description |
---|---|
Country |
Country involved in the commercial transaction |
Year |
Year in which the transaction took place |
Commodity code |
Commodity identifier |
Commodity |
Commodity description |
Flow |
Flow, e.g. Export or Import |
Price |
Price, in USD |
Weight |
Commodity weight |
Unit |
Unit in which the commodity is measured, e.g. Number of items |
Amount |
Commodity amount given in the aforementioned unit |
Category |
Commodity category, e.g. Live animals |
The dataset has over 8 million instances (rows, or commercial transactions). The dataset is made available in CSV format. Columns are separated by semi-colons (“ ; ”).
Given the aforementioned context, you are in charge of developing a set of solutions that allow the company to answer the following demands:
- 1. The number of transactions involving Brazil;
- 2. The number of transactions per year;
- 3. The most commercialized commodity (summing the quantities) in 2016, per flow type;
- 4. The average of commodity values per year;
- 5. The average price of commodities per unit type, year, and category in the export flow in Brazil;
- 6. The commodity with the highest price per unit type and year;
- 7. The number of transactions per flow type and year;
BONUS ACTIVITY:
- If you and your team properly answer the same questions given above using
SparkSQL
and reach the same results obtained with traditionalRDD
andPairRDD
programming.