/HDIP_CSDA_COMP08050_PROJECT

Higher Diploma in Science in Computing (Data Analytics) - Programme Module: Fundamentals of Data Analysis (COMP08050)

Primary LanguageJupyter NotebookMIT LicenseMIT

GMIT Logo

Higher Diploma in Science in Computing (Data Analytics)

Programming for Data Analysis (COMP08050) - Project 2020

The aim of this module is to develop programming skills towards the effective use of data analysis libraries and software. Students learn how to select efficient data structures for numerical programming, and to use these data structures to transform data into useful and actionable information.

1. Repository Structure

  1. Readme: README.md

  2. Jupyter Notebook: project2020.ipynb

  3. Excel version of simulated data: simulated_data.xlsx

  4. Images Folder images

If you have any issues viewing tasks2020.ipynb in github you can use Jupyter NBViewer which is a web application behind The Jupyter Notebook Viewer at https://nbviewer.jupyter.org/

2. Software used

logo Jupyter Jupyter is a free, open-source, interactive web tool known as a computational notebook, which researchers can use to combine software code, computational output, explanatory text and multimedia resources in a single document. (Source:downloads.com)

logo Cmdr provides you with an alternative to the Windows default command prompt utility through a more capable console emulator that also comes packing a good-looking graphical user interface. (Source:downloads.com)

logo SciPy is a scientific computation library that uses NumPy underneath. SciPy stands for Scientific Python. It provides more utility functions for optimization, stats and signal processing. (Source:w3schools.com)


3. Project

Problem statement: For this project you must create a data set by simulating a real-world phenomenon ofyour choosing. You may pick any phenomenon you wish – you might pick one that isof interest to you in your personal or professional life. Then, rather than collect datarelated to the phenomenon, you should model and synthesise such data using Python.We suggest you use thenumpy.randompackage for this purpose.Specifically, in this project you should:

  • Choose a real-world phenomenon that can be measured and for which you could collect at least one-hundred data points across at least four different variables.

  • Investigate the types of variables involved, their likely distributions, and their relationships with each other.

  • Synthesise/simulate a data set as closely matching their properties as possible.

  • Detail your research and implement the simulation in a Jupyter notebook – the data set itself can simply be displayed in an output cell within the notebook.

Note that this project is about simulation – you must synthesise a data set. Some students may already have some real-world data sets in their own files. It is okay to base your synthesised data set on these should you wish (please reference it if you do),but the main task in this project is to create a synthesised data set.

The project is divided into nine sections outlined below.

1.0 Introduction

2.0 Libraries

3.0 Dataset

4.0 Overview

5.0 Detailed Analysis

6.0 Flight Status

7.0 Simulation

8.0 Simulation Analysis

9.0 Comparison


1.0 Introduction: In this section I provide an overview of the real-world phenomenon I decided to generate my simulated data on. It also includes a brief introduction to simulation and synthetic data and an outline of where I sourced the dataset from. The real-world phenomenon I chose to look at related to aviation, more specifically information on flight delays at US airports and the type of delays experienced. I wanted to understand the relationship between flight delays and flight volume. I need to point out now that if I had had a chance to choose a dataset again, I would not have chosen this one. I will explain this in more detail later but suffice it to say the data had been cleansed already, was aggregated and had little categorical variables. In retrospect I should have chosen one that was a date series or time series for flights at an individual airport or more but which hadn’t been aggregated to a monthly view. I would also have liked if there had been more categorical variables with perhaps a status for individual flights.

Learnings: When researching an interesting dataset to analyse take your time. I also learned to download and add additional functionality in Jupyter Notebooks, including using the add-ons "Autopep8", "spellchecker" and "Table of Contents(2)". Autopep8 is excellent for formatting your code correctly, while spellchecker comes in handy in correcting those pesky spelling mistakes.

Topics: research, simulation, synthetic data.


2.0 Libraries: This section covers importing the libraries used for this project. I used Pandas for importing and manipulating the dataset. NumPy for generating the synthetic dataset and Matplotlib and Seaborn for data visualisation. I also used SciPy for the identification of the distributions of the dataset variables.

Learnings: I learned two things in this section. I got an understanding of the SciPy stats module and a way of making plots clearer by using the magic command %config InlineBackend.figure_format = ‘retina’.

Topics: libraries, Pandas, NumPy, SciPy, retina display magic command.


3.0 Dataset: This section covers importing the data from the CORGIS website, renaming and ordering the variables and splitting some variables into additional information. I also only select the top 5 US airports from the dataset and provide a brief description of them. Because the dataset only provides partial information for 2003 and 2016 I dropped those years. A feature description table of the dataset variables is also provided.

Learnings: As I worked on this section, I got a better understanding of manipulating and viewing datasets. I found out how to display all columns in Jupyter by using the “pd.set_option(display.max_columns, none)” and how to select only certain variables from a dataset by using conditional arguments. I also learned of feature engineering , particularly splitting variables into useable data which can add the ability for further analysis.

Topics: feature engineering, tables in Jupyter, dataframes, conditional arguments.


4.0 Overview: In section 4 I provide a quick overview of flight volumes by year and month. I also look at airport capacity and the factors that limit it. There is also a description of the relationship between capacity, demand and delay. There is also an outline of the busiest days experienced by airports in the US. I included a plot displaying the relationship between delays and flight volumes which indicates a positive correlation between the two. This follows through at an individual airport level as well.

Learnings: This section taught me how to display data using bar plots and the application of the “estimator=sum” function. I also learned about the relationship between flight volumes and flight delays.

Topics: barplots, lmplots, airport capacity, col_wraps, sharex and sharey in Seaborn.


5.0 Detailed Analysis: Section 5 looks at the variables in the dataset at a granular level. A correlation heatmap using Pearson’s Method displays graphically how the numerical variables are related. It also provides information on each of the Bureau of Statistics delay categories. For each category a distribution plot, boxplot and stripplot is produced as well as descriptive statistics. I also use SciPy stats module to identify a possible distribution which is used in NumPy to generate the data used to produce a synthetic dataset.

Learnings: Where do I start? This section was both interesting and frustrating. I also began to doubt my choice of dataset. I was frustrated trying to identify a possible distribution fit to apply to the data and possibly spent too much time on this at the expense of using my time more productively on generating a synthetic dataset. I did however gain valuable insight into distplot formatting using “kws” and using SciPy stats. I also learned how to combine Seaborn boxplots and stripplots.

Topics: heatmaps, Perason’s correlation method, stripplots, Gumbel distribution, Scipy.


6.0 Flight Status: This section follows the same approach as I took in section 5 ut looks at flight status instead of the reasons for delay. By flight status I mean, flights diverted, canceled and on time.

Learnings: Same as section 5.

Topics: distplots, stripplots, Gumbel distribution, Scipy.


7.0 Simulation: This is where I generate the synthetic data using the distributions identified in the previous sections. We could only do this using the numpy.random.package. As well as generating the random data for the variables, I also had to create columns for totals.

Learnings: I learned a lot from this section especially NumPy. Since I was trying to produce data closely aligned to the properties of my real-world phenomenon I ran into issues. Using the NumPy gumbel distribution did not allow setting a ceiling. Which meant it would generate more records than I wanted but would create the distribution I needed. It also generated negative numbers which I didn’t want but overcame using the “abs” function. I also learned how to output a dataframe to an excel file.

Topics: NumPy, Gumbel distribution, concatenation, index reset, export dataframe to excel.


8.0 Simulation Analysis: In this section I created two statistical summaries of the original and synthetic dataset as well as using df.info() to check that thae data types where the same and that I had no null values.


9.0 Comparison: Finally I produced scatterplots of each of the numeric variables to see how aligned they are and bar plots of total flights by year and month. I also displayed the simulated dataset in an output cell within the Jupyter notebook.

Learnings: Here I learned how to concatenate the two dataframes using "assign" so that I could compare the original and synthetic data against each other.


4. Conclusion

Overall I think I made a good attempt at creating a synthetic dataset for this project however I did make some errors along the way. I chose a real world phenomenon that could be measured and for which you could collect data points across at least four variables but I chose the wrong dataset in hindsight. In order to simulate the data I should have chosen a datset that recorded flights in a time series and had classification variables for each individual flight. If that was the case I could have used more functionality from the numpy.random.package. I could also have applied better distributions including poisson and binomial or exponential to detemine the outcome of each flight and aggregate the results. All in all it was an interesting project to work on and I definitely learned alot in developing the solution.