Dahel Consultant and Techies Internship: This is a project that entails the analysis of a Tour de France (Tourism) (The picture below is gotten from Chicagoland Epic Game Website).
This is an analysis of a bike racing event. It is done by analyzing data from 5 different tables namely tdf_finishers, tdf_winners, tdf_stages, tdf_tours, including an attached data dictionary. The project was carried out entirely using Microsoft Excel.
The goal of this analysis is to:
-
Calculate the average speed of the riders in a specific year or across multiple years.
-
Determine which rider has won the most stages in a given year or across multiple years.
-
Calculate the time margin between the winner and other riders for each stage or overall in a race.
-
Analyze the performance of different teams by calculating the total time taken by all riders from the same team in a race.
-
Determine the birthplaces of the riders and their frequency.
- Microsoft Excel concepts like:
1 Data Importing and Exporting:
- Importing data from external sources such as text files, CSV files, databases, or other Excel workbooks.
- Exporting data to different formats for sharing or further analysis.
2 Data Cleaning and Transformation:
- Removing duplicate records.
- Handling missing values (e.g., filling in missing values, deleting rows with missing data).
- Converting data types (e.g., converting text to numbers, dates).
- Splitting, merging, or rearranging data across columns.
3 Formulas and Functions:
- Utilizing built-in functions such as SUM, AVERAGE, COUNT, MAX, MIN for basic calculations.
- Performing more complex calculations using functions like IF, VLOOKUP, INDEX-MATCH, SUMIFS, COUNTIFS, etc.
- Creating custom functions using Excel's built-in programming language, Visual Basic for Applications (VBA).
4 Data Analysis Tools:
- PivotTables: Summarizing and analyzing large datasets by creating dynamic tables.
- Data Tables: Conducting what-if analysis by exploring different scenarios.
- Solver: Optimizing solutions by finding the best possible outcome based on defined constraints.
5 Statistical Analysis:
- Descriptive statistics: Calculating measures such as mean, median, mode, standard deviation, variance, etc.
- Correlation analysis: Determining relationships between variables using correlation coefficients.
- Regression analysis: Analyzing the relationship between dependent and independent variables.
6 Data Validation and Error Checking:
- Implementing data validation rules to ensure data accuracy and consistency.
- Using Excel's auditing tools to trace precedents and dependents, detect errors, and troubleshoot formulas.
7 Data Presentation and Reporting:
-
Formatting worksheets and cells for improved readability.
-
Creating dashboards and reports summarizing key insights.
-
Adding data labels, titles, and annotations to charts and graphs.
-
Inserting images, shapes, and text boxes to enhance presentation.
The dataset for the work is gotten from Dahel Consultant Techies. It consist of 5 different tables namely tdf_finishers, tdf_winners, tdf_stages, tdf_tours, including an attached data dictionary. I studied the dataset well and its attached dictionary to gain proper insight into the dataset. You can find a link to download the dataset here:
After downloading the dataset, I opened the dataset as CSV files in Microsoft Excel.
- I removed duplicate records.
- I handled missing values (e.g., filling in missing values, deleting rows with missing data).
- I converted data types (e.g., converting text to numbers, dates).
- I splitted, merged, and rearranged data across columns.
The transformation view and results are displayed below:
- Several expressions and built-in functions such as SUM, AVERAGE, COUNT, MAX, MIN and many more were made to arrive at the desired results.
- I performed more complex calculations using functions like IF, VLOOKUP, INDEX-MATCH, SUMIFS, COUNTIFS, etc.
The analysis conveys information about the following key areas:
-
average speed of the riders in a specific year or across multiple years.
-
rider who has won the most stages in a given year or across multiple years.
-
time margin between the winner and other riders for each stage or overall in a race.
-
performance of different teams by calculating the total time taken by all riders from the same team in a race.
-
birthplaces of the riders and their frequency of wins.
The overall analysis of the dataset can be checked out here
Summary of the insights gained into the bike race performance:
-
The average speed of the riders in the year 1903 is 24.2km/hr
-
The Average Speed of riders across all the years is 3247.8km/hr
-
The Average Speed of riders for multiple years(1903,1922and1935) is 80.6km/hr
-
Eddy Merckx, Bernard Hinault, Jacques Anquetil had the highest wins with the sum of wins of 32, 21, 16 respectively.
-
Rider who won the most stages in a given year(1948) happens to be Gino Bartali while the rider who won the most stages across multiple year happens to be Eddy Merckx.
-
There is no correlation between Riders place and year of birth Frequency of Wins
I am open for entry-level to mid-level data anlalyst role.
Let us have discussion about your company and industry now!