Data Analysis Projects

This repository contains a collection of data analysis projects showcasing SQL and Python skills for data exploration, cleaning, and visualization.

Projects

1. COVID-19 Data Analysis

Overview

This project analyzes COVID-19 data to explore various aspects of the pandemic, including infection rates, death rates, vaccination progress, and correlations with socio-economic factors.

Key Features

Joins data from multiple tables (CovidDeaths and CovidVaccinations)
Calculates infection and death percentages
Analyzes data by country and continent
Explores correlations between COVID-19 impact and factors like poverty and healthcare infrastructure
Creates views for later visualization

Technologies Used

SQL (Microsoft SQL Server)

Notable Queries

Comparison of extreme poverty levels with access to handwashing facilities
Analysis of hospital bed availability vs. weekly hospital admissions
Calculation of infection rates and death percentages
Global and country-specific COVID-19 statistics

2. National Housing Data Cleaning

Overview

This project focuses on cleaning and standardizing a dataset of national housing information.

Key Features

Removes duplicates
Standardizes date formats
Populates missing property addresses
Breaks out address components (Address, City, State)
Simplifies "Sold as Vacant" field values
Removes unused columns

Technologies Used

SQL (Microsoft SQL Server)

Notable Operations

Using self-joins to populate missing data
Leveraging PARSENAME for string splitting
Employing CTEs and window functions for duplicate removal

Setup and Usage

Ensure you have access to a SQL Server instance.
Import the provided datasets into your SQL Server.
Run the SQL scripts in your preferred SQL client (e.g., SQL Server Management Studio).

3. Movie Data Analysis Project

This project analyzes a dataset of movies to explore relationships between various features and their impact on movie gross earnings.

Project Overview

The project uses Python to perform data analysis and visualization on a movies dataset. It explores correlations between different movie attributes, with a focus on how factors like budget relate to gross earnings.

Technologies Used

Python
Pandas for data manipulation
Matplotlib and Seaborn for data visualization
NumPy for numerical operations

Key Features

Data cleaning and preprocessing
Exploratory Data Analysis (EDA)
Correlation analysis
Visualization of relationships between variables

Setup and Installation

Ensure you have Python installed on your system.
Install required libraries:
Download the 'movies.csv' dataset and place it in the appropriate directory.

Usage

Run the Jupyter Notebook 'CorrelationProject1.ipynb' to see the analysis and visualizations.

Findings

The project explores various hypotheses, including:

The relationship between movie budget and gross revenue
The impact of production company on gross revenue

Detailed findings and visualizations are available in the notebook.

Future Work

Deeper analysis of other factors influencing movie success
Machine learning models to predict movie performance
Develop visualizations based on the created views
Perform more advanced statistical analyses
Integrate machine learning models for predictive analytics

nyunja/Portfolio-Projects

Data Analysis Projects

Projects

1. COVID-19 Data Analysis

Overview

Key Features

Technologies Used

Notable Queries

2. National Housing Data Cleaning

Overview

Key Features

Technologies Used

Notable Operations

Setup and Usage

3. Movie Data Analysis Project

Project Overview

Technologies Used

Key Features

Setup and Installation

Usage

Findings

Future Work