This repository contains a collection of data analysis projects showcasing SQL and Python skills for data exploration, cleaning, and visualization.
This project analyzes COVID-19 data to explore various aspects of the pandemic, including infection rates, death rates, vaccination progress, and correlations with socio-economic factors.
- Joins data from multiple tables (CovidDeaths and CovidVaccinations)
- Calculates infection and death percentages
- Analyzes data by country and continent
- Explores correlations between COVID-19 impact and factors like poverty and healthcare infrastructure
- Creates views for later visualization
- SQL (Microsoft SQL Server)
- Comparison of extreme poverty levels with access to handwashing facilities
- Analysis of hospital bed availability vs. weekly hospital admissions
- Calculation of infection rates and death percentages
- Global and country-specific COVID-19 statistics
This project focuses on cleaning and standardizing a dataset of national housing information.
- Removes duplicates
- Standardizes date formats
- Populates missing property addresses
- Breaks out address components (Address, City, State)
- Simplifies "Sold as Vacant" field values
- Removes unused columns
- SQL (Microsoft SQL Server)
- Using self-joins to populate missing data
- Leveraging PARSENAME for string splitting
- Employing CTEs and window functions for duplicate removal
- Ensure you have access to a SQL Server instance.
- Import the provided datasets into your SQL Server.
- Run the SQL scripts in your preferred SQL client (e.g., SQL Server Management Studio).
This project analyzes a dataset of movies to explore relationships between various features and their impact on movie gross earnings.
The project uses Python to perform data analysis and visualization on a movies dataset. It explores correlations between different movie attributes, with a focus on how factors like budget relate to gross earnings.
- Python
- Pandas for data manipulation
- Matplotlib and Seaborn for data visualization
- NumPy for numerical operations
- Data cleaning and preprocessing
- Exploratory Data Analysis (EDA)
- Correlation analysis
- Visualization of relationships between variables
- Ensure you have Python installed on your system.
- Install required libraries:
- Download the 'movies.csv' dataset and place it in the appropriate directory.
Run the Jupyter Notebook 'CorrelationProject1.ipynb' to see the analysis and visualizations.
The project explores various hypotheses, including:
- The relationship between movie budget and gross revenue
- The impact of production company on gross revenue
Detailed findings and visualizations are available in the notebook.
- Deeper analysis of other factors influencing movie success
- Machine learning models to predict movie performance
- Develop visualizations based on the created views
- Perform more advanced statistical analyses
- Integrate machine learning models for predictive analytics