/personal_projects

This repo contains a portfolio of my data analytics and ML projects completed for personal learning, hobby, and research purposes.

Primary LanguageJupyter NotebookMIT LicenseMIT

ML and Data Analytics Portfolio

UPDATE: Please view my new portfolio website published on GitHub Pages: Portfolio website

This repository contains a portfolio of my ML and data analytics projects completed for professional, research, and learning purposes. Projects are presented in the form of Jupyter notebooks, R markdown files (published at RPubs), online dashboard apps, written reports, powerpoint slides, and video lectures.

Data files for all projects are available in their respective folders or, in the case of large data files, via a URL within the Jupyter notebooks. Jupyter notebooks are best viewed using the nbviewer links provided.

Contents

  • Machine Learning

    • The Battle of the Neighborhoods: K-means clustering was used to cluster all neighborhoods in Helsinki based on a selection of neighborhood features. The "optimal" neighborhood cluster for opening a new cafĂ© was then chosen using a set of business and mathematical assumptions. Neighborhood census and venue data were collected from Statistics Finland and via the Foursquare API, and then cleaned using SQL queries and pandas. View [ notebook | report | slides ]

    • Predicting House Sale Prices Using Regression: A series of regression models predicting house sale prices were developed and compared. Model instances with varying polynomial degrees, regularization types (i.e., ridge, lasso, elastic net), and extreme gradient boosting were compared using R2 and RMSE scores. All hyperparameters were tuned using 5-fold cross validation and Bayesian optimization. View [ notebook ]

    • Determining the Best Classification Model for Loan Default Prediction: A series of classification models predicting loan status were developed and compared. Model instances using k-nearest neighbor, decision tree, support vector machine, logistic regression, and extreme gradient boosting algorithms were compared using F1 scores, Jaccard indices, and log loss. Parameter tuning via cross validation and Bayesian optimization. View [ notebook ]

    • Movie Recommendation Systems: Two simple movie recommendation systems were created using (1) content-based filtering, which recommends movies similar in genre as those rated highly by the user, and (2) collaborative filtering, which recommends movies rated highly by other users with similar inputs, i.e., movies watched and rated. View [ notebook ]

    Tools: SQL, Pandas, Numpy, Matplotlib, Seaborn, Folium, Scikit-learn, XGBoost, Hyperopt

  • Data Analysis and Visualization

    • Gender Research Productivity Gap: GGPlot was used to visualize the gender research productivity gap in STEM and other scientific fields. I collected publication data by individual researchers in the Mathematics, Genetics, Applied Psychology, and Mathematical Psychology fields, then compared the productivity distributions by gender using histograms and kernel density plots. Note that this data was collected as part of my PhD dissertation, and a study based on my dissertation research was published in Journal of Applied Psychology. View [ R markdown | published article | slides ]

    • Descriptive Analysis of the 2019 Stack Overflow Developer Survey Data: Data collected as part of the 2019 Stack Overflow Annual Developer Survey was analyzed and visualized using SQL, Python and IBM Cognos. Findings yielded numerous insights into developer technology usage, trends, and demographics. View [ notebook | cognos dashboard | slides ]

    • Automobile Sales, Recalls, and Sentiment: Data on the profits, quantity sold, units recalled, and customer sentiment regarding 5 automobiles (distributed by 10 dealers) were visualized using Tableau. The data files used here were obtained from the IBM Accelerator Catalog. View [ tableau dashboard ]

    • Airline Performance Dashboard Using Plotly Dash: A plotly dash app visualizing yearly airline performance and delay data from 2005 to 2020 was deployed using Heroku. Data were visualized using a collection of bar, line, pie, choropleth map, and treemap charts. View [ dash app | python script ]

    • Visualizing Tesla and GameStop Stock Data: I used yfinance to extract the Tesla (TSLA) and GameStop (GME) stock data. I also used beautiful soup to scrape revenue data for the two companies. View [ notebook ]

    Tools: SQL, GGPlot2, Tableau, Cognos, Plotly, Dash, BeautifulSoup

  • Teaching About Data

    Since 2019, I have been the responsible teacher for the master's course Doing Quantitative Research at Aalto University School of Business. In 2020, due to the pandemic, the course was conducted fully online via pre-recorded lectures and live Zoom sessions.

    Below are five pre-recorded lectures from 2020:

I've also been involved in numerous academic research projects in which I performed the leading role in data collection, data analysis, paper writing, and presenting. In my research, I use statistical techniques from the social, natural, and formal science fields such as structural equation modeling, time series analysis, allometric modeling, and social network analysis on survey, experiment research, and archival data. You can view these outputs in a separate repository: Academic Publications and Presentations

Thank you for your interest in my work. If you would like to chat about my portfolio, professional opportunities, or collaboration, please message me on LinkedIn: https://www.linkedin.com/in/younghunji/.