DataScienceProjects: A Jupyter Notebook repository from franchiseBoyz

Overview

In this repository, you will find the source code to various projects I have been working on or still work-in-progress. The majority of the projects are accompanied by a Medium blog posts at tuannguyen-doan.medium.com. I published almost exclusively on Towards Data Science publication through Medium's Partnership program so please check out these articles as a way to support me and my future projects. Alternatively, you can also find my blog posts at my personal website here.

My interests lie in the intersection of statistical techniques, data visualization and sports (especially football). All the codes are written entirely in Python or R. I don't have a strong preference or attempt to make a concerted effort to code in a specific language/platform. The decision is mostly based on how specific functionalities needed for a project are supported (scraping in Python and data processing with dplyr piping in R).

I. Statistical application:

The statistics of modern football:

A collection of projects that explore the intricate statistical aspect of the Beautiful Game

Empirical Bayes and penalty taking ability - Using Bayesian statistics to make meaningful comparison between players across Europe.
Poisson process and match prediction - Here we learn about the Poisson process and how a random model outperforms football experts with its prediction.
The mathematics of football betting strategies - With the Poisson model and some additional help from mathematical research, can we beat the bookies?
Fisher vs Neyman-Person debate and Paul the Octopus - We went over the theory (or many theories) of hypothesis testings and see how they apply to the psychic ability of Paul the Octopus.

Statistical theory and its application:

Bayes theorem and a probabilistic argument for God - Bayes theory and how people have been using it to justify the necessary existence of God.
Dating with probability theory - Here we explore what probability theory has to say about the most optimal strategy to find the love of your life.
Bayes theorem and why it matters to my workout routine - A lightweight introduction to Bayes' theorem and how it helps convince me to hit the gym.
The Rule of Three and its application - A short introduction of the Rule of Three and how we can apply it to calculate the probability of events that have yet to happen. Application in voting, vaccine development, product quality monitoring, etc.
Lindy's effect - A (slightly) mathematical description of the Lindy's effect and how one can use it as a guide for life.
Normal Distribution with High Dimensionality - A statistical investigation into the myth of the "average Joe."
Mark-Recapture method - An intro to the statistics behind sampling theory and how you can use it to count almost everything

II. External Collaborations:

Published papers:

A robust and scalable method to compare Percentile metrics in online experiments (Quora Data Blog, 2022) Conducting statistical tests for Percentile metrics can be tricky, as they have less neat mathematical properties than other more common metrics, such as the average or the ratios. I discuss Quora's method to A/B test these metrics in a statistically valid and scalable manner.
How social learning amplifies moral outrage expression in online social networks (Science Advances, 2021) - Moral outrage shapes fundamental aspects of social life and is now widespread in online social networks. Here, we show how social learning processes amplify online moral outrage expressions over time.
Application of machine learning models in predicting length of stay among healthcare workers in underserved communities in South Africa (Human Resources for Health, 2018) - We aim to use machine learning methods to predict health professional’s length of practice in the rural public healthcare sector based on their demographic information.

III. General tutorials with Python and R:

Data visualization:

NetworkX and Basemap - Here is a comprehensive tutorial of how we can visualize geographical data with powerful tools that support Python.
Tkinter and Python - Building your own firework shows with Tkinter (and some math chops).
Data visualization with Matplotlib and Seaborn - Learn how to construct publish-worthy visualizations with Matplotlib and Seaborn packages.

Machine Learning practicals:

End-to-end Machine Learning project with R - Here is a full data science project that covers data collection, cleaning, visualization, machine learning and validation.
Unsupervised Learning - Clustering method with R - An introduction to an array of unsupervised learning algorithms: Hierachical clustering, k-means, and Factor Analysis.
Collaborative Filtering with Python - A comprehensive guide to the mathematical details and implementation of popular Matrix Factorization methods.

franchiseBoyz/DataScienceProjects