Continuous improvement is key! Not sure where I first read about this, but I just found this post by lifehack.org that highlighted the philosophy of Kaizen, the practice of continuous improvement. This resonated with me a lot, so I decided to set a goal to code every day of October!
After going through the almost the entire month (Oct 2021) I realized I really liked the visual aspect of this, so I initally decided to keep this going indefinitely. A bit down the line, I realized that as much as I loved this, I have other things I want to dedicate my time to, and have decided to stop adding new work here... Maybe I will come back in the future :)
Topics Explored: Scraping, Multiprocessing, Gradient Boosted Trees (GBT), Visualization, Content Creation, Dimensionality Reduction, Data Cleaning, Data Visualization, Data Exploration/Exploratory Data Analysis (EDA), Object Oriented Programming (OOP), Data Wrangling, Databases, Statistics, Automation, Data Versioning, Documentation,
Tools I used so far:
- (Python) concurrent.futures, bs4, requests, multiprocessing, threading, numpy, matplotlib, plotly, seaborn, sqlite3, ebooklib, collections, sklearn, pandas; (SQL); (C++); (git); (Medium)
- Re-evaluated priorities to start up my learning again!
- Watched Risk at Scale - Running a large investment risk system and how risk analysis techniques can help you - fascinating watch about working with risk at large scale and the software choices behind it.
- Watched:
- Highly-Scalable NLP to Answer Questions on South Africa’s COVID-19 WhatsApp Hotline - Impresive use of NLP to help covid Q&A
- Computations as Assets - a New Approach to Reproducibility and Transparency -- Introduction to ExAx and some visualizations it allows us to create. I really liked the COVID-19 visualization they did with taxi cars. Added ExAx/Accelerator to my list of thing to learn.
- Darts for Time Series Forecasting - Introduction to the Darts library. Seems like a very versatile tool for forecasting, I added it to my list of things to check out.
- Looked for resources to learn some more theoretical topics and found Complexity Explorer
- (Docker) Watched What is Docker in 5 Minutes
- (Data Pipeline) Watched How to quickly build Data Pipelines for Data Scientists - Some nice tips for data pipelining and tutorial for delta using python
- (Random Walks) Watched What is a Random Walk? | Infinite Series - Introduction to random walks to remember what they are all about
- None (Weekend)
- None (Weekend)
- (Random Walks) Began Complexity Explorer Random Walk tutorial (1/9)
- (Random Walks) Continued Complexity Explorer Random Walk tutorial (4/9)
- git [Oct 6]
- Kaggle Competitions
- Parallelization
- Multithreading [Oct 2]
- Dask
- Cloud (AWS/GCP/Azure)
- Bread and Butter
- PCA [Oct 9]
- A/B Testing [Oct 18]
- SQL
- SQLite [Oct 7]
- Project
- Feature engineering
- Data cleaning [Oct 20]
- Data wrangling [Oct 10]
- Quality of Life
- Docker
- Documentation - Read the Docs
- Pytest
- Web
- Streamlit/Flask/Fast API
- Data Vizualization
- Tableau
- Seaborn [Oct 8]
- Statistics
- Theory
- scipy.stats (more in depth)
- statsmodels
- ML
- xgboost [Oct 5]
- AutoML
- Auto-sklearn
- TPOT
- sklearn (more in depth)
- Scraping [Oct 10]
- requests [Oct 1]
- bs4 (HTML) [Oct 1]
- ebooklib - Epubs [Oct 10]
- unbalanced-learn (sampling)
- Specific ML tools
- lightgbm
- Graph ML
- Time Series
- prophet
- greykite
- sktime
- Darts
- More General Purpose Tools
- Kubernetes
- PySpark
- ExAx/Accelerator (eBay)
- Understanding the low level
- C++ - Review [Oct 15]
- CUDA
- GPU Programming
- Numba
- Cython
- More Viz tools
- Plotly
- More DB
- MongoDB
- Snowflake
- Oct 1: (requests, bs4, re, concurrent.futures, nltk, and pandas) Scraped readlightnovel.me to create a light-novels dataset
- Oct 2: (concurrent.futures, Threading, Multiprocessing) A comparison of multi- and single core multiprocessing for matrix multiplication in Python
- Oct 3: (xgboost) Implemented xgboost from scratch! (xgboost part 1)
- Oct 4: (xgboost, boosting) Implemented boosting and added to previously created xgboost trees (xgboost part 2)
- Oct 5: (xgboost, boosting, plotly) Finished xgboost project! Added multi-dim input feature and aproximate splitting (xgboost part 3)
- Oct 6: (git, PyTest, Circle.Ci) Set up git on my PC! (I ran into problems with this before, so I opted to use desktop app/web interface locally and git for remote server work). I also studied unit testing using using PyTest and Circle.ci.
- Oct 7: (SQL, sqlite) Tested out sqlite3 for running SQLite
- Oct 8: (Seaborn) Added visualization in seaborn to my multiprocessing project
- Oct 9: (PCA, DevOps, Blogging) Watched a couple of videos on PCA (which I found similar to SVD, a procedure I love), started going through a DevOps course on YouTube, and began writing a Medium post on SPPPACY (I have been meaning to do this last one for a long time and finally got to it!)
- Oct 10: (SQL, sqlite, ebooklib, bs4, re, collections) I made a dataset for ingredient pairings
- Oct 11: (SQL, streamlit, flask) Took some time to dig in deeper on SQL and web developement using Python so I can make the ingredient pairings project into an app
- Oct 12: (PCA, NumPy, Sklearn) Coded up PCA in Numpy and compared results with sklearn
- Oct 13: (Spark, PySpark) watched and read tutorials on PySpark and Spark
- Oct 14: (medium) went back and edited the medium post i wrote on Oct 9... hopefully I get it out soon
- Oct 15: (C++) I coded Othello in C++ from scratch!
- Oct 16: (hugo, portfolio) Watched some tutorials on making a portfolio website
- Oct 17: (hugo, portfolio) Put some more work into the porfolio
- Oct 18: (A/B testing) Read about A/B testing
- Oct 19: (Statistics) Started 365 Data Science statistics course
- Oct 20: (Data cleaning) Went and cleaned the data I generated from the light novels cite
- Oct 21: (Statistics, PySpark) Continued statistics course and read more about PySpark (on tutorialpoint)
- Oct 22: (Seaborn, Pandas) Basic data exploration on the scraped novel data
- Oct 23: (Seaborn, Pandas) Continued the data exploration and visualization for the light novel dataset
- Oct 24: (PySpark) Figured out how to run PySpark on Google Colab
- Oct 25: (Rasterio, concurrent.futures) Created a tool to match tif files between 2 directories
- Oct 26: (Rasterio, concurrent.futures) More work on the tif matching tool
- Oct 27: (Rasterio, concurrent.futures) Finished the tif matching tool
- Oct 28: (Data Versioning: DVC, DagsHub, FastDS; Documentation: Sphinx, Read the Docs; Exploratory Analysis: Missingno, Sidetable, Pandas; GPU Programming: Numba, CuPy, CuDF, CuML; Databases: Snowflake, Tecton) Joined PyData Global 2021 and went to:
- Oct 29: (Bayesian Ordered Logistic Regression: jax, numpyro; Graphs: neo4j, optuna, sklearn, pandas) More webinars:
- Oct 30: (Open Source: Contributed to NumPy) Last day of PyData Global:
- Participated in the NumPy + SciPy Sprint and made my first open source contribution!
- Oct 31: (Compressive sensing; Data Pipelines: Apache Kafka; Causal Inference: Simpson's Paradox) Catching up on PyData webinars I missed: