Journey to Machine Learning
General
I will share my learning path and it is meaningful as records personally. Hope it helps.
Next step: to be qualified with depth (phase 5 in 2022)
Update
Phase 4b
Jul 2021 to Dec 2021, 6 months
[Business]
- Ranking module: CTR model exp
- feature engineering
- realtime training
[CS fundamentals]
Phase 4a
Jan 2021 to Jun 2021, 6 months
[Business]
- Basic understanding of flow of online advertising
- Ranking module:
- System design
- Model training and CTR/CVR predictions
[CS fundamentals]
[Programming]
- Code reading: C++
Phase 3c
Jul 2020 to Dec 2020, 6 months
[Business]
- Online advertising modules: recall, rank
[CS fundamentals] Data Structure
[Programming]
- Data application: Scala/Spark/Hadoop Streaming/Shell
- Code reading: Java/C++
Phase 3b
Jan 2020 to Jun 2020, 6 months
[Business]
- Video Content Understanding: more insights in tags
- Transfer to Advertising Department from Jun 2020
[CS fundamentals] Data Structure
- Data Structure II, by 邓金辉@THU:
- Chapter 7-10 is done
- Chapter 11-12 is pending
- leetcode-python-java: 21
[Programming] Scala/Spark
tencent-ad-2020
[Project]- final score: 1.401474
- rank: 242
Phase 3a
Jul 2019 to Dec 2019, 6 months
[Work at BILIBILI] Algorithm Engineer
[Algorithms] Video Content Understanding
- Classification with XGBoost, BERT (binary/multi-label)
- Fine tune large BERT models with multi-gpu support
- Extensive paper reading and summary sharing: Pre-trained Language Models
[Engineering]
- Model: publish and online serving
- HiveQL: training data preparation and statistics data report
- GRPC
[Business] Recommend System
- Be familiar with Recall/Rank/Strategy modules gradually
[CS fundamentals] Data Structure
- Data Structure I, by 邓金辉@THU
- leetcode-python-java:
- update continuously
- interactive as jupyter notebook and implemented in Java also
[Programming] Java
- Java基础入门笔记
- Leetcode practice
Phase 2
Jan 2019 to Feb 2019, 2 months
广告点击率预测:Criteo CTR
[Project]- Codes at Github, by lambdaji
- a good introduction to CTR field: mainstream models are covered, including FM, FFM, DeepFM, Wide&Deep and etc.
- a good tutorial to tensorflow framework: codes are implemented by tensorflow and high-level APIs e.g. tf.Estimator and tf.Serving are used
- Papers reading: to understand CTR models systematically
支付风险识别:ATEC Payrisk
[Project]- Feature selection: avoid overfitting in test dataset
- Practice with XGBoost
Tensorflow for Deep Learning Research, by Chip Huyen@Stanford
[Course][Intern Hunting]
Mar 2019 to Jun 2019, 4 months
[Project] Captcha Recognition
- python-captcha-recognition: generate labels with OCR automatically and train MLP classifier to recognize captcha (accuracy: 60%)
- pytorch-captcha-recognition: generate captcha-label pairs automatically and train ResNet model to recognize captcha (accuracy: 90%)
[Project] Anti-crawling Fontlib in Website
- anti-crawl-fontlib-img:
- process WOFF file in anti-crawling and reconstruct fontlib (key-char pairs) for further text parsing dynamically
- char is obtained via OCR method and accuracy could reach 90%
- anti-crawl-fontlib-svg:
- convert fontlib into SVG files and use unique d-path as retrieval feature
- accuracy could reach almost 100% within 1s, 20x faster than previous OCR method
tensorflow-nlp-sentiment-analysis
[Project]- Sentiment analysis of restaurant reviews from Dianping, implemented by tensorflow (including estimator and serving)
- It is also a good tutorial for basic nlp models (e.g. LSTM and its variants) and multi-head attention from Transformers
[Course] Natural Language Processing with Deep Learning, by Manning@Stanford
- Classic course by 2017, both English and Chinese subtitles are included
- Latest course by 2019, English subtitles only
[Job Hunting]
Phase 1
Jul 2018 to Aug 2018, 2 months
[Course] Machine Learning, by Andrew Ng@Stanford
- Online videos
- both English and Chinese subtitles are included
- total 20 hours
- Coursera enrollment
- take quizs after each unit
- finish 8 programming tasks via Matlab
Python教程, by 廖雪峰
[Programming][Programming] Numpy (in Python)
Sep 2018 to Oct 2018, 2 months
统计学习方法, by 李航
[Book]- Basic theory of Machine Learning
- Study and derive formuolas is strongly recommended
机器学习实战
[Book]- Codes in Python is included
- Select chapter and practice codes corresponding to 统计学习方法
[Programming] Pandas, sk-learn (in Python)
Nov 2018 to Dec 2018, 2 months
[Project] Kaggle mini-projects practice (5+)
- Start with Titanic: Machine Learning from Disaster
- Target: go through the entire process, including feature engineering, modelling and evaluation
[Course] Data Structure I, by 邓金辉@THU
- Online videos
- Data Structure II should be continued
面向数据科学家的实用统计学
[Book]Refresh on Jul 2022 next time.
More
- Let's play with data
- New perspective will be provided by Machine Learning
- Big thanks to Tao Li, Haha Hu, Jiajia Cong, Jianxiong Ma