/journey-to-machine-learning

Personal practice on machine learning. Communications and questions are welcome.

Journey to Machine Learning

General

I will share my learning path and it is meaningful as records personally. Hope it helps.

Next step: to be qualified with depth (phase 5 in 2022)

Update


Phase 4b

Jul 2021 to Dec 2021, 6 months

[Business]
  • Ranking module: CTR model exp
    • feature engineering
    • realtime training
[CS fundamentals]

Phase 4a

Jan 2021 to Jun 2021, 6 months

[Business]
  • Basic understanding of flow of online advertising
  • Ranking module:
    • System design
    • Model training and CTR/CVR predictions
[CS fundamentals]
[Programming]
  • Code reading: C++

Phase 3c

Jul 2020 to Dec 2020, 6 months

[Business]
  • Online advertising modules: recall, rank
[CS fundamentals] Data Structure
[Programming]
  • Data application: Scala/Spark/Hadoop Streaming/Shell
  • Code reading: Java/C++

Phase 3b

Jan 2020 to Jun 2020, 6 months

[Business]
  • Video Content Understanding: more insights in tags
  • Transfer to Advertising Department from Jun 2020
[CS fundamentals] Data Structure
[Programming] Scala/Spark
[Project] tencent-ad-2020
  • final score: 1.401474
  • rank: 242

Phase 3a

Jul 2019 to Dec 2019, 6 months

[Work at BILIBILI] Algorithm Engineer
[Algorithms] Video Content Understanding
  • Classification with XGBoost, BERT (binary/multi-label)
  • Fine tune large BERT models with multi-gpu support
  • Extensive paper reading and summary sharing: Pre-trained Language Models
[Engineering]
  • Model: publish and online serving
  • HiveQL: training data preparation and statistics data report
  • GRPC
[Business] Recommend System
  • Be familiar with Recall/Rank/Strategy modules gradually
[CS fundamentals] Data Structure
[Programming] Java

Phase 2

Jan 2019 to Feb 2019, 2 months

[Project] 广告点击率预测:Criteo CTR
  • Codes at Github, by lambdaji
    • a good introduction to CTR field: mainstream models are covered, including FM, FFM, DeepFM, Wide&Deep and etc.
    • a good tutorial to tensorflow framework: codes are implemented by tensorflow and high-level APIs e.g. tf.Estimator and tf.Serving are used
  • Papers reading: to understand CTR models systematically
[Project] 支付风险识别:ATEC Payrisk
  • Feature selection: avoid overfitting in test dataset
  • Practice with XGBoost
[Course] Tensorflow for Deep Learning Research, by Chip Huyen@Stanford
[Intern Hunting]

Mar 2019 to Jun 2019, 4 months

[Project] Captcha Recognition
[Project] Anti-crawling Fontlib in Website
  • anti-crawl-fontlib-img:
    • process WOFF file in anti-crawling and reconstruct fontlib (key-char pairs) for further text parsing dynamically
    • char is obtained via OCR method and accuracy could reach 90%
  • anti-crawl-fontlib-svg:
    • convert fontlib into SVG files and use unique d-path as retrieval feature
    • accuracy could reach almost 100% within 1s, 20x faster than previous OCR method
[Project] tensorflow-nlp-sentiment-analysis
  • Sentiment analysis of restaurant reviews from Dianping, implemented by tensorflow (including estimator and serving)
  • It is also a good tutorial for basic nlp models (e.g. LSTM and its variants) and multi-head attention from Transformers
[Course] Natural Language Processing with Deep Learning, by Manning@Stanford
[Job Hunting]

Phase 1

Jul 2018 to Aug 2018, 2 months

[Course] Machine Learning, by Andrew Ng@Stanford
  • Online videos
    • both English and Chinese subtitles are included
    • total 20 hours
  • Coursera enrollment
    • take quizs after each unit
    • finish 8 programming tasks via Matlab
[Programming] Python教程, by 廖雪峰
[Programming] Numpy (in Python)

Sep 2018 to Oct 2018, 2 months

[Book] 统计学习方法, by 李航
  • Basic theory of Machine Learning
  • Study and derive formuolas is strongly recommended
[Book] 机器学习实战
  • Codes in Python is included
  • Select chapter and practice codes corresponding to 统计学习方法
[Programming] Pandas, sk-learn (in Python)

Nov 2018 to Dec 2018, 2 months

[Project] Kaggle mini-projects practice (5+)
[Course] Data Structure I, by 邓金辉@THU
[Book] 面向数据科学家的实用统计学

Refresh on Jul 2022 next time.

More

  • Let's play with data
  • New perspective will be provided by Machine Learning
  • Big thanks to Tao Li, Haha Hu, Jiajia Cong, Jianxiong Ma