Machine Learning for Finance (FN 570) 2019-20 Module 3 (Spring 2020)

Announcements

Email is the preferred method of communication. Class mailing list will be created as PHBS.MLF@allmail.net. But, the announcements will be made in DingTalk group chat.

Course Project

Important Dates:
- 3. 29 (Sun): Team formation
- 4. 07 (Tues): Dataset selection
- 4. 14 (Tues) (2~3 teams) / 4.17 (Fri) (the rest): Presentation
- 4. 26 (Sun): Submission (Github) deadline
Project page
Previous Years: 2017 | 2018

Lectures:

01 (2.18 Tue): Course overview (Syllabus), Python, Github, Etc.
02 (2.21 Fri): HSBC Guest Lecture [1/4] Model management cycle in banking industry, Tool setup (GCP/Ali Cloud).
03 (2.25 Tue): Brief Python crash course (Basic | Numpy, Notebook Shorcut Keys) | Intro (Slides, Reading: PML Ch. 1) | Notations, Regression (Slides)
04 (2.28 Fri): Regression weight update (Slides) PML Ch. 2 (Perceptron, Adaline, Gradient descent, SGD),
05 (3.03 Tue): Logistic Regression (Slides, Reading: PML Ch. 3)
06 (3.06 Fri): LR (continued) | SVM (Slides, Reading: PML Ch. 3)
07 (3.10 Tue): KNN and Decision Tree (Slides, Reading: PML Ch. 3)
08 (3.13 Fri): Data Preprocessing (Rading: PML Ch. 4), SVD/PCA (Slides, Reading: PML Ch. 5)
09 (3.17 Tue): LDA (Slides, Reading: PML Ch. 5), Hyperparameters (Slides, Reading: PML Ch. 6)
10 (3.20 Fri): HSBC Guest Lecture [2/4] Data mining, profiling, visualization, and conclusion.
11 (3.24 Tue): Bias-Variance, Cross-validation (Slides, Reading: PML Ch. 6)
12 (3.27 Fri): HSBC Guest Lecture [3/4] Model sharings.
13 (3.31 Tue): Evaluation Metric (Slides, Reading: PML Ch. 6), Ensenble (Reading: PML Ch. 7)
14 (4.03 Fri): HSBC Guest Lecture [4/4] Practical issues of applying ML to the real world.
15 (4.07 Tue): Midterm Exam (Solution)
16 (4.10 Fri): Neural Network, Deep Learning, CNN (Reading: Ch. 12-15)
17 (4.14 Tue): Midterm exam review, More on deep learning (TensorFlow), Course Project Presentation (2~3 teams)
18 (4.17 Fri): Course Project Presentation (the rest)

Course Resources

Course slides: Intro | Regression | SVM/KNN/Tree | SVD/PCA/LDA | Hyperparameter | Neural Network | Graphical Model
Past Exam: 2017 | 2018 | 2019
Exams from Tom Michell's ML course (Carnegie Mellon University)

Homeworks:

Set 0: [Required Software] [Due by 2.22 Sat]
- Register on Github.com and let TA know your ID (by DingTalk). Make sure to user your full real name in your profile. Accept invitation to the PHBS organization from TA.
  - Create a designated repository GITHUB_ID/PHBS_MLF_2019 for your HW and project. Tick Initialize this repository with a README and select python under .gitignore
  - Fork PML repository to your repository.
- Install Github Desktop (available on CMS). Then clone the two repositories to your local storage.
- Install Anaconda Python distribution (3.X version, not 2.X version). Anaconda distribution is core Python + useful scientific computation libraries (e.g., numpy, scipy, pandas) + package management system (pip or conda)
- Install PyCharm Community version. (Or Professional version after applying for free student license)
- Save the screenshot of (1) Github Desktop (showing 2 repositories) (2) Jupyter Notebook (Anaconda) (3) PyCharm (See my example) and make sure to press Push Origin to sync with the online repository in github.com.
Set 1: [Playing with Pandas dataframe] [Due by 3.11 Wed]
- The goal of this HW is to be familiar with pandas package and dataframe. Due to limited time, I cannot cover pandas in class. You need to teach yourself. Remenber that there's many answers to do the task I am asking below. Use your own way.
- For this HW, we will use Polish companies bankruptcy data Data Set from UCI Machine Learning Repository. Download the dataset and put the 4th year file (4year.arff) in your YOUR_GITHUB_ID/PHBS_MLF_2019/HW1/
- I did a basic process of the data (loading to dataframe and creating bankruptcy column). See my github
- We are going to use the following 4 features: X1 net profit / total assets, X2 total liabilities / total assets, X7 EBIT / total assets, X10 equity / total assets, and class
- Create a new dataframe with only 4 feataures (and and Bankruptcy). Properly rename the columns to X1, X2, X7, and X10
- Fill-in the missing values na with the mean. (See Ch 4 of PML)
- Find the mean and std of the 4 features among all, bankrupt and still-operating companies (3 groups).
- How many companies satisfy the condition, X1 < mean(X1) - stdev(X1) AND X10 < mean(X10) - std(X10)?
- What is the ratio of the bankrupted companies among the sub-groups above?
Set 2: [Classifiers] [Due by 3.19 Thurs]
- The goal of this HW is to be familiar with the basic classifiers PML Ch 3.
- For this HW, we continue to use Polish companies bankruptcy data Data Set from UCI Machine Learning Repository. Download the dataset and put the 4th year file (4year.arff) in your YOUR_GITHUB_ID/PHBS_MLF_2019/HW2/
- I did a basic process of the data (loading to dataframe, creating bankruptcy column, changing column names, filling-in na values, training-vs-test split, standardizatino, etc). See my github
- Select the 2 most important features using LogisticRegression with L1 penalty. (Adjust C until you see 2 features)
- Using the 2 selected features, apply LR / SVM / decision tree. Try your own hyperparameters (C, gamma, tree depth, etc) to maximize the prediction accuracy. (Just try several values. You don't need to show your answer is the maximum.)
- Visualize your classifiers using the plot_decision_regions function from PML Ch. 3
Set 3: [PCA/Hyperparameter/CV] [Due by 4.4 Sat]
- The goal of this HW is to be familiar with PCA (feature extraction), grid search, pipeline, etc.
- For this HW, we continue to use Polish companies bankruptcy data Data Set from UCI Machine Learning Repository. Download the dataset and put the 4th year file (4year.arff) in your YOUR_GITHUB_ID/PHBS_MLF_2019/HW3/
- Use the same pre-precessing provided in Set 2 (loading to dataframe, creating bankruptcy column, changing column names, filling-in na values, training-vs-test split, standardizatino, etc). See my github
- Extract 3 features using PCA method.
- Using the selected features from above, we are going to apply LR / SVM / decision tree.
- Implement the methods using pipeline. (PML p185)
- Use grid search for finding optimal hyperparameters. (PML p199). In the search, apply 10-fold cross-validation.

Syllabus

Classes:

Lectures: Tuesday & Friday 1:30 – 3:20 PM
Venue: Online/DingTalk ~~PHBS Building, Room 229~~

Instructor: Jaehyuk Choi

Office: PHBS Building, Room 755
Phone: 86-755-2603-0568
Email: jaehyuk@phbs.pku.edu.cn
Office Hour: Online/DingTalk (TBA)

Teaching Assistance: Shiqi Zhang (张诗琪)

Email: 1701213153@sz.pku.edu.cn
TA Office Hour: Online/DingTalk ~~(Room 213/214)~~

Course overview

With the advent of computation power and big data, machine learning (ML) recently became one of the most spotlighted research field in industry and academia. This course provides a broad introduction to ML in theoretical and practical perspectives. Through this course, students will learn the intuition and implementation behind the popular ML methods and gain hands-on experience of using ML software packages such as SK-learn and Tensorflow. This course will also explore the possibility of applying ML to finance and business. Each student is required to complete a final course project. This year, the compliance analytics team in HSBC bank will give 4 guest lectures thrroughout the course to demonstrate how ML is developed and shared in banking industry. In the guest lectures, students will also learn how to use cloud computing (Google Cloud Platform/Ali Cloud)

Prerequisites

This course assumes prior knowkedge in probability/statistics and experience in Python. This course is ideally recommended for those who have taken introductory ML/AI courses from undergraduate program.

Textbooks and Reading Materials

Primary textbook

PML (primary textbook): Python Machine Learning by Sebastian Raschka

ML

ISLR: An Introduction to Statistical Learning (with Applications in R) by James, Witten, Hastie, and Tibshirani
Bishop: Pattern Recognition and Machine Learning by Bishop (Microsoft)
ESL: The Elements of Statistical Learning by Hastie, Tibshirani, and Friedman
CML: Coursera Machine Learning by Andrew Ng
DL: Deep Learning by Goodfellow, Bengio, and Courville

ML in Finance

AFML: Advances in financial machine learning by López de Prado

Useful Github Repositories

PML: PHBS/python-machine-learning-book-2nd-edition (forked)
ISLR-Python: PHBS/ISLR-python (forked) ISRL implemented in Python

Assessment / Grading Details

Attendance 20%, Mid-term exam 30%, Assignments 20%, Course Project 30%
Attendance: TBA Randomly checked. The score is calculated as 20 – 2x(#of absence). Leave request should be made 24 hours before with supporting documents, except for emergency. Job interview/internship cannot be a valid reason for leave
Mid-term exam: 4.7 Tues. In-class open-book without computer/phone/calculator
Course project: Data Proposal and Presentation. Group of up to ?? people.
Attendance: checked randomly. The score is calculated as 20 – 2x(#of absence). Leave request should be made 24 hours before with supporting documents, except for emergency. Job interview/internship cannot be a valid reason for leave
Grade in letters (e.g., A+, A-, ... ,D+, D, F). A- or above < 30% and B- or below > 10%.

wzy14582/MLF