The main goal of this lecture:
- introduce new students in the class (Yes, we have new students come every class....)
- all students talk about the progress of their research proposal writing (due next week)
- Correlation Headmap
- PCA
- LDA (linear discriminant analysis, not the text mining one....)
- Introduction to F1, recall and precision: common metrics for machine learning
build your data analytic web app!
Reference: https://python-for-multivariate-analysis.readthedocs.io/a_little_book_of_python_for_multivariate_analysis.html
(a makeup 30-minute section after class, or make an appointment with me)
- terminal operation: call jupyter notebook, learn about 'pip install XXXXX'
- notebook from week1: intro to Pandas, load dataset into jupyter notebook, data exploration analysis, data cleaning
- notebook from week1: learn about data structure (Lecture_One_Data_Structure.ipynb)
- notebook from week2: learn about linear regression and the concepts of p-values,sklearn and statistic computation package (Lecture_Two_Linear_Regression)
- every student create individual project in ColumbiaPython (we have four now)
- finish research proposal in github as a readme file: a. what is your research question? b. why your dataset can answer the question?
- upload new files into github (reference papers, data & codes)
For Next Week: Introduction to Machine Learning for Classification
- example question, concepts, data analysis
- Do we really understand the Log Loss Calculation in Logistic Regression? Instead of Mean Squared Error for Linear Regression, we use a cost function called Cross-Entropy, also known as Log Loss for Logistic regression.
- Binary Classification
- One vs. Rest: Multiple Categories Classification
Why Naive Bayes is naive?
Explain how SVM works. What is the difference between SVM and Random Forest?