Multivariate Analysis and Data Visualization with Matplotlib and Seaborn

The main goal of this lecture:

Part one: class organization

introduce new students in the class (Yes, we have new students come every class....)
all students talk about the progress of their research proposal writing (due next week)

build your data analytic web app!

(a makeup 30-minute section after class, or make an appointment with me)

terminal operation: call jupyter notebook, learn about 'pip install XXXXX'
notebook from week1: intro to Pandas, load dataset into jupyter notebook, data exploration analysis, data cleaning
notebook from week1: learn about data structure (Lecture_One_Data_Structure.ipynb)
notebook from week2: learn about linear regression and the concepts of p-values,sklearn and statistic computation package (Lecture_Two_Linear_Regression)

every student create individual project in ColumbiaPython (we have four now)
finish research proposal in github as a readme file: a. what is your research question? b. why your dataset can answer the question?
upload new files into github (reference papers, data & codes)

For Next Week: Introduction to Machine Learning for Classification

example question, concepts, data analysis
Do we really understand the Log Loss Calculation in Logistic Regression? Instead of Mean Squared Error for Linear Regression, we use a cost function called Cross-Entropy, also known as Log Loss for Logistic regression.
Binary Classification
One vs. Rest: Multiple Categories Classification

Why Naive Bayes is naive?

Explain how SVM works. What is the difference between SVM and Random Forest?