Starter project code for students taking Udacity ud120
- Naive Bayes
- SVM (SVC - C = Classifier)
- Decision Tree
- Ensemble methods
- Adaboost
- Random Forest
- Ensemble methods
More Data > Fine Tuned Algorithm Data Types
- Numerical (discrete or continuous?)
- Categories/Enums
- Time series (date/time stamp)
- text
- Continuous meaning a continuous output range, not continuously learning
- Continuous vs. Discrete
Result is often just a simple line fit (y = mx + b)
reg.predict takes an array reg.coef_ & reg.intercept_ reg.score provides r**2
Classification vs. Regression Discrete vs. Continuous Decision Boundary vs. Best Fit Line Accuracy vs. Sum of Squares/R^2 to determine accuracy
Data is unlabeled
- Clustering
- K-means is most common
- Dimensionality Reduction
https://scikit-learn.org/stable/modules/preprocessing.html https://scikit-learn.org/stable/modules/preprocessing.html#scaling-features-to-a-range
- Key is you need a numpy array
- Scaling only affects algorithms in which > 1 dimension are compared
- Tip - if only horizontal and vertical lines split the data only one dimension is used so scaling doesn't matter
Bag of Words - A frequency count of words Stop words (A, and, of,...) are often removed Generally just word stems are used (e.g. love in loves)
# stopwords
from nltk.corpus import stopwords
len(stopwords.words('english'))
# Stemming
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")
# TF/IDF : Term frequency, inverse document frequency
# SelectPercentile SelectKBest from sklearn can be used to select the most relevant features
# They follow the traditional fit/predict model
The faces example is in sklearn
Note that train_test_split is in model_selection in newer APIs and cross_validation in older version
Always fit on training data
transform and predict should use test data for validation, but DO NOT re-fit
Problems with splitting data into test and training set:
- Splitting the data forces you to have smaller data sets (anything in one set shrinks the other)
- K-Fold Validation:
- Create K folds
- Each fold has 1 test set and K - 1 training sets
- Combine each training set grouping
- You now have K different test-train sets
- Do K trainings and average the results of each
- ML is teaching computers to learn from past experiences