Machine Learning
- anaconda
- data analysis
- pandas, numpy
- data visualization matplotlib
- scikit-learn machine learning, models
- supervised learning ( classication, regression )
- neural networks ( transfer learning, deep learning, time series)
- tensor flow and keras - image classification
- data engineering - spark / hadoop
- realtime and batch
- storytelling and communication
What is Machine Learning?
machine learn to do tasks for us
AI - machine acts like a human neuro ai machine learning subset of ai - how to achieve AI getting computers to act without explicitly code without if else
deep learning subset of machine learning - one of techniques to implement machine learning
Data science - analysing data and do something with it
Data Engineering
Teachable machine https://teachablemachine.withgoogle.com/
people look on spreadsheet machine learning look on big data nosql unstructured data
Data collection ( this is the hardest part ) Data modelling -> problem definition -> data -> eval -> features -> modelling -> experiments -> iterative process -> redefine problem
Deployment
K nearest neighbors perceptron support vector machine artificial neural network Decision tree
Machine learning predicting results based on incoming data rewards punish labels no labels(csv) try and error supervised unsupervised reinforcement
classification clustering skill acquisi regression association rule learning real time learn
Machine learning is using an algorithm / computer program to learn about different patterns in data and make predictions regarding future similar data
Input output
Machine learning models
chicken recipe
when not to use machine learning
- will a simple hand-coded instruction based system work?
when to use machine learning
- supervised learning samples data and labels i know my inputs and outputs classification - is one thing or another ( binary , multi-class) regression - a number will go up or down sell price of a house
- unsupervised learning I am not sure of the outputs but I have inputs samples data but no labels clustering -> you provide labels, putting samples into groups
- transfer learning i think that my problem may be similar to something else pass knowledge from model to another
- reinforcement learning rewarding and punishing, update a score, maximise the score
Matching your problem
The more data the better
- structured -> excel csv
- unstructured - images, audio files, natural language texts
Streaming - based on news change in stock data-> jupyter notebooks-> pandas-> matplotlib-> skikit learn
what defines success for us
machine learning model with % accuracy
Classification
- accuracy
- precision
- recall
Regression
- Mean absolute error
- Mean squared error
- Root mean squared error
Recommendation
- Precision at K
feature is another word for different forms of data feature variables to predict the target variable
heart disease weight, sex, heart rate , chest pain numerical or categorical features
derived feature -> visit in the last year for example
feature engineering
feature coverage -> ideally every sample has the same features it may be possible for some samples to miss some features ( columns )
Part 1 - 3 sets based on our problem and data what machine learning model should we use - splitting data
Choosing and training a model tuning a model model comparison
The training, validation(tuning) and test set
Split data into 3 different set
Course materials ( training set ) 70-80% Practice exam ( validation set ) 10-15% Final exam ( test set ) 10-15%
- Choosing a model?
Structured data -> CatBoost, dmlcXGBoost, Random Forest Unstructured data -> Deep learning, Transfer learning
based on training data
- Tuning a model Validation data or on training data
Random forest Neural networks layers models have hyperparameters you can adjust - this is tuning
- Model comparison during experimentation done on test data -- happens only on the testing set
how models perform in production
Training 64 % Test 47 %
Training 93% Test 99%
Overfitting - data leakage Underfitting - data mismatch, not same features
- try a more advanced model
- increase model hyperparameters
- reduce amount of features
- traing longer
- collect more data
- try a less advanced model
- remember that no model is perfect
Data analysis - data/evaluation and features Machine learn modelling Experimenting
Data analysis - pandas, matplotlib, numpy Data modelling - tensorflow, Pytorch, skikitlearn, dmlcXGBoost, CatBoost Jupyter Notebooks, Anaconda
which tool to use for what kind of problem!! not knowing everything by heart
anaconda-navigator
Anaconda and miniconda - hardware store and work bench for machine learning conda - data scientist personal assistant comes with miniconda
conda is like a package manager
conda create --prefix ./anacondaenv pandas numpy matplotlib scikit-learn jupyter
(base) the main environment in anaconda conda activate </Users/emanueldumitru/env environment> conda deactivate conda env list conda install jupyter jupyter notebook
conda create -n py33
- jupyter notebook m for markdown and y for code
exporting an environment
conda env export --prefix /Users/daniel/Desktop/project_1/env > environment.yml
conda env create --file environment.yml --name env_from_file
TODO: personal project maybe accuracy prediction of raising alarms for a customer??!
Numpy operations are a lot faster than using lists You can do large numerical calculation really fast Optimisations written in C code
Vectorization via broadcasting( avoiding loops ) broadcasting technique
Jupyter - shift + Tab - intellisense documentation
Matplotlib to visualise data - plots or figures Matplotlib is built on numpy arrays Integrates directly with pandas You can create basic or advanced plots
Try to always use object-oriented interface over the pyplot interface
outlayer more far from standard deviation
- python machine learning library
- learn patterns
- make predictions
- built on numpy and matplotlib
Scikit learn workflow
- get data ready
- pick a pmode to suit your problem
- fit the model to teh data and make a prediction ( learning and using patterns )
- evaluate the model
- improve through experimentation
- save and reload a trained model
Machine learning is simple a function / model - receives an input and produce an output
Debugging jupyter warning
conda list conda update conda documentation package manager conda search
Restart notebook from start kernel restart
%matplotlib inline
conda install python=3.6.9 scikit-learned=0.22 matplotlib conda list scikit-learn
Shortly, in order to fix warnings you may need to update libraries that you are using
Once your data is all in numerical format, there's one more transformation you'll probably want to do to it.
It's called Feature Scaling.
In other words, making sure all of your numerical data is on the same scale.
Normalization, Standardization
use sklearn maps to check what model is the best fit RandomForestRegressor is based on decision trees
create an if else statement program by yourself is basically a decision tree
ROC curves are a comparison of a model's true positive rate (tpr) versus a model's false positive rate (fpr).
- True positive = model predicts 1 when truth is 1
- False positive = model predicts 1 when truth is 0
- True negative = model predicts 0 when truth is 0
- False negative = model predicts 0 when truth is 1
ROC curves and AUC metrics are evaluation metrics for binary classification models (a model which predicts one thing or another, such as heart disease or not).
The ROC curve compares the true positive rate (tpr) versus the false positive rate (fpr) at different classification thresholds.
import sys !conda install --yes --prefix {sys.prefix} seaborn
R^2
What R-squared does: Compares your models predictions to the mean of the targets. Values can range from negative infinity ( a very poor model ) to 1. For example, if all your model does is predict the mean of the targets, it's R^2 value would be 0. And if your model perfectly predicts a range of unmbers it's R^2 value would be 1.
Mean absolute error (MAE)
MAE is the average of the absolute differences between predictions and actual values. It gives you an idea of how wrong your models predictions are.
Accuracy - The accuracy of the model in decimal form. Perfect accuracy is equal to 1.0.
Precision - Indicates the proportion of positive identifications (model predicted class 1) which were actually correct. A model which produces no false positives has a precision of 1.0.
Recall - Indicates the proportion of actual positives which were correctly classified. A model which produces no false negatives has a recall of 1.0.
F1 score - A combination of precision and recall. A perfect model achieves an F1 score of 1.0.
Confusion matrix - Compares the predicted values with the true values in a tabular way, if 100% correct, all values in the matrix will be top left to bottom right (diagonal line).
Cross-validation - Splits your dataset into multiple parts and train and tests your model on each part then evaluates performance as an average.
Classification report - Sklearn has a built-in function called classification_report() which returns some of the main classification metrics such as precision, recall and f1-score.
ROC Curve - Also known as receiver operating characteristic is a plot of true positive rate versus false-positive rate.
Area Under Curve (AUC) Score - The area underneath the ROC curve. A perfect model achieves an AUC score of 1.0.
Accuracy is a good measure to start with if all classes are balanced (e.g. same amount of samples which are labelled with 0 or 1).
Precision and recall become more important when classes are imbalanced.
If false-positive predictions are worse than false-negatives, aim for higher precision.
If false-negative predictions are worse than false-positives, aim for higher recall.
F1-score is a combination of precision and recall.
A confusion matrix is always a good way to visualize how a classification model is going.
R^2 (pronounced r-squared) or the coefficient of determination - Compares your model's predictions to the mean of the targets. Values can range from negative infinity (a very poor model) to 1. For example, if all your model does is predict the mean of the targets, its R^2 value would be 0. And if your model perfectly predicts a range of numbers it's R^2 value would be 1.
Mean absolute error (MAE) - The average of the absolute differences between predictions and actual values. It gives you an idea of how wrong your predictions were.
Mean squared error (MSE) - The average squared differences between predictions and actual values. Squaring the errors removes negative errors. It also amplifies outliers (samples which have larger errors).
R2 is similar to accuracy. It gives you a quick indication of how well your model might be doing. Generally, the closer your R2 value is to 1.0, the better the model. But it doesn't really tell exactly how wrong your model is in terms of how far off each prediction is.
MAE gives a better indication of how far off each of your model's predictions are on average.
As for MAE or MSE, because of the way MSE is calculated, squaring the differences between predicted values and actual values, it amplifies larger differences. Let's say we're predicting the value of houses (which we are).
Pay more attention to MAE: When being $10,000 off is twice as bad as being $5,000 off.
Pay more attention to MSE: When being $10,000 off is more than twice as bad as being $5,000 off.
First predictions = baseline predictions First model = baseline model
From a data perspective:
- Could we collect more data? ( generally the more data the better)
- Could we improve our data? ( maybe for features )
From a model perspective:
- Is there a better model we can use? ( scikit-learn ML map )
- Could we improve the current model?
Hyperparameters vs Parameters
- Parameters = model find these patterns in data
- Hyperparameters = settings on a model you can adjust to potentially improve its ability to find patterns
Three ways to adjust hyperparameters:
- By hand
- Randomly with RandomSearchCV
- Exhaustively with GridSearchCV
3 sets trainings, validation and test set
We're going to try and adjust:
max_depth max_features min_samples_leaf min_samples_split n_estimators
- excel spreadsheet
- pandas
- google sheet
- labels
- rows and columns
- it fits into a table
A good data scientist ask a lot of questions
conda env export > environment.yml conda env create --prefix ./env -f <path_to_env_yml_file>
Every project should have it's own environment
Correlation matrix
- Classification and regression matrix accuracy,precision,recall,f1 and r2,mae,mse,rmse
- Confusion matrix true positives, true negatives, false negatives, false positives
- Classifcation report
Get Data and store them in different data sources
- Structured data (organized and well fit to be readable, tables)
- Unstructured data ( pdfs, mails, etc )
- Binary data ( mp3, images, mp4, etc)
Data mining - preprocessing and extract meaning from data Big data - lots of data, usually hosted in cloud providers like AWS, GCP Data pipeline - ETL , store , visualize, monitor performance
Data ingestion - aquire data from different sources Data Lake - a collection of data Data transformation - convert data to Data warehouse - structured filtered data
Kafka - data streams and put it into a data lake Hadoop Amazon S3 Google BigQuery / Amazon Athena / Amazon Redshift - Data warehouse
Types of databases
- Relational Database
- NoSQL
- NewSQL
Using databases for Search ( Elasticsearch ) Using databases for Computation ( Apache Spark )
OLTP - Only transactional processing OLAP - Only Analytical processing
- Postgres
- MongoDB
- Apache Spark
- Apache Flink
- Apache Kafka
Deep learning / numerical computing framework designed by Google It is open source
unstructured data - neural networks deep learning is suited best for unstructured data like images, audiowaves, etc
- write fast deep learning code in Python able to run on a GPU
- able to access many pre-built deep learning models
- whole stack: preprocess, model, deploy
GPU - Graphical Process Unit faster than cpu, numeric computing
Structured data problem - deep learning, tensorflow, transfer learning, tensor flow hub neural network
- Classification
- Sequence to sequence - audio wave to text or command
- Object detection
What is transfer Learning?
- take what you know in one domani and apply it to another
- starting from scratch can be expensive and time consuming
- why not take advantage of what's already out there?
Tensorflow flow is quite similar with Scikit-learn flow
- Get data ready ( turn it into Tensors aka np arrays or matrix)
- Pick a model ( to suit your problem from Tensorflow Hub)
- Fit the model to the data and make a prediction
- Evaluate a model
- Improve through experimentation
- save and reload our trained model
Dog breed classification
Binary classification - neural network activation function is sigmoid Loss: BinaryCrossentropy
Multi-class classification - neural network activation function is softmax Loss: CategoricalCrossentropy
hill descending, goal go at the bottom of the hill Go down a hill blindfolded loss - height of the hill adam - your friend that will guide you on your way going down the hill
loss function, adam optimizer ( or stochastic gradient descent)
Transfer learning Mobile Net V2 dog vision
- Ability to work with others, communicating with others is essentially beside tech skills
Nice questions to ask for:
- who is it for
- what questions will they have
- what needs do they have
- what concerns can you address before they arise
Deliver the message to be heard and (potentially) understood
Communicating with people on your team:
- How the project is going
- what is in the way
- what you have done
- what you are doing next
- why you are doing something next
- who else could help
- what's not needed
- where you are stuck
- what is not clear
- what questions do we have
- are you still working towards the right thing
- Is there any feedback or advice
What are you working on this week, today, short period of time. What's working? What's not working? What could be improved? What's your next course of action? Why? What's holding you back?
Referecence overlaps
Start the job before you have it by working on and share you own project that relate with the role you are applying for
- During the week you build knowledge
- During the weekend you work on your own projects ( skill )
- Document on your blog ( medium )
Communicating with people outside your team:
- make sure to make it really functional / business oriented / visual / easy to understand
What story are we trying to tell ?
- sharing what worked / what didn't work / who's it for/ what do they need to know
- progress, not perfection
- write it down
- document / blogging
What am I going to learn at this job?
Github / Website / 1 or 2 Big Projects / Blog / Experience
Learning to learn / Start with Why
Statistics course: https://www.youtube.com/watch?v=zouPoc49xbk&list=PL8dPuuaLjXtNM_Y-bUAhblSAdWRnmBUcr
Mathematics course: https://www.khanacademy.org/math