THIS IS A WORK IN PROGRESS, TYRING TO EXPORT GOOGLE DOCS DIRECTLY TO GIT. see medium article for the google doc.

ML-DL-Compendium

Machine & Deep Learning Compendium

>>>>> gd2md-html alert: ERRORs: 3; WARNINGs: 1; ALERTS: 259.

  • See top comment block for details on ERRORs and WARNINGs.
  • In the converted Markdown or HTML, search for inline alerts that start with >>>>> gd2md-html alert: for specific instances that need correction.

Links to alert messages:

alert1 alert2 alert3 alert4 alert5 alert6 alert7 alert8 alert9 alert10 alert11 alert12 alert13 alert14 alert15 alert16 alert17 alert18 alert19 alert20 alert21 alert22 alert23 alert24 alert25 alert26 alert27 alert28 alert29 alert30 alert31 alert32 alert33 alert34 alert35 alert36 alert37 alert38 alert39 alert40 alert41 alert42 alert43 alert44 alert45 alert46 alert47 alert48 alert49 alert50 alert51 alert52 alert53 alert54 alert55 alert56 alert57 alert58 alert59 alert60 alert61 alert62 alert63 alert64 alert65 alert66 alert67 alert68 alert69 alert70 alert71 alert72 alert73 alert74 alert75 alert76 alert77 alert78 alert79 alert80 alert81 alert82 alert83 alert84 alert85 alert86 alert87 alert88 alert89 alert90 alert91 alert92 alert93 alert94 alert95 alert96 alert97 alert98 alert99 alert100 alert101 alert102 alert103 alert104 alert105 alert106 alert107 alert108 alert109 alert110 alert111 alert112 alert113 alert114 alert115 alert116 alert117 alert118 alert119 alert120 alert121 alert122 alert123 alert124 alert125 alert126 alert127 alert128 alert129 alert130 alert131 alert132 alert133 alert134 alert135 alert136 alert137 alert138 alert139 alert140 alert141 alert142 alert143 alert144 alert145 alert146 alert147 alert148 alert149 alert150 alert151 alert152 alert153 alert154 alert155 alert156 alert157 alert158 alert159 alert160 alert161 alert162 alert163 alert164 alert165 alert166 alert167 alert168 alert169 alert170 alert171 alert172 alert173 alert174 alert175 alert176 alert177 alert178 alert179 alert180 alert181 alert182 alert183 alert184 alert185 alert186 alert187 alert188 alert189 alert190 alert191 alert192 alert193 alert194 alert195 alert196 alert197 alert198 alert199 alert200 alert201 alert202 alert203 alert204 alert205 alert206 alert207 alert208 alert209 alert210 alert211 alert212 alert213 alert214 alert215 alert216 alert217 alert218 alert219 alert220 alert221 alert222 alert223 alert224 alert225 alert226 alert227 alert228 alert229 alert230 alert231 alert232 alert233 alert234 alert235 alert236 alert237 alert238 alert239 alert240 alert241 alert242 alert243 alert244 alert245 alert246 alert247 alert248 alert249 alert250 alert251 alert252 alert253 alert254 alert255 alert256 alert257 alert258 alert259

>>>>> PLEASE check and correct alert issues and delete this message and the inline alerts.


Machine & Deep Learning Compendium

The following is my personal compendium, which includes many topics, links, summaries in the fields of statistics, machine learning, deep learning, computer science, data science, deep vision, NLP, cloud computing, product management and others.

I see this compendium as a gateway and as a frequently visited resource for people of proficiency levels, for industry data scientists as well as academics.

The following link is an automatically generated HTML version of this document, which is updated every 5 minutes. Some of you may prefer it over the editable version that you see here.

Please keep in mind that this is a perpetual work in progress with ~390 pages on many topics. If you feel that something should be changed, please use the comment option and let me know.

I would like to thank the following contributors: Samuel Jefroykin, Sefi Keller

Many Thanks,

Dr. Ori Cohen.

TABLE OF CONTENT

[TOC]

TYPES OF MACHINE LEARNING

>>>>> gd2md-html alert: inline image link here (to images/image1.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

A wonderful introduction into machine learning, and how to choose the right algorithm or family of algorithms for the task at hand.

VARIOUS MODEL FAMILIES

Stanford cs221 - reflex, variable, state, logic

WEAKLY SUPERVISED

  1. Text classification with extremely small datasets, relies heavily on feature engineering methods such as number of hashtags, number of punctuations and other insights that are really good for this type of text.
  2. A great review paper for weakly supervision, discusses:
    1. Incomplete supervision
    2. Inaccurate
    3. Inexact
    4. Active learning
  3. Stanford on weakly
  4. Stanford ai on snorkel
  5. Hazy research on weak and snorkel
  6. Out of distribution generalization using test-time training - Test-time training turns a single unlabeled test instance into a self-supervised learning problem, on which we update the model parameters before making a prediction on this instance.
  7. Learning Deep Networks from Noisy Labels with Dropout Regularization - Large datasets often have unreliable labels—such as those obtained from Amazon’s Mechanical Turk or social media platforms—and classifiers trained on mislabeled datasets often exhibit poor performance. We present a simple, effective technique for accounting for label noise when training deep neural networks. We augment a standard deep network with a softmax layer that models the label noise statistics. Then, we train the deep network and noise model jointly via end-to-end stochastic gradient descent on the (perhaps mislabeled) dataset. The augmented model is overdetermined, so in order to encourage the learning of a non-trivial noise model, we apply dropout regularization to the weights of the noise model during training. Numerical experiments on noisy versions of the CIFAR-10 and MNIST datasets show that the proposed dropout technique outperforms state-of-the-art methods.
  8. Distill to label weakly supervised instance labeling using knowledge distillation - “Weakly supervised instance labeling using only image-level labels, in lieu of expensive fine-grained pixel annotations, is crucial in several applications including medical image analysis. In contrast to conventional instance segmentation scenarios in computer vision, the problems that we consider are characterized by a small number of training images and non-local patterns that lead to the diagnosis. In this paper, we explore the use of multiple instance learning (MIL) to design an instance label generator under this weakly supervised setting. Motivated by the observation that an MIL model can handle bags of varying sizes, we propose to repurpose an MIL model originally trained for bag-level classification to produce reliable predictions for single instances, i.e., bags of size 1. To this end, we introduce a novel regularization strategy based on virtual adversarial training for improving MIL training, and subsequently develop a knowledge distillation technique for repurposing the trained MIL model. Using empirical studies on colon cancer and breast cancer detection from histopathological images, we show that the proposed approach produces high-quality instance-level prediction and significantly outperforms state-of-the MIL methods.”
  9. Yet another article summarising FAIR

SEMI SUPERVISED

  1. Paper review

  2. Ruder an overview of proxy labeled for semi supervised (AMAZING)

  3. Self training

    1. Self training and tri training
    2. Confidence regularized self training
    3. Domain adaptation for semantic segmentation using class balanced self-training
    4. Self labeled techniques for semi supervised learning
  4. Tri training 5. Trinet for semi supervised Deep learning 6. Tri training exploiting unlabeled data using 3 classes, paper 7. Improving tri training with unlabeled data 8. Tri training using NN ensemble 9. Asymmetric try training for unsupervised domain adaptation, another implementation, another, paper 10. Tri training git

  5. Fast ai forums

  6. UDA GIT, paper, medium*, medium 2 (has data augmentation articles)

  7. s4l

  8. Google’s UDM and MixMatch dissected- For text classification, the authors used a combination of back translation and a new method called TF-IDF based word replacing.

    Back translation consists of translating a sentence into some other intermediate language (e.g. French) and then translating it back to the original language (English in this case). The authors trained an English-to-French and French-to-English system on the WMT 14 corpus.

    TF-IDF word replacement replaces words in a sentence at random based on the TF-IDF scores of each word (words with a lower TF-IDF have a higher probability of being replaced).

  9. MixMatch, medium, 2, 3, 4, that works by guessing low-entropy labels for data-augmented unlabeled examples and mixing labeled and unlabeled data using MixUp. We show that MixMatch obtains state-of-the-art results by a large margin across many datasets and labeled data amounts

  10. ReMixMatch - paper is really good. “We improve the recently-proposed “MixMatch” semi-supervised learning algorithm by introducing two new techniques: distribution alignment and augmentation anchoring”

  11. FixMatch - FixMatch is a recent semi-supervised approach by Sohn et al. from Google Brain that improved the state of the art in semi-supervised learning(SSL). It is a simpler combination of previous methods such as UDA and ReMixMatch. \

>>>>> gd2md-html alert: inline image link here (to images/image2.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

  1. Curriculum Labeling: Self-paced Pseudo-Labeling for Semi-Supervised Learning
  2. FAIR 2 original, Summarization of FAIR’s student teacher weak/ semi supervision
  3. Leveraging Just a Few Keywords for Fine-Grained Aspect Detection Through Weakly Supervised Co-Training
  4. Fidelity-Weighted Learning - “fidelity-weighted learning” (FWL), a semi-supervised student- teacher approach for training deep neural networks using weakly-labeled data. FWL modulates the parameter updates to a student network (trained on the task we care about) on a per-sample basis according to the posterior confidence of its label-quality estimated by a teacher (who has access to the high-quality labels). Both student and teacher are learned from the data.
  5. Unproven student teacher git
  6. A really nice student teacher git with examples.

>>>>> gd2md-html alert: inline image link here (to images/image3.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

  1. Teacher student for tri training for unlabeled data exploitation

>>>>> gd2md-html alert: inline image link here (to images/image4.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

REGRESSION

Metrics:

  1. R2
  2. Medium 1, 2, 3, 4,
  3. Tutorial

ACTIVE LEARNING

  1. If you need to start somewhere start here - types of AL, the methodology, examples, sample selection functions.
  2. A thorough review paper about AL
  3. The book on AL
  4. Choose your model first, then do AL, from lighttag
    1. The alternative is Query by committee - Importantly, the active learning method we presented above is the most naive form of what is called "uncertainty sampling" where we chose to sample based on how uncertain our model was. An alternative approach, called Query by Committee, maintains a collection of models (the committee) and selecting the most "controversial" data point to label next, that is one where the models disagreed on. Using such a committee may allow us to overcome the restricted hypothesis a single model can express, though at the onset of a task we still have no way of knowing what hypothesis we should be using.
    2. Paper: warning against transferring actively sampled datasets to other models
  5. How to increase accuracy with AL
  6. AL with model selection - paper
  7. Using weak and strong oracle in AL, paper.
  8. The pitfalls of AL - how to choose (cost-effectively) the active learning technique when one starts without the labeled data needed for methods like cross-validation; 2. how to choose (cost-effectively) the base learning technique when one starts without the labeled data needed for methods like cross-validation, given that we know that learning curves cross, and given possible interactions between active learning technique and base learner; 3. how to deal with highly skewed class distributions, where active learning strategies find few (or no) instances of rare classes; 4. how to deal with concepts including very small subconcepts (“disjuncts”)—which are hard enough to find with random sampling (because of their rarity), but active learning strategies can actually avoid finding them if they are misclassified strongly to begin with; 5. how best to address the cold-start problem, and especially 6. whether and what alternatives exist for using human resources to improve learning, that may be more cost efficient than using humans simply for labeling selected cases, such as guided learning [3], active dual supervision [2], guided feature labeling [1], etc.
  9. Confidence based stopping criteria paper
  10. A great tutorial
  11. An ok video
  12. Active learning framework in python
  13. Active Learning Using Pre-clustering
  14. A literature survey of active machine learning in the context of natural language processing
  15. Mnist competition (unpublished) using AL
  16. Practical Online Active Learning for Classification
  17. Video 2
  18. Active learning in R - code
  19. Deep bayesian active learning with image data
  20. Medium on AL***

Robert munro on active learning - should buy his book:

  1. GIT
  2. Active transfer learning
  3. Uncertainty sampling
    1. Least Confidence: difference between the most confident prediction and 100% confidence
    2. Margin of Confidence: difference between the top two most confident predictions
    3. Ratio of Confidence: ratio between the top two most confident predictions
    4. Entropy: difference between all predictions, as defined by information theory

>>>>> gd2md-html alert: inline image link here (to images/image5.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

  1. Diversity sampling - you want to make sure that it covers as diverse a set of data and real-world demographics as possible. 6. Model-based Outliers: sampling for low activation in your logits and hidden layers to find items that are confusing to your model because of lack of information 7. Cluster-based Sampling: using Unsupervised Machine Learning to sample data from all the meaningful trends in your data’s feature-space 8. Representative Sampling: sampling items that are the most representative of the target domain for your model, relative to your current training data 9. Real-world diversity: using sampling strategies that increase fairness when trying to support real-world diversity

>>>>> gd2md-html alert: inline image link here (to images/image6.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

  1. Combine uncertainty sampling and diversity sampling 10. Least Confidence Sampling with Clustering-based Sampling: sample items that are confusing to your model and then cluster those items to ensure a diverse sample (see diagram below). 11. Uncertainty Sampling with Model-based Outliers: sample items that are confusing to your model and within those find items with low activation in the model. 12. Uncertainty Sampling with Model-based Outliers and Clustering: combine methods 1 and 2. 13. Representative Cluster-based Sampling: cluster your data to capture multinodal distributions and sample items that are most like your target domain (see diagram below). 14. Sampling from the Highest Entropy Cluster: cluster your unlabeled data and find the cluster with the highest average confusion for your model. 15. Uncertainty Sampling and Representative Sampling: sample items that are both confusing to your current model and the most like your target domain. 16. Model-based Outliers and Representative Sampling: sample items that have low activation in your model but are relatively common in your target domain. 17. Clustering with itself for hierarchical clusters: recursively cluster to maximize the diversity. 18. Sampling from the Highest Entropy Cluster with Margin of Confidence Sampling: find the cluster with the most confusion and then sample for the maximum pairwise label confusion within that cluster. 19. Combining Ensemble Methods and Dropouts with individual strategies: aggregate results that come from multiple models or multiple predictions from one model via Monte-Carlo Dropouts aka Bayesian Deep Learning. 20.

>>>>> gd2md-html alert: inline image link here (to images/image7.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

  1. Active transfer learning.

>>>>> gd2md-html alert: inline image link here (to images/image8.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

Machine in the loop
  1. Similar to AL, just a machine / model / algo adds suggestions. This is obviously a tradeoff of bias and clean dataset

ONLINE LEARNING

  1. If you want to start with OL - start here & here
  2. Shay Shalev - A thesis about online learning
  3. Some answers about what is OL, the first one actually talks about S.Shalev’s other paper.
  4. Online learning - Andrew Ng - coursera
  5. Chip Huyen on online prediction & learning

ONLINE DEEP LEARNING (ODL)

  1. Hedge back propagation (HDP), Autonomous DL, Qactor - online AL for noisy labeled stream data.

N-SHOT LEARNING

  1. Zero shot, one shot, few shot (siamese is one shot)

ZERO SHOT LEARNING

  1. Instead of using class labels, we use some kind of vector representation for the classes, taken from a co-occurrence-after-svd or word2vec. - quite clever. This enables us to figure out if a new unseen class is near one of the known supervised classes. KNN can be used or some other distance-based classifier. Can we use word2vec for similarity measurements of new classes? \

>>>>> gd2md-html alert: inline image link here (to images/image9.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

  1. for classification, we can use nearest neighbour or manifold-based labeling propagation. \

>>>>> gd2md-html alert: inline image link here (to images/image10.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

  1. Multiple category vectors? Multilabel zero-shot also in the video

GPT3 is ZERO, ONE, FEW

  1. TBC

DATA SCIENCE

LIFE CYCLE

>>>>> gd2md-html alert: inline image link here (to images/image11.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

Google’s famous MLops

>>>>> gd2md-html alert: inline image link here (to images/image12.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

>>>>> gd2md-html alert: inline image link here (to images/image13.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

Fast ai project checklist

When I used to do consulting, I’d always seek to understand an organization’s context for developing data projects, based on these considerations:

  • Strategy: What is the organization trying to do (objective) and what can it change to do it better (levers)?
  • Data: Is the organization capturing necessary data and making it available?
  • Analytics: What kinds of insights would be useful to the organization?
  • Implementation: What organizational capabilities does it have?
  • Maintenance: What systems are in place to track changes in the operational environment?
  • Constraints: What constraints need to be considered in each of the above areas?

WORKFLOWS

  1. kaggle

PLATFORMS

  1. Uber, google, netflix, airbnb, etc

STACK

  1. Medium on canonical stack

Data Science KPI

  1. Comet ml on medium

>>>>> gd2md-html alert: inline image link here (to images/image14.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

  1. For the Data Driven manager (not ds)
  2. Measuring DS business value
  3. Best KPIS for DS - the best is what not to do

Being a DS / Researcher

  1. A day in a life
  2. Review of deep learning papers and co authorship
  3. Full stack DS Uri Weiss \

>>>>> gd2md-html alert: inline image link here (to images/image15.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

Team Building / Group Cohesion

>>>>> gd2md-html alert: inline image link here (to images/image16.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

References:

1, 2, 3, 4, 5, 6, 7, 8, 9, 10

Why data science needs generalists not specialists

  1. (good advice) Building a DS function (team)

Agile for data-science-research

  1. How to manage a data science research team using agile methodology, not scrum and not kanban
  2. Workflow for data science research projects
  3. Tips for data science research management
  4. IMO a really bad implementation of agile for data-science-projects

SOTA AND CURRENT TRENDS SUMMARIES

  1. ICLR 2019
  2. Medium
  3. State of ai, a yearly report

YOUTUBE COURSES

Deep learning Course

[Kadenze - deep learning tensor flow](https://www.kadenze.com/courses/creative-applications-of-deep-learning-with-tensorflow-iv/sessions/introduction-to-tensorflow) - Histograms for (Image distribution - mean distribution) / std dev, are looking quite good.

Machine Learning Course

  1. Recommended: Udacity includes ML and DL
  2. Week1: Introduction Lesson 4: Supervised, unsupervised.
  3. Lesson 6: model regression, cost function
  4. Lesson 71: optimization objective, large margin classification
  5. PCA at coursera #1
  6. PCA at coursera #2
  7. PCA #3
  8. SVM at coursera #1 - simplified

Predictive Analytics Course

[Syllabus](https://www.coursera.org/learn/predictive-analytics)


[Week 2: Lesson 29: supervised learning ](https://www.coursera.org/learn/predictive-analytics/lecture/qzrx8/statistics-vs-machine-learning)


[Lesson 36: From rules to trees](https://www.coursera.org/learn/predictive-analytics/lecture/qTN05/from-rules-to-trees)


[Lesson 43: overfitting, then validation, then accuracy](https://www.coursera.org/learn/predictive-analytics/lecture/cnLwv/overfitting)


[Lesson 46: bootstrap, bagging, boosting, random forests.](https://www.coursera.org/learn/predictive-analytics/lecture/ZUJqG/bootstrap-revisited)


[Lesson 52: NN](https://www.coursera.org/learn/predictive-analytics/lecture/6uyga/nearest-neighbor)


[Lesson 55: Gradient Descent](https://www.coursera.org/learn/predictive-analytics/lecture/68oAE/optimization-by-gradient-descent)


[Lesson 59: Logistic regression, SVM, Regularization, Lasso, Ridge regression](https://www.coursera.org/learn/predictive-analytics/lecture/FecmG/intuition-for-logistic-regression)


[Lesson 64: gradient descent, stochastic, parallel, batch.](https://www.coursera.org/learn/predictive-analytics/lecture/eCynR/stochastic-and-batched-gradient-descent)


[Unsupervised: Lesson X K-means, DBscan](https://www.coursera.org/learn/predictive-analytics/lecture/WWiiy/introduction-to-unsupervised-learning)

BOOKS & NOTEBOOKS

  1. Machine learning design patterns, git notebooks!, medium
    1. DP1 - transform Moving an ML model to production is much easier if you keep inputs, features, and transforms separate
    2. DP2 - checkpoints Saving the intermediate weights of your model during training provides resilience, generalization, and tunability
    3. DP3 - virtual epochs Base machine learning model training and evaluation on total number of examples, not on epochs or steps
    4. DP4 - keyed predictions Export your model so that it passes through client keys
    5. DP5 - repeatable sampling use the hash of a well distributed column to split your data into training, validation, and testing
  2. Gensim notebooks - from w2v, doc2vec to nmf, lda, pca, sklearn api, cosine, topic modeling, tsne, etc.
  3. Deep learning with python - francois chollet, deep learning & vision git notebooks!, official notebooks.
  4. Yandex school, nlp notebooks
  5. Machine learning engineering book (i.e., data science)
  6. Interpretable Machine Learning book

COST

  1. GPT2/3

Patents

  1. Method Patent Exceptionalism

DATA SCIENCE TOOLS

Python

How to use better OOP in python.

Best practices programming python classes - a great lecture.

How to know pip packages size’ good for removal

Python type checking tutorial

Concurrency vs Parallelism (great)

Coroutines vs futures

  1. Coroutines generators async wait
  2. Intro to concurrent,futures
  3. Future task event loop

Async io

  1. Intro
  2. complete

Clean code:

Virtual Environments

  1. Just use venv
  2. Summary on all the *envs
  3. A really good primer on virtual environments
  4. Introduction to venv complementary to the above
  5. Pipenv
  6. A great intro to pipenv
  7. A complementary to pipenv above
  8. Comparison between all *env

PYENV

Installing pyenv

Intro to pyenv

Pyenv tutorial and finding where it is

Pyenv override system python on mac

JUPYTER

(how does reshape work?) - a shape of (2,4,6) is like a tree of 2->4 and each one has more leaves 4->6.

As far as i can tell, reshape effectively flattens the tree and divide it again to a new tree, but the total amount of inputs needs to stay the same. 246 = 423*2 for example

code:

**import numpy**


**rng = numpy.random.RandomState(234)**


**a = rng.randn(2,3,10)**


**print(a.shape)**


**print(a)**


**b = numpy.reshape(a, (3,5,-1))**


**print(b.shape)**


**print (b)**

***** A tutorial for Google Colaboratory - free Tesla K80 with Jup-notebook**

Jupyter on Amazon AWS

How to add extensions to jupyter: extensions

Connecting from COLAB to MS AZURE

Streamlit vs. Dash vs. Shiny vs. Voila vs. Flask vs. Jupyter

SCIPY

  1. Optimization problems, a nice tutorial to finding the minima
  2. Minima / maxima finding it in a 1d numpy array

NUMPY

Using numpy efficiently - explaining why vectors work faster.


Fast vector calculation, a benchmark between list, map, vectorize. Vectorize wins. The idea is to use vectorize and a function that does something that may involve if conditions on a vector, and do it as fast as possible.

PANDAS

  1. Great introductory tutorial about using pandas, loading, loading from zip, seeing the table’s features, accessing rows & columns, boolean operations, calculating on a whole row\column with a simple function and on two columns even, dealing with time\date parsing.
  2. Visualizing pandas pivoting and reshaping functions by Jay Alammar - pivot melt stack unstack
  3. How to beautify pandas dataframe using html display
  4. Speeding up pandas
  5. Pandas summary
  6. Pandas html profiling
  7. (good) Pandas time series manipulation
  8. The fastest way to select rows by columns, by using masked values (benchmarked):
  9. def mask_with_values(df): mask = df['A'].values == 'foo' return df[mask]
  10. Parallelism, pools, threads, dask
  11. Accessing dataframe rows, columns and cells- by name, by index, by python methods.
  12. Looping through pandas
  13. How to inject headers into a headless CSV file -
  14. Dealing with time series in pandas,
    1. Create a new column based on a (boolean or not) column and calculation:
    2. Using python (map)
    3. Using numpy
    4. using a function (not as pretty)
  15. Given a DataFrame, the shift() function can be used to create copies of columns that are pushed forward (rows of **NaN **values added to the front) or pulled back (rows of NaN values added to the end). 5. df['t'] = [x for x in range(10)] 6. df['t-1'] = df['t'].shift(1) 7. df['t-1'] = df['t'].shift(-1) 8.
  16. Row and column sum in pandas and numpy
  17. Dataframe Validation In Python - A Practical Introduction - Yotam Perkal - PyCon Israel 2018
  18. In this talk, I will present the problem and give a practical overview (accompanied by Jupyter Notebook code examples) of three libraries that aim to address it: Voluptuous - Which uses Schema definitions in order to validate data [https://github.com/alecthomas/voluptuous] Engarde - A lightweight way to explicitly state your assumptions about the data and check that they're actually true [https://github.com/TomAugspurger/engarde] * TDDA - Test Driven Data Analysis [ https://github.com/tdda/tdda]. By the end of this talk, you will understand the Importance of data validation and get a sense of how to integrate data validation principles as part of the ML pipeline.
  19. Stop using itterows, use apply.

SCI-KIT LEARN

  1. Pipeline to json 1, 2
  2. cuML - Multi gpu, multi node-gpu alternative for SKLEARN algorithms
  3. Gpu TSNE ^
  4. Awesome code examples about using svm\knn\naive\log regression in sklearn in python, i.e., “fitting a model onto the data”
  5. Parallelism of numpy, pandas and sklearn using dask and clusters. Webpage, docs, example in jupyter.

Also Insanely fast, see here.

  1. Functional api for sk learn, using pipelines. thank you sk-lego.

>>>>> gd2md-html alert: inline image link here (to images/image17.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

>>>>> gd2md-html alert: inline image link here (to images/image18.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

>>>>> gd2md-html alert: inline image link here (to images/image19.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

FAST.AI

  1. Medium on all fast.ai courses, 14 posts

PYCARET

1. What is? by vidhaya - PyCaret is an open-source, machine learning library in Python that helps you from data preparation to model deployment. It is easy to use and you can do almost every data science project task with just one line of code.

NVIDIA TF CUDA CUDNN

GCP

Resize google disk size, 1, 2,

SQL

  1. Introduction, index, keys, joins, aliases etc., newer
  2. Sql cheat sheet
  3. Primary key
  4. Foreign key, a key constraint that is included in the primary key allowed values
  5. Index, i.e., book index for fast reading

GIT / Bitbucket

  1. Installing git LFS
  2. Use git lfs
  3. Download git-lfs
  4. **Git wip (great)
    **

>>>>> gd2md-html alert: inline image link here (to images/image20.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

DATA SCIENCE MANAGEMENT

INTERVIEW Qs

  1. 40 questions on ensembles
  2. 30 on trees
  3. 30 on knns

EXPERIMENT MANAGEMENT

  1. All the alternatives

  2. Cnvrg.io -

    1. Manage - Easily navigate machine learning with dashboards, reproducible data science, dataset organization, experiment tracking and visualization, a model repository and more
    2. Build - Run and track experiments in hyperspeed with the freedom to use any compute environment, framework, programming language or tool - no configuration required
    3. Automate - Build more models and automate your machine learning from research to production using reusable components and drag-n-drop interface
  3. Comet.ml - Comet lets you track code, experiments, and results on ML projects. It’s fast, simple, and free for open source projects.

  4. Floyd - notebooks on the cloud, similar to colab / kaggle, etc. gpu costs 4$/h

  5. Trains - open source

  6. Missing link - RIP

  7. Spark 4. Rdds vs datasets vs dataframesf 5. What are Rdds? 6. keras , tf, spark 7. Repartition vs coalesce 8. Best practices

  8. Databricks 9. Koalas - pandas API on Apache Spark 10. Intro to DB on spark, has some basic sklearn-like tool and other custom operations such as single-vector-based aggregator for using features as an input to a model 11. Pyspark.ml 12. Keras as a single node (no spark) 13. Horovod for distributed keras (and more) 14. Documentations (read me, has all libraries) 15. Medium tutorial, explains the 3 pros of DB with examples of using with native and non native algos 1. Spark sql 2. Mlflow 3. Streaming 4. SystemML DML using keras models. 16. systemML notebooks (didnt read) 17. Sklearn notebook example 18. Utilizing spark nodes for grid searching with sklearn 5. from spark_sklearn import GridSearchCV 19. How can we leverage our existing experience with modeling libraries like scikit-learn? We'll explore three approaches that make use of existing libraries, but still benefit from the parallelism provided by Spark.

     These approaches are:
    

HIRING / RECRUITING

  1. Data engineer skills on medium
    1. Coding (Typically Python)
    2. SQL
    3. Database design
    4. Data architecture/big data technologies
    5. Soft skills

WRITING DOCS

  1. Design docs at google

LEGAL & CONTRACTS

  1. (FAST) Advisory board saas agreement

CALCULUS

Start here!

PROBABILITY AND STATISTICS

Coursera course on probabilities - for data science, actually quite good in explaining a lot of the basic tools,prob, conditional, distributions, sampling, CI, hypothesis, etc.

  • **I.e, Probability deals with predicting the likelihood of future events, while statistics involves the analysis of the frequency of past events. **
  • The **problems **considered by probability and statistics are inverse to each other.
  • In probability theory we_ consider some underlying process which has some randomness or uncertainty modeled by random variables, and we figure out what happens._

=> Underlying process + randomness and random variables -> what happens next?

  • In statistics we observe something that has happened, and try to figure out what underlying process would explain those observations.

=> observe what happened -> what is the underlying process?

  • **Finally, probability **theory is mainly concerned with the **deductive **part, **statistics **with the **inductive **part of modeling processes with uncertainty

Introduction to statistics

  1. Table of content
  2. Median
  3. Mode - most freq
  4. Weighted mean
  5. Geometric mean
  6. Harmonic mean
  7. Percentiles
  8. Mean deviation
  9. Correlation
  10. Standard deviation, formula
  11. Standard normal distribution
  12. Skewness of distribution
  13. Confidence intervals (using std)
  14. Accuracy vs precision (accurate vs hitting closely or density)
  15. Probability
  16. Probability complement
  17. Chi-square test, p_value, independent, dependent, significance
  18. Variation vs variance - a private case
  19. Std vs variance - std is in the same metric as the mean, is the root of variance., allows outliers to influence, will not result in samples cancelling each other without the square root in the formula.

Introduction to Probability

  1. Types of events
  2. Independent events
  3. Conditional proba
  4. Proba tree diagrams
  5. Mutually exclusive events
  6. Combination and permutations
  7. Bayes
  8. Least squares regresssion It works by making the total of the square of the errors as small as possible (that is why it is called "least squares"
  9. Random variables
  10. Continuous random variables
  11. Random vars mean, std, variance

More on Statistics

  1. 25 concepts (part 2), 29 more concepts (part1) & part 3 in statistics.

Wiki

  1. Marginal probability
  2. Joint probability
  3. Conditional probability
  4. Chain rule - derivatives using the chain rule, on khan

Recommended Courses

  1. Another great course on probability, distribution types, conditional, joint, chain, etc.
  2. Kahn academy
  3. A really good intro to probability, conditional, joint, etc.

(another angle) The main difference between probability and statistics has to do with knowledge

  • what are the known facts? Inherent in both probability and statistics is a population,
  • every individual we are interested in studying, and a sample, consisting of the individuals that are selected from the population.
  • in probability: would start with us knowing everything about the composition of a population, and then would ask, “What is the likelihood that a selection, or sample, from the population, has certain characteristics?”
  • In statistics: we have no knowledge about the types of socks in the drawer. we infer properties about the population on the basis of a random sample.

Some calculations to get you into probability:

  • Finding out the probability of an event
  • Of two consecutive events (multiplication)
  • Of several events (sum)
  • Etc..

STATISTICAL SAMPLING AND RESAMPLING

  1. What is? Method for sampling/resampling, and sampling errors explained. (cross validation etc)

PROBABILITY

>>>>> gd2md-html alert: inline image link here (to images/image21.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

>>>>> gd2md-html alert: inline image link here (to images/image22.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

>>>>> gd2md-html alert: inline image link here (to images/image23.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

>>>>> gd2md-html alert: inline image link here (to images/image24.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text PDF (PROBABILITY DENSITY FUNCTION)

  1. Tutorial in scipy
  2. Array-based tutorial in python with PDF and KDE
  3. Summary of univariate distribution including pdf methods

Kernel Density Estimation

This tutorial actually explains why we should use KDE over a Histogram, it explains the cons of histograms and how KDE helps solve some issue that we usually encounter in ‘Sparse’ histograms where the distribution is hard to figure out.

How to use KDE? A tutorial about kernel density and how to use it in python. Has several good graphs and shows use cases.

Video tutorials about Kernel Density:

  1. KDE
  2. Non parametric Kernel Regression Estimation
  3. Non parametric Sieve Estimation
  4. Semi- nonparametric estimation

Udacity Video Tutorial - pretty good

  1. IMPORTANT: Comparison and benchmarks of various KDE algo’s
  2. Histograms and density plots
  3. SK LEARN
  4. Gaussian KDE in scipy, version 2

FEATURE TYPES

**Discrete **

  • **Numbers **

  • Categorical

  • **Categorical **data are variables that contain label values rather than numeric values.

    The number of possible values is often limited to a fixed set.

  • **Categorical **variables are often called nominal.

  • labels, usually discrete values such as gender, country of origin, marital status, high-school graduate

Continuous (the opposite of discrete): real-number values, measured on a continuous scale: height, weight.

**In order to compute a regression, categorical predictors must be re-expressed as numeric: some form of indicator variables (0/1) with a separate indicator for each level of the factor. **

Discrete with many values are often treated as continuous, i.e. zone numbers - > binary

Variable types: Nominal(weather), ordinal(order var 1,2,3), interval(range),

FEATURES

CORRELATION VS COVARIANCE
  1. Correlation is between -1 to 1, covariance is -inf to inf, units in covariance affect the scale, so correlation is preferred, it is normalized.
    Correlation is a measure of association. Correlation is used for bivariate analysis. It is a measure of how well the two variables are related.
    Covariance is also a measure of association. Covariance is a measure of the relationship between two random variables.

CORRELATION BETWEEN FEATURE TYPES
  1. Association vs correlation - correlation is a measure of association and a yes no question without assuming linearity
  2. A great article in medium, covering just about everything with great detail and explaining all the methods plus references.
  3. Heat maps for categorical vs target - groupby count per class, normalize by total count to see if you get more grouping in a certain combination of cat/target than others.
  4. Anova/log regression 2*, git, 3, for numeric/cont vs categorical - high F score from anova hints about association between a feature and a target, i.e., the importance of the feature to separating the target.
  5. Anova youtube 1, 2

>>>>> gd2md-html alert: inline image link here (to images/image25.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

  1. Cat vs cat, many metrics - on medium

CORRELATION VISUALIZATION

Feature space

>>>>> gd2md-html alert: inline image link here (to images/image26.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

>>>>> gd2md-html alert: inline image link here (to images/image27.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

PREDICTIVE POWER SCORE (PPS)

Is an asymmetric, data-type-agnostic score for predictive relationships between two columns that ranges from 0 to 1. github

>>>>> gd2md-html alert: inline image link here (to images/image28.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

Too many scenarios where the correlation is 0. This makes me wonder if I missed something… (Excerpt from the image by Denis Boigelot)

Regression

In case of an regression, the ppscore uses the mean absolute error (MAE) as the underlying evaluation metric (MAE_model). The best possible score of the MAE is 0 and higher is worse. As a baseline score, we calculate the MAE of a naive model (MAE_naive) that always predicts the median of the target column. The PPS is the result of the following normalization (and never smaller than 0):

PPS = 1 - (MAE_model / MAE_naive)

Classification

If the task is a classification, we compute the weighted F1 score (wF1) as the underlying evaluation metric (F1_model). The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. The weighted F1 takes into account the precision and recall of all classes weighted by their support as described here. As a baseline score (F1_naive), we calculate the weighted F1 score for a model that always predicts the most common class of the target column (F1_most_common) and a model that predicts random values (F1_random). F1_naive is set to the maximum of F1_most_common and F1_random. The PPS is the result of the following normalization (and never smaller than 0):

PPS = (F1_model - F1_naive) / (1 - F1_naive)

MUTUAL INFORMATION COEFFICIENT

Paper - we present a measure of dependence for two-variable relationships: the maximal information coefficient (MIC). MIC captures a wide range of associations both functional and not, and for functional relationships provides a score that roughly equals the coefficient of determination (R2) of the data relative to the regression function.

Computing MIC

(A) For each pair (x,y), the MIC algorithm finds the x-by-y grid with the highest induced mutual information. (B) The algorithm normalizes the mutual information scores and compiles a matrix that stores, for each resolution, the best grid at that resolution and its normalized score. (C) The normalized scores form the characteristic matrix, which can be visualized as a surface; MIC corresponds to the highest point on this surface.

In this example, there are many grids that achieve the highest score. The star in (B) marks a sample grid achieving this score, and the star in (C) marks that grid's corresponding location on the surface.

Mutual information classifier - Estimate mutual information for a discrete target variable.

Mutual information (MI) [1] between two random variables is a non-negative value, which measures the dependency between the variables. It is equal to zero if and only if two random variables are independent, and higher values mean higher dependency.

The function relies on nonparametric methods based on entropy estimation from k-nearest neighbors distances as described in [2] and [3]. Both methods are based on the idea originally proposed in [4].

MI score - Mutual Information between two clusterings.

The Mutual Information is a measure of the similarity between two labels of the same data.

Adjusted MI score - Adjusted Mutual Information between two clusterings.

Adjusted Mutual Information (AMI) is an adjustment of the Mutual Information (MI) score to account for chance. It accounts for the fact that the MI is generally higher for two clusterings with a larger number of clusters, regardless of whether there is actually more information shared.

This metric is furthermore symmetric: switching label_true with label_pred will return the same score value. This can be useful to measure the agreement of two independent label assignments strategies on the same dataset when the real ground truth is not known

Normalized MI score - Normalized Mutual Information (NMI) is a normalization of the Mutual Information (MI) score to scale the results between 0 (no mutual information) and 1 (perfect correlation). In this function, mutual information is normalized by some generalized mean of H(labels_true) and H(labels_pred)), defined by the average_method.

CRAMER’S COEFFICIENT

Calculating

FEATURE SELECTION

A series of good articles that explain about several techniques for feature selection

  1. How to parallelize feature selection on several CPUs, do it per label on each cpu and average the results.

  2. A great notebook about feature correlation and manytypes of visualization, what to drop what to keep, using many feature reduction and selection methods (quite a lot actually). Its a really good intro

  3. Multi class classification, feature selection, model selection, co-feature analysis

  4. Text analysis for sentiment, doing feature selection a tutorial with chi2(IG?), part 2 with bi-gram collocation in ntlk

  5. What is collocation? - “the habitual juxtaposition of a particular word with another word or words with a frequency greater than chance.”

  6. Sklearn feature selection methods (4) - youtube

  7. Univariate and independent features

  8. Linear models and regularization, doing feature ranking

  9. Random forests and feature ranking

  10. Random Search for focus and only then grid search for Random Forest, code

  11. Stability selection and recursive feature elimination (RFE). are wrapper methods in sklearn for the purpose of feature selection. RFE in sklearn

  12. Kernel feature selection via conditional covariance minimization (netanel d.)

  13. Github class that does the following:

    1. Features with a high percentage of missing values
    2. Collinear (highly correlated) features
    3. Features with zero importance in a tree-based model
    4. Features with low importance
    5. Features with a single unique value
  14. Machinelearning mastery on FS: 6. Univariate Selection. 7. Recursive Feature Elimination. 8. Principle Component Analysis. 9. Feature Importance.

  15. Sklearn tutorial on FS: 10. Low variance 11. Univariate kbest 12. RFE 13. selectFromModel using _coef _important_features 14. Linear models with L1 (svm recommended L2) 15. Tree based importance

  16. A complete overview of many methods 16. (reduction) LDA: Linear discriminant analysis is used to find a linear combination of features that characterizes or separates two or more classes (or levels) of a categorical variable. 17. (selection) ANOVA: ANOVA stands for Analysis of variance. It is similar to LDA except for the fact that it is operated using one or more categorical independent features and one continuous dependent feature. It provides a statistical test of whether the means of several groups are equal or not. 18. (Selection) Chi-Square: It is a is a statistical test applied to the groups of categorical features to evaluate the likelihood of correlation or association between them using their frequency distribution. 19. Wrapper methods: 1. Forward Selection: Forward selection is an iterative method in which we start with having no feature in the model. In each iteration, we keep adding the feature which best improves our model till an addition of a new variable does not improve the performance of the model. 2. Backward Elimination: In backward elimination, we start with all the features and removes the least significant feature at each iteration which improves the performance of the model. We repeat this until no improvement is observed on removal of features. 3. Recursive Feature elimination: It is a greedy optimization algorithm which aims to find the best performing feature subset. It repeatedly creates models and keeps aside the best or the worst performing feature at each iteration. It constructs the next model with the left features until all the features are exhausted. It then ranks the features based on the order of their elimination. 20.

  17. Relief - GIT git2 a new family of feature selection trying to optimize the distance of two samples from the selected one, one which should be closer the other farther.

    “The weight updation of attributes works on a simple idea (line 6). That if instance Rᵢ and H have different value (i.e the diff value is large), that means that attribute separates two instance with the same class which is not desirable, thus we reduce the attributes weight. On the other hand, if the instance Rᵢ and M have different value, that means the attribute separates the two instance with different class, which is desirable.”

  18. Scikit-feature (includes relief) forked from this (docs)

  19. Scikit-rebate (based on relief)

Feature selection using entropy, information gain, mutual information and … in sklearn.

Entropy, mutual information and KL Divergence by AurelienGeron

FEATURE ENGINEERING

  1. Vidhya on FE, anomalies, engineering, imputing
  2. Many types of FE, including log and box cox transform - a very useful explanation.
  3. Categorical Data
  4. Dummy variables and feature hashing - hashing is really cool.
  5. Text data - unigrams, bag of words, N-grams (2,3,..), tfidf matrix, cosine_similarity(tfidf) ontop of a tfidf matrix, unsupervised hierarchical clustering with similarity measures on top of (cosine_similarity), LDA for topic modelling in sklearn - pretty awesome, Kmeans(lda),.
  6. Deep learning data for FE - Word embedding using keras, continuous BOW - CBOW, SKIPGRAM, word2vec - really good.
  7. Topic Modelling - a fantastic slide show about topic modelling using LDA etc.
  8. Dipanjan on feature engineering 1 - cont numeric 2 - categorical 3 - traditional methods
  9. Target encoding git
  10. Category encoding git

REPRESENTATION LEARNING

  1. paper

TFIDF

  1. Max_features in tf idf -Sometimes it is not effective to transform the whole vocabulary, as the data may have some exceptionally rare words, which, if passed to TfidfVectorizer().fit(), will add unwanted dimensions to inputs in the future. One of the appropriate techniques in this case, for instance, would be to print out word frequences accross documents and then set a certain threshold for them. Imagine you have set a threshold of 50, and your data corpus consists of 100 words. After looking at the word frequences 20 words occur less than 50 times. Thus, you set max_features=80 and you are good to go. If max_features is set to None, then the whole corpus is considered during the TF-IDFtransformation. Otherwise, if you pass, say, 5 to max_features, that would mean creating a feature matrix out of the most 5 frequent words accross text documents.

SIMILARITY

  1. Cosine tutorial
  2. Edit distance similarity
  3. Diff lib similarity and soundex
  4. Soft cosine and cosine

FEATURE IMPORTANCE

Note: point 2, about lime is used for explainability, please also check that topic, down below.

  1. Using RF and other methods, really good

  2. Non parametric feature impact and importance - while there are nonparametric feature selection algorithms, they typically provide feature rankings, rather than measures of impact or importance.In this paper, we give mathematical definitions of feature impact and importance, derived from partial dependence curves, that operate directly on the data.

  3. Paper (pdf, blog post): (GITHUB) how to "explain the predictions of any classifier in an interpretable and faithful manner, by learning an interpretable model locally around the prediction."

    they want to understand the reasons behind the predictions, it’s a new field that says that many 'feature importance' measures shouldn’t be used. i.e., in a linear regression model, a feature can have an importance rank of 50 (for example), in a comparative model where you duplicate that feature 50 times, each one will have 1/50 importance and won’t be selected for the top K, but it will still be one of the most important features. so new methods needs to be developed to understand feature importance. this one has git code as well.

    Several github notebook examples: binary case, multi class, cont and cat features, there are many more for images in the github link.

    “Intuitively, an explanation is a local linear approximation of the model's behaviour. While the model may be very complex globally, it is easier to approximate it around the vicinity of a particular instance. While treating the model as a black box, we perturb the instance we want to explain and learn a sparse linear model around it, as an explanation. The figure below illustrates the intuition for this procedure. The model's decision function is represented by the blue/pink background, and is clearly nonlinear. The bright red cross is the instance being explained (let's call it X). We sample instances around X, and weight them according to their proximity to X (weight here is indicated by size). We then learn a linear model (dashed line) that approximates the model well in the vicinity of X, but not necessarily globally. For more information, read our paper, or take a look at this blog post.”
    \

>>>>> gd2md-html alert: inline image link here (to images/image29.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

FEATURE IMPUTING

  1. Vidhya on FE, anomalies, engineering, imputing
  2. Fancy impute

FEATURE STORE

  1. The importance of having one - medium
  2. Feast

>>>>> gd2md-html alert: inline image link here (to images/image30.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

  1. Tecton.ai \

>>>>> gd2md-html alert: inline image link here (to images/image31.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

CALIBRATION

Classic Model Calibration

  1. How do we do isotonic and sigmoid calibration - read this, then this, how to use in sklearn
  2. How to speed up isotonic regression for sklearn
  3. TODO: how to calibrate a DNN (except sklearn wrapper for keras)
  4. Allows us to use the probability as confidence. I.e, Well calibrated classifiers are probabilistic classifiers for which the output of the predict_proba method can be directly interpreted as a confidence level
  5. (good) Probability Calibration Essentials (with code)
  6. The Brier score is a proper score function that measures the accuracy of probabilistic predictions.
  7. Sk learn example
  8. ‘calibrated classifier cv in sklearn - The method to use for calibration. Can be ‘sigmoid’ which corresponds to Platt’s method or ‘isotonic’ which is a non-parametric approach. It is not advised to use isotonic calibration with too few calibration samples (<<1000) since it tends to overfit. Use sigmoids (Platt’s calibration) in this case.
    However, not all classifiers provide well-calibrated probabilities, some being over-confident while others being under-confident. Thus, a separate calibration of predicted probabilities is often desirable as a postprocessing. This example illustrates two different methods for this calibration and evaluates the quality of the returned probabilities using Brier’s score
  9. Example 1 - binary class below, 2 - 3 class moving prob vectors to a well defined location, 3 - comparison of non calibrated models, only logreg is calibrated naturally

>>>>> gd2md-html alert: inline image link here (to images/image32.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

  1. Mastery on why we need calibration

  2. Why softmax is not good as an uncertainty measure for DNN

  3. If a model doesn't have probabilities use the decision function

    y_pred = clf.predict(X_test)
        if hasattr(clf, "predict_proba"):
            prob_pos = clf.predict_proba(X_test)[:, 1]
        else:  # use decision function
            prob_pos = clf.decision_function(X_test)
            prob_pos = \
                (prob_pos - prob_pos.min()) / (prob_pos.max() - prob_pos.min())
    
    

NEURAL NET CALIBRATION

  1. Temperature in LSTM
  2. Paper: Calibration of modern NN
  3. Calibration post
  4. Change temperature in keras
  5. Calibration can also come in a different flavor, you want to make your algorithm certain, one trick is to use dropout layers when inferring/predicting/classifying, do it 100 times and average the results in some capacity ,

>>>>> gd2md-html alert: undefined internal link (link text: "see this chapter on BNN"). Did you generate a TOC?
(Back to top)(Next alert)
>>>>>

see this chapter on BNN 6. How Can We Know When Language Models Know? This paper is about calibration.
“Recent works have shown that language models (LM) capture different types of knowledge regarding facts or common sense. However, because no model is perfect, they still fail to provide appropriate answers in many cases. In this paper, we ask the question “how can we know when language models know, with confidence, the answer to a particular query?” We examine this question from the point of view of calibration, the property of a probabilistic model’s predicted probabilities actually being well correlated with the probability of correctness. We first examine a state-ofthe-art generative QA model, T5, and examine whether its probabilities are well calibrated, finding the answer is a relatively emphatic no. We then examine methods to calibrate such models to make their confidence scores correlate better with the likelihood of correctness through fine-tuning, post-hoc probability modification, or adjustment of the predicted outputs or inputs. Experiments on a diverse range of datasets demonstrate the effectiveness of our methods. We also perform analysis to study the strengths and limitations of these methods, shedding light on further improvements that may be made in methods for calibrating LMs.”

MULTI LABEL CLASSIFICATION

(what is?) Multilabel classification is a classification problem where multiple target labels can be assigned to each observation instead of only one like in multiclass classification.

Two different approaches exist for multilabel classification:

  • Problem transformation methods try to transform the multilabel classification into binary or multiclass classification problems.
  • Algorithm adaptation methods adapt multiclass algorithms so they can be applied directly to the problem.

I.e., the Two approaches are:

  • Use a classifier that does multi label
  • Use any classifier with a wrapper that compares each two labels

great PDF that explains about multi label classification and especially metrics, part 2 here

An awesome Paper that explains all of these methods in detail, also available here!

PT1: for each sample select one label, remove all others.

PT2: remove every sample which has multi labels.

PT3: for every combo of labels create a single-label, i.e. A&B, A&C etc..

PT4: (most common) create L datasets, for each label learn a binary representation, i.e., is it there or not.

PT5: duplicate each sample with only one of its labels

PT6: read the paper

There are other approaches for doing it within algorithms, they rely on the ideas PT3\4\5\6 implemented in the algorithms, or other tricks.

They also introduce **Label cardinality **and label density.

Efficient net, part 2 - EfficientNet is based on a network derived from a neural architecture search and novel compound scaling method is applied to iteratively build more complex network which achieves state of the art accuracy on multiclass classification tasks. Compound scaling refers to increasing the network dimensions in all three scaling formats using a novel strategy.

Multi label confusion matrices with sklearn

DISTRIBUTION

TYPES

(What are?) probabilities in a distribution always add up to 1.

>>>>> gd2md-html alert: inline image link here (to images/image33.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

>>>>> gd2md-html alert: inline image link here (to images/image34.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

  1. Poison and poison process

    Comparing distributions:

  2. Kolmogorov smirnov not good for categoricals.

  3. Comparing two

  4. Khan academy

  5. Visually

  6. When they are not normal

  7. Using train / test trick

Gaussian \ Normal Distribution

“ if you collect data and it is not normal, “you need to collect more data”

Beautiful graphs

The normal distribution is popular for two reasons:

  1. It is the most common distribution in nature (as distributions go)
  2. An enormous number of statistical relationships become clear and tractable if one assumes the normal.

Sure, nothing in real life exactly matches the Normal. But it is uncanny how many things come close.

this is partly due to the Central Limit Theorem, which says that if you average enough unrelated things, you eventually get the Normal.

  • the Normal distribution in statistics is a special world in which the math is straightforward and all the parts fit together in a way that is easy to understand and interpret.
  • It may not exactly match the real world, but it is close enough that this one simplifying assumption allows you to predict lots of things, and the predictions are often pretty reasonable.
  • statistically convenient.
  • represented by basic statistics
    • average
    • **variance **(or standard deviation) - the average of what's left when you take away the average, but to the power of 2.

In a statistical test, you need the data to be normal to guarantee that your p-values are accurate with your given sample size.

If the data are not normal, your sample size may or may not be adequate, and it may be difficult for you to know which is true.

COMPARING DISTRIBUTIONS

  1. Categorical data can be transformed to a histogram i.e., #class / total and then measured for distance between two histograms’, e.g., train and production. Using earth mover distance python git wrapper to c, linear programming, so its slow.
  2. Earth movers.
  3. EMD paper
  4. Also check KL DIVERGENCE in the information theory section.
  5. Bengio et al, transfer objective for learning to disentangle casual mechanisms - We propose to meta-learn causal structures based on how fast a learner adapts to new distributions arising from sparse distributional changes

DISTRIBUTION TRANSFORMATION

  1. Top 3 methods for handling skewed data. Log, square root, box cox transformations

BOX COX

Power transformations

(What is the Box-Cox Power Transformation?)

  • a procedure to **identify an appropriate exponent **(Lambda = l) to use to transform data into a “normal shape.”
  • The Lambda value indicates the power to which all data should be raised.

>>>>> gd2md-html alert: inline image link here (to images/image35.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

The Box-Cox transformation is a useful family of transformations.

  • Many statistical tests and intervals are based on the assumption of normality.
  • The assumption of normality often leads to tests that are simple, mathematically tractable, and powerful compared to tests that do not make the normality assumption.
  • **Unfortunately, many real data sets are in fact not approximately normal. **
  • However, an appropriate transformation of a data set can often yield a data set that does follow approximately a normal distribution.
  • This increases the applicability and usefulness of statistical techniques based on the normality assumption.

**IMPORTANT:!! After a transformation (c), we need to measure of the normality of the resulting transformation (d) . **

  • One measure is to compute the correlation coefficient of a normal probability plot => (d).
  • The correlation is computed between the vertical and horizontal axis variables of the probability plot and is a convenient measure of the linearity of the probability plot
  • In other words: the more linear the probability plot, the better a normal distribution fits the data!

*NOTE: another useful link that explains it with figures, but i did not read it.

GUARANTEED NORMALITY?

  • NO!
  • This is because it actually does not really check for normality;
  • the method checks for the smallest standard deviation.
  • The assumption is that among all transformations with Lambda values between -5 and +5, transformed data has the highest likelihood – but not a guarantee – to be normally distributed when standard deviation is the smallest.
  • it is absolutely necessary to always check the transformed data for normality using a probability plot. (d)

**+ **Additionally, the Box-Cox Power transformation only works if all the data is positive and greater than 0.

+ achieved easily by adding a constant ‘c’ to all data such that it all becomes positive before it is transformed. The transformation equation is then:

COMMON TRANSFORMATION FORMULAS (based on the actual formula)

>>>>> gd2md-html alert: inline image link here (to images/image36.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

Finally: An awesome tutorial in python with code examples, there is also another code example here

>>>>> gd2md-html alert: inline image link here (to images/image37.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

  • Maybe there is a slight problem in the python vs R code, details here, but needs investigating.

MANN-WHITNEY U TEST

(what is?) - the Mann–Whitney U test is a nonparametric test of the null hypothesis that it is equally likely that a randomly selected value from one sample will be less than or greater than a randomly selected value from a second sample.

In other words: This test can be used to determine whether two independent samples were selected from populations having the same distribution.

Unlike the t-test it does not require the assumption of normal distributions. It is nearly as efficient as the t-test on normal distributions.

NULL HYPOTHESIS

  1. What is chi-square and what is a null hypothesis, and how do we calculate observed vs expected and check if we can reject the null and get significant difference.
  2. Analytics vidhya
    1. What is hypothesis testing
    2. Intro to t-tests analyticsvidhya - always good
    3. Anova analysis of variance, one way, two way, manova
      1. if the means of two or more groups are significantly different from each other. ANOVA checks the impact of one or more factors by comparing the means of different samples.
      2. A one-way ANOVA tells us that at least two groups are different from each other. But it won’t tell us which groups are different.
      3. For such cases, when the outcome or dependent variable (in our case the test scores) is affected by two independent variables/factors we use a slightly modified technique called two-way ANOVA.
  3. _multivariate case and the technique we will use to solve it is known as MANOVA. _

INFORMATION THEORY

ENTROPY / INFORMATION GAIN

Great tutorial on all of these topics***

Entropy - lack of order or lack of predictability (excellent slide lecture by Aurelien Geron)

>>>>> gd2md-html alert: inline image link here (to images/image38.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

>>>>> gd2md-html alert: inline image link here (to images/image39.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

Cross entropy will be equal to entropy if the probability distributions of p (true) and q(predicted) are the same. However, if cross entropy is bigger (known as relative_entropy or kullback leibler divergence)

>>>>> gd2md-html alert: inline image link here (to images/image40.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

>>>>> gd2md-html alert: inline image link here (to images/image41.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

In this example we want the cross entropy loss to be zero, i.e., when we have a one hot vector and a predicted vector which are identical, i.e., 100% in the same class for predicted and true, we get 0. In all other cases we get some number that gets larger if the predicted class probability is lower than zero as seen here:

>>>>> gd2md-html alert: inline image link here (to images/image42.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

Formula for 2 classes:

>>>>> gd2md-html alert: inline image link here (to images/image43.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

NOTE: Entropy can be generalized as a formula for N > 2 classes:

>>>>> gd2md-html alert: inline image link here (to images/image44.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

**(We want to grow a simple tree) awesome pdf tutorial**→ a good attribute prefers attributes that split the data so that each successor node is as pure as possible

  • i.e., the distribution of examples in each node is so that it mostly contains examples of a single class
  • In other words: We want a measure that prefers attributes that have a high degree of „order“:
  • Maximum order: All examples are of the same class
  • Minimum order: All classes are equally likely → Entropy is a measure for (un-)orderedness Another interpretation:
  • Entropy is the amount of information that is contained
  • all examples of the same class → no information

>>>>> gd2md-html alert: inline image link here (to images/image45.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

**Entropy **is the amount of unorderedness in the class distribution of S

** **IMAGE above:

  • Maximal value when the equal class distribution
  • Minimal value when only one class is in S

So basically if we have the **outlook attribute **and it has **3 categories, **we calculate the entropy for E(feature=category) for all 3.

>>>>> gd2md-html alert: inline image link here (to images/image46.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

**INFORMATION: The I(S,A) **formula below.

What we actually want is the average entropy of the entire split, that corresponds to an entire attribute, i.e., OUTLOOK (sunny & overcast & rainy)

>>>>> gd2md-html alert: inline image link here (to images/image47.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

Information Gain: is actually what we gain by subtracting information from the entropy.

In other words we find the attributes that maximizes that difference, in other other words, the attribute that reduces the unorderness / lack of order / lack of predictability.

The BIGGER GAIN is selected.

>>>>> gd2md-html alert: inline image link here (to images/image48.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

There are some properties to Entropy that influence INFO GAIN (?):

>>>>> gd2md-html alert: inline image link here (to images/image49.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

There are some disadvantages with INFO GAIN, done use it when an attribute has many number values, such as “day” (date wise) 05/07, 06/07, 07/07..31/07 etc.

Information gain is biased towards choosing attributes with a large number of values and causes:

  • Overfitting
  • fragmentation

>>>>> gd2md-html alert: inline image link here (to images/image50.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

We measure Intrinsic information of an attribute, i.e., Attributes with higher intrinsic information are less useful.

We define Gain Ratio as info-gain with less bias toward multi value attributes, ie., “days”

NOTE: Day attribute would still win with the Gain Ratio, Nevertheless: Gain ratio is more reliable than Information Gain

>>>>> gd2md-html alert: inline image link here (to images/image51.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

Therefore, we define the alternative, which is the GINI INDEX. It measures impurity, we define the average Gini, and the Gini Gain.

>>>>> gd2md-html alert: inline image link here (to images/image52.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

FINALLY, further reading about decision trees and examples of INFOGAIN and GINI here.

Variational bounds on mutual information

CROSS ENTROPY, RELATIVE ENT, KL-D, JS-D, SOFT MAX

  1. A really good explanation on all of them
  2. Another good one on all of them
  3. Mastery on entropy, kullback leibler divergence (asymmetry), jensen-shannon divergence (symmetry) (has code)
  4. Entropy, mutual information and KL Divergence by AurelienGeron
  5. Gensim on divergence metrics such as KL jaccard etc, pros and cons, lda is a mess on small data.

SOFTMAX

  1. Understanding softmax

  2. Softmax and negative likelihood (NLL)

  3. Softmax vs cross entropy - Softmax loss and cross-entropy loss terms are used interchangeably in industry. Technically, there is no term as such Softmax loss. people use the term "softmax loss" when referring to "cross-entropy loss". The softmax classifier is a linear classifier that uses the cross-entropy loss function. In other words, the gradient of the above function tells a softmax classifier how exactly to update its weights using some optimization like gradient descent.

    The softmax() part simply normalises your network predictions so that they can be interpreted as probabilities. Once your network is predicting a probability distribution over labels for each input, the log loss is equivalent to the cross entropy between the true label distribution and the network predictions. As the name suggests, softmax function is a “soft” version of max function. Instead of selecting one maximum value, it breaks the whole (1) with maximal element getting the largest portion of the distribution, but other smaller elements getting some of it as well.

    This property of softmax function that it outputs a probability distribution makes it suitable for probabilistic interpretation in classification tasks.

    Cross entropy indicates the distance between what the model believes the output distribution should be, and what the original distribution is. Cross entropy measure is a widely used alternative of squared error. It is used when node activations can be understood as representing the probability that each hypothesis might be true, i.e. when the output is a probability distribution. Thus it is used as a loss function in neural networks which have softmax activations in the output layer.

GAME THEORY

  1. In computer science

DATASETS

Structured / Unstructured data

Unstructured

Structured

BIAS / VARIANCE

Overfitting your test set, a statistican view point, a great article, bottom line use bonferroni correction.

Understanding what is the next stage in DL (& ML) algorithm development: basic approach - Andrew NG on youtube

Terms: training, validation, test.

Split: training & validation 70%, test 30%

Procedure: crossfold training and validation, or further split 70% to training and validation.

**BIAS - Situation 1 - doing much worse than human: **

Human expert: 1% error

Training set error: 5% error (test on train)

Validation set error: 6% error (test on validation or CFV)

Conclusion: there is a BIAS between human expert and training set

Solution: 1. Train deeper or bigger\larger networks, 2. train longer, 3. May needs more data to get to the human expert level, Or 4. New model architecture.

VARIANCE - Situation 2 - validation set not close to training set error:

Human expert: 1% error

Training set error: 2% error

Validation set error: 6% error

Conclusion: there is a VARIANCE problem, i.e. OVERFITTING, between training and validation.

Solution: 1. Early stopping, 2. Regularization or 3. get more data, or 4. New model architecture.

Situation 3 - both:

Human expert: 1% error

Training set error: 5% error

Validation set error: 10% error

Conclusion: both problems occur, i.e., BIAS as and VARIANCE .

Solution: do it all.

Underfitting = Get more data

Overfitting = Early stop, regularization, reason: models detail & noise.

  • Happens more in non parametric (and non linear) algorithms such as decision trees.

Bottom line, bigger model or more data will solve most issues.

+ In practice advice with regularized linear regression.

>>>>> gd2md-html alert: inline image link here (to images/image53.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

>>>>> gd2md-html alert: inline image link here (to images/image54.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

>>>>> gd2md-html alert: inline image link here (to images/image55.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

>>>>> gd2md-html alert: inline image link here (to images/image56.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

IMPORTANT! For Test Train efficiency when the data is from different distributions:

E.g: TRAIN: 50K hours of voice chatter as the train set for a DLN, **TEST: **10H for specific voice-based problem, i.e, taxi chatter.

Best practice: better to divide the validation & test from the same distribution, i.e. the 10H set.

Reason: improving scores on validation which is from a diff distribution will not be the same quality as improving scores on a validation set originated from the actual distribution of the problem’s data, i.e., 10H.

**NOTE: **Unlike the usual supervised learning, where all the data is from the same distribution, where we split the training to train and validation (cfv).

**Situation 4: **However, when there are 2 distributions it’s possible to extend the division of the training set to validation_training and training, and the test to validation and test.

**Split: ** Train, Valid_Train = 48K\2K & Valid, Test, 5K & 5K.

>>>>> gd2md-html alert: inline image link here (to images/image57.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

So situation 1 stays the same,

Situation 2 is Valid_Train error (train_dev)

Situation 3 is Valid_Test error - need more data, **data synthesis **- tweak test to be similar to train data, new architecture as a solution

Situation 4 is now Test set error - get more data

SPARSE DATASETS

Sparse matrices in ML - one hot/tfidf, dictionary/list of lists/ coordinate list.

TRAINING METHODOLOGIES
  1. Train test split
  2. Cross validation
  3. Transfer learning - using a pre existing classifier similar to your domain, usually trained on millions of samples. fine-tuned on new data in order to create a new classifier that utilizes that information in the new domain. Examples such as w2v or classic resnet fine-tuning.
  4. Bootstrapping training- using a similar dataset, such as yelp, with 5 stars to create a pos/neg sentiment classifier based on 1 star and 5 stars. Finally using that to label or sample select from an unlabelled dataset, in order to create a new classifier or just to sample for annotation etc.
  5. Student-teacher paradigm (facebook), using a big labelled dataset to train a teacher classifier, predicting on unlabelled data, choosing the best classified examples based on probability, using those to train a new student model, finally fine-tune on the labeled dataset to create a more robust model, which is expected to know the unlabelled dataset and the labelled dataset with higher accuracy. With respect to the fully supervised teacher model / baseline.

>>>>> gd2md-html alert: inline image link here (to images/image58.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

  1. Yoav’s method for transfer learning for languages - train a classifier on labelled data from english and spanish, fine tune using left out spanish data, stop before overfitting. This can be generalized to other domains.

TRANSFER LEARNING

  1. In deep learning

>>>>> gd2md-html alert: inline image link here (to images/image59.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

TRAIN / TEST / CROSS VALIDATION

Scikit-lego on group-based splitting and transformation

Images from here.

>>>>> gd2md-html alert: inline image link here (to images/image60.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

>>>>> gd2md-html alert: inline image link here (to images/image61.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

**Train Test methodology - **

**“The training set is used to fit the models; the validation set is used to estimate prediction error for model selection; the test set is used for assessment of the generalization error of the final chosen model. Ideally, the test set should be kept in a "vault," and be brought out only at the end of the data analysis"

  • Random Split tests 66\33 - problem: variance each time we rerun.
  • Multiple times random split tests - problem: samples may not be included in train\test or selected multiple times.
  • Cross validation - pretty good, diff random seed results in diff mean accuracy, variance due to randomness
  • Multiple cross validation - accounts for the randomness of the CV
  • Statistical significance ( t-test) on multi CV - are two samples drawn from the same population? (no difference). If “yes”, not significant, even if the mean and std deviations differ.

Finally, When in doubt, use k-fold cross validation (k=10) and use multiple runs of k-fold cross validation with statistical significance tests.

Out of fold - leave unseen data, do cross fold on that. Good for ensembles.

VARIOUS DATASETS

  1. 26 of them
  2. 24
  3. Eu-es, 2
  4. 50K - ModelDepot alone has over 50,000 freely accessible pre-trained models with search functionality to

IMBALANCED DATASETS

  1. (the BEST resource and a great api for python) with visual samples - it actually works well on clustering.
  2. Mastery on cost sensitive sampling
  3. Smote for imbalance

Systematic Investigation of imbalance effects in CNN’s, with several observations. This is crucial when training networks, because in real life you don’t always get a balanced DS.

They recommend the following:

  1. (i) the effect of class imbalance on classification performance is detrimental;
  2. (ii) the method of addressing class imbalance that emerged as dominant in almost all analyzed scenarios was oversampling;
  3. (iii) **oversampling **should be applied to the level that totally eliminates the imbalance, whereas undersampling can perform better when the imbalance is only removed to some extent;
  4. (iv) as opposed to some classical machine learning models, oversampling **does not necessarily cause overfitting of CNNs; **
  5. (v) **thresholding **should be applied to compensate for prior class probabilities when overall number of properly classified cases is of interest.

General Rules:

  1. Many samples - undersampling
  2. Few samples - over sampling
  3. Consider random and non-random schemes
  4. Different sample rations, instead of 1:1 (proof? papers?)

Balancing data sets (wiki, scikit learn & examples in SKLEARN):

  1. Oversampling the minority class

    1. (Random) duplication of samples
    2. SMOTE (in weka + needs to be installed & paper) - find k nearest neighbours,

    New_Sample = (random num in [0,1] ) * vec(ki,current_sample)

  • **(in weka) **The nearestNeighbors parameter says how many nearest neighbor instances (surrounding the currently considered instance) are used to build an in between synthetic instance. The default value is 5. Thus the attributes of 5 nearest neighbors of a real existing instance are used to compute a new synthetic one.
  • **(in weka) **The percentage parameter says how many synthetic instances are created based on the number of the class with less instances (by default - you can also use the majority class by setting the -Coption). The default value is 100. This means if you have 25 instances in your minority class, again 25 instances are created synthetically from these (using their nearest neighbours' values). With 200% 50 synthetic instances are created and so on. 3. ADASYN - shifts the classification boundary to the minority class, synthetic data generated for majority class.
  1. Undersampling the majority class 4. Remove samples 5. Cluster centroids - replaces a cluster of samples (k-means) with a centroid. 6. Tomek links - cleans overlapping samples between classes in the majority class. 7. Penalizing the majority class during training
  2. Combined over and under (hybrid) - i.e., SMOTE and tomek/ENN
  3. Ensemble sampling 8. EasyEnsemble 9. BalanceCascade
  4. Dont balance, try algorithms that perform well with unbalanced DS 10. Decision trees - C4.5\5\CART\Random Forest 11. SVM
  5. Penalize Models - 12. added costs for misclassification on the minority class during training such as penalized-SVM 13. a CostSensitiveClassifier meta classifier in Weka that wraps classifiers and applies a custom penalty matrix for miss classification. 14. complex

SAMPLE SELECTION

  1. How to choose your sample size from a population based on confidence interval

>>>>> gd2md-html alert: inline image link here (to images/image62.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

  1. Data advise, should we get more data? How much

Gibbs sampling: - Gibbs Sampling is a MCMC method to draw samples from a potentially really really complicated, high dimensional distribution, where analytically, it’s hard to draw samples from it. The usual suspect would be those nasty integrals when computing the normalizing constant of the distribution, especially in Bayesian inference. Now Gibbs Sampler can draw samples from any distribution, provided you can provide all of the conditional distributions of the joint distribution analytically.

LEARNING CURVES

  1. Git examples

  2. Sklearn examples

  3. Understanding bias variance via learning curves

  4. Unread - learning curve sampling applied to model based clustering - seemed like active learning, i.e., sample using EM/cluster to achieve nearly as accurate on all data

  5. Predicting sample size required for training

  6. Advice on many things, including learning curves

    This is a really wonderful study with far-reaching implications that could even impact company strategies in some cases. It starts with a simple question: “how can we improve the state of the art in deep learning?” We have three main lines of attack:

  7. We can search for improved model architectures.

  8. We can scale computation

  9. We can create larger training data sets.

DISTILLING DATA

  1. Medium on this Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics. What I found interesting about this paper is that it challenges the common approach of “the more the merrier” when it comes to training data, and shifts the focus from the quantity of the data to the quality of the data.

DATASET SELECTION

  1. Medium

>>>>> gd2md-html alert: inline image link here (to images/image63.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

DATASET CONFIDENCE

  1. Dataset cartography mapping and diagnosing datasets with training dynamics - Large datasets have become commonplace in NLP research. However, the increased emphasis on data quantity has made it challenging to assess the quality of data. We introduce Data Maps---a model-based tool to characterize and diagnose datasets. We leverage a largely ignored source of information: the behavior of the model on individual instances during training (training dynamics) for building data maps. This yields two intuitive measures for each example---the model's confidence in the true class, and the variability of this confidence across epochs---obtained in a single run of training. Experiments across four datasets show that these model-dependent measures reveal three distinct regions in the data map, each with pronounced characteristics. First, our data maps show the presence of "ambiguous" regions with respect to the model, which contribute the most towards out-of-distribution generalization. Second, the most populous regions in the data are "easy to learn" for the model, and play an important role in model optimization. Finally, data maps uncover a region with instances that the model finds "hard to learn"; these often correspond to labeling errors. Our results indicate that a shift in focus from quantity to quality of data could lead to robust models and improved out-of-distribution generalization.

NORMALIZATION / SCALING

  1. A comparison of normalization / scaling techniques in sklearn
  2. Another great explanation on sklearn and (general) scaling - normal, min max, etc..
  3. Normalization\standardize features
  • data has varying scales

  • Normalize between range 0 to 1.

    • When the algorithm you are using does not make assumptions about the distribution of your data, such as k-nearest neighbors and artificial neural networks.
  • Standardize, mean of 0 and a std of 1:

    • When the algorithm assumes a gaussian dist, such as linear regression, logistic regression and linear discriminant analysis. LR, LogR, LDA

    **Generally, it is a good idea to standardize data that has a Gaussian (bell curve) distribution and normalize otherwise.

  1. In general terms, we should test 0,1 or -1,1 empirically and possibly match the range to the NN gates/activation function etc.

REGULARIZATION

Youtube? Watch this. Also explains about iso surfaces, lp norm, sparseness.

**(what is?) **Regularization (in linear regression) - to find the best model we define a loss or cost function that describes how well the model fits the data, and try minimize it. For a complex model that fits even the noise, i.e., over fitted, we penalize it by adding a complexity term that would add BIGGER LOSS for more complex models.

  • Bigger lambda -> high complexity models (deg 3) are ruled out,** more punishment.**
  • Smaller lambda -> models with high training error are rules out. I.e., linear model on non linear data?, i.e., deg 1.
  • Optimal is in between (deg 2)

L1 - for sparse models,

L1 vs L2, some formula

Rehearsal on vector normalization - for l1,l2,l3,l4 etc, what is the norm? (absolute value in certain cases)

>>>>> gd2md-html alert: inline image link here (to images/image64.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

**(Difference between? And features of) **L1 vs L2 as loss function and regularization.

  • **L1 - **moves the regressor faster, feature selection by sparsing coefficients (zeroing them), with sparse algorithms it is computationally efficient, with others no, so use L2.
  • L2 - moves slower, doesn't sparse, computationally efficient.

Why does L1 lead to sparity?

  • Intuition + some mathematical info
  • L1 & L2 regularization add constraints to the optimization problem. The curve H0 is the hypothesis. The solution is a set of points where the H0 meets the constraints.
  • In L2 the the hypothesis is tangential to the ||w||_2. The point of intersection has both x1 and x2 components. On the other hand, in L1, due to the nature of ||w||_1, the viable solutions are limited to the corners of the axis, i.e., x1. So that the value of x2 = 0. This means that the solution has eliminated the role of x2 leading to sparsity.
  • This can be extended to a higher dimensions and you can see why L1 regularization leads to solutions to the optimization problem where many of the variables have value 0.
  • In other words, L1 regularization leads to sparsity.
  • Also considered feature selection - although with LibSVM the recommendation is to feature select prior to using the SVM and use L2 instead.

L1 sparsity - intuition #2

  • For simplicity, let's just consider the 1-dimensional case.
  • L2:
  • L2-regularized loss function F(x)=f(x)+λ∥x∥^2 is smooth.
  • This means that the optimum is the stationary point (0-derivative point).
  • The stationary point of F can get very small when you increase λ, but it will still** won't be 0 unless f′(0)=0.**
  • L1:
    • regularized loss function F(x)=f(x)+λ∥x∥ is non-smooth, i.e., a min knee of 0.
    • It's not differentiable at 0.
    • Optimization theory says that the optimum of a function is either the point with 0-derivative or one of the irregularities (corners, kinks, etc.). So, it's possible that the optimal point of F is 0 even if 0 isn't the stationary point of f.
    • In fact, it would be 0 if λ is large enough (stronger regularization effect). Below is a graphical illustration.

>>>>> gd2md-html alert: inline image link here (to images/image65.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

In multi-dimensional settings: if a feature is not important, the loss contributed by it is small and hence the (non-differentiable) regularization effect would turn it off.

Intuition + formulation, which is pretty good:

>>>>> gd2md-html alert: inline image link here (to images/image66.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

(did not watch) but here is andrew ng talks about cost functions.

L2 regularization equivalent to Gaussian prior

>>>>> gd2md-html alert: inline image link here (to images/image67.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

L1 regularization equivalent to a Laplacean Prior(same link as above) - “Similarly the relationship between L1 norm and the Laplace prior can be undestood in the same fashion. Take instead of a Gaussian prior, a Laplace prior combine it with your likelihood and take the logarithm.“

How does regularization look like in SVM - controlling ‘C’

VALIDATION

DATASETS RELIABILITY & CORRECTNESS

  1. Clever Hans effect - in relations to cues left in the dataset that models find, instead of actually solving the defined task!
  • Ablating, i.e. removing, part of a model and observing the impact this has on performance is a common method for verifying that the part in question is useful. If performance doesn't go down, then the part is useless and should be removed. Carrying this method over to datasets, it should become common practice to perform dataset ablations, as well, for example:
  • Provide only incomplete input (as done in the reviewed paper): This verifies that the complete input is required. If not, the dataset contains cues that allow taking shortcuts.
  • Shuffle the input: This verifies the importance of word (or sentence) order. If a bag-of-words/sentences gives similar results, even though the task requires sequential reasoning, then the model has not learned sequential reasoning and the dataset contains cues that allow the model to "solve" the task without it.
  • Assign random labels: How much does performance drop if ten percent of instances are relabeled randomly? How much with all random labels? If scores don't change much, the model probably didn't learning anything interesting about the task.
  • Randomly replace content words: How much does performance drop if all noun phrases and/or verb phrases are replaced with random noun phrases and verbs? If not much, the dataset may provide unintended non-content cues, such as sentence length or distribution of function words.

2. Paper

UNIT TESTS

  1. A great :P unit test and logging post on medium - it’s actually mine :)
  2. A mind blowing lecture about unit testing your data using Voluptuous & engrade & TDDA lecture
  3. Great expectations, article, “TDDA” for Unit tests and CI
  4. Unit tests in python
  5. Unit tests in python - youtube
  6. Unit tests asserts
  7. Auger - automatic unit tests, has a blog post inside, doesn't work with py 3+
  8. A rather naive unit tests article aimed for DS
  9. A good pytest tutorial
  10. Mock, mock 2

FAIRNESS, ACCOUNTABILITY & TRANSPARENCY ML

  1. FATML website - The past few years have seen growing recognition that machine learning raises novel challenges for ensuring non-discrimination, due process, and understandability in decision-making. In particular, policymakers, regulators, and advocates have expressed fears about the potentially discriminatory impact of machine learning, with many calling for further technical research into the dangers of inadvertently encoding bias into automated decisions.

    At the same time, there is increasing alarm that the complexity of machine learning may reduce the justification for consequential decisions to “the algorithm made me do it.”

    1. Principles and best practices, projects
  2. FAccT - A computer science conference with a cross-disciplinary focus that brings together researchers and practitioners interested in fairness, accountability, and transparency in socio-technical systems.

  3. Paper - there is no fairness, enforcing fairness can improve accuracy

  4. Google on responsible ai practices see also PAIR

  5. Ethics and regulation in israel, lecture

  6. Bengio on ai

  7. Poisoning attacks on fairness - Research in adversarial machine learning has shown how the performance of machine learning models can be seriously compromised by injecting even a small fraction of poisoning points into the training data. We empirically show that our attack is effective not only in the white-box setting, in which the attacker has full access to the target model, but also in a more challenging black-box scenario in which the attacks are optimized against a substitute model and then transferred to the target model

FAIRNESS TOOLS

  1. PII tools, by gensim
  2. Fair-learn A Python package to assess and improve fairness of machine learning models. \

>>>>> gd2md-html alert: inline image link here (to images/image68.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

  1. Sk-lego

>>>>> gd2md-html alert: inline image link here (to images/image69.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

  1. Regression \

>>>>> gd2md-html alert: inline image link here (to images/image70.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

  1. classification

>>>>> gd2md-html alert: inline image link here (to images/image71.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

>>>>> gd2md-html alert: inline image link here (to images/image72.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

>>>>> gd2md-html alert: inline image link here (to images/image73.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

>>>>> gd2md-html alert: inline image link here (to images/image74.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

  1. information filter \

>>>>> gd2md-html alert: inline image link here (to images/image75.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

M. Zafar et al. (2017), Fairness Constraints: Mechanisms for Fair Classification

M. Hardt, E. Price and N. Srebro (2016), Equality of Opportunity in Supervised Learning

INTERPRETABLE / EXPLAINABLE AI (XAI)

>>>>> gd2md-html alert: inline image link here (to images/image76.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

  1. A curated document about XAI research resources.
  2. From the above image: Paper: Principles and practice of explainable models - a really good review for everything XAI - “a survey to help industry practitioners (but also data scientists more broadly) understand the field of explainable machine learning better and apply the right tools. Our latter sections build a narrative around a putative data scientist, and discuss how she might go about explaining her models by asking the right questions. From an organization viewpoint, after motivating the area broadly, we discuss the main developments, including the principles that allow us to study transparent models vs opaque models, as well as model-specific or model-agnostic post-hoc explainability approaches. We also briefly reflect on deep learning models, and conclude with a discussion about future research directions.”
  3. Book: interpretable machine learning, christoph mulner
  4. (great) Interpretability overview, transparent (simultability, decomposability, algorithmic transparency) post-hoc interpretability (text explanation, visual local, explanation by example,), evaluation, utility.
  5. Medium: the great debate

>>>>> gd2md-html alert: inline image link here (to images/image77.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

  1. Paper: pitfalls to avoid when interpreting ML models “y. A growing number of techniques provide model interpretations, but can lead to wrong conclusions if applied incorrectly. We illustrate pitfalls of ML model interpretation such as bad model generalization, dependent features, feature interactions or unjustified causal interpretations. Our paper addresses ML practitioners by raising awareness of pitfalls and pointing out solutions for correct model interpretation, as well as ML researchers by discussing open issues for further research.” - mulner et al.

>>>>> gd2md-html alert: inline image link here (to images/image78.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

  1. *** whitening a black box. This is very good, includes eli5, lime, shap, many others.
  2. Book: exploratory model analysis
  3. Alibi-explain - White-box and black-box ML model explanation library. Alibi is an open source Python library aimed at machine learning model inspection and interpretation. The focus of the library is to provide high-quality implementations of black-box, white-box, local and global explanation methods for classification and regression models.

>>>>> gd2md-html alert: inline image link here (to images/image79.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

  1. Hands on explainable ai youtube, git
  2. Explainable methods are not always consistent and do not agree with each other, this article has a make-sense explanation and flow for using shap and its many plots.

>>>>> gd2md-html alert: inline image link here (to images/image80.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

  1. Intro to shap and lime, part 1, part 2
  2. Lime
    1. *** how lime works behind the scenes
    2. LIME to interpret models NLP and IMAGE, github- In the experiments in our research paper, we demonstrate that both machine learning experts and lay users greatly benefit from explanations similar to Figures 5 and 6 and are able to choose which models generalize better, improve models by changing them, and get crucial insights into the models' behavior.
  3. Anchor 3. Anchor from the authors of Lime, - An anchor explanation is a rule that sufficiently “anchors” the prediction locally – such that changes to the rest of the feature values of the instance do not matter. In other words, for instances on which the anchor holds, the prediction is (almost) always the same.
  4. Shap: 4. Medium Intro to lime and shap 5. **** In depth SHAP 6. Github 7. Country happiness using shap 8. Stackoverflow example, predicting tags, pandas keras etc 9. Intro to shapely and shap 10. Fiddler on shap
  5. SHAP advanced 11. Official shap tutorial on their plots, you can never read this too many times. 12. What are shap values on kaggle - whatever you do start with this 13. Shap values on kaggle #2 - continue with this 14. How to calculate Shap values per class based on this graph

>>>>> gd2md-html alert: inline image link here (to images/image81.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

15. Shap [intro](https://towardsdatascience.com/explain-your-model-with-the-shap-values-bc36aac4de3d), [part 2](https://towardsdatascience.com/explain-any-models-with-the-shap-values-use-the-kernelexplainer-79de9464897a) with many algo examples and an explanation about the four plots.
  1. A thorough post about the many ways of explaining a model, from regression, to bayes, to trees, forests, lime, beta, feature selection/elimination
  2. Trusting models
  3. 3. Interpret using uncertainty
  4. Keras-vis for cnns, 3 methods, activation maximization, saliency and class activation maps
  5. The notebook! Blog
  6. More resources!
  7. Visualizing the impact of feature attribution baseline - Path attribution methods are a gradient-based way of explaining deep models. These methods require choosing a hyperparameter known as the baseline input. What does this hyperparameter mean, and how important is it? In this article, we investigate these questions using image classification networks as a case study. We discuss several different ways to choose a baseline input and the assumptions that are implicit in each baseline. Although we focus here on path attribution methods, our discussion of baselines is closely connected with the concept of missingness in the feature space - a concept that is critical to interpretability research.
  8. WHAT IF TOOL - GOOGLE, notebook, walkthrough
  9. Language interpretability tool (LIT) - The Language Interpretability Tool (LIT) is an open-source platform for visualization and understanding of NLP models.
  10. Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead - “trying to \textit{explain} black box models, rather than creating models that are \textit{interpretable} in the first place, is likely to perpetuate bad practices and can potentially cause catastrophic harm to society. There is a way forward -- it is to design models that are inherently interpretable. This manuscript clarifies the chasm between explaining black boxes and using inherently interpretable models, outlines several key reasons why explainable black boxes should be avoided in high-stakes decisions, identifies challenges to interpretable machine learning, and provides several example applications where interpretable models could potentially replace black box models in criminal justice, healthcare, and computer vision.”
  11. Using genetic algorithms
  12. Google’s what-if tool from PAIR

WHY WE SHOULDN’T TRUST MODELS

  1. Clever Hans effect for NLP
    1. Datasets need more love
    2. Datasets ablation and public beta
    3. Inter-prediction agreement
  2. Behavioral testing and CHECKLIST 4. Blog, Youtube, paper, git 5. Yonatan hadar on the subject in hebrew

DEBIASING MODELS

  1. Adversarial removal of demographic features - “We show that demographic information of authors is encoded in -- and can be recovered from -- the intermediate representations learned by text-based neural classifiers. The implication is that decisions of classifiers trained on textual data are not agnostic to -- and likely condition on -- demographic attributes. “
    “we explore several techniques to improve the effectiveness of the adversarial component. Our main conclusion is a cautionary one: do not rely on the adversarial training to achieve invariant representation to sensitive features.” \

  2. Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection (paper) , github, presentation by Shauli et al. - removing biased information such as gender from an embedding space using nullspace projection.
    The objective is this: give a representation of text, for example BERT embeddings of many resumes/CVs, we want to achieve a state where a certain quality, for example a gender representation of the person who wrote this resume is not encoded in X. they used the light version definition for “not encoded”, i.e., you cant predict the quality from the representation with a higher than random score, using a linear model. I.e., every linear model you will train, will not be able to predict the person’s gender out of the embedding space and will reach a 50% accuracy.
    This is done by an iterative process that includes. 1. Linear model training to predict the quality of the concept from the representation. 2. Performing ‘projection to null space’ for the linear classifier, this is an acceptable linear algebra calculation that has a meaning of zeroing the representation from the projection on the separation place that the linear model is representing, making the model useless. I.e., it will always predict the zero vector. This is done iteratively on the neutralized output, i.e., in the second iteration we look for an alternative way to predict the gender out of X, until we reach 50% accuracy (or some other metric you want to measure) at this point we have neutralized all the linear directions in the embedding space, that were predictive to the gender of the author.

    For a matrix W, the null space is a sub-space of all X such that WX=0, i.e., W maps X to the zero vector, this is a linear projection of the zero vector into a subspace. For example you can take a 3d vectors and calculate its projection on XY.

  3. Can we extinct predictive samples? Its an open question, Maybe we can use influence functions?

    Understanding Black-box Predictions via Influence Functions - How can we explain the predictions of a blackbox model? In this paper, we use influence functions — a classic technique from robust statistics — to trace a model’s prediction through the learning algorithm and back to its training data, thereby identifying training points most responsible for a given prediction.

    We show that even on non-convex and non-differentiable models where the theory breaks down, approximations to influence functions can still provide valuable information. On linear models and convolutional neural networks, we demonstrate that influence functions are useful for multiple purposes: understanding model behavior, debugging models, detecting dataset errors, and even creating visually indistinguishable training-set attacks.

  4. Removing ‘gender bias using pair mean pca

PRIVACY

  1. Privacy in DataScience podcast
  2. Fairness in AI

DIFFERENTIAL PRIVACY

  1. Differential privacy has emerged as a major area of research in the effort to prevent the identification of individuals and private data. It is a mathematical definition for the privacy loss that results to individuals when their private information is used to create AI products. It works by injecting noise into a dataset, during a machine learning training process, or into the output of a machine learning model, without introducing significant adverse effects on data analysis or model performance. It achieves this by calibrating the noise level to the sensitivity of the algorithm. The result is a differentially private dataset or model that cannot be reverse engineered by an attacker, while still providing useful information. Uses BOTLON & EPSILON
  2. youtube

ANONYMIZATION

  1. Using NER (omri mendels)

DE-ANONYMIZATION

  1. GPT2 - Of language datasets \

>>>>> gd2md-html alert: inline image link here (to images/image82.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

META-LEARNING

What is? Automated Machine Learning provides methods and processes to make Machine Learning available for non-Machine Learning experts, to improve efficiency of Machine Learning and to accelerate research on Machine Learning.

Personal note: automl algorithms in this field will bridge the gap and automate several key processes, but it will not allow a practitioner to do serious research or solve business or product problems easily. The importance of this field is to advance each subfield, whether HPO, NAS, etc. these selective novelties can help us solve specific issues, i.e, lets take HPO, we can use it to save time and money on redundant parameter searches, especially when it comes to resource heavy algorithms such as Deep learning (think GPU costs).

Personal thoughts on optimizations: be advised that optimizing problems will not guarantee a good result, you may over fit your problem in ways you are not aware of, beyond traditional overfitting and better accuracy doesn't guarantee a better result (for example if your dataset is unbalanced, needs imputing, cleaning, etc.). \

Always examine the data and results in order to see if they are correct.

Automl.org’s github - it has a backup for the following projects.

Automl.org is a joint effort between two universitie, freiburg and hannover, their website curates information regarding:

  1. HPO - hyper parameter optimization
  2. NAS - neural architecture search
  3. Meta Learning - learning across datasets, warmstarting of HPO and NAS etc.

Automl aims to automate these processes:

  • Preprocess and clean the data.
  • Select and construct appropriate features.
  • Select an appropriate model family.
  • Optimize model hyperparameters.
  • Postprocess machine learning models.
  • Critically analyze the results obtained.

Historically, AFAIK AutoML’s birth started with several methods to optimize each one of the previous processes in ml. IINM, weka’s paper (2012) was the first step in aggregating these ideas into a first public working solution.

The following is referenced from AutoML.org:

ML Systems

  • AutoWEKA is an approach for the simultaneous selection of a machine learning algorithm and its hyperparameters; combined with the WEKA package it automatically yields good models for a wide variety of data sets.
  • Auto-sklearn is an extension of AutoWEKA using the Python library scikit-learn which is a drop-in replacement for regular scikit-learn classifiers and regressors.
  • TPOT is a data-science assistant which optimizes machine learning pipelines using genetic programming.
  • (google) H2O AutoML provides automated model selection and ensembling for the H2O machine learning and data analytics platform. (git)
  • TransmogrifAI is an AutoML library running on top of Spark.
  • MLBoX is an AutoML library with three components: preprocessing, optimisation and prediction.

Hyper param optimization

Architecture Search

EXAMPLES

The taxi problem is an intro to a well known machine learning problem, the paper will explain about feature engineering, analysis and using various regression algorithms for the purpose of solving the problem, you can use this as a base for many regression and classification problems.

A **Second study **(regression, random forest, xgboost (extreme gradient boosting tree)).

Standard error estimate -- measures the distance from the estimated value to the real value

R^2 error estimate- measures the distance of the estimated to the mean against the real to the mean, 1 no error, 0 lots.

**** with regression prediction it's best to create **dummy variables (i.e., binary variables - exist or doesn't exist) **from numeric variables, such as grid_number to grid_1, grid_2 etc..

SURVIVAL ANALYSIS

  1. A good introduction for Survival Analysis

PROPAGANDA

  1. medium

EVALUATION METRICS

A metric learning reality check

1. [Medium](https://medium.com/@tkm45/updates-to-a-metric-learning-reality-check-730b6914dfe7)
2. [Git](https://github.com/KevinMusgrave/pytorch-metric-learning?fbclid=IwAR3PmPTDgYFok4p095WmkRWLfWhixyjFXkZgFJzeYXs5Y92pofoNZL_lGTg)
3. [Website](https://kevinmusgrave.github.io/powerful-benchmarker/papers/mlrc/?fbclid=IwAR3jK3-qFphFsO7ocmjeN-zPLkcaQkTAcC78cFUNFVe1BgXzM-iE5PLh3bU)

SUPERVISED

Precision \ Recall \ ROC \ AUC

  • Performance Measures:

    A balanced confusion matrix is better than one that is either one row of numbers and one of zeros, or a column of numbers and a column of zeros. Therefore an algorithm that outputs a lower classification accuracy but has a better confusion matrix wins.

    of Positive predictions divided by the total number of positive class values predicted.

    Precision = True Positives / (True Positives + False Positives)

    Low can be thought of many false positives.

    of positive predictions divided by the number of positive class values in the test data

    Recall (sensitivity) = True Positives / (True Positives + False Negatives)

    Low can be thought of many false negatives.

F1 Harmonic Mean Score

F1_Score = 2 * ((Precision * Recall) / (Precision + Recall))


F1 helps select a model based on a balance between precision and recall.


In a multi-class problem, there are many methods to calculate F1, some are more appropriate for balanced data, others are not.
  1. The best link yet - micro macro weighted (macro balanced, micro imbalanced, weighted imbalanced)

  2. Micro vs macro

  3. Micro vs weighted (not a good link)

  4. What is weighted

  5. Micro is accuracy in multi class


  • Accuracy = (1 – Error) = (TP + TN)/(PP + NP) = Pr(C), the probability of a correct classification.
  • Sensitivity (recall) = TP/(TP + FN) = TP/PP = the ability of the test to detect disease in a population of diseased individuals.
  • Specificity = TN/(TN + FP) = TN / NP = the ability of the test to correctly rule out the disease in a disease-free population.

(What are ?) Sensitivity and specificity against ROC and AUC.

ROC curve and AUC in weka - explains how the curve should look like for the negative or positive predictions, against what is actually plotted.

Mean F1? How do we calculate it.

Multiclass Precision / Recall, part 1

Precision at K, formulas, examples, git 1, git 2, git 3 (suggestive, recommendation application)

RECALL, PRECISION AND F1

  1. Medium on Controling the decision threshold using the probabilities any model gives, code, samples, tutorial
  2. Another good medium explanation on precision / recall / fpr/ tpr etc
  3. Scikit lego on choosing the threshold using grid search
  4. Best explanation ever

Recall

*   one day, your girlfriend asks you: ‘Sweetie, do you remember all birthday surprises from me?’
*   This simple question makes your life in danger. To extend your life, you need to **recall** all **10** surprising events from your memory.
*   So, `recall` is the ratio of a number of **events you can _correctly_ recall** to a number of **all correct events**. If you can recall all **10** events correctly, then, your recall ratio is **1.0** (**100%**). If you can recall **7** events correctly, your recall ratio is **0.7** (**70%**).

Precision

*   For example, you answers **15** times, **10** events are correct and **5** events are wrong. This means you can recall all events but it’s not so `precise`.
*   So, `precision` is the ratio of a number of **events you can _correctly_ recall** to a number **all events you recall** _(mix of correct and wrong recalls)_. In other words, it is how precise of your recall.
*   From the previous example (10 real events, 15 answers: 10 correct answers, 5 wrong answers), you get **100%** recall but your precision is only **66.67%** (10 / 15).

Confusion matrix wise: bottom line is recall (% correct out of positive cases), right column is precision (% of POS predictions) & % accuracy in diagonal

>>>>> gd2md-html alert: inline image link here (to images/image83.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

F1 score:

  • conveys the balance between the precision and the recall
  • 2*((precision*recall)/(precision+recall)

Yet another(pretty good) source

Another (bad) source for explaining, precision, recall, accuracy, true positive rate etc.

(How to use precision and recall?) answer by aurelien geron:

  • In a binary classifier, the decision function is the function that produces a score for the positive class.
  • In a logistic regression classifier, that decision function is simply a linear combination of the input features.
  • If that score is greater than some threshold that you choose, then the classifier "predicts" the positive class, or else it predicts the negative class.
  • **If you want your model to have high precision (at the cost of a low recall), then you must set the threshold pretty high. This way, the model will only predict the positive class when it is absolutely certain. For example, you may want this if the classifier is selecting videos that are safe for kids: it's better to err on the safe side. **
  • Conversely, if you want high recall (at the cost of a low precision) then you must use a low threshold. For example, if the classifier is used to detect intruders in a nuclear plant, then you probably want to detect all actual intruders, even if it means getting a lot of false alarms (called "false positives").
  • If you make a few assumptions about the distribution of the data (i.e., the positive and negative class are separated by a linear boundary plus Gaussian noise), then computing the logistic of the score gives you the probability that the instance belongs to the positive class. A score of 0 corresponds to the 50% probability. So by default, a LogisticClassifier predicts the positive class if it estimates the probability to be greater than 50%. In general, this sounds like a reasonable default threshold, but really it all depends on what you want to do with the classifier.
  • If the assumptions I mentioned above were perfect, then if the Logistic Classifier outputs a probability of X% for an instance, it means there is exactly X% chance that it's positive. But in practice, the assumptions are imperfect, so I try to always make it clear that we are talking about an "estimated probability", not an actual probability.

ROC CURVES

  1. Diff between precision recall to roc curve
  2. What is ROC AUC and PR AUC and when to use then (i.e for imbalanced data use PRAUC)
  3. What is AUC (AUROC)

(RMSE - what is?) - it is important to recall that RMSE has the same unit as the dependent variable (DV). It means that there is no absolute good or bad threshold, however you can define it based on your DV. For a datum which ranges from 0 to 1000, an RMSE of 0.7 is small, but if the range goes from** 0 to 1, it is not that small anymore.** However, although the smaller the RMSE, the better,

(R^2 vs RMSE) - R-squared is conveniently scaled between 0 and 1, whereas RMSE is not scaled to any particular values. This can be good or bad; obviously R-squared can be more easily interpreted, but with RMSE we explicitly know how much our predictions deviate, on average, from the actual values in the dataset. So in a way, RMSE tells you more.

I also found this video really helpful.

Kappa - measures accuracy while considering imbalanced datasets

References:

  1. A Survey on Deep Learning in Medical Image Analysis

UNSUPERVISED

  1. Silhouette Analysis vs Elbow Method vs Davies-Bouldin Index: Selecting the optimal number of clusters for KMeans clustering

BENCHMARKING

Numpy Blas:

  1. How do i know which version of blas is installed
  2. Benchmark OpenBLAS, Intel MKL vs ATLAS

>>>>> gd2md-html alert: inline image link here (to images/image84.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

  1. Another comparison

>>>>> gd2md-html alert: inline image link here (to images/image85.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

GLUE:

  1. Glue / super glue

State of the art in AI:

  1. In terms of domain X datasets

Cloud providers:

Datasets:

Hardware:

Platforms

Algorithms:

Scaling networks and predicting performance of NN:

  • A great overview of NN types, but the idea behind the video is to create a system that can predict train time and possibly accuracy when scaling networks using multiple GPUs, there is also a nice slide about general hardware recommendations.

>>>>> gd2md-html alert: inline image link here (to images/image86.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

NLP

HYPER PARAMETER OPTIMIZATION

1. Using HyperOpt -

  • Random Search

  • Tree of Parzen Estimators (TPE)

    Hyperopt has been designed to accommodate Bayesian optimization algorithms based on Gaussian processes and regression trees, but these are not currently implemented.

    All algorithms can be run either serially, or in parallel by communicating via MongoDB.

  • Mlflow, Hyperparameterhunter,hyperopt, concept drift, unit tests.

  • Hyperopt for hyper parameter search

  1. HyperparameterHunter -
    provides a wrapper for machine learning algorithms that saves all the important data. Simplify the experimentation and hyperparameter tuning process by letting HyperparameterHunter do the hard work of recording, organizing, and learning from your tests — all while using the same libraries you already do. Don't let any of your experiments go to waste, and start doing hyperparameter optimization the way it was meant to be.

  2. Implementation and comparison - HH slower than HO due to usage of skopt.

Multi CPU Processing

  1. Numpy on multi process, and how to use it.
  2. Pandas on multi process
  3. ***Dask - youtube intros
    1. Diagnostic dashboards
    2. Ditributed sklearn (amazing)
  4. Dask vs swifter vs vectorize 3. Dask is dask 4. Swifter will attempt to understand if dask or pandas apply should be used, looks like its using multi cpu so it may not be just using dask on the backend? 5. Vectorize is just another option
  5. Multi process cpu example
  6. Medium on MP, using MP pool, Ray etc.
  7. Async (multi process/thread/coroutines/asyncio)

MONITORING & ALERTS

>>>>> gd2md-html alert: undefined internal link (link text: "alibi-explain"). Did you generate a TOC?
(Back to top)(Next alert)
>>>>>

alibi-explain (see compendium) and ali-detect (see compendium)

Concept drift

  1. Some advice on medium, relabel using latest model (can we even trust it?) retrain after.
  2. Adversarial Validation Approach to Concept Drift Problem in User Targeting Automation Systems at Uber - Previous research on concept drift mostly proposed model retraining after observing performance decreases. However, this approach is suboptimal because the system fixes the problem only after suffering from poor performance on new data. Here, we introduce an adversarial validation approach to concept drift problems in user targeting automation systems. With our approach, the system detects concept drift in new data before making inference, trains a model, and produces predictions adapted to the new data.
  3. Drift estimator between data sets using random forest, formula is in the medium article above, code here at mlBOX

>>>>> gd2md-html alert: undefined internal link (link text: "Alibi-detect"). Did you generate a TOC?
(Back to top)(Next alert)
>>>>>

Alibi-detect - is an open source Python library focused on outlier, adversarial and drift detection.

>>>>> gd2md-html alert: inline image link here (to images/image87.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

CLASSIC MACHINE LEARNING

ASSOCIATION RULES

  1. Association rules slides - apriori, eclat, fp growth - pretty complete
  2. Terms - lift, confidence
  3. Paper - basic concepts and algo

Knoldus

1. [Apriori](https://blog.knoldus.com/machinex-why-no-one-uses-apriori-algorithm-for-association-rule-learning/)
2. [Association rules](https://blog.knoldus.com/machinex-two-parts-of-association-rule-learning/)
3. [Fp-growth](https://blog.knoldus.com/machinex-frequent-itemset-generation-with-the-fp-growth-algorithm/)
4. [Fp-tree construction](https://blog.knoldus.com/machinex-understanding-fp-tree-construction/)

APRIORI

5. [Apyori tut](https://stackabuse.com/association-rule-mining-via-apriori-algorithm-in-python/) [git](https://github.com/ymoch/apyori)
6. [Efficient apriori](https://github.com/tommyod/Efficient-Apriori)
7. [One of the best known association rules algorithm](https://machinelearningmastery.com/market-basket-analysis-with-association-rule-learning/) - apriori in weka
8. [A very good visual example of a transaction DB with the apriori algorithm step by step](http://www.lessons2all.com/Apriori.php)
9. [Python 3.0 code](http://adataanalyst.com/machine-learning/apriori-algorithm-python-3-0/)
10. [Mlxtnd](http://rasbt.github.io/mlxtend/api_subpackages/mlxtend.frequent_patterns/) [tutorial](https://www.geeksforgeeks.org/implementing-apriori-algorithm-in-python/)
    1. Apriori
    2. Rules
    3. pgrowth
    4. fpmax

FP Growth

11. [How to construct the fp-tree](https://www.youtube.com/watch?v=gq6nKbye648)
12. The same example, but with a graph that shows that lower support cost less for fp-growth in terms of calc time.
13. [Coursera video](https://www.coursera.org/learn/data-patterns/lecture/ugqCs/2-5-fpgrowth-a-pattern-growth-approach)
14. Another clip video
  1. How to validate these algorithms - probably the best way is confidence/support/lift

    It depends on your task. But usually you want all three to be high.

  • high support: should apply to a large amount of cases
  • high confidence: should be correct often
  • high lift: indicates it is not just a coincidence
  1. Difference between apriori and fp-growth

PROBABILISTIC ALGORITHMS

NAIVE BAYES

  1. Vidhya on NB
  2. Baysian tree

BAYES, BAYESIAN BELIEF NETWORKS

  1. Mastery on bayes theorem
  2. Introduction To BBS - a very good blog post
  3. A complementing SLIDE presentation that shows how to build the network’s tables
  4. A very nice presentation regarding BBS
  5. Maximum Likelihood (log likelihood) - proofs for bernoulli, normal, poisson.
  6. Another example

MARKOV MODELS

Random vs Stochastic (here and here):

  • A **variable **is 'random'.
  • A **process **is 'stochastic'.

Apart from this difference the two words are synonyms

In other words:

  • A random vector is a generalization of a single random variables to many.
  • A stochastic process is a sequence of random variables, or a sequence of random vectors (and then you have a vector-stochastic process).

(What is a Markov Model?) A Markov Model is a stochastic(random) model which models temporal or sequential data, i.e., data that are ordered.

  • It provides a way to model the dependencies of current information (e.g. weather) with previous information.
  • It is composed of states, transition scheme between states, and emission of outputs (discrete or continuous).
  • Several **goals **can be accomplished by using Markov models:
    • Learn **statistics **of sequential data.
    • Do **prediction **or estimation.
    • Recognize patterns.

(sunny cloudy explanation) Markov Chains is a probabilistic process, that **relies on the current state to predict the next state. **

  • to be effective the current state has to be dependent on the previous state in some way
  • if it looks cloudy outside, the next state we expect is rain.
  • If the rain starts to subside into cloudiness, the next state will most likely be sunny.
  • Not every process has the Markov Property, such as the Lottery, this weeks winning numbers have no dependence to the previous weeks winning numbers.
  1. They show how to build an order 1 markov table of probabilities, predicting the next state given the current.
  2. Then it shows the state diagram built from this table.
  3. Then how to build a transition matrix from the 3 states, i.e., from the probabilities in the table
  4. Then how to calculate the next state using the “current state vector” doing vec*matrix multiplications.
  5. Then it talks about the setting always into the rain prediction, and the solution is using two last states in a bigger table of order 2. He is not really telling us why the probabilities don't change if we add more states, it stays the same as in order 1, just repeating.

MARKOV MODELS / HIDDEN MARKOV MODEL

HMM tutorials

  1. HMM tutorial
    1. Part 1, 2, 3, 4
  2. Medium 2. Intro to HMM / MM 3. Paper like example
  3. HMM with sklearn and networkx

HMM variants

  1. Stack exchange on hmm
  2. HMM LEARN (sklearn, still being developed)
  3. Pomegranate (this is good)
    1. General mixture models
    2. Hmm
    3. Basyes classifiers and naive bayes
    4. Markov changes
    5. Bayesian networks
    6. Markov networks
    7. Factor graphs
  4. GHMM with python wrappers,
  5. Hmms (old)

HMM (what is? And why HIDDEN?) - the idea is that there are things that you CAN OBSERVE and there are things that you CAN'T OBSERVE. From the things you OBSERVE you want to INFER the things you CAN'T OBSERVE (HIDDEN). I.e., you play against someone else in a game, you don't see their choice of action, but you see the result.

  1. Python code, previously part of sklearn
  2. Python seqLearn - supervised multinomial HMM

This youtube video part1 - explains about the hidden markov model. It shows the visual representation of the model and how we go from that the formula:

>>>>> gd2md-html alert: inline image link here (to images/image88.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

It breaks down the formula to:

  • transition probability formula - the probability of going from Zk to Zk+1
  • **emission probability formula ** - the probability of going from Zk to Xk
  • **(Pi) Initial distribution - **the probability of Z1=i for i=1..m

>>>>> gd2md-html alert: inline image link here (to images/image89.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

In part2 of the video:

  1. Probably the **simplest **explanation of Markov Models and HMM as a “game” - link
  2. This video explains that building blocks of the needed knowledge in HMM, starting probabilities P0, transitions and emissions (state probabilities)
  3. This post, explains HMM and ties our understanding.

A cute explanation on quora:

>>>>> gd2md-html alert: inline image link here (to images/image90.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

This is the iconic image of a Hidden Markov Model. There is some state (x) that changes with time (markov). And you want to estimate or track it. Unfortunately, you cannot directly observe this state (hidden). That's the hidden part. But, you can observe something correlated with the state (y).

OBSERVED DATA -> INFER -> what you CANT OBSERVE (HIDDEN).

>>>>> gd2md-html alert: inline image link here (to images/image91.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

**Considering this model: **

  • where P(X0) is the initial state for happy or sad
  • Where P(Xt | X t-1) is the transition model from time-1 to time
  • Where P(Yt | Xt) is the observation model for happy and sad (X) in 4 situations (w, sad, crying, facebook)

>>>>> gd2md-html alert: inline image link here (to images/image92.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

INPUT OUTPUT HMM (IOHMM)

  1. Incomplete python code for unsupervised / semi-supervised / supervised IOHMM - training is there, prediction is missing.
  2. Machine learning - a probabilistic approach, david barber.

CONDITIONAL RANDOM FIELDS (CRF)

  1. Make sense intro to CRF, comparison against HMM
  2. HMM, CRF, MEMM
  3. Another crf article
  4. Neural network CRF NNCRF
  5. Another one
  6. scikit-learn inspired API for CRFsuite

REGRESSION ALGORITHMS

  1. Sk-lego to fit with intervals a linear regressor on top of non linear data

>>>>> gd2md-html alert: inline image link here (to images/image93.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

  1. Sk-lego monotonic

>>>>> gd2md-html alert: inline image link here (to images/image94.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

  1. Lightning - lightning is a library for large-scale linear classification, regression and ranking in Python. \

>>>>> gd2md-html alert: inline image link here (to images/image95.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

  1. Linear regression TBC
  2. CART - classification and regression tree, basically the diff between classification and regression trees - instead of IG we use sum squared error
  3. SVR - regression based svm, with kernel only.
  4. NNR- regression based NN, one output node
  5. LOGREG - Logistic regression - is used as a classification algo to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables. Output is BINARY. I.e., If the likelihood of killing the bug is > 0.5 it is assumed dead, if it is < 0.5 it is assumed alive.
  • Assumes binary outcome
  • Assumes no outliers
  • Assumes no intercorrelations among predictors (inputs?)

Regression Measurements:

  1. R^2 - several reasons it can be too high.
    1. Too many variables
    2. Overfitting
    3. Time series - seasonality trends can cause this

KERNEL REGRESSION

Gaussian Kernel Regression does–it takes a weighted average of the surrounding points

  • variance, sigma^2. Informally, this parameter will control the smoothness of your approximated function.
  • Smaller values of sigma will cause the function to **overfit **the data points, while larger values will cause it to underfit
  • There is a proposed method to find sigma in the post!
  • Gaussian Kernel Regression is equivalent to creating an RBF Network with the following properties: - described in the post

>>>>> gd2md-html alert: inline image link here (to images/image96.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

DIMENSIONALITY REDUCTION

PRINCIPAL COMPONENT REGRESSION (PCR) / PARTIAL LEAST SQUARES (PLS)

Principal component regression (PCR) Partial least squares and (PLS) - basically PCA and linear regression , however PLS makes use of the response variable in order to identify the new features.

One can describe Principal Components Regression as an approach for deriving a low-dimensional set of features from a large set of variables. The first principal component direction of the data is along which the observations vary the most. In other words, the first PC is a line that fits as close as possible to the data. One can fit p distinct principal components. The second PC is a linear combination of the variables that is uncorrelated with the first PC, and has the largest variance subject to this constraint. The idea is that the principal components capture the most variance in the data using linear combinations of the data in subsequently orthogonal directions. In this way, we can also combine the effects of correlated variables to get more information out of the available data, whereas in regular least squares we would have to discard one of the correlated variables.

The PCR method that we described above involves identifying linear combinations of X that best represent the predictors. These combinations (directions) are identified in an unsupervised way, since the response Y is not used to help determine the principal component directions. That is, the response Y does not supervise the identification of the principal components, thus there is no guarantee that the directions that best explain the predictors also are the best for predicting the response (even though that is often assumed). Partial least squares (PLS) are a supervised alternative to PCR. Like PCR, PLS is a dimension reduction method, which first identifies a new smaller set of features that are linear combinations of the original features, then fits a linear model via least squares to the new M features. Yet, unlike PCR, PLS makes use of the response variable in order to identify the new features.

LABEL ALGORITHMS

Label Propagation / Spreading

Note: very much related to weakly and semi supervision, i.e., we have small amounts of labels and we want to generalize the labels to other samples, see also weak supervision methods.

  1. Step 1: build a laplacian graph using KNN, distance metric is minkowski with p=2, i.e. euclidean distance.
  2. Step by step tutorial, part 2
  3. Spreading (propagation upgrade), Essentially a community graph algorithm, however it resembles KNN in its nature, using semi supervised data set (i.e., labeled and unlabeled data) to spread or propagate labels to unlabeled data, with small incrementations in the algorithm, using KNN-like methodology, each unlabeled sample will be given a label based on its 1st order friends, if there is a tie, a random label is chosen. Nodes are connected by using a euclidean distance.
  4. Difference between propagation and spreading is a laplacian matrix, vs normalized LM
  5. Laplacian matrix on youtube, videos 30-33
  6. Really good example notebook
  7. Spreading vs propagation
  8. https://en.wikipedia.org/wiki/Label_Propagation_Algorithm
  9. Youtube 1, 2, 3,
  10. Medium,
  11. Sklearn, 1, 2, 3, 4, 5,

>>>>> gd2md-html alert: inline image link here (to images/image97.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

  1. Git, incremental LP
  2. [Git2](https://github.com/yamaguchiyuto/label_propagation\)
    1. Harmonic Function (HMN) [Zhu+, ICML03]
    2. Local and Global Consistency (LGC) [Zhou+, NIPS04]
    3. Partially Absorbing Random Walk (PARW) [Wu+, NIPS12]
    4. OMNI-Prop (OMNIProp) [Yamaguchi+, AAAI15]
    5. Confidence-Aware Modulated Label Propagation (CAMLP) [Yamaguchi+, SDM16]

>>>>> gd2md-html alert: inline image link here (to images/image98.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

  1. Presentation 1,2
  2. Neo4j 1, 2, 3,

CLUSTERING ALGORITHMS

  1. Vidhya on clustering and methods
  2. KNN intuition 2, thorough explanation 3 - classify a new sample by looking at the majority vote of its K-nearest neighbours. k=1 special case. Even amount of classes needs an odd K that is not a multiple of the amount of classes in order to break ties.
  3. Determinging the number of clusters, a comparison of several methods, elbow, silhouette etc
  4. A good visual example of kmeans / gmm

Sensitive to outliers, can skew results (because we rely on the mean)

  • basically k-means with a most center object rather than a center virtual point that was based on mean distance from all points, we keep choosing medoids samples based on minimised SSE

    • k-medoid is a classical partitioning technique of clustering that clusters the data set of n objects into k clusters known a priori.
    • It is more robust to noise and outliers as compared to k-means because it minimizes a sum of pairwise dissimilarities instead of a sum of squared Euclidean distances.
    • A medoid can be defined as the object of a cluster whose average dissimilarity to all the objects in the cluster is minimal. i.e. it is a most centrally located point in the cluster.
    • Does Not scale to many samples, its O(K*n-K)^2
    • Randomized resampling can assure efficiency and quality.

From youtube (okay video)

>>>>> gd2md-html alert: inline image link here (to images/image99.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

X-means

X-means(paper):

  1. Theory behind bic calculation with a formula.
  2. Code: Calculate bic in k-means

>>>>> gd2md-html alert: inline image link here (to images/image100.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

G-means

G-means Improves on X-means in the paper: The G-means algorithm starts with a small number of k-means centers, and grows the number of centers. Each iteration of the algorithm splits into two those centers whose data appear not to come from a Gaussian distribution using the Anderson Darling test. Between each round of splitting, we run k-means on the entire dataset and all the centers to refine the current solution. We can initialize with just k = 1, or we can choose some larger value of k if we have some prior knowledge about the range of k. G-means repeatedly makes decisions based on a statistical test for the data assigned to each enter. If the data currently assigned to a k-means center appear to be Gaussian, then we want to represent that data with only one center.

>>>>> gd2md-html alert: inline image link here (to images/image101.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

GMM - Gaussian Mixture Models

  • What is GMM in short its knn with mean/variance centroids, a sample can be in several centroids with a certain probability.

>>>>> gd2md-html alert: Definition term(s) ↑↑ missing definition?
(Back to top)(Next alert)
>>>>>

Let us briefly talk about a probabilistic generalisation of k-means: the Gaussian Mixture Model(GMM).

In k-means, you carry out the following procedure:

  • specify k centroids, initialising their coordinates randomly

  • calculate the distance of each data point to each centroid

  • assign each data point to its nearest centroid

  • update the coordinates of the centroid to the mean of all points assigned to it

  • iterate until convergence.

In a GMM, you carry out the following procedure:

  • specify k multivariate Gaussians (termed components), initialising their mean and variance randomly

  • calculate the probability of each data point being produced by each component (sometimes termed the responsibility each component takes for the data point)

  • assign each data point to the component it belongs to with the highest probability

  • update the mean and variance of the component to the mean and variance of all data points assigned to it

  • iterate until convergence

You may notice the similarity between these two procedures. In fact, k-means is a GMM with fixed-variance components. Under a GMM, the probabilities (I think) you're looking for are the responsibilities each component takes for each data point.

  1. Gmm code on sklearn using ellipsoids
  2. How to select the K using bic
  3. Density estimation for gmm - nice graph

KMEANS++ / Kernel Kmeans

  1. A comparison of kmeans++ vs kernel kmeans
  2. Kernel Kmeans is part of TSLearn
  3. Elbow method,
  4. elbow and mean silhouette,
  5. elbow on medium using mean distance per cluster from the center
  6. Kneed a library to find the knee in a curve

KNN

  1. Nearpy, knn in scale! On github

DBSCAN

a DBSCAN visualization - very good!

DBSCAN for GPS.

A practical guide to dbscan - pretty good

ST-DBSCAN

  1. Paper - st-dbscan an algo for clustering spatio temporal data
  2. Popular git
  3. git

HDBSCAN*

(what is?) HDBSCAN is a clustering algorithm developed by Campello, Moulavi, and Sander. It extends DBSCAN by converting it into a hierarchical clustering algorithm, and then using a technique to extract a flat clustering based in the stability of clusters.

  • Github code
  • (great) Documentation with examples, for clustering, outlier detection, comparison, benchmarking and analysis!
  • (jupytr example) - take a look and see how to use it, usage examples are also in the docs and github

What are the algorithm’s steps:

  1. Transform the space according to the density/sparsity.
  2. Build the minimum spanning tree of the distance weighted graph.
  3. Construct a cluster hierarchy of connected components.
  4. Condense the cluster hierarchy based on minimum cluster size.
  5. Extract the stable clusters from the condensed tree.

OPTICS

(What is?) Ordering points to identify the clustering structure (OPTICS) is an algorithm for finding density-based[1] clusters in spatial data

  • Its basic idea is similar to DBSCAN,[3]
  • it addresses one of DBSCAN's major weaknesses:** the problem of detecting meaningful clusters in data of varying density. **
  • (How?) the points of the database are (linearly) ordered such that points which are spatially closest become neighbors in the ordering.
  • a special distance is stored for each point that represents the density that needs to be accepted for a cluster in order to have both points belong to the same cluster. (This is represented as a dendrogram.)

SVM CLUSTERING

Paper

An SVM-based clustering algorithm is introduced that clusters data with no a priori knowledge of input classes.

  1. The algorithm initializes by first running a binary SVM classifier against a data set with each vector in the set randomly labelled, this is repeated until an initial convergence occurs.
  2. Once this initialization step is complete, the SVM confidence parameters for classification on each of the training instances can be accessed.
  3. The lowest confidence data (e.g., the worst of the mislabelled data) then has its' labels switched to the other class label.
  4. The SVM is then re-run on the data set (with partly re-labelled data) and is guaranteed to converge in this situation since it converged previously, and now it has fewer data points to carry with mislabelling penalties.
  5. This approach appears to limit exposure to the local minima traps that can occur with other approaches. Thus, the algorithm then improves on its weakly convergent result by SVM re-training after each re-labeling on the worst of the misclassified vectors – i.e., those feature vectors with confidence factor values beyond some threshold.
  6. The repetition of the above process improves the accuracy, here a measure of separability, until there are no misclassifications. Variations on this type of clustering approach are shown.

ANOMALY DETECTION

“whether a new observation belongs to the same distribution as existing observations (it is an inlier), or should be considered as different (it is an outlier). “

=> Often, this ability is used to clean real data sets

Two important distinction must be made:

novelty detection:
The training data is not polluted by outliers, and we are interested in detecting anomalies in new observations.
outlier detection:
The training data contains outliers, and we need to fit the central mode of the training data, ignoring the deviant observations
  1. Medium - good
  2. kdnuggets
  3. Index for Z-score and other moving averages.
  4. A survey
  5. A great tutorial about AD using 20 algos in a single python package.
  6. Mastery on classifying rare events using lstm-autoencoder
  7. A comparison of One-class SVM versus Elliptic Envelope versus Isolation Forest versus LOF in sklearn. (The examples below illustrate how the performance of the covariance.EllipticEnvelope degrades as the data is less and less unimodal. The svm.OneClassSVM works better on data with multiple modes and ensemble.IsolationForest andneighbors.LocalOutlierFactor perform well in every cases.)
  8. Using Autoencoders - the information is there, but its all over the place.
  9. Twitter anomaly -
  10. Microsoft anomaly - a well documented black box, i cant find a description of the algorithm, just hints to what they sort of did
    1. up/down trend, dynamic range, tips and dips
    2. Api here
  11. STL and LSTM for anomaly prediction by microsoft 3. Medium on AD 4. Medium on AD using mahalanobis, AE and

OUTLIER DETECTION

  1. Alibi Detect is an open source Python library focused on outlier, adversarial and drift detection. The package aims to cover both online and offline detectors for tabular data, text, images and time series. The outlier detection methods should allow the user to identify global, contextual and collective outliers. \

>>>>> gd2md-html alert: inline image link here (to images/image102.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text \

  1. Pyod

>>>>> gd2md-html alert: inline image link here (to images/image103.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

>>>>> gd2md-html alert: inline image link here (to images/image104.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

>>>>> gd2md-html alert: inline image link here (to images/image105.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

  1. Anomaly detection resources (great)
  2. Novelty and outlier detection inm sklearn
  3. SUOD (Scalable Unsupervised Outlier Detection) is an acceleration framework for large-scale unsupervised outlier detector training and prediction. Notably, anomaly detection is often formulated as an unsupervised problem since the ground truth is expensive to acquire. To compensate for the unstable nature of unsupervised algorithms, practitioners often build a large number of models for further combination and analysis, e.g., taking the average or majority vote. However, this poses scalability challenges in high-dimensional, large datasets, especially for proximity-base models operating in Euclidean space.

>>>>> gd2md-html alert: inline image link here (to images/image106.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

SUOD is therefore proposed to address the challenge at three complementary levels: random projection (data level), pseudo-supervised approximation (model level), and balanced parallel scheduling (system level). As mentioned, the key focus is to accelerate the training and prediction when a large number of anomaly detectors are presented, while preserving the prediction capacity. Since its inception in Jan 2019, SUOD has been successfully used in various academic researches and industry applications, include PyOD [[2]](https://github.com/yzhao062/suod#zhao2019pyod) and [IQVIA](https://www.iqvia.com/) medical claim analysis. It could be especially useful for outlier ensembles that rely on a large number of base estimators.
  1. Skyline
  2. Scikit-lego 1.

>>>>> gd2md-html alert: inline image link here (to images/image107.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

2. 

>>>>> gd2md-html alert: inline image link here (to images/image108.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

>>>>> gd2md-html alert: inline image link here (to images/image109.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

3. 

>>>>> gd2md-html alert: inline image link here (to images/image110.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

ISOLATION FOREST

The best resource to explain isolation forest - the basic idea is that for an anomaly (in the example) only 4 partitions are needed, for a regular point in the middle of a distribution, you need many many more.

Isolation Forest -Isolating observations:

  • randomly selecting a feature
  • randomly selecting a split value between the maximum and minimum values of the selected feature.

Recursive partitioning can be represented by a tree structure, the number of splittings required to isolate a sample is equivalent to the path length from the root node to the terminating node.

This path length, averaged over a forest of such random trees, is a measure of normality and our decision function.

Random partitioning produces noticeable shorter paths for anomalies.

=> when a forest of random trees collectively produce shorter path lengths for particular samples, they are highly likely to be anomalies.

the paper is pretty good too - In the training stage, iTrees are constructed by recursively partitioning the given training set until instances are isolated or a specific tree height is reached of which results a partial model.

Note that the tree height limit l is automatically set by the sub-sampling size ψ: l = ceiling(log2 ψ), which is approximately the average tree height [7].

The rationale of growing trees up to the average tree height is that we are only interested in data points that have shorter-than average path lengths, as those points are more likely to be anomalies

LOCAL OUTLIER FACTOR

  • LOF computes a score (called local outlier factor) reflecting the degree of abnormality of the observations.
  • It measures the local density deviation of a given data point with respect to its neighbors. The idea is to detect the samples that have a substantially lower density than their neighbors.
  • In practice the local density is obtained from the **k-nearest neighbors. **
  • The LOF score of an observation is equal to the ratio of the average local density of his k-nearest neighbors, and its own local density:
    • a normal instance is expected to have a local density similar to that of its neighbors,
    • while abnormal data are expected to have much smaller local density.

ELLIPTIC ENVELOPE

  1. We assume that the regular data come from a known distribution (e.g. data are Gaussian distributed).
  2. From this assumption, we generally try to define the “shape” of the data,
  3. And can define outlying observations as observations which stand far enough from the fit shape.

ONE CLASS SVM

  1. A nice article about ocs, with github code, two methods are described.

  2. Resources for ocsvm

  3. It looks like there are two such methods, - The 2nd one: The algorithm obtains a spherical boundary, in feature space, around the data. The volume of this hypersphere is minimized, to minimize the effect of incorporating outliers in the solution.

    The resulting hypersphere is characterized by a center and a radius R>0 as distance from the center to (any support vector on) the boundary, of which the volume R2 will be minimized.

CLUSTERING METRICS

For community detection, text clusters, etc.

Google search for convenience

Silhouette:

  1. TFIDF, PCA, SILHOUETTE for deciding how many clusters to use, the knee/elbow method.
  2. Embedding based silhouette community detection
  3. A notebook, using the SuperHeat package, clustering w2v cosine similarity matrix, measuring using silhouette score.
  4. Topic modelling clustering, cant access this document on github

DECISION TREES

  1. Using hellinger distance to split supervised datasets, instead of gini and entropy. Claims better results.

Visualize decision trees, forests

explains about the similarities and how to measure. which is the best split? based on SSE and GINI (good info about gini here).
  • For classification the Gini cost function is used which provides an indication of how “pure” the leaf nodes are (how mixed the training data assigned to each node is).

Gini = sum(pk * (1 – pk))

  • Early stop - 1 sample per node is overfitting, 5-10 are good
  • Pruning - evaluate what happens if the lead nodes are removed, if there is a big drop, we need it.

KDTREE

  1. Similar to a binary search tree, just by using the median and selecting a feature randomly for each level.
  2. Used to find nearest neighbours.
  3. Many applications of using KD tree, reduce color space, Database key search, etc

RANDOM FOREST

Using an ensemble of trees to create a high dimensional and sparse representation of the data and classifying using a linear classifier

How do deal with imbalanced data in Random-forest -

  1. One is based on cost sensitive learning.
  2. Other is based on a sampling technique

EXTRA TREES

  1. A comparison between random forest and extra trees
    Fig. 1: Comparison of random forests and extra trees in presence of irrelevant predictors. In blue are presented the results from the random forest and red for the extra trees. The results are quite striking: Extra Trees perform consistently better when there are a few relevant predictors and many noisy ones

>>>>> gd2md-html alert: inline image link here (to images/image111.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

  1. Difference between RF and ET
  2. Differences #2

ACTIVE LEARNING ALGORITHMS

PASSIVE AGGRESSIVE CLASSIFIER

  1. The Passive Aggressive (PA) algorithm is perfect for classifying massive streams of data (e.g. Twitter). It's easy to implement and very fast, but does not provide global guarantees like the support-vector machine (SVM).
  2. Youtube, seems like active learning in stream..?

LINEAR SEPARATOR ALGORITHMS

SEQUENTIAL MINIMAL OPTIMIZATION (SMO)

- What is the SMO (SVM) classifier? - Sequential Minimal Optimization , or SMO . Training a support vector machine requires the solution of a very large quadratic programming (QP) optimization problem. SMO breaks this large QP problem into a series of smallest possible QP problems. These small QP problems are solved analytically, which avoids using a time-consuming numerical QP optimization as an inner loop. The amount of memory required for SMO is linear in the training set size, which allows SMO to handle very large training sets. Because matrix computation is avoided, SMO scales somewhere between linear and quadratic in the training set size for various test problems, while the standard chunking SVM algorithm scales somewhere between linear and cubic in the training set size. SMO’s computation time is dominated by SVM evaluation, hence SMO is fastest for linear SVMs and sparse data sets. On real- world sparse data sets, SMO can be more than 1000 times faster than the chunking algorithm.

Differences between libsvm and liblinear & smo vs libsvm

SUPPORT VECTOR MACHINES (SVM)

- Definition, tutorial:*

  • For Optimal 2-class classifier.
  • Extended for **regression **and clustering problems (1 class).
  • Kernel-based
    • maps feature vectors into a higher-dimensional space using a kernel** **function
    • builds an optimal linear discriminating function in this space (linear?) or an optimal hyper-plane (RBF?) that fits the training data
  • In case of SVM, the kernel is not defined explicitly.
  • A distance needs to be defined between any 2 points in the hyper-space.
  • The solution is optimal, the margin **is maximal. **between the separating hyper-plane and the nearest feature vectors
  • **The feature vectors that are the closest to the hyper-plane are called support vectors, which means that the position of other vectors does not affect the hyper-plane (the decision function). **
  • The model produced by support vector classification (as described above) depends only on a subset of the training data, because the cost function for building the model does not care about training points that lie beyond the margin.

- one against all, one against one, and Direct Acyclic Graph SVM (one against one with DAG). bottom line One Against One in LIBSVM.

A few good explanation about SVM, formulas, figures, C, gamma, etc.

Math of SVM on youtube:

  • very good number-based example #1
  • Very good but lengthy and chatty example with make-sense math #2 - udacity
    • Linear - maximize the margin, optimal solution, only a few close points are really needed the others are zeroes by the alphas (alpha says “pay attention to this variable”) in the quadratic programming equation. XtX is a similarity function (pairs of points that relate to each other in output labels and how similar to one another, Xi’s point in the same direction) y1y2 are the labels. Therefore further points are not needed. But the similarity is important here(?)
    • Non-linear - e.g. circle inside a circle, needs to map to a higher plane, a measure of similarity as XtX is important. We use this similarity idea to map into a higher plane, but we choose the higher plane for the purpose of a final function that behaves likes a known function, such as (A+B)^2. It turns out that (q1,q2,root(2)q1q2) is engineered with that root(2) thing for the purpose of making the multiplication of X^tY, which turns out to be (X^tY)^2. We can substitute this formula (X^tY)^2 instead of the X^tX in the quadratic equation to do that for us.This is the kernel trick that maps the inner class to one side and the outer circle class to the other and passes a plane in between them.
    • Similarity is defined intuitively as all the points in one class vs the other.. I think
    • A general kernel K=(X^tY + C)^p is a polynomial kernel that can define the above function and others.
    • Quadratic eq with possible kernels including the polynomial.

>>>>> gd2md-html alert: inline image link here (to images/image112.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

*   **Most importantly the kernel function is our domain knowledge. (?) IMO we should choose a kernel that fits our feature data.**
*   The output of K is a number(?)
*   Infinite dimensions - possible as well.
*   Mercer condition - it acts like a distance\similar so that is the “rule” of which a kernel needs to follow.

Regularization and influence

**- **(basically **punishment **for overfitting and **raising **the non- linear class points higher and lower)

SUPPORT VECTOR REGRESSION (SVR)

  • The method of SVM can be extended to solve regression problems.
  • Similar to SVM, the model produced by Support Vector Regression depends only on a subset of the training data, because the cost function for building the model ignores any training data close to the model prediction.
  • using many kernel transforms to turn a non-linear problem into a linear problem beforehand.

From the link above, it seems like liblinear is very much the same thing, without those kernel transforms. So, as they say, in cases where the kernel transforms are not needed (they mention document classification), it will be faster.

  • libsvm (SMO) implementation * kernel (n^2) * Linear SVM (n^3)
  • liblinear - optimized to deal with linear classification without kernels
    • Complexity O(n)
    • does not support kernel SVMs.
    • Scores higher

n is the number of samples in the training dataset.

Conclusion: In practice libsvm becomes painfully slow at 10k samples. Hence for medium to large scale datasets use liblinear and forget about libsvm (or maybe have a look at approximate kernel SVM solvers such as LaSVM, which saves training time and memory usage for large scale datasets).

Support vector clustering (SVC)

paper, short explanation

KERNELS

What are kernels in SVM** - **intuition and example

  • allows us to do certain calculations faster which otherwise would involve computations in higher dimensional space.
  • K(x, y) = <f(x), f(y)>. Here K is the kernel function, x, y are n dimensional inputs. f is a map from n-dimension to m-dimension space. < x,y> denotes the dot product. usually m is much larger than n.
  • normally calculating <f(x), f(y)> requires us to calculate f(x), f(y) first, and then do the dot product. These two computation steps can be quite expensive as they involve manipulations in m dimensional space, where m can be a large number.
  • Result is ONLY a scalar, i..e., 1-dim space.
  • We **don’t **need to do that calc if we use a clever kernel.

Example:

Simple Example: x = (x1, x2, x3); y = (y1, y2, y3). Then for the function f(x) = (x1x1, x1x2, x1x3, x2x1, x2x2, x2x3, x3x1, x3x2, x3x3), the kernel is K(x, y ) = (<x, y>)^2.

Let's plug in some numbers to make this more intuitive: suppose x = (1, 2, 3); y = (4, 5, 6). Then:

f(x) = (1, 2, 3, 2, 4, 6, 3, 6, 9) and f(y) = (16, 20, 24, 20, 25, 30, 24, 30, 36)

<f(x), f(y)> = 16 + 40 + 72 + 40 + 100+ 180 + 72 + 180 + 324 = 1024 i.e., 116+220+324..

A lot of algebra. Mainly because f is a mapping from 3-dimensional to 9 dimensional space.

With a kernel its faster.

K(x, y) = (4 + 10 + 18 ) ^2 = 32^2 = 1024

A kernel is a magical shortcut to calculate even infinite dimensions!

Relation to SVM?:

  • The idea of SVM is that y = w phi(x) +b, where w is the weight, phi is the feature vector, and b is the bias.
  • if y> 0, then we classify datum to class 1, else to class 0.
  • We want to find a set of weight and bias such that the margin is maximized.
  • Previous answers mention that kernel makes data linearly separable for SVM. I think a more precise way to put this is, kernels do not make the the data linearly separable.
  • The feature vector phi(x) makes the data linearly separable. Kernel is to make the calculation process faster and easier, especially when the feature vector phi is of very high dimension (for example, x1, x2, x3, ..., x_D^n, x1^2, x2^2, ...., x_D^2).
  • Why it can also be understood as a measure of similarity: if we put the definition of kernel above, <f(x), f(y)>, in the context of SVM and feature vectors, it becomes <phi(x), phi(y)>. The inner product means the projection of phi(x) onto phi(y). or colloquially, how much overlap do x and y have in their feature space. In other words, how similar they are.

Kernels:

  • **SVM::LINEAR Linear kernel. No mapping is done, linear discrimination (or regression) is done in the original feature space. It is the fastest option. **

>>>>> gd2md-html alert: inline image link here (to images/image113.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text .

  • **SVM::RBF Radial basis function (RBF), a good choice in most cases. **

>>>>> gd2md-html alert: inline image link here (to images/image114.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text .

Grid search for SVM Hyper parameters - in openCV. Example in log space

  • I.e., (for example, C = 2^-5 , 2 ^-3 , . . . , 2^15 , γ = 2^-15 , 2 ^-13 , . . . , 2^3 ).
  • There are heuristic methods that skip some search options
  • However, no need for heuristics, computation-time is small, grid can be paralleled and we dont skip parameters.
  • Controlling search complexity using two tier grid, coarse grid and then fine tune.
  • Regularization parameter C - penalty example
  • In non linear kernels:
    • Kernel choice
    • Kernel parameters
  • RBF - gamma, low and high values are far and near influence
    • Great Tutorial at LIBSVM
    • Reasonable first choice
    • when the relation between class labels and attributes is nonlinear.
    • Special case of C can make this similar to linear kernel (only! After finding C and gamma)
    • Certain parameters makes it behave like the sigmoid kernel.
    • Less hyperparameters than RBF kernel.
    • 0 <Kij <1 unlike other kernels where the degree is 0<k<infinity
    • Sigmoid is not valid under some parameters.
    • DON'T USE when the #features is very large, use linear.

RBF kernel use cases

  • Number of instances << number of features. I.e, 38 instances over 7000 features.

    **RBF=LINEAR **When the number of features is large, we may not need to use RBF over Linear and vice versa (After finding C and gamma)

  • **Number of Instances & features is VERY LARGE. **I.e, 20K samples X 20K features.

    Similar performance with libsvm and liblinear, liblinear is faster by 150 times. Rule of thumb is to use for document classification.

  • **Number of instances >> number of features. Usually **high dimensional mapping using non linear kernel. If we insist on liblinear, -s 2 leads to faster training.

Kdnuggets: When to use DL over SVM and other algorithms. Computationally expensive for a very small boost in accuracy.

ENSEMBLES

  1. How to combine several sklearn algorithms into a voting ensemble
  2. Stacking api, MLXTEND
  3. Mastery on stacking neural nets - really good
    1. Stacked Generalization Ensemble
    2. Multi-Class Classification Problem
    3. Multilayer Perceptron Model
    4. Train and Save Sub-Models
    5. Separate Stacking Model
    6. Integrated Stacking Model
  4. Vidhya on trees, bagging boosting, gbm, xgb
  5. Parallel grad boost treest
  6. A comprehensive guide to ensembles read! (samuel jefroykin) 7. Basic Ensemble Techniques 8. 2.1 Max Voting 9. 2.2 Averaging 10. 2.3 Weighted Average 11. Advanced Ensemble Techniques 12. 3.1 Stacking 13. 3.2 Blending 14. 3.3 Bagging 15. 3.4 Boosting 16. Algorithms based on Bagging and Boosting 17. 4.1 Bagging meta-estimator 18. 4.2 Random Forest 19. 4.3 AdaBoost 20. 4.4 GBM 21. 4.5 XGB 22. 4.6 Light GBM 23. 4.7 CatBoost
  7. Kaggler guide to stacking
  8. Blending vs stacking
  9. Kaggle ensemble guide
  • bagging (random sample selection, multi classifier training), random forest (random feature selection for each tree, multi tree training), boosting(creating stumps, each new stump tries to fix the previous error, at last combining results using new data, each model is assigned a skill weight and accounted for in the end), voting(majority vote, any set of algorithms within weka, results combined via mean or some other way), stacking(same as voting but combining predictions using a meta model is used).

BAGGING - bootstrap aggregating

Bagging - best example so far, create m bags, put n’<n samples (60% of n) in each bag - with replacement which means that the same sample can be selected twice or more, query from the test (x) each of the m models, calculate mean, this is the classification.

>>>>> gd2md-html alert: inline image link here (to images/image115.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

**Overfitting - ** not an issue with bagging, as the mean of the models actually averages or smoothes the “curves”. Even if all of them are overfitted.

>>>>> gd2md-html alert: inline image link here (to images/image116.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

BOOSTING

Mastery on using all the boosting algorithms: Gradient Boosting with Scikit-Learn, XGBoost, LightGBM, and CatBoost

Adaboost: similar to bagging, create a system that chooses from samples that were modelled poorly before.

  1. create bag_1 with n’ features <n with replacement, create the model_1, test on ALL train.
  2. Create bag_2 with n’ features with replacement, but add a bias for selecting from the samples that were wrongly classified by the model_1. Create a model_2. Average results from model_1 and model_2. I.e., who was classified correctly or not.
  3. Create bag_3 with n’ features with replacement, but add a bias for selecting from the samples that were wrongly classified by the model_1+2. Create a model_3. Average results from model_1, 2 & 3 I.e., who was classified correctly or not. Iterate onward.
  4. Create bag_m with n’ features with replacement, but add a bias for selecting from the samples that were wrongly classified by the previous steps.

>>>>> gd2md-html alert: inline image link here (to images/image117.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

XGBOOST

R Installation in Weka, then XGBOOST in weka through R

Parameters for weka mlr class.xgboost.

  • https://cran.r-project.org/web/packages/xgboost/xgboost.pdf
  • Here is an example configuration for multi-class classification:
  • weka.classifiers.mlr.MLRClassifier -learner “nrounds = 10, max_depth = 2, eta = 0.5, nthread = 2”
  • classif.xgboost -params "nrounds = 1000, max_depth = 4, eta = 0.05, nthread = 5, objective = "multi:softprob"

Copy: nrounds = 10, max_depth = 2, eta = 0.5, nthread = 2

Special case of random forest using XGBOOST:

#Random Forest™ - 1000 trees
bst <- xgboost(data = train$data, label = train$label, max_depth = 4, num_parallel_tree = 1000, subsample = 0.5, colsample_bytree =0.5, nrounds = 1, objective = "binary:logistic")

#Boosting - 3 rounds
bst <- xgboost(data = train$data, label = train$label, max_depth = 4, nrounds = 3, objective = "binary:logistic")

RF1000: - max_depth = 4, num_parallel_tree = 1000, subsample = 0.5, colsample_bytree =0.5, nrounds = 1, nthread = 2

XG: nrounds = 10, max_depth = 4, eta = 0.5, nthread = 2

Gradient Boosting Classifier

  1. Loss functions and GBC vs XGB
  2. Why is XGB faster than SK GBC
  3. Good XGB vs GBC tutorial
  4. XGB vs GBC

REINFORCEMENT LEARNING

Q-LEARN

Q-Learning

  • Markov chain problem, (state, action, new state, reward)
  • Lots of Exploration in the beginning, then exploitation
  • Returns optimal policy.
  • Refer to youtube here

RL IN DL

A review paper about RL in DL

INCREMENTAL LEARNING

(wiki) In computer science, incremental learning is a method of machine learning in which input data is continuously used to extend the existing model's knowledge i.e. to further train the model.

>>>>> gd2md-html alert: inline image link here (to images/image118.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

>>>>> gd2md-html alert: inline image link here (to images/image119.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

*   

HOEFFDING TREE

*   **IS STATE OF THE ART**

DIMENSIONALITY REDUCTION METHODS

>>>>> gd2md-html alert: inline image link here (to images/image120.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

>>>>> gd2md-html alert: inline image link here (to images/image121.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

TSNE

  1. Stat quest - the jist of it is that we assume a t- distribution on distances and remove those that are farther.normalized for density. T-dist used so that clusters are not clamped in the middle.

>>>>> gd2md-html alert: inline image link here (to images/image122.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

>>>>> gd2md-html alert: inline image link here (to images/image123.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

Iteratively moving from the left to the right

>>>>> gd2md-html alert: inline image link here (to images/image124.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

  1. TSNE algorithm
  2. Are there cases where PCA more suitable than TSNE
  3. PCA preserving pairwise distances over tSNE? How why, all here.
  4. Another advice about using tsne and the possible misinterpetations

PCA

  1. Machine learning mastery:

    1. Expected value, variance, covariance
    2. PCA (remove the mean from A, calculate cov(A), calculate eig(cov), A*eigK = PCA)
    3. **EigenDecomposition - what is an eigen vector - simply put its a vector that satisfies Av = lambdav, how to use eig() and how to confirm an eigenvector/eigenvalue and reconstruct the original A matrix. **
    4. SVD
    5. What is missing is how the EigenDecomposition is calculated.
  2. PCA on large matrices!

    1. Randomized svd
    2. Incremental svd
  3. PCA on Iris

  4. **(did not read) **What is PCA?

  5. **(did not read) **What is a covariance matrix?

  6. **(did not read) **Variance covariance matrix

  7. Visualization of the first PCA vectors, it is unclear what he is trying to show.

  8. A very nice introductory tutorial on how to use PCA

  9. ** An in-depth tutorial on PCA (paper)

  10. ** yet another tutorial paper on PCA (looks good)

  11. How to use PCA in Cross validation and for train\test split. (bottom line, do it on the train only.)

  12. Another tutorial paper - looks decent

  13. PCA whitening, Stanford tutorial (pca/zca whitening), Stackoverflow (really good) ,

    There are two things we are trying to accomplish with whitening:
    
  14. Make the features less correlated with one another.

  15. Give all of the features the same variance.

     Whitening has two simple steps:
    
  16. Project the dataset onto the eigenvectors. This rotates the dataset so that there is no correlation between the components.

  17. Normalize the the dataset to have a variance of 1 for all components. This is done by simply dividing each component by the square root of its eigenvalue.

SVD

  1. An explanation about SVD’s formulas.

KPCA

  1. First they say that Autoencoder is PCA based on their equation, i.e. minimize the reconstruction error formula.

>>>>> gd2md-html alert: inline image link here (to images/image125.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

  1. Then they say that PCA cant separate certain non-linear situations (circle within a circle), therefore they introduce kernel based PCA (using the kernel trick - like svm) which mapps the space to another linearly separable space, and performs PCA on it,

>>>>> gd2md-html alert: inline image link here (to images/image126.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

>>>>> gd2md-html alert: inline image link here (to images/image127.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

  1. Finally, showing results how KPCA works well on noisy images, compared to PCA.

>>>>> gd2md-html alert: inline image link here (to images/image128.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

LDA - Linear discriminant analysis

A comparison / tutorial with code on pca vs lda - read!

A comprehensive tutorial on LDA - read!

Dim reduction with LDA - nice examples

(Not to be confused with the other LDA) - Linear Discriminant Analysis (LDA) is most commonly used as dimensionality reduction technique in the pre-processing step for pattern-classification and machine learning applications. The goal is to project a dataset onto a lower-dimensional space with good class-separability in order avoid overfitting (“curse of dimensionality”) and also reduce computational costs.

PCA vs LDA:

Both Linear Discriminant Analysis (LDA) and Principal Component Analysis (PCA) are linear transformation techniques used for dimensionality reduction.

  • PCA can be described as an “unsupervised” algorithm, since it “ignores” class labels and its goal is to find the directions (the so-called principal components) that maximize the variance in a dataset.
  • In contrast to PCA, LDA is “supervised” and computes the directions (“linear discriminants”) that will represent the axes that maximize the separation between multiple classes.

Although it might sound intuitive that LDA is superior to PCA for a multi-class classification task where the class labels are known, this might not always the case.

For example, comparisons between classification accuracies for image recognition after using PCA or LDA show that :

  • PCA tends to outperform LDA if the number of samples per class is relatively small (PCA vs. LDA, A.M. Martinez et al., 2001).
  • In practice, it is also not uncommon to use both LDA and PCA in combination:

Best Practice: PCA for dimensionality reduction can be followed by an LDA. But before we skip to the results of the respective linear transformations, let us quickly recapitulate the purposes of PCA and LDA: PCA finds the axes with maximum variance for the whole data set where LDA tries to find the axes for best class separability. In practice, often a LDA is done followed by a PCA for dimensionality reduction.

>>>>> gd2md-html alert: inline image link here (to images/image129.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

**** To fully understand the details please follow the LDA link to the original and very informative article**

***** TODO: need some benchmarking for PCA\LDA\LSA\ETC..**

KDA - KERNEL DISCRIMINANT ANALYSIS

  1. pyDML package - has KDA - This package provides the classic algorithms of supervised distance metric learning, together with some of the newest proposals.

>>>>> gd2md-html alert: inline image link here (to images/image130.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

LSA

LSA is quite simple, you just use SVD to perform dimensionality reduction on the tf-idf vectors–that’s really all there is to it! And LSA CLUSTERING

Here is a very nice tutorial about LSA, with code, explaining what are the three matrices, word clustering, sentence clustering and vector importance. They say that for sentence space we need to remove the first vector as it is correlated with sentence length.

*how to interpret LSA vectors

PCA vs LSA: (intuition1, intuition2)

  • reduction of the dimensionality
  • noise reduction
  • incorporating relations between terms into the representation.
  • SVD and PCA and "total least-squares" (and several other names) are the same thing. It computes the orthogonal transform that decorrelates the variables and keeps the ones with the largest variance. There are two numerical approaches: one by SVD of the (centered) data matrix, and one by Eigen decomposition of this matrix "squared" (covariance).

LSA vs W2V

ICA

  1. While PCA is global, it finds global variables (with images we get eigen faces, good for reconstruction) that maximizes variance in orthogonal directions, and is not influenced by the TRANSPOSE of the data matrix.

  2. On the other hand, ICA is local and finds local variables (with images we get eyes ears, mouth, basically edges!, etc), ICA will result differently on TRANSPOSED matrices, unlike PCA, its also “directional” - consider the “cocktail party” problem. On documents, ICA gives topics.

  3. It helps, similarly to PCA, to help us analyze our data.

    Sparse info on ICA with security returns.

MANIFOLD

  1. The best tutorial that explains manifold (high to low dim projection/mapping/visuzation) (pca, sammon, isomap, tsne)
  2. Many manifold methods used to visualize high dimensional data.
  3. Comparing manifold methods

T-SNE

  1. Code and in-depth tutorial on TSNE, mapping probabilities to distributions****
  2. A great example of using PCA and then TSNE to see clusters that arent visible with PCA only.
  3. Misreading T-SNE, this is a very important read.
  4. In contrary to what it says on sklearn’s website, TSNE is not suited ONLY for visualization, you can also use it for data reduction
  5. “t-Distributed Stochastic Neighbor Embedding (t-SNE) is a (prize-winning) technique for dimensionality reduction that is particularly well suited for the visualization of high-dimensional datasets.”
  6. Comparing PCA and TSNE, then pushing PCA to TSNE and seeing what happens (as recommended in SKLEARN
  7. TSNE + AUTOENCODER example

Sammons embedding mapping

  1. In tensorflow

IVIS

  1. Paper:
  2. Git, docs
  3. Ivis animate
  4. Ivis explain

GENETIC ALGORITHMS / PROGRAMMING

What is the difference? Genetic programming and genetic algorithms are very similar. They are both used to evolve the answer to a problem, by comparing the fitness of each candidate in a population of potential candidates over many generations.

Each generation, new candidates are found by randomly changing (mutation) or swapping parts (crossover) of other candidates. The least 'fit' candidates are removed from the population. - peterjwest

>>>>> gd2md-html alert: inline image link here (to images/image131.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

Genetic algorithms (GA) are search algorithms that mimic the process of natural evolution, where each individual is a candidate solution: individuals are generally "raw data" (in whatever encoding format has been defined).

Genetic programming (GP) is considered a special case of GA, where each individual is a computer program (not just "raw data"). GP explore the algorithmic search space and evolve computer programs to perform a defined task.

johnIdol

>>>>> gd2md-html alert: inline image link here (to images/image132.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

LEARNING CLASSIFIER SYSTEMS

LCS

(WIKI) Learning classifier systems, or LCS, are a paradigm of rule-based machine learning methods that combine a discovery component (e.g. typically a genetic algorithm) with a learning component (performing either supervised learning, reinforcement learning, or unsupervised learning).

XCS

XCS is a type of Learning Classifier System (LCS), a machine learning algorithm that utilizes a genetic algorithm acting on a rule-based system, to solve a reinforcement learning problem.

Scikit-xcs

>>>>> gd2md-html alert: inline image link here (to images/image133.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

Tutorial

Recommender Algorithms

  1. Collaborative filtering, SVD
  2. Spotlight, item2vec, Neural nets for Recommender systems
  3. A general tutorial, has a nice intro
  4. Medium on Movies
    1. Part 1 matrix factorization in movies, users vs movies.
    2. Part 2 using collaborative filtering using open ai
    3. Part 3 using col-filtering with neural nets
  5. Medium series on collaborative filtering and embeddings Part 1, part 2,
  6. Movie recommender systems on kaggle 4. On git
  7. Matrix factorization

TEMPLATIZATION

>>>>> gd2md-html alert: inline image link here (to images/image134.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

>>>>> gd2md-html alert: inline image link here (to images/image135.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

  • 3 things we learned about applying word vectors to logs
    • GloVe consistently identified approximately 50 percent or more of the seeded events in the synthetic data as either exact or as valid sub-sequence matches. GloVe tended to nominate a limited number of template sequences that weren’t related to seeded events and many of those were tied to high frequency templates. When we tested GloVe against a generated data set with multiple SSH sessions in an auditd file, GloVe correctly proposed a single event that included all of the auditd record types defined in the SSH user login lifecycle.
    • Glove produces sub sequences that needs to be stitched to create a match

>>>>> gd2md-html alert: inline image link here (to images/image136.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

*   Glove is faster than paris and fp growth
*   Their clustering method misclassified  \
  • Towards an NLP based log template generation algorithm for system log analysis - CRF for templatization, i.e. ner style. “we can see that the more learning data given, the more accurately CRF produces log templates. Clearly a sufficient number of train data enables us to analyze less frequently appearing log templates. Therefore, it is reasonable that a log template can be analyzed correctly if train data include some of similar templates. However, in terms of log template accuracy, CRF requires 10000 train data to achieve same accuracy as Vaarandi’s algorithm”

>>>>> gd2md-html alert: inline image link here (to images/image137.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

>>>>> gd2md-html alert: inline image link here (to images/image138.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

>>>>> gd2md-html alert: inline image link here (to images/image139.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

Logpai
  1. Logpai
  2. Loghub datasets
  3. logpaI loglizer: An Evaluation Study on Log Parsing and Its Use in Log Mining, git
  4. log3C - paper Log3C is a general framework that identifies service system problems from system logs. It utilizes both system logs and system KPI metrics to promptly and precisely identify impactful system problems. Log3C consists of four steps**: Log parsing, Sequence vectorization, Cascading Clustering and Correlation analysis**. This is a joint work by CUHK and Microsoft Research. The repository contains the source code of Log3C, including data loading, sequence vectorization, cascading clustering, data saving, etc. The core part is the cascading clustering algorithm, which groups a large number of sequence vectors into clusters by iteratively sampling, clustering, matching.

>>>>> gd2md-html alert: inline image link here (to images/image140.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

>>>>> gd2md-html alert: inline image link here (to images/image141.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

>>>>> gd2md-html alert: inline image link here (to images/image142.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text
Selection of KPI: In our experiments, we use failure rate as the KPI for problem identification. failure rate is an important KPI for evaluating system service availability. There are also other KPIs such as mean time between failures, average request latency, throughput, etc. In our future work, we will experiment with problem identification concerning different KPI metrics. Noises in labeling: Our experiments are based on three datasets that are collected as a period of logs on three different days. The engineers manually inspected and labeled the log sequences. (false positives/negatives) may be introduced during the manual labeling process. However, as the engineers are experienced professionals of the product team who maintain the service system, we believe the amount of noise is small (if it exists)

Furthermore, we compare our method with two typical methods: PCA [41] and Invariants Mining [23]. All these three methods are unsupervised, log-based problem identification methods. PCA projects the log sequence vectors into a subspace. If the projected vector is far from the majority, it is considered as a problem. Invariants Mining extracts the linear relations (invariants) between log event occurrences, which hypothesizes that log events are often pairwise generated. For example, when processing files, "File A is opened" and "File A is closed" should be printed as a pair. Log sequences that violate the invariants are regarded as problematic. Log3C achieves good recalls (similar to those achieved by two comparative methods) and surpasses the comparative methods concerning precision and F1-measure. 
  1. Logzip paper-Logzip is an (personal note seems to be offline) efficient compression tool specific for log files. It compresses log files by utilizing the inherent structures of raw log messages, and thereby achieves a high compression ratio**.The results show that logzip can save about half of the storage space on average over traditional compression tools. Meanwhile, the design of logzip is highly parallel and only incurs negligible overhead. In addition, we share our industrial experience of applying logzip to Huawei's real products.**
  2. Logadvisor - paper1, 2 - Our goal, referred to as “learning to log”, is to automatically learn the common logging practice as a machine learning model, and then leverage the model to guide developers to make logging decisions during new development.
    1. Labels: logging method (e.g., Console.Writeline())
    2. Features: we need to extract useful features (e.g., exception type) from the collected code snippets for making logging decisions,
    3. Train / suggest

>>>>> gd2md-html alert: inline image link here (to images/image143.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

>>>>> gd2md-html alert: inline image link here (to images/image144.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

  1. **Logging descriptions - **This repository maintains a set of <code, log> pairs extracted from popular open-source projects, which are amendable to logging description generation research.
  2. (REALLY GOOD) Loglizer paper git demo- Loglizer is a machine learning-based log analysis toolkit for automated anomaly detection.

>>>>> gd2md-html alert: inline image link here (to images/image145.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

  • Feature extraction using fixed window, sliding window and session window
    • Fixed window: Both fixed windows and sliding windows are based on timestamp, which records the occurrence time of each log. Each fixed window has its size, which means the time span or time duration. As shown in Figure 1, the window size is Δt, which is a constant value, such as one hour or one day. Thus, the number of fixed windows depends on the predefined window size. Logs that happened in the same window are regarded as a log sequence.
    • Sliding window: Different from fixed windows, sliding windows consist of two attributes: window size and step size, e.g., hourly windows sliding every five minutes. In general, step size is smaller than window size, therefore causing the overlap of different windows. Figure 1 shows that the window size is ΔT , while the step size is the forwarding distance. The number of sliding windows, which is often larger than fixed windows, mainly depends on both window size and step size. Logs that occurred in the same sliding window are also grouped as a log sequence, though logs may duplicate in multiple sliding windows due to the overlap.
    • Session window: Compared with the above two windowing types, session windows are based on identifiers instead of the timestamp. Identifiers are utilized to mark different execution paths in some log data. For instance, HDFS logs with block_id record the allocation, writing, replication, deletion of certain block. Thus, we can group logs according to the identifiers, where each session window has a unique identifier
  • Many Supervised methods and most importantly a cool unsupervised method - > PCA for anomaly based on the length of the projected transformed sample vector by dividing the first and last PC vectors:
  • PCA was first applied in log-based anomaly detection by Xu et al. [47]. In their anomaly detection method, each log sequence is vectorized as an event count vector. After that, PCA is employed to find patterns between the dimensions of event count vectors. Employing PCA, two subspace are generated, namely normal space Sn and anomaly space Sa. Sn is constructed by the first k principal components and Sn is constructed by the remaining (n−k), where n is the original dimension. Then, the projection ya = (1−P P T )y of an event count vector y to Sa is calculated, where P = [v1,v2, ...,vk,] is the first k principal components. If the length of ya is larger
  1. LogParser - a benchmark for log parsers using 13 models on 16 datasets
    Important insights:
  2. Drain is fastest, most performing on most datasets (9/16)
  3. Fitting parameters should be adapted, which what makes drain the most performing
  4. More demanding metrics.
  5. Papers:

>>>>> gd2md-html alert: inline image link here (to images/image146.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

Metrics

Unsupervised metrics:

Using silhouette score to measure clusters, with a distance metric of your choice (mean w2v, edit distance, mean edit distance, token matching score etc..)

Drain paper:

>>>>> gd2md-html alert: inline image link here (to images/image147.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

  1. Accuracy measures the ability of a log parser in distinguishing constant parts and variable parts. Accuracy is one main focus of existing log parsing studies because an inaccurate log parser could greatly limit the effectiveness of the downstream log mining tasks [9].
    **
    Supervised Accuracy Metric: PA metric is the ratio of correctly parsed log messages over the total number of log messages:
    **
  • For fairness of comparison, we apply the same preprocessing rules (e.g., IP or number replacement) to each log parser.

  • After parsing, each log message has an event template, which in turn corresponds to a group of messages of the same template. \

  • A log message is considered correctly parsed if and only if its event template corresponds to the same group of log messages as the ground truth does. For example, if a log sequence [E1, E2, E2] is parsed to [E1, E4, E5], we get PA=1/3, since the 2nd and 3rd messages are not grouped together. \

  • In contrast to standard evaluation metrics that are used in previous studies, such as precision, recall, and F1-measure [9], [22], [28], PA is a more rigorous metric. In PA, partially matched events are considered incorrect.

  • The parameters of all the log parsers are fine-tuned through over 10 runs and the best results are reported to avoid bias from randomization. **

    **

  1. Robustness of a log parser measures the consistency of its accuracy under log datasets of different sizes or from different systems. A robust log parser should perform consistently across different datasets, and thus can be used in the versatile production environment.

    Robustness is crucial to the practical use of a log parser in production environments. In this part, we evaluate the robustness of log parsers from two aspects: 1) robustness across different types of logs and 2) robustness on different volumes of logs. \

  2. Efficiency measures the processing speed of a log parser. We evaluate the efficiency by recording the time that a parser takes to parse a specific dataset. The less time a log parser consumes, the higher efficiency it provides.

    Efficiency is an important aspect of log parsers to consider in order to handle log data in large scale. To measure the efficiency of a log parser, we record the running time it needs to finish the entire parsing process. Similar to the setting of the previous experiment, we evaluate six log parsers on three log datasets.

[9] - An Evaluation Study on Log Parsing and Its Use in Log Mining

a commonly used evaluation metric for clustering algorithms, to evaluate the parsing accuracy of log parsing methods. To calculate Fmeasure, we manually obtain the ground truths for all logs of these dataset. It is possible because we iteratively filter out logs with confirmed event using regular expression. Experiments about LKE and LogSig are run 10 times to avoid bias of clustering algorithms, while others are run once because they are deterministic. We note here that only the parts of free-text log message contents are used in evaluating the log parsing methods

[22] - https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.503.7668&rep=rep1&type=pdf

. These cluster descriptions then became our gold standard, against which to measure the performance of the other algorithms as an information retrieval (IR) task. As in classic IR, our performance metrics are Recall, Precision and F-Measure, which are described in [17]. The True Positive(TP), False Positive(FP) and False Negative(FN) values were derived by comparing the set of manually produced line formats to the set of retrieved formats produced by each algorithm. In our evaluation a line format is still considered a FP even if matches a manually produced line format to some degree, the match has to be exact for it to be considered a TP. The next section gives more details about the results of our experiments.

[28] -A search-based approach for accurate identification of log message

>>>>> gd2md-html alert: inline image link here (to images/image148.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

In [22]: [17] - Wikipedia.org. Precision and Recall - Wikipedia, the free encyclopedia. Published to the web, http://en.wikipedia.org/wiki/Precision and Recall. Last checked April 23, 2009.

In [22]: ref [30] & [31] - evaluation of clustering by stanford

This section introduces four external criteria of clustering quality. Purity is a simple and transparent evaluation measure. Normalized mutual information can be information-theoretically interpreted. The Rand index penalizes both false positive and false negative decisions during clustering. The F measure in addition supports differential weighting of these two types of errors.

High purity is easy to achieve when the number of clusters is large - in particular, purity is 1 if each document gets its own cluster. Thus, we cannot use purity to trade off the quality of the clustering against the number of clusters.

MI has the same problem as purity: it does not penalize large cardinalities and thus does not formalize our bias that, other things being equal, fewer clusters are better.

Because NMI is normalized, we can use it to compare clusterings with different numbers of clusters.

The Rand index gives equal weight to false positives and false negatives. Separating similar documents is sometimes worse than putting pairs of dissimilar documents in the same cluster.

>>>>> gd2md-html alert: inline image link here (to images/image149.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text


The Rand index gives equal weight to false positives and false negatives. Separating similar documents is sometimes worse than putting pairs of dissimilar documents in the same cluster. We can use the F measure measuresperf to penalize false negatives more strongly than false positives by selecting a value

>>>>> gd2md-html alert: inline image link here (to images/image150.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text , thus giving more weight to recall.

>>>>> gd2md-html alert: inline image link here (to images/image151.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

TIME SERIES

  1. Random walk - what is?

>>>>> gd2md-html alert: inline image link here (to images/image152.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

  1. Time series decomposition book - stl x11 seats
  2. Mastery on ts decomposition
  3. SKtime - is a sk-based api, medium, integrates algos from tsfresh and tslearn
  4. (really good) A LightGBM Autoregressor — Using Sktime, explains about the basics in time series prediction, splitting, next step, delayed step, multi step, deseason.
  5. SKtime-DL - using keras and DL
  6. TSFresh - extracts 1200 features, filters them using FDR for time series classification etc
  7. TSlearn - DTW, shapes, shapelets (keras layer), time series kmeans/clustering/svm/svr/KNN/bary centers/PAA/SAX

>>>>> gd2md-html alert: inline image link here (to images/image153.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

A great introduction into time series - **“The approach is to come up with a list of features that captures the temporal aspects so that the auto correlation information is not lost.” **basically tells us to take sequence features and create (auto)-correlated new variables using a time window, i.e., **“Time series forecasts as regression that factor in autocorrelation as well.”. **we can transform raw features into other type of features that explain the relationship in time between features. we measure success using loss functions, MAE RMSE MAPE RMSEP AC-ERROR-RATE

Interesting idea on how to define ‘time series’ dummy variables that utilize beginning\end of certain holiday events, including important information on what NOT to filter even if it seems insignificant, such as zero sales that may indicate some relationship to many sales the following day.

>>>>> gd2md-html alert: inline image link here (to images/image154.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

Time series patterns:

  • A **trend (a,b,c) **exists when there is a long-term increase or decrease in the data.
  • A **seasonal (a - big waves) **pattern occurs when a time series is affected by seasonal factors such as the time of the year or the day of the week. The monthly sales induced by the change in cost at the end of the calendar year.
  • A **cycle (a) **occurs when the data exhibit rises and falls that are not of a fixed period - sometimes years.

Some statistical measures (mean, median, percentiles, iqr, std dev, bivariate statistics - correlation between variables)

Bivariate Formula: this correlation measures the extent of a linear relationship between two variables. high number = high correlation between two variable. The value of r always lies between -1 and 1 with negative values indicating a negative relationship and positive values indicating a positive relationship. Negative = decreasing, positive = increasing.

>>>>> gd2md-html alert: inline image link here (to images/image155.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

But correlation can LIE, the following has 0.8 correlation for all of the graphs:

>>>>> gd2md-html alert: inline image link here (to images/image156.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

Autocorrelation measures the linear relationship between lagged values of a time series.

L8 is correlated, and has a high measure of 0.83

  • White-noise has autocorrelation of 0.

>>>>> gd2md-html alert: inline image link here (to images/image157.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

  • Average: Forecasts of all future values are equal to the mean of the historical data.
  • Naive: Forecasts are simply set to be the value of the last observation.
  • Seasonal Naive: forecast to be equal to the last observed value from the same season of the year
  • **Drift: **A variation on the naïve method is to allow the forecasts to increase or decrease over time, the drift is set to be the average change seen in the historical data.
  • Log
  • Box cox
  • Back transform
  • Calendrical adjustments
  • Inflation adjustment

Transforming time series data to tabular (in order to use tabular based approach)

SPLITTING TIME SERIES DATA

  1. SK-lego With a gap

>>>>> gd2md-html alert: inline image link here (to images/image158.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

  • Dummy variables: sunday, monday, tues,wed,thurs, friday. NO SATURDAY!
  • notice that only six dummy variables are needed to code seven categories. That is because the seventh category (in this case Sunday) is specified when the dummy variables are all set to zero. Many beginners will try to add a seventh dummy variable for the seventh category. This is known as the "dummy variable trap" because it will cause the regression to fail.
  • Outliers: If there is an outlier in the data, rather than omit it, you can use a dummy variable to remove its effect. In this case, the dummy variable takes value one for that observation and zero everywhere else.
  • **Public holidays: **For daily data, the effect of public holidays can be accounted for by including a dummy variable predictor taking value one on public holidays and zero elsewhere.
  • Easter: is different from most holidays because it is not held on the same date each year and the effect can last for several days. In this case, a dummy variable can be used with value one where any part of the holiday falls in the particular time period and zero otherwise.
  • **Trading days: **The number of trading days in a month can vary considerably and can have a substantial effect on sales data. To allow for this, the number of trading days in each month can be included as a predictor. An alternative that allows for the effects of different days of the week has the following predictors. # Mondays in month;# Tuesdays in month;# Sundays in month.
  • Advertising: $advertising for previous month;$advertising for two months previously

“compute parameter estimates over a rolling window of a fixed size through the sample. If the parameters are truly constant over the entire sample, then the estimates over the rolling windows should not be too different. If the parameters change at some point during the sample, then the rolling estimates should capture this instability”

estimate the trend cycle

  • 3-5-7-9? If its too large its going to flatten the curve, too low its going to be similar to the actual curve.
  • two tier moving average, first 4 then 2 on the resulted moving average.

Visual example of ARIMA algorithm - captures the time series trend or forecast.

Decomposition

  1. Creating curves to explain a complex seasonal fit.

>>>>> gd2md-html alert: inline image link here (to images/image159.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

>>>>> gd2md-html alert: inline image link here (to images/image160.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

Weighted “window”

1, scikit-lego with a decay estimator

>>>>> gd2md-html alert: inline image link here (to images/image161.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

  1. Level. The baseline value for the series if it were a straight line.
  2. Trend. The optional and often linear increasing or decreasing behavior of the series over time.
  3. Seasonality. The optional repeating patterns or cycles of behavior over time.
  4. Noise. The optional variability in the observations that cannot be explained by the model.

All time series have a level, most have noise, and the trend and seasonality are optional.

One step forecast using a window of “1” and a typical sample** “time, measure1, measure2”: **

  • linear/nonlinear classifiers: predict a single output value - using the t-1 previous line, i.e., **“measure1 t, measure 2 t, measure 1 t+1, measure 2 t+1 (as the class)” **
  • Neural networks: predict multiple output values, i.e., **“measure1 t, measure 2 t, measure 1 t+1(class1), measure 2 t+1(class2)” **

**One-Step Forecast: **This is where the next time step (t+1) is predicted.

**Multi-Step Forecast: **This is where two or more future time steps are to be predicted.

Multi-step forecast using a window of “1” and a typical sample “time, measure1”, i.e., using the current value input we label it as the two future input labels:

  • **“measure1 t, measure1 t+1(class) , measure1 t+2(class1)” **

This article explains about ML Methods for Sequential Supervised Learning - Six methods that have been applied to solve sequential supervised learning problems:

  1. sliding-window methods - converts a sequential supervised problem into a classical supervised problem
  2. recurrent sliding windows
  3. hidden Markov models
  4. maximum entropy Markov models
  5. input-output Markov models
  6. conditional random fields
  7. graph transformer networks

STATIONARY TIME SERIES

What is? A time series without a trend or seasonality, in other words non-stationary has a trend or seasonality

There are ways to remove the trend and seasonality, i.e., take the difference between time points.

  1. T+1 - T
  2. Bigger lag to support seasonal changes
  3. pandas.diff()
  4. Plot a histogram, plot a log(X) as well.
  5. Test for the unit root null hypothesis - i.e., use the Augmented dickey fuller test to determine if two samples originate in a stationary or a non-stationary (seasonal/trend) time series

Shay on stationary time series, AR, ARMA

(amazing) STL and more.

SHORT TIME SERIES

  1. Short time series
  2. Min sample size for short seasonal time series
  3. More mastery on short time series.
    1. Autoregression (AR)
    2. Moving Average (MA)
    3. Autoregressive Moving Average (ARMA)
    4. Autoregressive Integrated Moving Average (ARIMA)
    5. Seasonal Autoregressive Integrated Moving-Average (SARIMA)
    6. Seasonal Autoregressive Integrated Moving-Average with Exogenous Regressors (SARIMAX)
    7. Vector Autoregression (VAR)
    8. Vector Autoregression Moving-Average (VARMA)
    9. Vector Autoregression Moving-Average with Exogenous Regressors (VARMAX)
    10. Simple Exponential Smoothing (SES)
    11. Holt Winter’s Exponential Smoothing (HWES)

Predicting actual Values of time series using observations

  1. Using kalman filters - explains the concept etc, 1 out of 55 videos.

There are three types of gates within a unit:

  • Forget Gate: conditionally decides what information to throw away from the block.
  • Input Gate: conditionally decides which values from the input to update the memory state.
  • Output Gate: conditionally decides what to output based on input and the memory of the block.

Using lstm to predict sun spots, has some autocorrelation usage

Dynamic Time Warping (DTW)

how to compute a better distance for two time series.

  1. Youtube - explains everything
  2. Python code with a good tutorial.
  3. Another function for dtw distance in python
  4. Medium, mentions prunedDTW, sparseDTW and fastDTW

>>>>> gd2md-html alert: inline image link here (to images/image162.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

CLUSTERING TS

  1. Clustering time series, subsequences with a rolling window, the pitfall.
  2. Clustering using tslearn

ANOMALY DETECTION TS

  1. What is stationary (process), stationary time series analysis (shay palachi),
  2. mastery on arimas
  3. TS anomaly algos (stl, trees, arima)
  4. AD techniques, part 2, part 3
  5. Z-score, modified z-score and iqr an intro why z-score is not robust
  6. Adtk a sklearn-like toolkit with an amazing intro, various algorithms for non seasonal and seasonal, transformers, ensembles.
  7. Awesome TS anomaly detection on github
  8. Transfer learning toolkit, paper and benchmarks
  9. Ransac is a good baseline - random sample consensus for outlier detection
    1. Ransac, 2, 3, 4, 5, 6
    2. You can feed ransac with tsfresh/tslearn features.
  10. Anomaly detection for time series
  11. STL: 3. AD where anomalies coincide with seasonal peaks!! 4. AD challenges, stationary, seasonality, trend 5. Rt anomaly detection for time series pinterest using stl decomposition 6. AD
  12. Sliding windows 7. Solving sliding window problems 8. Rolling window regression
  13. Forecasting using Arima 1, 2
  14. Auto arima 1, 2, 3
  15. Twitters ESD test for outliers, using z-score and t test 9. Another esd test inside here
  16. Minimal sample size for seasonal forecasting
  17. Golden signals, youtube
  18. Graph-based Anomaly Detection and Description: A Survey
  19. Time2vec, paper (for deep learning, as a layer)

Digital Signal Processing (DSP)

  1. (Out of place) - using self-attention for sound signal processing
  2. Scipy signal processing
  3. Script find peaks
  4. Beat detection
    1. Real time bpm beat det
    2. Librosa: Beat detection (and temp)
  5. Mastery on Human activity recognition, smartphones

Graph Theory

  1. General purpose and community detection GIT karate club bene
  2. Connectivity
  3. Min-cut: 1, 2, 3, 4, 5, 6
  4. Louvain community
  5. Girwan newman gist, this worked, this is potentially good too, another, another, another
  6. Node2vec, paper, medium1, medium 2 - tutorial - code, git code, original py2 code, taboola code for their medium paper
  7. Evaluation metrics for community detection
  8. Review for community detection algorithms - paper
  9. Term: community structure
  10. Term: modularity of networks
  11. Unread paper
  12. Unread comparison of community detection algos
  13. Clustering adjacency matrices
  14. Spectral-clustering (is this suppose to be here?)
  15. Finding natural groups in undirected graphs
  16. Awesome community detection on github
  17. Various algorithms

5. Centrality algorithms

1. [5.1. The PageRank algorithm](https://neo4j.com/docs/graph-algorithms/current/algorithms/page-rank/)
2. [5.2. The Betweenness Centrality algorithm](https://neo4j.com/docs/graph-algorithms/current/algorithms/betweenness-centrality/)
3. [5.3. The Closeness Centrality algorithm](https://neo4j.com/docs/graph-algorithms/current/algorithms/closeness-centrality/)
4. [5.4. The Degree Centrality algorithm](https://neo4j.com/docs/graph-algorithms/current/algorithms/degree-centrality/)

6. Community detection algorithms

5. [6.1. The Louvain algorithm](https://neo4j.com/docs/graph-algorithms/current/algorithms/louvain/)
6. [6.2. The Label Propagation algorithm](https://neo4j.com/docs/graph-algorithms/current/algorithms/label-propagation/)
7. [6.3. The Connected Components algorithm](https://neo4j.com/docs/graph-algorithms/current/algorithms/connected-components/)

7. Experimental algorithms

8. [7.1. Procedures](https://neo4j.com/docs/graph-algorithms/current/experimental-procedures/)
9. [7.2. Centrality algorithms](https://neo4j.com/docs/graph-algorithms/current/experimental-algorithms/centrality/)
10. [7.3. Community detection algorithms](https://neo4j.com/docs/graph-algorithms/current/experimental-algorithms/community/)
11. [7.4. Path finding algorithms](https://neo4j.com/docs/graph-algorithms/current/experimental-algorithms/pathfinding/)
12. [7.5. Similarity algorithms](https://neo4j.com/docs/graph-algorithms/current/experimental-algorithms/similarity/)
13. [7.6. Link Prediction algorithms](https://neo4j.com/docs/graph-algorithms/current/experimental-algorithms/linkprediction/)
14. [7.7. Preprocessing functions and](https://neo4j.com/docs/graph-algorithms/current/experimental-algorithms/preprocessing/)

SOCIAL NETWORK ANALYSIS

  1. Wiki

  2. Paper: algorithmic approach to social networks

  3. Steve borgatti

  4. Intro to SNA

    1. Centrality
    2. Betweenness centrality
    3. Network centralization
    4. Network reach
    5. Network integration
    6. Boundary spanners
    7. Peripheral players
  5. Social Network Analysis: Can Quantity Compensate for Quality?

    Nicholas Christakis of Harvard and James Fowler of UC San Diego have produced a series of ground-breaking papers analyzing the spread of various traits in social networks: obesity, smoking, happiness, and most recently, in collaboration with John Cacioppo, loneliness. The Christakis-Fowler collaboration has now become well-known, but from a technical perspective, what was special about their work?

    It turns out that they found a way to distinguish between the three reasons why people who are related in a social network are similar to each other.

  6. Homophily is the tendency of people to seek others who are alike. For example, most of us restrict our dates to smokers or non-smokers, mirroring our own behavior.

  7. Confounding is the phenomenon of related individuals developing a trait because of a (shared) environmental circumstance. For example, people living right next to a McDonald’s might all gradually become obese.

  8. Induction is the process of one individual passing a trait or behavior on to their friends, whether by active encouragement or by setting an example

  9. Networkx - Centrality is just a fraction of the algorithms contained in networkx.

>>>>> gd2md-html alert: inline image link here (to images/image163.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

(DEEP) NEURAL NETS

Jay Alammar on NN[ Part 1](http://jalammar.github.io/visual-interactive-guide-basics-neural-networks/),[ Part 2](http://jalammar.github.io/feedforward-neural-networks-visual-interactive/)
  • NN in general - 5 introductions tutorials.

  • Segmentation examples

    MLP: fully connected, input, hidden layers, output. Gradient on the backprop takes a lot of time to calculate. Has vanishing gradient problem, because of multiplications when it reaches the first layers the loss correction is very small (0.10.101 = 0.001), therefore the early layers train slower than the last ones, and the early ones capture the basics structures so they are the more important ones.

    AutoEncoder - unsupervised, drives the input through fully connected layers, sometime reducing their neurons amount, then does the reverse and expands the layer’s size to get to the input (images are multiplied by the transpose matrix, many times over), Comparing the predicted output to the input, correcting the cost using gradient descent and redoing it, until the networks learns the output.

  • Convolutional auto encoder

  • Denoiser auto encoder - masking areas in order to create an encoder that understands noisy images

  • Variational autoencoder - doesnt rely on distance between pixels, rather it maps them to a function (gaussian), eventually the DS should be explained by this mapping, uses 2 new layers added to the network. Gaussian will create blurry images, but similar. Please note that it also works with CNN.

    What are logits in neural net - the vector of raw (non-normalized) predictions that a classification model generates, which is ordinarily then passed to a normalization function. If the model is solving a multi-class classification problem, logits typically become an input to the softmax function. The softmax function then generates a vector of (normalized) probabilities with one value for each possible class.

    WORD2VEC - based on autoencode, we keep only the hidden layer , Part 2

    RBM- restricted (no 2 nodes share a connection) boltzman machine

    An Autoencoder of features, tries to encode its own structure.

    Works best on pics, video, voice, sensor data. 2 layers, visible and hidden, error and bias calculated via KL Divergence.

  • Also known as a shallow network.

  • Two layers, input and output, goes back and forth until it learns its output.

    DBN - deep belief networks, similar structure to multi layer perceptron. fully connected, input, hidden(s), output layers. Can be thought of as stacks of RBM. training using GPU optimization, accurate and needs smaller labelled data set to complete the training.

    Solves the ‘vanishing gradient’ problem, imagine a fully connected network, advancing each 2 layers step by step until each boltzman network (2 layers) learns the output, keeps advancing until finished.. Each layer learns the entire input.

    Next step is to fine tune using a labelled test set, improves performance and alters the net. So basically using labeled samples we fine tune and associate features and pattern with a name. Weights and biases are altered slightly and there is also an increase in performance. Unlike CNN which learns features then high level features.

    Accurate and reasonable in time, unlike fully connected that has the vanishing gradient problem.

    **Transfer Learning **= like Inception in Tensor flow, use a prebuilt network to solve many problems that “work” similarly to the original network.

  • CS course definition - also very good explanation of the common use cases:

    • Feature extraction from the CNN part (removing the fully connected layer)
    • Fine-tuning, everything or partial selection of the hidden layers, mainly good to keep low level neurons that know what edges and color blobs are, but not dog breeds or something not as general.
  • CNN checkpoints for many problems with transfer learning. Has several relevant references

  • Such as this “How transferable are features in deep neural networks?

  • (the indian guy on facebook) IMDB transfer learning using cnn vgg and word2vec, the word2vec is interesting, the cnn part is very informative. With python code, keras.

    CNN, Convolutional Neural Net (this link explains CNN quite well, 2nd tutorial - both explain about convolution, padding, relu - sparsity, max and avg pooling):

  • Common Layers: input->convolution->relu activation->pooling to reduce dimensionality **** ->fully connected layer

  • ****repeat several times over as this discover patterns but needs another layer -> fully connected layer

  • Then we connect at the end a fully connected layer (fcl) to classify data samples.

  • Good for face detection, images etc.

  • Requires lots of data, not always possible in a real world situation

  • Relu is quite resistant to vanishing gradient & allows for deactivating neurons and for sparsity.

    RNN - what is RNN by Andrej Karpathy - The Unreasonable Effectiveness of Recurrent Neural Networks, basically a lot of information about RNNs and their usage cases

  • basic NN node with a loop, previous output is merged with current input. for the purpose of remembering history, for time series, to predict the next X based on the previous Y.

  • 1 to N = frame captioning

  • N to 1 = classification

  • N to N = predict frames in a movie

  • N\2 with time delay to N\2 = predict supply and demand

  • Vanishing gradient is 100 times worse.

  • Gate networks like LSTM solves vanishing gradient.

SNN - SELU activation function is inside not outside, results converge better.

Probably useful for feedforward networks

DEEP REINFORCEMENT LEARNING COURSE (for motion planning)or
DEEP RL COURSE (Q-LEARNING?) - using unlabeled data, reward, and probably a CNN to solve games beyond human level.

A brief survey of DL for Reinforcement learning

WIKI has many types of RNN networks (unread)

Unread and potentially good tutorials:

  1. deep learning python

EXAMPLES of Using NN on images:

Deep image prior / denoiser/ high res/ remove artifacts/ etc..

GRADIENT DESCENT

(What are?) batch, stochastic, and mini-batch gradient descent are and the benefits and limitations of each method.

What is gradient descent, how to use it, local minima okay to use, compared to global. Saddle points, learning rate strategies and research points

  1. Gradient descent is an **optimization algorithm **often used for finding the weights or coefficients of machine learning algorithms, such as artificial neural networks and logistic regression.
  2. the model makes predictions on training data, then use the error on the predictions to update the model to reduce the error.
  3. The goal of the algorithm is to find model parameters (e.g. coefficients or weights) that minimize the error of the model on the training dataset. It does this by making changes to the model that move it along a gradient or slope of errors down toward a minimum error value. This gives the algorithm its name of “gradient descent.”

Stochastic

  • calculate error and updates the model after every training sample

Batch

  • calculates the error for each example in the training dataset, but only updates the model after all training examples have been evaluated.

Mini batch (most common)

  • splits the training dataset into small batches, used to calculate model error and update model coefficients.
  • Implementations may choose to sum the gradient over the mini-batch or take the average of the gradient (reduces variance of gradient) (unclear?)

+ Tips on how to choose and train using mini batch in the link above

Dont decay the learning rate, increase batchsize - paper (optimization of a network)

>>>>> gd2md-html alert: inline image link here (to images/image164.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

>>>>> gd2md-html alert: inline image link here (to images/image165.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

Example: if you have 1000 training examples, and your batch size is 500, then it will take 2 iterations to complete 1 epoch.

>>>>> gd2md-html alert: inline image link here (to images/image166.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

GD with Momentum - explain

Batch size

(a good read) about batch sizes in keras, specifically LSTM, read this first!

A sequence prediction problem makes a good case for a varied batch size as you may want to have a batch size equal to the training dataset size (batch learning) during training and a batch size of 1 when making predictions for one-step outputs.

power of 2: have some advantages with regards to vectorized operations in certain packages, so if it's close it might be faster to keep your batch_size in a power of 2.

(pushing batches of samples to memory in order to train) -

Batch size defines number of samples that going to be propagated through the network.

For instance, let's say you have 1050 training samples and you want to set up batch_size equal to 100. Algorithm takes first 100 samples (from 1st to 100th) from the training dataset and trains network. Next it takes second 100 samples (from 101st to 200th) and train network again. We can keep doing this procedure until we will propagate through the networks all samples. The problem usually happens with the last set of samples. In our example we've used 1050 which is not divisible by 100 without remainder. The simplest solution is just to get final 50 samples and train the network.

Advantages:

  • It requires less memory. Since you train network using less number of samples the overall training procedure requires less memory. It's especially important in case if you are not able to fit dataset in memory.
  • Typically networks trains faster with mini-batches. That's because we update weights after each propagation. In our example we've propagated 11 batches (10 of them had 100 samples and 1 had 50 samples) and after each of them we've updated network's parameters. If we used all samples during propagation we would make only 1 update for the network's parameter.

Disadvantages:

  • The smaller the batch the less accurate estimate of the gradient. In the figure below you can see that mini-batch (green color) gradient's direction **fluctuates **compare to the full batch (blue color).

>>>>> gd2md-html alert: inline image link here (to images/image167.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

Small batch size has an effect on validation accuracy.

>>>>> gd2md-html alert: inline image link here (to images/image168.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text IMPORTANT: batch size in ‘.prediction’ is needed for some models, only for technical reasons as seen here, in keras.

  1. (unread) about mini batches and performance.
  2. (unread) tradeoff between bath size and number of iterations

Another observation, probably empirical - to answer your questions on Batch Size and Epochs:

In general: Larger batch sizes result in faster progress in training, but don't always converge as fast. Smaller batch sizes train slower, but can converge faster. It's definitely problem dependent.

In general, the models improve with more epochs of training, to a point. They'll start to plateau in accuracy as they converge. Try something like 50 and plot number of epochs (x axis) vs. accuracy (y axis). You'll see where it levels out.

BIAS

The role of bias in NN - similarly to the ‘b’ in linear regression.

>>>>> gd2md-html alert: inline image link here (to images/image169.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

>>>>> gd2md-html alert: inline image link here (to images/image170.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

BATCH NORMALIZATION

  1. The best explanation to what is BN and why to use it, including busting the myth that it solves internal covariance shift - shifting input distribution, and saying that it should come after activations as it makes more sense (it does),also a nice quote on where a layer ends is really good - it can end at the activation (or not). How to use BN in the test, hint: use a moving window. Bn allows us to use 2 parameters to control the input distribution instead of controlling all the weights.
  2. Medium on BN
  3. Medium on BN
  4. Ian goodfellow on BN
  5. Medium #2 - a better one on BN, and adding to VGG
  6. Reddit on BN, mainly on the paper saying to use it before, but best practice is to use after
  7. Diff between batch and norm (weak explanation)
  8. Weight normalization for keras and TF
  9. Layer normalization keras
  10. Instance normalization keras
  11. batch/layer/instance in TF with code
  12. Layer norm for rnn’s or whatever name it is in this post with code for GRU

What is the diff between batch/layer/recurrent batch and back rnn normalization

  • Layer normalization (Ba 2016): Does not use batch statistics. Normalize using the statistics collected from all units within a layer of the current sample. Does not work well with ConvNets.
  • Recurrent Batch Normalization (BN) (Cooijmans, 2016; also proposed concurrently by Qianli Liao & Tomaso Poggio, but tested on Recurrent ConvNets, instead of RNN/LSTM): Same as batch normalization. Use different normalization statistics for each time step. You need to store a set of mean and standard deviation for each time step.
  • Batch Normalized Recurrent Neural Networks (Laurent, 2015): batch normalization is only applied between the input and hidden state, but not between hidden states. i.e., normalization is not applied over time.
  • Streaming Normalization (Liao et al. 2016) : it summarizes existing normalizations and overcomes most issues mentioned above. It works well with ConvNets, recurrent learning and online learning (i.e., small mini-batch or one sample at a time):
  • Weight Normalization (Salimans and Kingma 2016): whenever a weight is used, it is divided by its L2 norm first, such that the resulting weight has L2 norm 1. That is, output y=x∗(w/|w|), where x and w denote the input and weight respectively. A scalar scaling factor g is then multiplied to the output y=y∗g. But in my experience g seems not essential for performance (also downstream learnable layers can learn this anyway).
  • Cosine Normalization (Luo et al. 2017): weight normalization is very similar to cosine normalization, where the same L2 normalization is applied to both weight and input: y=(x/|x|)∗(w/|w|). Again, manual or automatic differentiation can compute appropriate gradients of x and w.
  • Note that both Weight and Cosine Normalization have been extensively used (called normalized dot product) in the 2000s in a class of ConvNets called HMAX (Riesenhuber 1999) to model biological vision. You may find them interesting.

More about Batch/layer/instance/group norm are different methods for normalizing the inputs to the layers of deep neural networks

  1. Layer normalization solves the rnn case that batch couldnt - Is done per feature within the layer and normalized features are replaced
  2. Instance does it for (cnn?) using per channel normalization
  3. Group does it for group of channels

>>>>> gd2md-html alert: inline image link here (to images/image171.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

Part1: intuitive explanation to batch normalization

Part2: batch/layer/weight normalization - This is a good resource for advantages for every layer

  • Layer, per feature in a batch,
  • weight - divided by the norm

>>>>> gd2md-html alert: inline image link here (to images/image172.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

HYPER PARAM GRID SEARCHES

  1. A disciplined approach to neural network hyper-parameters: Part 1 -- learning rate, batch size, momentum, and weight decay

LOSS

Very Basic advice: You should probably switch train/validation repartition to something like 80% training and 20% validation. In most cases it will improve the classifier performance overall (more training data = better performance)

+If Training error and test error are too close (your system is unable to overfit on your training data), this means that your model is too simple. Solution: more layers or more neurons per layer.

Early stopping

If you have never heard about "early-stopping" you should look it up, it's an important concept in the neural network domain : https://en.wikipedia.org/wiki/Early_stopping . To summarize, the idea behind early-stopping is to stop the training once the validation loss starts plateauing. Indeed, when this happens it almost always mean you are starting to overfitt your classifier. The training loss value in itself is not something you should trust, beacause it will continue to increase event when you are overfitting your classifier.

With cross entropy there can be an issue where the accuracy is the same for two cases, one where the loss is decreasing and the other when the loss is not changing much.

>>>>> gd2md-html alert: inline image link here (to images/image173.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

How to read LOSS graphs (and accuracy on top)

>>>>> gd2md-html alert: inline image link here (to images/image174.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

>>>>> gd2md-html alert: inline image link here (to images/image175.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

This indicates that the model is overfitting. It continues to get better and better at fitting the data that it sees (training data) while getting worse and worse at fitting the data that it does not see (validation data).

This is a very good example of a train/test loss and an accuracy behavior.

>>>>> gd2md-html alert: inline image link here (to images/image176.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

>>>>> gd2md-html alert: inline image link here (to images/image177.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

Cross entropy formula with soft labels (probability) rather than classes.

Mastery on cross entropy, brier, roc auc, how to ‘game’ them and calibrate them

Game changer paper - a general adaptive loss search in nn

LEARNING RATE REDUCTION

Intro to Learning Rate methods - what they are doing and what they are fixing in other algos.

Callbacks, especially ReduceLROnPlateau - this callback monitors a quantity and if no improvement is seen for a 'patience' number of epochs, the learning rate is reduced.

Cs123 (very good): explains about many things related to CNN, but also about LR and adaptive methods.

An excellent comparison of several learning rate schedule methods and adaptive methods: (same here but not as good)

>>>>> gd2md-html alert: inline image link here (to images/image178.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

Adaptive gradient descent algorithms such as Adagrad, Adadelta, RMSprop, Adam, provide an alternative to classical SGD.

These per-parameter learning rate methods provide heuristic approach without requiring expensive work in tuning hyperparameters for the learning rate schedule manually.

  1. **Adagrad **performs larger updates for more sparse parameters and smaller updates for less sparse parameter. It has good performance with sparse data and training large-scale neural network. However, its monotonic learning rate usually proves too aggressive and stops learning too early when training deep neural networks.
  2. **Adadelta **is an extension of Adagrad that seeks to reduce its aggressive, monotonically decreasing learning rate.
  3. **RMSprop **adjusts the Adagrad method in a very simple way in an attempt to reduce its aggressive, monotonically decreasing learning rate.
  4. **Adam **is an update to the RMSProp optimizer which is like RMSprop with momentum.

>>>>> gd2md-html alert: inline image link here (to images/image179.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

adaptive learning rate methods demonstrate better performance than learning rate schedules, and they require much less effort in hyperparamater settings

>>>>> gd2md-html alert: inline image link here (to images/image180.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

Recommended paper: practical recommendation for gradient based DNN

Another great comparison - pdf paper and webpage link -

  • if your input data is sparse, then you likely achieve the best results using one of the adaptive learning-rate methods.
  • An additional benefit is that you will not need to tune the learning rate but will likely achieve the best results with the default value.
  • In summary, RMSprop is an extension of Adagrad that deals with its radically diminishing learning rates. It is identical to Adadelta, except that Adadelta uses the RMS of parameter updates in the numerator update rule. Adam, finally, adds bias-correction and momentum to RMSprop. Insofar, RMSprop, Adadelta, and Adam are very similar algorithms that do well in similar circumstances. Kingma et al. [10] show that its bias-correction helps Adam slightly outperform RMSprop towards the end of optimization as gradients become sparser. Insofar, Adam might be the best overall choice

TRAIN / VAL accuracy in NN

The second important quantity to track while training a classifier is the validation/training accuracy. This plot can give you valuable insights into the amount of overfitting in your model:

>>>>> gd2md-html alert: inline image link here (to images/image181.jpg). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

  • The gap between the training and validation accuracy indicates the amount of overfitting.
  • Two possible cases are shown in the diagram on the left. The blue validation error curve shows very small validation accuracy compared to the training accuracy, indicating strong overfitting (note, it's possible for the validation accuracy to even start to go down after some point).
  • NOTE: When you see this in practice you probably want to increase regularization:
    • stronger L2 weight penalty
    • Dropout
    • collect more data.
  • The other possible case is when the validation accuracy tracks the training accuracy fairly well. This case indicates that your model capacity is not high enough: make the model larger by increasing the number of parameters.

INITIALIZERS

XAVIER GLOROT:

Why’s Xavier initialization important?

In short, it helps signals reach deep into the network.

  • If the weights in a network start too small, then the signal shrinks as it passes through each layer until it’s too tiny to be useful.
  • If the weights in a network start too large, then the signal grows as it passes through each layer until it’s too massive to be useful.

Xavier initialization makes sure the weights are ‘just right’, keeping the signal in a reasonable range of values through many layers.

To go any further than this, you’re going to need a small amount of statistics - specifically you need to know about random distributions and their variance.

When to use glorot uniform-over-normal initialization?

However, i am still not seeing anything empirical that says that glorot surpesses everything else under certain conditions (except the glorot paper), most importantly, does it really help in LSTM where the vanishing gradient is ~no longer an issue?

He-et-al Initialization

This method of initializing became famous through a paper submitted in 2015 by He et al, and is similar to Xavier initialization, with the factor multiplied by two. In this method, the weights are initialized keeping in mind the size of the previous layer which helps in attaining a global minimum of the cost function faster and more efficiently.

w=np.random.randn(layer_size[l],layer_size[l-1])*np.sqrt(2/layer_size[l-1])

ACTIVATION FUNCTIONS

(a bunch of observations, seems like a personal list) -

  • Output layer - linear for regression, softmax for classification
  • Hidden layers - hyperbolic tangent for shallow networks (less than 3 hidden layers), and ReLU for deep networks

**ReLU **- The purpose of ReLU is to introduce non-linearity, since most of the real-world data we would want our network to learn would be nonlinear (e.g. convolution is a linear operation – element wise matrix multiplication and addition, so we account for nonlinearity by introducing a nonlinear function like ReLU, e.g here - search for ReLU).

  • Relu is quite resistant to vanishing gradient & allows for deactivating neurons and for sparsity.
  • Other nonlinear functions such as tanh or sigmoid can also be used instead of ReLU, but ReLU has been found to perform better in most situations.
  1. Visual + description of activation functions
  2. A very good explanation + figures about activations functions

Selu - better than RELU? Possibly.

Mish: A Self Regularized Non-Monotonic Neural Activation Function, yam peleg’s code

Mish, Medium, Keras Code, with benchmarks, computationally expensive.

OPTIMIZERS

There are several optimizers, each had his 15 minutes of fame, some optimizers are recommended for CNN, Time Series, etc..

There are also what I call ‘experimental’ optimizers, it seems like these pop every now and then, with or without a formal proof. It is recommended to follow the literature and see what are the ‘supposedly’ state of the art optimizers atm.

Adamod deeplearning optimizer with memory

Backstitch - September 17 - supposedly an improvement over SGD for speech recognition using DNN. Note: it wasnt tested with other datasets or other network types.

(how does it work?) take a negative step back, then a positive step forward. I.e., When processing a minibatch, instead of taking a single SGD step, we first take a step with −α times the current learning rate, for α > 0 (e.g. α = 0.3), and then a step with 1 + α times the learning rate, with the same minibatch (and a recomputed gradient). So we are taking a small negative step, and then a larger positive step. This resulted in quite large improvements – around 10% relative improvement [37] – for our best speech recognition DNNs. The recommended hyper parameters are in the paper.

Drawbacks: takes twice to train, momentum not implemented or tested, dropout is mandatory for improvement, slow starter.

Documentation about optimizers in keras

  • SGD can be fine tuned
  • For others Leave most parameters as they were

Best description on optimizers with momentum etc, from sgd to nadam, formulas and intuition

>>>>> gd2md-html alert: inline image link here (to images/image182.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

DROPOUT LAYERS IN KERAS AND GENERAL

A very influential paper about dropout and how beneficial it is - bottom line always use it.

**OPEN QUESTIONs: **

  1. does a dropout layer improve performance even if an lstm layer has dropout or recurrent dropout.
  2. What is the diff between a separate layer and inside the lstm layer.
  3. What is the diff in practice and intuitively between drop and recurrentdrop

Dropout layers in keras, or dropout regularization:

  • Dropout is a technique where randomly selected neurons are ignored RANDOMLY during training.
  • contribution to the activation of downstream neurons is temporally removed on the forward pass and any weight updates are not applied to the neuron on the backward pass.
  • As a neural network learns, neuron weights settle into their context within the network.
  • Weights of neurons are tuned for specific features providing some specialization. Neighboring neurons become to rely on this specialization, which if taken too far can result in a fragile model too specialized to the training data. (overfitting)
  • This reliant on context for a neuron during training is referred to complex co-adaptations.
  • After dropout,** other neurons will have to step in and handle the representation required to make predictions for the missing neurons,** which is believed to result in multiple independent internal representations being learned by the network.
  • Thus, the effect of dropout is that the network becomes less sensitive to the specific weights of neurons.
  • This in turn leads to a network with better generalization capability and less likely to overfit the training data.

Another great answer about drop out -

  • as a consequence of the 50% dropout, the neural network will learn different, redundant representations; the network can’t rely on the particular neurons and the combination (or interaction) of these to be present.
  • Another nice side effect is that training will be faster.
  • Rules:
    • Dropout is only applied during training,
    • Need to rescale the remaining neuron activations. E.g., if you set 50% of the activations in a given layer to zero, you need to scale up the remaining ones by a factor of 2.
    • if the training has finished, you’d use the complete network for testing (or in other words, you set the dropout probability to 0).

Implementation of drop out in keras is “inverse dropout” - n the Keras implementation, the output values are corrected during training (by dividing, in addition to randomly dropping out the values) instead of during testing (by multiplying). This is called "inverted dropout".

Inverted dropout is functionally equivalent to original dropout (as per your link to Srivastava's paper), with a nice feature that the network does not use dropout layers at all during test and prediction. This is explained a little in this Keras issue.

Dropout notes and rules of thumb aka “best practice” -

  • dropout value of 20%-50% of neurons with 20% providing a good starting point. (A probability too low has minimal effect and a value too high results in underlearning by the network.)
  • Use a large network for better performance, i.e., when dropout is used on a larger network, giving the model more of an opportunity to learn independent representations.
  • Use dropout on VISIBLE AND HIDDEN. Application of dropout at each layer of the network has shown good results.
  • Unclear ? Use a large learning rate with decay and a large momentum. Increase your learning rate by a factor of 10 to 100 and use a high momentum value of 0.9 or 0.99.
  • **Unclear ? **Constrain the size of network weights. A large learning rate can result in very large network weights. Imposing a constraint on the size of network weights such as max-norm regularization with a size of 4 or 5 has been shown to improve results.

Difference between LSTM ‘dropout’ and ‘recurrent_dropout’ - vertical vs horizontal.

I suggest taking a look at (the first part of) this paper. Regular dropout is applied on the inputs and/or the outputs, meaning the vertical arrows from x_t and to h_t. In you add it as an argument to your layer, it will mask the inputs; you can add a Dropout layer after your recurrent layer to mask the outputs as well. Recurrent dropout masks (or "drops") the connections between the recurrent units; that would be the horizontal arrows in your picture.

This picture is taken from the paper above. On the left, regular dropout on inputs and outputs. On the right, regular dropout PLUS recurrent dropout:

>>>>> gd2md-html alert: inline image link here (to images/image183.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

NEURAL NETWORK OPTIMIZATION TECHNIQUES

Basically do these after you have a working network

  1. Dont decay the learning rate, increase batchsize - paper (optimization of a network)
  2. Add one neuron with skip connection, or to every layer in a binary classification network to get global minimum.

Fine tuning

  1. 3 methods to fine tune, cut softmax layer, smaller learning rate, freeze layers
  2. Fine tuning on a sunset of data

Deep Learning for NLP

MULTI LABEL/OUTPUT

  1. A machine learning framework for multi-output/multi-label and stream data. Inspired by MOA and MEKA, following scikit-learn's philosophy. https://scikit-multiflow.github.io/
  2. Medium on MO, sklearn and keras
  3. [MO in keras, see functional API on how.](https://www.pyimagesearch.com/2018/06/04/keras-multiple-outputs-and-multiple-losses/\)

FUZZY MULTI LABEL

  1. Ie., probabilities or soft values instead of hard labels

DNN FRAMEWORKS

PYTORCH

  1. Deep learning with pytorch - The book
  2. Pytorch DL course, git - yann lecun

FAST.AI

  1. git

KERAS

A make sense introduction into keras, has several videos on the topic, going through many network types, creating custom activation functions, going through examples.

Didn’t read:

  1. Keras cheatsheet
  2. Seq2Seq RNN
  3. Stateful LSTM - Example script showing how to use stateful RNNs to model long sequences efficiently.
  4. CONV LSTM - this script demonstrate the use of a conv LSTM network, used to predict the next frame of an artificially generated move which contains moving squares.

How to force keras to use tensorflow and not teano (set the .bat file)

Callbacks - how to create an AUC ROC score callback with keras - with code example.

Batch size vs. Iterations in NN \ Keras.

Keras metrics - classification regression and custom metrics

Keras Metrics 2 - accuracy, ROC, AUC, classification, regression r^2.

Introduction to regression models in Keras, using MSE, comparing baseline vs wide vs deep networks.

How does Keras calculate accuracy? Formula and explanation

Compares label with the rounded predicted float, i.e. bigger than 0.5 = 1, smaller than = 0

For categorical we take the argmax for the label and the prediction and compare their location.

In both cases, we average the results.

Custom metrics (precision recall) in keras. Which are taken from here, including entropy and f1

KERAS MULTI GPU

  1. When using SGD only batches between 32-512 are adequate, more can lead to lower performance, less will lead to slow training times.
  2. Note: probably doesn't reflect on adam, is there a reference?
  3. Parallel gpu-code for keras. Its a one liner, but remember to scale batches by the amount of GPU used in order to see a (non linear) scaability in training time.
  4. Pitfalls in GPU training, this is a very important post, be aware that you can corrupt your weights using the wrong combination of batches-to-input-size, in keras-tensorflow.
    When you do multi-GPU training, it is important to feed all the GPUs with data. It can happen that the very last batch of your epoch has less data than defined (because the size of your dataset can not be divided exactly by the size of your batch). This might cause some GPUs not to receive any data during the last step. Unfortunately some Keras Layers, most notably the Batch Normalization Layer, can’t cope with that leading to nan values appearing in the weights (the running mean and variance in the BN layer).
  5. 5 things to be aware of for multi gpu using keras, crucial to look at before doing anything

KERAS FUNCTIONAL API

What is and how to use? A flexible way to declare layers in parallel, i.e. parallel ways to deal with input, feature extraction, models and outputs as seen in the following images.

>>>>> gd2md-html alert: inline image link here (to images/image184.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

KERAS EMBEDDING LAYER

  1. Injecting glove to keras embedding layer and using it for classification + what is and how to use the embedding layer in keras.
  2. Keras blog - using GLOVE for pretrained embedding layers.
  3. Word embedding using keras, continuous BOW - CBOW, SKIPGRAM, word2vec - really good.
  4. Fasttext - comparison of key feature against word2vec
  5. Multiclass classification using word2vec/glove + code
  6. word2vec/doc2vec/tfidf code in python for text classification
  7. Lda & word2vec
  8. Text classification with word2vec
  9. Gensim word2vec, and another one
  10. Fasttext paper

Keras: Predict vs Evaluate

here:

.predict() generates output predictions based on the input you pass it (for example, the predicted characters in the MNIST example)

.evaluate() computes the loss based on the input you pass it, along with any other metrics that you requested in the metrics param when you compiled your model (such as accuracy in the MNIST example)

Keras metrics

For classification methods - how does keras calculate accuracy, all functions.

LOSS IN KERAS

Why is the training loss much higher than the testing loss? A Keras model has two modes: training and testing. Regularization mechanisms, such as Dropout and L1/L2 weight regularization, are turned off at testing time.

The training loss is the average of the losses over each batch of training data. Because your model is changing over time, the loss over the first batches of an epoch is generally higher than over the last batches. On the other hand, the testing loss for an epoch is computed using the model as it is at the end of the epoch, resulting in a lower loss.

DNN ALGORITHMS

AUTOENCODERS

  1. How to use AE for dimensionality reduction + code - using keras’ functional API
  2. Keras.io blog post about AE’s - regular, deep, sparse, regularized, cnn, variational
    1. A keras.io replicate post but explains AE quite nicely.
  3. Examples of vanilla, multi layer, CNN and sparse AE’s
  4. Another example of CNN-AE
  5. Another AE tutorial
  6. Hinton’s coursera course on PCA vs AE, basically some info about what PCA does - maximizing variance and projecting and then what AE does and can do to achieve similar but non-linear dense representations
  7. A great tutorial on how does the clusters look like after applying PCA/ICA/AE
  8. Another great presentation on PCA vs AE, summarized in the KPCA section of this notebook. +another one +StackExchange
  9. Autoencoder tutorial with python code and how to encode after
  10. Git code for low dimensional auto encoder
  11. Bart denoising AE, sequence to sequence pre training for NL generation translation and comprehension.

AE for anomaly detection, fraud detection

Variational AE

  1. Unread - Simple explanation
  2. Pixel art VAE
  3. Unread - another VAE
  4. Pixel GAN VAE
  5. Disentangled VAE - improves VAE

SELF ORGANIZING MAPS (SOM)

  1. Git
    1. Sompy,
    2. ***minisom!
    3. Many graph examples, example
  2. Step by step with examples, calculations
  3. Adds intuition regarding “magnetism”’
  4. Implementation and faces, intuition towards each node and what it represents in a vision. I.e., each face resembles one of K clusters.
  5. Medium on kohonen networks, i.e., SOM
  6. Som on iris, explains inference - averaging, and cons of the method.
  7. Simple explanation
  8. Algorithm, formulas

NEURO EVOLUTION (GA/GP based)

NEAT

NEAT stands for NeuroEvolution of Augmenting Topologies. It is a method for evolving artificial neural networks with a genetic algorithm.

NEAT implements the idea that it is most effective to start evolution with small, simple networks and allow them to become increasingly complex over generations.


That way, just as organisms in nature increased in complexity since the first cell, so do neural networks in NEAT.

This process of continual elaboration allows finding highly sophisticated and complex neural networks.

A great article about NEAT

HYPER-NEAT

HyperNEAT computes the connectivity of its neural networks as a function of their geometry.

HyperNEAT is based on a theory of representation that hypothesizes that a good representation for an artificial neural network should be able to describe its pattern of connectivity compactly. \

The encoding in HyperNEAT, called compositional pattern producing networks, is designed to represent patterns with regularities such as symmetry, repetition, and repetition with variationץ

(WIKI) Compositional pattern-producing networks (CPPNs) are a variation of artificial neural networks (ANNs) that have an architecture whose evolution is guided by genetic algorithms

>>>>> gd2md-html alert: inline image link here (to images/image185.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

A great HyperNeat tutorial on Medium.

Radial Basis Function Network (RBFN)

The RBFN approach is more intuitive than the MLP.

  • An RBFN performs classification by **measuring the input’s similarity to examples from the training set. **
  • Each RBFN neuron stores a “prototype”, which is just **one of the examples from the training set. **
  • When we want to classify a new input, each neuron computes the Euclidean distance between the input and its prototype.
  • Roughly speaking, if the input more closely resembles the class A prototypes than the class B prototypes, it is classified as class A.

>>>>> gd2md-html alert: inline image link here (to images/image186.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

Bayesian Neural Network (BNN)

BNN - (what is?) Bayesian neural network (BNN) according to Uber - **architecture that more accurately forecasts time series predictions and uncertainty estimations at scale. **“how Uber has successfully applied this model to large-scale time series anomaly detection, enabling better accommodate rider demand during high-traffic intervals.”

Under the BNN framework, prediction uncertainty can be categorized into three types:

  1. Model uncertainty captures our ignorance of the model parameters and **can be reduced as more samples are collected. **
  2. model misspecification
  3. inherent noise captures the uncertainty in the data generation process and is irreducible.

Note: in a series of articles, uber explains about time series and leads to a BNN architecture.

  1. Neural networks - training on multi-signal raw data, training X and Y are window-based and the window size(lag) is determined in advance.

    Vanilla LSTM did not work properly, therefore an architecture of

Regarding point 1: ‘run prediction with dropout 100 times

*** MEDIUM with code how to do it.

Why do we need a confidence measure when we have a softmax probability layer? The blog post explains, for example, that with a CNN of apples, oranges, cat and dogs, a non related example such as a frog image may influence the network to decide its an apple, therefore we can’t rely on the probability as a confidence measure. The ‘run prediction with dropout 100 times’ should give us a confidence measure because it draws each weight from a bernoulli distribution.

“By applying dropout to all the weight layers in a neural network, we are essentially drawing each weight from a Bernoulli distribution. In practice, this mean that we can sample from the distribution by running several forward passes through the network. This is referred to as Monte Carlo dropout.”

Taken from Yarin Gal’s blog post . In this figure we see how sporadic is the signal from a forward pass (black line) compared to a much cleaner signal from 100 dropout passes.

>>>>> gd2md-html alert: inline image link here (to images/image187.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

Is it applicable for time series? In the figure below he tried to predict the missing signal between each two dotted lines, A is a bad estimation, but with a dropout layer we can see that in most cases the signal is better predicted.

>>>>> gd2md-html alert: inline image link here (to images/image188.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

Going back to uber, they are actually using this idea to predict time series with LSTM, using encoder decoder framework.

>>>>> gd2md-html alert: inline image link here (to images/image189.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

Note: this is probably applicable in other types of networks.

Phd Thesis by Yarin, he talks about uncertainty in Neural networks and using BNNs. he may have proved this thesis, but I did not read it. This blog post links to his full Phd.

Old note: The idea behind uncertainty is (paper here) that in order to trust your network’s classification, you drop some of the neurons during prediction, you do this ~100 times and you average the results. Intuitively this will give you confidence in your classification and increase your classification accuracy, because only a partial part of your network participated in the classification, randomly, 100 times. Please note that Softmax doesn't give you certainty.

Medium post on prediction with drop out

The solution for keras says to add trainable=true for every dropout layer and add another drop out at the end of the model. Thanks sam.

“import keras

inputs = keras.Input(shape=(10,))

x = keras.layers.Dense(3)(inputs)

outputs = keras.layers.Dropout(0.5)(x, training=True)

model = keras.Model(inputs, outputs)“

CONVOLUTIONAL NEURAL NET

>>>>> gd2md-html alert: inline image link here (to images/image190.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

(an excellent and thorough explanation about LeNet) -

  • **Convolution Layer **primary purpose is to extract features from the input image. Convolution preserves the spatial relationship between pixels by learning image features using small squares of input data.
  • **ReLU **(more in the activation chapter) - The purpose of ReLU is to introduce non-linearity in our ConvNet
  • Spatial Pooling (also called subsampling or downsampling) reduces the dimensionality of each feature map but retains the most important information. Spatial Pooling can be of different types: Max, Average, Sum etc.
  • **Dense / Fully Connected - **a traditional Multi Layer Perceptron that uses a softmax **activation **function in the output layer to classify. The output from the convolutional and pooling layers represent high-level features of the input image. The purpose of the Fully Connected layer is to use these features for classifying the input image into various classes based on the training dataset.

The overall training process of the Convolutional Network may be summarized as below:

  • Step1: We **initialize **all **filters **and **parameters **/ **weights **with random values
  • Step2: The network takes a single training image as input, goes through the forward propagation step (convolution, ReLU and pooling operations along with forward propagation in the Fully Connected layer) and finds the output probabilities for each class.
    • Let's say the output probabilities for the boat image above are [0.2, 0.4, 0.1, 0.3]
    • Since weights are randomly assigned for the first training example, output probabilities are also random.
  • Step3: Calculate the total error at the output layer (summation over all 4 classes)
    • (L2) Total Error = ∑ ½ (target probability – output probability) ²
  • Step4: Use **Backpropagation **to calculate the gradients of the error with respect to all weights in the network and use gradient descent to update all filter values / weights and parameter values to minimize the output error.
    • The weights are adjusted in proportion to their contribution to the total error.
    • When the same image is input again, output probabilities might now be [0.1, 0.1, 0.7, 0.1], which is closer to the target vector [0, 0, 1, 0].
    • This means that the network has learnt to classify this particular image correctly by adjusting its weights / filters such that the output error is reduced.
    • Parameters like number of filters, filter sizes, architecture of the network etc. have all been fixed before Step 1 and do not change during training process – only the values of the filter matrix and connection weights get updated.
  • Step5: Repeat steps 2-4 with all images in the training set.

The above steps train the ConvNet – this essentially means that all the weights and parameters of the ConvNet have now been optimized to correctly classify images from the training set.

When a new (unseen) image is input into the ConvNet, the network would go through the forward propagation step and output a probability for each class (for a new image, the output probabilities are calculated using the weights which have been optimized to correctly classify all the previous training examples). If our training set is large enough, the network will (hopefully) generalize well to new images and classify them into correct categories.

Illustrated 10 CNNS architectures

A study that deals with class imbalance in CNN’s - we systematically investigate the impact of class imbalance on classification performance of convolutional neural networks (CNNs) and compare frequently used methods to address the issue

  1. Over sampling
  2. Undersampling
  3. Thresholding probabilities (ROC?)
  4. Cost sensitive classification -different cost to misclassification
  5. One class - novelty detection. This is a concept learning technique that recognizes positive instances rather than discriminating between two classes

Using several imbalance scenarios, on several known data sets, such as MNIST

>>>>> gd2md-html alert: inline image link here (to images/image191.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

The results indication (loosely) that oversampling is usually better in most cases, and doesn't cause overfitting in CNNs.

CONV-1D

  1. How to setup a conv1d in keras, most importantly how to reshape your input vector
  2. Mastery on Character ngram cnn for sentiment analysis

1x1 CNN

  1. Mastery on 1x1 cnn, for dim reduction, decreasing feature maps and other usages.
    1. “This is the most common application of this type of filter and in this way, the layer is often called a feature map pooling layer.”
    2. “In the paper, the authors propose the need for an MLP convolutional layer and the need for cross-channel pooling to promote learning across channels.”
    3. “the 1×1 filter was used explicitly for dimensionality reduction and for increasing the dimensionality of feature maps after pooling in the design of the inception module, used in the GoogLeNet model”
    4. “The 1×1 filter was used as a projection technique to match the number of filters of input to the output of residual modules in the design of the residual network “

MASKED R-CNN

    [1. Using mask rnn for object detection](https://machinelearningmastery.com/how-to-perform-object-detection-in-photographs-with-mask-r-cnn-in-keras/)

Invariance in CNN

  1. Making cnn shift invariance - “Small shifts -- even by a single pixel -- can drastically change the output of a deep network (bars on left). We identify the cause: aliasing during downsampling. We anti-alias modern deep networks with classic signal processing, stabilizing output classifications (bars on right). We even observe accuracy increases (see plot below).

MAX AVERAGE POOLING

Intuitions to the differences between max and average pooling:

  1. A max-pool layer compressed by taking the maximum activation in a block. If you have a block with mostly small activation, but a small bit of large activation, you will loose the information on the low activations. I think of this as saying "this type of feature was detected in this general area".
  2. A mean-pool layer compresses by taking the mean activation in a block. If large activations are balanced by negative activations, the overall compressed activations will look like no activation at all. On the other hand, you retain some information about low activations in the previous example.
  3. MAX pooling In other words: Max pooling roughly means that only those features that are most strongly triggering outputs are used in the subsequent layers. You can look at it a little like focusing the network’s attention on what’s most characteristic for the image at hand.
  4. GLOBAL MAX pooling: In the last few years, experts have turned to global average pooling (GAP) layers to minimize overfitting by reducing the total number of parameters in the model. Similar to max pooling layers, GAP layers are used to reduce the spatial dimensions of a three-dimensional tensor. However, GAP layers perform a more extreme type of dimensionality reduction,
  5. Hinton’s controversy thoughts on pooling

Dilated CNN

  1. For improved performance

RESNET, DENSENET UNET

  1. A https://medium.com/swlh/resnets-densenets-unets-6bbdbcfdf010
  2. on the trick behind them, concatenating both f(x) = x

Graph Convolutional Networks

Explaination here, with some examples

CAPSULE NEURAL NETS

  1. The solution to CNN’s shortcomings, where features can be identified without relations to each other in an image, i.e. changing the location of body parts will not affect the classification, and changing the orientation of the image will. The promise of capsule nets is that these two issues are solved.
  2. Understanding capsule nets - part 2, there are more parts to the series

Transfer Learning using CNN

  1. To Add keras book chapter 5 (i think)
  2. Mastery on TL using CNN
    1. Classifier: The pre-trained model is used directly to classify new images.
    2. Standalone Feature Extractor: The pre-trained model, or some portion of the model, is used to pre-process images and extract relevant features.
    3. Integrated Feature Extractor: The pre-trained model, or some portion of the model, is integrated into a new model, but layers of the pre-trained model are frozen during training.
    4. Weight Initialization: The pre-trained model, or some portion of the model, is integrated into a new model, and the layers of the pre-trained model are trained in concert with the new model.

VISUALIZE CNN

  1. How to

Recurrent Neural Net (RNN)

**RNN **- a basic NN node with a loop, previous output is merged with current input (using tanh?), for the purpose of remembering history, for time series - to predict the next X based on the previous Y. 


(**What is RNN**?) by Andrej Karpathy - [The Unreasonable Effectiveness of Recurrent Neural Networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/), basically a lot of information about RNNs and their usage cases 1 to N = frame captioning
  • N to 1 = classification

  • N to N = predict frames in a movie

  • N\2 with time delay to N\2 = predict supply and demand

  • Vanishing gradient is 100 times worse.

  • Gate networks like LSTM solves vanishing gradient.

    (how to initialize?) Benchmarking RNN networks for text - don't worry about initialization, use normalization and GRU for big networks.

    ** Experimental improvements:

    Ref - ”Simplified RNN, with pytorch implementation” - changing the underlying mechanism in RNNs for the purpose of parallelizing calculation, seems to work nicely in terms of speed, not sure about state of the art results. Controversy regarding said work, author claims he already mentioned these ideas (QRNN) first, a year before, however it seems like his ideas have also been reviewed as incremental (PixelRNN). Its probably best to read all 3 papers in chronological order and use the most optimal solution.

RNNCELLS - recurrent shop, enables you to build complex rnns with keras. Details on their significance are inside the link

Masking for RNNs - the ideas is simple, we want to use variable length inputs, although rnns do use that, they require a fixed size input. So masking of 1’s and 0’s will help it understand the real size or where the information is in the input. Motivation: Padded inputs are going to contribute to our loss and we dont want that.

Source 1, source 2,

Visual attention RNNS - Same idea as masking but on a window-based cnn. Paper

LSTM

>>>>> gd2md-html alert: inline image link here (to images/image192.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

  • Hidden state vs cell state - you have to understand this concept before you dive in. i.e, Hidden state is overall state of what we have seen so far.
    Cell state is selective memory of the past. The hidden state (h) carries the information about what an RNN cell has seen over the time and supply it to the present time such that a loss function is not just dependent upon the data it is seeing in this time instant, but also, data it has seen historically.
  • Illustrated rnn lstm gru
  • Paper - a comparison of many LSTMs variants and they are pretty much the same performance wise
  • Paper - comparison of lstm variants, vanilla is mostly the best, forget and output gates are the most important in terms of performance. Other conclusions in the paper..
  • Master on unrolling RNN’s introductory post
  • Mastery on under/over fitting lstms - but makes sense for all types of networks
  • Mastery on return_sequence and return_state in keras LSTM
    • That return sequences return the hidden state output for each input time step.
    • That return state returns the hidden state output and cell state for the last input time step.
    • That return sequences and return state can be used at the same time.
  • Mastery on understanding stateful vs stateless, stateful stateless for time series
  • Mastery on timedistributed layer and seq2seq
    • TimeDistributed Layer - used to connect 3d inputs from lstms to dense layers, in order to utilize the time element. Otherwise it gets flattened when the connection is direct, nulling the lstm purpose. Note: nice trick that doesn't increase the dense layer structure multiplied by the number of dense neurons. It loops for each time step!
      I.e., The TimeDistributed achieves this **trick **by applying the same Dense layer (same weights) to the LSTMs outputs for one time step at a time. In this way, the output layer only needs one connection to each LSTM unit (plus one bias).

      For this reason, the number of training epochs needs to be increased to account for the smaller network capacity. I doubled it from 500 to 1000 to match the first one-to-one example

    • Sequence Learning Problem

    • One-to-One LSTM for Sequence Prediction

    • Many-to-One LSTM for Sequence Prediction (without TimeDistributed)

    • Many-to-Many LSTM for Sequence Prediction (with TimeDistributed)

  • Mastery on wrapping cnn-lstm with time distributed, as a whole model wrap, or on every layer in the model which is equivalent and preferred.
  • Master on visual examples for sequence prediction
  • Unread - sentiment classification of IMDB movies using Keras and LSTM
  • Very important - how to interpret LSTM neurons in keras
  • LSTM for time-series - (jakob) single point prediction, sequence prediction and shifted-sequence prediction with code.

Stateful vs Stateless: crucial for understanding how to leverage LSTM networks:

  1. A good description on what it is and how to use it.
  2. ML mastery
  3. Philippe remy on stateful vs stateless, intuition mostly with code, but not 100% clear

Machine Learning mastery:

A good tutorial on LSTM: important notes:

  1. **Scale to -1,1, **because the internal activation in the lstm cell is tanh.

2.stateful - True, needs to reset internal states, False =stateless. Great info & results HERE, with seeding, with training resets (and not) and predicting resets (and not) - note: empirically matching the shampoo input, network config, etc.

Another explanation/tutorial about stateful lstm, should be thorough.

  1. what is return_sequence, return_states, and how to use each one and both at the same time.

Return_sequence is needed for stacked LSTM layers.

4.stacked LSTM - each layer has represents a higher level of abstraction in TIME!

Keras Input shape - a good explanation about differences between input_shape, dim, and what is. Additionally about layer calculation of inputs and output based on input shape, and sequence model vs API model.

A comparison of LSTM/GRU/MGU with batch normalization and various initializations, GRu/Xavier/Batch are the best and recommended for RNN

Benchmarking LSTM variants: - it looks like LSTM and GRU are competitive to mutation (i believe its only in pytorch) adding a bias to LSTM works (a bias of 1 as recommended in the paper), but generally speaking there is no conclusive empirical evidence that says one type of network is better than the other for all tests, but the mutated networks tend to win over lstm\gru variants.

BIAS 1 in keras - unit_forget_bias: Boolean. If True, add 1 to the bias of the forget gate at initializationSetting it to true will also force bias_initializer="zeros". This is recommended in Jozefowicz et al.

>>>>> gd2md-html alert: inline image link here (to images/image193.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

Validation_split arg - The validation split variable in Keras is a value between [0..1]. Keras proportionally split your training set by the value of the variable. The first set is used for training and the 2nd set for validation after each epoch.

This is a nice helper add-on by Keras, and most other Keras examples you have seen the training and test set was passed into the fit method, after you have manually made the split. The value of having a validation set is significant and is a vital step to understand how well your model is training. Ideally on a curve you want your training accuracy to be close to your validation curve, and the moment your validation curve falls below your training curve the alarm bells should go off and your model is probably busy over-fitting.

Keras is a wonderful framework for deep learning, and there are many different ways of doing things with plenty of helpers.

Return_sequence: unclear.

Sequence.pad_sequences - using maxlength it will either pad with zero if smaller than, or truncate it if bigger.

Using batch size for LSTM in Keras

Imbalanced classes? Use class_weights, another explanation here about class_weights and sample_weights.

SKlearn Formula for balanced class weights and why it works, example

number of units in LSTM

Calculate how many params are in an LSTM layer?

>>>>> gd2md-html alert: inline image link here (to images/image194.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

Understanding timedistributed in Keras, but with focus on lstm one to one, one to many and many to many - here the timedistributed is applying a dense layer to each output neuron from the lstm, which returned_sequence = true for that purpose.

This tutorial clearly shows how to manipulate input construction, lstm output neurons and the target layer for the purpose of those three problems (1:1, 1:m, m:m).

BIDIRECTIONAL LSTM

(what is?) Wiki - The basic idea of BRNNs is to connect two hidden layers of opposite directions to the same output. By this structure, the output layer can get information from past and future states.

BRNN are especially useful when the context of the input is needed. For example, in handwriting recognition, the performance can be enhanced by knowledge of the letters located before and after the current letter.

Another explanation- It involves duplicating the first recurrent layer in the network so that there are now two layers side-by-side, then providing the input sequence as-is as input to the first layer and providing a reversed copy of the input sequence to the second.

.. It allows you to specify the merge mode, that is how the forward and backward outputs should be combined before being passed on to the next layer. The options are:

  • ‘sum‘: The outputs are added together.
  • ‘mul‘: The outputs are multiplied together.
  • ‘concat‘: The outputs are concatenated together (the default), providing double the number of outputs to the next layer.
  • ‘ave‘: The average of the outputs is taken.

The default mode is to concatenate, and this is the method often used in studies of bidirectional LSTMs.

Another simplified example

BACK PROPAGATION

A great Slide about back prop, on a simple 3 neuron network, with very easy to understand calculations.

UNSUPERVISED LSTM

  1. Paper, paper2, paper3
  2. In keras

GRU

A tutorial about GRU - To solve the vanishing gradient problem of a standard RNN, GRU uses, so called, update gate and reset gate. Basically, these are two vectors which decide what information should be passed to the output. The special thing about them is that they can be trained to keep information from long ago, without washing it through time or remove information which is irrelevant to the prediction.

  1. **update gate **helps the model to determine how much of the past information (from previous time steps) needs to be passed along to the future.
  2. **Reset gate **essentially, this gate is used from the model to decide how much of the past information to forget.

RECURRENT WEIGHTED AVERAGE (RNN-WA)

What is? (a type of cell that converges to higher accuracy faster than LSTM.

it implements attention into the recurrent neural network:

  1. the keras implementation is available at https://github.com/keisuke-nakata/rwa

  2. the whitepaper is at https://arxiv.org/pdf/1703.01253.pdf

>>>>> gd2md-html alert: inline image link here (to images/image195.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

QRNN

Potential competitor to the transformer

GRAPH NEURAL NETWORKS (GNN)

  1. Learning on graphs youtube - uriel singer
  2. Benchmarking GNN’s, methodology, git, the works.
  3. Awesome graph classification on github
  4. Octavian in medium on graphs, A really good intro to graph networks, too long too summarize, clever, mcgraph, regression, classification, embedding on graphs.
  5. Application of graph networks
  6. Recommender systems using GNN, w2v, pytorch w2v, networkx, sparse matrices, matrix factorization, dictionary optimization, part 1 here (how to find product relations, important: creating negative samples)
  7. Transformers are GNN, original: Transformers are graphs, not the typical embedding on a graph, but a more holistic approach to understanding text as a graph.
  8. Cnn for graphs
  9. Staring with gnn
  10. Really good - Basics deep walk and graphsage
  11. Application of gnn
  12. Michael Bronstein’s Central page for Graph deep learning articles on Medium (worth reading)
  13. GAT graphi attention networks, paper, examples - The graph attentional layer utilised throughout these networks is computationally efficient (does not require costly matrix operations, and is parallelizable across all nodes in the graph), allows for (implicitly) assigning different importances to different nodes within a neighborhood while dealing with different sized neighborhoods, and does not depend on knowing the entire graph structure upfront—thus addressing many of the theoretical issues with approaches.
  14. Medium on Intro, basics, deep walk, graph sage

Deep walk

1. [Git](https://github.com/phanein/deepwalk)
2. [Paper](https://arxiv.org/abs/1403.6652)
3. [Medium ](https://medium.com/@_init_/an-illustrated-explanation-of-using-skipgram-to-encode-the-structure-of-a-graph-deepwalk-6220e304d71b) and medium on [W2v, deep walk, graph2vec, n2v](https://towardsdatascience.com/graph-embeddings-the-summary-cc6075aba007)

Node2vec

4. [Git](https://github.com/eliorc/node2vec)
5. [Stanford](https://snap.stanford.edu/node2vec/)
6. [Elior on medium](https://towardsdatascience.com/node2vec-embeddings-for-graph-data-32a866340fef), [youtube](https://www.youtube.com/watch?v=828rZgV9t1g)
7. [Paper](https://cs.stanford.edu/~jure/pubs/node2vec-kdd16.pdf)

Graphsage

8. [medium](https://towardsdatascience.com/a-gentle-introduction-to-graph-neural-network-basics-deepwalk-and-graphsage-db5d540d50b3)

SDNE - structural deep network embedding

9. [medium](https://towardsdatascience.com/graph-embeddings-the-summary-cc6075aba007)

Diff2vec

10. [Git](https://github.com/benedekrozemberczki/diff2vec)
11. 

>>>>> gd2md-html alert: inline image link here (to images/image196.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

Splitter

, [git](https://github.com/benedekrozemberczki/Splitter), [paper](http://epasto.org/papers/www2019splitter.pdf), “Is a Single Embedding Enough? Learning Node Representations that Capture Multiple Social Contexts”


Recent interest in graph embedding methods has focused on learning a single representation for each node in the graph. But can nodes really be best described by a single vector representation? In this work, we propose a method for learning multiple representations of the nodes in a graph (e.g., the users of a social network). Based on a principled decomposition of the ego-network, each representation encodes the role of the node in a different local community in which the nodes participate. These representations allow for improved reconstruction of the nuanced relationships that occur in the graph a phenomenon that we illustrate through state-of-the-art results on link prediction tasks on a variety of graphs, reducing the error by up to 90%. In addition, we show that these embeddings allow for effective visual analysis of the learned community structure.

>>>>> gd2md-html alert: inline image link here (to images/image197.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

>>>>> gd2md-html alert: inline image link here (to images/image198.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

16. [Self clustering graph embeddings](https://github.com/benedekrozemberczki/GEMSEC)

>>>>> gd2md-html alert: inline image link here (to images/image199.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

17. [Walklets](https://github.com/benedekrozemberczki/walklets?fbclid=IwAR2ymD7lbgP_sUde5UvKGZp7TYYYmACMFJS6UGNjqW29ethONHy7ibmDL0Q), similar to deep walk with node skips. - lots of improvements, works in scale due to lower size representations, improves results, etc. 

SIGNAL PROCESSING NN (FFT, WAVELETS, SHAPELETS)

  1. Fourier Transform - decomposing frequencies
  2. WAVELETS On youtube (4 videos):
    1. used for denoising, compression, detect edges, detect features with various orientation, analyse signal power, detect and localize transients, change points in time series data and detect optimal signal representation (peaks etc) of time freq analysis of images and data.
    2. Can also be used to reconstruct time and frequencies, analyse images in space, frequencies, orientation, identifying coherent time oscillation in time series
    3. Analyse signal variability and correlation

HIERARCHICAL RNN

  1. githubcode

NN-Sequence Analysis

(did not read) [A causal framework for explaining the predictions of black-box sequence-to-sequence models](http://people.csail.mit.edu/tommi/papers/AlvJaa_EMNLP2017.pdf) - can this be applied to other time series prediction?

SIAMESE NETWORKS (one shot)

  1. Siamese CNN, learns a similarity between images, not to classify
  2. Visual tracking, explains contrastive and triplet loss
  3. One shot learning, very thorough, baseline vs siamese
  4. What is triplet loss

MULTI NETWORKS

  1. Google whitening black boxes using multi nets, segmentation and classification

OPTIMIZING NEURAL NETS

PRUNING / KNOWLEDGE DISTILLATION / LOTTERY TICKET

  1. Awesome Knowledge distillation
  2. Lottery ticket
    1. 1, 2-paper
    2. Uber on Lottery ticket, masking weights retraining
    3. Facebook article and paper
  3. Knowledge distillation 1, 2, 3
  4. Pruning 1, 2
  5. Teacher-student knowledge distillation focusing on Knowledge & Ranking distillation

>>>>> gd2md-html alert: inline image link here (to images/image200.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

  1. Deep network compression using teacher student

  2. Lottery ticket on BERT, magnitude vs structured pruning on a various metrics, i.e., LT works on bert.
    The classical Lottery Ticket Hypothesis was mostly tested with unstructured pruning, specifically magnitude pruning (m-pruning) where the weights with the lowest magnitude are pruned irrespective of their position in the model. We iteratively prune 10% of the least magnitude weights across the entire fine-tuned model (except the embeddings) and evaluate on dev set, for as long as the performance of the pruned subnetwork is above 90% of the full model.

    We also experiment with structured pruning (s-pruning) of entire components of BERT architecture based on their importance scores: specifically, we 'remove' the least important self-attention heads and MLPs by applying a mask. In each iteration, we prune 10% of BERT heads and 1 MLP, for as long as the performance of the pruned subnetwork is above 90% of the full model. To determine which heads/MLPs to prune, we use a loss-based approximation: the importance scores proposed by Michel, Levy and Neubig (2019) for self-attention heads, which we extend to MLPs. Please see our paper and the original formulation for more details.

Troubleshooting Neural Nets

(37 reasons, 10 more) - copy pasted and rewritten here for convenience, it's pretty thorough, but long and extensive, you should have some sort of intuition and not go through all of these. The following list is has much more insight and information in the article itself.

The author of the original article suggests to turn everything off and then start building your network step by step, i.e., “a divide and conquer ‘debug’ method”.

Dataset Issues

  1. Check your input data - for stupid mistakes

  2. Try random input - if the error behaves the same on random data, there is a problem in the net. Debug layer by layer

  3. Check the data loader - input data is possibly broken. Check the input layer.

  4. Make sure input is connected to output - do samples have correct labels, even after shuffling?

  5. Is the relationship between input and output too random? - the input are not sufficiently related to the output. Its pretty amorphic, just look at the data.

  6. Is there too much noise in the dataset? - badly labelled datasets.

  7. Shuffle the dataset - useful to counteract order in the DS, always shuffle input and labels together.

  8. Reduce class imbalance - imbalance datasets may add a bias to class prediction. Balance your class, your loss, do something.

  9. Do you have enough training examples? - training from scratch? ~1000 images per class, ~probably similar numbers for other types of samples.

**10. Make sure your batches don’t contain a single label - this is probably something you wont notice and will waste a lot of time figuring out! **In certain cases shuffle the DS to prevent batches from having the same label.

  1. Reduce batch size - This paper points out that having a very large batch can reduce the generalization ability of the model. However, please note that I found other references that claim a too small batch will impact performance.

12. Test on well known Datasets


Data Normalization/Augmentation

  1. Standardize the features - zero mean and unit variance, sounds like normalization.

  2. Do you have too much data augmentation?

Augmentation has a regularizing effect. Too much of this combined with other forms of regularization (weight L2, dropout, etc.) can cause the net to underfit.

  1. Check the preprocessing of your pretrained model - with a pretrained model make sure your input data is similar in range[0, 1], [-1, 1] or [0, 255]?

  2. Check the preprocessing for train/validation/test set - CS231n points out a common pitfall:

Any preprocessing should be computed ONLY on the training data, then applied to val/test


Implementation issues

  1. Try solving a simpler version of the problem -divide and conquer prediction, i.e., class and box coordinates, just use one.

  2. Look for correct loss “at chance” - calculat loss for chance level, i.e 10% baseline is -ln(0.1) = 2.3 Softmax loss is the negative log probability. Afterwards increase regularization strength which should increase the loss.

  3. Check your custom loss function.

  4. Verify loss input - parameter confusion.

  5. Adjust loss weights -If your loss is composed of several smaller loss functions, make sure their magnitude relative to each is correct. This might involve testing different combinations of loss weights.

  6. Monitor other metrics -like accuracy.

  7. Test any custom layers, debugging them.

  8. Check for “frozen” layers or variables - accidentally frozen?

  9. Increase network size - more layers, more neurons.

  10. Check for hidden dimension errors - confusion due to vectors ->(64, 64, 64)

  11. Explore Gradient checking -does your backprop work for custon gradients? 1 2 3.


Training issues

  1. Solve for a really small dataset - can you generalize on 2 samples?

  2. Check weights initialization - Xavier or He or forget about it for networks such as RNN.

  3. Change your hyperparameters - grid search

  4. Reduce regularization - too much may underfit, try for dropout, batch norm, weight, bias , L2.

  5. Give it more training time as long as the loss is decreasing.

  6. Switch from Train to Test mode - not clear.

  7. Visualize the training - activations, weights, layer updates, biases. Tensorboard and Crayon. Tips on Deeplearning4j. Expect gaussian distribution for weights, biases start at 0 and end up almost gaussian. Keep an eye out for parameters that are diverging to +/- infinity. **Keep an eye out for biases that become very large. **This can sometimes occur in the output layer for classification if the distribution of classes is very imbalanced.

  8. Try a different optimizer, Check this excellent post about gradient descent optimizers.

  9. Exploding / Vanishing gradients - Gradient clipping may help. Tips on: Deeplearning4j: “A good standard deviation for the activations is on the order of 0.5 to 2.0. Significantly outside of this range may indicate vanishing or exploding activations.”

  10. Increase/Decrease Learning Rate, or use adaptive learning

  11. Overcoming NaNs, big issue for RNN - decrease LR, how to deal with NaNs. evaluate layer by layer, why does it appear.

EMBEDDINGS

TOOLS

  1. Name-Entity Recognition (NER): It can recognise whether a word represents a person, location or names in the text.
  2. Parts-of-Speech Tagging (PoS): Tags all the words in the given text as to which “part of speech” they belong to.
  3. Text Classification: Classifying text based on the criteria (labels)
  4. Training Custom Models: Making our own custom models.
  5. It comprises of popular and state-of-the-art word embeddings, such as GloVe, BERT, ELMo, Character Embeddings, etc. There are very easy to use thanks to the Flair API
  6. Flair’s interface allows us to combine different word embeddings and use them to embed documents. This in turn leads to a significant uptick in results
  7. ‘Flair Embedding’ is the signature embedding provided within the Flair library. It is powered by contextual string embeddings. We’ll understand this concept in detail in the next section
  8. Flair supports a number of languages – and is always looking to add new ones

HUGGING FACE

  1. git

LANGUAGE EMBEDDINGS

>>>>> gd2md-html alert: inline image link here (to images/image201.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(