This list was compiled while preparing for AI Residency programs in Google, Facebook, Microsoft, Uber etc. The questions have been borrowed from https://medium.com/subhrajit-roy/cracking-the-machine-learning-interview-1d8c5bb752d8.
The term broadcasting describes how numpy treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes. Broadcasting provides a means of vectorizing array operations so that looping occurs in C instead of Python. It does this without making needless copies of data and usually leads to efficient algorithm implementations.
Scalar: A single number. Vector : A list of values.(rank 1 tensor) Matrix: A two dimensional list of values.(rank 2 tensor) Tensor: A multi dimensional matrix with rank n.
Hadamard product is a binary operation that takes two matrices of the same dimensions and produces another matrix of the same dimension as the operands where each element i, j is the product of elements i, j of the original two matrices.
The inverse of a matrix A is a matrix that, when multiplied by A results in the identity. Invertible matrices have connections back to systems of equations and to other concepts like linear independence or dependence.
Step 1: calculating the Matrix of Minors, Step 2: then turn that into the Matrix of Cofactors, Step 3: then the Adjugate, and Step 4: multiply that by 1/Determinant.
6. What is the determinant of a square matrix? How is it calculated? What is the connection of determinant to eigenvalues?
The determinant is a scalar value that can be computed from the elements of a square matrix and encodes certain properties of the linear transformation described by the matrix. Pick any row or column in the matrix. Multiply every element in that row or column by its cofactor and add. The result is the determinant. If A is an n × n matrix, then the sum of the n eigenvalues of A is the trace of A and the product of the n eigenvalues is the determinant of A.
If we use a linearly dependent set to construct a span, then we can always create the same infinite set with a starting set that is one vector smaller in size. However, this will not be possible if we build a span from a linearly independent set. So in a certain sense, using a linearly independent set to formulate a span is the best possible way — there are not any extra vectors being used to build up all the necessary linear combinations.
If and only if b is a linear combination of the columns of A
The homogeneous system Ax = 0 has infinitely many solutions. The statement would be true for fat and singular square matrices, but fails for a nonsingular square matrix.
Inverse of a matrix exists when the matrix is invertible. Now for a matrix to be invertible , you need to have the condition that the determinant of the matrix must not be zero. That is det(A) ≠ 0 where A is your matrix of interest.
L1 Norm is the sum of the magnitudes of the vectors in a space. It is the most natural way of measure distance between vectors, that is the sum of absolute difference of the components of the vectors. L2 norm gives the shortest distance to go from one point to another. L infinity norm gives the largest magnitude among each element of a vector.
If norm of x is greater than 0 then x is not equal to 0 (Zero Vector) and if norm is equal to 0 then x is a zero vector.
On raising power to cube (L3), quad (L4) or higher, the function becomes sensitive to the influence of outliers and thus introduces unwanted bias into distance calculation. Using squared L2 norm, the function becomes easily calculable.
L1 is used for feature selection, dealing with sparsity and has less computational cost.
Yes, although it is actually not a norm. It corresponds to the total number of nonzero elements in a vector.
The Frobenius norm is matrix norm of a matrix defined as the square root of the sum of the absolute squares of its elements.
A matrix having non-zero elements only in the diagonal running from the upper left to the lower right.
18. Why is multiplication by diagonal matrix computationally cheap? How is the multiplication different for square vs. non-square diagonal matrix?
The inverse of a diagonal matrix is obtained by replacing each element in the diagonal with its reciprocal, as illustrated below for matrix C. where I is the identity matrix.
A symmetric matrix is a square matrix that is equal to its transpose.
A vector which has a magnitude of one.
Two vectors are orthogonal if and only if their dot product is zero, i.e. they make an angle of 90° (π/2 radians), or one of the vectors is zero.
Given several non-zero vectors, if they are orthogonal (one to another), then they are linearly independent. So their number cannot exceed 𝑛 if they are in ℝ𝑛 .
To summarize, for a set of vectors to be orthogonal : They should be mutually perpendicular to each other
An orthogonal matrix is a square matrix whose columns and rows are orthogonal unit vectors (i.e., orthonormal vectors), i.e. where. is the identity matrix. Because it is invertible.
Eigendecomposition of a matrix is a type of decomposition that involves decomposing a square matrix into a set of eigenvectors and eigenvalues. One of the most widely used kinds of matrix decomposition is called eigendecomposition, in which we decompose a matrix into a set of eigenvectors and eigenvalues. An eigenvector or characteristic vector of a linear transformation is a nonzero vector that changes at most by a scalar factor when that linear transformation is applied to it. The corresponding eigenvalue is the factor by which the eigenvector is scaled.
- How to find eigen values of a matrix?
Model selection refers to the problem of selecting a few representative models from a large set of computational models for the purpose of decision making or optimization under uncertainty.
Bias is the simplifying assumptions made by the model to make the target function easier to approximate. Variance is the amount that the estimate of the target function will change given different training data. Trade-off is tension between the error introduced by the bias and the variance.
Attribute selection measure is a heuristic for selecting the splitting criterion that “best” separates a given data partition, D, of a class-labeled training tuples into individual classes.
Cross Validation is a very useful technique for assessing the effectiveness of your model, particularly in cases where you need to mitigate overfitting. It is also of use in determining the hyper parameters of your model, in the sense that which parameters will result in lowest test error.
Cross Validation is a technique which involves reserving a particular sample of a dataset on which you do not train the model. Later, you test your model on this sample before finalizing it.
Hold-out is when you split up your dataset into a 'train' and 'test' set. The training set is what the model is trained on, and the test set is used to see how well that model performs on unseen data.
K fold cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample. The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. Advantages: (1) No randomness of using some observations for training. (2) As validation set is larger than in LOOCV, it gives less variability in test-error as more observations are used for each iteration's prediction. Disadvantages: The disadvantage of this method is that the training algorithm has to be rerun from scratch k times, which means it takes k times as much computation to make an evaluation.
Leave-one-out cross validation is K-fold cross validation taken to its logical extreme, with K equal to N, the number of data points in the set. That means that N separate times, the function approximator is trained on all the data except for one point and a prediction is made for that point.
Feature selection methods can be used to identify and remove unneeded, irrelevant and redundant attributes from data that do not contribute to the accuracy of a predictive model or may in fact decrease the accuracy of the model.
There are two main types of feature selection techniques: wrapper and filter methods. Filter-based feature selection methods use statistical measures to score the correlation or dependence between input variables that can be filtered to choose the most relevant features.
Forward Selection: Forward selection is an iterative method in which we start with having no feature in the model. In each iteration, we keep adding the feature which best improves our model till an addition of a new variable does not improve the performance of the model.
Backward selection starts with all features contained in the dataset. It then runs a model and calculates a p-value associated with the t-test or F-test of the model for each feature. The feature with the largest insignificant p-value will then be removed from the model, and the process starts again.
Filter method: selecting a subset of features by a measure other than error (a measure that is inherent to the feature and not dependent on a model).
Mutual information is a symmetric measure, i.e. I(X;Y)=I(Y;X). However KL-Distance is an asymmetric measure: D(P||Q) not equal D(Q||P).
Very often in Probability and Statistics we'll replace observed data or a complex distributions with a simpler, approximating distribution. KL Divergence helps us to measure just how much information we lose when we choose an approximation.
1. How can the SVM optimization function be derived from the logistic regression optimization function?
SVM try to maximize the margin between the closest support vectors while LR the posterior class probability. Thus, SVM find a solution which is as fare as possible for the two categories while LR has not this property.
The Maximal-Margin Classifier is a hypothetical classifier that best explains how SVM works in practice.
It maximizes the margin of the hyperplane. This is the best hyperplane because it reduces the generalization error the most. If we add new data, the Maximum Margin Classifier is the best hyperplane to correctly classify the new data.
SVM is not very robust to outliers. Presence of a few outliers can lead to very bad global misclassification. SVM is not very robust to outliers.
C is a regularization parameter that controls the trade off between the achieving a low training error and a low testing error that is the ability to generalize your classifier to unseen data.
The vector theta has to be perpendicular to decision boundary.. Because the goal here is to maximize the margin. ... To understand this think as if you want to calculate the distance of point from line1 which can be done by projecting the point on to the line perpendicular to line1.
It maximizes the margin of the hyperplane. This is the best hyperplane because it reduces the generalization error the most. If we add new data, the Maximum Margin Classifier is the best hyperplane to correctly classify the new data.
SVM algorithms use a set of mathematical functions that are defined as the kernel. The function of kernel is to take data as input and transform it into the required form. These functions can be different types. For example linear, nonlinear, polynomial, radial basis function (RBF), and sigmoid.
Given two vectors, the similarity is the length of the projection of one vector on another. Another interesting kernel examples is Gaussian kernel.
The kernel trick avoids the explicit mapping that is needed to get linear learning algorithms to learn a nonlinear function or decision boundary. For all and in the input space , certain functions can be expressed as an inner product in another space . The function is often referred to as a kernel or a kernel function.
SVM tries to find the widest possible separating margin, while Logistic Regression optimizes the log likelihood function, with probabilities modeled by the sigmoid function.
The support vector machine algorithm has low bias and high variance, but the trade-off can be changed by increasing the C parameter that influences the number of violations of the margin allowed in the training data which increases the bias but decreases the variance.
The similarity is the length of the projection of one vector on another. Another interesting kernel examples is Gaussian kernel.
So, you can typically expect SVM to perform marginally better than logistic regression. SVM try to maximize the margin between the closest support vectors while LR the posterior class probability. Thus, SVM find a solution which is as fare as possible for the two categories while LR has not this property.
The key difference between Bayesian and frequentist approaches lies in the definition of a probability, so if it is necessary to treat probabilties strictly as a long run frequency then frequentist approaches are reasonable, if it isn't then you should use a Bayesian approach.
MLE gives you the value which maximises the Likelihood P(D|θ). And MAP gives you the value which maximises the posterior probability P(θ|D).
Bayesian regularization is a mathematical process that converts a nonlinear regression into a "well-posed" statistical problem in the manner of a ridge regression.
- Use of priors
- Bayesian analysis tells us both how likely version A is to be the winner and by how much
- Bayesian analysis can be more robust to outliers, by using more flexible distributions
- Bayesian analysis does not require to pick a bunch of thresholds in advance to formulate a valid hypothesis
The L1 regularization adds a penalty equal to the sum of the absolute value of the coefficients. The L1 regularization will shrink some parameters to zero. Hence some variables will not play any role in the model, L1 regression can be seen as a way to select features in a model.
The L2 regularization adds a penalty equal to the sum of the squared value of the coefficients. The L2 regularization will force the parameters to be relatively small, the bigger the penalization, the smaller (and the more robust) the coefficients are.
The difference between the L1 and L2 is just that L2 is the sum of the square of the weights, while L1 is just the sum of the weights. L1 helps perform feature selection in sparse feature spaces.Feature selection is to know which features are helpful and which are redundant. L2 has one very important advantage to L1, and that is invariance to rotation and scale. This is especially important in geographical / physical application.
The reason for using L1 norm to find a sparse solution is due to its special shape. It has spikes that happen to be at sparse points. Using it to touch the solution surface will very likely to find a touch point on a spike tip and thus a sparse solution.
Dropout is a technique used to prevent a model from overfitting. Dropout works by randomly setting the outgoing edges of hidden units (neurons that make up hidden layers) to 0 at each update of the training phase.
In forward propagation, inputs are set to zero with probability 𝑝, and otherwise scaled up by 11−𝑝. In backward propagation, gradients for the same dropped units are zeroed out; other gradients are scaled up by the same 11−𝑝.
Sensitivity, Specificity, and Accuracy, ROC are the terms which are most commonly associated with a Binary classification test and they statistically measure the performance of the test.
Precision can be seen as a measure of exactness or quality, whereas recall is a measure of completeness or quantity. In simple terms, high precision means that an algorithm returned substantially more relevant results than irrelevant ones, while high recall means that an algorithm returned most of the relevant results.
A t-test is a type of inferential statistic used to determine if there is a significant difference between the means of two groups, which may be related in certain features.
K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data (i.e., data without defined categories or groups). The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided.
A non-convex optimization problem is any problem where the objective or any of the constraints are non-convex, as pictured below. Such a problem may have multiple feasible regions and multiple locally optimal points within each region.
The EM algorithm is a general procedure to estimate the parameters in a model with latent (unobserved) factors.
Gaussian mixture models are a probabilistic model for representing normally distributed subpopulations within an overall population.
It works by choosing random values for the missing data points, and using those guesses to estimate a second set of data. The new values are used to create a better guess for the first set, and the process continues until the algorithm converges on a fixed point.
E-step: perform probabilistic assignments of each data point to some class based on the current hypothesis h for the distributional class parameters; M-step: update the hypothesis h for the distributional class parameters based on the new data assignments.
It reduces the time and storage space required. Removal of multi-collinearity improves the interpretation of the parameters of the machine learning model. It becomes easier to visualize the data when reduced to very low dimensions such as 2D or 3D.
Principal Component Analysis (PCA) is a dimension-reduction tool that can be used to reduce a large set of variables to a small set that still contains most of the information in the large set.
PCA will NOT consider the response variable but only the variance of the independent variables. Logistic Regression will consider how each independent variable impact on response variable.
Frequentists use probability only to model certain processes broadly described as "sampling." Bayesians use probability more widely to model both sampling and other kinds of uncertainty.
A random variable is a variable whose value is unknown or a function that assigns values to each of an experiment's outcomes.
A probability distribution is a table or an equation that links each outcome of a statistical experiment with its probability of occurrence.
A probability mass function (pmf) is a function used to describe the probability associated with the discrete variable.
Probability density function (PDF) is a statistical expression that defines a probability distribution (the likelihood of an outcome) for a discrete random variable (e.g., a stock or ETF).
A joint distribution is a probability distribution having two or more independent random variables. Furthermore, the strength of any relationship between the two variables can be measured.
All probabilities are positive: fx(x) ≥ 0. Any event in the distribution (e.g. “scoring between 20 and 30”) has a probability of happening of between 0 and 1 (e.g. 0% and 100%).
The absolute likelihood for a continuous random variable to take on any particular value is 0 (since there are an infinite set of possible values to begin with), the value of the PDF at two different samples can be used to infer, in any particular draw of the random variable, how much more likely it is that the random variable would equal one sample compared to the other sample.
Joint probability is the probability of two events occurring simultaneously. Marginal probability is the probability of an event irrespective of the outcome of another variable.
10. What is conditional probability? Given the joint probability function, how will you calculate it?
Here, P(A given B) is the probability of event A given that event B has occurred, called the conditional probability, described below. The joint probability is symmetrical, meaning that P(A and B) is the same as P(B and A).
The chain rule permits the calculation of any member of the joint distribution of a set of random variables using only conditional probabilities.
If two random variables, X and Y, are independent, they satisfy the following conditions. P(x|y) = P(x), for all values of X and Y. P(x ∩ y) = P(x) * P(y), for all values of X and Y.
The expectation describes the average value and the variance describes the spread (amount of variability) around the expectation. Covariance refers to the measure of the directional relationship between two random variables.
Covariance can be positive, zero, or negative. ... If X and Y are independent variables, then their covariance is 0: Cov(X, Y ) = E(XY ) − µXµY = E(X)E(Y ) − µXµY = 0 The converse, however, is not always true.
A covariance matrix is a square matrix giving the covariance between each pair of elements of a given random vector.
16. What is a Bernoulli distribution? Calculate the expectation and variance of a random variable that follows Bernoulli distribution?
The Bernoulli distribution is a discrete distribution having two possible outcomes labelled by and in which ("success") occurs with probability and ("failure") occurs with probability.
It is a discrete probability distribution that describes the possible results of a random variable that can take on one of K possible categories, with the probability of each category separately specified.
The normal distribution is a probability function that describes how the values of a variable are distributed. It is a symmetric distribution where most of the observations cluster around the central peak and the probabilities for values further away from the mean taper off equally in both directions.
There is no reason to use the Normal distribution as a default prior for a real parameter.
The central limit theorem states that if you have a population with mean μ and standard deviation σ and take sufficiently large random samples from the population with replacement , then the distribution of the sample means will be approximately normally distributed.
It is the distribution of differences between two independent variates with identical exponential distributions.
It is used to model the density of an idealized point mass or point charge as a function equal to zero everywhere except for zero and whose integral over the entire real line is equal to one.
A mixture distribution is the probability distribution of a random variable that is derived from a collection of other random variables as follows: first, a random variable is selected by chance from the collection according to given probabilities of selection, and then the value of the selected random variable is realized.
Empirical and Gaussian Mixture
A property holds almost everywhere if it holds for all elements in a set except a subset of measure zero, or equivalently, if the set of elements for which the property holds is conull.
Self-information is defined as the amount of information that knowledge about (the outcome of) a certain event, adds to someone's overall knowledge. The amount of self-information is expressed in the unit of information: a bit.
Differential entropy (also referred to as continuous entropy) is a concept in information theory that began as an attempt by Shannon to extend the idea of (Shannon) entropy, a measure of average surprisal of a random variable, to continuous probability distributions.
Kullback–Leibler divergence is a measure of how one probability distribution is different from a second, reference probability distribution.
Although the KL divergence measures the “distance” between two distri- butions, it is not a distance measure. This is because that the KL divergence is not a metric measure. It is not symmetric: the KL from p(x) to q(x) is generally not the same as the KL from q(x) to p(x).
Cross-entropy is commonly used in machine learning as a loss function. Cross-entropy is a measure from the field of information theory, building upon entropy and generally calculating the difference between two probability distributions.
A structured probabilistic model is a probabilistic model for which a graph expresses the conditional dependence structure between random variables.
Sample Mean is the mean of sample values collected. Population Mean is the mean of all the values in the population. If the sample is random and sample size is large then the sample mean would be a good estimate of the population mean.
The formula we use for standard deviation depends on whether the data is being considered a population of its own, or the data is a sample representing a larger population. If the data is being considered a population on its own, we divide by the number of data points, NNN. If the data is a sample from a larger population, we divide by one fewer than the number of data points in the sample, n-1n−1n, minus, 1.
Step 1: Calculate the mean of the data Step 2: Subtract the mean from each data point. Step 3: Square each deviation to make it positive. Step 4: Add the squared deviations together. Step 5: Divide the sum by one less than the number of data points in the sample.
A Confidence interval is a range of values that likely would contain an unknown population parameter. Confidence level refers to the percentage of probability, or certainty, that the confidence interval would contain the true population parameter when you draw a random sample many times.
The standard error is a statistical term that measures the accuracy with which a sample distribution represents a population by using standard deviation.
1. The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces (often with hundreds or thousands of dimensions).
In order to generalize well, machine learning algorithms need to be guided by prior beliefs about what kind of function they should learn. This prior states that the function we learn should not change very much within a small region.
Autoencoder is used for reducing the dimensions of the dataset while learning how to ignore noise. An Autoencoder is an unsupervised artificial neural network used for learning.
2. What were Autoencoders traditionally used for? Why there has been a resurgence of Autoencoders for generative modeling?
An autoencoder is a type of artificial neural network used to learn efficient data codings in an unsupervised manner. The aim of an autoencoder is to learn a representation (encoding) for a set of data, typically for dimensionality reduction, by training the network to ignore signal “noise”. They are a self-supervised technique for representation learning, where our network learns about its input so that it may generate new data just as input.
Cross Entropy, KL divergence
5. What is a linear autoencoder? Can it be optimal (lowest training reconstruction error)? If yes, under what conditions?
PCA is essentially a linear transformation but Auto-encoders are capable of modelling complex non linear functions. PCA is faster and computationally cheaper than autoencoders.
7. What is the impact of the size of the hidden layer in Autoencoders?
The goal of an autoencoder is to learn a good representation of the input which means that it should not be simply learning the identity function. One way to enforce this is by making the number of hidden units less than the number of features. There are also other constraints that work very well in practice, which allow you to have more hidden units than features potentially (Sparse Autoencoder, Denoising Autoencoder, Contractive Autoencoder, etc)
Representation learning is learning representations of input data typically by transforming it, that makes it easier to perform a task like classification or prediction.
In deep learning, the representations are formed by composition of multiple non-linear transformations of the input data with the goal of yielding abstract and useful representations for tasks like classification, prediction etc.
In one shot learning, you get only 1 or a few training examples in some categories. In zero shot learning, you are not presented with every class label in training. So in some categories, you get 0 training examples.
5. What is greedy layer-wise unsupervised pretraining (GLUP)? Why greedy? Why layer-wise? Why unsupervised? Why pretraining?
Greedy Layer-Wise Unsupervised Pretraining. ... Greedy layer-wise pretraining is called so because it optimizes each layer at a time greedily. After unsupervised training, there is usually a fine-tune stage, when a joint supervised training algorithm is applied to all the layers.
Greedy layer-wise pretraining provides a way to develop deep multi-layered neural networks whilst only ever training shallow networks. Pretraining can be used to iteratively deepen a supervised model or an unsupervised model that can be repurposed as a supervised model.
Unsupervised pre-training initializes the model to a point in parameter space that somehow renders the optimization process more effective, in the sense of achieving a lower minimum of the empirical cost function.
In unsupervised learning, an AI system is presented with unlabeled, uncategorized data and the system's algorithms act on the data without prior training. The output is dependent upon the coded algorithms. Unsupervised learning algorithms can perform more complex processing tasks than supervised learning systems.
Unsupervised pre-training sets the parameter in a region from which better basins of attraction can be reached, in terms of generalization. Hence, although unsupervised pre-training is a regularizer, it can have a positive effect on the training objective when the number of training examples is large.
10. What is the disadvantage of unsupervised pretraining compared to other forms of unsupervised learning?
You cannot get precise information regarding data sorting, and the output as data used in unsupervised learning is labeled and not known. Less accuracy of the results is because the input data is not known and not labeled by people in advance.
Unsupervised pre-training tries to learn a representation g(X) from X. g(X) might result in a sparse representation of the data (e.g. PCA, non-linear dimensionality reduction etc). As a result, the prediction function f(g(X)) now is further restricted.
A deterministic algorithm is an algorithm which, given a particular input, will always produce the same output, with the underlying machine always passing through the same sequence of states.
Las Vegas algorithm is a randomized algorithm that always gives correct results; that is, it always produces the correct result or it informs about the failure.
An approximate algorithm is a way of dealing with NP-completeness for optimization problem. This technique does not guarantee the best solution. The goal of an approximation algorithm is to come as close as possible to the optimum value in a reasonable amount of time which is at most polynomial time.
Monte Carlo algorithm is a randomized algorithm whose output may be incorrect with a certain (typically small) probability.
@misc{Abhinav:2020,
Author = {Abhinav Sagar},
Title = {Machine Learning Interview Questions and Answers},
Year = {2020},
Publisher = {GitHub},
Journal = {GitHub repository},
Howpublished = {\url{https://github.com/abhinavsagar/machine-learning-interview}}
}
If you want to keep updated with my latest articles and projects follow me on Medium. These are some of my contacts details: