1704 π€ Machine Learning, Data Science & Python Interview Questions (ANSWERED) To Land Your Next Six-Figure Job Offer from MLStack.Cafe
MLStack.Cafe is the biggest hand-picked collection of top Machine Learning, Data Science, Python and Coding interview questions for Junior and Experienced data analyst, machine learning engineers/developers and data scientists with more that 1704 ML & DS interview questions and answers. Prepare for your next ML, DS & Python interview and land 6-figure job offer in no time.
π΄ Get All 1704 Answers + PDFs + Latex Math on MLStack.Cafe - Kill Your ML, DS & Python Interview
π¨βπ» Hiring Data Analysts, Machine Learning Engineers or Developers? Post your Job on MLStack.Cafe and reach thousands of motivated engineers who is looking for a ML Job right now!
- Anomaly Detection
- Autoencoders
- Bias & Variance
- Big Data
- Big-O Notation
- Classification
- Clustering
- Cost Function
- Data Structures
- Databases
- Datasets
- Decision Trees
- Deep Learning
- Dimensionality Reduction
- Ensemble Learning
- Genetic Algorithms
- Gradient Descent
- K-Means Clustering
- K-Nearest Neighbors
- Linear Algebra
- Linear Regression
- Logistic Regression
- Machine Learning
- Model Evaluation
- Natural Language Processing
- NaΓ―ve Bayes
- Neural Networks
- NumPy
- Optimization
- Pandas
- Probability
- Python
- Random Forests
- SQL
- SVM
- Scikit-Learn
- Searching
- Sorting
- Statistics
- Supervised Learning
- TensorFlow
- Unsupervised Learning
[β¬] Anomaly Detection Interview Questions
Anomaly detection (or outlier detection) is the identification of rare items, events or observations which raise suspicions by differing significantly from the majority of the data.
Source: towardsdatascience.com
- The goal of anomaly detection is to identify cases that are unusual within data that is seemingly comparable hence anomaly detection can be used effectively as a tool for risk mitigation and fraud detection.
- When preparing datasets for machine learning models, it is really important to detect all the outliers and either get rid of them or analyze them to know why you had them there in the first place.
Source: towardsdatascience.com
Normalization rescales the values into a range of [0,1]. This might be useful in some cases where all parameters need to have the same positive scale. However, the outliers from the data set are lost.
Standardization rescales data to have a mean (
For most applications standardization is recommended.
Source: stats.stackexchange.com
The Median is the most suitable measure of central tendency for skewed distributions or distributions with outliers. For example, the median is often used as a measure of central tendency for income distributions, which are generally highly skewed.
Because the median only uses one or two values, itβs unaffected by extreme outliers or non-symmetric distributions of scores. In contrast, the mean and mode can vary in skewed distributions.
Source: en.wikipedia.org
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Q12: What are some advantages of using Isolation Forest algorithm for outliers detection? ββββ
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Q14: Imagine that you know there are outliers in your data, would you use Logistic Regression? ββββ
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
[β¬] Autoencoders Interview Questions
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Q4: What are the main differences between Sparse Autoencoders and Convolution Autoencoders? ββββ
Read answer on π MLStack.Cafe
Q5: What are some differences between the Undercomplete Autoencoder and the Sparse Autoencoder? ββββ
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
[β¬] Bias & Variance Interview Questions
In supervised machine learning an algorithm learns a model from training data.
The goal of any supervised machine learning algorithm is to best estimate the mapping function (f) for the output variable (Y) given the input data (X). The mapping function is often called the target function because it is the function that a given supervised machine learning algorithm aims to approximate.
Bias are the simplifying assumptions made by a model to make the target function easier to learn.
Generally, linear algorithms have a high bias making them fast to learn and easier to understand but generally less flexible.
-
Examples of low-bias machine learning algorithms include: Decision Trees, k-Nearest Neighbors and Support Vector Machines.
-
Examples of high-bias machine learning algorithms include: Linear Regression, Linear Discriminant Analysis and Logistic Regression.
Source: machinelearningmastery.com
-
High Bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting).
-
High Variance may result from an algorithm modeling random noise in the training data (overfitting).
-
The Bias-Variance tradeoff is a central problem in supervised learning. Ideally, a model should be able to accurately capture the regularities in its training data, but also generalize well to unseen data.
-
It is called a tradeoff because it is typically impossible to do both simultaneously:
- Algorithms with high variance will be prone to overfitting the dataset, but
- Algorithms with high bias will underfit the dataset.
Source: en.wikipedia.org
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
[β¬] Big-O Notation Interview Questions
Big-O notation (also called "asymptotic growth" notation) is a relative representation of the complexity of an algorithm. It shows how an algorithm scales based on input size. We use it to talk about how thing scale. Big O complexity can be visualized with this graph:
Source: stackoverflow.com
Say we have an array of n
elements:
int array[n];
If we wanted to access the first (or any) element of the array this would be O(1)
since it doesn't matter how big the array is, it always takes the same constant time to get the first item:
x = array[0];
Source: stackoverflow.com
Big-O is often used to make statements about functions that measure the worst case behavior of an algorithm. Worst case analysis gives the maximum number of basic operations that have to be performed during execution of the algorithm. It assumes that the input is in the worst possible state and maximum work has to be done to put things right.
Source: stackoverflow.com
O(log n)
means for every element, you're doing something that only needs to look at log N of the elements. This is usually because you know something about the elements that let you make an efficient choice (for example to reduce a search space).
The most common attributes of logarithmic running-time function are that:
- the choice of the next element on which to perform some action is one of several possibilities, and
- only one will need to be chosen
or
- the elements on which the action is performed are digits of
n
Most efficient sorts are an example of this, such as merge sort. βIt is O(log n)
when we do divide and conquer type of algorithms e.g binary search. Another example is quick sort where each time we divide the array into two parts and each time it takes O(N)
time to find a pivot element. Hence it N O(log N)
Plotting log(n)
on a plain piece of paper, will result in a graph where the rise of the curve decelerates as n
increases:
Source: stackoverflow.com
The fact is it's difficult to determine the exact runtime of an algorithm. It depends on the speed of the computer processor. So instead of talking about the runtime directly, we use Big O Notation to talk about how quickly the runtime grows depending on input size.
With Big O Notation, we use the size of the input, which we call n
. So we can say things like the runtime grows βon the order of the size of the inputβ (O(n)
) or βon the order of the square of the size of the inputβ (O(n2)
). Our algorithm may have steps that seem expensive when n
is small but are eclipsed eventually by other steps as n
gets larger. For Big O Notation analysis, we care more about the stuff that grows fastest as the input grows, because everything else is quickly eclipsed as n
gets very large.
Source: medium.com
O(n2)
means for every element, you're doing something with every other element, such as comparing them. Bubble sort is an example of this.
Source: stackoverflow.com
Let's say we wanted to find a number in the list:
for (int i = 0; i < n; i++){
if(array[i] == numToFind){ return i; }
}
What will be the time complexity (Big O) of that code snippet?
This would be O(n)
since at most we would have to look through the entire list to find our number. The Big-O is still O(n)
even though we might find our number the first try and run through the loop once because Big-O describes the upper bound for an algorithm.
Source: stackoverflow.com
O(1)
. Note, you don't have to insert at the end of the list. If you insert at the front of a (singly-linked) list, they are both O(1)
.
Stack contains 1,2,3:
[1]->[2]->[3]
Push 5:
[5]->[1]->[2]->[3]
Pop:
[1]->[2]->[3] // returning 5
Source: stackoverflow.com
Let's consider a traversal algorithm for traversing a list.
O(1)
denotes constant space use: the algorithm allocates the same number of pointers irrespective to the list size. That will happen if we move (reuse) our pointer along the list.- In contrast,
O(n)
denotes linear space use: the algorithm space use grows together with respect to the input sizen
. That will happen if let's say for some reason the algorithm needs to allocate 'N' pointers (or other variables) when traversing a list.
Source: stackoverflow.com
Consider:
f(x) = log n + 3n
What is the big O notation of this function?
It is simply O(n)
.
When you have a composite of multiple parts in Big O notation which are added, you have to choose the biggest one. In this case it is O(3n)
, but there is no need to include constants inside parentheses, so we are left with O(n)
.
Source: stackoverflow.com
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Q14: What is meant by "Constant Amortized Time" when talking about time complexity of an algorithm? βββ
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Q22: What are some algorithms which we use daily that has O(1)
, O(n log n)
and O(log n)
complexities? ββββ
Read answer on π MLStack.Cafe
[β¬] Classification Interview Questions
We call it naive because its assumptions (it assumes that all of the features in the dataset are equally important and independent) are really optimistic and rarely true in most real-world applications:
- we consider that these predictors are independent
- we consider that all the predictors have an equal effect on the outcome (like the day being windy does not have more importance in deciding to play golf or not)
Source: towardsdatascience.com
- A Perceptron is a fundamental unit of a Neural Network that is also a single-layer Neural Network.
- Perceptron is a linear classifier. Since it uses already labeled data points, it is a supervised learning algorithm.
- The activation function applies a step rule (convert the numerical output into +1 or -1) to check if the output of the weighting function is greater than zero or not.
A Perceptron is shown in the figure below:
Source: towardsdatascience.com
A decision boundary is a line or a hyperplane that separates the classes. This is what we expect to obtain from logistic regression, as with any other classifier. With this, we can figure out some way to split the data to allow for an accurate prediction of a given observationβs class using the available information.
In the case of a generic two-dimensional example, the split might look something like this:
Source: medium.com
-
Logistic regression: ideally used for classification of binary variables. Implements the sigmoid function to calculate the probability that a data point belongs to a certain class.
-
K-Nearest Neighbours (kNN): calculate the distance of one data point from every other data point and then takes a majority vote from k-nearest neighbors of each data points to classify the output.
-
Decision trees: use multiple if-else statements in the form of a tree structure that includes nodes and leaves. The nodes breaking down the one major structure into smaller structures and eventually providing the final outcome.
-
Random Forest: uses multiple decision trees to predict the outcome of the target variable. Each decision tree provides its own outcome and then it takes the majority vote to classify the final outcome.
-
Support Vector Machines: it creates an n-dimensional space for the n number of features in the dataset and then tries to create the hyperplanes such that it divides and classifies the data points with the maximum margin possible.
Source: www.upgrad.com
-
K-nearest neighbors or KNN is a supervised classification algorithm. This means that we need labeled data to classify an unlabeled data point. It attempts to classify a data point based on its proximity to other
K
-data points in the feature space. -
K-means Clustering is an unsupervised classification algorithm. It requires only a set of unlabeled points and a threshold
K
, so it gathers and groups data intoK
number of clusters.
Source: www.quora.com
There is not a rule of thumb to choose a standard optimal k. This value depends and varies from dataset to dataset, but as a general rule, the main goal is to keep it:
- small enough to exclude the samples of the other classes but
- large enough to minimize any noise in the data.
A way to looking for this optimal parameter, commonly called the Elbow method, consist in creating a for loop that trains various KNN models with different k values, keeping track of the error for each of these models, and use the model with the k value that achieves the best accuracy.
Source: medium.com
In Logistic regression models, we are modeling the probability that an input (X)
belongs to the default class (Y=1)
, that is to say:
where the P(X)
values are given by the logistic function,
The Ξ²0
and Ξ²1
values are estimated during the training stage using maximum-likelihood estimation or gradient descent. Once we have it, we can make predictions by simply putting numbers into the logistic regression equation and calculating a result.
For example, let's consider that we have a model that can predict whether a person is male or female based on their height, such as if P(X) β₯ 0.5
the person is male, and if P(X) < 0.5
then is female.
During the training stage we obtain Ξ²0 = -100
and Ξ²1 = 0.6
, and we want to evaluate what's the probability that a person with a height of 150cm
is male, so with that intention we compute:
Given that logistic regression solves a classification task, we can use directly this value to predict that the person is a female.
Source: machinelearningmastery.com
When it comes to classification problems, the goal is to establish a decision boundary that maximizes the margin between the classes. However, in the real world, this task can become difficult when we have to treat with non-linearly separable data. One approach to solve this problem is to perform a data transformation process, in which we map all the data points to a higher dimension find the boundary and make the classification.
That sounds alright, however, when there are more and more dimensions, computations within that space become more and more expensive. In such cases, the kernel trick allows us to operate in the original feature space without computing the coordinates of the data in a higher-dimensional space and therefore offers a more efficient and less expensive way to transform data into higher dimensions.
There exist different kernel functions, such as:
- linear,
- nonlinear,
- polynomial,
- radial basis function (RBF), and
- sigmoid.
Each one of them can be suitable for a particular problem depending on the data.
Source: medium.com
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Q11: What is the difference between a Weak Learner vs a Strong Learner and why they could be usefu? βββ
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Q17: How would you use Naive Bayes classifier for categorical features? What if some features are numerical? βββ
Read answer on π MLStack.Cafe
Q18: What's the difference between Generative Classifiers and Discriminative Classifiers? Name some examples of each one βββ
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Q25: What are some advantages and disadvantages of using AUC to measure the performance of the model? βββ
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Q28: Name some advantages of using Support Vector Machines vs Logistic Regression for classification ββββ
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Q31: What's the difference between Random Oversampling and Random Undersampling and when they can be used? ββββ
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Q34: What are the trade-offs between the different types of Classification Algorithms? How would do you choose the best one? ββββ
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
[β¬] Clustering Interview Questions
- Cluster analysis is also called clustering.
- It is the task of grouping a set of objects in such a way that objects in the same cluster are more similar to each other than to those in other clusters.
- Cluster analysis itself is not one specific algorithm, but the general task to be solved. It can be achieved by various algorithms that differ significantly in their understanding of what constitutes a cluster and how to efficiently find them.
Source: Handbook of Cluster Analysis from Chapman and Hall/CRC
- Clustering, when the data are similar pairs of points is called similarity-based clustering.
- A typical example of similarity-based clustering is community detection in social networks, where the observations are individual links between people, which may be due to friendship, shared interests, and work relationships. The strength of a link can be the frequency of interactions, for example, communications by e-mail, phone, or other social media, co-authorships, or citations.
- In this clustering paradigm, the points to be clustered are not assumed to be part of a vector space. Their attributes (or features) are incorporated into a single dimension, the link strength, or similarity, which takes a numerical value
$$S_{ij}$$ for each pair of pointsi
,j
. Hence, the natural representation for this problem is by means of the similarity matrix given below: $$ S=[S_{ij}]{i,j=1}^n $$ The similarities are symmetric $$S{ij} = S_{ji}$$ and nonnegative$$S_{ij} \geq 0$$ .
Source: Handbook of Cluster Analysis from Chapman and Hall/CRC
- Identifying cancerous data: Initially we take known samples of a cancerous and non-cancerous dataset, and label both the samples dataset. Then both the samples are mixed and different clustering algorithms are applied to the mixed samples dataset. It has been found through experiments that a cancerous dataset gives the best results with unsupervised non-linear clustering algorithms.
- Search engines: Search engines try to group similar objects in one cluster and the dissimilar objects far from each other. It provides results for the searched data according to the nearest similar object which is clustered around the data to be searched.
- Wireless sensor network's based application: Clustering algorithm can be used effectively in Wireless Sensor Network's based application. One application where it can be used is in Landmine detection. The clustering algorithm plays the role of finding the Cluster heads (or cluster center) which collects all the data in its respective cluster.
Source: sites.google.com
- Mean Shift is a non-parametric feature-space analysis technique for locating the maxima of a density function. What we're trying to achieve here is, to keep shifting the window to a region of higher density.
- We can understand this algorithm by thinking of our data points to be represented as a probability density function. Naturally, in a probability function, higher density regions will correspond to the regions with more points, and lower density regions will correspond to the regions with less points. In clustering, we need to find clusters of points, i.e the regions with a lot of points together. More points together mean higher density. Hence, we observe that clusters of points are more like the higher density regions in our probability density function.
So, we must iteratively go from lower density to higher density regions, in order to find our clusters.
-
The mean shift method is an iterative method, and we start with an initial estimate
x
. Let a kernel function$$K(x_i - x)$$ be given. This function determines the weight of nearby points for re-estimation of the mean. Typically a Gaussian kernel on the distance to the current estimate is used, $$ K(x_i-x)= e^{-c|x_i-x|^2} $$ The weighted mean of the density in the window determined byK
is $$ m(x) = \frac{\sum_{x_i \in N(x)} K(x_i - x) x_i}{\sum_{x_i \in N(x) K(x_i - x)}} $$ whereN(x)
is the neighborhood ofx
, a set of points for which$$K(x_i) \neq 0$$ . -
The difference
m(x) - x
is called mean shift. The mean-shift algorithm now sets$$m(x) \to x$$ , and repeats the estimation untilm(x)
converges. It means, after a sufficient number of steps, the position of the centroid of all the points, and the current location of the window will coincide. This is when we reach convergence, as no new points are added to our window in this step.
Source: en.wikipedia.org
- Self-Organizing Maps (SOMs) are a class of self-organizing clustering techniques.
- It is an unsupervised form of artificial neural networks. A self-organizing map consists of a set of neurons that are arranged in a rectangular or hexagonal grid. Each neuronal unit in the grid is associated with a numerical vector of fixed dimensionality. The learning process of a self-organizing map involves the adjustment of these vectors to provide a suitable representation of the input data.
- Self-organizing maps can be used for clustering numerical data in vector format.
Source: medium.com
- Significance testing addresses an important aspect of cluster validation. Many cluster analysis methods will deliver clusterings even for homogeneous data. They assume implicitly that clustering has to be found, regardless of whether this is meaningful or not.
A critical and challenging question in cluster analysis is whether the identified clusters represent important underlying structure or are artifacts of natural sampling variation.
- Significance testing is performed to distinguish between a clustering that reflects meaningful heterogeneity in the data and an artificial clustering of homogeneous data.
- Significance testing is also used for more specific tasks in cluster analysis, such as; estimating the number of clusters, and for interpreting some or all of the individual clusters, to show the significance of the individual clusters.
Source: www.ncbi.nlm.nih.gov
Multiclass classification means a classification task with more than two classes; e.g., classify a set of images of fruits which may be oranges, apples, or pears. Multiclass classification makes the assumption that each sample is assigned to one and only one label: a fruit can be either an apple or a pear but not both at the same time.
Multilabel classification assigns to each sample a set of target labels. This can be thought of as predicting properties of a data-point that are not mutually exclusive, such as topics that are relevant for a document. A text might be about any of religion, politics, finance or education at the same time or none of these.
Source: stats.stackexchange.com
The Jaccard index, also known as the Jaccard similarity coefficient, is a statistic used for gauging the similarity and diversity of sample sets. The Jaccard coefficient measures similarity between finite sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets:
Source: en.wikipedia.org
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Q11: What are some different types of Clustering Structures that are used in Clustering Algorithms? βββ
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Q25: When using various Clustering Algorithms, why is Euclidean Distance not a good metric in High Dimensions? ββββ
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Q27: How would you choose the number of Clusters when designing a K-Medoid Clustering Algorithm? ββββ
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Q31: How to tell if data is clustered enough for clustering algorithms to produce meaningful results? ββββ
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
[β¬] Cost Function Interview Questions
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
[β¬] Data Structures Interview Questions
A Stack is a container of objects that are inserted and removed according to the last-in first-out (LIFO) principle. In the pushdown stacks only two operations are allowed: push the item into the stack, and pop the item out of the stack.
There are basically three operations that can be performed on stacks. They are:
- inserting an item into a stack (push).
- deleting an item from the stack (pop).
- displaying the contents of the stack (peek or top).
A stack is a limited access data structure - elements can be added and removed from the stack only at the top. push adds an item to the top of the stack, pop removes the item from the top. A helpful analogy is to think of a stack of books; you can remove only the top book, also you can add a new book on the top.
Source: www.cs.cmu.edu
A stack is a recursive data structure, so it's:
- a stack is either empty or
- it consists of a top and the rest which is a stack by itself;
Source: www.cs.cmu.edu
A linked list is a linear data structure where each element is a separate object. Each element (we will call it a node) of a list is comprising of two items - the data and a reference (pointer) to the next node. The last node has a reference to null. The entry point into a linked list is called the head of the list. It should be noted that head is not a separate node, but the reference to the first node. If the list is empty then the head is a null reference.
Source: www.cs.cmu.edu
Arrays are:
- Finite (fixed-size) - An array is finite because it contains only limited number of elements.
- Order -All the elements are stored one by one , in contiguous location of computer memory in a linear order and fashion
- Homogenous - All the elements of an array are of same data types only and hence it is termed as collection of homogenous
Source: codelack.com
A queue is a container of objects (a linear collection) that are inserted and removed according to the first-in first-out (FIFO) principle. The process to add an element into queue is called Enqueue and the process of removal of an element from queue is called Dequeue.
Source: www.cs.cmu.edu
A Heap is a special Tree-based data structure which is an almost complete tree that satisfies the heap property:
- in a max heap, for any given node C, if P is a parent node of C, then the key (the value) of P is greater than or equal to the key of C.
- In a min heap, the key of P is less than or equal to the key of C. The node at the "top" of the heap (with no parents) is called the root node.
A common implementation of a heap is the binary heap, in which the tree is a binary tree.
Source: www.geeksforgeeks.org
Time Complexity: None Space Complexity: None
A hash table (hash map) is a data structure that implements an associative array abstract data type, a structure that can map keys to values. Hash tables implement an associative array, which is indexed by arbitrary objects (keys). A hash table uses a hash function to compute an index, also called a hash value, into an array of buckets or slots, from which the desired value can be found.
Source: en.wikipedia.org
A priority queue is a data structure that stores priorities (comparable values) and perhaps associated information. A priority queue is different from a "normal" queue, because instead of being a "first-in-first-out" data structure, values come out in order by priority. Think of a priority queue as a kind of bag that holds priorities. You can put one in, and you can take out the current highest priority.
Source: pages.cs.wisc.edu
Time Complexity: None Space Complexity: None
Trees are well-known as a non-linear data structure. They donβt store data in a linear way. They organize data hierarchically.
A tree is a collection of entities called nodes. Nodes are connected by edges. Each node contains a value or data or key, and it may or may not have a child node. The first node of the tree is called the root. Leaves are the last nodes on a tree. They are nodes without children.
Source: www.freecodecamp.org
Time Complexity: None Space Complexity: None
A graph is a common data structure that consists of a finite set of nodes (or vertices) and a set of edges connecting them. A pair (x,y)
is referred to as an edge, which communicates that the x vertex connects to the y vertex.
Graphs are used to solve real-life problems that involve representation of the problem space as a network. Examples of networks include telephone networks, circuit networks, social networks (like LinkedIn, Facebook etc.).
Source: www.educative.io
Time Complexity: None Space Complexity: None
A string is generally considered as a data type and is often implemented as an array data structure of bytes (or words) that stores a sequence of elements, typically characters, using some character encoding. String may also denote more general arrays or other sequence (or list) data types and structures.
Source: dev.to
Time Complexity: None Space Complexity: None
Trie (also called **digital tree **or prefix tree) is a tree-based data structure, which is used for efficient retrieval of a key in a large data-set of strings. Unlike a binary search tree, no node in the tree stores the key associated with that node; instead, its position in the tree defines the key with which it is associated; i.e., the value of the key is distributed across the structure. All the descendants of a node have a common prefix of the string associated with that node, and the root is associated with the empty string. Each complete English word has an arbitrary integer value associated with it (see image).
Source: medium.com
Time Complexity: None Space Complexity: None
A normal tree has no restrictions on the number of children each node can have. A binary tree is made of nodes, where each node contains a "left" pointer, a "right" pointer, and a data element.
There are three different types of binary trees:
- Full binary tree: Every node other than leaf nodes has 2 child nodes.
- Complete binary tree: All levels are filled except possibly the last one, and all nodes are filled in as far left as possible.
- Perfect binary tree: All nodes have two children and all leaves are at the same level.
Source: study.com
Time Complexity: None Space Complexity: None
Because they help manage your data in more a particular way than arrays and lists. It means that when you're debugging a problem, you won't have to wonder if someone randomly inserted an element into the middle of your list, messing up some invariants.
Arrays and lists are random access. They are very flexible and also easily corruptible. If you want to manage your data as FIFO or LIFO it's best to use those, already implemented, collections.
More practically you should:
- Use a queue when you want to get things out in the order that you put them in (FIFO)
- Use a stack when you want to get things out in the reverse order than you put them in (LIFO)
- Use a list when you want to get anything out, regardless of when you put them in (and when you don't want them to automatically be removed).
Source: stackoverflow.com
- Queues offer random access to their contents by shifting the first element off the front of the queue. You have to do this repeatedly to access an arbitrary element somewhere in the queue. Therefore, access is
O(n)
. - Searching for a given value in the queue requires iterating until you find it. So search is
O(n)
. - Inserting into a queue, by definition, can only happen at the back of the queue, similar to someone getting in line for a delicious Double-Double burger at In 'n Out. Assuming an efficient queue implementation, queue insertion is
O(1)
. - Deleting from a queue happens at the front of the queue. Assuming an efficient queue implementation, queue deletion is `
O(1)
.
Source: github.com
Queue can be classified into following types:
- Simple Queue - is a linear data structure in which removal of elements is done in the same order they were inserted i.e., the element will be removed first which is inserted first.
- Circular Queue - is a linear data structure in which the operations are performed based on FIFO (First In First Out) principle and the last position is connected back to the first position to make a circle. It is also called Ring Buffer. Circular queue avoids the wastage of space in a regular queue implementation using arrays.
- Priority Queue - is a type of queue where each element has a priority value and the deletion of the elements is depended upon the priority value
- In case of max-priority queue, the element will be deleted first which has the largest priority value
- In case of min-priority queue the element will be deleted first which has the minimum priority value.
- De-queue (Double ended queue) - allows insertion and deletion from both the ends i.e. elements can be added or removed from rear as well as front end.
- Input restricted deque - In input restricted double ended queue, the insertion operation is performed at only one end and deletion operation is performed at both the ends.
- Output restricted deque - In output restricted double ended queue, the deletion operation is performed at only one end and insertion operation is performed at both the ends.
Source: www.ques10.com
- A singly linked list
- A doubly linked list is a list that has two references, one to the next node and another to previous node.
- A multiply linked list - each node contains two or more link fields, each field being used to connect the same set of data records in a different order of same set(e.g., by name, by department, by date of birth, etc.).
- A circular linked list - where last node of the list points back to the first node (or the head) of the list.
Source: www.cs.cmu.edu
A dynamic array is an array with a big improvement: automatic resizing.
One limitation of arrays is that they're fixed size, meaning you need to specify the number of elements your array will hold ahead of time. A dynamic array expands as you add more elements. So you don't need to determine the size ahead of time.
Source: www.interviewcake.com
The easiest solution that comes to mind here is iteration:
function fib(n){
let arr = [0, 1];
for (let i = 2; i < n + 1; i++){
arr.push(arr[i - 2] + arr[i -1])
}
return arr[n]
}
And output:
fib(4)
=> 3
Notice that two first numbers can not really be effectively generated by a for loop, because our loop will involve adding two numbers together, so instead of creating an empty array we assign our arr variable to [0, 1]
that we know for a fact will always be there. After that we create a loop that starts iterating from i = 2 and adds numbers to the array until the length of the array is equal to n + 1
. Finally, we return the number at n index of array.
Source: medium.com
Time Complexity: O(n) Space Complexity: O(n)
An algorithm in our iterative solution takes linear time to complete the task. Basically we iterate through the loop n-2 times, so Big O (notation used to describe our worst case scenario) would be simply equal to O(n)
in this case. The space complexity is O(n)
.
function fib(n){
let arr = [0, 1]
for (let i = 2; i < n + 1; i++){
arr.push(arr[i - 2] + arr[i -1])
}
return arr[n]
}
double fibbonaci(int n){
double prev=0d, next=1d, result=0d;
for (int i = 0; i < n; i++) {
result=prev+next;
prev=next;
next=result;
}
return result;
}
def fib_iterative(n):
a, b = 0, 1
while n > 0:
a, b = b, a + b
n -= 1
return a
Few disadvantages of linked lists are :
- They use more memory than arrays because of the storage used by their pointers.
- Difficulties arise in linked lists when it comes to reverse traversing. For instance, singly linked lists are cumbersome to navigate backwards and while doubly linked lists are somewhat easier to read, memory is wasted in allocating space for a back-pointer.
- Nodes in a linked list must be read in order from the beginning as linked lists are inherently sequential access.
- Random access has linear time.
- Nodes are stored incontiguously (no or poor cache locality), greatly increasing the time required to access individual elements within the list, especially with a CPU cache.
- If the link to list's node is accidentally destroyed then the chances of data loss after the destruction point is huge. Data recovery is not possible.
- Search is linear versus logarithmic for sorted arrays and binary search trees.
- Different amount of time is required to access each element.
- Not easy to sort the elements stored in the linear linked list.
Source: www.quora.com
Recursive solution looks pretty simple (see code).
Letβs look at the diagram that will help you understand whatβs going on here with the rest of our code. Function fib is called with argument 5:
Basically our fib function will continue to recursively call itself creating more and more branches of the tree until it hits the base case, from which it will start summing up each branchβs return values bottom up, until it finally sums them all up and returns an integer equal to 5.
Source: medium.com
Time Complexity: O(2^n)
In case of recursion the solution take exponential time, that can be explained by the fact that the size of the tree exponentially grows when n increases. So for every additional element in the Fibonacci sequence we get an increase in function calls. Big O in this case is equal to O(2n)
. Exponential Time complexity denotes an algorithm whose growth doubles with each addition to the input data set.
function fib(n) {
if (n < 2){
return n
}
return fib(n - 1) + fib (n - 2)
}
public int fibonacci(int n) {
if (n < 2) return n;
return fibonacci(n - 1) + fibonacci(n - 2);
}
def F(n):
if n == 0: return 0
elif n == 1: return 1
else: return F(n-1)+F(n-2)
The space complexity of a datastructure indicates how much space it occupies in relation to the amount of elements it holds. For example a space complexity of O(1)
would mean that the datastructure alway consumes constant space no matter how many elements you put in there. O(n)
would mean that the space consumption grows linearly with the amount of elements in it.
A hashtable typically has a space complexity of O(n)
.
Source: stackoverflow.com
A Binary Heap is a Binary Tree with following properties:
- Itβs a complete tree (all levels are completely filled except possibly the last level and the last level has all keys as left as possible). This property of Binary Heap makes them suitable to be stored in an array.
- A Binary Heap is either Min Heap or Max Heap. In a Min Binary Heap, the key at root must be minimum among all keys present in Binary Heap. The same property must be recursively true for all nodes in Binary Tree. Max Binary Heap is similar to MinHeap.
10 10
/ \ / \
20 100 15 30
/ / \ / \
30 40 50 100 40
Source: www.geeksforgeeks.org
Time Complexity: None Space Complexity: None
Binary search tree is a data structure that quickly allows to maintain a sorted list of numbers.
- It is called a binary tree because each tree node has maximum of two children.
- It is called a search tree because it can be used to search for the presence of a number in
O(log n)
time.
The properties that separates a binary search tree from a regular binary tree are:
- All nodes of left subtree are less than root node
- All nodes of right subtree are more than root node
- Both subtrees of each node are also BSTs i.e. they have the above two properties
Source: www.programiz.com
Time Complexity: None Space Complexity: None
Char arrays:
- Static-sized
- Fast access
- Few built-in methods to manipulate strings
- A char array doesnβt define a data type
Strings:
- Slower access
- Define a data type
- Dynamic allocation
- More built-in functions to support string manipulations
Source: dev.to
Time Complexity: None Space Complexity: None
That is a basic (generic) tree structure that can be used for String
or any other object:
Source: stackoverflow.com
Time Complexity: None Space Complexity: None
public class Tree<T> {
private Node<T> root;
public Tree(T rootData) {
root = new Node<T>();
root.data = rootData;
root.children = new ArrayList<Node<T>>();
}
public static class Node<T> {
private T data;
private Node<T> parent;
private List<Node<T>> children;
}
}
Generic Tree:
class Tree(object):
"Generic tree node."
def __init__(self, name='root', children=None):
self.name = name
self.children = []
if children is not None:
for child in children:
self.add_child(child)
def __repr__(self):
return self.name
def add_child(self, node):
assert isinstance(node, Tree)
self.children.append(node)
# *
# /|\
# 1 2 +
# / \
# 3 4
t = Tree('*', [Tree('1'),
Tree('2'),
Tree('+', [Tree('3'),
Tree('4')])])
Binary tree:
class Tree:
def __init__(self):
self.left = None
self.right = None
self.data = None
To convert a singly linked list to a circular linked list, we will set the next pointer of the tail node to the head pointer.
- Create a copy of the head pointer, let's say
temp
. - Using a loop, traverse linked list till tail node (last node) using temp pointer.
- Now set the next pointer of the tail node to head node.
temp->next = head
Source: www.techcrashcourse.com
def convertTocircular(head):
# declare a node variable
# start and assign head
# node into start node.
start = head
# check that
while head.next
# not equal to null then head
# points to next node.
while(head.next is not None):
head = head.next
#
if head.next points to null
# then start assign to the
# head.next node.
head.next = start
return start
Graph:
- Consists of a set of vertices (or nodes) and a set of edges connecting some or all of them
- Any edge can connect any two vertices that aren't already connected by an identical edge (in the same direction, in the case of a directed graph)
- Doesn't have to be connected (the edges don't have to connect all vertices together): a single graph can consist of a few disconnected sets of vertices
- Could be directed or undirected (which would apply to all edges in the graph)
Tree:
- A type of graph (fit with in the category of Directed Acyclic Graphs (or a DAG))
- Vertices are more commonly called "nodes"
- Edges are directed and represent an "is child of" (or "is parent of") relationship
- Each node (except the root node) has exactly one parent (and zero or more children)
- Has exactly one "root" node (if the tree has at least one node), which is a node without a parent
- Has to be connected
- Is acyclic, meaning it has no cycles: "a cycle is a path [AKA sequence] of edges and vertices wherein a vertex is reachable from itself"
- Trees aren't a recursive data structure
Source: stackoverflow.com
Time Complexity: None Space Complexity: None
Linked lists are very useful when you need :
- to do a lot of insertions and removals, but not too much searching, on a list of arbitrary (unknown at compile-time) length.
- splitting and joining (bidirectionally-linked) lists is very efficient.
- You can also combine linked lists - e.g. tree structures can be implemented as "vertical" linked lists (parent/child relationships) connecting together horizontal linked lists (siblings).
Using an array based list for these purposes has severe limitations:
- Adding a new item means the array must be reallocated (or you must allocate more space than you need to allow for future growth and reduce the number of reallocations)
- Removing items leaves wasted space or requires a reallocation
- inserting items anywhere except the end involves (possibly reallocating and) copying lots of the data up one position
Source: stackoverflow.com
For traversing a (non-empty) binary tree in pre-order fashion, we must do these three things for every node N
starting from root node of the tree:
- (N) Process
N
itself. - (L) Recursively traverse its left subtree. When this step is finished we are back at N again.
- (R) Recursively traverse its right subtree. When this step is finished we are back at N again.
Source: github.com
Time Complexity: O(n) Space Complexity: O(n)
// Recursive function to perform pre-order traversal of the tree
public static void preorder(TreeNode root)
{
// return if the current node is empty
if (root == null) {
return;
}
// Display the data part of the root (or current node)
System.out.print(root.data + " ");
// Traverse the left subtree
preorder(root.left);
// Traverse the right subtree
preorder(root.right);
}
# Recursive function to perform pre-order traversal of the tree
def preorder(root):
# return if the current node is empty
if root is None:
return
# Display the data part of the root (or current node)
print(root.data, end=' ')
# Traverse the left subtree
preorder(root.left)
# Traverse the right subtree
preorder(root.right)
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Q36: LIS: Find length of the longest increasing subsequence (LIS) in the array. Solve using DP. βββ
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Q47: Binet's formula: How to calculate Fibonacci numbers without Recursion or Iteration? ββββ
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Normalization is basically to design a database schema such that duplicate and redundant data is avoided. If the same information is repeated in multiple places in the database, there is the risk that it is updated in one place but not the other, leading to data corruption.
There is a number of normalization levels from 1. normal form through 5. normal form. Each normal form describes how to get rid of some specific problem.
By having a database with normalization errors, you open the risk of getting invalid or corrupt data into the database. Since data "lives forever" it is very hard to get rid of corrupt data when first it has entered the database.
Source: stackoverflow.com
Q2: What is the difference between Data Definition Language (DDL) and Data Manipulation Language (DML)? ββ
-
Data definition language (DDL) commands are the commands which are used to define the database. CREATE, ALTER, DROP and TRUNCATE are some common DDL commands.
-
Data manipulation language (DML) commands are commands which are used for manipulation or modification of data. INSERT, UPDATE and DELETE are some common DML commands.
Source: en.wikibooks.org
NoSQL is better than RDBMS because of the following reasons/properities of NoSQL:
- It supports semi-structured data and volatile data
- It does not have schema
- Read/Write throughput is very high
- Horizontal scalability can be achieved easily
- Will support Bigdata in volumes of Terra Bytes & Peta Bytes
- Provides good support for Analytic tools on top of Bigdata
- Can be hosted in cheaper hardware machines
- In-memory caching option is available to increase the performance of queries
- Faster development life cycles for developers
Still, RDBMS is better than NoSQL for the following reasons/properties of RDBMS:
- Transactions with ACID properties - Atomicity, Consistency, Isolation & Durability
- Adherence to Strong Schema of data being written/read
- Real time query management ( in case of data size < 10 Tera bytes )
- Execution of complex queries involving join & group by clauses
Source: stackoverflow.com
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
- Covariance measures whether a variation in one variable results in a variation in another variable, and deals with the linear relationship of only
2
variables in the dataset. Its value can take range from-β
to+β
. Simply speaking Covariance indicates the direction of the linear relationship between variables.
- Correlation measures how strongly two or more variables are related to each other. Its values are between
-1
to1
. Correlation measures both the strength and direction of the linear relationship between two variables. Correlation is a function of the covariance.
Source: careerfoundry.com
It's not recommended to perform K-NN on large datasets, given that the computational and memory cost can increase. To understand the reason why we should remember how the K-NN algorithm works:
- Starts by calculating the distances to all vectors in a training set and store them.
- Then, it sorts the calculated distances.
- Then, we store the K nearest vectors.
- And finally, calculate the most frequent class displayed by K nearest vectors.
So implement K-NN on a large dataset it is not only a bad decision to store a large amount of data but it is also computationally costly to keep calculating and sorting all the values. For that reason, K-NN is not recommended and another classification algorithm like Naive Bayes or SVM is preferred in such cases.
Source: towardsdatascience.com
-
Cross-validation is a method of assessing how the results of a statistical analysis will generalize on an independent dataset,
-
It can be used in machine learning tasks to evaluate the predictive capability of the model,
-
It also helps us to avoid overfitting and underfitting,
-
A common way to cross-validate is to divide the dataset into training, validation, and testing where:
- Training dataset is a dataset of known data on which the training is run.
- Validation dataset is the dataset that is unknown against which the model is tested. The validation dataset is used after each epoch of learning to gauge the improvement of the model.
- Testing dataset is also an unknown dataset that is used to test the model. The testing dataset is used to measure the performance of the model after it has finished learning.
Source: en.wikipedia.org
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Q17: How would you address the problem of Heteroskedasticity caused for a Measurement error? ββββ
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Q20: What's the difference between Random Oversampling and Random Undersampling and when they can be used? ββββ
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
[β¬] Decision Trees Interview Questions
- Decision trees is a tool that uses a tree-like model of decisions and their possible consequences. If an algorithm only contains conditional control statements, decision trees can model that algorithm really well.
- Decision trees are a non-parametric, supervised learning method.
- Decision trees are used for classification and regression tasks.
- The diagram below shows an example of a decision tree (the dataset used is the Titanic dataset to predict whether a passenger survived or not):
Source: towardsdatascience.com
A decision tree is a flowchart-like structure in which:
- Each internal node represents the test on an attribute (e.g. outcome of a coin flip).
- Each branch represents the outcome of the test.
- Each leaf node represents a class label.
- The paths from the root to leaf represent the classification rules.
Source: en.wikipedia.org
A decision tree consists of three types of nodes:
- Decision nodes: Represented by squares. It is a node where a flow branches into several optional branches.
- Chance nodes: Represented by circles. It represents the probability of certain results.
- End nodes: Represented by triangles. It shows the final outcome of the decision path.
Source: en.wikipedia.org
- It is simple to understand and interpret. It can be visualized easily.
- It does not require as much data preprocessing as other methods.
- It can handle both numerical and categorical data.
- It can handle multiple output problems.
Source: scikit-learn.org
- If the Gini Index of the data is
0
then it means that all the elements belong to a specific class. When this happens it is said to be pure. - When all of the data belongs to a single class (pure) then the leaf node is reached in the tree.
- The leaf node represents the class label in the tree (which means that it gives the final output).
Source: medium.com
- Random forest is an ensemble learning method that works by constructing a multitude of decision trees. A random forest can be constructed for both classification and regression tasks.
- Random forest outperforms decision trees, and it also does not have the habit of overfitting the data as decision trees do.
- A decision tree trained on a specific dataset will become very deep and cause overfitting. To create a random forest, decision trees can be trained on different subsets of the training dataset, and then the different decision trees can be averaged with the goal of decreasing the variance.
Source: en.wikipedia.org
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Q25: Imagine that you know there are outliers in your data, would you use Logistic Regression? ββββ
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Q27: While building Decision Tree how do you choose which attribute to split at each node? ββββ
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
[β¬] Deep Learning Interview Questions
- Machine Learning depends on humans to learn. Humans determine the hierarchy of features to determine the difference between the data input. It usually requires more structured data to learn.
- Deep Learning automates much of the feature extraction piece of the process. It eliminates the manual human intervention required.
- Machine Learning is less dependent on the amount of data as compared to deep learning.
- Deep Learning requires a lot of data to give high accuracy. It would take thousands or millions of data points which are trained for days or weeks to give an acceptable accurate model.
Source: ibm.com
- Deep Learning gives a better performance compared to machine learning if the dataset is large enough.
- Deep Learning does not need the person designing the model to have a lot of domain understanding for feature introspection. Deep learning outshines other methods if there is no feature engineering done.
- Deep Learning really shines when it comes to complex problems such as image classification, natural language processing, and speech recognition.
Source: towardsdatascience.com
- One of the best benefits of Deep Learning is its ability to perform automatic feature extraction from raw data.
- When the number of data fed into the learning algorithm increases, there will be more edge cases taken into consideration and hence the algorithm will learn to make the right decisions in those edge cases.
Source: machinelearningmastery.com
- When researchers started to create large artificial neural networks, they started to use the word deep to refer to them.
- As the term deep learning started to be used, it is generally understood that it stands for artificial neural networks which are deep as opposed to shallow artificial neural networks.
- Deep Artificial Neural Networks and Deep Learning are generally the same thing and mostly used interchangeably.
Source: machinelearningmastery.com
- Early stopping in deep learning is a type of regularization where the training is stopped after a few iterations.
- When training a large network, there will be a point during training when the model will stop generalizing and start learning the statistical noise in the training dataset. This makes the networks unable to predict new data.
- Defining early stopping in a neural network will prevent the network from overfitting.
- One way of defining early stopping is to start training the model and if the performance of the model starts to degrade, then stopping the training process.
Source: Neural Networks and Deep Learning: A Textbook by Charu C. Aggarwal
- Ensemble methods are used to increase the generalization power of a model. These methods are applicable to both deep learning as well as machine learning algorithms.
- Some ensemble methods introduced in neural networks are Dropout and Dropconnect. The improvement in the model depends on the type of data and the nature of neural architecture.
Source: Neural Networks and Deep Learning: A Textbook by Charu C. Aggarwal
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Q11: What does the hidden layer in a Neural Network compute? βββ
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Q13: What is the difference between Linear Activation Function and Non-linear Activation Function? βββ
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Q18: How can you optimise the architecture of a Deep Learning classifier using Genetic Algorithms? βββββ
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
[β¬] Dimensionality Reduction Interview Questions
- As the amount of data required to train a model increases, it becomes harder and harder for machine learning algorithms to handle. As more features are added to the machine learning process, the more difficult the training becomes.
- In very high-dimensional space, supervised algorithms learn to separate points and build function approximations to make good predictions.
When the number of features increases, this search becomes expensive, both from a time and compute perspective. It might become impossible to find a good solution fast enough. This is the curse of dimensionality.
- Using dimensionality reduction of unsupervised learning, the most salient features can be discovered in the original feature set. Then the dimension of this feature set can be reduced to a more manageable number while losing very little information in the process. This will help supervised learning find the optimum function to approximate the dataset.
Source: www.amazon.com
- The Principal Component Analysis (PCA) is the process of computing principal components and using them to perform a change of basis on the data.
- The Principal Component of a collection of points in a real coordinate space are a sequence of
p
unit vectors, where thei
-th vector is the direction of a line that best fits the data while being orthogonal to thei - 1
vectors. The best-fitting line is defined as the line that minimizes the average squared distance from the points to the line. - PCA is commonly used in dimensionality reduction by projecting each data point onto only the first few principal components to obtain lower-dimensional data while preserving as much of the data's variation as possible.
Source: en.wikipedia.org
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Q16: What are the rules for generating a random matrix when Gaussian Random Projection is used? ββββ
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
[β¬] Ensemble Learning Interview Questions
Ensemble learning is a machine learning paradigm where multiple models (often called βweak learnersβ) are trained to solve the same problem and combined to get better results. The main hypothesis is that when weak models are correctly combined we can obtain more accurate and/or robust models.
Source: towardsdatascience.com
- Random Forests is a type of ensemble learning method for classification, regression, and other tasks.
- Random Forests works by constructing many decision trees at a training time. The way that this works is by averaging several decision trees at different parts of the same training set.
Source: en.wikipedia.org
In ensemble learning theory, we call weak learners (or base models) models that can be used as building blocks for designing more complex models by combining several of them. Most of the time, these basics models perform not so well by themselves either because they have a high bias (low degree of freedom models, for example) or because they have too much variance to be robust (high degree of freedom models, for example).
Source: towardsdatascience.com
- Ensemble methods is a machine learning technique that combines several base models in order to produce one optimal predictive model.
- Random Forest is a type of ensemble method.
- The number of component classifier in an ensemble has a great impact on the accuracy of the prediction, although there is a law of diminishing results in ensemble construction.
Source: towardsdatascience.com
- Random forest is an ensemble learning method that works by constructing a multitude of decision trees. A random forest can be constructed for both classification and regression tasks.
- Random forest outperforms decision trees, and it also does not have the habit of overfitting the data as decision trees do.
- A decision tree trained on a specific dataset will become very deep and cause overfitting. To create a random forest, decision trees can be trained on different subsets of the training dataset, and then the different decision trees can be averaged with the goal of decreasing the variance.
Source: en.wikipedia.org
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Q9: What is the difference between a Weak Learner vs a Strong Learner and why they could be usefu? βββ
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Q17: What are the trade-offs between the different types of Classification Algorithms? How would do you choose the best one? ββββ
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Q19: How would you find the optimal number of random features to consider at each split? βββββ
Read answer on π MLStack.Cafe
[β¬] Genetic Algorithms Interview Questions
- Each individual is represented by a chromosome represented by a collection of genes.
- A chromosome is represented by a string of bits, where each bit represents a single gene.
- A population is shown as a group of binary strings where each string represents an individual.
A Genetic Algorithm (GA) is a heuristic search algorithm used to solve search and optimization problems. This algorithm is a subset of evolutionary algorithms, which are used in the computation. Genetic algorithms employ the concept of genetics and natural selection to provide solutions to problems.
These algorithms have better intelligence than random search algorithms because they use historical data to take the search to the best performing region within the solution space.
GAs are also based on the behavior of chromosomes and their genetic structure. Every chromosome plays the role of providing a possible solution. The fitness function helps in providing the characteristics of all individuals within the population. The greater the function, the better the solution.
Source: www.section.io
There are some of the basic terminologies related to genetic algorithms:
- Population: This is a subset of all the probable solutions that can solve the given problem.
- Chromosomes: A chromosome is one of the solutions in the population.
- Gene: This is an element in a chromosome.
- Allele: This is the value given to a gene in a specific chromosome.
- Fitness function: This is a function that uses a specific input to produce an improved output. The solution is used as the input while the output is in the form of solution suitability.
- Genetic operators: In genetic algorithms, the best individuals mate to reproduce an offspring that is better than the parents. Genetic operators are used for changing the genetic composition of this next generation.
Source: www.section.io
- A fitness function is a function that maps the chromosome representation into a scalar value.
- At each iteration of the algorithm, each individual is evaluated using a fitness function.
- The individuals with a better fitness score are more likely to be chosen for reproduction and be represented in the next generation.
- The fitness function seeks to optimize the problem that is being solved.
Source: ai.stackexchange.com
- Mutation introduces new patterns in the chromosomes, and it helps to find solutions in uncharted areas.
- Mutations are implemented as random changes in the chromosomes. It may be programmed, for example, as a bit flip where a single bit of the chromosome changes.
- The purpose of mutation is to periodically refresh the population.
Source: www.amazon.com
- It has the capability to globally optimize the problem instead of just finding the local minima or maxima.
- It can handle problems with complex mathematical representation.
- It can handle problems that lack mathematical representation.
- It is resilient to noise.
- It can support parallel and distributed processing.
- It is suitable for continuous learning.
- It provides answers that improve over time.
- A genetic algorithm does not need derivative information.
Source: www.amazon.com
- While the average fitness of the genetic algorithm increases as the generations go by, the best individuals from the current generations will be lost due to selection, crossover, and mutation operators. This problem is solved by the process known as Elitism.
- Elitism guarantees that the best individuals always make it to the next generation.
n
predefined number of individuals are duplicated into the next generation. These individuals selected for duplication are also eligible to be parents of the new individuals.
Source: en.wikipedia.org
The crossover and mutation operations are applied separately for each dimension of the array that forms the real-coded chromosome.
For example, if [1.23, 9.81, 6.34]
and [-30.23, 12.67, -42.69]
are selected for the crossover operation, the crossover will be done for between; 1.23
and -30.23
(first dimension), 9.81
and 12.67
(second dimension), and 6.34
and -42.69
(third dimension).
Source: www.amazon.com
Read answer on π MLStack.Cafe
Q10: What are the differences between Genetic Algorithms and Traditional Search and Optimization Algorithms? βββ
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Q15: What is the difference between Stochastic Universal Sampling and Roulette Wheel Selection? βββ
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Q24: Can you explain Elitism in a context of Genetic Algorithms and it's impact on GA performance? βββ
Read answer on π MLStack.Cafe
Q25: What is the difference between Tournament Selection and Elitism in Genetic Algorithm? βββ
Read answer on π MLStack.Cafe
Q26: What are the advantages of using Floating-Point number to represent chromosomes instead of Binary numbers? βββ
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Q38: What are Constraint Satisfaction Problems and why is Genetic Algorithms suited to solve them? ββββ
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Q42: How can you optimise the architecture of a Deep Learning classifier using Genetic Algorithms? βββββ
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
[β¬] Gradient Descent Interview Questions
- A Gradient Descent is a type of optimization algorithm used to find the local minimum of a differentiable function.
- The main idea behind the gradient descent is to take steps in the negative direction of the gradient. This will lead to the steepest descent and eventually it will lead to the minimum point.
- It is shown as an equation by:
Where:
- a is the point.
-
$$\gamma$$ is the step size. - F(x) is the multi-variable function.
Source: en.wikipedia.org
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Q4: Compare the Mini-batch Gradient Descent, Stochastic Gradient Descent, and Batch Gradient Descent βββ
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Q7: In which case you would use Gradient Descent method or Ordinary Least Squares and why? βββ
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Q12: How is the Adam Optimization Algorithm different when compared to Stochastic Gradient Descent? ββββ
Read answer on π MLStack.Cafe
Q13: Name some advantages of using Gradient descent vs Ordinary Least Squares for Linear Regression ββββ
Read answer on π MLStack.Cafe
[β¬] K-Means Clustering Interview Questions
-
K-nearest neighbors or KNN is a supervised classification algorithm. This means that we need labeled data to classify an unlabeled data point. It attempts to classify a data point based on its proximity to other
K
-data points in the feature space. -
K-means Clustering is an unsupervised classification algorithm. It requires only a set of unlabeled points and a threshold
K
, so it gathers and groups data intoK
number of clusters.
Source: www.quora.com
- Clustering, when the data are similar pairs of points is called similarity-based clustering.
- A typical example of similarity-based clustering is community detection in social networks, where the observations are individual links between people, which may be due to friendship, shared interests, and work relationships. The strength of a link can be the frequency of interactions, for example, communications by e-mail, phone, or other social media, co-authorships, or citations.
- In this clustering paradigm, the points to be clustered are not assumed to be part of a vector space. Their attributes (or features) are incorporated into a single dimension, the link strength, or similarity, which takes a numerical value
$$S_{ij}$$ for each pair of pointsi
,j
. Hence, the natural representation for this problem is by means of the similarity matrix given below: $$ S=[S_{ij}]{i,j=1}^n $$ The similarities are symmetric $$S{ij} = S_{ji}$$ and nonnegative$$S_{ij} \geq 0$$ .
Source: Handbook of Cluster Analysis from Chapman and Hall/CRC
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Q9: How to tell if data is clustered enough for clustering algorithms to produce meaningful results? ββββ
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
[β¬] K-Nearest Neighbors Interview Questions
There is not a rule of thumb to choose a standard optimal k. This value depends and varies from dataset to dataset, but as a general rule, the main goal is to keep it:
- small enough to exclude the samples of the other classes but
- large enough to minimize any noise in the data.
A way to looking for this optimal parameter, commonly called the Elbow method, consist in creating a for loop that trains various KNN models with different k values, keeping track of the error for each of these models, and use the model with the k value that achieves the best accuracy.
Source: medium.com
-
KNN:
- The k-neighbors classification is a very commonly used technique and is widely applied in various scenarios.
- KNN implements learning based on the
k
nearest neighbors of each query point, wherek
is a hyperparameter of an integer value. - The optimal choice of the value
k
is highly data-dependent: in general, a largerk
suppresses the effects of noise but makes the classification boundaries less distinct.
-
RNN:
- The r-neighbors classification is used in cases where the data is not uniformly sampled or is sparse.
- RNN implements learning based on the number of neighbors within a fixed radius
r
of each training point, wherer
is a hyperparameter of the type float. - The optimal fixed radius
r
is chosen such that points in sparser neighborhoods use fewer nearest neighbors for the classification.
Source: scikit-learn.org
It's not recommended to perform K-NN on large datasets, given that the computational and memory cost can increase. To understand the reason why we should remember how the K-NN algorithm works:
- Starts by calculating the distances to all vectors in a training set and store them.
- Then, it sorts the calculated distances.
- Then, we store the K nearest vectors.
- And finally, calculate the most frequent class displayed by K nearest vectors.
So implement K-NN on a large dataset it is not only a bad decision to store a large amount of data but it is also computationally costly to keep calculating and sorting all the values. For that reason, K-NN is not recommended and another classification algorithm like Naive Bayes or SVM is preferred in such cases.
Source: towardsdatascience.com
- k-Nearest Neighbors is a supervised machine learning algorithm that can be used to solve both classification and regression problems.
- It assumes that similar things are closer to each other in certain feature spaces, in other words, similar things are in close proximity.
- The image above shows how similar points are closer to each other. KNN hinges on this assumption being true enough for the algorithm to be useful.
- There are many different ways of calculating the distance between the points, however, the straight line distance (Euclidean distance) is a popular and familiar choice.
Source: towardsdatascience.com
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
[β¬] Linear Algebra Interview Questions
Read answer on π MLStack.Cafe
[β¬] Linear Regression Interview Questions
Linear Regression is a supervised machine learning algorithm where the predicted output is continuous and has a constant slope. Itβs used to predict values within a continuous range, (e.g. sales, price) rather than trying to classify them into categories (e.g. cat, dog).
Source: ml-cheatsheet.readthedocs.io
- Simple linear regression uses traditional slope-intercept form. π₯ represents our input data and π¦ represents our prediction.
π¦ = ππ₯+π
- A more complex, multi-variable linear equation might look like this, where π€ represents the coefficients, or weights, our model will try to learn.
π(π₯,π¦,π§) = π€1π₯+π€2π¦+π€3π§
The variables π₯, π¦, π§ represent the attributes, or distinct pieces of information, we have about each observation. For sales predictions, these attributes might include a companyβs advertising spend on radio, TV, and newspapers.
πππππ = π€1π ππππ+π€2ππ+π€3*πππ€π
Source: ml-cheatsheet.readthedocs.io
We can use the following statistics to test the modelβs fitness:
-
R-squared: It is a statistical measure of how close the data points are to the fitted regression line. Its value is always between
0
and1
. The closer to1
, the better the regression model fits the observations. -
F-test: It evaluates the null hypothesis that the data is described by an intercept-only model, which is a regression with all the coefficients equal to zero versus the alternative hypothesis that at least one is not. If the P-value for the F-test is less than the significance level, we can reject the null hypothesis and conclude that the model provides a better fit than the intercept-only model.
-
Root Mean Square Error (RMSE): It measures the average deviation of the estimates from the observed value. How good this value is must be assessed for each project and context. For example, an RMSE of
1,000
for a house price prediction is probably good as houses tend to have prices over$100,000
, but an RMSE of1,000
for a life expectancy prediction is probably terrible as the average life expectancy is around78
.
Source: en.wikipedia.org
- The Mean Squared Error measures the variance of the residuals and is used when we want to punish the outliers in the dataset. It's defined as:
- The Mean Absolute Error measures the average of the residuals in the dataset. Is used when we donβt want outliers to play a big role. It can also be useful if we know that our distribution is multimodal, and itβs desirable to have predictions at one of the modes, rather than at the mean of them. It's defined as:
Source: medium.com
The common pattern for overfitting can be seen on learning curve plots, where model performance on the training dataset continues to improve (e.g. loss or error continues to fall) and performance on the test or validation set improves to a point and then begins to get worse.
So an overfit model will have extremely low training error but a high testing error.
Source: towardsdatascience.com
- Covariance measures whether a variation in one variable results in a variation in another variable, and deals with the linear relationship of only
2
variables in the dataset. Its value can take range from-β
to+β
. Simply speaking Covariance indicates the direction of the linear relationship between variables.
- Correlation measures how strongly two or more variables are related to each other. Its values are between
-1
to1
. Correlation measures both the strength and direction of the linear relationship between two variables. Correlation is a function of the covariance.
Source: careerfoundry.com
The Learning Rate is a hyper-parameter that can determine the speed or step size at each iteration while moving towards a minimal point in Gradient Descent. This value should not be too small or too high because if it's too small then it takes too much time to converge and if it's too large then the step size will increase and it moves quickly and never reach a global minima point even after repeated iterations.
Source: priyaroychowdhury.medium.com
- Measuring the distance of the observed y-values from the predicted y-values at each value of x.
- Squaring each of these distances.
- Calculating the mean of each of the squared distances.
MSE = (1/n) * Ξ£(actual β forecast)2
- The smaller the Mean Squared Error, the closer you are to finding the line of best fit
- How bad or good is this final value always depends on the context of the problem, but the main goal is that its value is so minimal as possible.
Source: www.scribbr.com
- Linear regression is a linear approach for modeling the relationship between a scalar response and one or more explanatory variables.
- In a supervised linear regression, the model tries to find a linear relationship between the input and output data points. This linear relationship is a straight line if graphed.
- If there is only one explanatory variable it is called simple linear regression, and if there are more than one explanatory variable it is called multiple linear regression.
- A linear function is given by the following equation: $$ y = X\beta + \epsilon $$ where all the variables are matrices containing data points.
Source: en.wikipedia.org
-
Non-linear functions have variables with powers greater than
1
. Like$$x^2$$ . If these non-linear functions are graphed, they do not produce a straight line (their direction changes constantly). -
Linear functions have variables with only powers of
1
. They form a straight line if it is graphed.
- Non-linear regression analysis tries to model a non-linear relationship between the independent and dependent variables.
- A simple non-linear relationship is shown below:
- Linear regression analysis tries to model a linear relationship between the independent and dependent variables.
- A simple linear relationship is shown below:
Source: www.columbia.edu
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Q16: How would you decide on the importance of variables for the Multivariate Regression model? βββ
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Q23: In which case you would use Gradient Descent method or Ordinary Least Squares and why? βββ
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Q38: How would you address the problem of Heteroskedasticity caused for a Measurement error? ββββ
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Q40: Name some advantages of using Gradient descent vs Ordinary Least Squares for Linear Regression ββββ
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
[β¬] Logistic Regression Interview Questions
Logistic regression can be used in classification problems where the output or dependent variable is categorical or binary. However, in order to implement logistic regression correctly, the dataset must also satisfy the following properties:
- There should not be a high correlation between the independent variables. In other words, the predictor variables should be independent of each other.
- There should be a linear relationship between the
logit
of the outcome and each predictor variable. Thelogit
function is given aslogit(p) = log(p/(1-p))
, wherep
is the probability of the outcome. - The sample size must be large. How large depends on the number of independent variables of the model.
When all the requirements above are satisfied, logistic regression can be used.
Source: careerfoundry.com
Although the task we are targeting in logistic regression is a classification, logistic regression does not actually individually classify things for you: it just gives you probabilities (or log odds ratios in the logit form).
The only way logistic regression can actually classify stuff is if you apply a rule to the probability output. For example, you may round probabilities greater than or equal to 50%
to 1
, and probabilities less than 50%
to 0
, and thatβs your classification.
Source: ryxcommar.com
A decision boundary is a line or a hyperplane that separates the classes. This is what we expect to obtain from logistic regression, as with any other classifier. With this, we can figure out some way to split the data to allow for an accurate prediction of a given observationβs class using the available information.
In the case of a generic two-dimensional example, the split might look something like this:
Source: medium.com
In Logistic regression models, we are modeling the probability that an input (X)
belongs to the default class (Y=1)
, that is to say:
where the P(X)
values are given by the logistic function,
The Ξ²0
and Ξ²1
values are estimated during the training stage using maximum-likelihood estimation or gradient descent. Once we have it, we can make predictions by simply putting numbers into the logistic regression equation and calculating a result.
For example, let's consider that we have a model that can predict whether a person is male or female based on their height, such as if P(X) β₯ 0.5
the person is male, and if P(X) < 0.5
then is female.
During the training stage we obtain Ξ²0 = -100
and Ξ²1 = 0.6
, and we want to evaluate what's the probability that a person with a height of 150cm
is male, so with that intention we compute:
Given that logistic regression solves a classification task, we can use directly this value to predict that the person is a female.
Source: machinelearningmastery.com
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Q10: What can you infer from each of the hand drawn decision boundary of Logistic Regression below? βββ
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Q18: Imagine that you know there are outliers in your data, would you use Logistic Regression? ββββ
Read answer on π MLStack.Cafe
Q19: Name some advantages of using Support Vector Machines vs Logistic Regression for classification ββββ
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
[β¬] Machine Learning Interview Questions
Essentially, Machine Learning is a method of teaching computers to make and improve predictions or behaviors based on some data. Machine Learning introduces a class of algorithms which is data-driven, i.e. unlike "normal" algorithms it is the data that "tells" what the "good answer" is. Machine learning creates a model based on sample data and use the model to make some prediction.
More rigid explanation: Machine Learning is a field of computer science, probability theory, and optimization theory which allows complex tasks to be solved for which a logical/procedural approach would not be possible or feasible.
Source: stackoverflow.com
-
Machine learning code records "facts" or approximations in some sort of storage, and with the algorithms calculates different probabilities.
-
The code itself (usually) will not be modified when a machine learns, only the database of what "it knows".
-
One example of code actually being modified is Genetic Programming, where you essentially evolve a program to complete a task (of course, the program doesn't modify itself - but it does modify another computer program).
Source: stackoverflow.com
- Overfitting refers to a model that models the training data too well.
- Overfitting happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data. This means that the noise or random fluctuations in the training data is picked up and learned as concepts by the model. The problem is that these concepts do not apply to new data and negatively impact the model's ability to generalize.
Source: machinelearningmastery.com
- Underfitting refers to a model that can neither model the training data nor generalizes to new data.
- An underfit machine learning model is not a suitable model and will be obvious as it will have poor performance on the training data.
- Underfitting is often not discussed as it is easy to detect given a good performance metric. The remedy is to move on and try alternate machine learning algorithms.
Source: machinelearningmastery.com
Every machine learning model has parameters and can additionally have hyper-parameters. Hyper-parameters are those parameters that cannot be directly learned from the regular training process. These parameters express higher-level properties of the model such as its complexity or how fast it should learn.
If machine learning model was an AM radio, the knobs for tuning the station would be its parameters but things like angle of antenna, height of antenna, volume knob would be hyperparameters.
Source: medium.com
In ensemble learning theory, we call weak learners (or base models) models that can be used as building blocks for designing more complex models by combining several of them. Most of the time, these basics models perform not so well by themselves either because they have a high bias (low degree of freedom models, for example) or because they have too much variance to be robust (high degree of freedom models, for example).
Source: towardsdatascience.com
What exactly do they have in common and where do they differ? If there is some kind of hierarchy between them, what would it be?
In short
- Statistics studies probability
- Data Mining explains patterns
- Machine Learning predicts with models
- Artificial Intelligence behaves and reasons
More detailed:
-
Statistics is concerned with probabilistic models, specifically inference on these models using data.
-
Data Mining is about using Statistics as well as other programming methods to find patterns hidden in the data so that you can explain some phenomenon. Data Mining builds intuition about what is really happening in some data and is still little more towards math than programming, but uses both.
-
Machine Learning uses Data Mining techniques and other learning algorithms to build models of what is happening behind some data so that it can predict future outcomes. Math is the basis for many of the algorithms, but this is more towards programming.
-
Artificial Intelligence uses models built by Machine Learning and other ways to reason about the world and give rise to intelligent behavior whether this is playing a game or driving a robot/car. Artificial Intelligence has some goal to achieve by predicting how actions will affect the model of the world and chooses the actions that will best achieve that goal. Very programming based.
Source: stats.stackexchange.com
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
[β¬] Model Evaluation Interview Questions
- Overfitting refers to a model that models the training data too well.
- Overfitting happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data. This means that the noise or random fluctuations in the training data is picked up and learned as concepts by the model. The problem is that these concepts do not apply to new data and negatively impact the model's ability to generalize.
Source: machinelearningmastery.com
- Underfitting refers to a model that can neither model the training data nor generalizes to new data.
- An underfit machine learning model is not a suitable model and will be obvious as it will have poor performance on the training data.
- Underfitting is often not discussed as it is easy to detect given a good performance metric. The remedy is to move on and try alternate machine learning algorithms.
Source: machinelearningmastery.com
Every machine learning model has parameters and can additionally have hyper-parameters. Hyper-parameters are those parameters that cannot be directly learned from the regular training process. These parameters express higher-level properties of the model such as its complexity or how fast it should learn.
If machine learning model was an AM radio, the knobs for tuning the station would be its parameters but things like angle of antenna, height of antenna, volume knob would be hyperparameters.
Source: medium.com
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Q8: What are some advantages and disadvantages of using AUC to measure the performance of the model? βββ
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Q14: If one algorithm has Higher Precision but Lower Recall than other, how can you tell which algorithm is better? ββββ
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
[β¬] Natural Language Processing Interview Questions
- Text preprocessing is done to transform a text into a more digestible form so that the machine learning algorithms can perform better. It is found that in tasks such as sentiment analysis, performing some preprocessing such as removing stop-words helps improve the accuracy of the machine learning model.
- Some common text preprocessing done are:
- removing HTML tags,
- removing stop-words,
- removing numbers,
- lower casing all letters,
- Lemmatization.
Source: towardsdatascience.com
- Stemming just removes the last few characters of a word, often leading to incorrect meanings and spellings.
Consider:
eating -> eat, Caring -> Car.
- Lemmatization considers the context and converts the word to its meaningful base form, which is called Lemma.
Consider:
Stripes -> Strip (verb) -or- Stripe (noun), better -> good
Source: stackoverflow.com
- Stemming is not computationally expensive, so it should be used where the dataset is large and performance is an issue.
- Lemmatization is computationally expensive because it involves dictionary look-up or a rule-based system. It is recommended for smaller datasets where accuracy is more important.
Source: stackoverflow.com
- PoS tagging is used to classify each word into its part of speech.
- Parts of speech can be used to find grammatical, or lexical patterns without specifying the word used.
- In English especially, the same word can be different parts of speech, so hence, PoS tagging can be helpful to differentiate between them.
Source: www.sketchengine.eu
- It identifies the occurrence of words in a document. It identifies the vocabulary and the presence of known words. Hence, it is very simple and flexible.
- It is intuitive that documents consisting of similar content will be similar in other ways such as meaning too. So, the BoW process will create a simple and quick group of features that can be used.
- The BoW model can be made as simple, and as complicated as possible. The main difference is how the vocabulary of words is maintained, and how the different words are scored.
Source: machinelearningmastery.com
- It identifies the occurrence of words in a document. It identifies the vocabulary and the presence of known words. Hence, it is very simple and flexible.
- It is intuitive that documents consisting of similar content will be similar in other ways such as meaning too. So, the BoW process will create a simple and quick group of features which can be used.
- The BoW model can be made as simple, and as complicated as possible. The main difference is how the vocabulary of words is maintained, and how the different words are scored.
Source: machinelearningmastery.com
Definition:
- TF-IDF: Term Frequency-Inverse Document Frequency
- TF: Term Frequency
Difference:
- TF-IDF is a numerical statistic that is intended to reflect how important a word is to the document in a collection of the corpus.
- TF is a count of the number of times a word occurs in a document.
- TF-IDF is given by:
- TF is given by the count of a word in the document by the number of words in d: $$ tf(t,d) = \frac{f_{t,d}}{\sum_{t'\in d} f_{t',d}} $$
Source: towardsdatascience.com
- A **one-hot **is a group of bits which only has one high
1
bit and all other bits are low0
. - In Natural Language Processing, the one-hot vector can be used to represent a sentence in the form of a matrix of
1 x N
size whereN
is the number of individual words in the corpus. - For example, the sentence "Peter picked a piece of pickled pepper" can be transformed into a matrix of
1 x 7
where each word is represented by the 7 columns. Hence the output of the sentence is:[0000001, 0000010, 0000100, 0001000, 0010000, 0100000, 1000000]
- An understandable representation of a one-hot vector is shown by the diagram below:
Source: en.wikipedia.org
Steps of text preprocessing can be divided into 3
major types:
- Tokenization: It is a process where a group of texts are divided into smaller pieces, or tokens. Paragraphs are tokenized into sentences, and sentences are tokenized into words.
- Normalization: Database normalization is where the structure of the database is converted to a series of normal forms. What it achieves is the organization of the data to appear similar across all records and fields. Similarly, in the field of NLP, normalization can be the process of converting all the words to its lowercase. This makes all the sentences and tokens appear the same and does not complicate the machine learning algorithm.
- Noise Removal: It is a process of cleaning up the text. Doing things such as removing characters which are not required, such as white spaces, numbers, special characters, etc.
Source: towardsdatascience.com
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Q12: What are some disadvantages of using a One-Hot Vector for Natural Language Processing? βββ
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Q28: What are some of the differences between NLP and CUI (Conversational User Interface)? ββββ
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
[β¬] NaΓ―ve Bayes Interview Questions
- Naive Bayes Classifiers are a family of simple probabilistic classifiers based on applying Bayes' theorem with strong independence assumptions between the features.
- Bayes' theorem is given by the following equation: $$ P(A|B)=\frac{P(B|A)P(A)}{P(B)} $$
- Using Bayes' theorem, the probability of
A
happening given thatB
has occurred can be found. - An example of the way a Naive Bayes Classifier can be used is, given that it has rained, the probability of temperature being low is
P(Temperature|Rain)
.
Source: towardsdatascience.com
We call it naive because its assumptions (it assumes that all of the features in the dataset are equally important and independent) are really optimistic and rarely true in most real-world applications:
- we consider that these predictors are independent
- we consider that all the predictors have an equal effect on the outcome (like the day being windy does not have more importance in deciding to play golf or not)
Source: towardsdatascience.com
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Q6: How would you use Naive Bayes classifier for categorical features? What if some features are numerical? βββ
Read answer on π MLStack.Cafe
Q7: What's the difference between Generative Classifiers and Discriminative Classifiers? Name some examples of each one βββ
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Q13: What are the trade-offs between the different types of Classification Algorithms? How would do you choose the best one? ββββ
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
[β¬] Neural Networks Interview Questions
- A neural network is a network or circuit of neurons composed of artificial neurons or nodes.
- A neural network can both be biological neural networks or artificial neural networks. Artificial neural networks are the ones used to solve AI problems.
- The artificial networks may be used for predictive modeling, adaptive control and applications where they can be trained via a dataset. Self-learning resulting from experience can occur within networks, which can derive conclusions from a complex and seemingly unrelated set of information.
A simple feed-forward neural network is shown below:
Source: en.wikipedia.org
- Artificial neural networks are modelled from biological neurons.
- The connections of the biological neuron are modeled as weights.
- A positive weight reflects an excitatory connection, while negative values mean inhibitory connections.
- All inputs are modified by a weight and summed. This activity is referred to as a linear combination.
- Finally, an activation function controls the amplitude of the output. For example, an acceptable range of output is usually between 0 and 1, or it could be β1 and 1.
Source: en.wikipedia.org
- The neural networks get the optimal weights and bias values through an Error Gradient.
- To decide whether to increase or decrease the current weights and bias, it needs to be compared to the optimal value. This is found by the gradients of error with respect to weights and bias:
- The gradient value is calculated from a selected algorithm called backpropagation.
- An optimization algorithm utilizes the gradient to improve the weight values and bias.
Source: towardsdatascience.com
- A Perceptron is a fundamental unit of a Neural Network that is also a single-layer Neural Network.
- Perceptron is a linear classifier. Since it uses already labeled data points, it is a supervised learning algorithm.
- The activation function applies a step rule (convert the numerical output into +1 or -1) to check if the output of the weighting function is greater than zero or not.
A Perceptron is shown in the figure below:
Source: towardsdatascience.com
- Neural network requires a loss function to be chosen when designing and configuring the model.
- While optimizing the model, an objective function is either a loss function or its negative. The objective function is sought to be maximized or minimized (output which has the highest or lowest score respectively). Typically, in a neural network the error should be minimized.
- The loss function should reduce all the aspects of a complex model down to a single scalar value, which allows the candidate solutions to be ranked and compared.
- The loss function chosen by the designer should capture the properties of the problem and be motivated by concerns that are important to the project.
Source: machinelearningmastery.com
-
An activation function applies a step rule (convert the numerical output into +1 or -1) to check if the output of the weighting function is greater than zero or not.
-
An activation function of a node defines the output of the node given an input or set of inputs to the node.
-
Activation functions can be divided into three categories:
- ridge functions
- radial functions, and
- fold functions.
-
A type of ridge function called Rectified Linear Function (ReLU) is shown below:
- A type of radial function called Gaussian Function is shown below:
- A fold function perform aggregation over the inputs, such as taking the mean, minimum or maximum.
Source: en.wikipedia.org
- Activation Functions help in keeping the value of the output from the neuron restricted to a certain limit as per the requirement. If the limit is not set then the output will reach very high magnitudes. Most activation functions convert the output to
-1
to1
or to0
to1
. - The most important role of the activation function is the ability to add non-linearity to the neural network. Most of the models in real-life is non-linear so the activation functions help to create a non-linear model.
- The activation function is responsible for deciding whether a neuron should be activated or not.
Source: towardsdatascience.com
- Forward propagation is the input data that is fed in the forward direction through the network. Each hidden layer accepts the input data, processes it as per the activation function, and passes it to the successive layer.
- Back propagation is the practice of fine-tuning the weights of the neural network based on the error rate obtained from the previous epoch. Proper tuning of the weights ensures low error rates, making the model more reliable.
Source: towardsdatascience.com
Some applications for ANN include:
- System identification and control: Vehicle control, trajectory prediction.
- Medical diagnosis: Identifying cancer, distinguishing highly invasive cancer cell lines from less invasive lines using only cell shape information.
- Sequence recognition: Gesture, speech, handwritten and printed text recognition.
- Geoscience: Hydrology, ocean modelling, coastal engineering, geomorphology.
- Cybersecurity: Discriminating between legitimate activities and malicious ones.
Source: en.wikipedia.org
- When researchers started to create large artificial neural networks, they started to use the word deep to refer to them.
- As the term deep learning started to be used, it is generally understood that it stands for artificial neural networks which are deep as opposed to shallow artificial neural networks.
- Deep Artificial Neural Networks and Deep Learning are generally the same thing and mostly used interchangeably.
Source: machinelearningmastery.com
- Early stopping in deep learning is a type of regularization where the training is stopped after a few iterations.
- When training a large network, there will be a point during training when the model will stop generalizing and start learning the statistical noise in the training dataset. This makes the networks unable to predict new data.
- Defining early stopping in a neural network will prevent the network from overfitting.
- One way of defining early stopping is to start training the model and if the performance of the model starts to degrade, then stopping the training process.
Source: Neural Networks and Deep Learning: A Textbook by Charu C. Aggarwal
- Self-Organizing Maps (SOMs) are a class of self-organizing clustering techniques.
- It is an unsupervised form of artificial neural networks. A self-organizing map consists of a set of neurons that are arranged in a rectangular or hexagonal grid. Each neuronal unit in the grid is associated with a numerical vector of fixed dimensionality. The learning process of a self-organizing map involves the adjustment of these vectors to provide a suitable representation of the input data.
- Self-organizing maps can be used for clustering numerical data in vector format.
Source: medium.com
Q13: What are some advantages of using Multilayer Perceptron over a Single-layer Perceptron? βββ
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Q18: How would you tune the Network Structure (Model Design) Hyperparameters to get the highest accuracy in an artificial Neural Network? βββ
Read answer on π MLStack.Cafe
Q19: How would you tune the Training Algorithm Hyperparameters to get the highest accuracy in a Neural Network? βββ
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Q35: What does the hidden layer in a Neural Network compute? βββ
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Q40: What is the difference between ReLU, Leaky ReLU, Exponential Linear Unit (ELU), and Parametric ReLU (PReLU)? ββββ
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
[β¬] Optimization Interview Questions
DataFrame.iloc
is a method used to retrieve data from a Data frame, and it is an integer position-based locator (from 0 to length-1 of the axis), but may also be used with a boolean array. It takes input as integer, arrays of integers, a slice object, boolean array and functions.
df.iloc[0]
df.iloc[-5:]
df.iloc[:, 2] # the : in the first position indicates all rows
df.iloc[:3, :3] # The upper-left 3 X 3 entries (assuming df has 3+ rows and columns)
DataFrame.loc
gets rows (and/or columns) with particular labels. It takes input as a single label, list of arrays and slice objects with labels.
df = pd.DataFrame(index=['a', 'b', 'c'], columns=['time', 'date', 'name'])
df.loc['a'] # equivalent to df.iloc[0]
df.loc['b':, 'date'] # equivalent to df.iloc[1:, 1]
Source: stackoverflow.com
- Splitting the data into groups based on some criteria.
- Applying a function to each group independently.
- Combining the results into a data structure.
Source: pandas.pydata.org
- By Simply iterating over columns, and printing the values.
for col in data.columns:
print(col)
- Using
.columns()
method with the dataframe object, this returns the column labels of the DataFrame.
list(data.columns)
- Using the
column.values()
method to return an array of index.
list(data.columns.values)
- Using
sorted()
method, which will return the list of columns sorted in alphabetical order.
sorted(data)
Source: www.geeksforgeeks.org
Q4: In Pandas, what do you understand as a bar plot and how can you generate a bar plot visualization ββ
- A Bar Plot is a plot that presents categorical data with rectangular bars with lengths proportional to the values that they represent.
- A bar plot shows comparisons among discrete categories.
- One axis of the plot shows the specific categories being compared, and the other axis represents a measured value.
# Code Sample for how to plot
df.plot.bar(x='x_valuesβ', y='y_values')
Source: pandas.pydata.org
DataFrame.iterrows
is a generator which yields both the index and row (as a Series):
import pandas as pd
df = pd.DataFrame({'c1': [10, 11, 12], 'c2': [100, 110, 120]})
for index, row in df.iterrows():
print(row['c1'], row['c2'])
10 100
11 110
12 120
Source: stackoverflow.com
You can use the attribute df.empty
to check whether it's empty or not:
if df.empty:
print('DataFrame is empty!')
Source: stackoverflow.com
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Q9: When cleaning data, mention how you will identify outliers present in a DataFrame object βββ
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
[β¬] Probability Interview Questions
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Common immutable type:
- numbers:
int()
,float()
,complex()
- immutable sequences:
str()
,tuple()
,frozenset()
,bytes()
Common mutable type (almost everything else):
- mutable sequences:
list()
,bytearray()
- set type:
set()
- mapping type:
dict()
- classes, class instances
- etc.
You have to understand that Python represents all its data as objects. Some of these objects like lists and dictionaries are mutable, meaning you can change their content without changing their identity. Other objects like integers, floats, strings and tuples are objects that can not be changed.
Source: techbeamers.com
Here are a few key points:
- Python is an interpreted language. That means that, unlike languages like C and its variants, Python does not need to be compiled before it is run. Other interpreted languages include PHP and Ruby.
- Python is dynamically typed, this means that you don't need to state the types of variables when you declare them or anything like that. You can do things like
x=111
and thenx="I'm a string"
without error - Python is well suited to object orientated programming in that it allows the definition of classes along with composition and inheritance. Python does not have access specifiers (like C++'s
public
,private
), the justification for this point is given as "we are all adults here" - In Python, functions are first-class objects. This means that they can be assigned to variables, returned from other functions and passed into functions. Classes are also first class objects
- Writing Python code is quick but running it is often slower than compiled languages. Fortunately, Python allows the inclusion of C based extensions so bottlenecks can be optimised away and often are. The
numpy
package is a good example of this, it's really quite quick because a lot of the number crunching it does isn't actually done by Python
Source: codementor.io
You canβt because strings are immutable. In most situations, you should simply construct a new string from the various parts you want to assemble it from. Work with them as lists; turn them into strings only when needed.
>>> s = list("Hello zorld")
>>> s
['H', 'e', 'l', 'l', 'o', ' ', 'z', 'o', 'r', 'l', 'd']
>>> s[6] = 'W'
>>> s
['H', 'e', 'l', 'l', 'o', ' ', 'W', 'o', 'r', 'l', 'd']
>>> "".join(s)
'Hello World'
Source: docs.python.org
- Python is a dynamic-typed language. It means that you donβt need to mention the data type of variables during their declaration.
- Python supports object-orientated programming as you can define classes along with the composition and inheritance.
- Functions in Python are like first-class objects. It suggests you can assign them to variables, return from other methods and pass them as arguments.
- Developing using Python is quick but running it is often slower than compiled languages.
- Python has several usages like web-based applications, test automation, data modeling, big data analytics, and much more.
Source: techbeamers.com
A Lambda Function is a small anonymous function. A lambda function can take any number of arguments, but can only have one expression.
Consider:
x = lambda a : a + 10
print(x(5)) # Output: 15
Source: stackoverflow.com
- Use a
tuple
to store a sequence of items that will not change. - Use a
list
to store a sequence of items that may change. - Use a
dictionary
when you want to associate pairs of two items.
Source: stackoverflow.com
While in many or most other programming languages variables are treated as global if not declared otherwise, Python deals with variables the other way around. They are local, if not otherwise declared.
- In Python, variables that are only referenced inside a function are implicitly global.
- If a variable is assigned a value anywhere within the functionβs body, itβs assumed to be a local unless explicitly declared as global.
Requiring global for assigned variables provides a bar against unintended side-effects.
Source: docs.python.org
Negative numbers mean that you count from the right instead of the left. So, list[-1]
refers to the last element, list[-2]
is the second-last, and so on.
Source: stackoverflow.com
- Global Variables: Variables declared outside a function or in global space are called global variables. These variables can be accessed by any function in the program.
- Local Variables: Any variable declared inside a function is known as a local variable. This variable is present in the local space and not in the global space.
Source: edureka.co
Descriptors were introduced to Python way back in version 2.2. They provide the developer with the ability to add managed attributes to objects. The methods needed to create a descriptor are __get__
, __set__
and __delete__
. If you define any of these methods, then you have created a descriptor.
Descriptors power a lot of the magic of Pythonβs internals. They are what make properties, methods and even the super function work. They are also used to implement the new style classes that were also introduced in Python 2.2.
Source: blog.pythonlibrary.org
Q11: Given variables a
and b
, switch their values so that b
has the value of a
, and a has the value of b
without using an intermediary variable ββ
a, b = b, a
Source: adevait.com
Suppose lst
is [2, 33, 222, 14, 25]
, What is lst[-1]
?
It's 25
. Negative numbers mean that you count from the right instead of the left. So, lst[-1]
refers to the last element, lst[-2]
is the second-last, and so on.
Source: adevait.com
- To convert the string into a number the built-in functions are used like
int()
constructor. It is a data type that is used likeint (β1β) == 1
. float()
is also used to show the number in the format asfloat(β1β) = 1
.- The number by default are interpreted as decimal and if it is represented by
int(β0x1β)
then it gives an error asValueError
. In this theint(string,base)
function takes the parameter to convert string to number in this the process will be likeint(β0x1β,16) == 16
. If the base parameter is defined as 0 then it is indicated by an octal and 0x indicates it as hexadecimal number. - There is function
eval()
that can be used to convert string into number but it is a bit slower and present many security risks
Source: careerride.com
In Python befor 3.10, we do not have a switch-case statement. Here, you may write a switch function to use. Else, you may use a set of if-elif-else statements. To implement a function for this, we may use a dictionary.
def switch_demo(argument):
switcher = {
1: "January",
2: "February",
3: "March",
4: "April",
5: "May",
6: "June",
7: "July",
8: "August",
9: "September",
10: "October",
11: "November",
12: "December"
}
print switcher.get(argument, "Invalid month")
Python 3.10 (2021) introduced the match
-case
statement which provides a first-class implementation of a "switch" for Python. For example:
For example:
def f(x):
match x:
case 'a':
return 1
case 'b':
return 2
The match
-case
statement is considerably more powerful than this simple example.
Source: github.com
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Q79: What is the difference between a function decorated with @staticmethod
and one decorated with @classmethod
? ββββ
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Q87: Is there any downside to the -O
flag apart from missing on the built-in debugging information? βββββ
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
[β¬] Random Forests Interview Questions
- Random Forests is a type of ensemble learning method for classification, regression, and other tasks.
- Random Forests works by constructing many decision trees at a training time. The way that this works is by averaging several decision trees at different parts of the same training set.
Source: en.wikipedia.org
- Classification: The output of the Random Forest is the one selected by the most trees.
- Regression: The output of the Random Forest is the mean or average prediction of the individual trees.
Source: en.wikipedia.org
- Ensemble methods is a machine learning technique that combines several base models in order to produce one optimal predictive model.
- Random Forest is a type of ensemble method.
- The number of component classifier in an ensemble has a great impact on the accuracy of the prediction, although there is a law of diminishing results in ensemble construction.
Source: towardsdatascience.com
In Random Forest the hyperparameters include:
- Number of decision trees in the forest.
- Number of features considered by each tree when splitting a node.
- The maximum depth of the individual trees.
- The minimum samples to split on at an internal node.
- The maximum number of leaf nodes.
- Number of random features.
- The size of the bootstrapped dataset.
Source: towardsdatascience.com
- Due to the observations being sampled with replacements, even if the size of the bootstrapped dataset is different, the datasets will be different.
- Due to this, the full size of the training data can be used.
Most of the time the best thing to do is not touch this hyperparameter.
Source: towardsdatascience.com
- Pruning is a data compression technique in machine learning and search algorithms that reduces the size of decision trees by removing sections of the tree that are non-critical and redundant to classify instances.
- Random Forest usually does not require pruning because it will not over-fit like a single decision tree. This happens due to the fact that the trees are bootstrapped and that multiple random trees use random features so the individual trees are strong without being correlated with each other.
Source: stats.stackexchange.com
- The OOB for a random forest is similar to Cross Validation. So, it is not necessary to perform cross-validation.
- By default, random forest picks up
2/3rd
data for training and rest for testing for regression and almost70%
data for training and rest for testing during classification. By principle since it randomizes the variable selection during each tree split it's not prone to overfit like other models.
Source: datascience.stackexchange.com
- Random forest is an ensemble learning method that works by constructing a multitude of decision trees. A random forest can be constructed for both classification and regression tasks.
- Random forest outperforms decision trees, and it also does not have the habit of overfitting the data as decision trees do.
- A decision tree trained on a specific dataset will become very deep and cause overfitting. To create a random forest, decision trees can be trained on different subsets of the training dataset, and then the different decision trees can be averaged with the goal of decreasing the variance.
Source: en.wikipedia.org
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Q31: Explain how it is possible to get feature importance in Random Forest using Out Of Bag Error ββββ
Read answer on π MLStack.Cafe
Q32: Imagine that you know there are outliers in your data, would you use Logistic Regression? ββββ
Read answer on π MLStack.Cafe
Q33: How would you find the optimal number of random features to consider at each split? βββββ
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
A view is simply a virtual table that is made up of elements of multiple physical or βrealβ tables. Views are most commonly used to join multiple tables together, or control access to any tables existing in background server processes.
Source: github.com/dhaval1406
In a nutshell, a temp table is a temporary storage structure. Basically, you can use a temp table to store data temporarily so you can manipulate and change it before it reaches its destination format.
Source: github.com/dhaval1406
- A PRIMARY KEY constraint is a unique identifier for a row within a database table.
- Every table should have a primary key constraint to uniquely identify each row and only one primary key constraint can be created for each table.
- The primary key constraints are used to enforce entity integrity.
Source: github.com/dhaval1406
- Default allows to add values to the column if the value of that column is not set.
- Default can be defined on number and datetime fields.
- They cannot be defined on timestamp and
IDENTITY
columns.
Source: github.com/chetansomani
- FOREIGN KEY constraint prevents any actions that would destroy links between tables with the corresponding data values.
- A foreign key in one table points to a primary key in another table.
- Foreign keys prevent actions that would leave rows with foreign key values when there are no primary keys with that value.
- The foreign key constraints are used to enforce referential integrity.
Source: github.com/dhaval1406
Normalization is basically to design a database schema such that duplicate and redundant data is avoided. If the same information is repeated in multiple places in the database, there is the risk that it is updated in one place but not the other, leading to data corruption.
There is a number of normalization levels from 1. normal form through 5. normal form. Each normal form describes how to get rid of some specific problem.
By having a database with normalization errors, you open the risk of getting invalid or corrupt data into the database. Since data "lives forever" it is very hard to get rid of corrupt data when first it has entered the database.
Source: stackoverflow.com
-
DELETE is a Data Manipulation Language(DML) command. It can be used for deleting some specified rows from a table. DELETE command can be used with WHERE clause.
-
TRUNCATE is a Data Definition Language(DDL) command. It deletes all the records of a particular table. TRUNCATE command is faster in comparison to DELETE. While DELETE command can be rolled back, TRUNCATE can not be rolled back in MySQL.
Source: stackoverflow.com
Q8: What is the difference between Data Definition Language (DDL) and Data Manipulation Language (DML)? ββ
-
Data definition language (DDL) commands are the commands which are used to define the database. CREATE, ALTER, DROP and TRUNCATE are some common DDL commands.
-
Data manipulation language (DML) commands are commands which are used for manipulation or modification of data. INSERT, UPDATE and DELETE are some common DML commands.
Source: en.wikibooks.org
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Support vector machines (SVMs) are a set of supervised learning methods used for classification, regression and outliers detection. The objective of the support vector machine algorithm is to find a hyperplane in an N-dimensional space(N β the number of features) that distinctly classifies the data points.
Support vector machines focus only on the points that are the most difficult to tell apart, whereas other classifiers pay attention to all of the points.
The intuition behind the support vector machine approach is that if a classifier is good at the most challenging comparisons (the points in B and A that are closest to each other), then the classifier will be even better at the easy comparisons (comparing points in B and A that are far away from each other).
Source: towardsdatascience.com
Hyperplanes are decision boundaries that help classify the data points. Data points falling on either side of the hyperplane can be attributed to different classes. Also, the dimension of the hyperplane depends upon the number of features. If the number of input features is 2, then the hyperplane is just a line. If the number of input features is 3, then the hyperplane becomes a two-dimensional plane.
To separate the two classes of data points, there are many possible hyperplanes that could be chosen. Our objective is to find a plane that has the maximum margin, i.e the maximum distance between data points of both classes. Maximizing the margin distance provides some reinforcement so that future data points can be classified with more confidence.
Source: towardsdatascience.com
- Support vectors are the data points nearest to the hyperplane, the points of a data set that, if removed, would alter the position of the dividing hyperplane.
- Using these support vectors, we maximize the margin of the classifier.
- For computing predictions, only the support vectors are used.
Source: towardsdatascience.com
- Data is rarely as clean that hyperplane is a line that linearly separates and classifies a set of data. In order to classify a dataset, itβs necessary to move away from a 2D view of the data to a 3D view.
- This βliftingβ of the data points represents the mapping of data into a higher dimension. This is known as kernelling.
- In this example, the picture on the left shows our original data points. In 1-dimension, this data is not linearly separable, but after applying the transformation Ο(x) = xΒ² and adding this second dimension to our feature space, the classes become linearly separable.
Source: www.kdnuggets.com
- Hard-Margin SVMs have linearly separable training data. No data points are allowed in the margin areas. This type of linear classification is known as Hard margin classification.
- Soft-Margin SVMs have training data that are not linearly separable. Margin violation means choosing a hyperplane, which can allow some data points to stay either in between the margin area or on the incorrect side of the hyperplane.
- Hard-Margin SVMs are quite sensitive to outliers.
- Soft-Margin SVMs try to find the best balance between keeping the margin as large as possible and limiting the margin violations.
Source: towardsdatascience.com
SVMs depends on supervised learning algorithms. The aim of using SVM is to correctly classify unseen data. Some common applications of SVM are:
- Face detection β SVMc classify parts of the image as a face and non-face and create a square boundary around the face.
- Text and hypertext categorization β SVMs allow Text and hypertext categorization for both inductive and transductive models. They use training data to classify documents into different categories. It categorizes on the basis of the score generated and then compares with the threshold value.
- Classification of images β Use of SVMs provides better search accuracy for image classification. It provides better accuracy in comparison to the traditional query-based searching techniques.
- Bioinformatics β It includes protein classification and cancer classification. We use SVM for identifying the classification of genes, patients on the basis of genes and other biological problems.
- Protein fold and remote homology detection β Apply SVM algorithms for protein remote homology detection.
- Handwriting recognition β We use SVMs to recognize handwritten characters used widely.
- Generalized predictive control(GPC) β Use SVM based GPC to control chaotic dynamics with useful parameters.
Source: data-flair.training
- Guaranteed Optimality: Owing to the nature of Convex Optimization, the solution will always be global minimum not a local minimum.
- Abundance of Implementations: We can access it conveniently, be it from Python or Matlab.
- SVM can be used for linearly separable as well as non-linearly separable data. Linearly separable data is the hard margin whereas non-linearly separable data poses a soft margin.
- SVMs provide compliance to the semi-supervised learning models. It can be used in areas where the data is labeled as well as unlabeled. It only requires a condition to the minimization problem which is known as the Transductive SVM.
- Feature Mapping used to be quite a load on the computational complexity of the overall training performance of the model. However, with the help of Kernel Trick, SVM can carry out the feature mapping using a simple dot product.
Source: data-flair.training
Let's say I am not using any kind of kernel, and it is a hard-margin SVM.
- For a hard-margin SVM all of the support vectors lie exactly on the margin.
- Regardless of the number of dimensions or size of the data set, the number of support vectors could be as little as
2
.
Source: stats.stackexchange.com
-
Linear kernel: Also referred to as the Non-kernel, is defined as the inner product of
x
andy
with an optional constant termc
. $$ K(x,y) = x^Ty + c $$ Is typically used on data sets with large amounts of features. -
Polynomial Kernel: is a more generalized form of the linear kernel, can distinguish curved or nonlinear input space. $$ K(x,) = (\alpha x^T y + c)^d $$ where the three parameters are
πΌ
,c
, andd
. The most common degreed
used is2
as larger degrees can lead to overfitting. -
The Radial Basis Function Kernel: can map an input space in infinite-dimensional space, is defined as: $$ K(x,y) = e^{-\gamma ||x-y||^2} $$ where
πΎ
is a parameter that scales the amount of influence two points have on each other, its range lies from0
to1
. A higher value of gamma will perfectly fit the training dataset, which causes overfitting. Generally, aπΎ = 0.1
is considered to be a good default value.
Source: www.datacamp.com
When it comes to classification problems, the goal is to establish a decision boundary that maximizes the margin between the classes. However, in the real world, this task can become difficult when we have to treat with non-linearly separable data. One approach to solve this problem is to perform a data transformation process, in which we map all the data points to a higher dimension find the boundary and make the classification.
That sounds alright, however, when there are more and more dimensions, computations within that space become more and more expensive. In such cases, the kernel trick allows us to operate in the original feature space without computing the coordinates of the data in a higher-dimensional space and therefore offers a more efficient and less expensive way to transform data into higher dimensions.
There exist different kernel functions, such as:
- linear,
- nonlinear,
- polynomial,
- radial basis function (RBF), and
- sigmoid.
Each one of them can be suitable for a particular problem depending on the data.
Source: medium.com
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Q36: Is there a relation between the Number of Support Vectors and the classifiers performance? ββββ
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Q42: Name some advantages of using Support Vector Machines vs Logistic Regression for classification ββββ
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Q47: How do you approximate RBF kernel to scale with large numbers of training samples? βββββ
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Q49: Why does SVM work well in practice, even if the reproduced space is very high dimensional? βββββ
Read answer on π MLStack.Cafe
[β¬] Scikit-Learn Interview Questions
Read answer on π MLStack.Cafe
When the list is sorted we can use the binary search (also known as half-interval search, logarithmic search, or binary chop) technique to find items on the list. Here's a step-by-step description of using binary search:
- Let
min = 1
andmax = n
. - Guess the average of
max
andmin
rounded down so that it is an integer. - If you guessed the number, stop. You found it!
- If the guess was too low, set min to be one larger than the guess.
- If the guess was too high, set max to be one smaller than the guess.
- Go back to step two.
In this example we looking for array item with value 4
:
When you do one operation in binary search we reduce the size of the problem by half (look at the picture below how do we reduce the size of the problem area) hence the complexity of binary search is O(log n)
. The binary search algorithm can be written either recursively or iteratively.
Source: www.tutorialspoint.com
Time Complexity: O(log n) Space Complexity: O(log n)
var binarySearch = function(array, value) {
var guess,
min = 0,
max = array.length - 1;
while(min <= max){
guess = Math.floor((min + max) /2);
if(array[guess] === value)
return guess;
else if(array[guess] < value)
min = guess + 1;
else
max = guess - 1;
}
return -1;
}
// binary search example in Java
/* here Arr is an of integer type, n is size of array
and target is element to be found */
int binarySearch(int Arr[], int n, int target) {
//set stating and ending index
int start = 0, ending = n-1;
while(start <= ending) {
// take mid of the list
int mid = (start + end) / 2;
// we found a match
if(Arr[mid] == target) {
return mid;
}
// go on right side
else if(Arr[mid] < target) {
start = mid + 1;
}
// go on left side
else {
end = mid - 1;
}
}
// element is not present in list
return -1;
}
def BinarySearch(lys, val):
first = 0
last = len(lys)-1
index = -1
while (first <= last) and (index == -1):
mid = (first+last)//2
if lys[mid] == val:
index = mid
else:
if val<lys[mid]:
last = mid -1
else:
first = mid +1
return index
Linear (sequential) search goes through all possible elements in some array and compare each one with the desired element. It may take up to O(n)
operations, where N is the size of an array and is widely considered to be horribly slow. In linear search when you perform one operation you reduce the size of the problem by one (when you do one operation in binary search you reduce the size of the problem by half). Despite it, it can still be used when:
- You need to perform this search only once,
- You are forbidden to rearrange the elements and you do not have any extra memory,
- The array is tiny, such as ten elements or less, or the performance is not an issue at all,
- Even though in theory other search algorithms may be faster than linear search (for instance binary search), in practice even on medium-sized arrays (around 100 items or less) it might be infeasible to use anything else. On larger arrays, it only makes sense to use other, faster search methods if the data is large enough, because the initial time to prepare (sort) the data is comparable to many linear searches,
- When the list items are arranged in order of decreasing probability, and these probabilities are geometrically distributed, the cost of linear search is only
O(1)
- You have no idea what you are searching.
When you ask MySQL something like SELECT x FROM y WHERE z = t
, and z
is a column without an index, linear search is performed with all the consequences of it. This is why adding an index to searchable columns is important.
Source: bytescout.com
Time Complexity: O(n) Space Complexity: O(n)
- A linear search runs in at worst linear time and makes at most
n
comparisons, wheren
is the length of the list. If each element is equally likely to be searched, then linear search has an average case of(n+1)/2
comparisons, but the average case can be affected if the search probabilities for each element vary. - When the list items are arranged in order of decreasing probability, and these probabilities are geometrically distributed, the cost of linear search is only
O(1)
function linearSearch(array, toFind){
for(let i = 0; i < array.length; i++){
if(array[i] === toFind) return i;
}
return -1;
}
# can be simply done using 'in' operator
if x in arr:
print arr.index(x)
# If you want to implement Linear Search in Python
def search(arr, x):
for i in range(len(arr)):
if arr[i] == x:
return i
return -1
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Q15: For Binary Search why do we need round down the average? Could we round up instead? ββββ
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Bubble Sort is based on the idea of repeatedly comparing pairs of adjacent elements and then swapping their positions if they are in the wrong order. Bubble sort is a stable, in-place sort algorithm.
How it works:
- In an unsorted array of
n
elements, start with the first two elements and sort them in ascending order. (Compare the element to check which one is greater). - Compare the second and third element to check which one is greater, and sort them in ascending order.
- Compare the third and fourth element to check which one is greater, and sort them in ascending order.
- ...
- Repeat steps 1β
n
until no more swaps are required.
Visualisation:
Source: github.com
Time Complexity: O(n^2) Space Complexity: O(n^2)
Bubble sort has a worst-case and average complexity of O(n2)
, where n
is the number of items being sorted. When the list is already sorted (best-case), the complexity of bubble sort is only O(n)
. The space complexity for Bubble Sort is O(1)
, because only single additional memory space is required (for temp
swap element).
// Normal
const bubbleSort = function(array) {
let swaps;
do {
swaps = false;
for (let i = 0; i < array.length - 1; i++) {
if (array[i] > array[i + 1]) {
let temp = array[i + 1];
array[i + 1] = array[i];
array[i] = temp;
swaps = true;
}
}
} while (swaps);
return array;
};
// Recursively
const bubbleSort = function (array, pointer = array.length - 1) {
// Base Case
if (pointer === 0) {
return array;
}
for (let i = 0; i < pointer; i++) {
if (array[i] > array[i + 1]) {
let temp = array[i + 1];
array[i + 1] = array[i];
array[i] = temp;
}
}
// Recursive call on smaller portion of the array
return bubbleSort(array, pointer - 1);
};
def bubbleSort(arr):
n = len(arr)
# Traverse through all array elements
for i in range(n):
# Last i elements are already in place
for j in range(0, n-i-1):
# traverse the array from 0 to n-i-1
# Swap if the element found is greater
# than the next element
if arr[j] > arr[j+1] :
arr[j], arr[j+1] = arr[j+1], arr[j]
Efficient sorting is important for optimizing the efficiency of other algorithms (such as search and merge algorithms) that require input data to be in sorted lists. Sorting is also often useful for canonicalizing data and for producing human-readable output. Sorting have direct applications in database algorithms, divide and conquer methods, data structure algorithms, and many more.
Source: en.wikipedia.org
Time Complexity: None Space Complexity: None
The idea of an in-place algorithm isn't unique to sorting, but sorting is probably the most important case, or at least the most well-known. The idea is about space efficiency - using the minimum amount of RAM, hard disk or other storage that you can get away with.
The idea is to produce an output in the same memory space that contains the input by successively transforming that data until the output is produced. This avoids the need to use twice the storage - one area for the input and an equal-sized area for the output.
Quicksort is one example of In-Place Sorting.
Source: stackoverflow.com
Time Complexity: None Space Complexity: None
The Ideal Sorting Algorithm would have the following properties:
- Stable: Equal keys arenβt reordered.
- Operates in place: requiring
O(1)
extra space. - Worst-case
O(n log n)
key comparisons. - Worst-case
O(n)
swaps. - Adaptive: Speeds up to
O(n)
when data is nearly sorted or when there are few unique keys.
There is no algorithm that has all of these properties, and so the choice of sorting algorithm depends on the application.
Source: www.toptal.com
Time Complexity: None Space Complexity: None
Sorting algorithms can be categorised based on the following parameters:
- Based on Number of Swaps or Inversion. This is the number of times the algorithm swaps elements to sort the input.
Selection Sort
requires the minimum number of swaps. - Based on Number of Comparisons. This is the number of times the algorithm compares elements to sort the input. Using Big-O notation, the sorting algorithm examples listed above require at least
O(n log n)
comparisons in the best case andO(n2)
comparisons in the worst case for most of the outputs. - Based on Recursion or Non-Recursion. Some sorting algorithms, such as
Quick Sort
, use recursive techniques to sort the input. Other sorting algorithms, such asSelection Sort
orInsertion Sort
, use non-recursive techniques. Finally, some sorting algorithm, such asMerge Sort
, make use of both recursive as well as non-recursive techniques to sort the input. - Based on Stability. Sorting algorithms are said to be
stable
if the algorithm maintains the relative order of elements with equal keys. In other words, two equivalent elements remain in the same order in the sorted output as they were in the input.Insertion sort
,Merge Sort
, andBubble Sort
are stableHeap Sort
andQuick Sort
are not stable
- Based on Extra Space Requirement. Sorting algorithms are said to be in place if they require a constant
O(1)
extra space for sorting.Insertion sort
andQuick-sort
arein place
sort as we move the elements about the pivot and do not actually use a separate array which is NOT the case in merge sort where the size of the input must be allocated beforehand to store the output during the sort.Merge Sort
is an example ofout place
sort as it require extra memory space for itβs operations.
Source: www.freecodecamp.org
Time Complexity: None Space Complexity: None
Insertion Sort is an in-place, stable, comparison-based sorting algorithm. The idea is to maintain a sub-list which is always sorted. An element which is to be 'insert'ed in this sorted sub-list, has to find its appropriate place and then it has to be inserted there. Hence the name, insertion sort.
Steps on how it works:
- If it is the first element, it is already sorted.
- Pick the next element.
- Compare with all the elements in sorted sub-list.
- Shift all the the elements in sorted sub-list that is greater than the value to be sorted.
- Insert the value.
- Repeat until list is sorted.
Visualisation:
Source: medium.com
Time Complexity: O(n^2) Space Complexity: O(n^2)
- Insertion sort runs in
O(n2)
in its worst and average cases. It runs inO(n)
time in its best case. - Insertion sort performs two operations: it scans through the list, comparing each pair of elements, and it swaps elements if they are out of order. Each operation contributes to the running time of the algorithm. If the input array is already in sorted order, insertion sort compares
O(n)
elements and performs no swaps. Therefore, in the best case, insertion sort runs inO(n)
time. - Space complexity is
O(1)
because an extra variable key is used (as a temp variable for insertion).
var insertionSort = function(a) {
// Iterate through our array
for (var i = 1, value; i < a.length; i++) {
// Our array is split into two parts: values preceeding i are sorted, while others are unsorted
// Store the unsorted value at i
value = a[i];
// Interate backwards through the unsorted values until we find the correct location for our `next` value
for (var j = i; a[j - 1] > value; j--) {
// Shift the value to the right
a[j] = a[j - 1];
}
// Once we've created an open "slot" in the correct location for our value, insert it
a[j] = value;
}
// Return the sorted array
return a;
};
import java.util.Arrays;
class InsertionSort {
void insertionSort(int array[]) {
int size = array.length;
for (int step = 1; step < size; step++) {
int key = array[step];
int j = step - 1;
// Compare key with each element on the left of it until an element smaller than
// it is found.
// For descending order, change key<array[j] to key>array[j].
while (j >= 0 && key < array[j]) {
array[j + 1] = array[j];
--j;
}
// Place key at after the element just smaller than it.
array[j + 1] = key;
}
}
// Driver code
public static void main(String args[]) {
int[] data = { 9, 5, 1, 4, 3 };
InsertionSort is = new InsertionSort();
is.insertionSort(data);
System.out.println("Sorted Array in Ascending Order: ");
System.out.println(Arrays.toString(data));
}
}
def insertionSort(array):
for step in range(1, len(array)):
key = array[step]
j = step - 1
# Compare key with each element on the left of it until an element smaller than it is found
# For descending order, change key<array[j] to key>array[j].
while j >= 0 and key < array[j]:
array[j + 1] = array[j]
j = j - 1
# Place key at after the element just smaller than it.
array[j + 1] = key
data = [9, 5, 1, 4, 3]
insertionSort(data)
print('Sorted Array in Ascending Order:')
print(data)
Advantages:
- Simple to understand
- Ability to detect that the list is sorted efficiently is built into the algorithm. When the list is already sorted (best-case), the complexity of bubble sort is only
O(n)
.
Disadvantages:
- It is very slow and runs in
O(n2)
time in worst as well as average case. Because of that Bubble sort does not deal well with a large set of data. For example Bubble sort is three times slower than Quicksort even for n = 100
Source: en.wikipedia.org
Time Complexity: None Space Complexity: None
In Bubble sort, you know that after k
passes, the largest k
elements are sorted at the k
last entries of the array, so the conventional Bubble sort uses:
public static void bubblesort(int[] a) {
for (int i = 1; i < a.length; i++) {
boolean is_sorted = true;
for (int j = 0; j < a.length - i; j++) { // skip the already sorted largest elements, compare to a.length - 1
if (a[j] > a[j+1]) {
int temp = a[j];
a[j] = a[j+1];
a[j+1] = temp;
is_sorted = false;
}
}
if(is_sorted) return;
}
}
Now, that would still do a lot of unnecessary iterations when the array has a long sorted tail of largest elements. If you remember where you made your last swap, you know that after that index, there are the largest elements in order, so:
public static void bubblesort(int[] a) {
int lastSwap = a.length - 1;
for (int i = 1; i< a.length; i++) {
boolean is_sorted = true;
int currentSwap = -1;
for (int j = 0; j < lastSwap; j++) { // compare to a.length - i
if (a[j] > a[j+1]) {
int temp = a[j];
a[j] = a[j+1];
a[j+1] = temp;
is_sorted = false;
currentSwap = j;
}
}
if (is_sorted) return;
lastSwap = currentSwap;
}
}
This allows to skip over many elements, resulting in about a worst case 50% improvement in comparison count (though no improvement in swap counts), and adds very little complexity.
Source: stackoverflow.com
Time Complexity: None Space Complexity: None
The add()
method below walks down the list until it finds the appropriate position. Then, it splices in the new node and updates the start
, prev
, and curr
pointers where applicable.
Note that the reverse operation, namely removing elements, doesn't need to change, because you are simply throwing things away which would not change any order in the list.
Source: stackoverflow.com
public void add(T x) {
Node newNode = new Node();
newNode.info = x;
// case: start is null; just assign start to the new node and return
if (start == null) {
start = newNode;
curr = start;
// prev is null, hence not formally assigned here
return;
}
// case: new node to be inserted comes before the current start;
// in this case, point the new node to start, update pointers, and return
if (x.compareTo(start.info) < 0) {
newNode.link = start;
start = newNode;
curr = start;
// again we leave prev undefined, as it is null
return;
}
// otherwise walk down the list until reaching either the end of the list
// or the first position whose element is greater than the node to be
// inserted; then insert the node and update the pointers
prev = start;
curr = start;
while (curr != null && x.compareTo(curr.info) >= 0) {
prev = curr;
curr = curr.link;
}
// splice in the new node and update the curr pointer (prev already correct)
newNode.link = prev.link;
prev.link = newNode;
curr = newNode;
}
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
[β¬] Statistics Interview Questions
The normal distribution is the most important probability distribution in statistics because it fits many natural phenomena. For example, heights, blood pressure, measurement error, and IQ scores follow the normal distribution. It is also known as the Gaussian distribution and the bell curve.
Normal distributions have the following features:
- symmetric bell shape
- mean and median and mode are equal; both located at the center of the distribution
- β68% of the data falls within 1 standard deviation of the mean
- β95% of the data falls within 2 standard deviations of the mean
- β99.7% of the data falls within 3 standard deviations of the mean
Source: statisticsbyjim.com
Normalization rescales the values into a range of [0,1]. This might be useful in some cases where all parameters need to have the same positive scale. However, the outliers from the data set are lost.
Standardization rescales data to have a mean (
For most applications standardization is recommended.
Source: stats.stackexchange.com
- We can talk about the error of a single measurement, but bias is the average of errors of many repeated measurements,
- Bias is a statistical property of the error of a measuring technique,
- Sometimes the term "bias error" is used as opposed to "root-mean-square error".
Source: stats.stackexchange.com
- The standard deviation (SD) measures the amount of variability, or dispersion, from the individual data values to the mean. It's defined as:
$$\sigma = \sqrt{ \frac{\sum_{i=1}^n (x_i - \bar{x})^2 }{n-1} }$$ - The standard error of the mean (SEM) measures how far the sample mean (average) of the data is likely to be from the true population mean. It's defined as:
$$\sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}}$$
Therefore, the relationship between the standard error of the mean and the standard deviation is such that, for given sample size, the standard error of the mean equals the standard deviation divided by the square root of the sample size.
Source: www.investopedia.com
-
The confidence level is the percentage of times we expect to get close to the same estimate if we run our experiment again or resample the population in the same way.
-
The confidence interval is the actual upper and lower bounds of the estimate we expect to find at a given level of confidence.
For example, if we are estimating a 95%
confidence interval around the mean proportion of female babies born every year based on a random sample of babies, we might find an upper bound of 0.56
and a lower bound of 0.48
. These are the upper and lower bounds of the confidence interval for a confidence level of 95%
.
This means that 95%
of the time, we can expect our estimate to fall between 0.56
and 0.48
.
Source: www.scribbr.com
The Median is the most suitable measure of central tendency for skewed distributions or distributions with outliers. For example, the median is often used as a measure of central tendency for income distributions, which are generally highly skewed.
Because the median only uses one or two values, itβs unaffected by extreme outliers or non-symmetric distributions of scores. In contrast, the mean and mode can vary in skewed distributions.
Source: en.wikipedia.org
A measure of central tendency is a single value that attempts to describe a set of data by identifying the central position within that set of data. In the simplest of terms, it attempts to find a single value that best represents an entire distribution of scores.
Mean, Median and Mode are average values or central tendency of a numerical data set.
Source: statistics.laerd.com
Statistical significance is a term used by researchers to state that it is unlikely their observations could have occurred under the null hypothesis of a statistical test. Significance is usually denoted by a p-value, or probability value.
Statistical significance is arbitrary β it depends on the threshold, or alpha value, chosen by the researcher. The most common threshold is p < 0.05, which means that the data is likely to occur less than 5% of the time under the null hypothesis.
When the p-value falls below the chosen alpha value, then we say the result of the test is statistically significant.
Source: www.scribbr.com
-
Arithmetic mean: Itβs often simply called the mean or the average and is the sum of all values divided by the total number of values.
$$\bar{x} = \frac{1}{n} \sum_{i=1}^n x_i$$ -
Geometric mean: is often used for a set of numbers whose values are meant to be multiplied together or are exponential, such as values of the human population or interest rates of a financial investment over time.
$$\bar{x} = \prod_{i=1}^n x_i $$ -
Harmonic mean: is an average that is often used in averaging things like rates as in the case of speed (i.e., distance per unit of time).
$$\bar{x} = n \left( \sum_{i=1}^n \frac{1}{x_i} \right)^{-1}$$
Source: en.wikipedia.org
The mode is the value that appears most frequently in a data set. A set of data may have one mode, more than one mode, or no mode at all. A data set can often have no mode, one mode, or more than one mode β it all depends on how many different values repeat most frequently.
For example, in the following list of numbers, 16 is the mode since it appears more times in the set than any other number:
- 3, 3, 6, 9, 16, 16, 16, 27, 27, 37, 48
Your data can be:
- without any mode
- unimodal, with one mode,
- bimodal, with two modes,
- trimodal, with three modes, or
- multimodal, with four or more modes.
Source: www.scribbr.com
The empirical rule, or the 68-95-99.7
rule, tells you where most of the values lie in a normal distribution:
- Around
68%
of values are within1
standard deviation of the mean. - Around
95%
of values are within2
standard deviations of the mean. - Around
99.7%
of values are within3
standard deviations of the mean.
The empirical rule is a quick way to get an overview of your data and check for any outliers or extreme values that donβt follow this pattern.
Source: www.scribbr.com
-
Descriptive statistics, as its name suggests, focus on describing the characteristics or features of a dataset. Here we look for measures of distribution, central tendency and variability in order to draw conclusions based on known data.
-
Inferential statistics focus on making generalizations about a larger population based on a representative sample of that population, It also allows us to make predictions so its results are usually in the form of a probability. Here, we perform hypothesis testing, compute confidence intervals, make regression and correlation analyses, in order to draw conclusions that go beyond the available data.
Source: careerfoundry.com
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Q37: What is the difference between Central Limit Theorem and the Law of Large Numbers? ββββ
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
[β¬] Supervised Learning Interview Questions
Support vector machines (SVMs) are a set of supervised learning methods used for classification, regression and outliers detection. The objective of the support vector machine algorithm is to find a hyperplane in an N-dimensional space(N β the number of features) that distinctly classifies the data points.
Support vector machines focus only on the points that are the most difficult to tell apart, whereas other classifiers pay attention to all of the points.
The intuition behind the support vector machine approach is that if a classifier is good at the most challenging comparisons (the points in B and A that are closest to each other), then the classifier will be even better at the easy comparisons (comparing points in B and A that are far away from each other).
Source: towardsdatascience.com
Linear Regression is a supervised machine learning algorithm where the predicted output is continuous and has a constant slope. Itβs used to predict values within a continuous range, (e.g. sales, price) rather than trying to classify them into categories (e.g. cat, dog).
Source: ml-cheatsheet.readthedocs.io
- Decision trees is a tool that uses a tree-like model of decisions and their possible consequences. If an algorithm only contains conditional control statements, decision trees can model that algorithm really well.
- Decision trees are a non-parametric, supervised learning method.
- Decision trees are used for classification and regression tasks.
- The diagram below shows an example of a decision tree (the dataset used is the Titanic dataset to predict whether a passenger survived or not):
Source: towardsdatascience.com
- Supervised learning is a subcategory of machine learning and artificial intelligence.
- It has a * labeled dataset*. Each input has a corresponding output, and algorithms are trained to predict the output based on the input.
- As input data is fed into the model, it adjusts the weights until the model is fitted properly.
Source: www.ibm.com
Supervised learning can be separated into two types of problems when data mining:
- Classification: It uses algorithms to assign the test data into specific categories. Common classification algorithms are linear classifiers, support vector machines (SVM), decision trees, k-nearest neighbor, and random forest.
- Regression: It is used to understand the relationship between dependent and independent variables. Linear regression, logistical regression, and polynomial regression are popular regression algorithms.
Source: www.ibm.com
- Supervised learning is when the data you feed your algorithm with is tagged or labelled, to help your logic make decisions.
Example: a hypothetical non-machine learning algorithm for face detection in images would try to define what a face is (round skin-like-colored disk, with dark area where you expect the eyes etc). A machine learning algorithm would not have such coded definition, but would "learn-by-examples": you'll show several images of faces and not-faces and a good algorithm will eventually learn and be able to predict whether or not an unseen image is a face.
- Unsupervised learning are types of algorithms that try to find correlations without any external inputs other than the raw data (your examples are not labeled, i.e. you don't say anything). In such a case the algorithm itself cannot "invent" what a face is, but it can try to cluster the data into different groups, e.g. it can distinguish that faces are very different from landscapes, which are very different from horses.
Source: stackoverflow.com
-
Supervised learning examples:
- You get a bunch of photos with information about what is on them and then you train a model to recognize new photos.
- You have a bunch of molecules and information about which are drugs and you train a model to answer whether a new molecule is also a drug.
- Based on past information about spams, filtering out a new incoming email into Inbox (normal) or Junk folder (Spam)
- Cortana or any speech automated system in your mobile phone trains your voice and then starts working based on this training.
- Train your handwriting to OCR system and once trained, it will be able to convert your hand-writing images into text (till some accuracy obviously)
-
Unsupervised learning examples:
- You have a bunch of photos of 6 people but without information about who is on which one and you want to divide this dataset into 6 piles, each with the photos of one individual.
- You have molecules, part of them are drugs and part are not but you do not know which are which and you want the algorithm to discover the drugs.
- A friend invites you to his party where you meet totally strangers. Now you will classify them using unsupervised learning (no prior knowledge) and this classification can be on the basis of gender, age group, dressing, educational qualification or whatever way you would like. Why this learning is different from Supervised Learning? Since you didn't use any past/prior knowledge about people and classified them "on-the-go".
- NASA discovers new heavenly bodies and finds them different from previously known astronomical objects - stars, planets, asteroids, blackholes etc. (i.e. it has no knowledge about these new bodies) and classifies them the way it would like to (distance from Milky way, intensity, gravitational force, red/blue shift or whatever)
- Let's suppose you have never seen a Cricket match before and by chance watch a video on internet, now you can classify players on the basis of different criterion: Players wearing same sort of kits are in one class, Players of one style are in one class (batsmen, bowler, fielders), or on the basis of playing hand (RH vs LH) or whatever way you would observe [and classify] it.
Source: stackoverflow.com
In supervised machine learning an algorithm learns a model from training data.
The goal of any supervised machine learning algorithm is to best estimate the mapping function (f) for the output variable (Y) given the input data (X). The mapping function is often called the target function because it is the function that a given supervised machine learning algorithm aims to approximate.
Bias are the simplifying assumptions made by a model to make the target function easier to learn.
Generally, linear algorithms have a high bias making them fast to learn and easier to understand but generally less flexible.
-
Examples of low-bias machine learning algorithms include: Decision Trees, k-Nearest Neighbors and Support Vector Machines.
-
Examples of high-bias machine learning algorithms include: Linear Regression, Linear Discriminant Analysis and Logistic Regression.
Source: machinelearningmastery.com
We call it naive because its assumptions (it assumes that all of the features in the dataset are equally important and independent) are really optimistic and rarely true in most real-world applications:
- we consider that these predictors are independent
- we consider that all the predictors have an equal effect on the outcome (like the day being windy does not have more importance in deciding to play golf or not)
Source: towardsdatascience.com
- A Perceptron is a fundamental unit of a Neural Network that is also a single-layer Neural Network.
- Perceptron is a linear classifier. Since it uses already labeled data points, it is a supervised learning algorithm.
- The activation function applies a step rule (convert the numerical output into +1 or -1) to check if the output of the weighting function is greater than zero or not.
A Perceptron is shown in the figure below:
Source: towardsdatascience.com
-
K-nearest neighbors or KNN is a supervised classification algorithm. This means that we need labeled data to classify an unlabeled data point. It attempts to classify a data point based on its proximity to other
K
-data points in the feature space. -
K-means Clustering is an unsupervised classification algorithm. It requires only a set of unlabeled points and a threshold
K
, so it gathers and groups data intoK
number of clusters.
Source: www.quora.com
A decision tree is a flowchart-like structure in which:
- Each internal node represents the test on an attribute (e.g. outcome of a coin flip).
- Each branch represents the outcome of the test.
- Each leaf node represents a class label.
- The paths from the root to leaf represent the classification rules.
Source: en.wikipedia.org
A decision tree consists of three types of nodes:
- Decision nodes: Represented by squares. It is a node where a flow branches into several optional branches.
- Chance nodes: Represented by circles. It represents the probability of certain results.
- End nodes: Represented by triangles. It shows the final outcome of the decision path.
Source: en.wikipedia.org
- We can talk about the error of a single measurement, but bias is the average of errors of many repeated measurements,
- Bias is a statistical property of the error of a measuring technique,
- Sometimes the term "bias error" is used as opposed to "root-mean-square error".
Source: stats.stackexchange.com
-
High Bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting).
-
High Variance may result from an algorithm modeling random noise in the training data (overfitting).
-
The Bias-Variance tradeoff is a central problem in supervised learning. Ideally, a model should be able to accurately capture the regularities in its training data, but also generalize well to unseen data.
-
It is called a tradeoff because it is typically impossible to do both simultaneously:
- Algorithms with high variance will be prone to overfitting the dataset, but
- Algorithms with high bias will underfit the dataset.
Source: en.wikipedia.org
- Classification is the problem of identifying which set of categories an observation belongs to.
- Regression is a set of statistical processes for estimating the relationships between a dependent variable and an independent variable.
- Classification is used to predict the values of a categorical variable, so the output is generally in the form of integers, or binary (
0
or1
). - Regression is used to predict a continuous variable, so the output is also a floating-point number (
0.1
,0.74
,0.69
, etc.).
Source: en.wikipedia.org
- Linear regression is a linear approach for modeling the relationship between a scalar response and one or more explanatory variables.
- In a supervised linear regression, the model tries to find a linear relationship between the input and output data points. This linear relationship is a straight line if graphed.
- If there is only one explanatory variable it is called simple linear regression, and if there are more than one explanatory variable it is called multiple linear regression.
- A linear function is given by the following equation: $$ y = X\beta + \epsilon $$ where all the variables are matrices containing data points.
Source: en.wikipedia.org
- k-Nearest Neighbors is a supervised machine learning algorithm that can be used to solve both classification and regression problems.
- It assumes that similar things are closer to each other in certain feature spaces, in other words, similar things are in close proximity.
- The image above shows how similar points are closer to each other. KNN hinges on this assumption being true enough for the algorithm to be useful.
- There are many different ways of calculating the distance between the points, however, the straight line distance (Euclidean distance) is a popular and familiar choice.
Source: towardsdatascience.com
-
Cross-validation is a method of assessing how the results of a statistical analysis will generalize on an independent dataset,
-
It can be used in machine learning tasks to evaluate the predictive capability of the model,
-
It also helps us to avoid overfitting and underfitting,
-
A common way to cross-validate is to divide the dataset into training, validation, and testing where:
- Training dataset is a dataset of known data on which the training is run.
- Validation dataset is the dataset that is unknown against which the model is tested. The validation dataset is used after each epoch of learning to gauge the improvement of the model.
- Testing dataset is also an unknown dataset that is used to test the model. The testing dataset is used to measure the performance of the model after it has finished learning.
Source: en.wikipedia.org
Multiclass classification means a classification task with more than two classes; e.g., classify a set of images of fruits which may be oranges, apples, or pears. Multiclass classification makes the assumption that each sample is assigned to one and only one label: a fruit can be either an apple or a pear but not both at the same time.
Multilabel classification assigns to each sample a set of target labels. This can be thought of as predicting properties of a data-point that are not mutually exclusive, such as topics that are relevant for a document. A text might be about any of religion, politics, finance or education at the same time or none of these.
Source: stats.stackexchange.com
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
[β¬] TensorFlow Interview Questions
[β¬] Unsupervised Learning Interview Questions
Some common real-world applications of unsupervised learning are:
- News selections: Google News uses unsupervised learning to categorize articles on the same story from various online news outlets.
- Computer vision: Unsupervised learning algorithms are used for visual perception tasks, such as object recognition.
- Medical imaging: Unsupervised machine learning provides essential features to medical imaging devices, such as image detection, classification, and segmentation, used in radiology and pathology to diagnose patients quickly and accurately.
- Anomaly detection: Unsupervised learning models can comb through large amounts of data and discover atypical data points within a dataset. These anomalies can raise awareness around faulty equipment, human error, or breaches in security.
Source: www.ibm.com
Some common challenges that unsupervised learning can help with are:
- Insufficient labeled data: For supervised learning, there is a requirement for a lot of labeled data for the model to perform well. Unsupervised learning can automatically label unlabeled examples. This would work by clustering all the data points and then applying the labels from the labeled ones to the unlabeled ones.
- Overfitting: Machine learning algorithms can sometimes overfit the training data by extracting too much from the noise in the data. When this happens, the algorithm is memorizing the training data rather than learning how to generalize the knowledge of the training data. Unsupervised learning can be introduced as a regularizer. Regularization is a process that helps to reduce the complexity of a machine learning algorithm, helping it capture the signal in the data without adjusting too much to the noise.
- Outliers: The quality of data is very important. If machine learning algorithms train on outliers (rare cases) then their generalization error will be lower than if they are ignored. Unsupervised learning can perform outlier detection using dimensionality reduction and create solutions specifically for the outliers, and separately, a solution for the normal data.
- Feature engineering: Feature engineering is a vital task for data scientists to perform, but feature engineering is very labor-intensive, and it requires a human to creatively engineer the features. Representation learning from unsupervised learning can be used to automatically learn the right type of features to help the task at hand.
Source: www.amazon.com
- Supervised learning is when the data you feed your algorithm with is tagged or labelled, to help your logic make decisions.
Example: a hypothetical non-machine learning algorithm for face detection in images would try to define what a face is (round skin-like-colored disk, with dark area where you expect the eyes etc). A machine learning algorithm would not have such coded definition, but would "learn-by-examples": you'll show several images of faces and not-faces and a good algorithm will eventually learn and be able to predict whether or not an unseen image is a face.
- Unsupervised learning are types of algorithms that try to find correlations without any external inputs other than the raw data (your examples are not labeled, i.e. you don't say anything). In such a case the algorithm itself cannot "invent" what a face is, but it can try to cluster the data into different groups, e.g. it can distinguish that faces are very different from landscapes, which are very different from horses.
Source: stackoverflow.com
-
Supervised learning examples:
- You get a bunch of photos with information about what is on them and then you train a model to recognize new photos.
- You have a bunch of molecules and information about which are drugs and you train a model to answer whether a new molecule is also a drug.
- Based on past information about spams, filtering out a new incoming email into Inbox (normal) or Junk folder (Spam)
- Cortana or any speech automated system in your mobile phone trains your voice and then starts working based on this training.
- Train your handwriting to OCR system and once trained, it will be able to convert your hand-writing images into text (till some accuracy obviously)
-
Unsupervised learning examples:
- You have a bunch of photos of 6 people but without information about who is on which one and you want to divide this dataset into 6 piles, each with the photos of one individual.
- You have molecules, part of them are drugs and part are not but you do not know which are which and you want the algorithm to discover the drugs.
- A friend invites you to his party where you meet totally strangers. Now you will classify them using unsupervised learning (no prior knowledge) and this classification can be on the basis of gender, age group, dressing, educational qualification or whatever way you would like. Why this learning is different from Supervised Learning? Since you didn't use any past/prior knowledge about people and classified them "on-the-go".
- NASA discovers new heavenly bodies and finds them different from previously known astronomical objects - stars, planets, asteroids, blackholes etc. (i.e. it has no knowledge about these new bodies) and classifies them the way it would like to (distance from Milky way, intensity, gravitational force, red/blue shift or whatever)
- Let's suppose you have never seen a Cricket match before and by chance watch a video on internet, now you can classify players on the basis of different criterion: Players wearing same sort of kits are in one class, Players of one style are in one class (batsmen, bowler, fielders), or on the basis of playing hand (RH vs LH) or whatever way you would observe [and classify] it.
Source: stackoverflow.com
Principal Component Analysis (PCA) is an unsupervised, non-parametric statistical technique primarily used for dimensionality reduction in machine learning.
Principal component analysis is a useful technique when dealing with large datasets. In some fields, (bioinformatics, internet marketing, etc) we end up collecting data which has many thousands or tens of thousands of dimensions. Manipulating the data in this form is not desirable, because of practical considerations like memory and CPU time. However, we can't just arbitrarily ignore dimensions either. We might lose some of the information we are trying to capture!
Principal component analysis is a common method used to manage this tradeoff. The idea is that we can somehow select the 'most important' directions, and keep those, while throwing away the ones that contribute mostly noise.
For example, this picture shows a 2D dataset being mapped to one dimension:
Note that the dimension chosen was not one of the original two: in general, it won't be, because that would mean your variables were uncorrelated to begin with.
We can also see that the direction of the principal component is the one that maximizes the variance of the projected data. This is what we mean by 'keeping as much information as possible.'
Source: math.stackexchange.com
-
K-nearest neighbors or KNN is a supervised classification algorithm. This means that we need labeled data to classify an unlabeled data point. It attempts to classify a data point based on its proximity to other
K
-data points in the feature space. -
K-means Clustering is an unsupervised classification algorithm. It requires only a set of unlabeled points and a threshold
K
, so it gathers and groups data intoK
number of clusters.
Source: www.quora.com
- As the amount of data required to train a model increases, it becomes harder and harder for machine learning algorithms to handle. As more features are added to the machine learning process, the more difficult the training becomes.
- In very high-dimensional space, supervised algorithms learn to separate points and build function approximations to make good predictions.
When the number of features increases, this search becomes expensive, both from a time and compute perspective. It might become impossible to find a good solution fast enough. This is the curse of dimensionality.
- Using dimensionality reduction of unsupervised learning, the most salient features can be discovered in the original feature set. Then the dimension of this feature set can be reduced to a more manageable number while losing very little information in the process. This will help supervised learning find the optimum function to approximate the dataset.
Source: www.amazon.com
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Q18: What are the main differences between Sparse Autoencoders and Convolution Autoencoders? ββββ
Read answer on π MLStack.Cafe
Q19: What are some differences between the Undercomplete Autoencoder and the Sparse Autoencoder? ββββ
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Q23: How to tell if data is clustered enough for clustering algorithms to produce meaningful results? ββββ
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe
Read answer on π MLStack.Cafe