DS-Interview-Questions-Flashcards

A flashcards-like collection of interviews questions for Data Science

Machine Learning-related

General Questions

Give me an example where MLE is equivalent to MAP

ANS

When using uniform prior, we assign equal weights everywhere, on all possible values of the $\theta$. The implication is that the likelihood equivalently weighted by some constants. Being constant, we could be ignored from our MAP equation, as it will not contribute to the maximization.

See details here

How to decide between L1 and L2 Loss Function?

ANS

Generally, L2 Loss Function is preferred in most of the cases. But when the outliers are present in the dataset, then the L2 Loss Function does not perform well. The reason behind this bad performance is that if the dataset is having outliers, then because of the consideration of the squared differences, it leads to the much larger error. Hence, L2 Loss Function is not useful here. Prefer L1 Loss Function as it is not affected by the outliers or remove the outliers and then use L2 Loss Function.

See details here

How to deal with Skewed Data?

ANS

- Log-transformation - Boxcox-transformation

See details here

Linear Regression

What is Bayesian Linear Regression?

ANS

The response variable generated from a normal (Gaussian) Distribution characterized by a mean and variance. The mean for linear regression is the transpose of the weight matrix multiplied by the predictor matrix. The variance is the square of the standard deviation σ (multiplied by the Identity matrix because this is a multi-dimensional formulation of the model).

The aim of Bayesian Linear Regression is not to find the single “best” value of the model parameters, but rather to determine the posterior distribution for the model parameters.

See details here

 How to interpret the Regression Coefficients for Curvilinear Relationships and Interactive Terms?

ANS

Interaction terms indicate that the effect of one predictor depends on the value of another predictor.

See details here

What are the assumptions required for linear regression?

ANS

There is a linear relationship between the dependent variables and the regressors, meaning the model you are creating actually fits the data
The errors or residuals of the data are normally distributed and independent from each other
There is minimal multicollinearity between explanatory variables
Homoscedasticity. This means the variance around the regression line is the same for all values of the predictor variable.

See details here

What are limitations in Linear Regression models?

ANS

Linear regression models are sensitive to outliers
Overfitting - It is easy to overfit your model such that your regression begins to model the random error (noise) in the data, rather than just the relationship between the variables. This most commonly arises when you have too many parameters compared to the number of samples
Linear regressions are meant to describe linear relationships between variables. So, if there is a nonlinear relationship, then you will have a bad model. However, you can sometimes compensate for this by transforming some of the parameters with a log, square root, etc. transformation.
The data may not fit the model due to violation of assumptions. The other answers deal with this and there is lots of material in textbooks and online about this, so, I won’t say more about it.

See details here

Logistic Regression

Is Logistic Regression a linear model? Why?

ANS

Logistic regression is considered a generalized linear model because the outcome always depends on the sum of the inputs and parameters. Or in other words, the output cannot depend on the product (or quotient, etc.) of its parameters!

See details here

How to interpret the weights in Logistic Regression?

ANS

For intercept $\beta_0$, it just denotes that when all numerical features and categorical features are zero, the estimated odds (probability of event divided by probability of no event) are $\exp(\beta_0)$
For numerical features, If you increase the value of feature $x_j$ by one unit, the estimated odds change by a factor of $\exp(\beta_j)$
For binary categorical features: one of the two values of the feature is the reference category. Changing the feature $x_j$ from the reference category to the other category changes the estimated odds by $\exp(\beta_j)$
Categorical feature with more than two categories: One solution to deal with multiple categories is one-hot-encoding, meaning that each category has its own column. You only need L-1 columns for a categorical feature with L categories, otherwise it is over-parameterized. The L-th category is then the reference category. You can use any other encoding that can be used in linear regression. The interpretation for each category then is equivalent to the interpretation of binary features.

See details here

Navie Bayes

What is navie about Navie Bayes?

ANS

Its core assumption of conditional independence (i.e. all input features are independent from one another) rarely holds true in the real world

See details here

Random Forest

How does random forest calcuate Feature Importance?

ANS

By the decrease in node impurity weighted by the probability of reaching that node. The node probability can be calculated by the number of samples reaches that node divided by total number of samples.

See details here

Clustering

How is k-NN different from k-means clustering?

ANS

k-NN, or k-nearest neighbors is a classification algorithm, where the k is an integer describing the number of neighboring data points that influence the classification of a given observation. K-means is a clustering algorithm, where the k is an integer describing the number of clusters to be created from the given data.

See details here

Statistics-related

What is Simpson's Paradox?

ANS

Simpson's paradox occurs when groups of data show one particular trend, but this trend is reversed when the groups are combined together. Understanding and identifying this paradox is important for correctly interpreting data.

See details here

What is the Central Limit Theorem and why is it important?

ANS

Suppose that we are interested in estimating the average height among all people. Collecting data for every person in the world is impossible. While we can’t obtain a height measurement from everyone in the population, we can still sample some people. The question now becomes, what can we say about the average height of the entire population given a single sample. The Central Limit Theorem addresses this question exactly.

Formally, it states that if we sample from a population using a sufficiently large sample size, the mean of the samples (also known as the sample population) will be normally distributed (assuming true random sampling). What’s especially important is that this will be true regardless of the distribution of the original population.

See details here

What is sampling? How many sampling methods do you know?

ANS

Sampling is a statistical analysis technique used to select, manipulate and analyze a *representative* subset of points to identify trends and patterns in the larger data set being examined.

Sampling based on Probability:
- Simple random sampling: Software is used to randomly select subjects from the whole population
- Stratified sampling: Subsets of the data sets or population are created based on a common factor, and samples are randomly collected from each subgroup
- Cluster sampling: The larger data set is divided into subsets (clusters) based on a defined factor, then a random sampling of clusters is analyzed
- Multistage sampling: A more complicated form of cluster sampling, this method also involves dividing the larger population into a number of clusters. Second-stage clusters are then broken out based on a secondary factor, and those clusters are then sampled and analyzed. This staging could continue as multiple subsets are identified, clustered and analyzed
- Systematic sampling: A sample is created by setting an interval at which to extract data from the larger population -- for example, selecting every 10th row in a spreadsheet of 200 items to create a sample size of 20 rows to analyze
Sampling based Non-Probability:
- Convenience sampling: Data is collected from an easily accessible and available group
- Consecutive sampling: Data is collected from every subject that meets the criteria until the predetermined sample size is met
- Purposive or judgmental sampling: The researcher selects the data to sample based on predefined criteria
- Quota sampling: The researcher ensures equal representation within the sample for all subgroups in the data set or population

See details here

What is the difference between type I vs type II error?

ANS

A type I error (FP) occurs when the null hypothesis is true, but is rejected. Let me say it again, a type I error occurs when the null hypothesis is actually true, but was rejected as false by the testing.

A type II (FN) error occurs when the null hypothesis is false, but erroneously fails to be rejected. Let me say this again, a type II error occurs when the null hypothesis is actually false, but was accepted as true by the testing.

See details here

How to interpret P-values in Linear Regression Analysis

ANS

The p-value for each term tests the null hypothesis that the coefficient is equal to zero (no effect). A low p-value (< 0.05) indicates that you can reject the null hypothesis. In other words, a predictor that has a low p-value is likely to be a meaningful addition to your model because changes in the predictor's value are related to changes in the response variable. Conversely, a larger (insignificant) p-value suggests that changes in the predictor are not associated with changes in the response.

See details here

What is a statistical interaction?

ANS

Basically, an interaction is when the effect of one factor (input variable) on the dependent variable (output variable) differs among levels of another factor.

See details here

What is selection bias?

ANS

Selection (or “sampling”) bias occurs in an “active,” sense when the sample data that is gathered and prepared for modeling has characteristics that are not representative of the true, future population of cases the model will see. That is, active selection bias occurs when a subset of the data are systematically (i.e., non-randomly) excluded from analysis.

See details here

What is an example of a data set with a non-Gaussian distribution?

ANS

The Gaussian distribution is part of the Exponential family of distributions, but there are a lot more of them, with the same sort of ease of use, in many cases, and if the person doing the machine learning has a solid grounding in statistics, they can be utilized where appropriate.

In a Poisson or Bernoulli process, the statistic that gives the time to the next event is not normal, but the data collected in such processes is the number of events per time unit, and for large 𝑛, that's approximately normal.

See details here

What is F-test?

ANS

The F-test can be used in regression analysis to determine whether a complex model is better than a simpler version of the same model in explaining the variance in the dependent variable.

The test statistic of the F-test is a random variable whose Probability Density Function is the F-distribution under the assumption that the null hypothesis is true.

The testing procedure for the F-test for regression is identical in its structure to that of other parametric tests of significance such as the t-test.

See details here

What is the Standard Deviation?

ANS

The standard deviation is a statistic that measures the dispersion of a dataset relative to its mean and is calculated as the square root of the variance. It is calculated as the square root of variance by determining the variation between each data point relative to the mean. If the data points are further from the mean, there is a higher deviation within the data set; thus, the more spread out the data, the higher the standard deviation.

See details here

What is the Standard Error?

ANS

The standard error (SE) of a statistic is the approximate standard deviation of a statistical sample population. The standard error is a statistical term that measures the accuracy with which a sample distribution represents a population by using standard deviation. In statistics, a sample mean deviates from the actual mean of a population—this deviation is the standard error of the mean.

See details here

Describe the procedure of Hypothesis Testing.

ANS

Determine the null hypothesis and alternative hypothesis
Verifiy data condition
Assume that the null hypothesis is true, calculate the p-value
Decide whether or not the result is statistically significant
Report the conclusion

See details here

How to compute Confidence Interval for population Mean?

ANS

See details here

How to compute Confidence Interval for population Median?

ANS

A general probability solution and an interesting bootstrap solution see below.

See details here and here

Programming-related

Include questions about Data Structures/Algorithms/Coding Concepts. Since I assume for those questions, coding practice should be focused more.

What are some pros and cons about your favorite statistical software?

ANS

For Python and R:

Pros:
- A large number of repositories in GitHub
- Packages for Data Science
Cons:
- Python doesn't have good documentation
- R is an erratic tool for machine learning projects
- R is a slow programming language

See more details here and here

How would you sort a large list of numbers?

ANS

See details here

What Native Data Structures Can You Name in Python?

ANS

List, Dictionary, Set, String, Tuple

See details here

In Python, How is Memory Managed?

ANS

In Python, memory is managed in a private heap space. This means that all the objects and data structures will be located in a private heap. However, the programmer won’t be allowed to access this heap. Instead, the Python interpreter will handle it. At the same time, the core API will enable access to some Python tools for the programmer to start coding.

See details here

What is worst case time complexity of quick sort?

ANS

O(n^2)

What is average case time complexity of quick sort?

ANS

O(n\log{n})

What is worst case time complexity of looking up a value in a hashtable?

ANS

O(n)

What is average case time complexity of looking up a value in a hashtable?

ANS

O(1) if the number of entris is no more than the number of buckets

What do you deal with collision in hashtable?

ANS

- By having each bucket contain a linked list of elements that are hashed to that bucket. This is why a bad hash function can make lookups in hash tables very slow. -If the hash table entries are all full then the hash table can increase the number of buckets that it has and then redistribute all the elements in the table. The hash function returns an integer and the hash table has to take the result of the hash function and mod it against the size of the table that way it can be sure it will get to bucket. So by increasing the size, it will rehash and run the modulo calculations which if you are lucky might send the objects to different buckets. - Dynamic resizing - Open addressing strategy

See details here

What is the difference between black-red tree with binary search tree?

ANS

A red-black tree is a self-balancing tree, while a binary search tree is not. So a binary search tree is able to form long chains of nodes that can cause searches to take linear time, but a red-black tree guarantees (because it is self-balancing) a search operation takes logarithmic time.

See details here

What is the largest and minimum possible height of a binary tree with n elements?

ANS

n - 1 and O(\log(2n)) See details [here](https://www.geeksforgeeks.org/relationship-number-nodes-height-binary-tree/)

What is the largest possible height of a balanced binary search tree with n elements?

ANS

O(\log{n})

SQL-related

Tell me the difference between an inner join, left join/right join, and union.

ANS

In a Venn diagram the inner join is when both tables have a match, a left join is when there is a match in the left table and the right table is null, a right join is the opposite of a left join, and a full join is all of the data combined.

See details here

What does UNION do? What is the difference between UNION and UNION ALL?

ANS

UNION removes duplicate records (where all columns in the results are the same), UNION ALL does not.

See details here

Deep Learning-related

What is model checkpointing?

ANS

It is an approach where a snapshot of the state of the system is taken in case of system failure. If there is a problem, not all is lost. The checkpoint may be used directly, or used as the starting point for a new run, picking up where it left off.

See details here

What are the problems with sigmoid as activation function?

ANS

- Sigmoid saturate and kill gradients: the curve becomes parallel to x-axis when fed with a very large positive or negative number. During backpropagation, it multiplies the local gradident and if the gradient is small it get killed. **This problem of vanishing gradient is solved by Relu** - Not Zero-centered: Sigmoid outputs are not zero-centered, which is undesirable because it can indirectly introduce undesirable zig-zagging dynamics in the gradient updates for the weights.

See details here

What regularization techniques for neural nets do you know

ANS

- L1 & L2 regularization (in loss function): smaller weight matrices lead to simpler models - Dropout: it can be thought of as an ensemble technique in machine learning - Data Augmentation: - Early Stopping - Weight Decay - Batch Normalization

See details here

What need to be taken cautions when updating pretrained weights in language models? And solutions?

ANS

When updating during training on target problem, useful pretrained information might be overwritten. Some solutions: - freezing: train layers independently to give them time to adapt to new task and data. All train and all parameters jointly in the end - lower learing rate: warm-up receives variance in early stage of trainning - regularization: to encourage target model parameters to stay close to the parameters of pretrained model

See details here

Jingciii/DS-Interview-Questions-Flashcards

DS-Interview-Questions-Flashcards

Machine Learning-related

General Questions

Linear Regression

Logistic Regression

Navie Bayes

Random Forest

Clustering

Statistics-related

Programming-related

SQL-related

Deep Learning-related