Colection of resources to revise topics in Machine Learning.
https://sebastianraschka.com/faq/
Data Engineering: https://docs.google.com/spreadsheets/d/1GOO4s1NcxCR8a44F0XnsErz5rYDxNbHAHznu4pJMRkw/edit#gid=0
STATS:
- Estimate the minimum sample size for hypothesis testing
- Linear regression model, varaible has low coefficient value, p<0.05, High Sum of squared error (ANOVA ?) contribution for the variale. What will you conclude
- Hypothesis testing for 1) skewed data 2) proportions data with non normal distribution.
- Categorical encoding methods. How to prevent overfitting for target encoding
Regression:
- What are the assumptions of linear regression? (Delhivery, ANZ bank, Citi Bank, Accenture )
- What is the meaning of multicollinearity? (ANZ, Amazon)
- How to detect multicollinearity? (Amazon, Delhivery)
- What do you understand by VIF (Variance Inflation Factor)? (Amazon)
- What is the difference between R-squared and adjusted R-squared? (Delhivery, ANZ bank, Citi Bank, Accenture)
- How to deal with multicollinearity in data? (Citi, Accenture)
- Explain forward and backward elimination? (Accenture)
- How does PCA work? (ICICI securities, Amazon, Miko.ai)
- Explain Ridge and Lasso Regression? (Delhivery, ANZ bank, Citi Bank, Accenture, Amazon)
- Can SVM be used for regression? (Miko.ai)
- What is the curse of dimensionality? Can you give an example?
- What is the difference between the coefficient of determination and coefficient of correlation?
- Give methods of variable selection in Regression Analysis? (Delhivery, ANZ bank, ICICI securities)
- Why do we perform the residual analysis? (ANZ)
- What are L1 and L2 penalization? (Miko.ai)
- What is heteroscadasticity? How does it affect the regression coefficients? (ANZ)
- Why does only VIF>10 implies that there is multicollinearity, why not choose vif>8? (IDFC First Bank)
- In my dataset, if I have 100 observations and 1500 features, do you think I would be able to fit the regression model onto that or not? (IDFC First Bank)
- For a single variable, how will you detect outliers? (ICICI Lombard)
- How correlation between two variables will change in the presence of an outlier? Will it increase, decrease or remain constant? Explain how, using its formula. (ICICI Lombard)
- What are influential and leverage points? Which of them has more effect on the model? (ANZ, Wells Fargo)
- Does multicollinearity impact the prediction of a machine learning algorithm? (Wells Fargo)
Logistic regression:
- What are the different types of loss functions in regression and classification?
- What is the difference between R2 and adjusted R2?
- What are the basics assumption of linear regression and logistics regression?
- How to check for multicollinearity in datasets and do multicollinearity affects the final performance of model?
- In linear regression train R2 is 0.95 & test R2 = 0.93 but |y-y`| is large. How is this possible?
- How is hypothesis testing used in linear regression?
- How to decide feature importance in linear regression?
- How do you decide whether your linear regression model fits the data?
- What is the formula of loss function in logistics regression?
- Why MSE can’t be used as a loss function for logistics regression?
- How to decide feature importance in Logistic regression?
- What is the difference between L1 and L2 regularization and why L1 create sparsity?
- How to do bias & variance tradeoff? How to find out model is overfitting or not?
- What are different types of optimization techniques used to train classical ML algorithms and why do we need SGD over GD?
- Can we use tanh function in place of sigmoid in logistics regression?
- How to do bivariate analysis between a categorical variable and a continuous variable?
The p-value is a number, calculated from a statistical test, that describes how likely you are to have found a particular set of observations if the null hypothesis were true. P-values are used in hypothesis testing to help decide whether to reject the null hypothesis. P > 0.05 is the probability that the null hypothesis is true. 1 minus the P value is the probability that the alternative hypothesis is true. A statistically significant test result (P ≤ 0.05) means that the test hypothesis is false or should be rejected. A P value greater than 0.05 means that no effect was observed.