Zep - Machine Learning Interview Questions
Zep - Machine Learning Interview Questions
TO INTERVIEWS FOR
MACHINE LEARNING
ZEP ANALYTICS
Introduction
We've curated this series of interview guides to
accelerate your learning and your mastery of data
science skills and tools.
Explore
TABLE OF
CONTENTS
1. What are different types of Machine Learning?
2. What is Overfitting, and How Can You Avoid It?
3. What is ‘training Set’ and ‘test Set’ in a Machine
Learning Model? How Much Data Will You Allocate for
Your Training, Validation, and Test Sets?
4. How Do You Handle Missing or Corrupted Data in a
Dataset?
5. How Can You Choose a Classifier Based on a
Training Set Data Size?
6. What Are the Three Stages of Building a Model in
Machine Learning?
7. What Are the Applications of Supervised Machine
Learning in Modern Businesses?
8. What is Semi-supervised Machine Learning?
9. What Are Unsupervised Machine Learning
Techniques?
10. What is the Difference Between Supervised and
Unsupervised Machine Learning?
TABLE OF
CONTENTS
11. What Is ‘naive’ in the Naive Bayes Classifier?
12. Explain How a System Can Play a Game of Chess
Using Reinforcement Learning.
13. How Will You Know Which Machine Learning
Algorithm to Choose for Your Classification Problem?
14. When Will You Use Classification over Regression?
15. What is a Random Forest?
16. What is Bias and Variance in a Machine Learning
Model?
17. What is the Trade-off Between Bias and Variance?
18. What is a Decision Tree Classification?
19. What is Pruning in Decision Trees, and How Is It
Done?
20. Briefly Explain Logistic Regression.
21. Explain the K Nearest Neighbor Algorithm.
22. What is a Recommendation System?
23. What is Kernel SVM?
24. What Are Some Methods of Reducing
Dimensionality?
TABLE OF
CONTENTS
25. What is Principal Component Analysis?
26. What are Support Vectors in SVM?
27. What is Ensemble learning?
28. What is Cross-Validation?
29. What are the different methods to split a tree in a
decision tree algorithm?
30. How does the Support Vector Machine algorithm
handle self-learning?
31. What is the difference between Lasso and Ridge
regression?
32. What are the assumptions you need to take before
starting with linear regression?
33. Explain the Confusion Matrix with Respect to
Machine Learning Algorithms.
34. What Is a False Positive and False Negative and
How Are They Significant?
35. Define Precision and Recall.
36. What do you understand by Type I vs Type II error?
37. What is a Decision Tree in Machine Learning?
TABLE OF
CONTENTS
38. What is Hypothesis in Machine Learning?
39. What are the differences between Deep Learning
and Machine Learning?
40. What is Entropy in Machine Learning?
41. What is Epoch in Machine Learning?
42. How is the suitability of a Machine Learning
Algorithm determined for a particular problem?
43. What is the Variance Inflation Factor?
44. When should Classification be used over
Regression?
45. Why is rotation required in PCA? What will happen if
the components are not rotated?
46. What is ROC Curve and what does it represent?
47. Why are Validation and Test Datasets Needed?
48. Explain the difference between KNN and K-means
Clustering.
49. What is Dimensionality Reduction?
50. Both being Tree-based Algorithms, how is Random
Forest different from Gradient Boosting Machine
(GBM)?
TABLE OF
CONTENTS
51. What is meant by Parametric and Non-parametric
Models?
52. Differentiate between Sigmoid and Softmax
Functions.
53. In Machine Learning, for how many classes can
Logistic Regression be used?
54. What is meant by Correlation and Covariance?
55. What are the Various Tests for Checking the
Normality of a Dataset?
56. What are the Two Main Types of Filtering in Machine
Learning? Explain.
57. Outlier Values can be Discovered from which Tools?
58. What is meant by Ensemble Learning?
59. What are the Various Kernels that are present in
SVM?
60. Suppose you found that your model is suffering
from high variance. Which algorithm do you think could
handle this situation and why?
TABLE OF
CONTENTS
61. What is Binarizing of Data? How to Binarize?
62. How to Standardize Data?
63. We know that one-hot encoding increases the
dimensionality of a dataset, but label encoding doesn’t.
How?
64. Imagine you are given a dataset consisting of
variables having more than 30% missing values. Let’s
say, out of 50 variables, 16 variables have missing
values, which is higher than 30%. How will you deal with
them?
65. Explain False Negative, False Positive, True Negative,
and True Positive with a simple example.
66. What is F1-score and How Is It Used?
67. How can you avoid overfitting ?
68. What is inductive machine learning?
69. What is Genetic Programming?
70. What is Inductive Logic Programming in Machine
Learning?
TABLE OF
CONTENTS
71. What is Model Selection in Machine Learning?
72. What is the difference between heuristic for rule
learning and heuristics for decision trees?
73. What is the general principle of an ensemble
method and what is bagging and boosting in ensemble
method?
74. What is bias-variance decomposition of
classification error in ensemble method?
75. What is an Incremental Learning algorithm in
ensemble?
76. What is PCA, KPCA and ICA used for?
77. When does regularization come into play in Machine
Learning?
78. How can we relate standard deviation and
variance?
79. Is a high variance in data good or bad?
80. Explain the handling of missing or corrupted values
in the given dataset.
TABLE OF
CONTENTS
81. What is Time series?
82. What is a Box-Cox transformation?
83. What is the difference between stochastic gradient
descent (SGD) and gradient descent (GD)?
84. What is the exploding gradient problem while using
back propagation technique?
85. Explain the differences between Random Forest and
Gradient Boosting machines.
86. What’s a Fourier transform?
87. What do you mean by Associative Rule Mining
(ARM)?
88. What is Marginalisation? Explain the process.
89. Explain the phrase “Curse of Dimensionality”.
90. What is the difference between regularization and
normalisation?
91. Explain the difference between Normalization and
Standardization.
92. List the most popular distribution curves along with
scenarios where you will use them in an algorithm.
TABLE OF
CONTENTS
93. When does the linear regression line stop rotating
or finds an optimal spot where it is fitted on data?
94. Which machine learning algorithm is known as the
lazy learner and why is it called so?
95. Is it possible to use KNN for image processing?
96. Explain the term instance-based learning.
97. What is Bayes’ Theorem? State at least 1 use case
with respect to the machine learning context?
98. What is Naive Bayes? Why is it Naive?
99. Explain the difference between Lasso and Ridge?
100. Why would you Prune your tree?
101. Model accuracy or Model performance? Which one
will you prefer and why?
102. Mention some of the EDA Techniques?
103. Differentiate between Statistical Modeling and
Machine Learning?
104. Differentiate between Boosting and Bagging?
105. What is the significance of Gamma and
Regularization in SVM?
TABLE OF
CONTENTS
106. What is the difference between a generative and
discriminative model?
107. What are hyperparameters and how are they
different from parameters?
108. Can logistic regression be used for classes more
than 2?
109. How to deal with multicollinearity?
110. What is Heteroscedasticity?
111. Is ARIMA model a good fit for every time series
problem?
112. What is a voting model?
113. How to deal with very few data samples? Is it
possible to make a model out of it?
114. What is Pandas Profiling?
115. When should ridge regression be preferred over
lasso?
116. What is a good metric for measuring the level of
multicollinearity?
TABLE OF
CONTENTS
117. When can be a categorical value treated as a
continuous variable and what effect does it have when
done so?
118. What is the role of maximum likelihood in logistic
regression.
119. What is a pipeline?
120. What do you understand by L1 and L2
regularization?
121. What do you mean by AUC curve?
122. Why does XGBoost perform better than SVM?
123. What is the difference between SVM Rank and SVR
(Support Vector Regression)?
124. What is the difference between the normal soft
margin SVM and SVM with a linear kernel?
125. What are the advantages of using a naive Bayes
for classification?
126. Are Gaussian Naive Bayes the same as binomial
Naive Bayes?
TABLE OF
CONTENTS
127. What is the difference between the Naive Bayes
Classifier and the Bayes classifier?
128. In what real world applications is Naive Bayes
classifier used?
129. Is naive Bayes supervised or unsupervised?
130. What do you understand by selection bias in
Machine Learning?
131. What Are the Three Stages of Building a Model in
Machine Learning?
132. What is the difference between Entropy and
Information Gain?
133. What are collinearity and multicollinearity?
134. What is A/B Testing?
135. What is Cluster Sampling?
136. What is deep learning, and how does it contrast
with other machine learning algorithms?
ML | COMPREHENSIVE GUIDE TO INTERVIEWS FOR MACHINE LEARNING
01
1. What Are the Different Types of Machine Learning?
There are three types of machine learning:
Supervised Learning
In supervised machine learning, a model makes predictions
or decisions based on past or labeled data. Labeled data
refers to sets of data that are given tags or labels, and thus
made more meaningful.
Unsupervised Learning
In unsupervised learning, we don't have labeled data. A
model can identify patterns, anomalies, and relationships in
the input data.
Reinforcement Learning
Using reinforcement learning, the model can learn based on
the rewards it received for its previous action.
zepanalytics.com
ML | COMPREHENSIVE GUIDE TO INTERVIEWS FOR MACHINE LEARNING
02
2. What is Overfitting, and How Can You Avoid It?
The Overfitting is a situation that occurs when a model
learns the training set too well, taking up random
fluctuations in the training data as concepts. These impact
the model’s ability to generalize and don’t apply to new data.
When a model is given the training data, it shows 100
percent accuracy—technically a slight loss. But, when we use
the test data, there may be an error and low efficiency. This
condition is known as overfitting.
zepanalytics.com
ML | COMPREHENSIVE GUIDE TO INTERVIEWS FOR MACHINE LEARNING
03
Training Set
The training set is examples given to the model to
analyze and learn.
70% of the total data is typically taken as the training
dataset.
This is labeled data used to train the model.
Test Set
The test set is used to test the accuracy of the
hypothesis generated by the model
Remaining 30% is taken as testing dataset
We test without labeled data and then verify results with
labels
Consider a case where you have labeled data for 1,000
records. One way to train the model is to expose all 1,000
records during the training process. Then you take a small
set of the same data to test the model, which would give
good results in this case.
But, this is not an accurate way of testing. So, we set aside a
portion of that data called the ‘test set’ before starting the
training process. The remaining data is called the ‘training
set’ that we use for training the model. The training set
passes through the model multiple times until the accuracy
is high, and errors are minimized.
Now, we pass the test data to check if the model can
accurately predict the values and determine if training is
effective. If you get errors, you either need to change your
model or retrain it with more data.
zepanalytics.com
ML | COMPREHENSIVE GUIDE TO INTERVIEWS FOR MACHINE LEARNING
04
Regarding the question of how to split the data into a
training set and test set, there is no fixed rule, and the ratio
can vary based on individual preferences.
zepanalytics.com
ML | COMPREHENSIVE GUIDE TO INTERVIEWS FOR MACHINE LEARNING
05
Model Building
Choose a suitable algorithm for the model and train it
according to the requirement
Model Testing
Check the accuracy of the model through the test data
Applying the Model
Make the required changes after testing and use the final
model for real-time projects
06
8. What is Semi-supervised Machine Learning?
Supervised learning uses data that is completely labeled,
whereas unsupervised learning uses no training data.
In the case of semi-supervised learning, the training data
contains a small amount of labeled data and a large
amount of unlabeled data.
Clustering
Clustering problems involve data to be divided into subsets.
These subsets, also called clusters, contain data that are
similar to each other. Different clusters reveal different
details about the objects, unlike classification or regression.
Association
In an association problem, we identify patterns of
associations between different variables or items.
For example, an e-commerce website can suggest other
items for you to buy, based on the prior purchases that you
have made, spending habits, items in your wishlist, other
customers’ purchase habits, and so on.
zepanalytics.com
ML | COMPREHENSIVE GUIDE TO INTERVIEWS FOR MACHINE LEARNING
07
10. What is the Difference Between Supervised and
Unsupervised Machine Learning?
Supervised learning - This model learns from the labeled
data and makes a future prediction as output
Unsupervised learning - This model uses unlabeled input
data and allows the algorithm to act on that information
without guidance.
zepanalytics.com
ML | COMPREHENSIVE GUIDE TO INTERVIEWS FOR MACHINE LEARNING
08
Earlier, chess programs had to determine the best moves
after much research on numerous factors. Building a
machine designed to play such games would require many
rules to be specified.
With reinforced learning, we don’t have to deal with this
problem as the learning agent learns by playing the game. It
will make a move (decision), check if it’s the right move
(feedback), and keep the outcomes in memory for the next
step it takes (learning). There is a reward for every correct
decision the system takes and punishment for the wrong
one.
zepanalytics.com
ML | COMPREHENSIVE GUIDE TO INTERVIEWS FOR MACHINE LEARNING
09
14. When Will You Use Classification over Regression?
Classification is used when your target is categorical, while
regression is used when your target variable is continuous.
Both classification and regression belong to the category of
supervised machine learning algorithms.
Examples of classification problems include:
Predicting yes or no
Estimating gender
Breed of an animal
Type of color
Examples of regression problems include:
Estimating sales and price of a product
Predicting the score of a team
Predicting the amount of rainfall
zepanalytics.com
ML | COMPREHENSIVE GUIDE TO INTERVIEWS FOR MACHINE LEARNING
10
Underfitting: High bias can cause an algorithm to miss the
relevant relations between features and target outputs.
Variance
Variance refers to the amount the target model will change
when trained with different training data. For a good model,
the variance should be minimized.
zepanalytics.com
ML | COMPREHENSIVE GUIDE TO INTERVIEWS FOR MACHINE LEARNING
11
18. What is a Decision Tree Classification?
A decision tree builds classification (or regression) models
as a tree structure, with datasets broken up into ever-
smaller subsets while developing the decision tree, literally in
a tree-like way with branches and nodes. Decision trees can
handle both categorical and numerical data.
zepanalytics.com
ML | COMPREHENSIVE GUIDE TO INTERVIEWS FOR MACHINE LEARNING
12
20. Briefly Explain Logistic Regression.
Logistic regression is a classification algorithm used to
predict a binary outcome for a given set of independent
variables.
The output of logistic regression is either a 0 or 1 with a
threshold value of generally 0.5. Any value above 0.5 is
considered as 1, and any point below 0.5 is considered as 0.
zepanalytics.com
ML | COMPREHENSIVE GUIDE TO INTERVIEWS FOR MACHINE LEARNING
13
Let the new data point to be classified is a black ball. We use
KNN to classify it. Assume K = 5 (initially).
Next, we find the K (five) nearest data points, as shown.
zepanalytics.com
ML | COMPREHENSIVE GUIDE TO INTERVIEWS FOR MACHINE LEARNING
14
23. What is Kernel SVM?
Kernel SVM is the abbreviated version of the kernel support
vector machine. Kernel methods are a class of algorithms
for pattern analysis, and the most common one is the kernel
SVM.
zepanalytics.com
ML | COMPREHENSIVE GUIDE TO INTERVIEWS FOR MACHINE LEARNING
15
27. What is Ensemble learning?
Ensemble learning is a combination of the results obtained
from multiple machine learning models to increase the
accuracy for improved decision-making.
Example: A Random Forest with 100 trees can provide much
better results than using just one decision tree.
zepanalytics.com
ML | COMPREHENSIVE GUIDE TO INTERVIEWS FOR MACHINE LEARNING
16
29. What are the different methods to split a tree in a
decision tree algorithm?
Variance: Splitting the nodes of a decision tree using the
variance is done when the target variable is continuous.
zepanalytics.com
ML | COMPREHENSIVE GUIDE TO INTERVIEWS FOR MACHINE LEARNING
17
coefficients. In Ridge or L2 regression, the penalty function is
determined by the sum of the squares of the coefficients.
zepanalytics.com
ML | COMPREHENSIVE GUIDE TO INTERVIEWS FOR MACHINE LEARNING
18
Here,
For actual values:
Total Yes = 12+1 = 13
Total No = 3+9 = 12
Similarly, for predicted values:
Total Yes = 12+3 = 15
Total No = 1+9 = 10
For a model to be accurate, the values across the diagonals
should be high. The total sum of all the values in the matrix
equals the total observations in the test data set.
For the above matrix, total observations = 12+3+1+9 = 25
Now, accuracy = sum of the values across the
diagonal/total dataset
= (12+9) / 25
= 21 / 25
= 84%
zepanalytics.com
ML | COMPREHENSIVE GUIDE TO INTERVIEWS FOR MACHINE LEARNING
19
34. What Is a False Positive and False Negative and How Are
They Significant?
False positives are those cases that wrongly get classified as
True but are False.
False negatives are those cases that wrongly get classified
as False but are True.
In the term ‘False Positive,’ the word ‘Positive’ refers to the ‘Yes’
row of the predicted value in the confusion matrix. The
complete term indicates that the system has predicted it as
a positive, but the actual value is negative.
zepanalytics.com
ML | COMPREHENSIVE GUIDE TO INTERVIEWS FOR MACHINE LEARNING
20
35. Define Precision and Recall.
Precision
Precision is the ratio of several events you can correctly
recall to the total number of events you recall (mix of
correct and wrong recalls).
Precision = (True Positive) / (True Positive + False Positive)
Recall
A recall is the ratio of the number of events you can recall
the number of total events.
Recall = (True Positive) / (True Positive + False Negative)
zepanalytics.com
ML | COMPREHENSIVE GUIDE TO INTERVIEWS FOR MACHINE LEARNING
21
37. What is a Decision Tree in Machine Learning?
A decision tree is used to explain the sequence of actions
that must be performed to get the desired output. It is a
hierarchical diagram that shows the actions.
zepanalytics.com
ML | COMPREHENSIVE GUIDE TO INTERVIEWS FOR MACHINE LEARNING
22
the unknown target function that maps all plausible
observations based on the given problem in the best
manner. Hypothesis in Machine learning is a model that
helps in approximating the target function and performing
the necessary input-to-output mappings. The choice and
configuration of algorithms allow defining the space of
plausible hypotheses that may be represented by a model.
In the hypothesis, lowercase h (h) is used for a specific
hypothesis, while uppercase h (H) is used for the hypothesis
space that is being searched. Let us briefly understand
these notations:
Hypothesis (h): A hypothesis is a specific model that
helps in mapping input to output; the mapping can
further be used for evaluation and prediction.
Hypothesis set (H): Hypothesis set consists of a space of
hypotheses that can be used to map inputs to outputs,
which can be searched. The general constraints include
the choice of problem framing, the model, and the
model configuration.
zepanalytics.com
ML | COMPREHENSIVE GUIDE TO INTERVIEWS FOR MACHINE LEARNING
23
-end systems as well. The systems acquire various
properties and features with the help of the given data, and
the problem is solved using an end-to-end method.
zepanalytics.com
ML | COMPREHENSIVE GUIDE TO INTERVIEWS FOR MACHINE LEARNING
24
41. What is Epoch in Machine Learning?
Epoch in Machine Learning is used to indicate the count of
passes in a given training dataset where the Machine
Learning algorithm has done its job. Generally, when there is
a large chunk of data, it is grouped into several batches. All
these batches go through the given model, and this process
is referred to as iteration. Now, if the batch size comprises
the complete training dataset, then the count of iterations is
the same as that of epochs.
In case there is more than one batch, d*e=i*b is the formula
used, wherein d is the dataset, e is the number of epochs, i is
the number of iterations, and b is the batch size.
zepanalytics.com
ML | COMPREHENSIVE GUIDE TO INTERVIEWS FOR MACHINE LEARNING
25
If the output is a number, then regression techniques must
be used; if the output is a different cluster of inputs, then
clustering techniques should be used.
Step 2: Checking the algorithms in hand: After classifying the
problem, the available algorithms that can be deployed for
solving the classified problem should be considered.
Step 3: Implementing the algorithms: If there are multiple
algorithms available, then all of them are to be
implemented. Finally, the algorithm that gives the best
performance is selected.
zepanalytics.com
ML | COMPREHENSIVE GUIDE TO INTERVIEWS FOR MACHINE LEARNING
26
Classification is chosen over regression when the output of
the model needs to yield the belongingness of data points in
a dataset to a particular category.
For example, If you want to predict the price of a house, you
should use regression since it is a numerical variable.
However, if you are trying to predict whether a house
situated in a particular area is going to be high-, medium-,
or low-priced, then a classification model should be used.
zepanalytics.com
ML | COMPREHENSIVE GUIDE TO INTERVIEWS FOR MACHINE LEARNING
27
46. What is ROC Curve and what does it represent?
ROC stands for receiver operating characteristic. ROC Curve
is used to graphically represent the trade-off between true
and false positive rates.
In ROC, area under the curve (AUC) gives an idea about the
accuracy of the model.
The above graph shows an ROC curve. The greater the AUC,
the better the performance of the model.
zepanalytics.com
ML | COMPREHENSIVE GUIDE TO INTERVIEWS FOR MACHINE LEARNING
28
Validation dataset: Validation dataset is used to look
into a model’s response. After this, the hyperparameters
on the basis of the estimated benchmark of the
validation dataset data are tuned. When a model’s
response is evaluated by using the validation dataset,
the model is indirectly trained with the validation set. This
may lead to the overfitting of the model to specific data.
So, this model will not be strong enough to give the
desired response to real-world data.
Test dataset: Test dataset is the subset of the actual
dataset, which is not yet used to train the model. The
model is unaware of this dataset. So, by using the test
dataset, the response of the created model can be
computed on hidden data. The model’s performance is
tested on the basis of the test dataset.
zepanalytics.com
ML | COMPREHENSIVE GUIDE TO INTERVIEWS FOR MACHINE LEARNING
29
48. Explain the difference between KNN and K-means
Clustering.
K-nearest neighbors (KNN): It is a supervised Machine
Learning algorithm. In KNN, identified or labeled data is given
to the model. The model then matches the points based on
the distance from the closest points.
zepanalytics.com
ML | COMPREHENSIVE GUIDE TO INTERVIEWS FOR MACHINE LEARNING
30
49. What is Dimensionality Reduction?
In the real world, Machine Learning models are built on top
of features and parameters. These features can be
multidimensional and large in number. Sometimes, the
features may be irrelevant and it becomes a difficult task to
visualize them.
This is where dimensionality reduction is used to cut down
irrelevant and redundant features with the help of principal
variables. These principal variables conserve the features,
and are a subgroup, of the parent variables.
zepanalytics.com
ML | COMPREHENSIVE GUIDE TO INTERVIEWS FOR MACHINE LEARNING
31
Boosting assists in reducing bias and variance for
strengthening the weak learners.
32
54. What is meant by Correlation and Covariance?
Correlation is a mathematical concept used in statistics and
probability theory to measure, estimate, and compare data
samples taken from different populations. In simpler terms,
correlation helps in establishing a quantitative relationship
between two variables.
Covariance is also a mathematical concept; it is a simpler
way to arrive at a correlation between two variables.
Covariance basically helps in determining what change or
affect does one variable has on another.
55. What are the Various Tests for Checking the Normality of
a Dataset?
In Machine Learning, checking the normality of a dataset is
very important. Hence, certain tests are performed on a
dataset to check its normality. Some of them are:
D’Agostino Skewness Test
Shapiro-Wilk Test
Anderson-Darling Test
Jarque-Bera Test
Kolmogorov-Smirnov Test
zepanalytics.com
ML | COMPREHENSIVE GUIDE TO INTERVIEWS FOR MACHINE LEARNING
33
Collaborative filtering refers to a recommender system
where the interests of the individual user are matched with
preferences of multiple users to predict new content.
Content-based filtering is a recommender system where the
focus is only on the preferences of the individual user and
not on multiple users.
59. What are the Various Kernels that are present in SVM?
The various kernels that are present in SVM are:
Linear
Polynomial
Radial Basis
Sigmoid
zepanalytics.com
ML | COMPREHENSIVE GUIDE TO INTERVIEWS FOR MACHINE LEARNING
34
60. Suppose you found that your model is suffering from
high variance. Which algorithm do you think could handle
this situation and why?
Handling High Variance
For handling issues of high variance, we should use the
bagging algorithm.
The bagging algorithm would split data into subgroups
with a replicated sampling of random data.
Once the algorithm splits the data, we can use random
data to create rules using a particular training algorithm.
After that, we can use polling for combining the
predictions of the model.
zepanalytics.com
ML | COMPREHENSIVE GUIDE TO INTERVIEWS FOR MACHINE LEARNING
35
63. We know that one-hot encoding increases the
dimensionality of a dataset, but label encoding doesn’t.
How?
When one-hot encoding is used, there is an increase in the
dimensionality of a dataset. The reason for the increase in
dimensionality is that, for every class in categorical
variables, it forms a different variable.
Example: Suppose there is a variable “Color.” It has three
sublevels, “Yellow,” “Purple,” and “Orange.” So, one-hot
encoding “Color” will create three different variables as
Color.Yellow, Color.Purple, and Color.Orange.
In label encoding, the subclasses of a certain variable get
the value as 0 and 1. So, label encoding is only used for
binary variables.
This is why one-hot encoding increases the dimensionality
of data and label encoding does not.
zepanalytics.com
ML | COMPREHENSIVE GUIDE TO INTERVIEWS FOR MACHINE LEARNING
36
65. Explain False Negative, False Positive, True Negative, and
True Positive with a simple example.
True Positive (TP): When the Machine Learning model
correctly predicts the condition, it is said to have a True
Positive value.
True Negative (TN): When the Machine Learning model
correctly predicts the negative condition or class, then it is
said to have a True Negative value.
False Positive (FP): When the Machine Learning model
incorrectly predicts a negative class or condition, then it is
said to have a False Positive value.
False Negative (FN): When the Machine Learning model
incorrectly predicts a positive class or condition, then it is
said to have a False Negative value.
zepanalytics.com
ML | COMPREHENSIVE GUIDE TO INTERVIEWS FOR MACHINE LEARNING
37
Precision = (No. of True Positives / No. True Positives + No. of
False Negatives)
Both precision and recall are partial measures of accuracy
of a model. F1-score combines precision and recall and
provides an overall score to measure a model’s accuracy.
F1-score = 2 × (Precision × Recall) / (Precision + Recall)
This is why, F1-score is the most popular measure of
accuracy in any Machine-Learning-based binary
classification model.
zepanalytics.com
ML | COMPREHENSIVE GUIDE TO INTERVIEWS FOR MACHINE LEARNING
38
68. What is inductive machine learning?
The inductive machine learning involves the process of
learning by examples, where a system, from a set of
observed instances tries to induce a general rule.
39
73. What is the general principle of an ensemble method
and what is bagging and boosting in ensemble method?
The general principle of an ensemble method is to combine
the predictions of several models built with a given learning
algorithm in order to improve robustness over a single
model. Bagging is a method in ensemble for improving
unstable estimation or classification schemes. While
boosting method are used sequentially to reduce the bias of
the combined model. Boosting and Bagging both can
reduce errors by reducing the variance term.
40
77. When does regularization come into play in Machine
Learning?
At times when the model begins to underfit or overfit,
regularization becomes necessary. It is a regression that
diverts or regularizes the coefficient estimates towards zero.
It reduces flexibility and discourages learning in a model to
avoid the risk of overfitting. The model complexity is reduced
and it becomes better at predicting.
zepanalytics.com
ML | COMPREHENSIVE GUIDE TO INTERVIEWS FOR MACHINE LEARNING
41
80. Explain the handling of missing or corrupted values in the
given dataset.
An easy way to handle missing values or corrupted values is
to drop the corresponding rows or columns. If there are too
many rows or columns to drop then we consider replacing
the missing or corrupted values with some new value.
Identifying missing values and dropping the rows or columns
can be done by using IsNull() and dropna( ) functions in
Pandas. Also, the Fillna() function in Pandas replaces the
incorrect values with the placeholder value.
zepanalytics.com
ML | COMPREHENSIVE GUIDE TO INTERVIEWS FOR MACHINE LEARNING
42
83. What is the difference between stochastic gradient
descent (SGD) and gradient descent (GD)?
Gradient Descent and Stochastic Gradient Descent are the
algorithms that find the set of parameters that will minimize
a loss function.
The difference is that in Gradient Descend, all training
samples are evaluated for each set of parameters. While in
Stochastic Gradient Descent only one training sample is
evaluated for the set of parameters identified.
zepanalytics.com
ML | COMPREHENSIVE GUIDE TO INTERVIEWS FOR MACHINE LEARNING
43
85. Explain the differences between Random Forest and
Gradient Boosting machines.
Random forests are a significant number of decision trees
pooled using averages or majority rules at the end. Gradient
boosting machines also combine decision trees but at the
beginning of the process unlike Random forests. Random
forest creates each tree independent of the others while
gradient boosting develops one tree at a time. Gradient
boosting yields better outcomes than random forests if
parameters are carefully tuned but it’s not a good option if
the data set contains a lot of outliers/anomalies/noise as it
can result in overfitting of the model. Random forests
perform well for multiclass object detection. Gradient
Boosting performs well when there is data which is not
balanced such as in real time risk assessment.
zepanalytics.com
ML | COMPREHENSIVE GUIDE TO INTERVIEWS FOR MACHINE LEARNING
44
87. What do you mean by Associative Rule Mining (ARM)?
Associative Rule Mining is one of the techniques to discover
patterns in data like features (dimensions) which occur
together and features (dimensions) which are correlated. It
is mostly used in Market-based Analysis to find how
frequently an itemset occurs in a transaction. Association
rules have to satisfy minimum support and minimum
confidence at the very same time. Association rule
generation generally comprised of two different steps:
“A min support threshold is given to obtain all frequent
item-sets in a database.”
“A min confidence constraint is given to these frequent
item-sets in order to form the association rules.”
Support is a measure of how often the “item set” appears in
the data set and Confidence is a measure of how often a
particular rule has been found to be true.
zepanalytics.com
ML | COMPREHENSIVE GUIDE TO INTERVIEWS FOR MACHINE LEARNING
45
89. Explain the phrase “Curse of Dimensionality”.
The Curse of Dimensionality refers to the situation when your
data has too many features.
The phrase is used to express the difficulty of using brute
force or grid search to optimize a function with too many
inputs.
It can also refer to several other issues like:
If we have more features than observations, we have a
risk of overfitting the model.
When we have too many features, observations become
harder to cluster. Too many dimensions cause every
observation in the dataset to appear equidistant from all
others and no meaningful clusters can be formed.
Dimensionality reduction techniques like PCA come to the
rescue in such cases.
zepanalytics.com
ML | COMPREHENSIVE GUIDE TO INTERVIEWS FOR MACHINE LEARNING
46
91. Explain the difference between Normalization and
Standardization.
Normalization and Standardization are the two very popular
methods used for feature scaling. Normalization refers to re-
scaling the values to fit into a range of [0,1]. Standardization
refers to re-scaling data to have a mean of 0 and a
standard deviation of 1 (Unit variance). Normalization is
useful when all parameters need to have the identical
positive scale however the outliers from the data set are lost.
Hence, standardization is recommended for most
applications.
zepanalytics.com
ML | COMPREHENSIVE GUIDE TO INTERVIEWS FOR MACHINE LEARNING
47
Normal distribution describes how the values of a variable
are distributed. It is typically a symmetric distribution where
most of the observations cluster around the central peak.
The values further away from the mean taper off equally in
both directions. An example would be the height of students
in a classroom.
Poisson distribution helps predict the probability of certain
events happening when you know how often that event has
occurred. It can be used by businessmen to make forecasts
about the number of customers on certain days and allows
them to adjust supply according to the demand.
Exponential distribution is concerned with the amount of
time until a specific event occurs. For example, how long a
car battery would last, in months.
zepanalytics.com
ML | COMPREHENSIVE GUIDE TO INTERVIEWS FOR MACHINE LEARNING
48
system to AUC: ROC. Since we added/deleted data [up
sampling or downsampling], we can go ahead with a
stricter algorithm like SVM, Gradient boosting or ADA
boosting.
zepanalytics.com
ML | COMPREHENSIVE GUIDE TO INTERVIEWS FOR MACHINE LEARNING
49
96. Explain the term instance-based learning.
Instance Based Learning is a set of procedures for
regression and classification which produce a class label
prediction based on resemblance to its nearest neighbors in
the training data set. These algorithms just collects all the
data and get an answer when required or queried. In simple
words they are a set of procedures for solving new problems
based on the solutions of already solved problems in the
past which are similar to the current problem.
zepanalytics.com
ML | COMPREHENSIVE GUIDE TO INTERVIEWS FOR MACHINE LEARNING
50
Naive Bayes is considered Naive because the attributes in it
(for the class) is independent of others in the same class.
This lack of dependence between two attributes of the same
class creates the quality of naiveness.
zepanalytics.com
ML | COMPREHENSIVE GUIDE TO INTERVIEWS FOR MACHINE LEARNING
51
important feature. Example: The best of Search Results will
lose its virtue if the Query results do not appear fast.
If Performance is hinted at Why Accuracy is not the most
important virtue – For any imbalanced data set, more than
Accuracy, it will be an F1 score than will explain the business
case and in case data is imbalanced, then Precision and
Recall will be more important than rest.
zepanalytics.com
ML | COMPREHENSIVE GUIDE TO INTERVIEWS FOR MACHINE LEARNING
52
103. Differentiate between Statistical Modeling and Machine
Learning?
Machine learning models are about making accurate
predictions about the situations, like Foot Fall in restaurants,
Stock-Price, etc. where-as, Statistical models are designed
for inference about the relationships between variables, as
What drives the sales in a restaurant, is it food or Ambience.
zepanalytics.com
ML | COMPREHENSIVE GUIDE TO INTERVIEWS FOR MACHINE LEARNING
53
process. Weak classifiers used are generally logistic
regression, shallow decision trees etc.
There are many algorithms which make use of boosting
processes but two of them are mainly used: Adaboost and
Gradient Boosting and XGBoost.
zepanalytics.com
ML | COMPREHENSIVE GUIDE TO INTERVIEWS FOR MACHINE LEARNING
54
107. What are hyperparameters and how are they different
from parameters?
A parameter is a variable that is internal to the model and
whose value is estimated from the training data. They are
often saved as part of the learned model. Examples include
weights, biases etc.
A hyperparameter is a variable that is external to the model
whose value cannot be estimated from the data. They are
often used to estimate model parameters. The choice of
parameters is sensitive to implementation. Examples include
learning rate, hidden layers etc.
zepanalytics.com
ML | COMPREHENSIVE GUIDE TO INTERVIEWS FOR MACHINE LEARNING
55
108. Can logistic regression be used for classes more than
2?
No, logistic regression cannot be used for classes more
than 2 as it is a binary classifier. For multi-class
classification algorithms like Decision Trees, Naïve Bayes’
Classifiers are better suited.
111. Is ARIMA model a good fit for every time series problem?
No, ARIMA model is not suitable for every type of time series
problem. There are situations where ARMA model and others
also come in handy.
ARIMA is best when different standard temporal structures
require to be captured for time series data.
zepanalytics.com
ML | COMPREHENSIVE GUIDE TO INTERVIEWS FOR MACHINE LEARNING
56
112. What is a voting model?
A voting model is an ensemble model which combines
several classifiers but to produce the final result, in case of a
classification-based model, takes into account, the
classification of a certain data point of all the models and
picks the most vouched/voted/generated option from all
the given classes in the target column.
zepanalytics.com
ML | COMPREHENSIVE GUIDE TO INTERVIEWS FOR MACHINE LEARNING
57
116. What is a good metric for measuring the level of
multicollinearity?
VIF or 1/tolerance is a good measure of measuring
multicollinearity in models. VIF is the percentage of the
variance of a predictor which remains unaffected by other
predictors. So higher the VIF value, greater is the
multicollinearity amongst the predictors.
A rule of thumb for interpreting the variance inflation factor:
1 = not correlated.
Between 1 and 5 = moderately correlated.
Greater than 5 = highly correlated.
zepanalytics.com
ML | COMPREHENSIVE GUIDE TO INTERVIEWS FOR MACHINE LEARNING
58
119. What is a pipeline?
A pipeline is a sophisticated way of writing software such
that each intended action while building a model can be
serialized and the process calls the individual functions for
the individual tasks. The tasks are carried out in sequence
for a given sequence of data points and the entire process
can be run onto n threads by use of composite estimators in
scikit learn.
zepanalytics.com
ML | COMPREHENSIVE GUIDE TO INTERVIEWS FOR MACHINE LEARNING
59
almost any data but at the same time it needs to use a
Kernel and we can argue that there’s not a perfect kernel for
every dataset.
zepanalytics.com
ML | COMPREHENSIVE GUIDE TO INTERVIEWS FOR MACHINE LEARNING
60
on some points. So the training error will not be 0, but
average error over all points is minimized.
Kernels
The above assume that the best classifier is a straight line.
But what is it is not a straight line. (e.g. it is a circle, inside a
circle is one class, outside is another class). If we are able to
map the data into higher dimensions – the higher
dimension may give us a straight line.
zepanalytics.com
ML | COMPREHENSIVE GUIDE TO INTERVIEWS FOR MACHINE LEARNING
61
126. Are Gaussian Naive Bayes the same as binomial Naive
Bayes?
Binomial Naive Bayes: It assumes that all our features are
binary such that they take only two values. Means 0s can
represent “word does not occur in the document” and 1s as
“word occurs in the document”.
Gaussian Naive Bayes: Because of the assumption of the
normal distribution, Gaussian Naive Bayes is used in cases
when all our features are continuous. For example in Iris
dataset features are sepal width, petal width, sepal length,
petal length. So its features can have different values in the
data set as width and length can vary. We can’t represent
features in terms of their occurrences. This means data is
continuous. Hence we use Gaussian Naive Bayes here.
zepanalytics.com
ML | COMPREHENSIVE GUIDE TO INTERVIEWS FOR MACHINE LEARNING
62
structure (connections and directions) and the
parameters(likelihood) using the data. After the structure
has been learned the class is only determined by the nodes
in the Markov blanket(its parents, its children, and the
parents of its children), and all variables given the Markov
blanket are discarded.
zepanalytics.com
ML | COMPREHENSIVE GUIDE TO INTERVIEWS FOR MACHINE LEARNING
63
Since these are generative models, so based upon the
assumptions of the random variable mapping of each
feature vector these may even be classified as Gaussian
Naive Bayes, Multinomial Naive Bayes, Bernoulli Naive Bayes,
etc.
zepanalytics.com
ML | COMPREHENSIVE GUIDE TO INTERVIEWS FOR MACHINE LEARNING
64
Data: When specific subsets of data are chosen to
support a conclusion or rejection of bad data on
arbitrary grounds, instead of according to previously
stated or generally agreed criteria.
Attrition: Attrition bias is a kind of selection bias caused
by attrition (loss of participants) discounting trial
subjects/tests that did not run to completion.
zepanalytics.com
ML | COMPREHENSIVE GUIDE TO INTERVIEWS FOR MACHINE LEARNING
65
133. What are collinearity and multicollinearity?
Collinearity is a linear association between two predictors.
Multicollinearity is a situation where two or more predictors
are highly linearly related.
zepanalytics.com
ML | COMPREHENSIVE GUIDE TO INTERVIEWS FOR MACHINE LEARNING
66
136. What is deep learning, and how does it contrast with
other machine learning algorithms?
Deep learning is a subset of machine learning that is
concerned with neural networks: how to use
backpropagation and certain principles from neuroscience
to more accurately model large sets of unlabelled or semi-
structured data. In that sense, deep learning represents an
unsupervised learning algorithm that learns representations
of data through the use of neural nets.
zepanalytics.com
This brings our list of 120+
Machine Learning interview
questions to an end.
zepanalytics.com
Ready to take the next steps?
Zep offers a platform for education to learn,
grow & earn.
Explore
zepanalytics.com