Hearing about an interview always makes us feel jittery. But we all know quite well that the entire process is worth suffering for as you may end up getting your dream job. A machine learning interview questions is no exception; in fact, it is nowadays one of the popular posts that is much in demand. It needs a whole lot of preparation and perseverance.
You may land yourself amid immense confusion if you think of preparing for everything. What you need to do is focus on the prime topics that will clarify all your core concepts. These Machine Learning Interview Questions will help you to crack upcoming interviews.
Top Machine learning Interview Questions and Answers
Let us dig in and study the top Machine Learning interview questions and answers below:
Question: What is Machine Learning? And how is it different from Artificial Intelligence?
Answer: Machine learning is a process by which a machine can perform from its experiment. A dataset is fed into the program that is capable of learning from the dataset. Then in the output, it knows how to recognize things fit inside that data set even if their machine on the program has never seen that example before. ML works on pattern recognition; on the other hand, AI exhibits the idea of intelligence and is about training a system to react as to how any human brain would do.
Question: Define three stages of building a model in Machine Learning.
Answer: The three stages of building a model in ML are:
- Model Building
Choosing an appropriate algorithm for the model and train it according to the requirement.
- Model Testing
Check the accuracy of the data by testing with data sets
- Applying a model
Make the required changes after testing and use the final model for real-time projects
Question: Explain parametric models with examples? How are they different from non-parametric models.
Answer: Models with a finite number of parameters are Parametric models. We need to know the parameters of the model to predict new data — for example, linear regression, logistic regression, and linear SVMs.
Models with an unbounded number of parameters are Non-parametric models, allowing for more flexibility. We need to know the parameters of the model and the state of the data observed to predict new data — for example, decision trees, k-nearest neighbors, and topic models using latent Dirichlet analysis.
Question: Differentiate between Type I and Type II error.
Answer:
- Type I Error: Reject the correct null hypothesis and is a severe error, also called false positive. The probability of making this error is the level of significance. It is claiming that something has happened when it hasn’t.
- Type II Error: Accept a false null hypothesis. The probability of making this error mainly depends on the sample size and population variance. This error is more likely to occur if the topic is difficult to test due to hard sampling or high variability. The probability of rejecting a false null hypothesis is 1- a.k.a Power of the test. It is claiming that nothing has happened while something has happened.
Question: What are the types of machine learning? Differentiate between them.
Answer:
Supervised Learning | Unsupervised Learning | Reinforcement Learning | |
Definition | Taught by labeled data. | Taught without any guidance using unlabeled data. | Reinforcement Learning is taught by self-learning by interacting with the surrounding environment. |
Types of Problems | Regression and classification | Association and clustering | Reward-based |
Type of data | Labeled data | Unlabeled data | No pre-defined data |
Training | Involves external supervision. | Doesn’t involve supervision | Doesn’t involve supervision |
Approach | Maps labeled input to output. | Discovers output by understanding patterns. | Trail and error method to discover output. |
Popular Algorithms | Linear regression, KNN | K-means, C-means | Q-Learning |
You may read about Supervised and Unsupervised Learning in detail here.
Question: Explain Generalization, Overfitted, and Underfitted
Answer:
Generalization
A model is built and trained on datasets so that it can make accurate predictions on the unseen data. If the trained model is capable of making these accurate predictions on we can say that the model is generalized from the training set to test set.
Overfitted
When a model is fit too closely to the particularities of the training set and obtain a model that works well on the training set but is not able to generalize to new data is the case of overfitting. In simple words, the model was given to many features while training that it became confused and gave wrong analysis output.
Underfitted
When a model is too simple and doesn’t cover all the aspects and variability of the data, then the model might perform poorly on the training set. This choosing of the too-simple model is underfitting.
Question: What is inductive machine learning?
Answer:
Inductive machine learning involves the process of learning by examples, where a system tries to induce a general rule from a set of observed instances.
Inductive machine learning is an inductive step in which you learn a model from a given data set.
Question: Name some tools that are used for running the machine learning algorithm in parallel.
Answer: Some of the tools are:
- GPUs
- Matlab
- Map Reduce
- Spark
- Graphlab
- Giraph
- Vowpal
Question: What is the difference between Causation and Correlation? Explain with example.
Answer: Causation is the relationship between two variables such that occurrences of the other cause one of them.
Correlation is the relationship between two variables that are related to each other but not caused by each other.
For example, Inflation causes the price fluctuations in petrol and groceries, so inflation has a causation relationship with both of them. Between gasoline and groceries, there is a correlation that both of them can increase or decrease due to changes in inflation, but neither of them causes or impacts the other one.
Question: Define Sampling. Why do we need it?
Answer: Sampling is a process of choosing a subset from a target population that would serve as its representative. We use the data from the sample to understand the pattern in the community as a whole. Sampling is necessary because often, we can not gather or process the complete data within a reasonable time. Sampling can be performed with several techniques; some of them are Random Sampling, Stratified Sampling, and Clustering Sampling.
Question: State the difference between classification and regression
Answer: Classification is a kind of supervised learning technique where the output label is discrete or categorical. Regression, on the other hand, is a Supervised learning technique that is used to predict or continuous or real-valued variables.
For instance, predicting a stock price is a Regression problem because the stock price is a continuous variable that can take real-values whereas predicting whether the email is spam or not is a Classification problem because in this case, the value is discrete and has only two possible benefits yes, or no.
Question: What is stratified sampling?
Answer: Stratified sampling is a probability sampling technique wherein the entire population is divided into different subgroups called strata than a probability sample is drawn proportionally from each layer. For instance, in the case of binary classification, if the ratio of positive and negative labeled data were 9:1, then in stratified sampling, you would randomly select subsample from each of the positive and negative labeled datasets such that after sampling the ratio remains 9:1.
Question: Define confidence interval
Answer: It is an interval estimate which is likely to include an unknown population parameter, the estimated range being calculated from the given sample dataset. It is the range of values for which you are sure that the real value of the variable would lie.
Question: Define conditional probability.
Answer: Conditional probability is the measure of the likelihood of one event, given that one event has occurred. Let us consider two events are given A and B, then the conditional probability of A, given B has already occurred, is provided as:
where stands for the intersection. So, the conditional probability is the joint probability of both the events divided by the probability of event B.
Question: Explain what Bayes theorem is and why is it useful?
Answer: The theorem is used to describe the probability of an event based on the prior knowledge of other events related to it. For example, the probability of a person having a particular disease would be found on the symptoms shown.
Bayes theorem is mathematically formulated as:
where A and B are the events and P(B) ≠ 0. Most of the type we want to find P(A|B), but we know P(B|A), so we can use Bayes theorem to find the missing values.
Question: How are True Positive Rate and Recall are related?
Answer: True Positive Rate is the same as Recall, also called sensitivity. The formula to calculate them:
where TP = true positive and FN = false negative.
Question: What is a probabilistic graphical model?
Answer: A probabilistic graphical model is a robust framework that represents the conditional dependence among the random variables in a graph structure. It can be used in modelling a large number of random variables having complex interactions with each other.
Question: What are the two representations of graphical models? Differentiate between them.
Answer: The two branches of the graphical representation of the distribution are Markov Networks and Bayesian Networks. Both of them differ in the set of independence that they can encode.
- Bayesian Networks: When a model structure is a Directed Acyclic Graph(DAG), the model represents a factorization of the joint probability of all the random variables. The Bayesian networks capture conditional independence between random variables and reduce the number of parameters required to estimate the joint probability distribution.
- Markov Networks: They are used when the underlying structure of the model in an undirected graph. They follow the Markov process, i.e. given the current states, the future states would be independent of the past states. Markov Network represents the distribution of the sequence of the nodes.
Question: How is the k-Nearest Neighbor (k-NN) algorithm different from the k-Means algorithm?
Answer:
- The fundamental difference between these algorithms is that k-NN is a Supervised algorithm, whereas k-Means is unsupervised.
- k-NN is a classification algorithm, and k-Means is a clustering algorithm.
- k-NN tries to classify an observation on its “k” surrounding neighbors. It is also known as a lazy learner because it does absolutely nothing at the training stage. On the other hand, the k-Means algorithm partitions the training data set into different clusters such that all the data points from other clusters. The algorithm tries to maintain enough separability between the clusters.
Question: How is KNN different from k-means clustering?
Answer:
kNN | k-means clustering |
Supervised learning algorithm used for classification. | The unsupervised method used for clustering. |
Data is labelled for training. | No labelled data, machine trains itself. |
The ‘k’ refers to the number of nearest neighbours of a target label. | k refers to the number of clusters, which is set at the beginning of the algorithm |
When the algorithm gives the highest possible accuracy, the algorithm stops. | When no more clusters move from one to another, the algorithm is said to be complete. |
We can optimize the algorithm using confusion matrix and cross-validation. | Optimization can be performed using silhouette and elbow methods. |
Question: Define F-test. Where would you use it?
Answer: An F-test is any statistical hypothesis test where the test statistic follows an F-distribution under the null hypothesis. If you have two models that have been fitted to a dataset, you can use F-test to identify the model which best suits the sample population.
Question: What is a chi-squared test?
Answer: Chi-squared is any statistical hypothesis test where the test statistic follows a chi-squared distribution (distribution of the sum of the squared standard normal deviates) under the null hypothesis. It measures how well the observed distribution of data fits the expected distribution if the variables are independent.
Question: What is the p-value? Why is it important?
Answer: The p-value represents the level of marginal significance while performing the minimal statistical test. It provides the smallest level of importance at which the null hypothesis can be rejected. A small p-value (generally <= 0.05) means that there is strong evidence against the null hypothesis, and therefore, you can refuse the null hypothesis. A significant p-value (>0.05) signifies weak evidence against the null hypothesis, and thus one cannot reject the null hypothesis. The smaller the p-value, the higher the significance with which the null hypothesis can be rejected.
Question: Explain how a ROC curve works.
Answer: ROC curve or Receiver Operating Characteristic Curve is the graphical representation of the performance of a classification model for all the classification thresholds. The graph shows two parameters, i.e. True Positive Rate (TPR) and False Positive Rate (FPR) at different classification thresholds. A typical ROC curveis as follows:
where the vertical axis is TPR, and the horizontal axis is FPR. Lowering the threshold will classify more items as positive, thereby increasing both TP and FP. To compute ROC, we use a sorting algorithm known as AUC (Area Under the Curve) which measures the whole 2-D area below the curve.
Question: Define precision and recall.
Answer: Precision and recall are measures used to evaluate the performance of a classification algorithm. In a perfect classifier, precision and recall are equal to 1. Precision is the fraction of relevant instances amongst the retrieved instances, whereas recall is the fraction of retrieved instances amongst the relevant instances.
Precision = true positive/(true positive + false positive)
Recall = true positive/(true positive + false negative)
Question: What is the difference between L1 and L2 regularization?
Answer: Both L1 and L2 regularization are done to avoid overfitting. L1 tries to calculate the median, whereas L2 calculates the mean of the data for the same. L1 is also called Lasso and L2, Ridge regularization technique.
In L1 regularization, features that are not important are eliminated, thus selecting only the most relevant features. In L2, the loss function tries to minimize loss by subtracting it from the average (mean) of the distribution of data.
Question: What is the difference between ‘training Set’ and ‘test Set’ in a Machine Learning Model?
Answer: Whenever we obtain a dataset, we split the data into two sets – training and testing. Usually, 70-80% of data is taken for training and rest for testing. The training dataset is used to create or build the model. The test dataset is used to evaluate and find the accuracy of the model.
Question: How Do You Handle Missing or Corrupted Data in a Dataset?
Answer: There are many ways to do this:
- remove or drop the missing rows or columns.
- replace them with another value.
- assign them a new category, if a trend/pattern is seen.
Question: What Are the Applications of Supervised Machine Learning in Modern Businesses?
Answer: There are many practical applications of supervised learning:
- image classification
- recommender systems
- dynamic pricing
- customer segmentation
- identify the most valuable customers (Customer lifetime value modeling)
Question: What is Semi-supervised Machine Learning?
Answer: Semi-supervised learning is an approach which is a mix of supervised and unsupervised learning mechanism. It combines a small amount of labelled data along with the huge amount of unlabelled data to be fed into the system for training purposes. Speech recognition is a good example of semi-supervised learning. This type of ML approach helps when you don’t have enough data and can use the techniques to increase the size of training data.
Question: What Are Unsupervised Machine Learning Techniques?
Answer: Unsupervised learning methods are used when we don’t have labelled data, i.e. only the input is known, and the output is unknown. Patterns, trends and underlying structure is identified and modelled using the unlabelled training dataset. Unsupervised learning methods are more accurate and predictable. The most popular algorithm is cluster analysis used for Exploratory Data Analysis (EDA) to get patterns, groupings and trends.
Question: What is an F1 score?
Answer: The F1 score is the measure of the model’s accuracy. It is a weighted average of the precision and recall of a model. The result ranges between 0 and 1, with 0 being the worst and 1 being the best model. F1 score is widely used in the field of Information Retrieval and Natural Language Processing.