0% found this document useful (0 votes)
345 views

Interview Questions

Company Name - TCS Role - Data Scientist The document contains interview questions and answers related to data science topics like GMM vs K-means clustering, regression imputation, z-score for outlier treatment, handling noisy data, parameters in random forest, and dictionary vs list comprehension.

Uploaded by

Pournima bhujbal
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
345 views

Interview Questions

Company Name - TCS Role - Data Scientist The document contains interview questions and answers related to data science topics like GMM vs K-means clustering, regression imputation, z-score for outlier treatment, handling noisy data, parameters in random forest, and dictionary vs list comprehension.

Uploaded by

Pournima bhujbal
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 225

Company Name - TCS

Role - Data Scientist

Today's Interview Questions and Answers (23-11-2021)

Q1. GMM vs K means?

Ans. Gaussian mixture models (GMMs) are often used for data clustering. You can
use GMMs to perform either hard clustering or soft clustering on query data. To
perform hard clustering, the GMM assigns query data points to the multivariate
normal components that maximize the component posterior probability, given the
data.

K-means clustering aims to partition data into k clusters in a way that data points in
the same cluster are similar and data points in the different clusters are farther
apart. Similarity of two points is determined by the distance between them.

Q2. Regression imputation?

Ans. Regression imputation fits a statistical model on a variable with missing


values. Predictions of this regression model are used to substitute the missing
values in this variable.

Q3. Z score for outlier’s treatment?

Ans. Z score test is one of the most commonly used methods to detect outliers. It
measures the number of standard deviations away the observation is from the mean
value. A z score of 1.5 indicated that the observation is 1.5 standard deviations
above the mean and -1.5 means that the observation is 1.5 standard deviations
below or less than the mean.

Q4. Handle data with lot of noise?

Ans. Noisy data can be handled by following the given procedures:

1) Binning:
• Binning methods smooth a sorted data value by consulting the values around it.
• The sorted values are distributed into a number of “buckets,” or bins.

2) Regression:
• Here data can be smoothed by fitting the data to a function.
• Linear regression involves finding the “best” line to fit two attributes, so that one
attribute can be used to predict the other.
• Multiple linear regressions and extension of linear regression, where more than
two attributes are involved and the data are fit to a multidimensional surface.

3) Clustering:
• Outliers may be detected by clustering, where similar values are organized into
groups, or “clusters.”
Q5. Min sample leaf and max depth in random forest?

Ans. Min sample leaf specifies the minimum number of samples that should be
present in the leaf node after splitting a node. The max_depth of a tree in Random
Forest is defined as the longest path between the root node and the leaf node.
Using the max_depth parameter, I can limit up to what depth I want every tree in
my random forest to grow.
Q6. Dictionary and list comprehension?

Ans. The only difference between list and dictionary comprehension is that the
dictionary has the keys and values in it. So, the keys and values will come as the
expression or value.

Part 1 – Basic Statistics and Distributions

20 Question (www.analyticsindiamag.com)

1. What is the difference between data analysis and machine learning?

Data analysis requires strong knowledge of coding and basic knowledge of


statistics

Machine learning, on the other hand, requires basic knowledge of coding and
strong knowledge of statistics and business.
2. What is big data?

Big data has 3 major components – volume (size of data), velocity (inflow of data)
and variety (types of data)

Big data causes “overloads”

3. What are the four main things we should know before studying data
analysis?

Descriptive statistics

Inferential statistics

Distributions (normal distribution / sampling distribution)

Hypothesis testing

4. What is the difference between inferential statistics and descriptive


statistics?

Descriptive statistics – provides exact and accurate information.

Inferential statistics – provides information of a sample and we need to inferential


statistics to reach to a conclusion about the population.

5. What is the difference between population and sample in inferential


statistics?

From the population we take a sample. We cannot work on the population either
due to computational costs or due to availability of all data points for the
population.  

From the sample we calculate the statistics

From the sample statistics we conclude about the population

6. What are descriptive statistics?


Descriptive statistic is used to describe the data (data properties)

5-number summary is the most commonly used descriptive statistics

7. Most common characteristics used in descriptive statistics?

 Center – middle of the data. Mean / Median / Mode are the most commonly
used as measures.
 Mean – average of all the numbers
 Median – the number in the middle
 Mode – the number that occurs the most. The disadvantage of using
Mode is that there may be more than one mode.
 Spread – How the data is dispersed. Range / IQR / Standard Deviation /
Variance are the most commonly used as measures.
 Range = Max – Min
 Inter Quartile Range (IQR) = Q3 – Q1
 Standard Deviation (σ) = √(∑(x-µ)  / n)
2

 Variance = σ
2

 Shape – the shape of the data can be symmetric or skewed


 Symmetric – the part of the distribution that is on the left side of the
median is same as the part of the distribution that is on the right side
of the median
 Left skewed – the left tail is longer than the right side
 Right skewed – the right tail is longer than the left side 
 Outlier – An outlier is an abnormal value
 Keep the outlier based on judgement
 Remove the outlier based on judgement

8. What is quantitative data and qualitative data?

Quantitative data is also known as numeric data

Qualitative data is also known as categorical data

9. How to calculate range and interquartile range?

IQR = Q3 – Q1
Where, Q3 is the third quartile (75 percentile) 

Where, Q1 is the first quartile (25 percentile)

10. Why we need 5-number summary?

Low extreme (minimum)

Lower quartile (Q1)

Median

Upper quartile (Q3)

Upper extreme (maximum)

11. What is the benefit of using box plot?

Shows the 5-number summary pictorially

Can be used to compare group of histograms

12. What is the meaning of standard deviation?

It represents how far the data points from the mean are

(σ) = √(∑(x-µ)2 / n)

Variance is the square of standard deviation

13. What is left skewed distribution and right skewed distribution?

 Left skewed
 The left tail is longer than the right side
 Mean < median < mode
 Right skewed
 The right tail is longer than the right side
 Mode < median < mean
14. What does symmetric distribution mean?

The part of the distribution that is on the left side of the median is same as the part
of the distribution that is on the right side of the median

Few examples are – uniform distribution, binomial distribution, normal distribution

15. What is the relationship between mean and median in normal


distribution?

In the normal distribution mean is equal to median

16. What does it mean by bell curve distribution and Gaussian distribution?

Normal distribution is called bell curve distribution / Gaussian distribution

It is called bell curve because it has the shape of a bell

It is called Gaussian distribution as it is named after Carl Gauss

17. How to convert normal distribution to standard normal distribution?

Standardized normal distribution has mean = 0 and standard deviation = 1

To convert normal distribution to standard normal distribution we can use the


formula

X (standardized) = (x-µ) / σ

18. What is an outlier?

An outlier is an abnormal value (It is at an abnormal distance from rest of the data
points). 

19. Mention one method to find outliers?


Shows the 5-number summary can be used to identify the outlier

Widely used – Any data point that lies outside the 1.5 * IQR

Lower bound = Q1 – (1.5 * IQR)

Upper bound = Q3 + (1.5 * IQR)

20. What can I do with outlier?

 Remove outlier
 When we know the data-point is wrong (negative age of a person)
 When we have lots of data
 We should provide two analyses. One with outliers and another
without outliers.
 Keep outlier
 When there are lot of outliers (skewed data)
 When results are critical
 When outliers have meaning (fraud data)

Part 2 – Advance Statistics and Hypothesis Testing

20 Question

21. What is the difference between population parameters and sample


statistics?

 Population parameters are:


 Mean = µ
 Standard deviation = σ
 Sample statistics are:
 Mean = x (bar)
 Standard deviation = s

22. Why we need sample statistics?

Population parameters are usually unknown hence we need sample statistics.


23. How to find the mean length of all fishes in the sea?

Define the confidence level (most common is 95%)

Take a sample of fishes from the sea (to get better results the number of fishes >
30)

Calculate the mean length and standard deviation of the lengths

Calculate t-statistics

Get the confidence interval in which the mean length of all the fishes should be.

24. What are the effects of the width of confidence interval?

 Confidence interval is used for decision making


 As the confidence level increases the width of the confidence interval also
increases
 As the width of the confidence interval increases, we tend to get useless
information also.
 Useless information – wide CI
 High risk – narrow CI

25. Mention the relationship between standard error and margin of error?

As the standard error increases the margin of error also increases

26. Mention the relationship between confidence interval and margin of


error?

As the confidence level increases the margin of error also increases

27. What is the proportion of confidence interval that will not contain the
population parameter?

Alpha is the portion of confidence interval that will not contain the population
parameter
α = 1 – CL

28. What is the difference between 95% confidence level and 99% confidence
level?

The confidence interval increases as me move from 95% confidence level to 99%
confidence level

29. What do you mean by degree of freedom?

DF is defined as the number of options we have 

DF is used with t-distribution and not with Z-distribution

For a series, DF = n-1 (where n is the number of observations in the series)

30. What do you think if DF is more than 30?

As DF increases the t-distribution reaches closer to the normal distribution

At low DF, we have fat tails

If DF > 30, then t-distribution is as good as normal distribution

31. When to use t distribution and when to use z distribution?

 The following conditions must be satisfied to use Z-distribution


 Do we know the population standard deviation?
 Is the sample size > 30?
 CI = x (bar) – Z*σ/√n to x (bar) + Z*σ/√n
 Else we should use t-distribution
 CI = x (bar) – t*s/√n to x (bar) + t*s/√n

32. What is H0 and H1? What is H0 and H1 for two-tail test?

 H0 is known as null hypothesis. It is the normal case / default case.


 For one tail test x <= µ
 For two-tail test x = µ
 H1 is known as alternate hypothesis. It is the other case.
 For one tail test x > µ
 For two-tail test x <> µ

33. What is p-value in hypothesis testing?

 If the p-value is more than then critical value, then we fail to reject the H0
 If p-value = 0.015 (critical value = 0.05) – strong evidence
 If p-value = 0.055 (critical value = 0.05) – weak evidence
 If the p-value is less than the critical value, then we reject the H0
 If p-value = 0.055 (critical value = 0.05) – weak evidence
 If p-value = 0.005 (critical value = 0.05) – strong evidence

34. How to calculate p-value using manual method?

Find H0 and H1

Find n, x(bar) and s

Find DF for t-distribution

Find the type of distribution – t or z distribution

Find t or z value (using the look-up table)

Compute the p-value to critical value

35. How to calculate p-value using EXCEL?

Go to Data tab

Click on Data Analysis

Select Descriptive Statistics

Choose the column

Select summary statistics and confidence level (0.95)


36. What do we mean by – making decision based on comparing p-value with
significance level?

If the p-value is more than then critical value, then we fail to reject the H0

If the p-value is less than the critical value, then we reject the H0

37. What is the difference between one tail and two tail hypothesis testing?

 2-tail test: Critical region is on both sides of the distribution


 H0: x = µ
 H1: x <> µ
 1-tail test: Critical region is on one side of the distribution
 H1: x <= µ
 H1: x > µ

38. What do you think of the tail (one tail or two tail) if H0 is equal to one
value only?

It is a two-tail test

39. What is the critical value in one tail or two-tail test?

Critical value in 1-tail = alpha

Critical value in 2-tail = alpha / 2

40. Why is the t-value same for 90% two tail and 95% one tail test?

P-value of 1-tail = P-value of 2-tail / 2

It is because in two tail there are 2 critical regions


Top 75 Statistics Interview Questions
Statistics is a very interesting field and has a lot of impact in today’s world of
computing and large data handling. Many companies are investing billions of
Dollars into Statistics and understanding Analytics.
This gives way for the creation of a lot of jobs in this sector along with the
increased competition it brings. To help you with your Statistics interview, we
have come up with these interview questions and answers that can guide you on
how to approach questions and answer them effectively. Thereby, it helps you in
the interview that you’re preparing to ace.
Q1. How is the statistical significance of an insight assessed?
Q2. Where are long-tailed distributions used?
Q3. What is the central limit theorem?
Q4. What are observational and experimental data in statistics?
Q5. What is meant by mean imputation for missing data? Why is it bad?
Q6. What is an outlier? How can outliers be determined in a dataset?
Q7. How is missing data handled in statistics?
Q8. What is exploratory data analysis?
Q9. What is the meaning of selection bias?
Q10. What are the types of selection bias in statistics?
This Top Statistics Interview Questions and answers blog is divided into three
sections:
1. Basic
2. Intermediate
3. Advanced
Basic Interview Questions

1. How is the statistical significance of an insight assessed?


Hypothesis testing is used to find out the statistical significance of the insight. To
elaborate, the null hypothesis and the alternate hypothesis are stated, and the p-
value is calculated.
After calculating the p-value, the null hypothesis is assumed true, and the values
are determined. To fine-tune the result, the alpha value, which denotes the
significance, is tweaked. If the p-value turns out to be less than the alpha, then the
null hypothesis is rejected. This ensures that the result obtained is statistically
significant.

2. Where are long-tailed distributions used?


A long-tailed distribution is a type of distribution where the tail drops off gradually
toward the end of the curve.
The Pareto principle and the product sales distribution are good examples to denote
the use of long-tailed distributions. Also, it is widely used in classification and
regression problems.

3. What is the central limit theorem?


The central limit theorem states that the normal distribution is arrived at when the
sample size varies without having an effect on the shape of the population
distribution.
This central limit theorem is the key because it is widely used in performing
hypothesis testing and also to calculate the confidence intervals accurately.

4. What is observational and experimental data in Statistics?


Observational data correlates to the data that is obtained from observational
studies, where variables are observed to see if there is any correlation between
them.
Experimental data is derived from experimental studies, where certain variables are
held constant to see if any discrepancy is raised in the working.

5. What is meant by mean imputation for missing data? Why is it bad?


Mean imputation is a rarely used practice where null values in a dataset are
replaced directly with the corresponding mean of the data.
It is considered a bad practice as it completely removes the accountability for
feature correlation. This also means that the data will have low variance and
increased bias, adding to the dip in the accuracy of the model, alongside narrower
confidence intervals.

6. What is an outlier? How can outliers be determined in a dataset?


Outliers are data points that vary in a large way when compared to other
observations in the dataset. Depending on the learning process, an outlier can
worsen the accuracy of a model and decrease its efficiency sharply.
Outliers are determined by using two methods:

 Standard deviation/z-score
 Interquartile range (IQR)

7. How is missing data handled in statistics?


There are many ways to handle missing data in Statistics:

 Prediction of the missing values


 Assignment of individual (unique) values
 Deletion of rows, which have the missing data
 Mean imputation or median imputation
 Using random forests, which support the missing values

8. What is exploratory data analysis?


Exploratory data analysis is the process of performing investigations on data to
understand the data better.
In this, initial investigations are done to determine patterns, spot abnormalities, test
hypotheses, and also check if the assumptions are right.

9. What is the meaning of selection bias?


Selection bias is a phenomenon that involves the selection of individual or grouped
data in a way that is not considered to be random. Randomization plays a key role
in performing analysis and understanding model functionality better.
If correct randomization is not achieved, then the resulting sample will not
accurately represent the population.

10. What are the types of selection bias in statistics?


There are many types of selection bias as shown below:

 Observer selection
 Attrition
 Protopathic bias
 Time intervals
 Sampling bias

11. What is the meaning of an inlier?


An inlier is a data point that lies at the same level as the rest of the dataset. Finding
an inlier in the dataset is difficult when compared to an outlier as it requires
external data to do so. Inliers, similar to outliers reduce model accuracy. Hence,
even they are removed when they’re found in the data. This is done mainly to
maintain model accuracy at all times.

12. What is the probability of throwing two fair dice when the sum is 5 and 8?
There are 4 ways of rolling a 5 (1+4, 4+1, 2+3, 3+2):
P(Getting a 5) = 4/36 = 1/9
Now, there are 7 ways of rolling an 8 (1+7, 7+1, 2+6, 6+2, 3+5, 5+3, 4+4)
P(Getting an 8) = 7/36 = 0.194

13. State the case where the median is a better measure when compared to the
mean.
In the case where there are a lot of outliers that can positively or negatively skew
data, the median is preferred as it provides an accurate measure in this case of
determination.

14. Can you give an example of root cause analysis?


Root cause analysis, as the name suggests, is a method used to solve problems by
first identifying the root cause of the problem.
Example: If the higher crime rate in a city is directly associated with the higher
sales in a red-colored shirt, it means that they are having a positive correlation.
However, this does not mean that one causes the other.
Causation can always be tested using A/B testing or hypothesis testing.

15. What is the meaning of six sigma in statistics?


Six sigma is a quality assurance methodology used widely in statistics to provide
ways to improve processes and functionality when working with data.
A process is considered as six sigma when 99.99966% of the outcomes of the
model are considered to be defect-free.

16. What is DOE?


DOE is an acronym for the Design of Experiments in statistics. It is considered as
the design of a task that describes the information and the change of the same
based on the changes to the independent input variables.

17. What is the meaning of KPI in statistics?


KPI stands for Key Performance Analysis in statistics. It is used as a reliable
metric to measure the success of a company with respect to its achieving the
required business objectives.
There are many good examples of KPIs:

 Profit margin percentage


 Operating profit margin
 Expense ratio

18. What type of data does not have a log-normal distribution or a Gaussian
distribution?
Exponential distributions do not have a log-normal distribution or a Gaussian
distribution. In fact, any type of data that is categorical will not have these
distributions as well.
Example: Duration of a phone car, time until the next earthquake, etc.

19. What is the Pareto principle?


The Pareto principle is also called the 80/20 rule, which means that 80 percent of
the results are obtained from 20 percent of the causes in an experiment.
A simple example of the Pareto principle is the observation that 80 percent of peas
come from 20 percent of pea plants on a farm.

20. What is the meaning of the five-number summary in Statistics?


The five-number summary is a measure of five entities that cover the entire range
of data as shown below:

 Low extreme (Min)


 First quartile (Q1)
 Median
 Upper quartile (Q3)
 High extreme (Max)

21. What are population and sample in Inferential Statistics, and how are they
different?
A population is a large volume of observations (data). The sample is a small
portion of that population. Because of the large volume of data in the population, it
raises the computational cost. The availability of all data points in the population is
also an issue.
In short:

 We calculate the statistics using the sample.


 Using these sample statistics, we make conclusions about the population.

22. What are quantitative data and qualitative data?

 Quantitative data is also known as numeric data.


 Qualitative data is also known as categorical data.

23. What is Mean?


Mean is the average of a collection of values. We can calculate the mean by
dividing the sum of all observations by the number of observations.

24. What is the meaning of standard deviation?


Standard deviation represents the magnitude of how far the data points are from the
mean. A low value of standard deviation is an indication of the data being close to
the mean, and a high value indicates that the data is spread to extreme ends, far
away from the mean.

25. What is a bell-curve distribution?


A normal distribution can be called a bell-curve distribution. It gets its name from
the bell curve shape that we get when we visualize the distribution.

26. What is Skewness?


Skewness measures the lack of symmetry in a data distribution. It indicates that
there are significant differences between the mean, the mode, and the median of
data. Skewed data cannot be used to create a normal distribution.

27. What is kurtosis?


Kurtosis is used to describe the extreme values present in one tail of distribution
versus the other. It is actually the measure of outliers present in the distribution. A
high value of kurtosis represents large amounts of outliers being present in data. To
overcome this, we have to either add more data into the dataset or remove the
outliers.

28. What is correlation?


Correlation is used to test relationships between quantitative variables and
categorical variables. Unlike covariance, correlation tells us how strong the
relationship is between two variables. The value of correlation between two
variables ranges from -1 to +1.
The -1 value represents a high negative correlation, i.e., if the value in one variable
increases, then the value in the other variable will drastically decrease. Similarly,
+1 means a positive correlation, and here, an increase in one variable will lead to
an increase in the other. Whereas, 0 means there is no correlation.
If two variables are strongly correlated, then they may have a negative impact on
the statistical model, and one of them must be dropped.
Next up on this top Statistics Interview Questions and Answers blog, let us take a
look at the intermediate set of questions.
Intermediate Interview Questions

29. What are left-skewed and right-skewed distributions?


A left-skewed distribution is one where the left tail is longer than that of the right
tail. Here, it is important to note that the mean < median < mode.
Similarly, a right-skewed distribution is one where the right tail is longer than the
left one. But, here mean > median > mode.

30. What is the difference between Descriptive and Inferential Statistics?


Descriptive Statistics: Descriptive statistics is used to summarize a sample set of
data like the standard deviation or the mean.
Inferential statistics: Inferential statistics is used to draw conclusions from the test
data that are subjected to random variations.

31. What are the types of sampling in Statistics?


There are four main types of data sampling as shown below:

 Simple random: Pure random division


 Cluster: Population divided into clusters
 Stratified: Data divided into unique groups
 Systematic: Picks up every ‘n’ member in the data

32. What is the meaning of covariance?


Covariance is the measure of indication when two items vary together in a cycle.
The systematic relation is determined between a pair of random variables to see if
the change in one will affect the other variable in the pair or not.
33. Imagine that Jeremy took part in an examination. The test is having a
mean score of 160, and it has a standard deviation of 15. If Jeremy’s z-score is
1.20, what would be his score on the test?
To determine the solution to the problem, the following formula is used:
X = μ + Zσ

Here:
μ: Mean
σ: Standard deviation
X: Value to be calculated

Therefore, X = 160 + (15*1.2) = 173.8 (Approximated to 174)

34. If a distribution is skewed to the right and has a median of 20, will the
mean be greater than or less than 20?
If the given distribution is a right-skewed distribution, then the mean should be
greater than 20, while the mode remains to be less than 20.

35. What is Bessel's correction?


Bessel’s correction is a factor that is used to estimate a populations’ standard
deviation from its sample. It causes the standard deviation to be less biased,
thereby, providing more accurate results.

36. The standard normal curve has a total area to be under one, and it is
symmetric around zero. True or False?
True, a normal curve will have the area under unity and the symmetry around zero
in any distribution. Here, all of the measures of central tendencies are equal to zero
due to the symmetric nature of the standard normal curve.

37. In an observation, there is a high correlation between the time a person


sleeps and the amount of productive work he does. What can be inferred from
this?
First, correlation does not imply causation here. Correlation is only used to
measure the relationship, which is linear between rest and productive work. If both
vary rapidly, then it means that there is a high amount of correlation between them.

38. What is the relationship between the confidence level and the significance
level in statistics?
The significance level is the probability of obtaining a result that is extremely
different from the condition where the null hypothesis is true. While the confidence
level is used as a range of similar values in a population.
Both significance and confidence level are related by the following formula:
Significance level = 1 − Confidence level

39. A regression analysis between apples (y) and oranges (x) resulted in the
following least-squares line: y = 100 + 2x. What is the implication if oranges
are increased by 1?
If the oranges are increased by one, there will be an increase of 2 apples since the
equation is:
y = 100 + 2x.

40. What types of variables are used for Pearson’s correlation coefficient?
Variables to be used for the Pearson’s correlation coefficient must be either in a
ratio or in an interval.
Note that there can exist a condition when one variable is a ratio, while the other is
an interval score.

41. In a scatter diagram, what is the line that is drawn above or below the
regression line called?
The line that is drawn above or below the regression line in a scatter diagram is
called the residual or also the prediction error.

42. What are the examples of symmetric distribution?


Symmetric distribution means that the data on the left side of the median is the
same as the one present on the right side of the median.
There are many examples of symmetric distribution, but the following three are the
most widely used ones:

 Uniform distribution
 Binomial distribution
 Normal distribution

43. Where is inferential statistics used?


Inferential statistics is used for several purposes, such as research, in which we
wish to draw conclusions about a population using some sample data. This is
performed in a variety of fields, ranging from government operations to quality
control and quality assurance teams in multinational corporations.

44. What is the relationship between mean and median in a normal


distribution?
In a normal distribution, the mean is equal to the median. To know if the
distribution of a dataset is normal, we can just check the dataset’s mean and
median.

45. What is the difference between the 1st quartile, the 2nd quartile, and the
3rd quartile?
Quartiles are used to describe the distribution of data by splitting data into three
equal portions, and the boundary or edge of these portions are called quartiles.
That is,

 The lower quartile (Q1) is the 25th percentile.


 The middle quartile (Q2), also called the median, is the 50th percentile.
 The upper quartile (Q3) is the 75th percentile.

46. How do the standard error and the margin of error relate?
The standard error and the margin of error are quite closely related to each other. In
fact, the margin of error is calculated using the standard error. As the standard error
increases, the margin of error also increases.

47. What is one sample t-test?


This T-test is a statistical hypothesis test in which we check if the mean of the
sample data is statistically or significantly different from the population’s mean.

48. What is an alternative hypothesis?


The alternative hypothesis (denoted by H1) is the statement that must be true if the
null hypothesis is false. That is, it is a statement used to contradict the null
hypothesis. It is the opposing point of view that gets proven right when the null
hypothesis is proven wrong.

49. Given a left-skewed distribution that has a median of 60, what conclusions
can we draw about the mean and the mode of the data?
Given that it is a left-skewed distribution, the mean will be less than the median,
i.e., less than 60, and the mode will be greater than 60.

50. What are the types of biases that we encounter while sampling?
Sampling biases are errors that occur when taking a small sample of data from a
large population as the representation in statistical analysis. There are three types
of biases:

 The selection bias


 The survivorship bias
 The undercoverage bias
Advanced Interview Questions

51. What are the scenarios where outliers are kept in the data?
There are not many scenarios where outliers are kept in the data, but there are some
important situations when they are kept. They are kept in the data for analysis if:

 Results are critical


 Outliers add meaning to the data
 The data is highly skewed

52. Briefly explain the procedure to measure the length of all sharks in the
world.
Following steps can be used to determine the length of sharks:

 Define the confidence level (usually around 95%)


 Use sample sharks to measure
 Calculate the mean and standard deviation of the lengths
 Determine t-statistics values
 Determine the confidence interval in which the mean length lies

53. How does the width of the confidence interval change with length?
The width of the confidence interval is used to determine the decision-making
steps. As the confidence level increases, the width also increases.
The following also apply:

 Wide confidence interval: Useless information


 Narrow confidence interval: High-risk factor

54. What is the meaning of degrees of freedom (DF) in statistics?


Degrees of freedom or DF is used to define the number of options at hand when
performing an analysis. It is mostly used with t-distribution and not with the z-
distribution.
If there is an increase in DF, the t-distribution will reach closer to the normal
distribution. If DF > 30, this means that the t-distribution at hand is having all of
the characteristics of a normal distribution.

55. How can you calculate the p-value using MS Excel?


Following steps are performed to calculate the p-value easily:

 Find the Data tab above


 Click on Data Analysis
 Select Descriptive Statistics
 Select the corresponding column
 Input the confidence level

56. What is the law of large numbers in statistics?


The law of large numbers in statistics is a theory that states that the increase in the
number of trials performed will cause a positive proportional increase in the
average of the results becoming the expected value.
Example: The probability of flipping a fair coin and landing heads is closer to 0.5
when it is flipped 100,000 times when compared to 100 flips.

57. What are some of the properties of a normal distribution?


A normal distribution, regardless of its size, will have a bell-shaped curve that is
symmetric along the axes.
Following are some of the important properties:

 Unimodal: It has only one mode.


 Symmetrical: Left and right halves of the curve are mirrored.
 Central tendency: The mean, median, and mode are at the midpoint.
58. If there is a 30 percent probability that you will see a supercar in any 20-
minute time interval, what is the probability that you see at least one supercar
in the period of an hour (60 minutes)?
The probability of not seeing a supercar in 20 minutes is:
= 1 − P(Seeing one supercar)
= 1 − 0.3
= 0.7
Probability of not seeing any supercar in the period of 60 minutes is:
= (0.7) ^ 3 = 0.343
Hence, the probability of seeing at least one supercar in 60 minutes is:
= 1 − P(Not seeing any supercar)
= 1 − 0.343 = 0.657

59. What is the meaning of sensitivity in statistics?


Sensitivity, as the name suggests, is used to determine the accuracy of a classifier
(logistic, random forest, etc.):
The simple formula to calculate sensitivity is:
Sensitivity = Predicted True Events/Total number of Events

60. What are the types of biases that you can encounter while sampling?
There are three types of biases:

 Selection bias
 Survivorship bias
 Under coverage bias

61. What is the meaning of TF/IDF vectorization?


TF-IDF is an acronym for Term Frequency – Inverse Document Frequency. It is
used as a numerical measure to denote the importance of a word in a document.
This document is usually called the collection or the corpus.
The TF-IDF value is directly proportional to the number of times a word is
repeated in a document. TF-IDF is vital in the field of Natural Language
Processing (NLP) as it is mostly used in the domain of text mining and information
retrieval.

62. What are some of the low and high-bias Machine Learning algorithms?
There are many low and high-bias Machine Learning algorithms, and the following
are some of the widely used ones:

 Low bias: SVM, decision trees, KNN algorithm, etc.


 High bias: Linear and logistic regression

63. What is the use of Hash tables in statistics?


Hash tables are the data structures that are used to denote the representation of key-
value pairs in a structured way. The hashing function is used by a hash table to
compute an index that contains all of the details regarding the keys that are mapped
to their associated values.

64. What are some of the techniques to reduce underfitting and overfitting
during model training?
Underfitting refers to a situation where data has high bias and low variance, while
overfitting is the situation where there are high variance and low bias.
Following are some of the techniques to reduce underfitting and overfitting:
For reducing underfitting:

 Increase model complexity


 Increase the number of features
 Remove noise from the data
 Increase the number of training epochs

For reducing overfitting:

 Increase training data


 Stop early while training
 Lasso regularization
 Use random dropouts

65. Can you give an example to denote the working of the central limit
theorem?
Let’s consider the population of men who have normally distributed weights, with
a mean of 60 kg and a standard deviation of 10 kg, and the probability needs to be
found out.
If one single man is selected, the weight is greater than 65 kg, but if 40 men are
selected, then the mean weight is far more than 65 kg.
The solution to this can be as shown below:
Z = (x − µ) / ? = (65 − 60) / 10 = 0.5

For a normal distribution P(Z > 0.5) = 0.409


Z = (65 − 60) / 5 = 1
P(Z > 1) = 0.090

66. How do you stay up-to-date with the new and upcoming concepts in
statistics?
This is a commonly asked question in a statistics interview. Here, the interviewer is
trying to assess your interest and ability to find out and learn new things
efficiently. Do talk about how you plan to learn new concepts and make sure to
elaborate on how you practically implemented them while learning.

67. What is the benefit of using box plots?


Box plots allow us to provide a graphical representation of the 5-number summary
and can also be used to compare groups of histograms.

68. Does a symmetric distribution need to be unimodal?


A symmetric distribution does not need to be unimodal (having only one mode or
one value that occurs most frequently). It can be bi-modal (having two values that
have the highest frequencies) or multi-modal (having multiple or more than two
values that have the highest frequencies).

69. What is the impact of outliers in statistics?


Outliers in statistics have a very negative impact as they skew the result of any
statistical query. For example, if we want to calculate the mean of a dataset that
contains outliers, then the mean calculated will be different from the actual mean
(i.e., the mean we will get once we remove the outliers).

70. When creating a statistical model, how do we detect overfitting?


Overfitting can be detected by cross-validation. In cross-validation, we divide the
available data into multiple parts and iterate on the entire dataset. In each iteration,
one part is used for testing, and others are used for training. This way, the entire
dataset will be used for training and testing purposes, and we can detect if the data
is being overfitted.

71. What is a survivorship bias?


The survivorship bias is the flaw of the sample selection that occurs when a dataset
only considers the ‘surviving’ or existing observations and fails to consider those
observations that have already ceased to exist.

72. What is an undercoverage bias?


The undercoverage bias is a bias that occurs when some members of the population
are inadequately represented in the sample.

74. What is the relationship between standard deviation and standard


variance?
Standard deviation is the square root of standard variance. Basically, standard
deviation takes a look at how the data is spread out from the mean. On the other
hand, standard variance is used to describe how much the data varies from the
mean of the entire dataset.

Statistics Interview Questions

Statistical computing is the process through which data scientists take raw
data and create predictions and models. Without an advanced knowledge of
statistics it is difficult to succeed as a data scientist–accordingly, it is likely a
good interviewer will try to probe your understanding of the subject matter
with statistics-oriented data science interview questions. Be prepared to
answer some fundamental statistics questions as part of your data science
interview.

Here are examples of rudimentary statistics questions we’ve found:

1. What is the Central Limit Theorem and why is it important?

“Suppose that we are interested in estimating the average height among all
people. Collecting data for every person in the world is impossible. While we
can’t obtain a height measurement from everyone in the population, we can
still sample some people. The question now becomes, what can we say about
the average height of the entire population given a single sample. The Central
Limit Theorem addresses this question exactly.” 

An Introduction to the Central Limit Theorem

In a world full of data that seldom follows nice theoretical distributions,


the Central Limit Theorem is a beacon of light. Often referred to as the
cornerstone of statistics, it is an important concept to understand when
performing any type of data analysis.

Motivation

Suppose that we are interested in estimating the average height among all
people. Collecting data for every person in the world is impractical, bordering
on impossible. While we can’t obtain a height measurement from everyone in
the population, we can still sample some people. The question now becomes,
what can we say about the average height of the entire population given a
single sample.

The Central Limit Theorem addresses this question exactly. Formally, it


states that if we sample from a population using a sufficiently large sample
size, the mean of the samples (also known as the sample population) will be
normally distributed (assuming true random sampling). What’s especially
important is that this will be true regardless of the distribution of the original
population.

When I first read this description I did not completely understand what it
meant. However, after visualizing a few examples it become more clear. Let’s
look at an example of the Central Limit Theorem in action.

Example

Suppose we have the following population distribution.


I manually generated the above population by choosing numbers between 0
and 100, and plotted it as a histogram. The height of the histogram denotes
the frequency of the number in the population. As we can see, the distribution
is pretty ugly. It certainly isn’t normal, uniform, or any other commonly
known distribution.

In order to sample from the above distribution, we need to define a sample


size, referred to as N. This is the number of observations that we will sample
at a time. Suppose that we choose N to be 3. This means that we will sample in
groups of 3. So for the above population, we might sample groups such as [5,
20, 41], [60, 17, 82], [8, 13, 61], and so on.
Suppose that we gather 1,000 samples of 3 from the above population. For
each sample, we can compute its average. If we do that, we will have 1,000
averages. This set of 1,000 averages is called a sampling distribution, and
according to Central Limit Theorem, the sampling distribution will approach
a normal distribution as the sample size N used to produce it increases. Here
is what our sample distribution looks like for N = 3.

As we can see, it certainly looks uni-modal, though not necessarily normal. If


we repeat the same process with a larger sample size, we should see the
sampling distribution start to become more normal. Let’s repeat the same
process again with N = 10. Here is the sampling distribution for that sample
size.
This certainly looks more normal, and if we repeated this process one more
time for N = 30 we observe this result.
The above plots demonstrate that as the sample size N is increased, the
resultant sample mean distribution becomes more normal. Further, the
distribution variance also decreases. Keep in mind that the original
population that we are sampling from was that weird ugly distribution above.

Further Intuition

When I first saw an example of the Central Limit Theorem like this, I didn’t
really understand why it worked. The best intuition that I have come across
involves the example of flipping a coin. Suppose that we have a fair coin and
we flip it 100 times. If we observed 48 heads and 52 tails we would probably
not be very surprised. Similarly, if we observed 40 heads and 60 tails, we
would probably still not be very surprised, though it might seem more rare
than the 48/52 scenario. However, if we observed 20 heads and 80 tails we
might start to question the fairness of the coin.

This is essentially what the normal-ness of the sample distribution represents.


For the coin example, we are likely to get about half heads and half tails.
Outcomes farther away from the expected 50/50 result are less likely, and thus
less expected. The normal distribution of the sampling distribution captures
this concept.

The mean of the sampling distribution will approximate the mean of the true
population distribution. Additionally, the variance of the sampling
distribution is a function of both the population variance and the sample size
used. A larger sample size will produce a smaller sampling distribution
variance. This makes intuitive sense, as we are considering more samples
when using a larger sample size, and are more likely to get a representative
sample of the population. So roughly speaking, if the sample size used is large
enough, there is a good chance that it will estimate the population pretty well.
Most sources state that for most applications N = 30 is sufficient.

These principles can help us to reason about samples from any population.
Depending on the scenario and the information available, the way that it is
applied may vary. For example, in some situations we might know the true
population mean and variance, which would allow us to compute the variance
of any sampling distribution. However, in other situations, such as the original
problem we discussed of estimating average human height, we won’t know the
true population mean and variance. Understanding the nuances of sampling
distributions and the Central Limit Theorem is an essential first step toward
talking many of these problems.
2. What is sampling? How many sampling methods do you know?

“Data sampling is a statistical analysis technique used to select, manipulate


and analyze a representative subset of data points to identify patterns and
trends in the larger data set being examined.”

Data Sampling

Data sampling is a statistical analysis technique used to select, manipulate and


analyze a representative subset of data points to identify patterns and trends
in the larger data set being examined. It enables data scientists, predictive
modelers and other data analysts to work with a small, manageable amount of
data about a statistical population to build and run analytical models more
quickly, while still producing accurate findings.

Advantages and challenges of data sampling

Sampling can be particularly useful with data sets that are too large to
efficiently analyze in full -- for example, in big data analytics applications or
surveys. Identifying and analyzing a representative sample is more efficient
and cost-effective than surveying the entirety of the data or population.

An important consideration, though, is the size of the required data sample


and the possibility of introducing a sampling error. In some cases, a small
sample can reveal the most important information about a data set. In others,
using a larger sample can increase the likelihood of accurately representing
the data as a whole, even though the increased size of the sample may impede
ease of manipulation and interpretation.
Types of data sampling methods

There are many different methods for drawing samples from data; the ideal
one depends on the data set and situation. Sampling can be based
on probability, an approach that uses random numbers that correspond to
points in the data set to ensure that there is no correlation between points
chosen for the sample. Further variations in probability sampling include:

Simple random sampling: Software is used to randomly select subjects from


the whole population.

Stratified sampling: Subsets of the data sets or population are created based


on a common factor, and samples are randomly collected from each subgroup.

Cluster sampling: The larger data set is divided into subsets (clusters) based
on a defined factor, then a random sampling of clusters is analyzed.

Multistage sampling: A more complicated form of cluster sampling, this


method also involves dividing the larger population into a number of clusters.
Second-stage clusters are then broken out based on a secondary factor, and
those clusters are then sampled and analyzed. This staging could continue as
multiple subsets are identified, clustered and analyzed.

Systematic sampling: A sample is created by setting an interval at which to


extract data from the larger population -- for example, selecting every 10th
row in a spreadsheet of 200 items to create a sample size of 20 rows to analyze.

Sampling can also be based on nonprobability, an approach in which a data


sample is determined and extracted based on the judgment of the analyst. As
inclusion is determined by the analyst, it can be more difficult to extrapolate
whether the sample accurately represents the larger population than when
probability sampling is used.
Data sampling can be accomplished using probability or nonprobablity
methods.

Nonprobability data sampling methods include:

Convenience sampling: Data is collected from an easily accessible and


available group.

Consecutive sampling: Data is collected from every subject that meets the


criteria until the predetermined sample size is met.

Purposive or judgmental sampling: The researcher selects the data to sample


based on predefined criteria.

Quota sampling: The researcher ensures equal representation within the


sample for all subgroups in the data set or population.
3. What is the difference between type I vs type II error?

“A type I error occurs when the null hypothesis is true, but is rejected. A type
II error occurs when the null hypothesis is false, but erroneously fails to be
rejected.” 

In statistical test theory, the notion of statistical error is an integral part of


hypothesis testing. The statistical test requires an unambiguous statement of a
null hypothesis (H0), for example, "this person is healthy", "this accused
person is not guilty" or "this product is not broken".   The result of the test of
the null hypothesis may be positive (healthy, not guilty, not broken) or may
be negative (not healthy, guilty, broken).

If the result of the test corresponds with reality, then a correct decision has
been made (e.g., person is healthy and is tested as healthy, or the person is not
healthy and is tested as not healthy).  However, if the result of the test does not
correspond with reality, then two types of error are distinguished: type I
errorand type II error.

Type I Error (False Positive Error)

A type I error occurs when the null hypothesisis true, but is rejected.  Let me


say this again, atype I error occurs when the null hypothesis is actually true, but
was rejected as falseby the testing.

A type I error, or false positive, is asserting something as true when it is


actually false.  This false positive error is basically a "false alarm" – a result
that indicates a given condition has been fulfilled when it actually has not been
fulfilled (i.e., erroneously a positive result has been assumed).
Let’s use a shepherd and wolf example.  Let’s say that our null hypothesis is
that there is “no wolf present.”  A type I error (or false positive) would be
“crying wolf” when there is no wolf present. That is, the actual conditionwas
that there was no wolf present; however, the shepherd wrongly indicated
there was a wolf present by calling "Wolf! Wolf!”  This is a type I error or
false positive error.

Type II Error (False Negative)

A type II error occurs when the null hypothesis is false, but erroneously fails
to be rejected.  Let me say this again, atype II error occurs when the null
hypothesis is actually false, but was accepted as trueby the testing.

A type II error, or false negative, is where a test result indicates that a


condition failed, while it actually was successful.   A Type II error is
committed when we fail to believe a true condition.

Continuing our shepherd and wolf example.  Again, our null hypothesis is that
there is “no wolf present.”  A type II error (or false negative) would be doing
nothing (not “crying wolf”) when there is actually a wolf present.  That is,
the actual situationwas that there was a wolf present; however, the shepherd
wrongly indicated there was no wolf present and continued to play Candy
Crush on his iPhone.  This is a type II error or false negative error.

A tabular relationship between truthfulness/falseness of the null hypothesis


and outcomes of the test can be seen in the table below:

Null Hypothesis is Null hypothesis is


  true false

Reject null
hypothesis Type I Error Correct Outcome
False Positive True Positive

Correct outcome Type II Error


Fail to reject null
hypothesis True Negative False Negative

Examples

Let’s walk through a few examples and use a simple form to help us to
understand the potential cost ramifications of type I and type II errors.  Let’s
start with our shepherd / wolf example.

Type I Error / False Type II Error / False


Null Hypothesis Positive Negative

Shepherd thinks wolf Shepherd thinks wolf


is present (shepherd is NOT present
cries wolf) when no (shepherd does
Wolf is not wolf is actually nothing) when a wolf
present present is actually present

Costs (actual costs Replacement cost for


plus shepherd the sheep eaten by
credibility) associated the wolf, and
with scrambling the replacement cost for
townsfolk to kill the hiring a new
Cost Assessment non-existing wolf shepherd
Note: I added a row called “Cost Assessment.”  Since it can not be universally
stated that a type I or type II error is worse (as it is highly dependent upon the
statement of the null hypothesis), I’ve added this cost assessment to help me
understand which error is more “costly” and for which I might want to do
more testing.

Let’s look at the classic criminal dilemma next.  In colloquial usage, a type I
error can be thought of as "convicting an innocent person" and type II error
"letting a guilty person go free".

Type I Error / False Type II Error / False


Null Hypothesis Positive Negative

Person is judged
as guiltywhen the Person is judged not
person actually did guiltywhen they
Person is not notcommit the crime actually didcommit
guilty of the (convicting an the crime (letting a
crime innocent person) guilty person go free)

Social costs of
sending an innocent
person to prison and
denying them their
personal freedoms Risks of letting a
(which in our society, guilty criminal roam
is considered an the streets and
almost unbearable committing future
Cost Assessment cost) crimes

Let’s look at some business related examples.  In these examples I have


reworded the null hypothesis, so be careful on the cost assessment.
 

Type I Error / False Type II Error / False


Null Hypothesis Positive Negative

(H0 true, but rejected (H0 false, but


as false) accepted as true)

Medicine Medicine A does not


Medicine A cures A curesDisease B, but cureDisease B, but
Disease B is rejected as false is accepted as true

Unexpected side
Lost opportunity cost effects (maybe even
for rejecting an death) for using a
effective drug that drug that is not
Cost Assessment could cure Disease B effective

Let’s try one more.

Type I Error / False Type II Error / False


Null Hypothesis Positive Negative
(H0 true, but rejected (H0 false, but
as false) accepted as true)

Display Ad A is Display Ad A is Display Ad A is not


effective in effective in driving effective in driving
driving conversions, but conversions, but
conversions is rejected as false is accepted as true

Lost sales for


Lost opportunity cost promoting an
for rejecting an ineffective Display
effective Display Ad Ad A to your target
Cost Assessment A visitors

The cost ramifications in the Medicine example are quite substantial, so


additional testing would likely be justified in order to minimize the impact of
the type II error (using an ineffective drug) in our example.  However, the cost
ramifications in the Display Ad example are quite small, for both the type I
and type II errors, so additional investment in addressing the type I and type
II errors is probably not worthwhile.

Summary

Type I and type II errors are highly depend upon the language or positioning
of the null hypothesis. Changing the positioning of the null hypothesis can
cause type I and type II errors to switch roles.

It’s hard to create a blanket statement that a type I error is worse than a type
II error, or vice versa.  The severity of the type I and type II errors can only
be judged in context of the null hypothesis, which should be thoughtfully
worded to ensure that we’re running the right test. 
I highly recommend adding the “Cost Assessment” analysis like we did in the
examples above.  This will help identify which type of error is more “costly”
and identify areas where additional testing might be justified.

4. What is linear regression? What do the terms p-value, coefficient, and r-


squared value mean? What is the significance of each of these components?

A linear regression is a good tool for quick predictive analysis: for example,
the price of a house depends on a myriad of factors, such as its size or its
location. In order to see the relationship between these variables, we need to
build a linear regression, which predicts the line of best fit between them and
can help conclude whether or not these two factors have a positive or negative
relationship. Read more here and here.

5. What are the assumptions required for linear regression?

There are four major assumptions: 1. There is a linear relationship between


the dependent variables and the regressors, meaning the model you are
creating actually fits the data, 2. The errors or residuals of the data are
normally distributed and independent from each other, 3. There is minimal
multicollinearity between explanatory variables, and 4. Homoscedasticity.
This means the variance around the regression line is the same for all values
of the predictor variable.

6. What is a statistical interaction?

”Basically, an interaction is when the effect of one factor (input variable) on


the dependent variable (output variable) differs among levels of another
factor.” Read more here.
7. What is selection bias?

“Selection (or ‘sampling’) bias occurs in an ‘active,’ sense when the sample
data that is gathered and prepared for modeling has characteristics that are
not representative of the true, future population of cases the model will see.
That is, active selection bias occurs when a subset of the data are
systematically (i.e., non-randomly) excluded from analysis.” Read more here.

8. What is an example of a data set with a non-Gaussian distribution?

“The Gaussian distribution is part of the Exponential family of distributions,


but there are a lot more of them, with the same sort of ease of use, in many
cases, and if the person doing the machine learning has a solid grounding in
statistics, they can be utilized where appropriate.” Read more here.

9. What is the Binomial Probability Formula?

“The binomial distribution consists of the probabilities of each of the possible


numbers of successes on N trials for independent events that each have a
probability of π (the Greek letter pi) of occurring.” Read more here.
Statistics Interview Questions

1. What is the Central Limit Theorem?

Central Limit Theorem is the cornerstone of statistics. It states that the distribution
of a sample from a population comprising a large sample size will have its mean
normally distributed. In other words, it will not have any effect on the original
population distribution.

Central Limit Theorem is widely used in the calculation of confidence intervals


and hypothesis testing. Here is an example – We want to calculate the average
height of people in the world, and we take some samples from the general
population, which serves as the data set. Since it is hard or impossible to obtain
data regarding the height of every person in the world, we will simply calculate the
mean of our sample. 
By multiplying it several times, we will obtain the mean and their frequencies
which we can plot on the graph and create a normal distribution. It will form a bell-
shaped curve that will closely resemble the original data set.   

2. What is the assumption of normality?

The assumption of normality dictates that the mean distribution across samples is
normal. This is true across independent samples as well.  

3. Describe Hypothesis Testing. How is the statistical significance of an insight


assessed?

Hypothesis Testing in statistics is used to see if a certain experiment yields


meaningful results. It essentially helps to assess the statistical significance of
insight by determining the odds of the results occurring by chance. The first thing
is to know the null hypothesis and then state it. Then the p-value is calculated, and
if the null hypothesis is true, other values are also determined. The alpha value
denotes the significance and is adjusted accordingly.

If the p-value is less than alpha, the null hypothesis is rejected, but if it is greater
than alpha, the null hypothesis is accepted. The rejection of the null hypothesis
indicates that the results obtained are statistically significant.

4. What are observational and experimental data in statistics?

Observational data is derived from the observation of certain variables from


observational studies. The variables are observed to determine any correlation
between them.

Experimental data is derived from those experimental studies where certain


variables are kept constant to determine any discrepancy or causality. 

5. What is an outlier? 
Outliers can be defined as the data points within a data set that varies largely in
comparison to other observations. Depending on its cause, an outlier can decrease
the accuracy as well as efficiency of a model. Therefore, it is crucial to remove
them from the data set. 

6. How to screen for outliers in a data set?

There are many ways to screen and identify potential outliers in a data set. Two
key methods are described below –

Standard deviation/z-score – Z-score or standard score can be obtained in a normal


distribution by calculating the size of one standard deviation and multiplying it by
3. The data points outside the range are then identified. The Z-score is measured
from the mean. If the z-score is positive, it means the data point is above average.

If the z-score is negative, the data point is below average.

If the z-score is close to zero, the data point is close to average.

If the z-score is above or below 3, it is an outlier and the data point is considered
unusual. 

The formula for calculating a z-score is – 

z= data point−mean/standard deviation OR z=x−μ/ σ

Interquartile range (IQR) – IQR, also called midspread, is a method to identify


outliers and can be described as the range of values that occur throughout the
length of the middle of 50% of a data set. It is simply the difference between two
extreme data points within the observation. 

IQR=Q3 – Q1
Other methods to screen outliers include Isolation Forests, Robust Random Cut
Forests, and DBScan clustering.

7. What is the meaning of an inlier?

An Inliner is a data point within a data set that lies at the same level as the others.
It is usually an error and is removed to improve the model accuracy. Unlike
outliers, inlier is hard to find and often requires external data for accurate
identification. 

8. What is the meaning of six sigma in statistics?

Six sigma in statistics is a quality control method to produce an error or defect-free


data set. Standard deviation is known as Sigma or σ. The more the standard
deviation, the less likely that process performs with accuracy and causes a defect.
If a process outcome is 99.99966% error-free, it is considered six sigma. A six
sigma model works better than 1σ, 2σ, 3σ, 4σ, 5σ processes and is reliable enough
to produce defect-free work. 

9. What is the meaning of KPI in statistics?

KPI is an acronym for a key performance indicator. It can be defined as a


quantifiable measure to understand whether the goal is being achieved or not. KPI
is a reliable metric to measure the performance level of an organization or
individual with respect to the objectives. An example of KPI in an organization is
the expense ratio.

10. What is the Pareto principle?

Also known as the 80/20 rule, the Pareto principle states that 80% of the effects or
results in an experiment are obtained from 20% of the causes. A simple example is
– 20% of sales come from 80% of customers.
11. What is the Law of Large Numbers in statistics?

According to the law of large numbers, an increase in the number of trials in an


experiment will result in a positive and proportional increase in the results coming
closer to the expected value. As an example, let us check the probability of rolling
a six-sided dice three times. The expected value obtained is far from the average
value. And if we roll a dice a large number of times, we will obtain the average
result closer to the expected value (which is 3.5 in this case). 

12. What are some of the properties of a normal distribution?

Also known as Gaussian distribution, Normal distribution refers to the data which
is symmetric to the mean, and data far from the mean is less frequent in
occurrence. It appears as a bell-shaped curve in graphical form, which is
symmetrical along the axes.

The properties of a normal distribution are –

Symmetrical – The shape changes with that of parameter values

Unimodal – Has only one mode.

Mean – the measure of central tendency

Central tendency – the mean, median, and mode lie at the centre, which means that
they are all equal, and the curve is perfectly symmetrical at the midpoint. 

13. How would you describe a ‘p-value’?

P-value in statistics is calculated during hypothesis testing, and it is a number that


indicates the likelihood of data occurring by a random chance. If a p-value is 0.5
and is less than alpha, we can conclude that there is a probability of 5% that the
experiment results occurred by chance, or you can say, 5% of the time, we can
observe these results by chance. 
14. How can you calculate the p-value using MS Excel?

The formula used in MS Excel to calculate p-value is –

 =tdist(x,deg_freedom,tails)

The p-value is expressed in decimals in Excel. Here are the steps to calculate it –

Find the Data tab 

On the Analysis tab, click on the data analysis icon 

Select Descriptive Statistics and then click OK

Select the relevant column

Input the confidence level and other variables 

15. What are the types of biases that you can encounter while sampling?

Sampling bias occurs when you lack the fair representation of data samples during
an investigation or a survey. The six main types of biases that one can encounter
while sampling are –

Undercoverage bias

Observer Bias

Survivorship bias

Self-Selection/Voluntary Response Bias

Recall Bias
Exclusion Bias

16. What is cherry-picking, P-hacking, and significance chasing?

Cherry-picking can be defined as the practice in statistics where only that


information is selected which supports a certain claim and ignores any other claim
that refutes the desired conclusion.

P-hacking refers to a technique in which data collection or analysis is manipulated


until significant patterns can be found who have no underlying effect whatsoever. 

Significance chasing is also known by the names of Data Dredging, Data Fishing,
or Data Snooping. It refers to the reporting of insignificant results as if they are
almost significant. 

17. What is the difference between type I vs type II errors?

A type 1 error occurs when the null hypothesis is rejected even if it is true. It is
also known as false positive.

A type 2 error occurs when the null hypothesis fails to get rejected, even if it is
false. It is also known as a false negative.

18. What is a statistical interaction?

A statistical interaction refers to the phenomenon which occurs when the influence
of an input variable impacts the output variable. A real-life example includes the
interaction of adding sugar to the stirring of tea. Neither of the two variables has an
impact on sweetness, but it is the combination of these two variables that do. 
19. Give an example of a data set with a non-Gaussian distribution?

A non-Gaussian distribution is a common occurrence in many processes in


statistics. This happens when the data naturally follows a non-normal distribution
with data clumped on one side or the other on a graph. For example, the growth of
bacteria follows a non-Gaussian or exponential distribution naturally and Weibull
distribution. 

20. What is the Binomial Distribution Formula?

The binomial distribution formula is:

b(x; n, P) = nCx * Px * (1 – P)n – x

Where:

b = binomial probability

x = total number of “successes” (pass or fail, heads or tails, etc.)

P = probability of success on an individual trial

n = number of trials

21. What are the criteria that Binomial distributions must meet?

Here are the three main criteria that Binomial distributions must meet –

The number of observation trials must be fixed. It means that one can only find the
probability of something when done only a certain number of times.

Each trial needs to be independent. It means that none of the trials should impact
the probability of other trials.

The probability of success remains the same across all trials. 


22. What is linear regression? 

In statistics, linear regression is an approach that models the relationship between


one or more explanatory variables and one outcome variable. For example, linear
regression can be used to quantify or model the relationship between various
predictor variables such as age, gender, genetics, and diet on height, outcome
variables. 

23. What are the assumptions required for linear regression?

Four major assumptions for linear regression are as under –

There’s a linear relationship between the predictor (independent) variables and the
outcome (dependent) variable. It means that the relationship between X and the
mean of Y is linear.

The errors are normally distributed with no correlation between them. This process
is known as Autocorrelation. 

There is an absence of correlation between predictor variables. This phenomenon is


called multicollinearity.

The variation in the outcome or response variable is the same for all values of
independent or predictor variables. This phenomenon of assumption of equal
variance is known as homoscedasticity. 

24. What are some of the low and high-bias Machine Learning algorithms?

Some of the widely used low and high-bias Machine Learning algorithms are –

Low bias -Decision trees, Support Vector Machines, k-Nearest Neighbors, etc.

High bias -Linear Regression, Logistic Regression, Linear Discriminant Analysis,


etc. 
25. When should you use a t-test vs a z-test?

The z-test is used for hypothesis testing in statistics with a normal distribution. It is
used to determine population variance in the case where a sample is large. 

The t-test is used with a t-distribution and used to determine population variance
when you have a small sample size. 

In case the sample size is large or n>30, a z-test is used. T-tests are helpful when
the sample size is small or n<30.

26. What is the equation for confidence intervals for means vs for
proportions?

To calculate the confidence intervals for mean, we use the following equation –

For n > 30

Use the Z table for the standard normal distribution.

For n<30

Use the t table with df=n-1

Confidence Interval for the Population Proportion –

27. What is the empirical rule?

In statistics, the empirical rule states that every piece of data in a normal
distribution lies within three standard deviations of the mean. It is also known as
the 68–95–99.7 rule. According to the empirical rule, the percentage of values that
lie in a normal distribution follow the 68%, 95%, and 99.7% rule. In other words,
68% of values will fall within one standard deviation of the mean, 95% will fall
within two standard deviations, and 99.75 will fall within three standard deviations
of the mean.

28. How are confidence tests and hypothesis tests similar? How are they
different?

Confidence tests and hypothesis tests both form the foundation of statistics. 

The confidence interval holds importance in research to offer a strong base for
research estimations, especially in medical research. The confidence interval
provides a range of values that helps in capturing the unknown parameter. 

We can calculate confidence interval using this formula –

Hypothesis testing is used to test an experiment or observation and determine if the


results did not occur purely by chance or luck using the below formula where ‘p’ is
some parameter. 

Confidence and hypothesis testing are inferential techniques used to either estimate
a parameter or test the validity of a hypothesis using a sample of data from that
data set. While confidence interval provides a range of values for an accurate
estimation of the precision of that parameter, hypothesis testing tells us how
confident we are inaccurately drawing conclusions about a parameter from a
sample. Both can be used to infer population parameters in tandem. 
In case we include 0 in the confidence interval, it indicates that the sample and
population have no difference. If we get a p-value that is higher than alpha from
hypothesis testing, it means that we will fail to reject the bull hypothesis.

29. What general conditions must be satisfied for the central limit theorem to
hold?

Here are the conditions that must be satisfied for the central limit theorem to hold –

The data must follow the randomization condition which means that it must be
sampled randomly.

The Independence Assumptions dictate that the sample values must be independent
of each other.

Sample sizes must be large. They must be equal to or greater than 30 to be able to
hold CLT. Large sample size is required to hold the accuracy of CLT to be true. 

30. What is Random Sampling? Give some examples of some random


sampling techniques.

Random sampling is a sampling method in which each sample has an equal


probability of being chosen as a sample. It is also known as probability sampling.

Let us check four main types of random sampling techniques –

Simple Random Sampling technique – In this technique, a sample is chosen


randomly using randomly generated numbers. A sampling frame with the list of
members of a population is required, which is denoted by ‘n’. Using Excel, one can
randomly generate a number for each element that is required.
Systematic Random Sampling technique -This technique is very common and easy
to use in statistics. In this technique, every k’th element is sampled. For instance,
one element is taken from the sample and then the next while skipping the pre-
defined amount or ‘n’. 

In a sampling frame, divide the size of the frame N by the sample size (n) to get
‘k’, the index number. Then pick every k’th element to create your sample. 

Cluster Random Sampling technique -In this technique, the population is divided
into clusters or groups in such a way that each cluster represents the population.
After that, you can randomly select clusters to sample.  

Stratified Random Sampling technique – In this technique, the population is


divided into groups that have similar characteristics. Then a random sample can be
taken from each group to ensure that different segments are represented equally
within a population. 

31. What is the difference between population and sample in inferential


statistics?

A population in inferential statistics refers to the entire group we take samples


from and are used to draw conclusions. A sample, on the other hand, is a specific
group we take data from and this data is used to calculate the statistics. Sample size
is always less than that of the population. 

32. What are descriptive statistics?

Descriptive statistics are used to summarize the basic characteristics of a data set in
a study or experiment. It has three main types – 

Distribution – refers to the frequencies of responses.

Central Tendency – gives a measure or the average of each response.


Variability – shows the dispersion of a data set.

33. What are quantitative data and qualitative data?

Qualitative data is used to describe the characteristics of data and is also known as
Categorical data. For example, how many types. Quantitative data is a measure of
numerical values or counts. For example, how much or how often. It is also known
as Numeric data.

34. How to calculate range and interquartile range?

The range is the difference between the highest and the lowest values whereas the
Interquartile range is the difference between upper and lower medians.  

Range (X) = Max(X) – Min(X)

IQR = Q3 – Q1

Here, Q3 is the third quartile (75 percentile) 

Here, Q1 is the first quartile (25 percentile)

35. What is the meaning of standard deviation?

Standard deviation gives the measure of the variation of dispersion of values in a


data set. It represents the differences of each observation or data point from the
mean.

(σ) = √(∑(x-µ)2 / n)

Where the variance is the square of standard deviation.


36. What is the relationship between mean and median in normal
distribution?

In a normal distribution, the mean and the median are equal. 

37. What is the left-skewed distribution and the right-skewed distribution?

In the left-skewed distribution, the left tail is longer than the right side.  

Mean < median < mode

In the right-skewed distribution, the right tail is longer. It is also known as


positive-skew distribution.

Mode < median < mean

38. How to convert normal distribution to standard normal distribution?

Any point (x) from the normal distribution can be converted into standard normal
distribution (Z) using this formula –

Z(standardized) = (x-µ) / σ

Here, Z for any particular x value indicates how many standard deviations x is
away from the mean of all values of x.

39. What can you do with an outlier?

Outliers affect A/B testing and they can be either removed or kept according to
what situation demands or the data set requirements. 

Here are some ways to deal with outliers in data –

Filter out outliers especially when we have loads of data.


If a data point is wrong, it is best to remove the outliers.

Alternatively, two options can be provided – one with outliers and one without.

During post-test analysis, outliers can be removed or modified. The best way to
modify them is to trim the data set.

If there are a lot of outliers and results are critical, then it is best to change the
value of the outliers to other variables. They can be changed to a value that is
representative of the data set.

When outliers have meaning, they can be considered, especially in the case of mild
outliers. 

40. How to detect outliers?

The best way to detect outliers is through graphical means. Apart from that,
outliers can also be detected through the use of statistical methods using tools such
as Excel, Python, SAS, among others. The most popular graphical ways to detect
outliers include box plot and scatter plot. 

41. Why do we need sample statistics?

Sampling in statistics is done when population parameters are not known,


especially when the population size is too large.

42. What is the relationship between standard error and margin of error?

Margin of error = Critical value X Standard deviation for the population 

and

Margin of error = Critical value X Standard error of the sample.


The margin of error will increase with the standard error. 

43. What is the proportion of confidence intervals that will not contain the
population parameter?

Alpha is the probability in a confidence interval that will not contain the population
parameter. 

α = 1 – CL

Alpha is usually expressed as a proportion. For instance, if the confidence level is


95%, then alpha would be equal to 1-0.95 or 0.05. 

44. What is skewness?

Skewness provides the measure of the symmetry of a distribution. If a distribution


is not normal or asymmetrical, it is skewed. A distribution can exhibit positive
skewness or negative skewness if the tail on the right is longer and the tail on the
left side is longer, respectively. 

45. What is the meaning of covariance?

In statistics, covariance is a measure of association between two random variables


from their respective means in a cycle. 

46. What is a confounding variable?

A confounding variable in statistics is an ‘extra’ or ‘third’ variable that is


associated with both the dependent variable and the independent variable, and it
can give a wrong estimate that provides useless results. 
For example, if we are studying the effect of weight gain, then lack of workout will
be the independent variable, and weight gain will be the dependent variable. In this
case, the amount of food consumption can be the confounding variable as it will
mask or distort the effect of other variables in the study. The effect of weather can
be another confounding variable that may later the experiment design. 

47. What does it mean if a model is heteroscedastic?

A model is said to be heteroscedastic when the variation in errors comes out to be


inconsistent. It often occurs in two forms – conditional and unconditional.

48. What is selection bias and why is it important?

Selection bias is a term in statistics used to denote the situation when selected
individuals or a group within a study differ in a manner from the population of
interest that they give systematic error in the outcome.

Typically selection bias can be identified using bivariate tests apart from using
other methods of multiple regression such as logistic regression.

It is crucial to understand and identify selection bias to avoid skewing results in a


study. Selection bias can lead to false insights about a particular population group
in a study.

Different types of selection bias include –

Sampling bias – It is often caused by non-random sampling. The best way to


overcome this is by drawing from a sample that is not self-selecting.

Participant attrition – The dropout rate of participants from a study constitutes


participant attrition. It can be avoided by following up with the participants who
dropped off to determine if the attrition is due to the presence of a common factor
between participants or something else.
Exposure – It occurs due to the incorrect assessment or the lack of internal validity
between exposure and effect in a population.

Data – It includes dredging of data and cherry-picking and occurs when a large
number of variables are present in the data causing even bogus results to appear
significant. 

Time-interval – It is a sampling error that occurs when observations are selected


from a certain time period only. For example, analyzing sales during the Christmas
season.

Observer selection- It is a kind of discrepancy or detection bias that occurs during


the observation of a process and dictates that for the data to be observable, it must
be compatible with the life that observes it.

49. What does autocorrelation mean?

Autocorrelation is a representation of the degree of correlation between the two


variables in a given time series. It means that the data is correlated in a way that
future outcomes are linked to past outcomes. Autocorrelation makes a model less
accurate because even errors follow a sequential pattern. 

50. What does Design of Experiments mean?

The Design of Experiments or DOE is a systematic method that explains the


relationship between the factors affecting a process and its output. It is used to
infer and predict an outcome by changing the input variables. 

51. What is Bessel’s correction?

Bessel’s correction advocates the use of n-1 instead of n in the formula of standard
deviation. It helps to increase the accuracy of results while analyzing a sample of
data to derive more general conclusions.
52. What types of variables are used for Pearson’s correlation coefficient?

Variables (both the dependent and independent variables) used for Pearson’s
correlation coefficient must be quantitative. It will only test for the linear
relationship between two variables.

53. What is the use of Hash tables in statistics?

In statistics, hash tables are used to store key values or pairs in a structured way. It
uses a hash function to compute an index into an array of slots in which the desired
elements can be searched. 

54. Does symmetric distribution need to be unimodal?

Symmetrical distribution does not necessarily need to be unimodal, they can be


skewed or asymmetric. They can be bimodal with two peaks or multimodal with
multiple peaks. 

55. What is the benefit of using box plots?

Boxplot is a visually effective representation of two or more data sets and


facilitates quick comparison between a group of histograms.

56. What is the meaning of TF/IDF vectorization?

TF/IDF is an acronym for Term Frequency – Inverse Document Frequency and is a


numerical measure widely used in statistics in summarization. It reflects the
importance of a word or term in a document. The document is called a collection or
corpus.

57. What is the meaning of sensitivity in statistics?

Sensitivity refers to the accuracy of a classifier in a test. It can be calculated using


the formula –
Sensitivity = Predicted True Events/Total number of Events

58. What is the difference between the first quartile, the second quartile, and
the third quartile?

The first quartile is denoted by Q1 and it is the median of the lower half of the data
set.

The second quartile is denoted by Q2 and is the median of the data set.

The third quartile is denoted by Q3 and is the median of the upper half of the data
set.

About 25% of the data set lies above Q3, 75% lies below Q3 and 50% lies below
Q2. The Q1, Q2, and Q3 are the 25th, 50th, and 75th percentile respectively.

59. What is kurtosis?

Kurtosis is a measure of the degree of the extreme values present in one tail of
distribution or the peaks of frequency distribution as compared to the others. The
standard normal distribution has a kurtosis of 3 whereas the values of symmetry
and kurtosis between -2 and +2 are considered normal and acceptable. The data
sets with a high level of kurtosis imply that there is a presence of outliers. One
needs to add data or remove outliers to overcome this problem. Data sets with low
kurtosis levels have light tails and lack outliers.

60. What is a bell-curve distribution?

A bell-curve distribution is represented by the shape of a bell and indicates normal


distribution. It occurs naturally in many situations especially while analyzing
financial data. The top of the curve shows the mode, mean and median of the data
and is perfectly symmetrical. The key characteristics of a bell-shaped curve are –
The empirical rule says that approximately 68% of data lies within one standard
deviation of the mean in either of the directions.

Around 95% of data falls within two standard deviations and

Around 99.7% of data fall within three standard deviations in either direction. 

Statistics Interview Questions and Answers For Fresher

Q1). What are the various branches of statistics?

Ans:- Statistics have two main branches, namely:


Descriptive Statistics: This usually summarizes the data from the sample by
making use of an index like mean or standard deviation. The methods which are
used in the descriptive statistics are displaying, organizing, and describing the
data.

Inferential Statistics: These conclude from data that are subject to random


variations like observation mistakes and other sample variations.

Q2). Enumerate various fields where statistics can be used?

Ans:- Statistics are usually used in many different kinds of research fields. The
lists of files in which statistics are used are :

Science

Technology

Biology

Computer Science

Chemistry

Business

It is also used in the following areas:

Providing comparison

Explaining action which has already occurred

Predicting the future result

Estimation of quantities that are not known


Data Science Training - Using R and Python

Detailed Coverage

Best-in-class Content

Prepared by Industry leaders

Latest Technology Covered

Data Science Training - Using R and Python

Q3). What is the difference between Data Science and Statistics?

Ans:- Data Science is a science that is led by data. It includes the interdisciplinary


fields of scientific methods, algorithms, and even the process for extracting
insights from the data. The data can be either structured or unstructured. There are
many similarities between data science and data mining as both useful abstract
information from the data. Now, data science also includes mathematical statistics
and computer science and its applications. It is by the combination of statistics,
visualization, and applied mathematics and computer science that data science can
convert a vast amount of data into insights and knowledge. Thus, statistics from the
main part of data science it is a branch of mathematical commerce with the
collection, analysis, interpretation, organization, and presentation of data.

Q4). What is the meaning of correlation and covariance in statistics?

Ans:- Both correlation and covariance are basically two concepts of mathematics


that are widely used in statistics. They not only help in establishing the relations
between two random variables but also help in measuring the dependency between
the two. Although the work between these two mathematical terms is similar, they
are quite different from each other.
Correlation: It is considered as the best technique for measurement and also for
estimation of the quantitative relationship between the two variables. Correlation
measures how efficiently two variables are related.

Covariance: In this, two terms vary together, and it is a measure that shows the
extent to which two random variables can change in a cycle. It forms a statistical
relationship between a pair of random variables, where any change in one variable
reciprocates by a corresponding change in another variable.

Q5). What is Bayesian?

Ans:- Bayesian rests on the data which is observed in reality and further considers
the probability distribution on the hypothesis.

Q6). What is Frequentist?

Ans:- Frequentists rest on the hypothesis of choice and further consider the


probability distribution on the data, whether it is observed or not.

Q7). What is the Likelihood?

Ans:- The probability of some of the observed outcomes under specific parameter


values is regarded as the likelihood of the set of parameter values under certain
observed outcomes.

Q8). What is P-value?

Ans:- In terms of statistical significance testing, the p-value represents the


probability of obtaining a test value, which is as extreme as the one which had
been observed originally. The underlying condition is that the null hypothesis is
true.

Q9). Explain P-value with the help of an example?


Ans:- Let us suppose the experimental results showing the coin turning heads 14 in
20 flips in total. Here is what is derived:

Null hypothesis (Ho): a fair coin

Observation 0: 14 heads out of 20 flips

P-value of observation 0 given HO= Prob (? 14 heads or ? 14 tails) = 0.115

We can see above that the p-value overshoots the value of 0.05, so the observation
is in line with the null hypothesis-that means the observed result of 14 heads in 20
flips can be related to the chance alone- as it comes within the range of what would
happen 95% of the time is this was a real case. In the example, we failed to reject
the null hypothesis at the level of 5 %. The coin did not have an even fall and the
shift from the expected outcome is slight to be reported as “not statistically
significant at 5% level).

Q10). What do you mean by sampling?

Ans:- Sampling is considered as part of the statistical practice which is concerned


with the selection of an unbiased or random subset of single observations in a
population of individuals which are directed to yield some knowledge about the
population of concern.

Q11). What are the various methods of sampling?

Ans:- Sampling can be done in 4 broad methods:

Randomly or in a simple yet random method

Systematically or taking every kth member of the population

Cluster when the population is considered in groups or clusters


Stratified i.e. when the exclusive groups or strata, a sample from a group)
samplings.

Q12). What do you mean by Mode?

Ans:- The mode is defined as that element of the data sample, which appears most
often in the collection.

X= [ 1 5 5 6 3 2]

Mode (x) % return 5, happen most.

Q13). What do you mean by Median?

Ans:- Median is often described as that numerical value that separates the higher
half of the sample, which can be either a group or a population or even a
probability distribution from the lower half. The median can usually be found by a
limited list of numbers when all the observations are arranged from the lowest to
the highest value and picking the middle one.

Q14). What do you mean by skewness?

Ans:- Skewness is described as the data asymmetry, which is centered around a


mean. If skewness is negative, the data is spread more on the left of the mean to the
right. If skewness is seen as positive, then the data is moving more to the right.

Data Science Training - Using R and Python

No cost for a Demo Class

Industry Expert as your Trainer


Available as per your schedule

Customer Support Available 

Q15). What is the meaning of Covariance?

Ans:-Covariance is a measure of how two variables move in sync with each other.

y 2= [1 3 4 5 6 7 8]

cov ( x,y2) % return 2*2 matrix, diagonal represents variance.

Q16). What is One Sample test?

Ans:-T-test refers to any statistical hypothesis test in which the statistic of the test
follows a Student’s t distribution if the null hypothesis is supported.

[h, p, ci] = test (y2,0)% return  1 0.0018ci = 2.6280 7.0863

Q17). What do you mean by Alternative Hypothesis?

Ans:-The Alternative-hypothesis, which is represented by H1 is the statement that 


holds true if the null hypothesis is false.

Q18). What do you mean by Significance Level?

Ans:-The probability of rejection of the null hypothesis when it is known as the


significance level a, and very common choices are ?=0.05 and ?=0.01.

Q19). Give Examples of Central Limit Theorem?


Ans:-Let us suppose that the population of the men has normally distributed
weights, with a mean of 173lb and a standard deviation of 30 lb and one has to find
the probability

If one man is randomly selected, the weight is greater than 180 lb

If 36 different men are randomly selected, the mean weight is more than 180 lb.

The solution will be:  

z= (x-µ)/?= (180-173)/30=0.23

For normal distribution P(Z>0.23)= 0.4090

? x?= ?/?n=20/?36=5

z=(180-17)/5=1.40

P(Z>1.4) =0.0808

Q20). What is Binary Search?

Ans:-In any binary search, the array has to be arranged either in ascending or
descending order. In every step, the search key value is compared with the key
value of the middle element of the array by the algorithm. If both the keys match, a
matching element is discovered, and the index or the position is returned. Else, if
the search key falls below the key of the middle element, then the algorithm will
repeat the action on the sub-array which falls to the left of the middle element of
the array if the search key is more than the sub-array to the right.

Q21). Can you throw more light on the Hash Table?


Ans:-A hash table refers to a data structure that is used for implementation in an
associative way in a structure that can map keys to values. A hash table makes use
of a hash function for computing an index into an array of buckets or slots from
which the correct value can be obtained.

Q22). Explain the difference between ‘long’ and ‘wide’ format data?

Ans:-In the wide format, the repeated responses of the subject will fall in a single
row, and each response will go in a separate column.  In the long format, every row
makes a one-time point per subject. The data in the wide format can be recognized
by the fact that the columns are basically represented by the groups.

Q23). What is the meaning of normal distribution?

Ans:-Data is usually distributed in many ways which incline to left or right. There
are high chances that data is focussed around a middle value without any particular
inclination to the left or the right. It further reaches the normal distribution and
forms a bell-shaped curve.

 The normal distribution has the following properties:

Unimodal or one-mode.

Both the left and right halves are symmetrical and are mirror images of each other.

It is bell-shaped with a maximum height at the center.

Mean, mode, and even the median are all present at the center.

Asymptotic

Q24). What is the primary goal of A/B testing?


Ans:-A/B testing refers to a statistical hypothesis with two variables A and B. The
primary goal of A/B testing is the identification of any changes to the web page for
maximizing or increasing the outcome of interest. A/B testing is a fantastic method
for finding the most suitable online promotional and marketing strategies for the
business. It is basically used for testing everything from website copy to even the
emails made for sales and also search ads.

Q25). What is the meaning of statistical power of sensitivity, and how is it


calculated?

Ans:-The statistical power of sensitivity refers to the validation of the accuracy of a


classifier, which can be Logistic, SVM, Random Forest, etc. Sensitivity is basically
Predicted True Events/Total Events. True events are the ones that are true and also
predicted as true by the model.

Data Science Training - Using R and Python

Personalized Free Consultation

Access to Our Learning Management System

Access to Our Course Curriculum

Be a Part of Our Free Demo Class

Q26). What do you think is the difference between overfitting and


underfitting?

Ans:-In both statistics and machine learning, fitting the model to a set of training
data to be able to make increased reliable predictions on general untrained data is a
common task. In the case of overfitting, random errors or noise is described by a
statistical model instead of an underlying relationship. In the case of overfitting,
the model is highly complex, like having too many parameters which are relative
to many observations. The overfit model has a poor predictive performance, and it
overreacts to many minor fluctuations in the training data. In the case of
underfitting, the underlying trend of the data cannot be captured by the statistical
model or even the machine learning algorithm. Even such a model has poor
predictive performance.

Conclusion

Statistics have widespread applications everywhere. It has been used in many


applications like biology, meteorology, demography, economics, and mathematics.
Economic planning is not possible without statistics and is largely baseless. The
field, due to its immense applications, has a huge scope. One has to have sharp
clarity of concepts and knowledge of the basics in addition to a keen interest in the
field. These probability and statistics interview questions and answers that are
mentioned above should be prepared  properly as they will be a great help during
your interview. Go to Janbask Training to get help on more such interview
questions on statistics.

Data Science Interview Questions

 What is Data Science?


 What is the difference between Data Analytics, Big Data, and Data Science?
 Which language R or Python is most suitable for text analytics?
 Explain Recommender System.
 What are the benefits of R language?
 How is statistics used by Data Scientists?
 What is the importance of data cleansing in data analysis?
 In real world scenario, how the machine learning is deployed?
 What is Linear Regression?
 Explain K-means algorithm.

Data Science Interview Questions & Answers

Q1). What is Data Science?

Data Science is a combination or mix of mathematical and technical skill, which


may require business vision as well. These skills are used to predict the future
trend and analyzing the data.

Q2). What is the difference between Data Analytics, Big Data, and Data
Science?

Big Data: Big Data deals with huge data volume in structured and semi structured
form and require just basic knowledge of mathematics and statistics.

Data Analytics: Data Analytics provide the operational insights of complex


scenarios of business

Data Science: Data Science deals with slicing and dicing of data and require deep
knowledge of mathematics and statistics

Q3). Which language R or Python is most suitable for text analytics?


As Python consists of a rich library of Pandas, due to which the analysts can use
high-level data analysis tools and data structures, this feature is absent in R, so
Python is more suitable for text analytics.

Q4). Explain Recommender System.

The recommended system works on the basis of past behavior of the person and is
widely deployed in a number of fields like music preferences, movie
recommendations, research articles, social tags and search queries. With this
system, the future model can also be prepared, which can predict the person’s
future behavior and can be used to know the product the person would prefer
buying or which movie he will view or which book he will read. It uses the discrete
characteristics of the items to recommend any additional item.

Q5). What are the benefits of R language?

R programming uses a number of software suites for statistical computing,


graphical representation, data calculation and manipulation. Following are a few
characteristics of R programming:

It has an extensive tool collection

Tools have the operators to perform Matrix operations and calculations using
arrays

Analysing techniques using graphical representation

It is a language with many effective features but is simple as well

It supports machine learning applications

It acts as a connecting link between a number of data sets, tools and software
It can be used to solve data oriented problem

Q6). How is statistics used by Data Scientists?

With the help of statistics, the Data Scientists can convert the huge amount of data
to provide its insights. The data insights can provide a better idea of what the
customers are expecting? With the help of statistics, the Data scientists can know
the customer’s behavior, his engagements, interests and final conversion. They can
make powerful predictions and certain inferences. It can also be converted into
powerful propositions of business and the customers can also be offered suitable
deals.

Q7). What is the importance of data cleansing in data analysis?

As the data come from various multiple sources, so it becomes important to extract
useful and relevant data and therefore data cleansing become very important. Data
cleansing is basically the process of correcting and detecting accurate and relevant
data components and deletion of the irrelevant one. For data cleansing, the data is
processed concurrently or in batches.

Data cleansing is one of the important and essential steps for data science, as the
data can be prone to errors due to a number of reasons, including human
negligence. It takes a lot of time and effort to cleanse the data, as it comes from
various sources.

Q8). In real world scenario, how the machine learning is deployed?

The real world applications of machine learning include:

Finance: To evaluate risks, investment opportunities and in the detection of fraud

Robotics: To handle the non ordinary situations


Search Engine: To rank the pages as per the user’s personal preferences

Information Extraction: To frame the possible questions to extract the answers


from database

E-commerce: To deploy targeted advertising, re-marketing and customer churn

Q9). What is Linear Regression?

Linear regression is basically used for predictive analysis. This method describes
the relationship between dependent and independent variables. In linear regression,
a single line is fitted within a scatter plot. It consists of the following three
methods:

Analyzing and determining the direction and correlation of the data

Deployment of estimation model

To ensure the validity and usefulness of the model. It also helps to determine the
outcomes of various events

Q10). Explain K-means algorithm.

K-Means is a basic an unsupervised learning algorithm and uses data clusters,


known as K-clusters to classify the data. The data similarity is identified by
grouping the data. The K centers are defined in each K cluster. Using K clusters
the K groups are formed and K is performed. The objects are assigned to their
nearest cluster center. All objects of the same cluster are related to each other and
different from the objects of other clusters. This algorithm is the best for large sets
of data.

1. What is Linear Regression. What are the Assumptions involved in it?

Answer : The question can also be phrased as to why linear regression is not a very
effective algorithm.
Linear Regression is a mathematical relationship between an independent and
dependent variable. The relationship is a direct proportion, relation making it the
most simple relationship between the variables.

Y = mX+c

Y – Dependent Variable

X – Independent Variable

m and c are constants

Assumptions of Linear Regression :

The relationship between Y and X must be Linear.

The features must be independent of each other.

Homoscedasticity – The variation between the output must be constant for


different input data.

The distribution of Y along X should be the Normal Distribution.

2. What is Logistic Regression? What is the loss function in LR?

Answer : Logistic Regression is the Binary Classification. It is a statistical model


that uses the logit function on the top of the probability to give 0 or 1 as a result.

The loss function in LR is known as the Log Loss function. The equation for which
is given as :
3. Difference between Regression and Classification?

Answer :  The major difference between Regression and Classification is that


Regression results in a continuous quantitative value while Classification is
predicting the discrete labels.

However, there is no clear line that draws the difference between the two. We have
a few properties of both Regression and Classification. These are as follows:

Regression

Regression predicts the quantity.

We can have discrete as well as continuous values as input for regression.

If input data are ordered with respect to the time it becomes time series forecasting.

Classification

The Classification problem for two classes is known as Binary Classification.

Classification can be split into Multi- Class Classification or Multi-Label


Classification.

We focus more on accuracy in Classification while we focus more on the error


term in Regression.
4. What is Natural Language Processing?  State some real life example of
NLP.

Answer : Natural Language Processing is a branch of Artificial Intelligence that


deals with the conversation of Human Language to Machine Understandable
language so that it can be processed by ML models.

Examples – NLP has so many practical applications including chatbots, google


translate,  and many other real time applications like Alexa.

Some of the other applications of NLP are in text completion, text suggestions, and
sentence correction.

5. Why do we need Evaluation Metrics. What do you understand by


Confusion Matrix ?

Answer : Evaluation Metrics are statistical measures of model performance. They


are very important because to determine the performance of any model it is very
significant to use various Evaluation Metrics. Few of the evaluation Metrics are –
Accuracy, Log Loss, Confusion Matrix.

Confusion Matrix is a matrix to find the performance of a Classification model. It


is in general a 2×2 matrix with one side as prediction and the other side as actual
values.
6. How does Confusion Matrix help in evaluating model performance?

Answer: We can find different accuracy measures using a confusion matrix. These
parameters are Accuracy, Recall, Precision, F1 Score, and Specificity.

7. What is the significance of Sampling? Name some techniques for Sampling?

Answer : For analyzing the data we cannot proceed with the whole volume at once
for large datasets. We need to take some samples from the data which can
represent the whole population. While making a sample out of complete data, we
should take that data which can be a true representative of the whole data set.
There are mainly two types of Sampling techniques based on Statistics.

Probability Sampling and Non Probability Sampling

Probability Sampling – Simple Random, Clustered Sampling, Stratified Sampling.

Non Probability Sampling – Convenience Sampling, Quota Sampling, Snowball


Sampling.

8.  What are Type 1 and Type 2 errors? In which scenarios the Type 1 and
Type 2 errors become significant?

Answer :  Rejection of True Null Hypothesis is known as a Type 1 error. In simple


terms, False Positive are known as a Type 1 Error.

Not rejecting the False Null Hypothesis is known as a Type 2 error. False


Negatives are known as a Type 2 error.

Type 1 Error is significant where the importance of being negative becomes


significant. For example – If a man is not suffering from a particular disease
marked as positive for that infection. The medications given to him might damage
his organs.

While Type 2 Error is significant in cases where the importance of being positive
becomes important. For example – The alarm has to be raised in case of burglary in
a bank. But a system identifies it as a False case that won’t raise the alarm on time
resulting in a heavy loss.

9. What are the conditions for Overfitting and Underfitting?

Answer :
In Overfitting the model performs well for the training data, but for any new data it
fails to provide output. For Underfitting the model is very simple and not able to
identify the correct relationship. Following are the bias and variance conditions.

Overfitting – Low bias and High Variance results in overfitted model. Decision


tree is more prone to Overfitting.

Underfitting – High bias and Low Variance. Such model doesn’t perform well on
test data also. For example – Linear Regression is more prone to Underfitting.

10. What do you mean by Normalisation? Difference between Normalisation


and Standardization?

Answer : Normalisation is a process of bringing the features in a simple range, so


that model can perform well and do not get inclined towards any particular feature.

For example – If we have a dataset with multiple features and one feature is the
Age data which is in the range 18-60 , Another feature is the salary feature ranging
from 20000 – 2000000. In such a case, the values have a very much difference in
them. Age ranges in two digits integer while salary is in range significantly higher
than the age. So to bring the features in comparable range we need Normalisation.

Both Normalisation and Standardization are methods of Features Conversion.


However, the methods are different in terms of the conversions. The data after
Normalisation scales in the range of 0-1. While in case of Standardization the data
is scaled such that it means comes out to be 0.
11. What do you mean by Regularisation? What are L1 and L2
Regularisation?

Answer: Regulation is a method to improve your model which is Overfitted by


introducing extra terms in the loss function. This helps in making the model
performance better for unseen data.
There are two types of Regularisation:

L1 Regularisation – In L1 we add lambda times the absolute weight terms to the


loss function. In this the feature weights are penalised on the basis of absolute
value.

L2 Regularisation – In L2 we add lambda times the squared weight terms to the


loss function. In this the feature weights are penalised on the basis of squared
values.

12. Describe Decision tree Algorithm and what are entropy and information
gain?

Answer : Decision tree is a Supervised Machine Learning approach. It uses the


predetermined decisions data to prepare a model based on previous output. It
follows a system to identify the pattern and predict the classes or output variable
from previous output .
The Decision tree works in the following manner –

It takes the complete set of Data and try to identify a point with highest
information gain and least entropy to mark it as a data node and proceed further in
this manner. Entropy and Information gain are deciding factor to identify the data
node in a Decision Tree.

13. What is Ensemble Learning. Give an important example of Ensemble


Learning?

Answer : Ensemble Learning is a process of accumulating multiple models to form


a better prediction model. In Ensemble Learning the performance of the individual
model contributes to the overall development in every step. There are two common
techniques in this – Bagging and Boosting.
Bagging – In this the data set is split to perform parallel processing of models and
results are accumulated based on performance to achieve better accuracy.

Boosting – This is a sequential technique in which a result from one model is


passed to another model to reduce error at every step making it a better
performance model.

The most important example of Ensemble Learning is Random Forest Classifier. It


takes multiple Decision Tree combined to form a better performance Random
Forest model.
14. Explain Naive Bayes Classifier and the principle on which it works?

Answer : Naive Bayes Classifier algorithm is a probabilistic model. This model


works on the Bayes Theorem principle.  The accuracy of Naive Bayes can be
increased significantly by combining it with other kernel functions for making a
perfect Classifier.

Bayes Theorem –  This is a theorem which explains the conditional probability. If


we need to identify the probability of occurrence of Event A provided the Event B
has already occurred such cases are known as Conditional Probability.

15. What is Imbalanced Data? How do you manage to balance the data?
Answer : If a data is distributed across different categories and the distribution is
highly imbalance. Such data are known as Imbalance Data. These kind of datasets
causes error in model performance by making category with large values
significant for the model resulting in an inaccurate model.

There are various techniques to handle imbalance data. We can increase the
number of samples for minority classes. We can decrease the number of samples
for classes with extremely high numbers of data points. We can use a cluster based
technique to increase number of Data points for all the categories.

16. Explain Unsupervised Clustering approach?

Answer : Grouping the data into different clusters based on the distribution of data
is known as Clustering technique.

There are various Clustering Techniques –

1. Density Based Clustering – DBSCAN , HDBSCAN

2. Hierarchical Clustering.

3. Partition Based Clustering

4. Distribution Based Clustering.

17. Explain DBSCAN clustering technique and in what terms DBSCAN is


better than K- Means Clustering?

Answer : DBSCAN( Density Based) clustering technique is an unsupervised


approach which splits the vectors into different groups based on the minimum
distance and number of points lying in that range. In DBSCAN Clustering we have
two significant parameters –

Epsilon – The minimum radius or distance between the two data points to tag them
in the same cluster.

Min – Sample Points – The number of minimum sample which should fall under
that range to be identified as one cluster.

DBSCAN Clustering technique has few advantages over other clustering


algorithms –

1. In DBSCAN we do not need to provide the fixed number of clusters. There can
be as many clusters formed on the basis of the data points distribution. While in k
nearest neighbour we need to provide the number of clusters we need to split our
data into.

2. In DBSCAN we also get a noise cluster identified which helps us in identifying


the outliers. This sometimes also acts as a significant term to tune the hyper
parameters of a model accordingly.

18.  What do you mean by Cross Validation. Name some common cross
Validation techniques?

Answer : Cross Validation is a model performance improvement technique. This is


a Statistics based approach in which the model gets to train and tested with rotation
within the training dataset so that model can perform well for unknown or testing
data.

In this the training data are split into different groups and in rotation those groups
are used for validation of model performance.
The common Cross Validation techniques are –

K- Fold Cross Validation

Leave p-out Cross Validation

Leave-one-out cross-validation.

Holdout method

19.  What is Deep Learning?

Answer: Deep Learning is the branch of Machine Learning and AI which tries to
achieve better accuracy and able to achieve complex models. Deep Learning
models are similar to human brains like structure with input layer, hidden layer,
activation function and output layer designed in a fashion to give a human brain
like structure.

Deep Learning have so many real time applications –

Self Driving Cars

Computer Vision and Image Processing

Real Time Chat bots

Home Automation Systems


20. Difference between RNN and CNN?

Answer:

CNN  RNN
It is used for distributed data,
 RNN is used for sequential data.
images.
CNN has better performance than
 RNN is not having so many features.
RNN
It requires input and output to be of
RNN can take any dimensions data.
fixed size.
CNN is a feed forward network
RNN is not like a feed-forward mechanism it
with muli layer easy processing
uses it’s own internal memory.
network.
CNNs use patterns between Recurrent neural networks use time-series
different layers to identify the next information and process the results based on
results. past memories.
Image Processing Time-series forecasting, Text Classification
170 Machine Learning Interview Questions and Answer for 2021

Machine Learning Interview Questions

1. # Explain the terms AI, ML and Deep Learning?


2. # What’s the difference between Type I and Type II error?
3. # State the differences between causality and correlation?
4. # How can we relate standard deviation and variance?
5. # Is a high variance in data good or bad?
6. # What is Time series?
7. # What is a Box-Cox transformation?
8. # What’s a Fourier transform?
9. # What is Marginalization? Explain the process.
10.# Explain the phrase “Curse of Dimensionality”.
Top 100 Machine Learning Questions with Answers for Interview

1. Explain the terms Artificial Intelligence (AI), Machine Learning (ML


and Deep Learning?

Artificial Intelligence  (AI) is the domain of producing intelligent machines.


ML refers to systems that can assimilate from experience (training data) and
Deep Learning (DL) states to systems that learn from experience on large data
sets. ML can be considered as a subset of AI. Deep Learning (DL) is ML but
useful to large data sets. The figure below roughly encapsulates the relation
between AI, ML, and DL:

In summary, DL is a subset of ML & both were the subsets of AI.

Additional Information: ASR (Automatic Speech Recognition) & NLP (Natural


Language Processing) fall under AI and overlay with ML & DL as ML is often
utilized for NLP and ASR tasks.
2. What are the different types of Learning/ Training models in ML?

ML algorithms can be primarily classified depending on the presence/absence of


target variables.

A. Supervised learning: [Target is present]


The machine learns using labelled data. The model is trained on an existing data
set before it starts making decisions with the new data.
The target variable is continuous : Linear Regression, polynomial Regression,
quadratic Regression.
The target variable is categorical : Logistic regression, Naive Bayes, KNN,
SVM, Decision Tree, Gradient Boosting, ADA boosting, Bagging, Random
forest etc.

B. Unsupervised learning: [Target is absent]


The machine is trained on unlabelled data and without any proper guidance. It
automatically infers patterns and relationships in the data by creating clusters.
The model learns through observations and deduced structures in the data.
Principal component Analysis, Factor analysis, Singular Value Decomposition
etc.

C.  Reinforcement Learning:


The model learns through a trial and error method. This kind of learning
involves an agent that will interact with the environment to create actions and
then discover errors or rewards of that action.

3. What is the difference between deep learning and machine learning?


Machine Learning involves algorithms that learn from patterns of data and then
apply it to decision making. Deep Learning, on the other hand, is able to learn
through processing data on its own and is quite similar to the human brain
where it identifies something, analyse it, and makes a decision.
The key differences are as follow:

 The manner in which data is presented to the system.


 Machine learning algorithms always require structured data and deep
learning networks rely on layers of artificial neural networks.

4. What is the main key difference between supervised and unsupervised


machine learning?

Supervised learning technique needs labeled data to train the model. For
example, to solve a classification problem (a supervised learning task), you need
to have label data to train the model and to classify the data into your labeled
groups. Unsupervised learning does not  need any labelled dataset. This is the
main key difference between supervised learning and unsupervised learning.
5. How do you select important variables while working on a data set? 

There are various means to select important variables from a data set that
include the following:

 Identify and discard correlated variables before finalizing on important


variables
 The variables could be selected based on ‘p’ values from Linear
Regression
 Forward, Backward, and Stepwise selection
 Lasso Regression
 Random Forest and plot variable chart
 Top features can be selected based on information gain for the
available set of features.

6. There are many machine learning algorithms till now. If given a data set,
how can one determine which algorithm to be used for that?

Machine Learning algorithm to be used purely depends on the type of data in a


given dataset. If data is linear then, we use linear regression. If data shows non-
linearity then, the bagging algorithm would do better. If the data is to be
analyzed/interpreted for some business purposes then we can use decision trees
or SVM. If the dataset consists of images, videos, audios then, neural networks
would be helpful to get the solution accurately.

So, there is no certain metric to decide which algorithm to be used for a given
situation or a data set. We need to explore the data using EDA (Exploratory
Data Analysis) and understand the purpose of using the dataset to come up with
the best fit algorithm. So, it is important to study all the algorithms in detail.
7. How are covariance and correlation different from one another?

Covariance measures how two variables are related to each other and how one
would vary with respect to changes in the other variable. If the value is positive
it means there is a direct relationship between the variables and one would
increase or decrease with an increase or decrease in the base variable
respectively, given that all other conditions remain constant.

Correlation quantifies the relationship between two random variables and has
only three specific values, i.e., 1, 0, and -1.
1 denotes a positive relationship, -1 denotes a negative relationship, and 0
denotes that the two variables are independent of each other.

8. State the differences between causality and correlation?

Causality applies to situations where one action, say X, causes an outcome, say
Y, whereas Correlation is just relating one action (X) to another action(Y) but X
does not necessarily cause Y.
9. We look at machine learning software almost all the time. How do we
apply Machine Learning to Hardware?

We have to build ML algorithms in System Verilog which is a Hardware


development Language and then program it onto an FPGA to apply Machine
Learning to hardware.

10. Explain One-hot encoding and Label Encoding. How do they affect the
dimensionality of the given dataset?

One-hot encoding is the representation of categorical variables as binary


vectors. Label Encoding is converting labels/words into numeric form. Using
one-hot encoding increases the dimensionality of the data set. Label encoding
doesn’t affect the dimensionality of the data set. One-hot encoding creates a
new variable for each level in the variable whereas, in Label encoding, the
levels of a variable get encoded as 1 and 0.
Deep Learning Interview Questions

Deep Learning is a part of machine learning that works with neural networks. It
involves a hierarchical structure of networks that set up a process to help
machines learn the human logics behind any action. We have compiled a list of
the frequently asked deep leaning interview questions to help you prepare.

What is overfitting?

Overfitting is a type of modelling error which results in the failure to predict


future observations effectively or fit additional data in the existing model. It
occurs when a function is too closely fit to a limited set of data points and
usually ends with more parameters read more…

What is Multilayer Perceptron and Boltzmann Machine?

The Boltzmann machine is a simplified version of the multilayer perceptron.


This is a two layer model with a visible input layer and a hidden layer which
makes stochastic decisions for the read more…

11. When does regularization come into play in Machine Learning?

At times when the model begins to underfit or overfit, regularization becomes


necessary. It is a regression that diverts or regularizes the coefficient estimates
towards zero. It reduces flexibility and discourages learning in a model to avoid
the risk of overfitting. The model complexity is reduced and it becomes better at
predicting.
12. What is Bias, Variance and what do you mean by Bias-Variance
Tradeoff?

Both are errors in Machine Learning Algorithms. When the algorithm has
limited flexibility to deduce the correct observation from the dataset, it results in
bias. On the other hand, variance occurs when the model is extremely sensitive
to small fluctuations.

If one adds more features while building a model, it will add more complexity
and we will lose bias but gain some variance. In order to maintain the optimal
amount of error, we perform a tradeoff between bias and variance based on the
needs of a business.
Source: Understanding the Bias-Variance Tradeoff: Scott Fortmann – Roe

Bias stands for the error because of the erroneous or overly simplistic
assumptions in the learning algorithm . This  assumption can lead to the model
underfitting the data, making it hard for it to have high predictive accuracy and
for you to generalize your knowledge from the training set to the test set.

Variance is also an error because of  too much complexity in the learning
algorithm. This can be the reason for the algorithm being highly sensitive to
high degrees of variation in training data, which can lead your model to overfit
the data. Carrying too much noise from the training data for your model to be
very useful for your test data.

The bias-variance decomposition essentially decomposes the learning error from


any algorithm by adding the bias, the variance and a bit of irreducible error due
to noise in the underlying dataset. Essentially, if you make the model more
complex and add more variables, you’ll lose bias but gain some variance — in
order to get the optimally reduced amount of error, you’ll have to trade off bias
and variance. You don’t want either high bias or high variance in your model.
13. How can we relate standard deviation and variance?

Standard deviation refers to the spread of your data from the mean. Variance is


the average degree to which each point differs from the mean i.e. the average of
all data points. We can relate Standard deviation and Variance because it is the
square root of Variance.

14. A data set is given to you and it has missing values which spread along
1standard deviation from the mean. How much of the data would remain
untouched?

It is given that the data is spread across mean that is the data is spread across an
average. So, we can presume that it is a normal distribution. In a normal
distribution, about 68% of data lies in 1 standard deviation from averages like
mean, mode or median. That means about 32% of the data remains uninfluenced
by missing values.Questions asked by top companies

15. Is a high variance in data good or bad?

Higher variance directly means that the data spread is big and the feature has a
variety of data. Usually, high variance in a feature is seen as not so good
quality.

16. If your dataset is suffering from high variance, how would you handle
it?

For datasets with high variance, we could use the bagging algorithm to handle
it. Bagging algorithm splits the data into subgroups with sampling replicated
from random data. After the data is split, random data is used to create rules
using a training algorithm. Then we use polling technique to combine all the
predicted outcomes of the model.
17. A data set is given to you about utilities fraud detection. You have built
aclassifier model and achieved a performance score of 98.5%. Is this a
goodmodel? If yes, justify. If not, what can you do about it?

Data set about utilities fraud detection is not balanced enough i.e. imbalanced.
In such a data set, accuracy score cannot be the measure of performance as it
may only be predict the majority class label correctly but in this case our point
of interest is to predict the minority label. But often minorities are treated as
noise and ignored. So, there is a high probability of misclassification of the
minority label as compared to the majority label. For evaluating the model
performance in case of imbalanced data sets, we should use Sensitivity (True
Positive rate) or Specificity (True Negative rate) to determine class label wise
performance of the classification model. If the minority class label’s
performance is not so good, we could do the following:

1. We can use under sampling or over sampling to balance the data.


2. We can change the prediction threshold value.
3. We can assign weights to labels such that the minority class labels get
larger weights.
4. We could detect anomalies.

18. Explain the handling of missing or corrupted values in the given


dataset.

An easy way to handle missing values or corrupted values is to drop the


corresponding rows or columns. If there are too many rows or columns to drop
then we consider replacing the missing or corrupted values with some new
value.

Identifying missing values and dropping the rows or columns can be done by
using IsNull() and dropna( ) functions in Pandas. Also, the Fillna() function in
Pandas replaces the incorrect values with the placeholder value.
19. What is Time series?

A Time series is a sequence of numerical data points in successive order. It


tracks the movement of the chosen data points, over a specified period of time
and records the data points at regular intervals. Time series doesn’t require any
minimum or maximum time input. Analysts often use Time series to examine
data according to their specific requirement.

20. What is a Box-Cox transformation?

Box-Cox transformation is a power transform which transforms non-normal


dependent variables into normal variables as normality is the most common
assumption made while using many statistical techniques. It has a lambda
parameter which when set to 0 implies that this transform is equivalent to log-
transform. It is used for variance stabilization and also to normalize the
distribution.

“KickStart your Artificial Intelligence Journey with Great Learning which


offers high-rated  Artificial Intelligence courses  with world-class training by
industry leaders. Whether you’re interested in machine learning, data mining,
or data analysis, Great Learning has a course for you!”

21. What is the difference between stochastic gradient descent (SGD) and
gradient descent (GD)?

Gradient Descent and Stochastic Gradient Descent are the algorithms that find
the set of parameters that will minimize a loss function.
The difference is that in Gradient Descend, all training samples are evaluated
for each set of parameters. While in Stochastic Gradient Descent only one
training sample is evaluated for the set of parameters identified.
22. What is the exploding gradient problem while using back propagation
technique?

When large error gradients accumulate and result in large changes in the neural
network weights during training, it is called the exploding gradient problem.
The values of weights can become so large as to overflow and result in NaN
values. This makes the model unstable and the learning of the model to stall just
like the vanishing gradient problem.

23. Can you mention some advantages and disadvantages of decision trees?

The advantages of decision trees are that they are easier to interpret, are
nonparametric and hence robust to outliers, and have relatively few parameters
to tune.
On the other hand, the disadvantage is that they are prone to overfitting.
24. Explain the differences between Random Forest and Gradient Boosting
machines.

Random forests are a significant number of decision trees pooled using averages
or majority rules at the end. Gradient boosting machines also combine decision
trees but at the beginning of the process unlike Random forests. Random forest
creates each tree independent of the others while gradient boosting develops one
tree at a time. Gradient boosting yields better outcomes than random forests if
parameters are carefully tuned but it’s not a good option if the data set contains
a lot of outliers/anomalies/noise as it can result in overfitting of the
model.Random forests perform well for multiclass object detection . Gradient
Boosting performs well when there is data which is not balanced such as in real
time risk assessment.

25. What is a confusion matrix and why do you need it?

Confusion matrix (also called the error matrix) is a table that is frequently used
to illustrate the performance of a classification model i.e. classifier on a set of
test data for which the true values are well-known.

It allows us to visualize the performance of an algorithm/model. It allows us to


easily identify the confusion between different classes. It is used as a
performance measure of a model/algorithm.

A confusion matrix is known as a summary of predictions on a classification


model. The number of right and wrong predictions were summarized with count
values and broken down by each class label. It gives us information about the
errors made through the classifier and also the types of errors made by a
classifier.
26. What’s a Fourier transform?

Fourier Transform is a mathematical technique that transforms any function of


time to a function of frequency. Fourier transform is closely related to Fourier
series. It takes any time-based pattern for input and calculates the overall cycle
offset, rotation speed and strength for all possible cycles. Fourier transform is
best applied to waveforms since it has functions of time and space. Once a
Fourier transform applied on a waveform, it gets decomposed into a sinusoid.
27. What do you mean by Associative Rule Mining (ARM)?

Associative Rule Mining is one of the techniques to discover patterns in data


like features (dimensions) which occur together and features (dimensions)
which are correlated. It is mostly used in Market-based Analysis to find how
frequently an itemset occurs in a transaction. Association rules have to satisfy
minimum support and minimum confidence at the very same time. Association
rule generation generally comprised of two different steps:

 “A min support threshold is given to obtain all frequent item-sets in a


database.”
 “A min confidence constraint is given to these frequent item-sets in
order to form the association rules.”

Support is a measure of how often the “item set” appears in the data set and
Confidence is a measure of how often a particular rule has been found to be
true.

28. What is Marginalisation? Explain the process.

Marginalisation is summing the probability of a random variable X given joint


probability distribution of X with other variables. It is an application of the law
of total probability.

P(X=x) = ∑YP(X=x,Y) 

Given the joint probability P(X=x,Y), we can use marginalization to find


P(X=x). So, it is to find distribution of one random variable by exhausting cases
on other random variables.

29. Explain the phrase “Curse of Dimensionality”.

The Curse of Dimensionality refers to the situation when your data has too
many features.
The phrase is used to express the difficulty of using brute force or grid search to
optimize a function with too many inputs.

It can also refer to several other issues like:

 If we have more features than observations, we have a risk of


overfitting the model.
 When we have too many features, observations become harder to
cluster. Too many dimensions cause every observation in the dataset to
appear equidistant from all others and no meaningful clusters can be
formed.

Dimensionality reduction techniques like PCA come to the rescue in such cases.

30. What is the Principle Component Analysis?

The idea here is to reduce the dimensionality of the data set by reducing the
number of variables that are correlated with each other. Although the variation
needs to be retained to the maximum extent.

The variables are transformed into a new set of variables that are known as
Principal Components’. These PCs are the eigenvectors of a covariance matrix
and therefore are orthogonal.

NLP Interview Questions

NLP or Natural Language Processing helps machines analyse natural languages


with the intention of learning them. It extracts information from data by
applying machine learning algorithms. Apart from learning the basics of NLP, it
is important to prepare specifically for the interviews.

Explain Dependency Parsing in NLP?

Dependency Parsing, also known as Syntactic parsing in NLP is a process of


assigning syntactic structure to a sentence and identifying its dependency
parses. This process is crucial to understand the correlations between the “head”
words in the syntactic structure.
The process of dependency parsing can be a little complex considering how any
sentence can have more than one dependency parses. Multiple parse trees are
known as ambiguities. Dependency parsing needs to resolve these ambiguities
in order to effectively assign a syntactic structure to a sentence.

Dependency parsing can be used in the semantic analysis of a sentence apart


from the syntactic structuring.

Which of the following architecture can be trained faster and needs less amount
of training data

a. LSTM based Language Modelling

b. Transformer architecture

Read more…

31. Why is rotation of components so important in Principle Component


Analysis (PCA)?

Rotation in PCA is very important as it maximizes the separation within the


variance obtained by all the components because of which interpretation of
components would become easier. If the components are not rotated, then we
need extended components to describe variance of the components.
32. What are outliers? Mention three methods to deal with outliers.

A data point that is considerably distant from the other similar data points is
known as an outlier. They may occur due to experimental errors or variability in
measurement. They are problematic and can mislead a training process, which
eventually results in longer training time, inaccurate models, and poor results.

The three methods to deal with outliers are:


Univariate method – looks for data points having extreme values on a single
variable
Multivariate method – looks for unusual combinations on all the variables
Minkowski error – reduces the contribution of potential outliers in the training
process
33. What is the difference between regularization and normalisation? 

Normalisation adjusts the data; regularisation adjusts the prediction function. If


your data is on very different scales (especially low to high), you would want to
normalise the data. Alter each column to have compatible basic statistics. This
can be helpful to make sure there is no loss of accuracy. One of the goals of
model training is to identify the signal and ignore the noise if the model is given
free rein to minimize error, there is a possibility of suffering from overfitting.
Regularization imposes some control on this by providing simpler fitting
functions over complex ones.

34. Explain the difference between Normalization and Standardization.

Normalization and Standardization are the two very popular methods used for
feature scaling. Normalization refers to re-scaling the values to fit into a range
of [0,1]. Standardization refers to re-scaling data to have a mean of 0 and a
standard deviation of 1 (Unit variance). Normalization is useful when all
parameters need to have the identical positive scale however the outliers from
the data set are lost. Hence, standardization is recommended for most
applications.

35. List the most popular distribution curves along with scenarios where
you will use them in an algorithm.

The most popular distribution curves are as follows- Bernoulli Distribution,


Uniform Distribution, Binomial Distribution, Normal Distribution, Poisson
Distribution, and Exponential Distribution.
Each of these distribution curves is used in various scenarios.

Bernoulli Distribution can be used to check if a team will win a championship


or not, a newborn child is either male or female, you either pass an exam or not,
etc.
Uniform distribution is a probability distribution that has a constant probability.
Rolling a single dice is one example because it has a fixed number of outcomes.

Binomial distribution is a probability with only two possible outcomes, the


prefix ‘bi’ means two or twice. An example of this would be a coin toss. The
outcome will either be heads or tails.

Normal distribution describes how the values of a variable are distributed. It is


typically a symmetric distribution where most of the observations cluster around
the central peak. The values further away from the mean taper off equally in
both directions. An example would be the height of students in a classroom.

Poisson distribution helps predict the probability of certain events happening


when you know how often that event has occurred. It can be used by
businessmen to make forecasts about the number of customers on certain days
and allows them to adjust supply according to the demand.

Exponential distribution is concerned with the amount of time until a specific


event occurs. For example, how long a car battery would last, in months.

36. How do we check the normality of a data set or a feature? 

Visually, we can check it using plots. There is a list of Normality checks, they
are as follow:

 Shapiro-Wilk W Test
 Anderson-Darling Test
 Martinez-Iglewicz Test
 Kolmogorov-Smirnov Test
 D’Agostino Skewness Test
37. What is Linear Regression ?

Linear Function can be defined as a Mathematical function on a 2D plane as,  Y


=Mx +C, where Y is a dependent variable and X is Independent Variable, C is
Intercept and M is slope and same can be expressed as Y is a Function of X or Y
= F(x).

At any given value of X, one can compute the value of Y, using the equation of
Line. This relation between Y and X, with a degree of the polynomial as 1 is
called Linear Regression.

In Predictive Modeling, LR is represented as Y = Bo + B1x1 + B2x2


The value of B1 and B2 determines the strength of the correlation between
features and the dependent variable.

Example: Stock Value in $ = Intercept + (+/-B1)*(Opening value of Stock) +


(+/-B2)*(Previous Day Highest value of Stock)

38. Differentiate between regression and classification.

Regression and classification are categorized under the same umbrella of


supervised machine learning. The main difference between them is that the
output variable in the regression is numerical (or continuous) while that for
classification is categorical (or discrete).

Example: To predict the definite Temperature of a place is Regression problem


whereas predicting whether the day will be Sunny cloudy or there will be rain is
a case of classification. 
39. What is target imbalance? How do we fix it? A scenario where you have
performed target imbalance on data. Which metrics and algorithms do you
find suitable to input this data onto? 

If you have categorical variables as the target when you cluster them together or
perform a frequency count on them if there are certain categories which are
more in number as compared to others by a very significant number. This is
known as the target imbalance.

Example: Target column – 0,0,0,1,0,2,0,0,1,1 [0s: 60%, 1: 30%, 2:10%] 0 are in


majority. To fix this, we can perform up-sampling or down-sampling. Before
fixing this problem let’s assume that the performance metrics used was
confusion metrics. After fixing this problem we can shift the metric system to
AUC: ROC. Since we added/deleted data [up sampling or downsampling], we
can go ahead with a stricter algorithm like SVM, Gradient boosting or ADA
boosting. 

40. List all assumptions for data to be met before starting with linear
regression.

Before starting linear regression, the assumptions to be met are as follow:

 Linear relationship
 Multivariate normality
 No or little multicollinearity
 No auto-correlation
 Homoscedasticity

41. When does the linear regression line stop rotating or finds an optimal
spot where it is fitted on data? 

A place where the highest RSquared value is found, is the place where the line
comes to rest. RSquared represents the amount of variance captured by the
virtual linear regression line with respect to the total variance captured by the
dataset. 

42. Why is logistic regression a type of classification technique and not a


regression? Name the function it is derived from? 

Since the target column is categorical, it uses linear regression to create an odd
function that is wrapped with a log function to use regression as a
classifier. Hence, it is a type of classification technique and not a regression. It
is derived from cost function. 

43. What could be the issue when the beta value for a certain variable
varies way too much in each subset when regression is run on different
subsets of the given dataset?

Variations in the beta values in every subset implies that the dataset is


heterogeneous. To overcome this problem, we can use a different model for
each of the clustered subsets of the dataset or use a non-parametric model such
as decision trees.

44. What does the term Variance Inflation Factor mean?

Variation Inflation Factor (VIF) is the ratio of variance of the model to variance
of the model with only one independent variable. VIF gives the estimate of
volume of multicollinearity in a set of many regression variables.

VIF = Variance of model Variance of model with one independent variable

45. Which machine learning algorithm is known as the lazy learner and
why is it called so?

KNN is a Machine Learning algorithm known as a lazy learner. K-NN is a lazy


learner because it doesn’t learn any machine learnt values or variables from the
training data but dynamically calculates distance every time it wants to classify,
hence memorises the training dataset instead. 

Python Interview Questions

Here’s a list of the top 101 interview questions with answers to help you
prepare. The first set of questions and answers are curated for freshers while the
second set is designed for advanced users.

What are functions in Python?

Functions in Python refer to blocks that have organised, and reusable codes to
perform single, and related events. Functions are important to create better
modularity for applications which reuse high degree of coding. Python has a
number of built-in functions read more…

What are dataframes?

A pandas dataframe is a data structure in pandas which is mutable. Pandas has


support for heterogeneous data which is arranged across two axes.( rows and
columns).

Reading files into pandas:- Read more…

46. Is it possible to use KNN for image processing? 


Yes, it is possible to use KNN for image processing. It can be done by
converting the 3-dimensional image into a single-dimensional vector and using
the same as input to KNN. 

47. Differentiate between K-Means and KNN algorithms?

KNN is Supervised Learning where-as K-Means is Unsupervised Learning.


With KNN, we predict the label of the unidentified element based on its nearest
neighbour and further extend this approach for solving classification/regression-
based problems. 

K-Means is Unsupervised Learning, where we don’t have any Labels present, in


other words, no Target Variables and thus we try to cluster the data based upon
their coordinates and try to establish the nature of the cluster based on the
elements filtered for that cluster.

48. How does the SVM algorithm deal with self-learning? 

SVM has a learning rate and expansion rate which takes care of this. The
learning rate compensates or penalises the hyperplanes for making all the wrong
moves and expansion rate deals with finding the maximum separation area
between classes. 

49. What are Kernels in SVM? List popular kernels used in SVM along
with a scenario of their applications.

The function of kernel is to take data as input and transform it into the required
form. A few popular Kernels used in SVM are as follows: RBF, Linear,
Sigmoid, Polynomial, Hyperbolic, Laplace, etc. 
50. What is Kernel Trick in an SVM Algorithm?

Kernel Trick is a mathematical function which when applied on data points, can
find the region of classification between two different classes. Based on the
choice of function, be it linear or radial, which purely depends upon the
distribution of data, one can build a classifier. 

51. What are ensemble models? Explain how ensemble techniques yield
better learning as compared to traditional classification ML algorithms? 

Ensemble is a group of models that are used together for prediction both in
classification and regression class. Ensemble learning helps improve ML results
because it combines several models. By doing so, it allows a better predictive
performance compared to a single model. 
They are superior to individual models as they reduce variance, average out
biases, and have lesser chances of overfitting.

52. What are overfitting and underfitting? Why does the decision tree
algorithm suffer often with overfitting problem?

Overfitting is a statistical model or machine learning algorithm which captures


the noise of the data. Underfitting is a model or machine learning algorithm
which does not fit the data well enough and occurs if the model or algorithm
shows low variance but high bias.

In decision trees, overfitting occurs when the tree is designed to perfectly fit all
samples in the training data set. This results in branches with strict rules or
sparse data and affects the accuracy when predicting samples that aren’t part of
the training set.
53. What is OOB error and how does it occur? 

For each bootstrap sample, there is one-third of data that was not used in the
creation of the tree, i.e., it was out of the sample. This data is referred to as out
of bag data. In order to get an unbiased measure of the accuracy of the model
over test data, out of bag error is used. The out of bag data is passed for each
tree is passed through that tree and the outputs are aggregated to give out of bag
error. This percentage error is quite effective in estimating the error in the
testing set and does not require further cross-validation. 

54. Why boosting is a more stable algorithm as compared to other ensemble


algorithms? 

Boosting focuses on errors found in previous iterations until they become


obsolete. Whereas in bagging there is no corrective loop. This is why boosting
is a more stable algorithm compared to other ensemble algorithms. 

55. How do you handle outliers in the data?

Outlier is an observation in the data set that is far away from other observations
in the data set. We can discover outliers using tools and functions like box plot,
scatter plot, Z-Score, IQR score etc. and then handle them based on the
visualization we have got. To handle outliers, we can cap at some threshold, use
transformations to reduce skewness of the data and remove outliers if they are
anomalies or errors.
56. List popular cross validation techniques.

There are mainly six types of cross validation techniques. They are as follow:

 K fold
 Stratified k fold
 Leave one out
 Bootstrapping
 Random search cv
 Grid search cv

57. Is it possible to test for the probability of improving model accuracy


without cross-validation techniques? If yes, please explain.

Yes, it is possible to test for the probability of improving model accuracy


without cross-validation techniques. We can do so by running the ML model for
say n number of iterations, recording the accuracy. Plot all the accuracies and
remove the 5% of low probability values. Measure the left [low] cut off and
right [high] cut off. With the remaining 95% confidence, we can say that the
model can go as low or as high [as mentioned within cut off points]. 

58. Name a popular dimensionality reduction algorithm.

Popular dimensionality reduction algorithms are Principal Component Analysis


and Factor Analysis.
Principal Component Analysis creates one or more index variables from a larger
set of measured variables. Factor Analysis is a model of the measurement of a
latent variable. This latent variable cannot be measured with a single variable
and is seen through a relationship it causes in a set of y variables.

59. How can we use a dataset without the target variable into supervised
learning algorithms? 

Input the data set into a clustering algorithm, generate optimal clusters, label the
cluster numbers as the new target variable. Now, the dataset has independent
and target variables present. This ensures that the dataset is ready to be used in
supervised learning algorithms. 

60. List all types of popular recommendation systems ? Name and explain


two personalized recommendation systems along with their ease of
implementation. 

Popularity based recommendation, content-based recommendation, user-based


collaborative filter, and item-based recommendation are the popular types of
recommendation systems.
Personalised Recommendation systems are- Content-based recommendation,
user-based collaborative filter, and item-based recommendation. User-based
collaborative filter and item-based recommendations are more personalised.
Ease to maintain: Similarity matrix can be maintained easily with Item-based
recommendation.
61. How do we deal with sparsity issues in recommendation systems? How
do we measure its effectiveness? Explain. 

Singular value decomposition can be used to generate the prediction matrix.


RMSE is the measure that helps us understand how close the prediction matrix
is to the original matrix.  

62. Name and define techniques used to find similarities in the


recommendation system. 

Pearson correlation and Cosine correlation are techniques used to find


similarities in recommendation systems. 

63. State the limitations of Fixed Basis Function.

Linear separability in feature space doesn’t imply linear separability in input


space. So, Inputs are non-linearly transformed using vectors of basic functions
with increased dimensionality. Limitations of Fixed basis functions are:

1. Non-Linear transformations cannot remove overlap between two


classes but they can increase overlap.
2. Often it is not clear which basis functions are the best fit for a given
task. So, learning the basic functions can be useful over using fixed
basis functions.
3. If we want to use only fixed ones, we can use a lot of them and let the
model figure out the best fit but that would lead to overfitting the
model thereby making it unstable. 

64. Define and explain the concept of Inductive Bias with some examples.

Inductive Bias is a set of assumptions that humans use to predict outputs given
inputs that the learning algorithm has not encountered yet. When we are trying
to learn Y from X and the hypothesis space for Y is infinite, we need to reduce
the scope by our beliefs/assumptions about the hypothesis space which is also
called inductive bias. Through these assumptions, we constrain our hypothesis
space and also get the capability to incrementally test and improve on the data
using hyper-parameters. Examples:

1. We assume that Y varies linearly with X while applying Linear


regression.
2. We assume that there exists a hyperplane separating negative and
positive examples.

65. Explain the term instance-based learning.

Instance Based Learning is a set of procedures for regression and classification


which produce a class label prediction based on resemblance to its nearest
neighbors in the training data set. These algorithms just collects all the data and
get an answer when required or queried. In simple words they are a set of
procedures for solving new problems based on the solutions of already solved
problems in the past which are similar to the current problem.

66. Keeping train and test split criteria in mind, is it good to perform
scaling before the split or after the split? 

Scaling should be done post-train and test split ideally. If the data is closely
packed, then scaling post or pre-split should not make much difference.
67. Define precision, recall and F1 Score?
The metric used to access the performance of the classification model is
Confusion Metric. Confusion Metric can be further interpreted with the
following terms:-

True Positives (TP) – These are the correctly predicted positive values. It
implies that the value of the actual class is yes and the value of the predicted
class is also yes.

True Negatives (TN) – These are the correctly predicted negative values. It
implies that the value of the actual class is no and the value of the predicted
class is also no.

False positives and false negatives, these values occur when your actual class
contradicts with the predicted class.

Now,
Recall, also known as Sensitivity is the ratio of true positive rate (TP), to all
observations in actual class – yes
Recall = TP/(TP+FN)

Precision is the ratio of positive predictive value, which measures the amount
of accurate positives model predicted viz a viz number of positives it claims.
Precision = TP/(TP+FP)

Accuracy is the most intuitive performance measure and it is simply a ratio of


correctly predicted observation to the total observations.
Accuracy = (TP+TN)/(TP+FP+FN+TN)

F1 Score is the weighted average of Precision and Recall. Therefore, this score
takes both false positives and false negatives into account. Intuitively it is not as
easy to understand as accuracy, but F1 is usually more useful than accuracy,
especially if you have an uneven class distribution. Accuracy works best if false
positives and false negatives have a similar cost. If the cost of false positives
and false negatives are very different, it’s better to look at both Precision and
Recall.

68. Plot validation score and training score with data set size on the x-axis
and another plot with model complexity on the x-axis.

For high bias in the models, the performance of the model on the validation data
set is similar to the performance on the training data set. For high variance in
the models, the performance of the model on the validation set is worse than the
performance on the training set.
69. What is Bayes’ Theorem? State at least 1 use case with respect to the
machine learning context?

Bayes’ Theorem describes the probability of an event, based on prior knowledge


of conditions that might be related to the event. For example, if cancer is related
to age, then, using Bayes’ theorem, a person’s age can be used to more
accurately assess the probability that they have cancer than can be done without
the knowledge of the person’s age.
Chain rule for Bayesian probability can be used to predict the likelihood of the
next word in the sentence.

70. What is Naive Bayes? Why is it Naive?

Naive Bayes classifiers are a series of classification algorithms that are based on
the Bayes theorem. This family of algorithm shares a common principle which
treats every pair of features independently while being classified. 
Naive Bayes is considered Naive because the attributes in it (for the class) is
independent of others in the same class.  This lack of dependence between two
attributes of the same class creates the quality of naiveness.

Read more about  Naive Bayes.

71. Explain how a Naive Bayes Classifier works.

Naive Bayes classifiers are a family of algorithms which are derived from the
Bayes theorem of probability. It works on the fundamental assumption that
every set of two features that is being classified is independent of each other and
every feature makes an equal and independent contribution to the outcome.

72. What do the terms prior probability and marginal likelihood in context
of Naive Bayes theorem mean?

Prior probability is the percentage of dependent binary variables in the data set.
If you are given a dataset and dependent variable is either 1 or 0 and percentage
of 1 is 65% and percentage of 0 is 35%. Then, the probability that any new
input for that variable of being 1 would be 65%.
Marginal likelihood is the denominator of the Bayes equation and it makes sure
that the posterior probability is valid by making its area 1.

73. Explain the difference between Lasso and Ridge?

Lasso(L1) and Ridge(L2) are the regularization techniques where we penalize


the coefficients to find the optimum solution. In ridge, the penalty function is
defined by the sum of the squares of the coefficients and for the Lasso, we
penalize the sum of the absolute values of the coefficients. Another type of
regularization method is ElasticNet, it is a hybrid penalizing function of both
lasso and ridge. 

74. What’s the difference between probability and likelihood?

Probability is the measure of the likelihood that an event will occur that is, what
is the certainty that a specific event will occur? Where-as a likelihood function
is a function of parameters within the parameter space that describes the
probability of obtaining the observed data.
So the fundamental difference is, Probability attaches to possible results;
likelihood attaches to hypotheses. 

75. Why would you Prune your tree?

In the context of data science or AIML, pruning refers to the process of


reducing redundant branches of a decision tree. Decision Trees are prone to
overfitting, pruning the tree helps to reduce the size and minimizes the chances
of overfitting. Pruning involves turning branches of a decision tree into leaf
nodes and removing the leaf nodes from the original branch. It serves as a tool
to perform the tradeoff.

76. Model accuracy or Model performance? Which one will you prefer and
why?

This is a trick question, one should first get a clear idea, what is Model
Performance? If Performance means speed, then it depends upon the nature of
the application, any application related to the real-time scenario will need high
speed as an important feature. Example: The best of Search Results will lose its
virtue if the Query results do not appear fast.

If Performance is hinted at Why Accuracy is not the most important virtue – For
any imbalanced data set, more than Accuracy, it will be an F1 score than will
explain the business case and in case data is imbalanced, then Precision and
Recall will be more important than rest.

77. List the advantages and limitations of the Temporal Difference


Learning Method.

Temporal Difference Learning Method is a mix of Monte Carlo method and


Dynamic programming method. Some of the advantages of this method include:

1. It can learn in every step online or offline.


2. It can learn from a sequence which is not complete as well.
3. It can work in continuous environments.
4. It has lower variance compared to MC method and is more efficient
than MC method.

Limitations of TD method are:

1. It is a biased estimation.
2. It is more sensitive to initialization.
78. How would you handle an imbalanced dataset?

Sampling Techniques can help with an imbalanced dataset. There are two ways
to perform sampling, Under Sample or Over Sampling.

In Under Sampling, we reduce the size of the majority class to match minority
class thus help by improving performance w.r.t storage and run-time execution,
but it potentially discards useful information.

For Over Sampling, we upsample the Minority class and thus solve the problem
of information loss, however, we get into the trouble of having Overfitting.

There are other techniques as well –


Cluster-Based Over Sampling – In this case, the K-means clustering algorithm
is independently applied to minority and majority class instances. This is to
identify clusters in the dataset. Subsequently, each cluster is oversampled such
that all clusters of the same class have an equal number of instances and all
classes have the same size

Synthetic Minority Over-sampling Technique (SMOTE) – A subset of data


is taken from the minority class as an example and then new synthetic similar
instances are created which are then added to the original dataset. This
technique is good for Numerical data points.

79. Mention some of the EDA Techniques?

Exploratory Data Analysis (EDA) helps analysts to understand the data better
and forms the foundation of better models. 

Visualization

 Univariate visualization
 Bivariate visualization
 Multivariate visualization
Missing Value Treatment – Replace missing values with Either Mean/Median

Outlier Detection – Use Boxplot to identify the distribution of Outliers, then


Apply IQR to set the boundary for IQR

Transformation – Based on the distribution, apply a transformation on the


features

Scaling the Dataset – Apply MinMax, Standard Scaler or Z Score Scaling


mechanism to scale the data.

Feature Engineering – Need of the domain, and SME knowledge helps Analyst
find derivative fields which can fetch more information about the nature of the
data

Dimensionality reduction — Helps in reducing the volume of data without


losing much information

80. Mention why feature engineering is important in model building and


list out some of the techniques used for feature engineering.

Algorithms necessitate features with some specific characteristics to work


appropriately. The data is initially in a raw form. You need to extract features
from this data before supplying it to the algorithm. This process is called feature
engineering. When you have relevant features, the complexity of the algorithms
reduces. Then, even if a non-ideal algorithm is used, results come out to be
accurate.

Feature engineering primarily has two goals:

 Prepare the suitable input data set to be compatible with the machine
learning algorithm constraints.
 Enhance the performance of machine learning models.
Some of the techniques used for feature engineering include Imputation,
Binning, Outliers Handling, Log transform, grouping operations, One-Hot
encoding, Feature split, Scaling, Extracting date.

81. Differentiate between Statistical Modeling and Machine Learning?

Machine learning models are about making accurate predictions about the
situations, like Foot Fall in restaurants, Stock-Price, etc. where-as, Statistical
models are designed for inference about the relationships between variables, as
What drives the sales in a restaurant, is it food or Ambience.

82. Differentiate between Boosting and Bagging?

Bagging and Boosting are variants of Ensemble Techniques.

Bootstrap Aggregation or bagging is a method that is used to reduce the


variance for algorithms having very high variance. Decision trees are a
particular family of classifiers which are susceptible to having high bias.

Decision trees have a lot of sensitiveness to the type of data they are trained on.
Hence generalization of results is often much more complex to achieve in them
despite very high fine-tuning. The results vary greatly if the training data is
changed in decision trees.

Hence bagging is utilised where multiple decision trees are made which are
trained on samples of the original data and the final result is the average of all
these individual models.

Boosting is the process of using an n-weak classifier system for prediction such
that every weak classifier compensates for the weaknesses of its classifiers. By
weak classifier, we imply a classifier which performs poorly on a given data
set. 
It’s evident that boosting is not an algorithm rather it’s a process. Weak
classifiers used are generally logistic regression, shallow decision trees etc.

There are many algorithms which make use of boosting processes but two of
them are mainly used: Adaboost and Gradient Boosting and XGBoost.

83. What is the significance of Gamma and Regularization in SVM?

The gamma defines influence. Low values meaning ‘far’ and high values
meaning ‘close’.  If gamma is too large, the radius of the area of influence of the
support vectors only includes the support vector itself and no amount of
regularization with C will be able to prevent overfitting.  If gamma is very
small, the model is too constrained and cannot capture the complexity of the
data.

The regularization parameter (lambda) serves as a degree of importance that is


given to miss-classifications. This can be used to draw the tradeoff with
OverFitting.

84. Define ROC curve work

The graphical representation of the contrast between true positive rates and the
false positive rate at various thresholds is known as the ROC curve. It is used as
a proxy for the trade-off between true positives vs the false positives.
85. What is the difference between a generative and discriminative model?

A generative model learns the different categories of data. On the other hand, a
discriminative model will only learn the distinctions between different
categories of data. Discriminative models perform much better than the
generative models when it comes to classification tasks.

86. What are hyperparameters and how are they different from
parameters?

A parameter is a variable that is internal to the model and whose value is


estimated from the training data. They are often saved as part of the learned
model. Examples include weights, biases etc.
A hyperparameter is a variable that is external to the model whose value cannot
be estimated from the data. They are often used to estimate model parameters.
The choice of parameters is sensitive to implementation. Examples include
learning rate, hidden layers etc.

87. What is shattering a set of points? Explain VC dimension.

In order to shatter a given configuration of points, a classifier must be able to,


for all possible assignments of positive and negative for the points, perfectly
partition the plane such that positive points are separated from negative points.
For a configuration of n points, there are 2n  possible assignments of positive or
negative. 

When choosing a classifier, we need to consider the type of data to be classified


and this can be known by VC dimension of a classifier. It is defined as
cardinality of the largest set of points that the classification algorithm i.e. the
classifier can shatter. In order to have a VC dimension of at  least n, a classifier
must be able to shatter a single given configuration of n points.

88. What are some differences between a linked list and an array?

Arrays and Linked lists are both used to store linear data of similar types.
However, there are a few difference between them.

Array Linked List

Elements are well-indexed, making specific Elements need to be accessed in a


element accessing easier cumulative manner

Operations (insertion, deletion) are faster in Linked list takes linear time, making
array operations a bit slower

Arrays are of fixed size Linked lists are dynamic and flexible

Memory is assigned during compile time in an Memory is allocated during execution or


array runtime in Linked list.

Elements are stored randomly in Linked


Elements are stored consecutively in arrays.
list

Memory utilization is efficient in the


Memory utilization is inefficient in the array
linked list.
89. What is the meshgrid () method and the contourf () method? State some
usesof both.

The meshgrid( ) function in numpy takes two arguments as input : range of x-


values in the grid, range of y-values in the grid whereas meshgrid needs to be
built before the contourf( ) function in matplotlib is used which takes in many
inputs : x-values, y-values, fitting curve (contour line) to be plotted in grid,
colours etc.

 Meshgrid () function is used to create a grid using 1-D arrays of x-axis inputs
and y-axis inputs to represent the matrix indexing. Contourf () is used to draw
filled contours using the given x-axis inputs, y-axis inputs, contour line, colours
etc.

90. Describe a hash table.

Hashing is a technique for identifying unique objects from a group of similar


objects. Hash functions are large keys converted into small keys in hashing
techniques. The values of hash functions are stored in data structures which are
known hash table.

91. List the advantages and disadvantages of using neural networks .

Advantages:

We can store information on the entire network instead of storing it in a


database. It has the ability to work and give a good accuracy even with
inadequate information. A neural network has parallel processing ability and
distributed memory.
Disadvantages:

Neural Networks requires processors which are capable of parallel processing.


It’s unexplained functioning of the network is also quite an issue as it reduces
the trust in the network in some situations like when we have to show the
problem we noticed to the network. Duration of the network is mostly unknown.
We can only know that the training is finished by looking at the error value but
it doesn’t give us optimal results.

92. You have to train a 12GB dataset using a neural network with a
machine which has only 3GB RAM. How would you go about it?

We can use NumPy arrays to solve this issue. Load all the data into an array. In
NumPy, arrays have a property to map the complete dataset without loading it
completely in memory. We can pass the index of the array, dividing data into
batches, to get the data required and then pass the data into the neural networks.
But be careful about keeping the batch size normal.
93. Write a simple code to binarize data.

Conversion of data into binary values on the basis of certain threshold is known
as binarizing of data. Values below the threshold are set to 0 and those above
the threshold are set to 1 which is useful for feature engineering.

Code:

from sklearn.preprocessing import Binarizer

import pandas

import numpy

names_list = ['Alaska', 'Pratyush', 'Pierce', 'Sandra', 'Soundarya', 'Meredith',


'Richard', 'Jackson', 'Tom',’Joe’]

data_frame = pandas.read_csv(url, names=names_list)

array = dataframe.values

# Splitting the array into input and output

A = array [: 0:7]

B = array [:7]

binarizer = Binarizer(threshold=0.0). fit(X)

binaryA = binarizer.transform(A)

numpy.set_printoptions(precision=5)

print (binaryA [0:7:])

94. What is an Array?

The array is defined as a collection of similar items, stored in a contiguous


manner. Arrays is an intuitive concept as the need to group similar objects
together arises in our day to day lives. Arrays satisfy the same need. How are
they stored in the memory? Arrays consume blocks of data, where each element
in the array consumes one unit of memory. The size of the unit depends on the
type of data being used. For example, if the data type of elements of the array is
int, then 4 bytes of data will be used to store each element. For character data
type, 1 byte will be used. This is implementation specific, and the above units
may change from computer to computer.

Example:

fruits = [‘apple’, banana’, pineapple’]

In the above case, fruits is a list that comprises of three fruits. To access them
individually, we use their indexes. Python and C are 0- indexed languages, that
is, the first index is 0. MATLAB on the contrary starts from 1, and thus is a 1-
indexed language.

95. What are the advantages and disadvantages of using an Array?

1. Advantages:

1. Random access is enabled


2. Saves memory
3. Cache friendly
4. Predictable compile timing
5. Helps in re-usability of code
6. Disadvantages: 

1. Addition and deletion of records is time consuming even though we


get the element of interest immediately through random access. This is
due to the fact that the elements need to be reordered after insertion or
deletion.
2. If contiguous blocks of memory are not available in the memory, then
there is an overhead on the CPU to search for the most optimal
contiguous location available for the requirement.
Now that we know what arrays are, we shall understand them in detail by
solving some interview questions. Before that, let us see the functions that
Python as a language provides for arrays, also known as, lists.

append() – Adds an element at the end of the list


copy() – returns a copy of a list.
reverse() – reverses the elements of the list
sort() – sorts the elements in ascending order by default.

96. What is Lists in Python?

Lists is an effective data structure provided in python. There are various


functionalities associated with the same. Let us consider the scenario where we
want to copy a list to another list. If the same operation had to be done in C
programming language, we would have to write our own function to implement
the same.

On the contrary, Python provides us with a function called copy. We can copy a
list to another just by calling the copy function.
new_list = old_list.copy()
We need to be careful while using the function. copy() is a shallow copy
function, that is, it only stores the references of the original list in the new list.
If the given argument is a compound data structure like a list then python
creates another object of the same type (in this case, a new list) but for
everything inside old list, only their reference is copied. Essentially, the new list
consists of references to the elements of the older list.

Hence, upon changing the original list, the new list values also change. This can
be dangerous in many applications. Therefore, Python provides us with another
functionality called as deepcopy.  Intuitively, we may consider that deepcopy()
would follow the same paradigm, and the only difference would be that for
each element we will recursively call deepcopy. Practically, this is not the case.

deepcopy() preserves the graphical structure of the original compound data. Let
us understand this better with the help of an example:

import copy.deepcopy

a = [1,2]

b = [a,a] # there's only 1 object a

c = deepcopy(b)

# check the result by executing these lines

c[0] is a # return False, a new object a' is created

c[0] is c[1] # return True, c is [a',a'] not [a',a'']


This is the tricky part, during the process of deepcopy() a hashtable
implemented as a dictionary in python is used to map: old_object reference onto
new_object reference. 

Therefore, this prevents unnecessary duplicates and thus preserves the structure
of the copied compound data structure. Thus, in this case, c[0] is not equal to a,
as internally their addresses are different.

Normal copy

>>> a = [[1, 2, 3], [4, 5, 6]]

>>> b = list(a)

>>> a

[[1, 2, 3], [4, 5, 6]]

>>> b

[[1, 2, 3], [4, 5, 6]]

>>> a[0][1] = 10

>>> a

[[1, 10, 3], [4, 5, 6]]

>>> b # b changes too -> Not a deepcopy.

[[1, 10, 3], [4, 5, 6]]

Deep copy
>>> import copy

>>> b = copy.deepcopy(a)

>>> a

[[1, 10, 3], [4, 5, 6]]

>>> b

[[1, 10, 3], [4, 5, 6]]

>>> a[0][1] = 9

>>> a

[[1, 9, 3], [4, 5, 6]]

>>> b # b doesn't change -> Deep Copy

[[1, 10, 3], [4, 5, 6]]


Now that we have understood the concept of lists, let us solve interview
questions to get better exposure on the same.

97. Given an array of integers where each element represents the max
number of steps that can be made forward from that element. The task is to
find the minimum number of jumps to reach the end of the array (starting
from the first element). If an element is 0, then cannot move through that
element.

Solution: This problem is famously called as end of array problem. We want to


determine the minimum number of jumps required in order to reach the end.
The element in the array represents the maximum number of jumps that, that
particular element can take.

Let us understand how to approach the problem initially. 


We need to reach the end. Therefore, let us have a count that tells us how near
we are to the end. Consider the array A=[1,2,3,1,1]

In the above example we can go from

> 2 - >3 - > 1 - > 1 - 4 jumps

1 - > 2 - > 1 - > 1 - 3 jumps

1 - > 2 - > 3 - > 1 - 3 jumps


Hence, we have a fair idea of the problem. Let us come up with a logic for the
same. 

Let us start from the end and move backwards as that makes more sense
intuitionally. We will use variables right and prev_r denoting previous right to
keep track of the jumps. 

Initially, right = prev_r = the last but one element. We consider the distance of
an element to the end, and the number of jumps possible by that element.
Therefore, if the sum of the number of jumps possible and the distance is greater
than the previous element, then we will discard the previous element and use the
second element’s value to jump. Try it out using a pen and paper first. The logic
will seem very straight forward to implement. Later, implement it on your own
and then verify with the result.

def min_jmp(arr):

n = len(arr)

right = prev_r = n-1

count = 0
# We start from rightmost index and travesre array to find the leftmost index

# from which we can reach index 'right'

while True:

for j in (range(prev_r-1,-1,-1)):

if j + arr[j] >= prev_r:

right = j

if prev_r != right:

prev_r = right

else:

break

count += 1

return count if right == 0 else -1


# Enter the elements separated by a space

arr = list(map(int, input().split()))

print(min_jmp(n, arr))

98. Given a string S consisting only ‘a’s and ‘b’s, print the last index of the
‘b’ present in it.

When we have are given a string of a’s and b’s, we can immediately find out the
first location of a character occurring. Therefore, to find the last occurrence of a
character, we reverse the string and find the first occurrence, which is
equivalent to the last occurrence in the original string.

Here, we are given input as a string. Therefore, we begin by splitting the


characters element wise using the function split. Later, we reverse the array,
find the first occurrence position value, and get the index by finding the value
len – position -1, where position is the index value.

def split(word):

return [(char) for char in word]

a = input()

a= split(a)

a_rev = a[::-1]

pos = -1
for i in range(len(a_rev)):

if a_rev[i] == ‘b’:

pos = len(a_rev)- i -1

print(pos)

break

else:

continue

if pos==-1:

print(-1)

99. Rotate the elements of an array by d positions to the left. Let us initially


look at an example.

A = [1,2,3,4,5]

A <<2

[3,4,5,1,2]

A<<3

[4,5,1,2,3]
There exists a pattern here, that is, the first d elements are being interchanged
with last n-d +1 elements. Therefore we can just swap the elements. Correct?
What if the size of the array is huge, say 10000 elements. There are chances of
memory error, run-time error etc. Therefore, we do it more carefully. We rotate
the elements one by one in order to prevent the above errors, in case of large
arrays.

# Rotate all the elements left by 1 position


def rot_left_once ( arr):

n = len( arr)

tmp = arr [0]

for i in range ( n-1): #[0,n-2]

arr[i] = arr[i + 1]

arr[n-1] = tmp

# Use the above function to repeat the process for d times.

def rot_left (arr, d):

n = len (arr)

for i in range (d):

rot_left_once ( arr, n)

arr = list( map( int, input().split()))

rot =int( input())

leftRotate ( arr, rot)

for i in range( len(arr)):

print( arr[i], end=' ')

100. Water Trapping Problem:


Given an array arr[] of N non-negative integers which represents the height of
blocks at index I, where the width of each block is 1. Compute how much water
can be trapped in between blocks after raining.

#  Structure is like below:

#||

# |_|

# answer is we can trap two units of water.

Solution: We are given an array, where each element denotes the height of the
block. One unit of height is equal to one unit of water, given there exists space
between the 2 elements to store it. Therefore, we need to find out all such pairs
that exist which can store water. We need to take care of the possible cases:

1. There should be no overlap of water saved


2. Water should not overflow

Therefore, let us find start with the extreme elements, and move towards the
centre.

n = int(input())

arr = [int(i) for i in input().split()]

left, right = [arr[0]], [0] * n

# left =[arr[0]]

#right = [ 0 0 0 0…0] n terms

right[n-1] = arr[-1] # right most element


# we use two arrays left[ ] and right[ ], which keep track of elements greater
than all
# elements the order of traversal respectively.

for elem in arr[1 : ]:

left.append(max(left[-1], elem) )

for i in range( len( arr)-2, -1, -1):

right[i] = max( arr[i] , right[i+1] )

water = 0

# once we have the arrays left, and right, we can find the water capacity
between these arrays.

for i in range( 1, n - 1):

add_water = min( left[i - 1], right[i]) - arr[i]

if add_water > 0:

water += add_water

print(water)

101. Explain Eigenvectors and Eigenvalues.

Ans. Linear transformations are helpful to understand using eigenvectors. They


find their prime usage in the creation of covariance and correlation matrices in
data science.

Simply put, eigenvectors are directional entities along which linear


transformation features like compression, flip etc. can be applied.
Eigenvalues are the magnitude of the linear transformation features along each
direction of an Eigenvector.

102. How would you define the number of clusters in a clustering


algorithm?

Ans. The number of clusters can be determined by finding the silhouette score.


Often we aim to get some inferences from data using clustering techniques so
that we can have a broader picture of a number of classes being represented by
the data. In this case, the silhouette score helps us determine the number of
cluster centres to cluster our data along.

Another technique that can be used is the elbow method.

103. What are the performance metrics that can be used to estimate the
efficiency of a linear regression model?

Ans. The performance metric that is used in this case is:

1. Mean Squared Error


2. R2 score
3. Adjusted  R2 score
4. Mean Absolute score

104. What is the default method of splitting in decision trees?

The default method of splitting in decision trees is the Gini Index. Gini Index is
the measure of impurity of a particular node.

This can be changed by making changes to classifier parameters. 


105. How is p-value useful?

Ans. The p-value gives the probability of the null hypothesis is true. It gives us
the statistical significance of our results. In other words, p-value determines the
confidence of a model in a particular output.

106. Can logistic regression be used for classes more than 2?

Ans. No, logistic regression cannot be used for classes more than 2 as it is a


binary classifier. For multi-class classification algorithms like Decision Trees,
Naïve Bayes’ Classifiers are better suited.

107. What are the hyperparameters of a logistic regression model?

Ans. Classifier penalty, classifier solver and classifier C are the trainable


hyperparameters of a Logistic Regression Classifier. These can be specified
exclusively with values in Grid Search to hyper tune a Logistic Classifier.

108. Name a few hyper-parameters of decision trees?

Ans. The most important features which one can tune in decision trees are:

1. Splitting criteria
2. Min_leaves
3. Min_samples
4. Max_depth

109. How to deal with multicollinearity?

Ans. Multi collinearity can be dealt with by the following steps:

 Remove highly correlated predictors from the model.


 Use Partial Least Squares Regression (PLS)  or Principal Components
Analysis,
110. What is Heteroscedasticity?

Ans. It is a situation in which the variance of a variable is unequal across the


range of values of the predictor variable.

It should be avoided in regression as it introduces unnecessary variance.  

111. Is ARIMA model a good fit for every time series problem?

Ans. No, ARIMA model is not suitable for every type of time series problem.
There are situations where ARMA model and others also come in handy.

ARIMA is best when different standard temporal structures require to be


captured for time series data.

112. How do you deal with the class imbalance in a classification problem?

Ans. Class imbalance can be dealt with in the following ways:

1. Using class weights


2. Using Sampling
3. Using SMOTE
4. Choosing loss functions like Focal Loss

113. What is the role of cross-validation?

Ans. Cross-validation is a technique which is used to increase the performance


of a machine learning algorithm, where the machine is fed sampled data out of
the same data for a few times. The sampling is done so that the dataset is broken
into small parts of the equal number of rows, and a random part is chosen as the
test set, while all other parts are chosen as train sets.

114. What is a voting model?


Ans. A voting model is an ensemble model which combines several classifiers
but to produce the final result, in case of a classification-based model, takes into
account, the classification of a certain data point of all the models and picks the
most vouched/voted/generated option from all the given classes in the target
column.

115. How to deal with very few data samples? Is it possible to make a model
out of it?

Ans. If very few data samples are there, we can make use of oversampling to
produce new data points. In this way, we can have new data points.

116. What are the hyperparameters of an SVM?

Ans. The gamma value, c value and the type of kernel are the hyperparameters
of an SVM model.

117. What is Pandas Profiling?

Ans. Pandas profiling is a step to find the effective number of usable data. It


gives us the statistics of NULL values and the usable values and thus makes
variable selection and data selection for building models in the preprocessing
phase very effective.

118. What impact does correlation have on PCA?

Ans. If data is correlated PCA does not work well. Because of the correlation of
variables the effective variance of variables decreases. Hence correlated data
when used for PCA does not work well.

119. How is PCA different from LDA?

Ans. PCA is unsupervised. LDA is unsupervised.


PCA takes into consideration the variance. LDA takes into account the
distribution of classes.

120. What distance metrics can be used in KNN?

Ans. Following distance metrics can be used in KNN.

 Manhattan
 Minkowski
 Tanimoto
 Jaccard
 Mahalanobis

121. Which metrics can be used to measure correlation of categorical data?

Ans. Chi square test can be used for doing so. It gives the measure of
correlation between categorical predictors.

122. Which algorithm can be used in value imputation in both categorical


and continuous categories of data?

Ans. KNN is the only algorithm that can be used for imputation of both
categorical and continuous variables.

123. When should ridge regression be preferred over lasso?

Ans. We should use ridge regression when we want to use all predictors and not
remove any as it reduces the coefficient values but does not nullify them.

124. Which algorithms can be used for important variable selection?

Ans. Random Forest, Xgboost and plot variable importance charts can be used
for variable selection.

125. What ensemble technique is used by Random forests?


Ans. Bagging is the technique used by Random Forests. Random forests are a
collection of trees which work on sampled data from the original dataset with
the final prediction being a voted average of all trees.

126. What ensemble technique is used by gradient boosting trees?

Ans. Boosting is the technique used by GBM.

127. If we have a high bias error what does it mean? How to treat it?

Ans. High bias error means that that model we are using is ignoring all the
important trends in the model and the model is underfitting.

To reduce underfitting:

 We need to increase the complexity of the model


 Number of features need to be increased

Sometimes it also gives the impression that the data is noisy. Hence noise from
data should be removed so that most important signals are found by the model
to make effective predictions.

Increasing the number of epochs results in increasing the duration of training of


the model. It’s helpful in reducing the error.

128. Which type of sampling is better for a classification model and why?

Ans. Stratified sampling is better in case of classification problems because it


takes into account the balance of classes in train and test sets. The proportion of
classes is maintained and hence the model performs better. In case of random
sampling of data, the data is divided into two parts without taking into
consideration the balance classes in the train and test sets. Hence some classes
might be present only in tarin sets or validation sets. Hence the results of the
resulting model are poor in this case.
129. What is a good metric for measuring the level of multicollinearity?

Ans. VIF or 1/tolerance is a good measure of measuring multicollinearity in


models. VIF is the percentage of the variance of a predictor which remains
unaffected by other predictors. So higher the VIF value, greater is the
multicollinearity amongst the predictors.

A rule of thumb for interpreting the variance inflation factor:

 1 = not correlated.
 Between 1 and 5 = moderately correlated.
 Greater than 5 = highly correlated.

130. When can be a categorical value treated as a continuous variable and


what effect does it have when done so?

Ans. A categorical predictor can be treated as a continuous one when the nature
of data points it represents is ordinal. If the predictor variable is having ordinal
data then it can be treated as continuous and its inclusion in the model increases
the performance of the model.

131. What is the role of maximum likelihood in logistic regression.

Ans. Maximum likelihood equation helps in estimation of most probable values


of the estimator’s predictor variable coefficients which produces results which
are the most likely or most probable and are quite close to the truth values.

132. Which distance do we measure in the case of KNN?

Ans. The hamming distance is measured in case of KNN for the determination


of nearest neighbours. Kmeans uses euclidean distance.

133. What is a pipeline?


Ans. A pipeline is a sophisticated way of writing software such that each
intended action while building a model can be serialized and the process calls
the individual functions for the individual tasks. The tasks are carried out in
sequence for a given sequence of data points and the entire process can be run
onto n threads by use of composite estimators in scikit learn.

134. Which sampling technique is most suitable when working with time-
series data?

Ans. We can use a custom iterative sampling such that we continuously add
samples to the train set. We only should keep in mind that the sample used for
validation should be added to the next train sets and a new sample is used for
validation.

135. What are the benefits of pruning?

Ans. Pruning helps in the following:

1. Reduces overfitting
2. Shortens the size of the tree
3. Reduces complexity of the model
4. Increases bias

136. What is normal distribution?

Ans. The distribution having the below properties is called normal distribution. 

 The mean, mode and median  are all equal.


 The curve is symmetric at the center (i.e. around the mean, μ).
 Exactly half of the values are to the left of center and exactly half the
values are to the right.
 The total area under the curve is 1.

137. What is the 68 per cent rule in normal distribution?


Ans. The normal distribution is a bell-shaped curve. Most of the data points are
around the median. Hence approximately 68 per cent of the data is around the
median. Since there is no skewness and its bell-shaped. 

138. What is a chi-square test?

Ans. A chi-square determines if a sample data matches a population. 

A chi-square test for independence compares two variables in a contingency


table to see if they are related.

A very small chi-square test statistics implies observed data fits the expected
data extremely well. 

139. What is a random variable?

Ans. A Random Variable is a set of possible values from a random experiment.


Example: Tossing a coin: we could get Heads or Tails. Rolling of a dice: we get
6 values

140. What is the degree of freedom?

Ans. It is the number of independent values or quantities which can be assigned


to a statistical distribution. It is used in Hypothesis testing and chi-square test.

141. Which kind of recommendation system is used by amazon to


recommend similar items?

Ans. Amazon uses a collaborative filtering algorithm for the recommendation of


similar items. It’s a user to user similarity based mapping of user likeness and
susceptibility to buy.
142. What is a false positive?

Ans. It is a test result which wrongly indicates that a particular condition or


attribute is present.

Example – “Stress testing, a routine diagnostic tool used in detecting heart


disease, results in a significant number of false positives in women”

143. What is a false negative?

Ans. A test result which wrongly indicates that a particular condition or


attribute is absent.

Example – “it’s possible to have a false negative—the test says you aren’t
pregnant when you are”

144. What is the error term composed of in regression?

Ans. Error is a sum of bias error+variance error+ irreducible error in regression.


Bias and variance error can be reduced but not the irreducible error.

145. Which performance metric is better R2 or adjusted R2?

Ans. Adjusted R2 because the performance of predictors impacts it. R2 is


independent of predictors and shows performance improvement through
increase if the number of predictors is increased.

146. What’s the difference between Type I and Type II error?

Type I and Type II error in machine learning refers to false values. Type I is
equivalent to a False positive while Type II is equivalent to a False negative. In
Type I error, a hypothesis which ought to be accepted doesn’t get accepted.
Similarly, for Type II error, the hypothesis gets rejected which should have been
accepted in the first place.
147. What do you understand by L1 and L2 regularization?

L2 regularization: It tries to spread error among all the terms. L2 corresponds to


a Gaussian prior.

L1 regularization: It is more binary/sparse, with many variables either being


assigned a 1 or 0 in weighting. L1 corresponds to setting a Laplacean prior on
the terms.

148. Which one is better, Naive Bayes Algorithm or Decision Trees?

Although it depends on the problem you are solving, but some general
advantages are following:

Naive Bayes:

 Work well with small dataset compared to DT which need more data
 Lesser overfitting
 Smaller in size and faster in processing

Decision Trees:

 Decision Trees are very flexible, easy to understand, and easy to debug
 No preprocessing or transformation of features required
 Prone to overfitting but you can use pruning or Random forests to
avoid that.

149. What do you mean by the ROC curve?

Receiver operating characteristics (ROC curve): ROC curve illustrates the


diagnostic ability of a binary classifier. It is calculated/created by plotting True
Positive against False Positive at various threshold settings. The performance
metric of ROC curve is AUC (area under curve). Higher the area under the
curve, better the prediction power of the model.

150. What do you mean by AUC curve?

AUC (area under curve). Higher the area under the curve, better the prediction
power of the model.

151. What is log likelihood in logistic regression?

It is the sum of the likelihood residuals. At record level, the natural log of the
error (residual) is calculated for each record, multiplied by minus one, and those
values are totaled. That total is then used as the basis for deviance (2 x ll) and
likelihood (exp(ll)).

The same calculation can be applied to a naive model that assumes absolutely
no predictive power, and a saturated model assuming perfect predictions.

The likelihood values are used to compare different models, while the deviances
(test, naive, and saturated) can be used to determine the predictive power and
accuracy. Logistic regression accuracy of the model will always be 100 percent
for the development data set, but that is not the case once a model is applied to
another data set.

152. How would you evaluate a logistic regression model?

Model Evaluation is a very important part in any analysis to answer the


following questions,

How well does the model fit the data?, Which predictors are most important?,
Are the predictions accurate?
So the following are the criterion to access the model performance,

1. Akaike Information Criteria (AIC): In simple terms, AIC estimates the


relative amount of information lost by a given model. So the less information
lost the higher the quality of the model. Therefore, we always prefer models
with minimum AIC.

2. Receiver operating characteristics (ROC curve): ROC curve illustrates the


diagnostic ability of a binary classifier. It is calculated/ created by plotting True
Positive against False Positive at various threshold settings. The performance
metric of ROC curve is AUC (area under curve). Higher the area under the
curve, better the prediction power of the model.

3. Confusion Matrix: In order to find out how well the model does in
predicting the target variable, we use a confusion matrix/ classification rate. It is
nothing but a tabular representation of actual Vs predicted values which helps
us to find the accuracy of the model.

153. What are the advantages of SVM algorithms?

SVM algorithms have basically advantages in terms of complexity. First I


would like to clear that both Logistic regression as well as SVM can form non
linear decision surfaces and can be coupled with the kernel trick. If Logistic
regression can be coupled with kernel then why use SVM?

● SVM is found to have better performance practically in most cases.

● SVM is computationally cheaper O(N^2*K) where K is no of support vectors


(support vectors are those points that lie on the class margin) where as logistic
regression is O(N^3)
● Classifier in SVM depends only on a subset of points . Since we need to
maximize distance between closest points of two classes (aka margin) we need
to care about only a subset of points unlike logistic regression.

154. Why does XGBoost perform better than SVM?

First reason is that XGBoost is an ensemble method that uses many trees to
make a decision so it gains power by repeating itself.

SVM is a linear separator, when data is not linearly separable SVM needs a
Kernel to project the data into a space where it can separate it, there lies its
greatest strength and weakness, by being able to project data into a high
dimensional space SVM can find a linear separation for almost any data but at
the same time it needs to use a Kernel and we can argue that there’s not a
perfect kernel for every dataset.

155. What is the difference between SVM Rank and SVR (Support Vector
Regression)?

One is used for ranking and the other is used for regression.

There is a crucial difference between regression and ranking. In regression, the


absolute value is crucial. A real number is predicted.

In ranking, the only thing of concern is the ordering of a set of examples. We


only want to know which example has the highest rank, which one has the
second-highest, and so on. From the data, we only know that example 1 should
be ranked higher than example 2, which in turn should be ranked higher than
example 3, and so on. We do not know by how much example 1 is ranked higher
than example 2, or whether this difference is bigger than the difference between
examples 2 and 3.
156. What is the difference between the normal soft margin SVM and SVM
with a linear kernel?

Hard-margin

You have the basic SVM – hard margin. This assumes that data is very well
behaved, and you can find a perfect classifier – which will have 0 error on train
data.

Soft-margin

Data is usually not well behaved, so SVM hard margins may not have a solution
at all. So we allow for a little bit of error on some points. So the training error
will not be 0, but average error over all points is minimized.

Kernels

The above assume that the best classifier is a straight line. But what is it is not a
straight line. (e.g. it is a circle, inside a circle is one class, outside is another
class). If we are able to map the data into higher dimensions – the higher
dimension may give us a straight line.

157. How is linear classifier relevant to SVM?

An svm is a type of linear classifier. If you don’t mess with kernels, it’s
arguably the most simple type of linear classifier.

Linear classifiers (all?) learn linear fictions from your data that map your input
to scores like so: scores = Wx + b. Where W is a matrix of learned weights, b is
a learned bias vector that shifts your scores, and x is your input data. This type
of function may look familiar to you if you remember y = mx + b from high
school.

A typical svm loss function ( the function that tells you how good your
calculated scores are in relation to the correct labels ) would be hinge loss. It
takes the form: Loss = sum over all scores except the correct score of max(0,
scores – scores(correct class) + 1).

158. What are the advantages of using a naive Bayes for classification?

 Very simple, easy to implement and fast.


 If the NB conditional independence assumption holds, then it will
converge quicker than discriminative models like logistic regression.
 Even if the NB assumption doesn’t hold, it works great in practice.
 Need less training data.
 Highly scalable. It scales linearly with the number of predictors and
data points.
 Can be used for both binary and mult-iclass classification problems.
 Can make probabilistic predictions.
 Handles continuous and discrete data.
 Not sensitive to irrelevant features.

159. Are Gaussian Naive Bayes the same as binomial Naive Bayes?

Binomial Naive Bayes: It assumes that all our features are binary such that they
take only two values. Means 0s can represent “word does not occur in the
document” and 1s as “word occurs in the document”.

Gaussian Naive Bayes: Because of the assumption of the normal distribution,


Gaussian Naive Bayes is used in cases when all our features are continuous. For
example in Iris dataset features are sepal width, petal width, sepal length, petal
length. So its features can have different values in the data set as width and
length can vary. We can’t represent features in terms of their occurrences. This
means data is continuous. Hence we use Gaussian Naive Bayes here.

160. What is the difference between the Naive Bayes Classifier and the
Bayes classifier?

Naive Bayes assumes conditional independence, P(X|Y, Z)=P(X|Z)

P(X|Y,Z)=P(X|Z)

P(X|Y,Z)=P(X|Z), Whereas more general Bayes Nets (sometimes called


Bayesian Belief Networks), will allow the user to specify which attributes are,
in fact, conditionally independent.

For the Bayesian network as a classifier, the features are selected based on some
scoring functions like Bayesian scoring function and minimal description
length(the two are equivalent in theory to each other given that there is enough
training data). The scoring functions mainly restrict the structure (connections
and directions) and the parameters(likelihood) using the data. After the structure
has been learned the class is only determined by the nodes in the Markov
blanket(its parents, its children, and the parents of its children), and all variables
given the Markov blanket are discarded.

161. In what real world applications is Naive Bayes classifier used?

Some of real world examples are as given below

 To mark an email as spam, or not spam?


 Classify a news article about technology, politics, or sports?
 Check a piece of text expressing positive emotions, or negative
emotions?
 Also used for face recognition software
162. Is naive Bayes supervised or unsupervised?

First, Naive Bayes is not one algorithm but a family of Algorithms that inherits
the following attributes:

1.Discriminant Functions

2.Probabilistic Generative Models

3.Bayesian Theorem

4.Naive Assumptions of Independence and Equal Importance of feature vectors.

Moreover, it is a special type of Supervised Learning algorithm that could do


simultaneous multi-class predictions (as depicted by standing topics in many
news apps).

Since these are generative models, so based upon the assumptions of the random
variable mapping of each feature vector these may even be classified as
Gaussian Naive Bayes, Multinomial Naive Bayes, Bernoulli Naive Bayes, etc.

163. What do you understand by selection bias in Machine Learning?

Selection bias stands for the bias which was introduced by the selection of
individuals, groups or data for doing analysis in a way that the proper
randomization is not achieved. It ensures that the sample obtained is not
representative of the population intended to be analyzed and sometimes it is
referred to as the selection effect. This is the part of distortion of a statistical
analysis which results from the method of collecting samples. If you don’t take
the  selection bias into the account then some conclusions of the study may not
be accurate.

The types of selection bias includes:

1. Sampling bias: It is a systematic error due to a non-random sample of


a population causing some members of the population to be less likely
to be included than others resulting in a biased sample.
2. Time interval: A trial may be terminated early at an extreme value
(often for ethical reasons), but the extreme value is likely to be reached
by the variable with the largest variance, even if all variables have a
similar mean.
3. Data: When specific subsets of data are chosen to support a conclusion
or rejection of bad data on arbitrary grounds, instead of according to
previously stated or generally agreed criteria.
4. Attrition: Attrition bias is a kind of selection bias caused by attrition
(loss of participants) discounting trial subjects/tests that did not run to
completion.

164. What do you understand by Precision and Recall?

In pattern recognition, The information retrieval and classification in machine


learning are part of precision. It is also called as positive predictive value which
is the fraction of relevant instances among the retrieved instances.

Recall is also known as sensitivity and the fraction of the total amount of
relevant instances which  were actually retrieved. 

Both precision and recall are therefore based on an understanding and measure
of relevance.

165. What Are the Three Stages of Building a Model in Machine Learning?

To build a model in machine learning, you need to follow few steps:

1. Understand the business model


2. Data acquisitions
3. Data cleaning
4. Exploratory data analysis
5. Use machine learning algorithms to make a model
6. Use unknown dataset to check the accuracy of the model

166. How Do You Design an Email Spam Filter in Machine Learning?

1. Understand the business model: Try to understand the related attributes


for the spam mail
2. Data acquisitions: Collect the spam mail to read the hidden pattern
from them
3. Data cleaning: Clean the unstructured or semi structured data
4. Exploratory data analysis: Use statistical concepts to understand the
data like spread, outlier, etc.
5. Use machine learning algorithms to make a model: can use naive bayes
or some other algorithms as well
6. Use unknown dataset to check the accuracy of the model

167. What is the difference between Entropy and Information Gain?

The information gain is based on the decrease in entropy after a dataset is split


on an attribute. Constructing a decision tree is all about finding the attribute that
returns the highest information gain (i.e., the most homogeneous branches).
Step 1: Calculate entropy of the target.

168. What are collinearity and multicollinearity?

Collinearity is a linear association between two


predictors. Multicollinearity is a situation where two or more predictors are
highly linearly related.

169. What is Kernel SVM?

SVM algorithms have basically advantages in terms of complexity. First I


would like to clear that both Logistic regression as well as SVM can form non
linear decision surfaces and can be coupled with the kernel trick. If Logistic
regression can be coupled with kernel then why use SVM?

● SVM is found to have better performance practically in most cases.

● SVM is computationally cheaper O(N^2*K) where K is no of support vectors


(support vectors are those points that lie on the class margin) where as logistic
regression is O(N^3)

● Classifier in SVM depends only on a subset of points . Since we need to


maximize distance between closest points of two classes (aka margin) we need
to care about only a subset of points unlike logistic regression.

170. What is the process of carrying out a linear regression?

Linear Regression Analysis consists of more than just fitting a linear line


through a cloud of data points. It consists of 3 stages–

 (1) analyzing the correlation and directionality of the data,

 (2) estimating the model, i.e., fitting the line, 

and (3) evaluating the validity and usefulness of the model.

“KickStart your Artificial Intelligence Journey with Great Learning which


offers high-rated  Artificial Intelligence courses  with world-class training by
industry leaders. Whether you’re interested in machine learning, data mining,
or data analysis, Great Learning has a course for you!”

FAQ:

1. How do I start a career in machine learning?


There is no fixed or definitive guide through which you can start your machine
learning career. The first step is to understand the basic principles of the subject
and learn a few key concepts such as algorithms and data structures, coding
capabilities, calculus, linear algebra, statistics. The next step would be to take
up a ML course, or read the top books for self-learning. You can also work on
projects to get a hands-on experience.

2. What is the best way to learn machine learning?

Any way that suits your style of learning can be considered as the best way to
learn. Different people may enjoy different methods. Some of the common ways
would be through taking up a Machine Learning Course, watching YouTube
videos, reading blogs with relevant topics, read books which can help you self-
learn.

3. What degree do you need for machine learning?

Most hiring companies will look for a masters or doctoral degree in the relevant
domain. The field of study includes computer science or mathematics. But
having the necessary skills even without the degree can help you land a ML job
too.

4. How do you break into machine learning?

The most common way to get into a machine learning career is to acquire the
necessary skills. Learn programming languages such as C, C++, Python, and
Java. Gain basic knowledge about various ML algorithms, mathematical
knowledge about calculus and statistics. This will help you go a long way.

5. How difficult is machine learning?

Machine Learning is a vast concept that contains a lot different aspects. With
the right guidance and with consistent hard-work, it may not be very difficult to
learn. It definitely requires a lot of time and effort, but if you’re interested in the
subject and are willing to learn, it won’t be too difficult.

6. What is machine learning for beginners?

Machine Learning for beginners will consist of the basic concepts such as types
of Machine Learning (Supervised, Unsupervised, Reinforcement Learning).
Each of these types of ML have different algorithms and libraries within them,
such as, Classification and Regression. There are various classification
algorithms and regression algorithms such as Linear Regression. This would be
the first thing you will learn before moving ahead with other concepts.

7. What level of math is required for machine learning?

You will need to know statistical concepts, linear algebra, probability,


Multivariate Calculus, Optimization. As you go into the more in-depth concepts
of ML, you will need more knowledge regarding these topics.

8. Does machine learning require coding?

Programming is a part of Machine Learning. It is important to know


programming languages such as Python.

Stay tuned to this page for more such information on interview questions
and  career assistance . You can check our other blogs about  Machine
Learning  for more information.

You can also take up the  PGP Artificial Intelligence and Machine Learning
Course  offered by Great Learning in collaboration with UT Austin. The course
offers online learning with mentorship and provides career assistance as well.
The curriculum has been designed by faculty from Great Lakes and The
University of Texas at  Austin-McCombs and helps you power ahead your
career.

Machine Learning Interview Questions

A list of frequently asked machine learning interview questions and answers are


given below.

1) What do you understand by Machine learning?

Machine learning is the form of Artificial Intelligence that deals with system
programming and automates data analysis to enable computers to learn and act
through experiences without being explicitly programmed.

For example, Robots are coded in such a way that they can perform the tasks
based on data they collect from sensors. They automatically learn programs from
data and improve with experiences.

2) Differentiate between inductive learning and deductive learning?

In inductive learning, the model learns by examples from a set of observed


instances to draw a generalized conclusion. On the other side, in deductive
learning, the model first applies the conclusion, and then the conclusion is drawn.

o Inductive learning is the method of using observations to draw conclusions.


o Deductive learning is the method of using conclusions to form observations.

For example, if we have to explain to a kid that playing with fire can cause burns.
There are two ways we can explain this to a kid; we can show training examples of
various fire accidents or images of burnt people and label them as "Hazardous". In
this case, a kid will understand with the help of examples and not play with the
fire. It is the form of Inductive machine learning. The other way to teach the same
thing is to let the kid play with the fire and wait to see what happens. If the kid gets
a burn, it will teach the kid not to play with fire and avoid going near it. It is the
form of deductive learning.
3) What is the difference between Data Mining and Machine Learning?

Data mining can be described as the process in which the structured data tries to
abstract knowledge or interesting unknown patterns. During this process, machine
learning algorithms are used.

Machine learning represents the study, design, and development of the algorithms


which provide the ability to the processors to learn without being explicitly
programmed.

4) What is the meaning of Overfitting in Machine learning?

Overfitting can be seen in machine learning when a statistical model describes


random error or noise instead of the underlying relationship. Overfitting is usually
observed when a model is excessively complex. It happens because of having too
many parameters concerning the number of training data types. The model displays
poor performance, which has been overfitted.

5) Why overfitting occurs?

The possibility of overfitting occurs when the criteria used for training the model is
not as per the criteria used to judge the efficiency of a model.

6) What is the method to avoid overfitting?

Overfitting occurs when we have a small dataset, and a model is trying to learn
from it. By using a large amount of data, overfitting can be avoided. But if we have
a small database and are forced to build a model based on that, then we can use a
technique known as cross-validation. In this method, a model is usually given a
dataset of a known data on which training data set is run and dataset of unknown
data against which the model is tested. The primary aim of cross-validation is to
define a dataset to "test" the model in the training phase. If there is sufficient data,
'Isotonic Regression' is used to prevent overfitting.

7) Differentiate supervised and unsupervised machine learning.

o In supervised machine learning, the machine is trained using labeled data.


Then a new dataset is given into the learning model so that the algorithm
provides a positive outcome by analyzing the labeled data. For example, we
first require to label the data which is necessary to train the model while
performing classification.
o In the unsupervised machine learning, the machine is not trained using
labeled data and let the algorithms make the decisions without any
corresponding output variables.

8) How does Machine Learning differ from Deep Learning?

o Machine learning is all about algorithms which are used to parse data, learn
from that data, and then apply whatever they have learned to make informed
decisions.
o Deep learning is a part of machine learning, which is inspired by the
structure of the human brain and is particularly useful in feature detection.

9) How is KNN different from k-means?

KNN or K nearest neighbors is a supervised algorithm which is used for


classification purpose. In KNN, a test sample is given as the class of the majority
of its nearest neighbors. On the other side, K-means is an unsupervised algorithm
which is mainly used for clustering. In k-means clustering, it needs a set of
unlabeled points and a threshold only. The algorithm further takes unlabeled data
and learns how to cluster it into groups by computing the mean of the distance
between different unlabeled points.

10) What are the different types of Algorithm methods in Machine Learning?

The different types of algorithm methods in machine earning are:

o Supervised Learning
o Semi-supervised Learning
o Unsupervised Learning
o Transduction
o Reinforcement Learning
11) What do you understand by Reinforcement Learning technique?

Reinforcement learning is an algorithm technique used in Machine Learning. It


involves an agent that interacts with its environment by producing actions &
discovering errors or rewards. Reinforcement learning is employed by different
software and machines to search for the best suitable behavior or path it should
follow in a specific situation. It usually learns on the basis of reward or penalty
given for every action it performs.

12) What is the trade-off between bias and variance?

Both bias and variance are errors. Bias is an error due to erroneous or overly
simplistic assumptions in the learning algorithm. It can lead to the model under-
fitting the data, making it hard to have high predictive accuracy and generalize the
knowledge from the training set to the test set.

Variance is an error due to too much complexity in the learning algorithm. It leads
to the algorithm being highly sensitive to high degrees of variation in the training
data, which can lead the model to overfit the data.

To optimally reduce the number of errors, we will need to tradeoff bias and
variance.

13) How do classification and regression differ?

Classification Regression

o Classification is the task to predict a o Regression is the task to


discrete class label. predict a continuous
quantity.

o In a classification problem, data is o A regression problem


labeled into one of two or more needs the prediction of a
classes. quantity.
o A classification having problem with o A regression problem
two classes is called binary containing multiple input
classification, and more than two variables is called a
classes is called multi-class multivariate regression
classification problem.

o Classifying an email as spam or non- o Predicting the price of a


spam is an example of a classification stock over a period of time
problem. is a regression problem.

14) What are the five popular algorithms we use in Machine Learning?

Five popular algorithms are:

o Decision Trees
o Probabilistic Networks
o Neural Networks
o Support Vector Machines
o Nearest Neighbor

15) What do you mean by ensemble learning?

Numerous models, such as classifiers are strategically made and combined to solve
a specific computational program which is known as ensemble learning. The
ensemble methods are also known as committee-based learning or learning
multiple classifier systems. It trains various hypotheses to fix the same issue. One
of the most suitable examples of ensemble modeling is the random forest trees
where several decision trees are used to predict outcomes. It is used to improve the
classification, function approximation, prediction, etc. of a model.

16) What is a model selection in Machine Learning?


The process of choosing models among diverse mathematical models, which are
used to define the same data is known as Model Selection. Model learning is
applied to the fields of statistics, data mining, and machine learning.

17) What are the three stages of building the hypotheses or model in machine
learning?

There are three stages to build hypotheses or model in machine learning:

o Model building
It chooses a suitable algorithm for the model and trains it according to the
requirement of the problem.
o Applying the model
It is responsible for checking the accuracy of the model through the test
data.
o Model testing
It performs the required changes after testing and apply the final model.

18) What according to you, is the standard approach to supervised learning?

In supervised learning, the standard approach is to split the set of example into the
training set and the test.

19) Describe 'Training set' and 'training Test'.

In various areas of information of machine learning, a set of data is used to


discover the potentially predictive relationship, which is known as 'Training Set'.
The training set is an example that is given to the learner. Besides, the 'Test set' is
used to test the accuracy of the hypotheses generated by the learner. It is the set of
instances held back from the learner. Thus, the training set is distinct from the test
set.

20) What are the common ways to handle missing data in a dataset?

Missing data is one of the standard factors while working with data and handling.
It is considered as one of the greatest challenges faced by the data analysts. There
are many ways one can impute the missing values. Some of the common methods
to handle missing data in datasets can be defined as deleting the rows, replacing
with mean/median/mode, predicting the missing values, assigning a unique
category, using algorithms that support missing values, etc.

21) What do you understand by ILP?

ILP stands for Inductive Logic Programming. It is a part of machine learning


which uses logic programming. It aims at searching patterns in data which can be
used to build predictive models. In this process, the logic programs are assumed as
a hypothesis.

22) What are the necessary steps involved in Machine Learning Project?

There are several essential steps we must follow to achieve a good working model
while doing a Machine Learning Project. Those steps may include parameter
tuning, data preparation, data collection, training the model, model
evaluation, and prediction, etc.

23) Describe Precision and Recall?

Precision and Recall both are the measures which are used in the information
retrieval domain to measure how good an information retrieval system reclaims the
related data as requested by the user.

Precision can be said as a positive predictive value. It is the fraction of relevant


instances among the received instances.

On the other side, recall is the fraction of relevant instances that have been
retrieved over the total amount or relevant instances. The recall is also known
as sensitivity.

24) What do you understand by Decision Tree in Machine Learning?

Decision Trees can be defined as the Supervised Machine Learning, where the data
is continuously split according to a certain parameter. It builds classification or
regression models as similar as a tree structure, with datasets broken up into ever
smaller subsets while developing the decision tree. The tree can be defined by two
entities, namely decision nodes, and leaves. The leaves are the decisions or the
outcomes, and the decision nodes are where the data is split. Decision trees can
manage both categorical and numerical data.

25) What are the functions of Supervised Learning?

o Classification
o Speech Recognition
o Regression
o Predict Time Series
o Annotate Strings

26) What are the functions of Unsupervised Learning?

o Finding clusters of the data


o Finding low-dimensional representations of the data
o Finding interesting directions in data
o Finding novel observations/ database cleaning
o Finding interesting coordinates and correlations

27) What do you understand by algorithm independent machine learning?

Algorithm independent machine learning can be defined as machine learning,


where mathematical foundations are independent of any particular classifier or
learning algorithm.

28) Describe the classifier in machine learning.


A classifier is a case of a hypothesis or discrete-valued function which is used to
assign class labels to particular data points. It is a system that inputs a vector of
discrete or continuous feature values and outputs a single discrete value, the class.

29) What do you mean by Genetic Programming?

Genetic Programming (GP) is almost similar to an Evolutionary Algorithm, a


subset of machine learning. Genetic programming software systems implement an
algorithm that uses random mutation, a fitness function, crossover, and multiple
generations of evolution to resolve a user-defined task. The genetic programming
model is based on testing and choosing the best option among a set of results.

30) What is SVM in machine learning? What are the classification methods
that SVM can handle?

SVM stands for Support Vector Machine. SVM are supervised learning models
with an associated learning algorithm which analyze the data used for classification
and regression analysis.

The classification methods that SVM can handle are:

o Combining binary classifiers


o Modifying binary to incorporate multiclass learning

31) How will you explain a linked list and an array?

An array is a datatype which is widely implemented as a default type, in almost all


the modern programming languages. It is used to store data of a similar type.

But there are many use-cases where we don't know the quantity of data to be
stored. For such cases, advanced data structures are required, and one such data
structure is linked list.

There are some points which explain how the linked list is different from an array:

ARRAY LINKED LIST


o An array is a group of elements of o Linked List is an ordered group of
a similar data type. elements of the same type, which are
connected using pointers.

o Elements are stored consecutively o New elements can be stored anywhere


in the memory. in memory.

o An Array supports Random o Linked List supports Sequential


Access. It means that the elements Access. It means that we have to
can be accessed directly using their traverse the complete linked list, up to
index value, like arr[0] for 1st that element sequentially which
element, arr[5] for 6th element, etc. element/node we want to access in a
As a result, accessing elements in linked list.
an array is fast with constant time To access the nth element of a linked
complexity of O(1). list, the time complexity is O(n).

o Memory is allocated at compile o Memory is allocated at runtime,


time as soon as the array is whenever a new node is added. It is
declared. It is known as Static known as Dynamic Memory
Memory Allocation. Allocation.

o Insertion and Deletion operation o In case of a linked list, a new element


takes more time in the array, as the is stored at the first free available
memory locations are consecutive memory location.
and fixed. Thus, Insertion and Deletion
operations are fast in the linked list.

o Size of the array must be declared o Size of a Linked list is variable. It


at the time of array declaration. grows at runtime whenever nodes are
added to it.
32) What do you understand by the Confusion Matrix?

A confusion matrix is a table which is used for summarizing the performance of a


classification algorithm. It is also known as the error matrix.

Where,

TN= True Negative


TP= True Positive
FN= False Negative
FP= False Positive

33) Explain True Positive, True Negative, False Positive, and False
Negative in Confusion Matrix with an example.

o True Positive
When a model correctly predicts the positive class, it is said to be a true
positive.
For example, Umpire gives a Batsman NOT OUT when he is NOT OUT.
o True Negative
When a model correctly predicts the negative class, it is said to be a true
negative.
For example, Umpire gives a Batsman OUT when he is OUT.
o False Positive
When a model incorrectly predicts the positive class, it is said to be a false
positive. It is also known as 'Type I' error.
For example, Umpire gives a Batsman NOT OUT when he is OUT.
o False Negative
When a model incorrectly predicts the negative class, it is said to be a false
negative. It is also known as 'Type II' error.
For example, Umpire gives a Batsman OUT when he is NOT OUT.

34) What according to you, is more important between model accuracy and
model performance?

Model accuracy is a subset of model performance. The accuracy of the model is


directly proportional to the performance of the model. Thus, better the performance
of the model, more accurate are the predictions.

35) What is Bagging and Boosting?

o Bagging is a process in ensemble learning which is used for improving


unstable estimation or classification schemes.
o Boosting methods are used sequentially to reduce the bias of the combined
model.

36) What are the similarities and differences between bagging and boosting in
Machine Learning?

Similarities of Bagging and Boosting

o Both are the ensemble methods to get N learns from 1 learner.


o Both generate several training data sets with random sampling.
o Both generate the final result by taking the average of N learners.
o Both reduce variance and provide higher scalability.

Differences between Bagging and Boosting

o Although they are built independently, but for Bagging, Boosting tries to
add new models which perform well where previous models fail.
o Only Boosting determines the weight for the data to tip the scales in favor of
the most challenging cases.
o Only Boosting tries to reduce bias. Instead, Bagging may solve the problem
of over-fitting while boosting can increase it.

37) What do you understand by Cluster Sampling?

Cluster Sampling is a process of randomly selecting intact groups within a defined


population, sharing similar characteristics. Cluster sample is a probability where
each sampling unit is a collection or cluster of elements.

For example, if we are clustering the total number of managers in a set of


companies, in that case, managers (sample) will represent elements and companies
will represent clusters.

38) What do you know about Bayesian Networks?

Bayesian Networks also referred to as 'belief networks' or 'casual networks', are


used to represent the graphical model for probability relationship among a set of
variables.

For example, a Bayesian network can be used to represent the probabilistic


relationships between diseases and symptoms. As per the symptoms, the network
can also compute the probabilities of the presence of various diseases.

Efficient algorithms can perform inference or learning in Bayesian networks.


Bayesian networks which relate the variables (e.g., speech signals or protein
sequences) are called dynamic Bayesian networks.

39) Which are the two components of Bayesian logic program?

A Bayesian logic program consists of two components:

o Logical
It contains a set of Bayesian Clauses, which capture the qualitative structure
of the domain.

 Machine Learning Interview Questions For Freshers


o 1.Why was Machine Learning Introduced?
o 2.What are Different Types of Machine Learning algorithms?
o 3.What is Supervised Learning?
o 4.What is Unsupervised Learning?
o 5.What is ‘Naive’ in a Naive Bayes?
o 6.What is PCA? When do you use it?
o 7.Explain SVM Algorithm in Detail
o 8.What are Support Vectors in SVM?
o 9.What are Different Kernels in SVM?
o 10.What is Cross-Validation?
o 11.What is Bias in Machine Learning?
o 12.Explain the Difference Between Classification and Regression?
o
 Advanced Machine Learning Questions
o 13.What is F1 score? How would you use it?
o 14.Define Precision and Recall?
o 15.How to Tackle Overfitting and Underfitting?
o 16.What is a Neural Network?
o 17.What are Loss Function and Cost Functions? Explain the key Difference
Between them?
o 18.What is Ensemble learning?
o 19.How do you make sure which Machine Learning Algorithm to use?
o 20.How to Handle Outlier Values?
o 21.What is a Random Forest? How does it work?
o 22.What is Collaborative Filtering? And Content-Based Filtering?
o 23.What is Clustering?
o 24.How can you select K for K-means Clustering?
o 25.What are Recommender Systems?
o 26.How do check the Normality of a dataset?
o 27.Can logistic regression use for more than 2 classes?
o 28.Explain Correlation and Covariance?
o 29.What is P-value?
o 30.What are Parametric and Non-Parametric Models?
o 31.What is Reinforcement Learning?
o 32.Difference Between Sigmoid and Softmax functions?

Firstly, Machine Learning refers to the process of training a computer program to


build a statistical model based on data. The goal of machine learning (ML) is to
turn data and identify the key patterns out of data or to get key insights.
For example, if we have a historical dataset of actual sales figures, we can train
machine learning models to predict sales for the coming future.

Why is the Machine Learning trend emerging so fast?

Machine Learning solves Real-World problems. Unlike the hard coding rule to
solve the problem, machine learning algorithms learn from the data.

The learnings can later be used to predict the feature. It is paying off for early
adopters.

A full 82% of enterprises adopting machine learning and Artificial Intelligence


(AI) have gained a significant financial advantage from their investments.

Machine Learning Interview Questions For Freshers

1. Why was Machine Learning Introduced?

The simplest answer is to make our lives easier. In the early days of “intelligent”
applications, many systems used hardcoded rules of “if” and “else” decisions to
process data or adjust the user input. Think of a spam filter whose job is to move
the appropriate incoming email messages to a spam folder.  

But with the machine learning algorithms, we are given ample information for the
data to learn and identify the patterns from the data.

Unlike the normal problems we don’t need to write the new rules for each
problem in machine learning, we just need to use the same workflow but with a
different dataset.

Let’s talk about Alan Turing, in his 1950 paper, “Computing Machinery and
Intelligence”, Alan asked, “Can machines think?”

Full paper here

The paper describes the “Imitation Game”, which includes three participants -

 Human acting as a judge,


 Another human, and
 A computer is an attempt to convince the judge that it is human.

The judge asks the other two participants to talk. While they respond the judge
needs to decide which response came from the computer. If the judge could not
tell the difference the computer won the game.
The test continues today as an annual competition in artificial intelligence. The
aim is simple enough: convince the judge that they are chatting to a human
instead of a computer chatbot program.

2. What are Different Types of Machine Learning algorithms?

There are various types of machine learning algorithms. Here is the list of them in
a broad category based on:

 Whether they are trained with human supervision (Supervised, unsupervised,


reinforcement learning)
 The criteria in the below diagram are not exclusive, we can combine them any
way we like.
 
Types of Machine Learning algorithms

3. What is Supervised Learning?

Supervised learning is a machine learning algorithm of inferring a function from


labeled training data. The training data consists of a set of training examples.

Example: 01

Knowing the height and weight identifying the gender of the person. Below are
the popular supervised learning algorithms. 
 Support Vector Machines
 Regression
 Naive Bayes
 Decision Trees
 K-nearest Neighbour Algorithm and Neural Networks.
Example: 02

If you build a T-shirt classifier, the labels will be “this is an S, this is an M and
this is L”, based on showing the classifier examples of S, M, and L.

4. What is Unsupervised Learning?

Unsupervised learning is also a type of machine learning algorithm used to find


patterns on the set of data given. In this, we don’t have any dependent variable or
label to predict. Unsupervised Learning Algorithms:

 Clustering, 
 Anomaly Detection, 
 Neural Networks and Latent Variable Models.
Example:

In the same example, a T-shirt clustering will categorize as “collar style and V
neck style”, “crew neck style” and “sleeve types”.

5. What is ‘Naive’ in a Naive Bayes?

The Naive Bayes method is a supervised learning algorithm, it is naive since it


makes assumptions by applying Bayes’ theorem that all attributes are
independent of each other.

Bayes’ theorem states the following relationship, given class variable y and
dependent vector x1  through xn:

P(yi | x1,..., xn) =P(yi)P(x1,..., xn | yi)(P(x1,..., xn)

Using the naive conditional independence assumption that each xiis independent:
for all I this relationship is simplified to:

P(xi | yi, x1, ..., xi-1, xi+1, ...., xn) = P(xi | yi)
Since, P(x1,..., xn) is a constant given the input, we can use the following
classification rule:

P(yi | x1, ..., xn) = P(y) ni=1P(xi | yi)P(x1,...,xn) and we can also use Maximum
A Posteriori (MAP) estimation to estimate P(yi)and P(yi | xi) the former is then
the relative frequency of class yin the training set.

P(yi | x1,..., xn)  P(yi) ni=1P(xi | yi)

y = arg max P(yi)ni=1P(xi | yi)

The different naive Bayes classifiers mainly differ by the assumptions they make
regarding the distribution of P(yi | xi): can be Bernoulli, binomial, Gaussian, and
so on.

6. What is PCA? When do you use it?

Principal component analysis (PCA) is most commonly used for dimension


reduction.

In this case, PCA measures the variation in each variable (or column in the table).
If there is little variation, it throws the variable out, as illustrated in the figure
below:
Principal component analysis (PCA)

Thus making the dataset easier to visualize. PCA is used in finance, neuroscience,
and pharmacology.
It is very useful as a preprocessing step, especially when there are linear
correlations between features.

7. Explain SVM Algorithm in Detail

A Support Vector Machine (SVM) is a very powerful and versatile supervised


machine learning model, capable of performing linear or non-linear classification,
regression, and even outlier detection.

Suppose we have given some data points that each belong to one of two classes,
and the goal is to separate two classes based on a set of examples.

In SVM, a data point is viewed as a p-dimensional vector (a list of p numbers),


and we wanted to know whether we can separate such points with a (p-1)-
dimensional hyperplane. This is called a linear classifier.

There are many hyperplanes that classify the data. To choose the best hyperplane
that represents the largest separation or margin between the two classes. 
If such a hyperplane exists, it is known as a maximum-margin hyperplane and the
linear classifier it defines is known as a maximum margin classifier. The best
hyperplane that divides the data in H3

We have data (x1, y1), ..., (xn, yn), and different features (xii, ..., xip), and yiis
either 1 or -1.

The equation of the hyperplane H3 is the set of points satisfying:

w. x-b = 0

Where w is the normal vector of the hyperplane. The parameter b||w||determines


the offset of the hyperplane from the original along the normal vector w

So for each i, either xiis in the hyperplane of 1 or -1. Basically, xisatisfies:

w . xi - b = 1  or   w. xi - b = -1
Support Vector Machine (SVM)

8. What are Support Vectors in SVM?

A Support Vector Machine (SVM) is an algorithm that tries to fit a line (or plane
or hyperplane) between the different classes that maximizes the distance from the
line to the points of the classes.

In this way, it tries to find a robust separation between the classes. The Support
Vectors are the points of the edge of the dividing hyperplane as in the below
figure.
Support Vector Machine (SVM)

9. What are Different Kernels in SVM?

There are six types of kernels in SVM:

 Linear kernel - used when data is linearly separable. 


 Polynomial kernel - When you have discrete data that has no natural notion of
smoothness.
 Radial basis kernel - Create a decision boundary able to do a much better job of
separating two classes than the linear kernel.
 Sigmoid kernel - used as an activation function for neural networks.

10. What is Cross-Validation?

Cross-validation is a method of splitting all your data into three parts: training,
testing, and validation data. Data is split into k subsets, and the model has trained
on k-1of those datasets.

The last subset is held for testing. This is done for each of the subsets. This is k-
fold cross-validation. Finally, the scores from all the k-folds are averaged to
produce the final score.
Cross-validation

11. What is Bias in Machine Learning?

Bias in data tells us there is inconsistency in data. The inconsistency may occur
for several reasons which are not mutually exclusive.

For example, a tech giant like Amazon to speed the hiring process they build one
engine where they are going to give 100 resumes, it will spit out the top five, and
hire those.

When the company realized the software was not producing gender-neutral
results it was tweaked to remove this bias.
12. Explain the Difference Between Classification and Regression?

Classification is used to produce discrete results, classification is used to classify


data into some specific categories.
For example, classifying emails into spam and non-spam categories. 

Whereas, regression deals with continuous data.


For example, predicting stock prices at a certain point in time.

Classification is used to predict the output into a group of classes. 


For example, Is it Hot or Cold tomorrow?

Whereas, regression is used to predict the relationship that data represents. 


For example, What is the temperature tomorrow?

Advanced Machine Learning Questions

13. What is F1 score? How would you use it?

Let’s have a look at this table before directly jumping into the F1 score.

Prediction Predicted Yes Predicted No


Actual Yes True Positive (TP) False Negative (FN)
Actual No False Positive (FP) True Negative (TN)

In binary classification we consider the F1 score to be a measure of the model’s


accuracy. The F1 score is a weighted average of precision and recall scores.

F1 = 2TP/2TP + FP + FN

We see scores for F1 between 0 and 1, where 0 is the worst score and 1 is the best
score. 
The F1 score is typically used in information retrieval to see how well a model
retrieves relevant results and our model is performing.
 

14. Define Precision and Recall?

Precision and recall are ways of monitoring the power of machine learning
implementation. But they often used at the same time.

Precision answers the question, “Out of the items that the classifier predicted to
be relevant, how many are truly relevant?”
Whereas, recall answers the question, “Out of all the items that are truly relevant,
how many are found by the classifier?

In general, the meaning of precision is the fact of being exact and accurate. So the
same will go in our machine learning model as well. If you have a set of items
that your model needs to predict to be relevant. How many items are truly
relevant?

The below figure shows the Venn diagram that precision and recall.

Precision and recall

Mathematically, precision and recall can be defined as the following:

precision = # happy correct answers/# total items returned by ranker

recall = # happy correct answers/# total relevant answers

15. How to Tackle Overfitting and Underfitting?

Overfitting means the model fitted to training data too well, in this case, we need
to resample the data and estimate the model accuracy using techniques like k-fold
cross-validation.
Whereas for the Underfitting case we are not able to understand or capture the
patterns from the data, in this case, we need to change the algorithms, or we need
to feed more data points to the model.

16. What is a Neural Network?


It is a simplified model of the human brain. Much like the brain, it has neurons
that activate when encountering something similar.

The different neurons are connected via connections that help information flow
from one neuron to another.

17. What are Loss Function and Cost Functions? Explain the key Difference
Between them?

When calculating loss we consider only a single data point, then we use the term
loss function.

Whereas, when calculating the sum of error for multiple data then we use the cost
function. There is no major difference.

In other words, the loss function is to capture the difference between the actual
and predicted values for a single record whereas cost functions aggregate the
difference for the entire training dataset.

The Most commonly used loss functions are Mean-squared error and Hinge loss.

Mean-Squared Error(MSE): In simple words, we can say how our model


predicted values against the actual values.

MSE = √(predicted value - actual value)2

Hinge loss: It is used to train the machine learning classifier, which is

L(y) = max(0,1- yy)

Where y = -1 or 1 indicating two classes and y represents the output form of the
classifier. The most common cost function represents the total cost as the sum of
the fixed costs and the variable costs in the equation y = mx + b

18. What is Ensemble learning?

Ensemble learning is a method that combines multiple machine learning models


to create more powerful models.

There are many reasons for a model to be different. Few reasons are:

 Different Population
 Different Hypothesis
 Different modeling techniques
When working with the model’s training and testing data, we will experience an
error. This error might be bias, variance, and irreducible error.

Now the model should always have a balance between bias and variance, which
we call a bias-variance trade-off.

This ensemble learning is a way to perform this trade-off.

There are many ensemble techniques available but when aggregating multiple
models there are two general methods:

 Bagging, a native method: take the training set and generate new training sets off
of it.
 Boosting, a more elegant method: similar to bagging, boosting is used to optimize
the best weighting scheme for a training set.

19. How do you make sure which Machine Learning Algorithm to use?

It completely depends on the dataset we have. If the data is discrete we use SVM.
If the dataset is continuous we use linear regression.

So there is no specific way that lets us know which ML algorithm to use, it all
depends on the exploratory data analysis (EDA).

EDA is like “interviewing” the dataset; As part of our interview we do the


following:

 Classify our variables as continuous, categorical, and so forth. 


 Summarize our variables using descriptive statistics. 
 Visualize our variables using charts.

Based on the above observations select one best-fit algorithm for a particular
dataset.

20. How to Handle Outlier Values?

An Outlier is an observation in the dataset that is far away from other


observations in the dataset. Tools used to discover outliers are

 Box plot
 Z-score
 Scatter plot, etc.

Typically, we need to follow three simple strategies to handle outliers:

 We can drop them. 


 We can mark them as outliers and include them as a feature. 
 Likewise, we can transform the feature to reduce the effect of the outlier.

21. What is a Random Forest? How does it work?

Random forest is a versatile machine learning method capable of performing both


regression and classification tasks.

Like bagging and boosting, random forest works by combining a set of other tree
models. Random forest builds a tree from a random sample of the columns in the
test data.

Here’s are the steps how a random forest creates the trees:

 Take a sample size from the training data.


 Begin with a single node.
 Run the following algorithm, from the start node:
o If the number of observations is less than node size then stop.
o Select random variables.
o Find the variable that does the “best” job of splitting the observations.
o Split the observations into two nodes.
o Call step `a` on each of these nodes.

22. What is Collaborative Filtering? And Content-Based Filtering?

Collaborative filtering is a proven technique for personalized content


recommendations. Collaborative filtering is a type of recommendation system
that predicts new content by matching the interests of the individual user with the
preferences of many users.

Content-based recommender systems are focused only on the preferences of the


user. New recommendations are made to the user from similar content according
to the user’s previous choices.
Collaborative Filtering and Content-Based Filtering

23. What is Clustering?

Clustering is the process of grouping a set of objects into a number of groups.


Objects should be similar to one another within the same cluster and dissimilar to
those in other clusters.

A few types of clustering are:

 Hierarchical clustering
 K means clustering
 Density-based clustering
 Fuzzy clustering, etc.
24. How can you select K for K-means Clustering?

There are two kinds of methods that include direct methods and statistical testing
methods:

 Direct methods: It contains elbow and silhouette 


 Statistical testing methods: It has gap statistics.

The silhouette is the most frequently used while determining the optimal value of
k.

25. What are Recommender Systems?

A recommendation engine is a system used to predict users’ interests and


recommend products that are quite likely interesting for them.

Data required for recommender systems stems from explicit user ratings after
watching a film or listening to a song, from implicit search engine queries and
purchase histories, or from other knowledge about the users/items themselves.

26. How do check the Normality of a dataset?

Visually, we can use plots. A few of the normality checks are as follows:

 Shapiro-Wilk Test
 Anderson-Darling Test
 Martinez-Iglewicz Test
 Kolmogorov-Smirnov Test
 D’Agostino Skewness Test

27. Can logistic regression use for more than 2 classes?

No, by default logistic regression is a binary classifier, so it cannot be applied to


more than 2 classes. However, it can be extended for solving multi-class
classification problems (multinomial logistic regression)

28. Explain Correlation and Covariance?

Correlation is used for measuring and also for estimating the quantitative
relationship between two variables.  Correlation measures how strongly two
variables are related. Examples like, income and expenditure, demand and
supply, etc.

Covariance is a simple way to measure the correlation between two variables.


The problem with covariance is that they are hard to compare without
normalization.
29. What is P-value?

P-values are used to make a decision about a hypothesis test. P-value is the
minimum significant level at which you can reject the null hypothesis. The lower
the p-value, the more likely you reject the null hypothesis.

30. What are Parametric and Non-Parametric Models?

Parametric models will have limited parameters and to predict new data, you only
need to know the parameter of the model.

Non-Parametric models have no limits in taking a number of parameters,


allowing for more flexibility and to predict new data. You need to know the state
of the data and model parameters.

31. What is Reinforcement Learning?

Reinforcement learning is different from the other types of learning like


supervised and unsupervised. In reinforcement learning, we are given neither data
nor labels. Our learning is based on the rewards given to the agent by the
environment.

32. Difference Between Sigmoid and Softmax functions?

The sigmoid function is used for binary classification. The probabilities sum
needs to be 1. Whereas, Softmax function is used for multi-classification. The
probabilities sum will be 1.
(VisionNLP provide)

Q1.  Different activation function?

Ans. Binary Step Function, Linear Activation Function, Sigmoid/Logistic


Activation Function, Tanh Function (Hyperbolic Tangent), ReLU Activation
Function.

Q2.  How do you handle imbalance data?

Ans.  Follow these techniques:


Use the right evaluation metrics.
Use K-fold Cross-Validation in the right way.
Ensemble different resampled datasets.
Resample with different ratios.
Cluster the abundant class.
Design your own models.

Q3.  Difference between sigmoid and softmax?

Ans.  The sigmoid function is used for the two-class logistic regression, whereas
the softmax function is used for the multiclass logistic regression (a.k.a. MaxEnt,
multinomial logistic regression, softmax Regression, Maximum Entropy
Classifier).

Q4. Explain about optimizers?

Ans. Optimizers are algorithms or methods used to change the attributes of the
neural network such as weights and learning rate to reduce the losses. Optimizers
are used to solve optimization problems by minimizing the function.

Q5. Precision-Recall Trade off?

Ans.  The Idea behind the precision-recall trade-off is that when a person changes
the threshold for determining if a class is positive or negative it will tilt the
scales. It means that it will cause precision to increase and recall to decrease, or
vice versa.

Q6. Decision Tree Parameters?


Ans.  These are the parameters used for building Decision Tree:
min_samples_split, min_samples_leaf, max_features and criterion.

Q7. Bagging and boosting?

Ans.  Bagging is a way to decrease the variance in the prediction by generating


additional data for training from dataset using combinations with repetitions to
produce multi-sets of the original data. Boosting is an iterative technique which
adjusts the weight of an observation based on the last classification.

You might also like