Interview Questions
Interview Questions
Ans. Gaussian mixture models (GMMs) are often used for data clustering. You can
use GMMs to perform either hard clustering or soft clustering on query data. To
perform hard clustering, the GMM assigns query data points to the multivariate
normal components that maximize the component posterior probability, given the
data.
K-means clustering aims to partition data into k clusters in a way that data points in
the same cluster are similar and data points in the different clusters are farther
apart. Similarity of two points is determined by the distance between them.
Ans. Z score test is one of the most commonly used methods to detect outliers. It
measures the number of standard deviations away the observation is from the mean
value. A z score of 1.5 indicated that the observation is 1.5 standard deviations
above the mean and -1.5 means that the observation is 1.5 standard deviations
below or less than the mean.
1) Binning:
• Binning methods smooth a sorted data value by consulting the values around it.
• The sorted values are distributed into a number of “buckets,” or bins.
2) Regression:
• Here data can be smoothed by fitting the data to a function.
• Linear regression involves finding the “best” line to fit two attributes, so that one
attribute can be used to predict the other.
• Multiple linear regressions and extension of linear regression, where more than
two attributes are involved and the data are fit to a multidimensional surface.
3) Clustering:
• Outliers may be detected by clustering, where similar values are organized into
groups, or “clusters.”
Q5. Min sample leaf and max depth in random forest?
Ans. Min sample leaf specifies the minimum number of samples that should be
present in the leaf node after splitting a node. The max_depth of a tree in Random
Forest is defined as the longest path between the root node and the leaf node.
Using the max_depth parameter, I can limit up to what depth I want every tree in
my random forest to grow.
Q6. Dictionary and list comprehension?
Ans. The only difference between list and dictionary comprehension is that the
dictionary has the keys and values in it. So, the keys and values will come as the
expression or value.
20 Question (www.analyticsindiamag.com)
Machine learning, on the other hand, requires basic knowledge of coding and
strong knowledge of statistics and business.
2. What is big data?
Big data has 3 major components – volume (size of data), velocity (inflow of data)
and variety (types of data)
3. What are the four main things we should know before studying data
analysis?
Descriptive statistics
Inferential statistics
Hypothesis testing
From the population we take a sample. We cannot work on the population either
due to computational costs or due to availability of all data points for the
population.
Center – middle of the data. Mean / Median / Mode are the most commonly
used as measures.
Mean – average of all the numbers
Median – the number in the middle
Mode – the number that occurs the most. The disadvantage of using
Mode is that there may be more than one mode.
Spread – How the data is dispersed. Range / IQR / Standard Deviation /
Variance are the most commonly used as measures.
Range = Max – Min
Inter Quartile Range (IQR) = Q3 – Q1
Standard Deviation (σ) = √(∑(x-µ) / n)
2
Variance = σ
2
IQR = Q3 – Q1
Where, Q3 is the third quartile (75 percentile)
Median
It represents how far the data points from the mean are
(σ) = √(∑(x-µ)2 / n)
Left skewed
The left tail is longer than the right side
Mean < median < mode
Right skewed
The right tail is longer than the right side
Mode < median < mean
14. What does symmetric distribution mean?
The part of the distribution that is on the left side of the median is same as the part
of the distribution that is on the right side of the median
X (standardized) = (x-µ) / σ
18. What is an outlier?
An outlier is an abnormal value (It is at an abnormal distance from rest of the data
points).
Widely used – Any data point that lies outside the 1.5 * IQR
Remove outlier
When we know the data-point is wrong (negative age of a person)
When we have lots of data
We should provide two analyses. One with outliers and another
without outliers.
Keep outlier
When there are lot of outliers (skewed data)
When results are critical
When outliers have meaning (fraud data)
20 Question
Take a sample of fishes from the sea (to get better results the number of fishes >
30)
Calculate t-statistics
Get the confidence interval in which the mean length of all the fishes should be.
27. What is the proportion of confidence interval that will not contain the
population parameter?
Alpha is the portion of confidence interval that will not contain the population
parameter
α = 1 – CL
28. What is the difference between 95% confidence level and 99% confidence
level?
The confidence interval increases as me move from 95% confidence level to 99%
confidence level
If the p-value is more than then critical value, then we fail to reject the H0
If p-value = 0.015 (critical value = 0.05) – strong evidence
If p-value = 0.055 (critical value = 0.05) – weak evidence
If the p-value is less than the critical value, then we reject the H0
If p-value = 0.055 (critical value = 0.05) – weak evidence
If p-value = 0.005 (critical value = 0.05) – strong evidence
Find H0 and H1
Go to Data tab
If the p-value is more than then critical value, then we fail to reject the H0
If the p-value is less than the critical value, then we reject the H0
37. What is the difference between one tail and two tail hypothesis testing?
38. What do you think of the tail (one tail or two tail) if H0 is equal to one
value only?
It is a two-tail test
40. Why is the t-value same for 90% two tail and 95% one tail test?
Standard deviation/z-score
Interquartile range (IQR)
Observer selection
Attrition
Protopathic bias
Time intervals
Sampling bias
12. What is the probability of throwing two fair dice when the sum is 5 and 8?
There are 4 ways of rolling a 5 (1+4, 4+1, 2+3, 3+2):
P(Getting a 5) = 4/36 = 1/9
Now, there are 7 ways of rolling an 8 (1+7, 7+1, 2+6, 6+2, 3+5, 5+3, 4+4)
P(Getting an 8) = 7/36 = 0.194
13. State the case where the median is a better measure when compared to the
mean.
In the case where there are a lot of outliers that can positively or negatively skew
data, the median is preferred as it provides an accurate measure in this case of
determination.
18. What type of data does not have a log-normal distribution or a Gaussian
distribution?
Exponential distributions do not have a log-normal distribution or a Gaussian
distribution. In fact, any type of data that is categorical will not have these
distributions as well.
Example: Duration of a phone car, time until the next earthquake, etc.
21. What are population and sample in Inferential Statistics, and how are they
different?
A population is a large volume of observations (data). The sample is a small
portion of that population. Because of the large volume of data in the population, it
raises the computational cost. The availability of all data points in the population is
also an issue.
In short:
Here:
μ: Mean
σ: Standard deviation
X: Value to be calculated
34. If a distribution is skewed to the right and has a median of 20, will the
mean be greater than or less than 20?
If the given distribution is a right-skewed distribution, then the mean should be
greater than 20, while the mode remains to be less than 20.
36. The standard normal curve has a total area to be under one, and it is
symmetric around zero. True or False?
True, a normal curve will have the area under unity and the symmetry around zero
in any distribution. Here, all of the measures of central tendencies are equal to zero
due to the symmetric nature of the standard normal curve.
38. What is the relationship between the confidence level and the significance
level in statistics?
The significance level is the probability of obtaining a result that is extremely
different from the condition where the null hypothesis is true. While the confidence
level is used as a range of similar values in a population.
Both significance and confidence level are related by the following formula:
Significance level = 1 − Confidence level
39. A regression analysis between apples (y) and oranges (x) resulted in the
following least-squares line: y = 100 + 2x. What is the implication if oranges
are increased by 1?
If the oranges are increased by one, there will be an increase of 2 apples since the
equation is:
y = 100 + 2x.
40. What types of variables are used for Pearson’s correlation coefficient?
Variables to be used for the Pearson’s correlation coefficient must be either in a
ratio or in an interval.
Note that there can exist a condition when one variable is a ratio, while the other is
an interval score.
41. In a scatter diagram, what is the line that is drawn above or below the
regression line called?
The line that is drawn above or below the regression line in a scatter diagram is
called the residual or also the prediction error.
Uniform distribution
Binomial distribution
Normal distribution
45. What is the difference between the 1st quartile, the 2nd quartile, and the
3rd quartile?
Quartiles are used to describe the distribution of data by splitting data into three
equal portions, and the boundary or edge of these portions are called quartiles.
That is,
46. How do the standard error and the margin of error relate?
The standard error and the margin of error are quite closely related to each other. In
fact, the margin of error is calculated using the standard error. As the standard error
increases, the margin of error also increases.
49. Given a left-skewed distribution that has a median of 60, what conclusions
can we draw about the mean and the mode of the data?
Given that it is a left-skewed distribution, the mean will be less than the median,
i.e., less than 60, and the mode will be greater than 60.
50. What are the types of biases that we encounter while sampling?
Sampling biases are errors that occur when taking a small sample of data from a
large population as the representation in statistical analysis. There are three types
of biases:
51. What are the scenarios where outliers are kept in the data?
There are not many scenarios where outliers are kept in the data, but there are some
important situations when they are kept. They are kept in the data for analysis if:
52. Briefly explain the procedure to measure the length of all sharks in the
world.
Following steps can be used to determine the length of sharks:
53. How does the width of the confidence interval change with length?
The width of the confidence interval is used to determine the decision-making
steps. As the confidence level increases, the width also increases.
The following also apply:
60. What are the types of biases that you can encounter while sampling?
There are three types of biases:
Selection bias
Survivorship bias
Under coverage bias
62. What are some of the low and high-bias Machine Learning algorithms?
There are many low and high-bias Machine Learning algorithms, and the following
are some of the widely used ones:
64. What are some of the techniques to reduce underfitting and overfitting
during model training?
Underfitting refers to a situation where data has high bias and low variance, while
overfitting is the situation where there are high variance and low bias.
Following are some of the techniques to reduce underfitting and overfitting:
For reducing underfitting:
65. Can you give an example to denote the working of the central limit
theorem?
Let’s consider the population of men who have normally distributed weights, with
a mean of 60 kg and a standard deviation of 10 kg, and the probability needs to be
found out.
If one single man is selected, the weight is greater than 65 kg, but if 40 men are
selected, then the mean weight is far more than 65 kg.
The solution to this can be as shown below:
Z = (x − µ) / ? = (65 − 60) / 10 = 0.5
66. How do you stay up-to-date with the new and upcoming concepts in
statistics?
This is a commonly asked question in a statistics interview. Here, the interviewer is
trying to assess your interest and ability to find out and learn new things
efficiently. Do talk about how you plan to learn new concepts and make sure to
elaborate on how you practically implemented them while learning.
Statistical computing is the process through which data scientists take raw
data and create predictions and models. Without an advanced knowledge of
statistics it is difficult to succeed as a data scientist–accordingly, it is likely a
good interviewer will try to probe your understanding of the subject matter
with statistics-oriented data science interview questions. Be prepared to
answer some fundamental statistics questions as part of your data science
interview.
“Suppose that we are interested in estimating the average height among all
people. Collecting data for every person in the world is impossible. While we
can’t obtain a height measurement from everyone in the population, we can
still sample some people. The question now becomes, what can we say about
the average height of the entire population given a single sample. The Central
Limit Theorem addresses this question exactly.”
Motivation
Suppose that we are interested in estimating the average height among all
people. Collecting data for every person in the world is impractical, bordering
on impossible. While we can’t obtain a height measurement from everyone in
the population, we can still sample some people. The question now becomes,
what can we say about the average height of the entire population given a
single sample.
When I first read this description I did not completely understand what it
meant. However, after visualizing a few examples it become more clear. Let’s
look at an example of the Central Limit Theorem in action.
Example
Further Intuition
When I first saw an example of the Central Limit Theorem like this, I didn’t
really understand why it worked. The best intuition that I have come across
involves the example of flipping a coin. Suppose that we have a fair coin and
we flip it 100 times. If we observed 48 heads and 52 tails we would probably
not be very surprised. Similarly, if we observed 40 heads and 60 tails, we
would probably still not be very surprised, though it might seem more rare
than the 48/52 scenario. However, if we observed 20 heads and 80 tails we
might start to question the fairness of the coin.
The mean of the sampling distribution will approximate the mean of the true
population distribution. Additionally, the variance of the sampling
distribution is a function of both the population variance and the sample size
used. A larger sample size will produce a smaller sampling distribution
variance. This makes intuitive sense, as we are considering more samples
when using a larger sample size, and are more likely to get a representative
sample of the population. So roughly speaking, if the sample size used is large
enough, there is a good chance that it will estimate the population pretty well.
Most sources state that for most applications N = 30 is sufficient.
These principles can help us to reason about samples from any population.
Depending on the scenario and the information available, the way that it is
applied may vary. For example, in some situations we might know the true
population mean and variance, which would allow us to compute the variance
of any sampling distribution. However, in other situations, such as the original
problem we discussed of estimating average human height, we won’t know the
true population mean and variance. Understanding the nuances of sampling
distributions and the Central Limit Theorem is an essential first step toward
talking many of these problems.
2. What is sampling? How many sampling methods do you know?
Data Sampling
Sampling can be particularly useful with data sets that are too large to
efficiently analyze in full -- for example, in big data analytics applications or
surveys. Identifying and analyzing a representative sample is more efficient
and cost-effective than surveying the entirety of the data or population.
There are many different methods for drawing samples from data; the ideal
one depends on the data set and situation. Sampling can be based
on probability, an approach that uses random numbers that correspond to
points in the data set to ensure that there is no correlation between points
chosen for the sample. Further variations in probability sampling include:
Cluster sampling: The larger data set is divided into subsets (clusters) based
on a defined factor, then a random sampling of clusters is analyzed.
“A type I error occurs when the null hypothesis is true, but is rejected. A type
II error occurs when the null hypothesis is false, but erroneously fails to be
rejected.”
If the result of the test corresponds with reality, then a correct decision has
been made (e.g., person is healthy and is tested as healthy, or the person is not
healthy and is tested as not healthy). However, if the result of the test does not
correspond with reality, then two types of error are distinguished: type I
errorand type II error.
A type II error occurs when the null hypothesis is false, but erroneously fails
to be rejected. Let me say this again, atype II error occurs when the null
hypothesis is actually false, but was accepted as trueby the testing.
Continuing our shepherd and wolf example. Again, our null hypothesis is that
there is “no wolf present.” A type II error (or false negative) would be doing
nothing (not “crying wolf”) when there is actually a wolf present. That is,
the actual situationwas that there was a wolf present; however, the shepherd
wrongly indicated there was no wolf present and continued to play Candy
Crush on his iPhone. This is a type II error or false negative error.
Reject null
hypothesis Type I Error Correct Outcome
False Positive True Positive
Examples
Let’s walk through a few examples and use a simple form to help us to
understand the potential cost ramifications of type I and type II errors. Let’s
start with our shepherd / wolf example.
Let’s look at the classic criminal dilemma next. In colloquial usage, a type I
error can be thought of as "convicting an innocent person" and type II error
"letting a guilty person go free".
Person is judged
as guiltywhen the Person is judged not
person actually did guiltywhen they
Person is not notcommit the crime actually didcommit
guilty of the (convicting an the crime (letting a
crime innocent person) guilty person go free)
Social costs of
sending an innocent
person to prison and
denying them their
personal freedoms Risks of letting a
(which in our society, guilty criminal roam
is considered an the streets and
almost unbearable committing future
Cost Assessment cost) crimes
Unexpected side
Lost opportunity cost effects (maybe even
for rejecting an death) for using a
effective drug that drug that is not
Cost Assessment could cure Disease B effective
Summary
Type I and type II errors are highly depend upon the language or positioning
of the null hypothesis. Changing the positioning of the null hypothesis can
cause type I and type II errors to switch roles.
It’s hard to create a blanket statement that a type I error is worse than a type
II error, or vice versa. The severity of the type I and type II errors can only
be judged in context of the null hypothesis, which should be thoughtfully
worded to ensure that we’re running the right test.
I highly recommend adding the “Cost Assessment” analysis like we did in the
examples above. This will help identify which type of error is more “costly”
and identify areas where additional testing might be justified.
A linear regression is a good tool for quick predictive analysis: for example,
the price of a house depends on a myriad of factors, such as its size or its
location. In order to see the relationship between these variables, we need to
build a linear regression, which predicts the line of best fit between them and
can help conclude whether or not these two factors have a positive or negative
relationship. Read more here and here.
“Selection (or ‘sampling’) bias occurs in an ‘active,’ sense when the sample
data that is gathered and prepared for modeling has characteristics that are
not representative of the true, future population of cases the model will see.
That is, active selection bias occurs when a subset of the data are
systematically (i.e., non-randomly) excluded from analysis.” Read more here.
Central Limit Theorem is the cornerstone of statistics. It states that the distribution
of a sample from a population comprising a large sample size will have its mean
normally distributed. In other words, it will not have any effect on the original
population distribution.
The assumption of normality dictates that the mean distribution across samples is
normal. This is true across independent samples as well.
If the p-value is less than alpha, the null hypothesis is rejected, but if it is greater
than alpha, the null hypothesis is accepted. The rejection of the null hypothesis
indicates that the results obtained are statistically significant.
5. What is an outlier?
Outliers can be defined as the data points within a data set that varies largely in
comparison to other observations. Depending on its cause, an outlier can decrease
the accuracy as well as efficiency of a model. Therefore, it is crucial to remove
them from the data set.
There are many ways to screen and identify potential outliers in a data set. Two
key methods are described below –
If the z-score is above or below 3, it is an outlier and the data point is considered
unusual.
IQR=Q3 – Q1
Other methods to screen outliers include Isolation Forests, Robust Random Cut
Forests, and DBScan clustering.
An Inliner is a data point within a data set that lies at the same level as the others.
It is usually an error and is removed to improve the model accuracy. Unlike
outliers, inlier is hard to find and often requires external data for accurate
identification.
Also known as the 80/20 rule, the Pareto principle states that 80% of the effects or
results in an experiment are obtained from 20% of the causes. A simple example is
– 20% of sales come from 80% of customers.
11. What is the Law of Large Numbers in statistics?
Also known as Gaussian distribution, Normal distribution refers to the data which
is symmetric to the mean, and data far from the mean is less frequent in
occurrence. It appears as a bell-shaped curve in graphical form, which is
symmetrical along the axes.
Central tendency – the mean, median, and mode lie at the centre, which means that
they are all equal, and the curve is perfectly symmetrical at the midpoint.
=tdist(x,deg_freedom,tails)
The p-value is expressed in decimals in Excel. Here are the steps to calculate it –
15. What are the types of biases that you can encounter while sampling?
Sampling bias occurs when you lack the fair representation of data samples during
an investigation or a survey. The six main types of biases that one can encounter
while sampling are –
Undercoverage bias
Observer Bias
Survivorship bias
Recall Bias
Exclusion Bias
Significance chasing is also known by the names of Data Dredging, Data Fishing,
or Data Snooping. It refers to the reporting of insignificant results as if they are
almost significant.
A type 1 error occurs when the null hypothesis is rejected even if it is true. It is
also known as false positive.
A type 2 error occurs when the null hypothesis fails to get rejected, even if it is
false. It is also known as a false negative.
A statistical interaction refers to the phenomenon which occurs when the influence
of an input variable impacts the output variable. A real-life example includes the
interaction of adding sugar to the stirring of tea. Neither of the two variables has an
impact on sweetness, but it is the combination of these two variables that do.
19. Give an example of a data set with a non-Gaussian distribution?
Where:
b = binomial probability
n = number of trials
21. What are the criteria that Binomial distributions must meet?
Here are the three main criteria that Binomial distributions must meet –
The number of observation trials must be fixed. It means that one can only find the
probability of something when done only a certain number of times.
Each trial needs to be independent. It means that none of the trials should impact
the probability of other trials.
There’s a linear relationship between the predictor (independent) variables and the
outcome (dependent) variable. It means that the relationship between X and the
mean of Y is linear.
The errors are normally distributed with no correlation between them. This process
is known as Autocorrelation.
The variation in the outcome or response variable is the same for all values of
independent or predictor variables. This phenomenon of assumption of equal
variance is known as homoscedasticity.
24. What are some of the low and high-bias Machine Learning algorithms?
Some of the widely used low and high-bias Machine Learning algorithms are –
Low bias -Decision trees, Support Vector Machines, k-Nearest Neighbors, etc.
The z-test is used for hypothesis testing in statistics with a normal distribution. It is
used to determine population variance in the case where a sample is large.
The t-test is used with a t-distribution and used to determine population variance
when you have a small sample size.
In case the sample size is large or n>30, a z-test is used. T-tests are helpful when
the sample size is small or n<30.
26. What is the equation for confidence intervals for means vs for
proportions?
To calculate the confidence intervals for mean, we use the following equation –
For n > 30
For n<30
In statistics, the empirical rule states that every piece of data in a normal
distribution lies within three standard deviations of the mean. It is also known as
the 68–95–99.7 rule. According to the empirical rule, the percentage of values that
lie in a normal distribution follow the 68%, 95%, and 99.7% rule. In other words,
68% of values will fall within one standard deviation of the mean, 95% will fall
within two standard deviations, and 99.75 will fall within three standard deviations
of the mean.
28. How are confidence tests and hypothesis tests similar? How are they
different?
Confidence tests and hypothesis tests both form the foundation of statistics.
The confidence interval holds importance in research to offer a strong base for
research estimations, especially in medical research. The confidence interval
provides a range of values that helps in capturing the unknown parameter.
Confidence and hypothesis testing are inferential techniques used to either estimate
a parameter or test the validity of a hypothesis using a sample of data from that
data set. While confidence interval provides a range of values for an accurate
estimation of the precision of that parameter, hypothesis testing tells us how
confident we are inaccurately drawing conclusions about a parameter from a
sample. Both can be used to infer population parameters in tandem.
In case we include 0 in the confidence interval, it indicates that the sample and
population have no difference. If we get a p-value that is higher than alpha from
hypothesis testing, it means that we will fail to reject the bull hypothesis.
29. What general conditions must be satisfied for the central limit theorem to
hold?
Here are the conditions that must be satisfied for the central limit theorem to hold –
The data must follow the randomization condition which means that it must be
sampled randomly.
The Independence Assumptions dictate that the sample values must be independent
of each other.
Sample sizes must be large. They must be equal to or greater than 30 to be able to
hold CLT. Large sample size is required to hold the accuracy of CLT to be true.
In a sampling frame, divide the size of the frame N by the sample size (n) to get
‘k’, the index number. Then pick every k’th element to create your sample.
Cluster Random Sampling technique -In this technique, the population is divided
into clusters or groups in such a way that each cluster represents the population.
After that, you can randomly select clusters to sample.
Descriptive statistics are used to summarize the basic characteristics of a data set in
a study or experiment. It has three main types –
Qualitative data is used to describe the characteristics of data and is also known as
Categorical data. For example, how many types. Quantitative data is a measure of
numerical values or counts. For example, how much or how often. It is also known
as Numeric data.
The range is the difference between the highest and the lowest values whereas the
Interquartile range is the difference between upper and lower medians.
IQR = Q3 – Q1
(σ) = √(∑(x-µ)2 / n)
In the left-skewed distribution, the left tail is longer than the right side.
Any point (x) from the normal distribution can be converted into standard normal
distribution (Z) using this formula –
Z(standardized) = (x-µ) / σ
Here, Z for any particular x value indicates how many standard deviations x is
away from the mean of all values of x.
Outliers affect A/B testing and they can be either removed or kept according to
what situation demands or the data set requirements.
Alternatively, two options can be provided – one with outliers and one without.
During post-test analysis, outliers can be removed or modified. The best way to
modify them is to trim the data set.
If there are a lot of outliers and results are critical, then it is best to change the
value of the outliers to other variables. They can be changed to a value that is
representative of the data set.
When outliers have meaning, they can be considered, especially in the case of mild
outliers.
The best way to detect outliers is through graphical means. Apart from that,
outliers can also be detected through the use of statistical methods using tools such
as Excel, Python, SAS, among others. The most popular graphical ways to detect
outliers include box plot and scatter plot.
42. What is the relationship between standard error and margin of error?
and
43. What is the proportion of confidence intervals that will not contain the
population parameter?
Alpha is the probability in a confidence interval that will not contain the population
parameter.
α = 1 – CL
Selection bias is a term in statistics used to denote the situation when selected
individuals or a group within a study differ in a manner from the population of
interest that they give systematic error in the outcome.
Typically selection bias can be identified using bivariate tests apart from using
other methods of multiple regression such as logistic regression.
Data – It includes dredging of data and cherry-picking and occurs when a large
number of variables are present in the data causing even bogus results to appear
significant.
Bessel’s correction advocates the use of n-1 instead of n in the formula of standard
deviation. It helps to increase the accuracy of results while analyzing a sample of
data to derive more general conclusions.
52. What types of variables are used for Pearson’s correlation coefficient?
Variables (both the dependent and independent variables) used for Pearson’s
correlation coefficient must be quantitative. It will only test for the linear
relationship between two variables.
In statistics, hash tables are used to store key values or pairs in a structured way. It
uses a hash function to compute an index into an array of slots in which the desired
elements can be searched.
58. What is the difference between the first quartile, the second quartile, and
the third quartile?
The first quartile is denoted by Q1 and it is the median of the lower half of the data
set.
The second quartile is denoted by Q2 and is the median of the data set.
The third quartile is denoted by Q3 and is the median of the upper half of the data
set.
About 25% of the data set lies above Q3, 75% lies below Q3 and 50% lies below
Q2. The Q1, Q2, and Q3 are the 25th, 50th, and 75th percentile respectively.
Kurtosis is a measure of the degree of the extreme values present in one tail of
distribution or the peaks of frequency distribution as compared to the others. The
standard normal distribution has a kurtosis of 3 whereas the values of symmetry
and kurtosis between -2 and +2 are considered normal and acceptable. The data
sets with a high level of kurtosis imply that there is a presence of outliers. One
needs to add data or remove outliers to overcome this problem. Data sets with low
kurtosis levels have light tails and lack outliers.
Around 99.7% of data fall within three standard deviations in either direction.
Ans:- Statistics are usually used in many different kinds of research fields. The
lists of files in which statistics are used are :
Science
Technology
Biology
Computer Science
Chemistry
Business
Providing comparison
Detailed Coverage
Best-in-class Content
Covariance: In this, two terms vary together, and it is a measure that shows the
extent to which two random variables can change in a cycle. It forms a statistical
relationship between a pair of random variables, where any change in one variable
reciprocates by a corresponding change in another variable.
Ans:- Bayesian rests on the data which is observed in reality and further considers
the probability distribution on the hypothesis.
We can see above that the p-value overshoots the value of 0.05, so the observation
is in line with the null hypothesis-that means the observed result of 14 heads in 20
flips can be related to the chance alone- as it comes within the range of what would
happen 95% of the time is this was a real case. In the example, we failed to reject
the null hypothesis at the level of 5 %. The coin did not have an even fall and the
shift from the expected outcome is slight to be reported as “not statistically
significant at 5% level).
Ans:- The mode is defined as that element of the data sample, which appears most
often in the collection.
X= [ 1 5 5 6 3 2]
Ans:- Median is often described as that numerical value that separates the higher
half of the sample, which can be either a group or a population or even a
probability distribution from the lower half. The median can usually be found by a
limited list of numbers when all the observations are arranged from the lowest to
the highest value and picking the middle one.
Ans:-Covariance is a measure of how two variables move in sync with each other.
y 2= [1 3 4 5 6 7 8]
Ans:-T-test refers to any statistical hypothesis test in which the statistic of the test
follows a Student’s t distribution if the null hypothesis is supported.
If 36 different men are randomly selected, the mean weight is more than 180 lb.
z= (x-µ)/?= (180-173)/30=0.23
? x?= ?/?n=20/?36=5
z=(180-17)/5=1.40
P(Z>1.4) =0.0808
Ans:-In any binary search, the array has to be arranged either in ascending or
descending order. In every step, the search key value is compared with the key
value of the middle element of the array by the algorithm. If both the keys match, a
matching element is discovered, and the index or the position is returned. Else, if
the search key falls below the key of the middle element, then the algorithm will
repeat the action on the sub-array which falls to the left of the middle element of
the array if the search key is more than the sub-array to the right.
Q22). Explain the difference between ‘long’ and ‘wide’ format data?
Ans:-In the wide format, the repeated responses of the subject will fall in a single
row, and each response will go in a separate column. In the long format, every row
makes a one-time point per subject. The data in the wide format can be recognized
by the fact that the columns are basically represented by the groups.
Ans:-Data is usually distributed in many ways which incline to left or right. There
are high chances that data is focussed around a middle value without any particular
inclination to the left or the right. It further reaches the normal distribution and
forms a bell-shaped curve.
Unimodal or one-mode.
Both the left and right halves are symmetrical and are mirror images of each other.
Mean, mode, and even the median are all present at the center.
Asymptotic
Ans:-In both statistics and machine learning, fitting the model to a set of training
data to be able to make increased reliable predictions on general untrained data is a
common task. In the case of overfitting, random errors or noise is described by a
statistical model instead of an underlying relationship. In the case of overfitting,
the model is highly complex, like having too many parameters which are relative
to many observations. The overfit model has a poor predictive performance, and it
overreacts to many minor fluctuations in the training data. In the case of
underfitting, the underlying trend of the data cannot be captured by the statistical
model or even the machine learning algorithm. Even such a model has poor
predictive performance.
Conclusion
Q2). What is the difference between Data Analytics, Big Data, and Data
Science?
Big Data: Big Data deals with huge data volume in structured and semi structured
form and require just basic knowledge of mathematics and statistics.
Data Science: Data Science deals with slicing and dicing of data and require deep
knowledge of mathematics and statistics
The recommended system works on the basis of past behavior of the person and is
widely deployed in a number of fields like music preferences, movie
recommendations, research articles, social tags and search queries. With this
system, the future model can also be prepared, which can predict the person’s
future behavior and can be used to know the product the person would prefer
buying or which movie he will view or which book he will read. It uses the discrete
characteristics of the items to recommend any additional item.
Tools have the operators to perform Matrix operations and calculations using
arrays
It acts as a connecting link between a number of data sets, tools and software
It can be used to solve data oriented problem
With the help of statistics, the Data Scientists can convert the huge amount of data
to provide its insights. The data insights can provide a better idea of what the
customers are expecting? With the help of statistics, the Data scientists can know
the customer’s behavior, his engagements, interests and final conversion. They can
make powerful predictions and certain inferences. It can also be converted into
powerful propositions of business and the customers can also be offered suitable
deals.
As the data come from various multiple sources, so it becomes important to extract
useful and relevant data and therefore data cleansing become very important. Data
cleansing is basically the process of correcting and detecting accurate and relevant
data components and deletion of the irrelevant one. For data cleansing, the data is
processed concurrently or in batches.
Data cleansing is one of the important and essential steps for data science, as the
data can be prone to errors due to a number of reasons, including human
negligence. It takes a lot of time and effort to cleanse the data, as it comes from
various sources.
Linear regression is basically used for predictive analysis. This method describes
the relationship between dependent and independent variables. In linear regression,
a single line is fitted within a scatter plot. It consists of the following three
methods:
To ensure the validity and usefulness of the model. It also helps to determine the
outcomes of various events
Answer : The question can also be phrased as to why linear regression is not a very
effective algorithm.
Linear Regression is a mathematical relationship between an independent and
dependent variable. The relationship is a direct proportion, relation making it the
most simple relationship between the variables.
Y = mX+c
Y – Dependent Variable
X – Independent Variable
The loss function in LR is known as the Log Loss function. The equation for which
is given as :
3. Difference between Regression and Classification?
However, there is no clear line that draws the difference between the two. We have
a few properties of both Regression and Classification. These are as follows:
Regression
If input data are ordered with respect to the time it becomes time series forecasting.
Classification
Some of the other applications of NLP are in text completion, text suggestions, and
sentence correction.
Answer: We can find different accuracy measures using a confusion matrix. These
parameters are Accuracy, Recall, Precision, F1 Score, and Specificity.
Answer : For analyzing the data we cannot proceed with the whole volume at once
for large datasets. We need to take some samples from the data which can
represent the whole population. While making a sample out of complete data, we
should take that data which can be a true representative of the whole data set.
There are mainly two types of Sampling techniques based on Statistics.
8. What are Type 1 and Type 2 errors? In which scenarios the Type 1 and
Type 2 errors become significant?
While Type 2 Error is significant in cases where the importance of being positive
becomes important. For example – The alarm has to be raised in case of burglary in
a bank. But a system identifies it as a False case that won’t raise the alarm on time
resulting in a heavy loss.
Answer :
In Overfitting the model performs well for the training data, but for any new data it
fails to provide output. For Underfitting the model is very simple and not able to
identify the correct relationship. Following are the bias and variance conditions.
Underfitting – High bias and Low Variance. Such model doesn’t perform well on
test data also. For example – Linear Regression is more prone to Underfitting.
For example – If we have a dataset with multiple features and one feature is the
Age data which is in the range 18-60 , Another feature is the salary feature ranging
from 20000 – 2000000. In such a case, the values have a very much difference in
them. Age ranges in two digits integer while salary is in range significantly higher
than the age. So to bring the features in comparable range we need Normalisation.
12. Describe Decision tree Algorithm and what are entropy and information
gain?
It takes the complete set of Data and try to identify a point with highest
information gain and least entropy to mark it as a data node and proceed further in
this manner. Entropy and Information gain are deciding factor to identify the data
node in a Decision Tree.
15. What is Imbalanced Data? How do you manage to balance the data?
Answer : If a data is distributed across different categories and the distribution is
highly imbalance. Such data are known as Imbalance Data. These kind of datasets
causes error in model performance by making category with large values
significant for the model resulting in an inaccurate model.
There are various techniques to handle imbalance data. We can increase the
number of samples for minority classes. We can decrease the number of samples
for classes with extremely high numbers of data points. We can use a cluster based
technique to increase number of Data points for all the categories.
Answer : Grouping the data into different clusters based on the distribution of data
is known as Clustering technique.
2. Hierarchical Clustering.
Epsilon – The minimum radius or distance between the two data points to tag them
in the same cluster.
Min – Sample Points – The number of minimum sample which should fall under
that range to be identified as one cluster.
1. In DBSCAN we do not need to provide the fixed number of clusters. There can
be as many clusters formed on the basis of the data points distribution. While in k
nearest neighbour we need to provide the number of clusters we need to split our
data into.
18. What do you mean by Cross Validation. Name some common cross
Validation techniques?
In this the training data are split into different groups and in rotation those groups
are used for validation of model performance.
The common Cross Validation techniques are –
Leave-one-out cross-validation.
Holdout method
Answer: Deep Learning is the branch of Machine Learning and AI which tries to
achieve better accuracy and able to achieve complex models. Deep Learning
models are similar to human brains like structure with input layer, hidden layer,
activation function and output layer designed in a fashion to give a human brain
like structure.
Answer:
CNN RNN
It is used for distributed data,
RNN is used for sequential data.
images.
CNN has better performance than
RNN is not having so many features.
RNN
It requires input and output to be of
RNN can take any dimensions data.
fixed size.
CNN is a feed forward network
RNN is not like a feed-forward mechanism it
with muli layer easy processing
uses it’s own internal memory.
network.
CNNs use patterns between Recurrent neural networks use time-series
different layers to identify the next information and process the results based on
results. past memories.
Image Processing Time-series forecasting, Text Classification
170 Machine Learning Interview Questions and Answer for 2021
Supervised learning technique needs labeled data to train the model. For
example, to solve a classification problem (a supervised learning task), you need
to have label data to train the model and to classify the data into your labeled
groups. Unsupervised learning does not need any labelled dataset. This is the
main key difference between supervised learning and unsupervised learning.
5. How do you select important variables while working on a data set?
There are various means to select important variables from a data set that
include the following:
6. There are many machine learning algorithms till now. If given a data set,
how can one determine which algorithm to be used for that?
So, there is no certain metric to decide which algorithm to be used for a given
situation or a data set. We need to explore the data using EDA (Exploratory
Data Analysis) and understand the purpose of using the dataset to come up with
the best fit algorithm. So, it is important to study all the algorithms in detail.
7. How are covariance and correlation different from one another?
Covariance measures how two variables are related to each other and how one
would vary with respect to changes in the other variable. If the value is positive
it means there is a direct relationship between the variables and one would
increase or decrease with an increase or decrease in the base variable
respectively, given that all other conditions remain constant.
Correlation quantifies the relationship between two random variables and has
only three specific values, i.e., 1, 0, and -1.
1 denotes a positive relationship, -1 denotes a negative relationship, and 0
denotes that the two variables are independent of each other.
Causality applies to situations where one action, say X, causes an outcome, say
Y, whereas Correlation is just relating one action (X) to another action(Y) but X
does not necessarily cause Y.
9. We look at machine learning software almost all the time. How do we
apply Machine Learning to Hardware?
10. Explain One-hot encoding and Label Encoding. How do they affect the
dimensionality of the given dataset?
Deep Learning is a part of machine learning that works with neural networks. It
involves a hierarchical structure of networks that set up a process to help
machines learn the human logics behind any action. We have compiled a list of
the frequently asked deep leaning interview questions to help you prepare.
What is overfitting?
Both are errors in Machine Learning Algorithms. When the algorithm has
limited flexibility to deduce the correct observation from the dataset, it results in
bias. On the other hand, variance occurs when the model is extremely sensitive
to small fluctuations.
If one adds more features while building a model, it will add more complexity
and we will lose bias but gain some variance. In order to maintain the optimal
amount of error, we perform a tradeoff between bias and variance based on the
needs of a business.
Source: Understanding the Bias-Variance Tradeoff: Scott Fortmann – Roe
Bias stands for the error because of the erroneous or overly simplistic
assumptions in the learning algorithm . This assumption can lead to the model
underfitting the data, making it hard for it to have high predictive accuracy and
for you to generalize your knowledge from the training set to the test set.
Variance is also an error because of too much complexity in the learning
algorithm. This can be the reason for the algorithm being highly sensitive to
high degrees of variation in training data, which can lead your model to overfit
the data. Carrying too much noise from the training data for your model to be
very useful for your test data.
14. A data set is given to you and it has missing values which spread along
1standard deviation from the mean. How much of the data would remain
untouched?
It is given that the data is spread across mean that is the data is spread across an
average. So, we can presume that it is a normal distribution. In a normal
distribution, about 68% of data lies in 1 standard deviation from averages like
mean, mode or median. That means about 32% of the data remains uninfluenced
by missing values.Questions asked by top companies
Higher variance directly means that the data spread is big and the feature has a
variety of data. Usually, high variance in a feature is seen as not so good
quality.
16. If your dataset is suffering from high variance, how would you handle
it?
For datasets with high variance, we could use the bagging algorithm to handle
it. Bagging algorithm splits the data into subgroups with sampling replicated
from random data. After the data is split, random data is used to create rules
using a training algorithm. Then we use polling technique to combine all the
predicted outcomes of the model.
17. A data set is given to you about utilities fraud detection. You have built
aclassifier model and achieved a performance score of 98.5%. Is this a
goodmodel? If yes, justify. If not, what can you do about it?
Data set about utilities fraud detection is not balanced enough i.e. imbalanced.
In such a data set, accuracy score cannot be the measure of performance as it
may only be predict the majority class label correctly but in this case our point
of interest is to predict the minority label. But often minorities are treated as
noise and ignored. So, there is a high probability of misclassification of the
minority label as compared to the majority label. For evaluating the model
performance in case of imbalanced data sets, we should use Sensitivity (True
Positive rate) or Specificity (True Negative rate) to determine class label wise
performance of the classification model. If the minority class label’s
performance is not so good, we could do the following:
Identifying missing values and dropping the rows or columns can be done by
using IsNull() and dropna( ) functions in Pandas. Also, the Fillna() function in
Pandas replaces the incorrect values with the placeholder value.
19. What is Time series?
21. What is the difference between stochastic gradient descent (SGD) and
gradient descent (GD)?
Gradient Descent and Stochastic Gradient Descent are the algorithms that find
the set of parameters that will minimize a loss function.
The difference is that in Gradient Descend, all training samples are evaluated
for each set of parameters. While in Stochastic Gradient Descent only one
training sample is evaluated for the set of parameters identified.
22. What is the exploding gradient problem while using back propagation
technique?
When large error gradients accumulate and result in large changes in the neural
network weights during training, it is called the exploding gradient problem.
The values of weights can become so large as to overflow and result in NaN
values. This makes the model unstable and the learning of the model to stall just
like the vanishing gradient problem.
23. Can you mention some advantages and disadvantages of decision trees?
The advantages of decision trees are that they are easier to interpret, are
nonparametric and hence robust to outliers, and have relatively few parameters
to tune.
On the other hand, the disadvantage is that they are prone to overfitting.
24. Explain the differences between Random Forest and Gradient Boosting
machines.
Random forests are a significant number of decision trees pooled using averages
or majority rules at the end. Gradient boosting machines also combine decision
trees but at the beginning of the process unlike Random forests. Random forest
creates each tree independent of the others while gradient boosting develops one
tree at a time. Gradient boosting yields better outcomes than random forests if
parameters are carefully tuned but it’s not a good option if the data set contains
a lot of outliers/anomalies/noise as it can result in overfitting of the
model.Random forests perform well for multiclass object detection . Gradient
Boosting performs well when there is data which is not balanced such as in real
time risk assessment.
Confusion matrix (also called the error matrix) is a table that is frequently used
to illustrate the performance of a classification model i.e. classifier on a set of
test data for which the true values are well-known.
Support is a measure of how often the “item set” appears in the data set and
Confidence is a measure of how often a particular rule has been found to be
true.
P(X=x) = ∑YP(X=x,Y)
The Curse of Dimensionality refers to the situation when your data has too
many features.
The phrase is used to express the difficulty of using brute force or grid search to
optimize a function with too many inputs.
Dimensionality reduction techniques like PCA come to the rescue in such cases.
The idea here is to reduce the dimensionality of the data set by reducing the
number of variables that are correlated with each other. Although the variation
needs to be retained to the maximum extent.
The variables are transformed into a new set of variables that are known as
Principal Components’. These PCs are the eigenvectors of a covariance matrix
and therefore are orthogonal.
Which of the following architecture can be trained faster and needs less amount
of training data
b. Transformer architecture
Read more…
A data point that is considerably distant from the other similar data points is
known as an outlier. They may occur due to experimental errors or variability in
measurement. They are problematic and can mislead a training process, which
eventually results in longer training time, inaccurate models, and poor results.
Normalization and Standardization are the two very popular methods used for
feature scaling. Normalization refers to re-scaling the values to fit into a range
of [0,1]. Standardization refers to re-scaling data to have a mean of 0 and a
standard deviation of 1 (Unit variance). Normalization is useful when all
parameters need to have the identical positive scale however the outliers from
the data set are lost. Hence, standardization is recommended for most
applications.
35. List the most popular distribution curves along with scenarios where
you will use them in an algorithm.
Visually, we can check it using plots. There is a list of Normality checks, they
are as follow:
Shapiro-Wilk W Test
Anderson-Darling Test
Martinez-Iglewicz Test
Kolmogorov-Smirnov Test
D’Agostino Skewness Test
37. What is Linear Regression ?
At any given value of X, one can compute the value of Y, using the equation of
Line. This relation between Y and X, with a degree of the polynomial as 1 is
called Linear Regression.
If you have categorical variables as the target when you cluster them together or
perform a frequency count on them if there are certain categories which are
more in number as compared to others by a very significant number. This is
known as the target imbalance.
40. List all assumptions for data to be met before starting with linear
regression.
Linear relationship
Multivariate normality
No or little multicollinearity
No auto-correlation
Homoscedasticity
41. When does the linear regression line stop rotating or finds an optimal
spot where it is fitted on data?
A place where the highest RSquared value is found, is the place where the line
comes to rest. RSquared represents the amount of variance captured by the
virtual linear regression line with respect to the total variance captured by the
dataset.
Since the target column is categorical, it uses linear regression to create an odd
function that is wrapped with a log function to use regression as a
classifier. Hence, it is a type of classification technique and not a regression. It
is derived from cost function.
43. What could be the issue when the beta value for a certain variable
varies way too much in each subset when regression is run on different
subsets of the given dataset?
Variation Inflation Factor (VIF) is the ratio of variance of the model to variance
of the model with only one independent variable. VIF gives the estimate of
volume of multicollinearity in a set of many regression variables.
45. Which machine learning algorithm is known as the lazy learner and
why is it called so?
Here’s a list of the top 101 interview questions with answers to help you
prepare. The first set of questions and answers are curated for freshers while the
second set is designed for advanced users.
Functions in Python refer to blocks that have organised, and reusable codes to
perform single, and related events. Functions are important to create better
modularity for applications which reuse high degree of coding. Python has a
number of built-in functions read more…
SVM has a learning rate and expansion rate which takes care of this. The
learning rate compensates or penalises the hyperplanes for making all the wrong
moves and expansion rate deals with finding the maximum separation area
between classes.
49. What are Kernels in SVM? List popular kernels used in SVM along
with a scenario of their applications.
The function of kernel is to take data as input and transform it into the required
form. A few popular Kernels used in SVM are as follows: RBF, Linear,
Sigmoid, Polynomial, Hyperbolic, Laplace, etc.
50. What is Kernel Trick in an SVM Algorithm?
Kernel Trick is a mathematical function which when applied on data points, can
find the region of classification between two different classes. Based on the
choice of function, be it linear or radial, which purely depends upon the
distribution of data, one can build a classifier.
51. What are ensemble models? Explain how ensemble techniques yield
better learning as compared to traditional classification ML algorithms?
Ensemble is a group of models that are used together for prediction both in
classification and regression class. Ensemble learning helps improve ML results
because it combines several models. By doing so, it allows a better predictive
performance compared to a single model.
They are superior to individual models as they reduce variance, average out
biases, and have lesser chances of overfitting.
52. What are overfitting and underfitting? Why does the decision tree
algorithm suffer often with overfitting problem?
In decision trees, overfitting occurs when the tree is designed to perfectly fit all
samples in the training data set. This results in branches with strict rules or
sparse data and affects the accuracy when predicting samples that aren’t part of
the training set.
53. What is OOB error and how does it occur?
For each bootstrap sample, there is one-third of data that was not used in the
creation of the tree, i.e., it was out of the sample. This data is referred to as out
of bag data. In order to get an unbiased measure of the accuracy of the model
over test data, out of bag error is used. The out of bag data is passed for each
tree is passed through that tree and the outputs are aggregated to give out of bag
error. This percentage error is quite effective in estimating the error in the
testing set and does not require further cross-validation.
Outlier is an observation in the data set that is far away from other observations
in the data set. We can discover outliers using tools and functions like box plot,
scatter plot, Z-Score, IQR score etc. and then handle them based on the
visualization we have got. To handle outliers, we can cap at some threshold, use
transformations to reduce skewness of the data and remove outliers if they are
anomalies or errors.
56. List popular cross validation techniques.
There are mainly six types of cross validation techniques. They are as follow:
K fold
Stratified k fold
Leave one out
Bootstrapping
Random search cv
Grid search cv
59. How can we use a dataset without the target variable into supervised
learning algorithms?
Input the data set into a clustering algorithm, generate optimal clusters, label the
cluster numbers as the new target variable. Now, the dataset has independent
and target variables present. This ensures that the dataset is ready to be used in
supervised learning algorithms.
64. Define and explain the concept of Inductive Bias with some examples.
Inductive Bias is a set of assumptions that humans use to predict outputs given
inputs that the learning algorithm has not encountered yet. When we are trying
to learn Y from X and the hypothesis space for Y is infinite, we need to reduce
the scope by our beliefs/assumptions about the hypothesis space which is also
called inductive bias. Through these assumptions, we constrain our hypothesis
space and also get the capability to incrementally test and improve on the data
using hyper-parameters. Examples:
66. Keeping train and test split criteria in mind, is it good to perform
scaling before the split or after the split?
Scaling should be done post-train and test split ideally. If the data is closely
packed, then scaling post or pre-split should not make much difference.
67. Define precision, recall and F1 Score?
The metric used to access the performance of the classification model is
Confusion Metric. Confusion Metric can be further interpreted with the
following terms:-
True Positives (TP) – These are the correctly predicted positive values. It
implies that the value of the actual class is yes and the value of the predicted
class is also yes.
True Negatives (TN) – These are the correctly predicted negative values. It
implies that the value of the actual class is no and the value of the predicted
class is also no.
False positives and false negatives, these values occur when your actual class
contradicts with the predicted class.
Now,
Recall, also known as Sensitivity is the ratio of true positive rate (TP), to all
observations in actual class – yes
Recall = TP/(TP+FN)
Precision is the ratio of positive predictive value, which measures the amount
of accurate positives model predicted viz a viz number of positives it claims.
Precision = TP/(TP+FP)
F1 Score is the weighted average of Precision and Recall. Therefore, this score
takes both false positives and false negatives into account. Intuitively it is not as
easy to understand as accuracy, but F1 is usually more useful than accuracy,
especially if you have an uneven class distribution. Accuracy works best if false
positives and false negatives have a similar cost. If the cost of false positives
and false negatives are very different, it’s better to look at both Precision and
Recall.
68. Plot validation score and training score with data set size on the x-axis
and another plot with model complexity on the x-axis.
For high bias in the models, the performance of the model on the validation data
set is similar to the performance on the training data set. For high variance in
the models, the performance of the model on the validation set is worse than the
performance on the training set.
69. What is Bayes’ Theorem? State at least 1 use case with respect to the
machine learning context?
Naive Bayes classifiers are a series of classification algorithms that are based on
the Bayes theorem. This family of algorithm shares a common principle which
treats every pair of features independently while being classified.
Naive Bayes is considered Naive because the attributes in it (for the class) is
independent of others in the same class. This lack of dependence between two
attributes of the same class creates the quality of naiveness.
Naive Bayes classifiers are a family of algorithms which are derived from the
Bayes theorem of probability. It works on the fundamental assumption that
every set of two features that is being classified is independent of each other and
every feature makes an equal and independent contribution to the outcome.
72. What do the terms prior probability and marginal likelihood in context
of Naive Bayes theorem mean?
Prior probability is the percentage of dependent binary variables in the data set.
If you are given a dataset and dependent variable is either 1 or 0 and percentage
of 1 is 65% and percentage of 0 is 35%. Then, the probability that any new
input for that variable of being 1 would be 65%.
Marginal likelihood is the denominator of the Bayes equation and it makes sure
that the posterior probability is valid by making its area 1.
Probability is the measure of the likelihood that an event will occur that is, what
is the certainty that a specific event will occur? Where-as a likelihood function
is a function of parameters within the parameter space that describes the
probability of obtaining the observed data.
So the fundamental difference is, Probability attaches to possible results;
likelihood attaches to hypotheses.
76. Model accuracy or Model performance? Which one will you prefer and
why?
This is a trick question, one should first get a clear idea, what is Model
Performance? If Performance means speed, then it depends upon the nature of
the application, any application related to the real-time scenario will need high
speed as an important feature. Example: The best of Search Results will lose its
virtue if the Query results do not appear fast.
If Performance is hinted at Why Accuracy is not the most important virtue – For
any imbalanced data set, more than Accuracy, it will be an F1 score than will
explain the business case and in case data is imbalanced, then Precision and
Recall will be more important than rest.
1. It is a biased estimation.
2. It is more sensitive to initialization.
78. How would you handle an imbalanced dataset?
Sampling Techniques can help with an imbalanced dataset. There are two ways
to perform sampling, Under Sample or Over Sampling.
In Under Sampling, we reduce the size of the majority class to match minority
class thus help by improving performance w.r.t storage and run-time execution,
but it potentially discards useful information.
For Over Sampling, we upsample the Minority class and thus solve the problem
of information loss, however, we get into the trouble of having Overfitting.
Exploratory Data Analysis (EDA) helps analysts to understand the data better
and forms the foundation of better models.
Visualization
Univariate visualization
Bivariate visualization
Multivariate visualization
Missing Value Treatment – Replace missing values with Either Mean/Median
Feature Engineering – Need of the domain, and SME knowledge helps Analyst
find derivative fields which can fetch more information about the nature of the
data
Prepare the suitable input data set to be compatible with the machine
learning algorithm constraints.
Enhance the performance of machine learning models.
Some of the techniques used for feature engineering include Imputation,
Binning, Outliers Handling, Log transform, grouping operations, One-Hot
encoding, Feature split, Scaling, Extracting date.
Machine learning models are about making accurate predictions about the
situations, like Foot Fall in restaurants, Stock-Price, etc. where-as, Statistical
models are designed for inference about the relationships between variables, as
What drives the sales in a restaurant, is it food or Ambience.
Decision trees have a lot of sensitiveness to the type of data they are trained on.
Hence generalization of results is often much more complex to achieve in them
despite very high fine-tuning. The results vary greatly if the training data is
changed in decision trees.
Hence bagging is utilised where multiple decision trees are made which are
trained on samples of the original data and the final result is the average of all
these individual models.
Boosting is the process of using an n-weak classifier system for prediction such
that every weak classifier compensates for the weaknesses of its classifiers. By
weak classifier, we imply a classifier which performs poorly on a given data
set.
It’s evident that boosting is not an algorithm rather it’s a process. Weak
classifiers used are generally logistic regression, shallow decision trees etc.
There are many algorithms which make use of boosting processes but two of
them are mainly used: Adaboost and Gradient Boosting and XGBoost.
The gamma defines influence. Low values meaning ‘far’ and high values
meaning ‘close’. If gamma is too large, the radius of the area of influence of the
support vectors only includes the support vector itself and no amount of
regularization with C will be able to prevent overfitting. If gamma is very
small, the model is too constrained and cannot capture the complexity of the
data.
The graphical representation of the contrast between true positive rates and the
false positive rate at various thresholds is known as the ROC curve. It is used as
a proxy for the trade-off between true positives vs the false positives.
85. What is the difference between a generative and discriminative model?
A generative model learns the different categories of data. On the other hand, a
discriminative model will only learn the distinctions between different
categories of data. Discriminative models perform much better than the
generative models when it comes to classification tasks.
86. What are hyperparameters and how are they different from
parameters?
88. What are some differences between a linked list and an array?
Arrays and Linked lists are both used to store linear data of similar types.
However, there are a few difference between them.
Operations (insertion, deletion) are faster in Linked list takes linear time, making
array operations a bit slower
Arrays are of fixed size Linked lists are dynamic and flexible
Meshgrid () function is used to create a grid using 1-D arrays of x-axis inputs
and y-axis inputs to represent the matrix indexing. Contourf () is used to draw
filled contours using the given x-axis inputs, y-axis inputs, contour line, colours
etc.
Advantages:
92. You have to train a 12GB dataset using a neural network with a
machine which has only 3GB RAM. How would you go about it?
We can use NumPy arrays to solve this issue. Load all the data into an array. In
NumPy, arrays have a property to map the complete dataset without loading it
completely in memory. We can pass the index of the array, dividing data into
batches, to get the data required and then pass the data into the neural networks.
But be careful about keeping the batch size normal.
93. Write a simple code to binarize data.
Conversion of data into binary values on the basis of certain threshold is known
as binarizing of data. Values below the threshold are set to 0 and those above
the threshold are set to 1 which is useful for feature engineering.
Code:
import pandas
import numpy
array = dataframe.values
A = array [: 0:7]
B = array [:7]
binaryA = binarizer.transform(A)
numpy.set_printoptions(precision=5)
Example:
In the above case, fruits is a list that comprises of three fruits. To access them
individually, we use their indexes. Python and C are 0- indexed languages, that
is, the first index is 0. MATLAB on the contrary starts from 1, and thus is a 1-
indexed language.
1. Advantages:
On the contrary, Python provides us with a function called copy. We can copy a
list to another just by calling the copy function.
new_list = old_list.copy()
We need to be careful while using the function. copy() is a shallow copy
function, that is, it only stores the references of the original list in the new list.
If the given argument is a compound data structure like a list then python
creates another object of the same type (in this case, a new list) but for
everything inside old list, only their reference is copied. Essentially, the new list
consists of references to the elements of the older list.
Hence, upon changing the original list, the new list values also change. This can
be dangerous in many applications. Therefore, Python provides us with another
functionality called as deepcopy. Intuitively, we may consider that deepcopy()
would follow the same paradigm, and the only difference would be that for
each element we will recursively call deepcopy. Practically, this is not the case.
deepcopy() preserves the graphical structure of the original compound data. Let
us understand this better with the help of an example:
import copy.deepcopy
a = [1,2]
c = deepcopy(b)
Therefore, this prevents unnecessary duplicates and thus preserves the structure
of the copied compound data structure. Thus, in this case, c[0] is not equal to a,
as internally their addresses are different.
Normal copy
>>> b = list(a)
>>> a
>>> b
>>> a[0][1] = 10
>>> a
Deep copy
>>> import copy
>>> b = copy.deepcopy(a)
>>> a
>>> b
>>> a[0][1] = 9
>>> a
97. Given an array of integers where each element represents the max
number of steps that can be made forward from that element. The task is to
find the minimum number of jumps to reach the end of the array (starting
from the first element). If an element is 0, then cannot move through that
element.
Let us start from the end and move backwards as that makes more sense
intuitionally. We will use variables right and prev_r denoting previous right to
keep track of the jumps.
Initially, right = prev_r = the last but one element. We consider the distance of
an element to the end, and the number of jumps possible by that element.
Therefore, if the sum of the number of jumps possible and the distance is greater
than the previous element, then we will discard the previous element and use the
second element’s value to jump. Try it out using a pen and paper first. The logic
will seem very straight forward to implement. Later, implement it on your own
and then verify with the result.
def min_jmp(arr):
n = len(arr)
count = 0
# We start from rightmost index and travesre array to find the leftmost index
while True:
for j in (range(prev_r-1,-1,-1)):
right = j
if prev_r != right:
prev_r = right
else:
break
count += 1
print(min_jmp(n, arr))
98. Given a string S consisting only ‘a’s and ‘b’s, print the last index of the
‘b’ present in it.
When we have are given a string of a’s and b’s, we can immediately find out the
first location of a character occurring. Therefore, to find the last occurrence of a
character, we reverse the string and find the first occurrence, which is
equivalent to the last occurrence in the original string.
def split(word):
a = input()
a= split(a)
a_rev = a[::-1]
pos = -1
for i in range(len(a_rev)):
if a_rev[i] == ‘b’:
pos = len(a_rev)- i -1
print(pos)
break
else:
continue
if pos==-1:
print(-1)
A = [1,2,3,4,5]
A <<2
[3,4,5,1,2]
A<<3
[4,5,1,2,3]
There exists a pattern here, that is, the first d elements are being interchanged
with last n-d +1 elements. Therefore we can just swap the elements. Correct?
What if the size of the array is huge, say 10000 elements. There are chances of
memory error, run-time error etc. Therefore, we do it more carefully. We rotate
the elements one by one in order to prevent the above errors, in case of large
arrays.
n = len( arr)
arr[i] = arr[i + 1]
arr[n-1] = tmp
n = len (arr)
rot_left_once ( arr, n)
#||
# |_|
Solution: We are given an array, where each element denotes the height of the
block. One unit of height is equal to one unit of water, given there exists space
between the 2 elements to store it. Therefore, we need to find out all such pairs
that exist which can store water. We need to take care of the possible cases:
Therefore, let us find start with the extreme elements, and move towards the
centre.
n = int(input())
# left =[arr[0]]
left.append(max(left[-1], elem) )
water = 0
# once we have the arrays left, and right, we can find the water capacity
between these arrays.
if add_water > 0:
water += add_water
print(water)
103. What are the performance metrics that can be used to estimate the
efficiency of a linear regression model?
The default method of splitting in decision trees is the Gini Index. Gini Index is
the measure of impurity of a particular node.
Ans. The p-value gives the probability of the null hypothesis is true. It gives us
the statistical significance of our results. In other words, p-value determines the
confidence of a model in a particular output.
Ans. The most important features which one can tune in decision trees are:
1. Splitting criteria
2. Min_leaves
3. Min_samples
4. Max_depth
111. Is ARIMA model a good fit for every time series problem?
Ans. No, ARIMA model is not suitable for every type of time series problem.
There are situations where ARMA model and others also come in handy.
112. How do you deal with the class imbalance in a classification problem?
115. How to deal with very few data samples? Is it possible to make a model
out of it?
Ans. If very few data samples are there, we can make use of oversampling to
produce new data points. In this way, we can have new data points.
Ans. The gamma value, c value and the type of kernel are the hyperparameters
of an SVM model.
Ans. If data is correlated PCA does not work well. Because of the correlation of
variables the effective variance of variables decreases. Hence correlated data
when used for PCA does not work well.
Manhattan
Minkowski
Tanimoto
Jaccard
Mahalanobis
Ans. Chi square test can be used for doing so. It gives the measure of
correlation between categorical predictors.
Ans. KNN is the only algorithm that can be used for imputation of both
categorical and continuous variables.
Ans. We should use ridge regression when we want to use all predictors and not
remove any as it reduces the coefficient values but does not nullify them.
Ans. Random Forest, Xgboost and plot variable importance charts can be used
for variable selection.
127. If we have a high bias error what does it mean? How to treat it?
Ans. High bias error means that that model we are using is ignoring all the
important trends in the model and the model is underfitting.
To reduce underfitting:
Sometimes it also gives the impression that the data is noisy. Hence noise from
data should be removed so that most important signals are found by the model
to make effective predictions.
128. Which type of sampling is better for a classification model and why?
1 = not correlated.
Between 1 and 5 = moderately correlated.
Greater than 5 = highly correlated.
Ans. A categorical predictor can be treated as a continuous one when the nature
of data points it represents is ordinal. If the predictor variable is having ordinal
data then it can be treated as continuous and its inclusion in the model increases
the performance of the model.
134. Which sampling technique is most suitable when working with time-
series data?
Ans. We can use a custom iterative sampling such that we continuously add
samples to the train set. We only should keep in mind that the sample used for
validation should be added to the next train sets and a new sample is used for
validation.
1. Reduces overfitting
2. Shortens the size of the tree
3. Reduces complexity of the model
4. Increases bias
A very small chi-square test statistics implies observed data fits the expected
data extremely well.
Example – “it’s possible to have a false negative—the test says you aren’t
pregnant when you are”
Type I and Type II error in machine learning refers to false values. Type I is
equivalent to a False positive while Type II is equivalent to a False negative. In
Type I error, a hypothesis which ought to be accepted doesn’t get accepted.
Similarly, for Type II error, the hypothesis gets rejected which should have been
accepted in the first place.
147. What do you understand by L1 and L2 regularization?
Although it depends on the problem you are solving, but some general
advantages are following:
Naive Bayes:
Work well with small dataset compared to DT which need more data
Lesser overfitting
Smaller in size and faster in processing
Decision Trees:
Decision Trees are very flexible, easy to understand, and easy to debug
No preprocessing or transformation of features required
Prone to overfitting but you can use pruning or Random forests to
avoid that.
AUC (area under curve). Higher the area under the curve, better the prediction
power of the model.
It is the sum of the likelihood residuals. At record level, the natural log of the
error (residual) is calculated for each record, multiplied by minus one, and those
values are totaled. That total is then used as the basis for deviance (2 x ll) and
likelihood (exp(ll)).
The same calculation can be applied to a naive model that assumes absolutely
no predictive power, and a saturated model assuming perfect predictions.
The likelihood values are used to compare different models, while the deviances
(test, naive, and saturated) can be used to determine the predictive power and
accuracy. Logistic regression accuracy of the model will always be 100 percent
for the development data set, but that is not the case once a model is applied to
another data set.
How well does the model fit the data?, Which predictors are most important?,
Are the predictions accurate?
So the following are the criterion to access the model performance,
3. Confusion Matrix: In order to find out how well the model does in
predicting the target variable, we use a confusion matrix/ classification rate. It is
nothing but a tabular representation of actual Vs predicted values which helps
us to find the accuracy of the model.
First reason is that XGBoost is an ensemble method that uses many trees to
make a decision so it gains power by repeating itself.
SVM is a linear separator, when data is not linearly separable SVM needs a
Kernel to project the data into a space where it can separate it, there lies its
greatest strength and weakness, by being able to project data into a high
dimensional space SVM can find a linear separation for almost any data but at
the same time it needs to use a Kernel and we can argue that there’s not a
perfect kernel for every dataset.
155. What is the difference between SVM Rank and SVR (Support Vector
Regression)?
One is used for ranking and the other is used for regression.
Hard-margin
You have the basic SVM – hard margin. This assumes that data is very well
behaved, and you can find a perfect classifier – which will have 0 error on train
data.
Soft-margin
Data is usually not well behaved, so SVM hard margins may not have a solution
at all. So we allow for a little bit of error on some points. So the training error
will not be 0, but average error over all points is minimized.
Kernels
The above assume that the best classifier is a straight line. But what is it is not a
straight line. (e.g. it is a circle, inside a circle is one class, outside is another
class). If we are able to map the data into higher dimensions – the higher
dimension may give us a straight line.
An svm is a type of linear classifier. If you don’t mess with kernels, it’s
arguably the most simple type of linear classifier.
Linear classifiers (all?) learn linear fictions from your data that map your input
to scores like so: scores = Wx + b. Where W is a matrix of learned weights, b is
a learned bias vector that shifts your scores, and x is your input data. This type
of function may look familiar to you if you remember y = mx + b from high
school.
A typical svm loss function ( the function that tells you how good your
calculated scores are in relation to the correct labels ) would be hinge loss. It
takes the form: Loss = sum over all scores except the correct score of max(0,
scores – scores(correct class) + 1).
158. What are the advantages of using a naive Bayes for classification?
159. Are Gaussian Naive Bayes the same as binomial Naive Bayes?
Binomial Naive Bayes: It assumes that all our features are binary such that they
take only two values. Means 0s can represent “word does not occur in the
document” and 1s as “word occurs in the document”.
160. What is the difference between the Naive Bayes Classifier and the
Bayes classifier?
P(X|Y,Z)=P(X|Z)
For the Bayesian network as a classifier, the features are selected based on some
scoring functions like Bayesian scoring function and minimal description
length(the two are equivalent in theory to each other given that there is enough
training data). The scoring functions mainly restrict the structure (connections
and directions) and the parameters(likelihood) using the data. After the structure
has been learned the class is only determined by the nodes in the Markov
blanket(its parents, its children, and the parents of its children), and all variables
given the Markov blanket are discarded.
First, Naive Bayes is not one algorithm but a family of Algorithms that inherits
the following attributes:
1.Discriminant Functions
3.Bayesian Theorem
Since these are generative models, so based upon the assumptions of the random
variable mapping of each feature vector these may even be classified as
Gaussian Naive Bayes, Multinomial Naive Bayes, Bernoulli Naive Bayes, etc.
Selection bias stands for the bias which was introduced by the selection of
individuals, groups or data for doing analysis in a way that the proper
randomization is not achieved. It ensures that the sample obtained is not
representative of the population intended to be analyzed and sometimes it is
referred to as the selection effect. This is the part of distortion of a statistical
analysis which results from the method of collecting samples. If you don’t take
the selection bias into the account then some conclusions of the study may not
be accurate.
Recall is also known as sensitivity and the fraction of the total amount of
relevant instances which were actually retrieved.
Both precision and recall are therefore based on an understanding and measure
of relevance.
165. What Are the Three Stages of Building a Model in Machine Learning?
FAQ:
Any way that suits your style of learning can be considered as the best way to
learn. Different people may enjoy different methods. Some of the common ways
would be through taking up a Machine Learning Course, watching YouTube
videos, reading blogs with relevant topics, read books which can help you self-
learn.
Most hiring companies will look for a masters or doctoral degree in the relevant
domain. The field of study includes computer science or mathematics. But
having the necessary skills even without the degree can help you land a ML job
too.
The most common way to get into a machine learning career is to acquire the
necessary skills. Learn programming languages such as C, C++, Python, and
Java. Gain basic knowledge about various ML algorithms, mathematical
knowledge about calculus and statistics. This will help you go a long way.
Machine Learning is a vast concept that contains a lot different aspects. With
the right guidance and with consistent hard-work, it may not be very difficult to
learn. It definitely requires a lot of time and effort, but if you’re interested in the
subject and are willing to learn, it won’t be too difficult.
Machine Learning for beginners will consist of the basic concepts such as types
of Machine Learning (Supervised, Unsupervised, Reinforcement Learning).
Each of these types of ML have different algorithms and libraries within them,
such as, Classification and Regression. There are various classification
algorithms and regression algorithms such as Linear Regression. This would be
the first thing you will learn before moving ahead with other concepts.
Stay tuned to this page for more such information on interview questions
and career assistance . You can check our other blogs about Machine
Learning for more information.
You can also take up the PGP Artificial Intelligence and Machine Learning
Course offered by Great Learning in collaboration with UT Austin. The course
offers online learning with mentorship and provides career assistance as well.
The curriculum has been designed by faculty from Great Lakes and The
University of Texas at Austin-McCombs and helps you power ahead your
career.
Machine learning is the form of Artificial Intelligence that deals with system
programming and automates data analysis to enable computers to learn and act
through experiences without being explicitly programmed.
For example, Robots are coded in such a way that they can perform the tasks
based on data they collect from sensors. They automatically learn programs from
data and improve with experiences.
For example, if we have to explain to a kid that playing with fire can cause burns.
There are two ways we can explain this to a kid; we can show training examples of
various fire accidents or images of burnt people and label them as "Hazardous". In
this case, a kid will understand with the help of examples and not play with the
fire. It is the form of Inductive machine learning. The other way to teach the same
thing is to let the kid play with the fire and wait to see what happens. If the kid gets
a burn, it will teach the kid not to play with fire and avoid going near it. It is the
form of deductive learning.
3) What is the difference between Data Mining and Machine Learning?
Data mining can be described as the process in which the structured data tries to
abstract knowledge or interesting unknown patterns. During this process, machine
learning algorithms are used.
The possibility of overfitting occurs when the criteria used for training the model is
not as per the criteria used to judge the efficiency of a model.
Overfitting occurs when we have a small dataset, and a model is trying to learn
from it. By using a large amount of data, overfitting can be avoided. But if we have
a small database and are forced to build a model based on that, then we can use a
technique known as cross-validation. In this method, a model is usually given a
dataset of a known data on which training data set is run and dataset of unknown
data against which the model is tested. The primary aim of cross-validation is to
define a dataset to "test" the model in the training phase. If there is sufficient data,
'Isotonic Regression' is used to prevent overfitting.
o Machine learning is all about algorithms which are used to parse data, learn
from that data, and then apply whatever they have learned to make informed
decisions.
o Deep learning is a part of machine learning, which is inspired by the
structure of the human brain and is particularly useful in feature detection.
10) What are the different types of Algorithm methods in Machine Learning?
o Supervised Learning
o Semi-supervised Learning
o Unsupervised Learning
o Transduction
o Reinforcement Learning
11) What do you understand by Reinforcement Learning technique?
Both bias and variance are errors. Bias is an error due to erroneous or overly
simplistic assumptions in the learning algorithm. It can lead to the model under-
fitting the data, making it hard to have high predictive accuracy and generalize the
knowledge from the training set to the test set.
Variance is an error due to too much complexity in the learning algorithm. It leads
to the algorithm being highly sensitive to high degrees of variation in the training
data, which can lead the model to overfit the data.
To optimally reduce the number of errors, we will need to tradeoff bias and
variance.
Classification Regression
14) What are the five popular algorithms we use in Machine Learning?
o Decision Trees
o Probabilistic Networks
o Neural Networks
o Support Vector Machines
o Nearest Neighbor
Numerous models, such as classifiers are strategically made and combined to solve
a specific computational program which is known as ensemble learning. The
ensemble methods are also known as committee-based learning or learning
multiple classifier systems. It trains various hypotheses to fix the same issue. One
of the most suitable examples of ensemble modeling is the random forest trees
where several decision trees are used to predict outcomes. It is used to improve the
classification, function approximation, prediction, etc. of a model.
17) What are the three stages of building the hypotheses or model in machine
learning?
o Model building
It chooses a suitable algorithm for the model and trains it according to the
requirement of the problem.
o Applying the model
It is responsible for checking the accuracy of the model through the test
data.
o Model testing
It performs the required changes after testing and apply the final model.
In supervised learning, the standard approach is to split the set of example into the
training set and the test.
20) What are the common ways to handle missing data in a dataset?
Missing data is one of the standard factors while working with data and handling.
It is considered as one of the greatest challenges faced by the data analysts. There
are many ways one can impute the missing values. Some of the common methods
to handle missing data in datasets can be defined as deleting the rows, replacing
with mean/median/mode, predicting the missing values, assigning a unique
category, using algorithms that support missing values, etc.
22) What are the necessary steps involved in Machine Learning Project?
There are several essential steps we must follow to achieve a good working model
while doing a Machine Learning Project. Those steps may include parameter
tuning, data preparation, data collection, training the model, model
evaluation, and prediction, etc.
Precision and Recall both are the measures which are used in the information
retrieval domain to measure how good an information retrieval system reclaims the
related data as requested by the user.
On the other side, recall is the fraction of relevant instances that have been
retrieved over the total amount or relevant instances. The recall is also known
as sensitivity.
Decision Trees can be defined as the Supervised Machine Learning, where the data
is continuously split according to a certain parameter. It builds classification or
regression models as similar as a tree structure, with datasets broken up into ever
smaller subsets while developing the decision tree. The tree can be defined by two
entities, namely decision nodes, and leaves. The leaves are the decisions or the
outcomes, and the decision nodes are where the data is split. Decision trees can
manage both categorical and numerical data.
o Classification
o Speech Recognition
o Regression
o Predict Time Series
o Annotate Strings
30) What is SVM in machine learning? What are the classification methods
that SVM can handle?
SVM stands for Support Vector Machine. SVM are supervised learning models
with an associated learning algorithm which analyze the data used for classification
and regression analysis.
But there are many use-cases where we don't know the quantity of data to be
stored. For such cases, advanced data structures are required, and one such data
structure is linked list.
There are some points which explain how the linked list is different from an array:
Where,
33) Explain True Positive, True Negative, False Positive, and False
Negative in Confusion Matrix with an example.
o True Positive
When a model correctly predicts the positive class, it is said to be a true
positive.
For example, Umpire gives a Batsman NOT OUT when he is NOT OUT.
o True Negative
When a model correctly predicts the negative class, it is said to be a true
negative.
For example, Umpire gives a Batsman OUT when he is OUT.
o False Positive
When a model incorrectly predicts the positive class, it is said to be a false
positive. It is also known as 'Type I' error.
For example, Umpire gives a Batsman NOT OUT when he is OUT.
o False Negative
When a model incorrectly predicts the negative class, it is said to be a false
negative. It is also known as 'Type II' error.
For example, Umpire gives a Batsman OUT when he is NOT OUT.
34) What according to you, is more important between model accuracy and
model performance?
36) What are the similarities and differences between bagging and boosting in
Machine Learning?
o Although they are built independently, but for Bagging, Boosting tries to
add new models which perform well where previous models fail.
o Only Boosting determines the weight for the data to tip the scales in favor of
the most challenging cases.
o Only Boosting tries to reduce bias. Instead, Bagging may solve the problem
of over-fitting while boosting can increase it.
o Logical
It contains a set of Bayesian Clauses, which capture the qualitative structure
of the domain.
Machine Learning solves Real-World problems. Unlike the hard coding rule to
solve the problem, machine learning algorithms learn from the data.
The learnings can later be used to predict the feature. It is paying off for early
adopters.
The simplest answer is to make our lives easier. In the early days of “intelligent”
applications, many systems used hardcoded rules of “if” and “else” decisions to
process data or adjust the user input. Think of a spam filter whose job is to move
the appropriate incoming email messages to a spam folder.
But with the machine learning algorithms, we are given ample information for the
data to learn and identify the patterns from the data.
Unlike the normal problems we don’t need to write the new rules for each
problem in machine learning, we just need to use the same workflow but with a
different dataset.
Let’s talk about Alan Turing, in his 1950 paper, “Computing Machinery and
Intelligence”, Alan asked, “Can machines think?”
Full paper here
The paper describes the “Imitation Game”, which includes three participants -
The judge asks the other two participants to talk. While they respond the judge
needs to decide which response came from the computer. If the judge could not
tell the difference the computer won the game.
The test continues today as an annual competition in artificial intelligence. The
aim is simple enough: convince the judge that they are chatting to a human
instead of a computer chatbot program.
There are various types of machine learning algorithms. Here is the list of them in
a broad category based on:
Example: 01
Knowing the height and weight identifying the gender of the person. Below are
the popular supervised learning algorithms.
Support Vector Machines
Regression
Naive Bayes
Decision Trees
K-nearest Neighbour Algorithm and Neural Networks.
Example: 02
If you build a T-shirt classifier, the labels will be “this is an S, this is an M and
this is L”, based on showing the classifier examples of S, M, and L.
Clustering,
Anomaly Detection,
Neural Networks and Latent Variable Models.
Example:
In the same example, a T-shirt clustering will categorize as “collar style and V
neck style”, “crew neck style” and “sleeve types”.
Bayes’ theorem states the following relationship, given class variable y and
dependent vector x1 through xn:
Using the naive conditional independence assumption that each xiis independent:
for all I this relationship is simplified to:
P(xi | yi, x1, ..., xi-1, xi+1, ...., xn) = P(xi | yi)
Since, P(x1,..., xn) is a constant given the input, we can use the following
classification rule:
P(yi | x1, ..., xn) = P(y) ni=1P(xi | yi)P(x1,...,xn) and we can also use Maximum
A Posteriori (MAP) estimation to estimate P(yi)and P(yi | xi) the former is then
the relative frequency of class yin the training set.
The different naive Bayes classifiers mainly differ by the assumptions they make
regarding the distribution of P(yi | xi): can be Bernoulli, binomial, Gaussian, and
so on.
In this case, PCA measures the variation in each variable (or column in the table).
If there is little variation, it throws the variable out, as illustrated in the figure
below:
Principal component analysis (PCA)
Thus making the dataset easier to visualize. PCA is used in finance, neuroscience,
and pharmacology.
It is very useful as a preprocessing step, especially when there are linear
correlations between features.
Suppose we have given some data points that each belong to one of two classes,
and the goal is to separate two classes based on a set of examples.
There are many hyperplanes that classify the data. To choose the best hyperplane
that represents the largest separation or margin between the two classes.
If such a hyperplane exists, it is known as a maximum-margin hyperplane and the
linear classifier it defines is known as a maximum margin classifier. The best
hyperplane that divides the data in H3
We have data (x1, y1), ..., (xn, yn), and different features (xii, ..., xip), and yiis
either 1 or -1.
w. x-b = 0
w . xi - b = 1 or w. xi - b = -1
Support Vector Machine (SVM)
A Support Vector Machine (SVM) is an algorithm that tries to fit a line (or plane
or hyperplane) between the different classes that maximizes the distance from the
line to the points of the classes.
In this way, it tries to find a robust separation between the classes. The Support
Vectors are the points of the edge of the dividing hyperplane as in the below
figure.
Support Vector Machine (SVM)
Cross-validation is a method of splitting all your data into three parts: training,
testing, and validation data. Data is split into k subsets, and the model has trained
on k-1of those datasets.
The last subset is held for testing. This is done for each of the subsets. This is k-
fold cross-validation. Finally, the scores from all the k-folds are averaged to
produce the final score.
Cross-validation
Bias in data tells us there is inconsistency in data. The inconsistency may occur
for several reasons which are not mutually exclusive.
For example, a tech giant like Amazon to speed the hiring process they build one
engine where they are going to give 100 resumes, it will spit out the top five, and
hire those.
When the company realized the software was not producing gender-neutral
results it was tweaked to remove this bias.
12. Explain the Difference Between Classification and Regression?
Let’s have a look at this table before directly jumping into the F1 score.
F1 = 2TP/2TP + FP + FN
We see scores for F1 between 0 and 1, where 0 is the worst score and 1 is the best
score.
The F1 score is typically used in information retrieval to see how well a model
retrieves relevant results and our model is performing.
Precision and recall are ways of monitoring the power of machine learning
implementation. But they often used at the same time.
Precision answers the question, “Out of the items that the classifier predicted to
be relevant, how many are truly relevant?”
Whereas, recall answers the question, “Out of all the items that are truly relevant,
how many are found by the classifier?
In general, the meaning of precision is the fact of being exact and accurate. So the
same will go in our machine learning model as well. If you have a set of items
that your model needs to predict to be relevant. How many items are truly
relevant?
The below figure shows the Venn diagram that precision and recall.
Overfitting means the model fitted to training data too well, in this case, we need
to resample the data and estimate the model accuracy using techniques like k-fold
cross-validation.
Whereas for the Underfitting case we are not able to understand or capture the
patterns from the data, in this case, we need to change the algorithms, or we need
to feed more data points to the model.
The different neurons are connected via connections that help information flow
from one neuron to another.
17. What are Loss Function and Cost Functions? Explain the key Difference
Between them?
When calculating loss we consider only a single data point, then we use the term
loss function.
Whereas, when calculating the sum of error for multiple data then we use the cost
function. There is no major difference.
In other words, the loss function is to capture the difference between the actual
and predicted values for a single record whereas cost functions aggregate the
difference for the entire training dataset.
The Most commonly used loss functions are Mean-squared error and Hinge loss.
Where y = -1 or 1 indicating two classes and y represents the output form of the
classifier. The most common cost function represents the total cost as the sum of
the fixed costs and the variable costs in the equation y = mx + b
There are many reasons for a model to be different. Few reasons are:
Different Population
Different Hypothesis
Different modeling techniques
When working with the model’s training and testing data, we will experience an
error. This error might be bias, variance, and irreducible error.
Now the model should always have a balance between bias and variance, which
we call a bias-variance trade-off.
There are many ensemble techniques available but when aggregating multiple
models there are two general methods:
Bagging, a native method: take the training set and generate new training sets off
of it.
Boosting, a more elegant method: similar to bagging, boosting is used to optimize
the best weighting scheme for a training set.
19. How do you make sure which Machine Learning Algorithm to use?
It completely depends on the dataset we have. If the data is discrete we use SVM.
If the dataset is continuous we use linear regression.
So there is no specific way that lets us know which ML algorithm to use, it all
depends on the exploratory data analysis (EDA).
Based on the above observations select one best-fit algorithm for a particular
dataset.
Box plot
Z-score
Scatter plot, etc.
Like bagging and boosting, random forest works by combining a set of other tree
models. Random forest builds a tree from a random sample of the columns in the
test data.
Here’s are the steps how a random forest creates the trees:
Hierarchical clustering
K means clustering
Density-based clustering
Fuzzy clustering, etc.
24. How can you select K for K-means Clustering?
There are two kinds of methods that include direct methods and statistical testing
methods:
The silhouette is the most frequently used while determining the optimal value of
k.
Data required for recommender systems stems from explicit user ratings after
watching a film or listening to a song, from implicit search engine queries and
purchase histories, or from other knowledge about the users/items themselves.
Visually, we can use plots. A few of the normality checks are as follows:
Shapiro-Wilk Test
Anderson-Darling Test
Martinez-Iglewicz Test
Kolmogorov-Smirnov Test
D’Agostino Skewness Test
Correlation is used for measuring and also for estimating the quantitative
relationship between two variables. Correlation measures how strongly two
variables are related. Examples like, income and expenditure, demand and
supply, etc.
P-values are used to make a decision about a hypothesis test. P-value is the
minimum significant level at which you can reject the null hypothesis. The lower
the p-value, the more likely you reject the null hypothesis.
Parametric models will have limited parameters and to predict new data, you only
need to know the parameter of the model.
The sigmoid function is used for binary classification. The probabilities sum
needs to be 1. Whereas, Softmax function is used for multi-classification. The
probabilities sum will be 1.
(VisionNLP provide)
Ans. The sigmoid function is used for the two-class logistic regression, whereas
the softmax function is used for the multiclass logistic regression (a.k.a. MaxEnt,
multinomial logistic regression, softmax Regression, Maximum Entropy
Classifier).
Ans. Optimizers are algorithms or methods used to change the attributes of the
neural network such as weights and learning rate to reduce the losses. Optimizers
are used to solve optimization problems by minimizing the function.
Ans. The Idea behind the precision-recall trade-off is that when a person changes
the threshold for determining if a class is positive or negative it will tilt the
scales. It means that it will cause precision to increase and recall to decrease, or
vice versa.