0% found this document useful (0 votes)

471 views33 pages

100 Data Science Interview Questions and Answers

Uploaded by

Rajachandra Voodiga

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

471 views33 pages

100 Data Science Interview Questions and Answers

Uploaded by

Rajachandra Voodiga

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 33

100 Data Science Interview Questions and

Answers (General) for 2018

Hone yourself to be the ideal candidate at your next data scientist job interview with these
frequently asked data science interview questions. Data Scientist interview questions asked at a
job interview can fall into one of the following categories -

 Technical Data Scientist Interview Questions based on data science programming languages like
Python , R, etc.
 Technical Data Scientist Interview Questions based on statistics, probability , math , machine
learning, etc.
 Practical experience or Role based data scientist interview questions based on the projects you
have worked on , and how they turned out.
DeZyre has got you covered with a series of blogposts that will help you prepare for your next
data science interview.

In collaboration with data scientists, industry experts and top counsellors, we have put together a list of
general data science interview questions and answers to help you with your preparation in applying for
data science jobs. This first part of a series of data science interview questions and answers article,
focusses only on the general topics like questions around data, probability,statistics and other data science
concepts. This also includes a list of open ended questions that interviewers ask to get a feel of how often
and how quickly you can think on your feet.There are some data analyst interview questions in this blog
which can also be asked in a data science interview. These kind of analytics interview questions also
measure if you were successful in applying data science techniques to real life problems.

If you would like more information about Online Data Science course, please click the orange "Request
Info" button on top of this page.

Data Science Interview Questions and Answers

Data Science is not an easy field to get into. This is something all data scientists will agree on. Apart from
having a degree in mathematics/statistics or engineering, a data scientist also needs to go through intense
training to develop all the skills required for this field. Apart from the degree/diploma and the training, it
is important to prepare the right resume for a data science job, and to be well versed with the data science
interview questions and answers.
Tweet: Data Science Interview questions and answers
Consider our top 100 Data Science Interview Questions and Answers as a starting point for your data
scientist interview preparation. Even if you are not looking for a data scientist position now, as you are
still working your way through hands-on projects and learning programming languages like Python and
R – you can start practicing these Data Scientist Interview questions and answers. These Data Scientist
job interview questions will set the foundation for data science interviews to impress potential employers
by knowing about your subject and being able to show the practical implications of data science.
Top 100 Data Scientist Interview Questions and
Answers
1) Differentiate between Data Science , Machine Learning and AI.

Machine Artificial
Criteria Data Science Learning Intelligence

A wide term that

focuses on
Data Science is not exactly a subset applications
of machine learning but it uses A subset of AI that ranging from
machine learning to analyse and focuses on narrow Robotics to Text
Defintion make future predictions. range of activities. Analysis.

It is a
combination of
both business
It is a purely and technical
Role It can take on a busines role. technical role. aspects.

Data Science is a broad term for

diverse disciplines and is not Machine learning AI is a sub-field
merely about developing and fits within the data of computer
Scope training models. science spectrum. science.

AI Loosely integrated Machine learning A sub- field of

is a sub field of AI computer
and is tightly science
integrated. consisting of
various task like
planning,
moving around
in the world,
recognizing
objects and
sounds,
speaking,
translating,
performing
social or
business
transactions,
creative work..

Data Science vs Machine Learning

2) Python or R – Which one would you prefer for text analytics?
The best possible answer for this would be Python because it has Pandas library that provides easy to use
data structures and high performance data analysis tools.

3) Which technique is used to predict categorical responses?

Classification technique is used widely in mining for classifying data sets.

4) What is logistic regression? Or State an example when you have used logistic regression
recently.
Logistic Regression often referred as logit model is a technique to predict the binary outcome from a
linear combination of predictor variables. For example, if you want to predict whether a particular
political leader will win the election or not. In this case, the outcome of prediction is binary i.e. 0 or 1
(Win/Lose). The predictor variables here would be the amount of money spent for election campaigning
of a particular candidate, the amount of time spent in campaigning, etc.

5) What are Recommender Systems?

A subclass of information filtering systems that are meant to predict the preferences or ratings that a user
would give to a product. Recommender systems are widely used in movies, news, research articles,
products, social tags, music, etc.

6) Why data cleaning plays a vital role in analysis?

Cleaning data from multiple sources to transform it into a format that data analysts or data scientists can
work with is a cumbersome process because - as the number of data sources increases, the time take to
clean the data increases exponentially due to the number of sources and the volume of data generated in
these sources. It might take up to 80% of the time for just cleaning data making it a critical part of
analysis task.

7) Differentiate between univariate, bivariate and multivariate analysis.

These are descriptive statistical analysis techniques which can be differentiated based on the number of
variables involved at a given point of time. For example, the pie charts of sales based on territory involve
only one variable and can be referred to as univariate analysis.
If the analysis attempts to understand the difference between 2 variables at time as in a scatterplot, then it
is referred to as bivariate analysis. For example, analysing the volume of sale and a spending can be
considered as an example of bivariate analysis.

Analysis that deals with the study of more than two variables to understand the effect of variables on the
responses is referred to as multivariate analysis.

8) What do you understand by the term Normal Distribution?

Data is usually distributed in different ways with a bias to the left or to the right or it can all be jumbled
up. However, there are chances that data is distributed around a central value without any bias to the left
or right and reaches normal distribution in the form of a bell shaped curve. The random variables are
distributed in the form of an symmetrical bell shaped curve.

Image Credit : mathisfun.com

9) What is Linear Regression?

Linear regression is a statistical technique where the score of a variable Y is predicted from the score of a
second variable X. X is referred to as the predictor variable and Y as the criterion variable.

10) What is Interpolation and Extrapolation?

Estimating a value from 2 known values from a list of values is Interpolation. Extrapolation is
approximating a value by extending a known set of values or facts.

11) What is power analysis?

An experimental design technique for determining the effect of a given sample size.

12) What is K-means? How can you select K for K-means?

13) What is Collaborative filtering?
The process of filtering used by most of the recommender systems to find patterns or information by
collaborating viewpoints, various data sources and multiple agents.

14) What is the difference between Cluster and Systematic Sampling?

Cluster sampling is a technique used when it becomes difficult to study the target population spread
across a wide area and simple random sampling cannot be applied. Cluster Sample is a probability sample
where each sampling unit is a collection, or cluster of elements. Systematic sampling is a statistical
technique where elements are selected from an ordered sampling frame. In systematic sampling, the list is
progressed in a circular manner so once you reach the end of the list,it is progressed from the top again.
The best example for systematic sampling is equal probability method.

15) Are expected value and mean value different?

They are not different but the terms are used in different contexts. Mean is generally referred when
talking about a probability distribution or sample population whereas expected value is generally referred
in a random variable context.

For Sampling Data

Mean value is the only value that comes from the sampling data.

Expected Value is the mean of all the means i.e. the value that is built from multiple samples. Expected
value is the population mean.

For Distributions
Mean value and Expected value are same irrespective of the distribution, under the condition that the
distribution is in the same population.

16) What does P-value signify about the statistical data?

P-value is used to determine the significance of results after a hypothesis test in statistics. P-value helps
the readers to draw conclusions and is always between 0 and 1.

• P- Value > 0.05 denotes weak evidence against the null hypothesis which means the null
hypothesis cannot be rejected.

• P-value <= 0.05 denotes strong evidence against the null hypothesis which means the null
hypothesis can be rejected.

• P-value=0.05is the marginal value indicating it is possible to go either way.

17) Do gradient descent methods always converge to same point?

No, they do not because in some cases it reaches a local minima or a local optima point. You don’t reach
the global optima point. It depends on the data and starting conditions

18) What are categorical variables?

19) A test has a true positive rate of 100% and false positive rate of 5%. There is a population
with a 1/1000 rate of having the condition the test identifies. Considering a positive test, what is the
probability of having that condition?
Let’s suppose you are being tested for a disease, if you have the illness the test will end up saying you
have the illness. However, if you don’t have the illness- 5% of the times the test will end up saying you
have the illness and 95% of the times the test will give accurate result that you don’t have the illness.
Thus there is a 5% error in case you do not have the illness.

Out of 1000 people, 1 person who has the disease will get true positive result.

Out of the remaining 999 people, 5% will also get true positive result.

Close to 50 people will get a true positive result for the disease.

This means that out of 1000 people, 51 people will be tested positive for the disease even though only one
person has the illness. There is only a 2% probability of you having the disease even if your reports say
that you have the disease.

20) How you can make data normal using Box-Cox transformation?
21) What is the difference between Supervised Learning an Unsupervised Learning?
If an algorithm learns something from the training data so that the knowledge can be applied to the test
data, then it is referred to as Supervised Learning. Classification is an example for Supervised Learning.
If the algorithm does not learn anything beforehand because there is no response variable or any training
data, then it is referred to as unsupervised learning. Clustering is an example for unsupervised learning.

22) Explain the use of Combinatorics in data science.

23) Why is vectorization considered a powerful method for optimizing numerical code?
24) What is the goal of A/B Testing?
It is a statistical hypothesis testing for randomized experiment with two variables A and B. The goal of
A/B Testing is to identify any changes to the web page to maximize or increase the outcome of an
interest. An example for this could be identifying the click through rate for a banner ad.

25) What is an Eigenvalue and Eigenvector?

Eigenvectors are used for understanding linear transformations. In data analysis, we usually calculate the
eigenvectors for a correlation or covariance matrix. Eigenvectors are the directions along which a
particular linear transformation acts by flipping, compressing or stretching. Eigenvalue can be referred to
as the strength of the transformation in the direction of eigenvector or the factor by which the
compression occurs.

26) What is Gradient Descent?

27) How can outlier values be treated?
Outlier values can be identified by using univariate or any other graphical analysis method. If the number
of outlier values is few then they can be assessed individually but for large number of outliers the values
can be substituted with either the 99th or the 1st percentile values. All extreme values are not outlier
values.The most common ways to treat outlier values –

1) To change the value and bring in within a range

2) To just remove the value.

28) How can you assess a good logistic model?

There are various methods to assess the results of a logistic regression analysis-

• Using Classification Matrix to look at the true negatives and false positives.

• Concordance that helps identify the ability of the logistic model to differentiate between the event
happening and not happening.

• Lift helps assess the logistic model by comparing it with random selection.

29) What are various steps involved in an analytics project?

• Understand the business problem

• Explore the data and become familiar with it.

• Prepare the data for modelling by detecting outliers, treating missing values, transforming
variables, etc.

• After data preparation, start running the model, analyse the result and tweak the approach. This is
an iterative step till the best possible outcome is achieved.
• Validate the model using a new data set.

• Start implementing the model and track the result to analyse the performance of the model over
the period of time.

30) How can you iterate over a list and also retrieve element indices at the same time?
This can be done using the enumerate function which takes every element in a sequence just like in a list
and adds its location just before it.

31) During analysis, how do you treat missing values?

The extent of the missing values is identified after identifying the variables with missing values. If any
patterns are identified the analyst has to concentrate on them as it could lead to interesting and meaningful
business insights. If there are no patterns identified, then the missing values can be substituted with mean
or median values (imputation) or they can simply be ignored. There are various factors to be considered
when answering this question

 Understand the problem statement, understand the data and then give the answer.Assigning a default value which
can be mean, minimum or maximum value. Getting into the data is important.
 If it is a categorical variable, the default value is assigned. The missing value is assigned a default value.
 If you have a distribution of data coming, for normal distribution give the mean value.
 Should we even treat missing values is another important point to consider? If 80% of the values for a variable are
missing then you can answer that you would be dropping the variable instead of treating the missing values.

32) Explain about the box cox transformation in regression models.

For some reason or the other, the response variable for a regression analysis might not satisfy one or more
assumptions of an ordinary least squares regression. The residuals could either curve as the prediction
increases or follow skewed distribution. In such scenarios, it is necessary to transform the response
variable so that the data meets the required assumptions. A Box cox transformation is a statistical
technique to transform non-mornla dependent variables into a normal shape. If the given data is not
normal then most of the statistical techniques assume normality. Applying a box cox transformation
means that you can run a broader number of tests.

33) Can you use machine learning for time series analysis?
Yes, it can be used but it depends on the applications.

34) Write a function that takes in two sorted lists and outputs a sorted list that is their union.
First solution which will come to your mind is to merge two lists and short them afterwards
Python code-
def return_union(list_a, list_b):
return sorted(list_a + list_b)
R code-
return_union <- function(list_a, list_b)
{
list_c<-list(c(unlist(list_a),unlist(list_b)))
return(list(list_c[[1]][order(list_c[[1]])]))
}
Generally, the tricky part of the question is not to use any sorting or ordering function. In that case you
will have to write your own logic to answer the question and impress your interviewer.

Python code-
def return_union(list_a, list_b):
    len1 = len(list_a)
    len2 = len(list_b)
   final_sorted_list = []
    j = 0
    k = 0

for i in range(len1+len2):

        if k == len1:
            final_sorted_list.extend(list_b[j:])
            break
        elif j == len2:
            final_sorted_list.extend(list_a[k:])
            break
        elif list_a[k] < list_b[j]:
            final_sorted_list.append(list_a[k])
            k += 1
        else:
            final_sorted_list.append(list_b[j])
            j += 1
    return final_sorted_list

Similar function can be returned in R as well by following the similar steps.

return_union <- function(list_a,list_b)

{
#Initializing length variables
len_a <- length(list_a)
len_b <- length(list_b)
len <- len_a + len_b

#initializing counter variables

j=1
k=1

#Creating an empty list which has length equal to sum of both the lists
list_c <- list(rep(NA,len))

#Here goes our for loop

for(i in 1:len)
{
    if(j>len_a)
      {
        list_c[i:len] <- list_b[k:len_b]
        break
      }
    else if(k>len_b)
      {
        list_c[i:len] <- list_a[j:len_a]
        break
      }
    else if(list_a[[j]] <= list_b[[k]])
      {
        list_c[[i]] <- list_a[[j]]
        j <- j+1
      }
    else if(list_a[[j]] > list_b[[k]])
    {
      list_c[[i]] <- list_b[[k]]
      k <- k+1
    }
}
return(list(unlist(list_c)))

}

35) What is the difference between Bayesian Estimate and Maximum Likelihood Estimation
(MLE)?
In bayesian estimate we have some knowledge about the data/problem (prior) .There may be several
values of the parameters which explain data and hence we can look for multiple parameters like 5
gammas and 5 lambdas that do this. As a result of Bayesian Estimate, we get multiple models for making
multiple predcitions i.e. one for each pair of parameters but with the same prior. So, if a new example
need to be predicted than computing the weighted sum of these predictions serves the purpose.

Maximum likelihood does not take prior into consideration (ignores the prior) so it is like being a
Bayesian while using some kind of a flat prior.

36)       What is Regularization and what kind of problems does regularization solve?
37)       What is multicollinearity and how you can overcome it?
38)       What is the curse of dimensionality?
39)        How do you decide whether your linear regression model fits the data?
40)       What is the difference between squared error and absolute error?
41)   What is Machine Learning?
The simplest way to answer this question is – we give the data and equation to the machine. Ask the
machine to look at the data and identify the coefficient values in an equation.
For example for the linear regression y=mx+c, we give the data for the variable x, y and the machine
learns about the values of m and c from the data.

42) How are confidence intervals constructed and how will you interpret them?
43) How will you explain logistic regression to an economist, physican scientist and biologist?
44) How can you overcome Overfitting?
45) Differentiate between wide and tall data formats?
46) Is Naïve Bayes bad? If yes, under what aspects.
47) How would you develop a model to identify plagiarism?
48) How will you define the number of clusters in a clustering algorithm?
Though the Clustering Algorithm is not specified, this question will mostly be asked in reference to K-
Means clustering where “K” defines the number of clusters. The objective of clustering is to group similar
entities in a way that the entities within a group are similar to each other but the groups are different from
each other.

For example, the following image shows three different groups.

Within Sum of squares is generally used to explain the homogeneity within a cluster. If you plot WSS for
a range of number of clusters, you will get the plot shown below. The Graph is generally known as Elbow
Curve.
Red circled point in above graph i.e. Number of Cluster =6 is the point after which you don’t see any
decrement in WSS. This point is known as bending point and taken as K in K – Means.

This is the widely used approach but few data scientists also use Hierarchical clustering first to create
dendograms and identify the distinct groups from there.

49) Is it better to have too many false negatives or too many false positives?
50) Is it possible to perform logistic regression with Microsoft Excel?
It is possible to perform logistic regression with Microsoft Excel. There are two ways to do it using Excel.

a) One is to use Add-ins provided by many websites which we can use.

b) Second is to use fundamentals of logistic regression and use Excel’s computational power to build a
logistic regression

But when this question is being asked in an interview, interviewer is not looking for a name of Add-ins
rather a method using the base excel functionalities.

Let’s use a sample data to learn about logistic regression using Excel. (Example assumes that you are
familiar with basic concepts of logistic regression)
Data shown above consists of three variables where X1 and X2 are independent variables and Y is a class
variable. We have kept only 2 categories for our purpose of binary logistic regression classifier.

Next we have to create a logit function using independent variables, i.e.

Logit = L = β0 + β1X1 + β2X2

We have kept the initial values of beta 1, beta 2 as 0.1 for now and we will use Excel Solve to optimize
the beta values in order to maximize our log likelihood estimate.

Assuming that you are aware of logistic regression basics, we calculate probability values from Logit
using following formula:

Probability= e^Logit/(1+ e^Logit )

e is base of natural logarithm i.e. e = 2.71828163

Let’s put it into excel formula to calculate probability values for each of the observation.

The conditional probability is the probability of Predicted Y, given set of independent variables X.

And this p can be calculated as-

P〖(X)〗^Yactual*[1-P〖(X)〗^(1-Yactual)]

Then we have to take natural log of the above function-

ln⁡〖[ 〗 P〖(X)〗^Yactual*[1-P(X)^(1-Yactual) ]]

Which turns out to be –

Yactualln⁡〖[ 〗 P(X)](Yactual- 1)*ln[1-P(X)]

Log likelihood function LL is the sum of above equation for all the observations
Log likelihood LL will be sum of column G, which we just calculated
The objective is to maximize the Log Likelihood i.e. cell H2 in this example. We have to maximize H2
by optimizing B0, B1, and B2.
We’ll use Excel’s solver add-in to achieve the same.

Excel comes with this Add-in pre-installed and you must see it under Data Tab in Excel as shown below

If you don’t see it there then make sure if you have loaded it. To load an add-in in Excel,

Go to File >> Options >> Add-Ins and see if checkbox in front of required add-in is checked or not?
Make sure to check it to load an add-in into Excel.
If you don’t see Solver Add-in there, go to the bottom of the screen (Manage Add-Ins) and click on OK.
Next you will see a popup window which should have your Solver add-in present. Check the checkbox
in-front of the add-in name. If you don’t see it there as well click on browse and direct it to the required
folder which contains Solver Add-In.

Once you have your Solver loaded, click on Solver icon under Data tab and You will see a new window
popped up like –

Put H2 in set objective, select max and fill cells E2 to E4 in next form field.
By doing this we have told Solver to Maximize H2 by changing values in cells E2 to E4.

Now click on Solve button at the bottom –

You will see a popup like below -

This shows that Solver has found a local maxima solution but we are in need of Global Maxima Output.
Keep clicking on Continue until it shows the below popup

It shows that Solver was able to find and converge the solution. In case it is not able to converge it will
throw an error. Select “Keep Solver Solution” and Click on OK to accept the solution provided by Solver.

Now, you can see that value of Beta coefficients from B 0, B1 B2 have changed and our Log Likelihood
function has been maximized.
Using these values of Betas you can calculate the probability and hence response variable by deciding the
probability cut-off.

51) What do you understand by Fuzzy merging ? Which language will you use to handle it?
52) What is the difference between skewed and uniform distribution?
When the observations in a dataset are spread equally across the range of distribution, then it is referred to
as uniform distribution. There are no clear perks in an uniform distribution. Distributions that have more
observations on one side of the graph than the other are referred to as skewed distribution.Distributions
with fewer observations on the left ( towards lower values) are said to be skewed left and distributions
with fewer observation on the right ( towards higher values) are said to be skewed right.

53) You created a predictive model of a quantitative outcome variable using multiple regressions.
What are the steps you would follow to validate the model?
Since the question asked, is about post model building exercise, we will assume that you have already
tested for null hypothesis, multi collinearity and Standard error of coefficients.

Once you have built the model, you should check for following –

· Global F-test to see the significance of group of independent variables on dependent variable

· R^2

· Adjusted R^2

· RMSE, MAPE

In addition to above mentioned quantitative metrics you should also check for-
· Residual plot

· Assumptions of linear regression

54) What do you understand by Hypothesis in the content of Machine Learning?

55) What do you understand by Recall and Precision?
Recall measures "Of all the actual true samples how many did we classify as true?"

Precision measures "Of all the samples we classified as true how many are actually true?"

We will explain this with a simple example for better understanding -

Imagine that your wife gave you surprises every year on your anniversary in last 12 years. One day all of
a sudden your wife asks -"Darling, do you remember all anniversary surprises from me?".

This simple question puts your life into danger.To save your life, you need to Recall all 12 anniversary
surprises from your memory. Thus, Recall(R) is the ratio of number of events you can correctly recall to
the number of all correct events. If you can recall all the 12 surprises correctly then the recall ratio is 1
(100%) but if you can recall only 10 suprises correctly of the 12 then the recall ratio is 0.83 (83.3%).

However , you might be wrong in some cases. For instance, you answer 15 times, 10 times the surprises
you guess are correct and 5 wrong. This implies that your recall ratio is 100% but the precision is 66.67%.

Precision is the ratio of number of events you can correctly recall to a number of all events you recall
(combination of wrong and correct recalls).

56) How will you find the right K for K-means?

57) Why L1 regularizations causes parameter sparsity whereas L2 regularization does not?
Regularizations in statistics or in the field of machine learning is used to include some extra information
in order to solve a problem in a better way. L1 & L2 regularizations are generally used to add constraints
to optimization problems.

In the example shown above H0 is a hypothesis. If you observe, in L1 there is a high likelihood to hit the
corners as solutions while in L2, it doesn’t. So in L1 variables are penalized more as compared to L2 which
results into sparsity.
In other words, errors are squared in L2, so model sees higher error and tries to minimize that squared
error.
58) How can you deal with different types of seasonality in time series modelling?
Seasonality in time series occurs when time series shows a repeated pattern over time. E.g., stationary
sales decreases during holiday season, air conditioner sales increases during the summers etc. are few
examples of seasonality in a time series.

Seasonality makes your time series non-stationary because average value of the variables at different time
periods. Differentiating a time series is generally known as the best method of removing seasonality from
a time series. Seasonal differencing can be defined as a numerical difference between a particular value
and a value with a periodic lag (i.e. 12, if monthly seasonality is present)

59) In experimental design, is it necessary to do randomization? If yes, why?

60) What do you understand by conjugate-prior with respect to Naïve Bayes?
61) Can you cite some examples where a false positive is important than a false negative?
Before we start, let us understand what are false positives and what are false negatives.

False Positives are the cases where you wrongly classified a non-event as an event a.k.a Type I error.

And, False Negatives are the cases where you wrongly classify events as non-events, a.k.a Type II error.
In medical field, assume you have to give chemo therapy to patients. Your lab tests patients for certain
vital information and based on those results they decide to give radiation therapy to a patient.

Assume a patient comes to that hospital and he is tested positive for cancer (But he doesn’t have cancer)
based on lab prediction. What will happen to him? (Assuming Sensitivity is 1)

One more example might come from marketing. Let’s say an ecommerce company decided to give $1000
Gift voucher to the customers whom they assume to purchase at least $5000 worth of items. They send
free voucher mail directly to 100 customers without any minimum purchase condition because they
assume to make at least 20% profit on sold items above 5K.

Now what if they have sent it to false positive cases?

62) Can you cite some examples where a false negative important than a false positive?
Assume there is an airport ‘A’ which has received high security threats and based on certain
characteristics they identify whether a particular passenger can be a threat or not. Due to shortage of staff
they decided to scan passenger being predicted as risk positives by their predictive model.

What will happen if a true threat customer is being flagged as non-threat by airport model?

Another example can be judicial system. What if Jury or judge decide to make a criminal go free?

What if you rejected to marry a very good person based on your predictive model and you happen to
meet him/her after few years and realize that you had a false negative?

63) Can you cite some examples where both false positive and false negatives are equally
important?
In the banking industry giving loans is the primary source of making money but at the same time if your
repayment rate is not good you will not make any profit, rather you will risk huge losses.

Banks don’t want to lose good customers and at the same point of time they don’t want to acquire bad
customers. In this scenario both the false positives and false negatives become very important to
measure.
These days we hear many cases of players using steroids during sport competitions Every player has to go
through a steroid test before the game starts. A false positive can ruin the career of a Great sportsman and
a false negative can make the game unfair.

64) Can you explain the difference between a Test Set and a Validation Set?
Validation set can be considered as a part of the training set as it is used for parameter selection and to
avoid Overfitting of the model being built. On the other hand, test set is used for testing or evaluating the
performance of a trained machine leaning model.

In simple terms ,the differences can be summarized as-

 Training Set is to fit the parameters i.e. weights.

 Test Set is to assess the performance of the model i.e. evaluating the predictive power and generalization.
 Validation set is to tune the parameters.
65) What makes a dataset gold standard?
66) What do you understand by statistical power of sensitivity and how do you calculate it?
Sensitivity is commonly used to validate the accuracy of a classifier (Logistic, SVM, RF etc.). Sensitivity
is nothing but “Predicted TRUE events/ Total events”. True events here are the events which were true
and model also predicted them as true.

Calculation of senstivity is pretty straight forward-

Senstivity = True Positives /Positives in Actual Dependent Variable

Where, True positives are Positive events which are correctly classified as Positives.

67) What is the importance of having a selection bias?

Selection Bias occurs when there is no appropriate randomization acheived while selecting individuals,
groups or data to be analysed.Selection bias implies that the obtained sample does not exactly represent
the population that was actually intended to be analyzed.Selection bias consists of Sampling Bias, Data,
Attribute and Time Interval.

68) Give some situations where you will use an SVM over a RandomForest Machine Learning
algorithm and vice-versa.
SVM and Random Forest are both used in classification problems.

a) If you are sure that your data is outlier free and clean then go for SVM. It is the opposite - if your
data might contain outliers then Random forest would be the best choice

b)      Generally, SVM consumes more computational power than Random Forest, so if you are
constrained with memory go for Random Forest machine learning algorithm.
c)  Random Forest gives you a very good idea of variable importance in your data, so if you want to have
variable importance then choose Random Forest machine learning algorithm.
d)      Random Forest machine learning algorithms are preferred for multiclass problems.

e) SVM is preferred in multi-dimensional problem set - like text classification

but as a good data scientist, you should experiment with both of them and test for accuracy or rather you
can use ensemble of many Machine Learning techniques.

69) What do you understand by feature vectors?

70) How do data management procedures like missing data handling make selection bias worse?
Missing value treatment is one of the primary tasks which a data scientist is supposed to do before
starting data analysis. There are multiple methods for missing value treatment. If not done properly, it
could potentially result into selection bias. Let see few missing value treatment examples and their impact
on selection-

Complete Case Treatment: Complete case treatment is when you remove entire row in data even if one
value is missing. You could achieve a selection bias if your values are not missing at random and they
have some pattern. Assume you are conducting a survey and few people didn’t specify their gender.
Would you remove all those people? Can’t it tell a different story?
Available case analysis: Let say you are trying to calculate correlation matrix for data so you might
remove the missing values from variables which are needed for that particular correlation coefficient. In
this case your values will not be fully correct as they are coming from population sets.
Mean Substitution: In this method missing values are replaced with mean of other available values.This
might make your distribution biased e.g., standard deviation, correlation and regression are mostly
dependent on the mean value of variables.
Hence, various data management procedures might include selection bias in your data if not chosen
correctly.

71) What are the advantages and disadvantages of using regularization methods like Ridge
Regression?
72) What do you understand by long and wide data formats?
73) What do you understand by outliers and inliers? What would you do if you find them in your
dataset?
74) Write a program in Python which takes input as the diameter of a coin and weight of the coin
and produces output as the money value of the coin.
75) What are the basic assumptions to be made for linear regression?
Normality of error distribution, statistical independence of errors, linearity and additivity.

76) Can you write the formula to calculat R-square?

R-Square can be calculated using the below formular -

1 - (Residual Sum of Squares/ Total Sum of Squares)

77) What is the advantage of performing dimensionality reduction before fitting an SVM?

Support Vector Machine Learning Algorithm performs better in the reduced space. It is beneficial to
perform dimensionality reduction before fitting an SVM if the number of features is large when compared
to the number of observations.

78) How will you assess the statistical significance of an insight whether it is a real insight or just by
chance?
Statistical importance of an insight can be accessed using Hypothesis Testing.

79) How would you create a taxonomy to identify key customer trends in unstructured data?
Tweet: Data Science Interview questions #1 - How would you create a taxonomy to identify key customer
trends in unstructured data? - http://ctt.ec/sdqZ0+
The best way to approach this question is to mention that it is good to check with the business owner and
understand their objectives before categorizing the data. Having done this, it is always good to follow an
iterative approach by pulling new data samples and improving the model accordingly by validating it for
accuracy by soliciting feedback from the stakeholders of the business. This helps ensure that your model
is producing actionable results and improving over the time.

80) How will you find the correlation between a categorical variable and a continuous variable ?
You can use the analysis of covariance technqiue to find the correlation between a categorical variable
and a continuous variable.

81)

Learn Data Science in Python and R Programming to nail data science interviews at top tech companies!
Data Science Puzzles-Brain Storming/ Puzzle based
Data Science Interview Questions asked in Data
Scientist Job Interviews
1) How many Piano Tuners are there in Chicago?
To solve this kind of a problem, we need to know –

Can you tell if the equation given below is linear or not ?

Emp_sal= 2000+2.5(emp_age)2
Yes it is a linear equation as the coefficients are linear.

What will be the output of the following R programming code ?

var2<- c("I","Love,"DeZyre")

var2

It will give an error.

How many Pianos are there in Chicago?

How often would a Piano require tuning?

How much time does it take for each tuning?

We need to build these estimates to solve this kind of a problem. Suppose, let’s assume Chicago has close
to 10 million people and on an average there are 2 people in a house. For every 20 households there is 1
Piano. Now the question how many pianos are there can be answered. 1 in 20 households has a piano, so
approximately 250,000 pianos are there in Chicago.

Now the next question is-“How often would a Piano require tuning? There is no exact answer to this
question. It could be once a year or twice a year. You need to approach this question as the interviewer is
trying to test your knowledge on whether you take this into consideration or not. Let’s suppose each piano
requires tuning once a year so on the whole 250,000 piano tunings are required.

Let’s suppose that a piano tuner works for 50 weeks in a year considering a 5 day week. Thus a piano
tuner works for 250 days in a year. Let’s suppose tuning a piano takes 2 hours then in an 8 hour workday
the piano tuner would be able to tune only 4 pianos. Considering this rate, a piano tuner can tune 1000
pianos a year.

Thus, 250 piano tuners are required in Chicago considering the above estimates.

2) There is a race track with five lanes. There are 25 horses of which you want to find out the three
fastest horses. What is the minimal number of races needed to identify the 3 fastest horses of those
25?
Divide the 25 horses into 5 groups where each group contains 5 horses. Race between all the 5 groups (5
races) will determine the winners of each group. A race between all the winners will determine the winner
of the winners and must be the fastest horse. A final race between the 2 nd and 3rd place from the winners
group along with the 1st and 2nd place of thee second place group along with the third place horse will
determine the second and third fastest horse from the group of 25.
3) Estimate the number of french fries sold by McDonald's everyday.
4) How many times in a day does a clock’s hand overlap?
5) You have two beakers. The first beaker contains 4 litre of water and the second one contains 5
litres of water.How can you our exactly 7 litres of water into a bucket?
6) A coin is flipped 1000 times and 560 times heads show up. Do you think the coin is biased?
7) Estimate the number of tennis balls that can fit into a plane.
8) How many haircuts do you think happen in US every year?
9) In a city where residents prefer only boys, every family in the city continues to give birth to
children until a boy is born. If a girl is born, they plan for another child. If a boy is born, they stop.
Find out the proportion of boys to girls in the city.
Probability Interview Questions for Data Science
1. There are two companies manufacturing electronic chip. Company A is manufactures defective chips with a
probability of 20% and good quality chips with a probability of 80%. Company B manufactures defective chips
with a probability of 80% and good chips with a probability of 20%.If you get just one electronic chip, what is the
probability that it is a good chip?
2. Suppose that you now get a pack of 2 electronic chips coming from the same company either A or B. When you
test the first electronic chip it appears to be good. What is the probability that the second electronic chip you
received is also good?
3. A dating site allows users to select 6 out of 25 adjectives to describe their likes and preferences. A match is said to
be found between two users on the website if the match on atleast 5 adjectives. If Steve and On a dating site, users
can select 5 out of 24 adjectives to describe themselves. A match is declared between two users if they match on
at least 4 adjectives. If Brad and Angelina randomly pick adjectives, what is the probability that they will form a
match?
4. A coin is tossed 10 times and the results are 2 tails and 8 heads. How will you analyse whether the coin is fair or
not? What is the p-value for the same?
5. Continuation to the above question, if each coin is tossed 10 times (100 tosses are made in total). Will you modify
your approach to the test the fairness of the coin or continue with the same?
6. An ant is placed on an infinitely long twig. The ant can move one step backward or one step forward with same
probability during discrete time steps. Find out the probability with which the ant will return to the starting point.
Statistics Interview Questions for Data Science
1. Explain the central limit theorem.

2. What is the relevance of central limit theorem to a class of freshmen in the social sciences who hardly
have any knowledge about statistics?

3. Given a dataset, show me how Euclidean Distance works in three dimensions.

4. How will you prevent overfitting when creating a statistical model ?

Frequently Asked Open Ended Machine Learning Interview Questions for Data
Scientists
1. Which is your favourite machine learning algorithm and why?
2. In which libraries for Data Science in Python and R, does your strength lie?
3. What kind of data is important for specific business requirements and how, as a data scientist will you go about
collecting that data?
4. Tell us about the biggest data set you have processed till date and for what kind of analysis.
5. Which data scientists you admire the most and why?
6. Suppose you are given a data set, what will you do with it to find out if it suits the business needs of your project
or not.
7. What were the business outcomes or decisions for the projects you worked on?
8. What unique skills you think can you add on to our data science team?
9. Which are your favorite data science startups?
10. Why do you want to pursue a career in data science?
11. What have you done to upgrade your skills in analytics?
12. What has been the most useful business insight or development you have found?
13. How will you explain an A/B test to an engineer who does not know statistics?
14. When does parallelism helps your algorithms run faster and when does it make them run slower?
15. How can you ensure that you don’t analyse something that ends up producing meaningless results?
16. How would you explain to the senior management in your organization as to why a particular data set is
important?
17. Is more data always better?
18. What are your favourite imputation techniques to handle missing data?
19. What are your favorite data visualization tools?
20. Explain the life cycle of a data science project.
Suggested Answers by Data Scientists for Open Ended Data Science Interview Questions
How can you ensure that you don’t analyse something that ends up producing meaningless results?
 Understanding whether the model chosen is correct or not.Start understanding from the point where you did
Univariate or Bivariate analysis, analysed the distribution of data and correlation of variables and built the linear
model.Linear regression has an inherent requirement that the data and the errors in the data should be normally
distributed. If they are not then we cannot use linear regression. This is an inductive approach to find out if the
analysis using linear regression will yield meaningless results or not.
 Another way is to train and test data sets by sampling them multiple times. Predict on all those datasets to find out
whether or not the resultant models are similar and are performing well.
 By looking at the p-value, by looking at r square values, by looking at the fit of the function and analysing as to
how the treatment of missing value could have affected- data scientists can analyse if something will produce
meaningless results or not.
- Gaganpreet Singh,Data Scientist
These are some of the more general questions around data, statistics and data science that can be asked in
the interviews. We will come up with more questions – specific to language, Python/ R, in the subsequent
articles, and fulfil our goal of providing a set of 100 data science interview questions and answers.
3 Secrets to becoming a Great Enterprise Data Scientist
• Keep on adding technical skills to your data scientist’s toolbox.

• Improve your scientific axiom

• Learn the language of business as the insights from a data scientist help in reshaping the entire
organization.

The important tip, to nail a data science interview is to be confident with the answers without bluffing. If
you are well-versed with a particular technology whether it is Python, R, Hadoop or any other big data
technology ensure that you can back this up but if you are not strong in a particular area do not mention
unless asked about it. The above list of data scientist job interview questions is not an exhaustive one.
Every company has a different approach for interviewing data scientists. However, we do hope that the
above data science technical interview questions elucidate the data science interview process and provide
an understanding on the type of data scientist job interview questions asked when companies are hiring
data people.
We request industry experts and data scientists to chime in their suggestions in comments for open ended
data science interview questions to help students understand the best way to apporach the interviewer and
help them nail the interview.If you have any words of wisdom for data science students to ace a data
science interview, share with us in comments below!

Related Posts
Data Science Interview Questions for Python
Data Science interview Questions for R
Data Scientist Interview Questions asked at Top Tech Companies
Data Analyst Interview Questions
PREVIOUS NEXT

Follow
Big Data and Hadoop Training Courses in Popular
Cities
Hadoop Training in Texas
Hadoop Training in California
Hadoop Training in Dallas
Hadoop Training in Chicago
Hadoop Training in Charlotte
Hadoop Training in Dubai
Hadoop Training in Edison
Hadoop Training in Fremont
Hadoop Training in San Jose
Hadoop Training in Washington
Hadoop Training in New Jersey
Hadoop Training in New York
Hadoop Training in Atlanta
Hadoop Training in Canada
Hadoop Training in Abu Dhabi
Hadoop Training in Detroit
Hadoop Trainging in Germany
Hadoop Training in Houston
Hadoop Training in Virginia
Upcoming Live Data Scientists Training

15
Jul

Sat and Sun (6 weeks)

7:00 AM - 10:00 AM PST

$399
LEARN MORE





Relevant Courses
 Hadoop Online Training
 Apache Spark Training
 Data Science in Python Training
 Data Science in R Language Training
 Salesforce Certification Training
 NoSQL Database Training
 Hadoop Admin Training

 Top 100 Hadoop Interview Questions and Answers 2017
 Pig Interview Questions and Answers
 Hive Interview Questions and Answers
 HBase Interview Questions and Answers
 MapReduce Interview Questions and Answers
 HDFS Interview Questions and Answers
 Real-Time Hadoop Interview Questions and Answers
 Hadoop Admin Interview Questions and Answers
 Basic Hadoop Interview Questions and Answers
 Apache Spark Interview Questions and Answers
 Data Analyst Interview Questions and Answers
 100 Data Science Interview Questions and Answers (General)
 100 Data Science in R Interview Questions and Answers
 100 Data Science in Python Interview Questions and Answers
 Top AWS Certifications-Which one should I choose?
 Recap of Machine Learning News for May 2018
 Recap of Data Science News for May 2018
 Recap of Apache Spark News for May 2018
 Recap of Hadoop News for May 2018
 Recap of Machine Learning News for April 2018
 Recap of Data Science News for April 2018
 Recap of Apache Spark News for April 2018
 Recap of Hadoop News for April 2018
 Recap of Machine Learning News for March 2018

Blog Categories
 Big Data
 CRM
 Data Science
 Live Courses
 Mobile App Development
 NoSQL Database
 Web Development

Tutorials
 Hadoop Online Tutorial – Hadoop HDFS Commands Guide
 MapReduce Tutorial–Learn to implement Hadoop WordCount Example
 Hadoop Hive Tutorial-Usage of Hive Commands in HQL
 Hive Tutorial-Getting Started with Hive Installation on Ubuntu
 Learn Java for Hadoop Tutorial: Inheritance and Interfaces
 Learn Java for Hadoop Tutorial: Classes and Objects
 Learn Java for Hadoop Tutorial: Arrays
 Apache Spark Tutorial–Run your First Spark Program
 PySpark Tutorial-Learn to use Apache Spark with Python
 R Tutorial- Learn Data Visualization with R using GGVIS
 Neural Network Training Tutorial
 Python List Tutorial
 MatPlotLib Tutorial
 Decision Tree Tutorial
 Neural Network Tutorial
 Performance Metrics for Machine Learning Algorithms
 R Tutorial: Data.Table
 SciPy Tutorial
 Step-by-Step Apache Spark Installation Tutorial
 Introduction to Apache Spark Tutorial
 R Tutorial: Importing Data from Web
 R Tutorial: Importing Data from Relational Database
 R Tutorial: Importing Data from Excel
 Introduction to Machine Learning Tutorial
 Machine Learning Tutorial: Linear Regression
 Machine Learning Tutorial: Logistic Regression
 Support Vector Machine Tutorial (SVM)
 K-Means Clustering Tutorial
 dplyr Manipulation Verbs
 Introduction to dplyr package
 Importing Data from Flat Files in R
 Principal Component Analysis Tutorial
 Pandas Tutorial Part-3
 Pandas Tutorial Part-2
 Pandas Tutorial Part-1
 Tutorial- Hadoop Multinode Cluster Setup on Ubuntu
 Data Visualizations Tools in R
 R Statistical and Language tutorial
 Introduction to Data Science with R
 Apache Pig Tutorial: User Defined Function Example
 Apache Pig Tutorial Example: Web Log Server Analytics
 Impala Case Study: Web Traffic
 Impala Case Study: Flight Data Analysis
 Hadoop Impala Tutorial
 Apache Hive Tutorial: Tables
 Flume Hadoop Tutorial: Twitter Data Extraction
 Flume Hadoop Tutorial: Website Log Aggregation
 Hadoop Sqoop Tutorial: Example Data Export
 Hadoop Sqoop Tutorial: Example of Data Aggregation
 Apache Zookepeer Tutorial: Example of Watch Notification
 Apache Zookepeer Tutorial: Centralized Configuration Management
 Hadoop Zookeeper Tutorial
 Hadoop Sqoop Tutorial
 Hadoop PIG Tutorial
 Hadoop Oozie Tutorial
 Hadoop NoSQL Database Tutorial
 Hadoop Hive Tutorial
 Hadoop HDFS Tutorial
 Hadoop hBase Tutorial
 Hadoop Flume Tutorial
 Hadoop 2.0 YARN Tutorial
 Hadoop MapReduce Tutorial
 Big Data Hadoop Tutorial for Beginners- Hadoop Installation

Online Courses
 Hadoop Training
 Spark Certification Training
 Data Science in Python
 Data Science inR
 Data Science Training

Courses
Live Courses
Big Data and Hadoop Certification Training
Apache Spark Certification Training
Data Science Course
Hadoop Administration
AWS Solution Architect Associate Certification Training
Machine Learning Course

Self-Paced Courses
Hadoop Project based Training
CCA175 - Cloudera Spark and Hadoop Developer Certification
Data Science in R Programming
NoSQL Databases for Big Data
Salesforce Certifications - ADM 201 and DEV 401 (Platform App Builder)

One-on-One Training
Data Science in R Programming
Hadoop Administration
NoSQL Databases for Big Data
Salesforce Certifications - ADM 201 and DEV 401 (Platform App Builder)

Free Courses
Introduction to Data Science in Python
Java for Beginners by John Purcell
About DeZyre
 About Us
 Contact Us
 Pricing
 Mini Projects
 Online Hackathons
 DeZyre Reviews
 Blog
 Tutorials
 Webinar
 Student Portfolios
 FAQ
 Privacy Policy
 Disclaimer
Connect with us






Copyright 2018 Iconiq Inc. All rights reserved. All trademarks are property of their respective owners.
Share to LinkedIn
, Number of shares
Share to TwitterShare to Facebook
, Number of shares3
Share to Reddit
, Number of shares
Share to Google+

Microsoft Power BI Cookbook by Greg Deckler
100% (18)
Microsoft Power BI Cookbook by Greg Deckler
655 pages
Data Analyst Interview Questions
60% (5)
Data Analyst Interview Questions
28 pages
Business Statistics Using Excel PDF
93% (14)
Business Statistics Using Excel PDF
505 pages
101 Best Microsoft Excel Tips & Tricks Ebook v1.3 - LM
96% (25)
101 Best Microsoft Excel Tips & Tricks Ebook v1.3 - LM
616 pages
The Python Bible
97% (31)
The Python Bible
506 pages
POWER BI Tutorial
89% (9)
POWER BI Tutorial
77 pages
Excel Basics To Advanced - Design Robust Spreadsheet Applications Powered With Formatting
100% (13)
Excel Basics To Advanced - Design Robust Spreadsheet Applications Powered With Formatting
171 pages
Excel Dashboards Tutorial PDF
93% (27)
Excel Dashboards Tutorial PDF
166 pages
Data Analysis With Microsoft Excel
92% (24)
Data Analysis With Microsoft Excel
532 pages
Top 30 Data Analytics Interview Questions & Answers
100% (1)
Top 30 Data Analytics Interview Questions & Answers
16 pages
DS Interview Questions Guide 365DataScience
100% (5)
DS Interview Questions Guide 365DataScience
111 pages
Excel Bible For Beginners - Excel For Dummies Guide To The Best Excel Tools, Tips and Shortcuts
100% (17)
Excel Bible For Beginners - Excel For Dummies Guide To The Best Excel Tools, Tips and Shortcuts
148 pages
Excel Formulas and Functions
85% (26)
Excel Formulas and Functions
126 pages
Excel VBA Bundle 2 Books Excel VBA and Macros and 51 Awesome Macros
100% (19)
Excel VBA Bundle 2 Books Excel VBA and Macros and 51 Awesome Macros
230 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
23 pages
Pyspark Interview Questions: Click Here
0% (1)
Pyspark Interview Questions: Click Here
35 pages
Data Science Interview Question
No ratings yet
Data Science Interview Question
93 pages
120 Data Science Interview Questions
No ratings yet
120 Data Science Interview Questions
19 pages
Data Science Interview Questions - 365 Questions
No ratings yet
Data Science Interview Questions - 365 Questions
48 pages
Data Science Interview Questions
No ratings yet
Data Science Interview Questions
27 pages
Machine Learning Interview
No ratings yet
Machine Learning Interview
14 pages
Python Interview Questions PDF
No ratings yet
Python Interview Questions PDF
6 pages
Interview Python: Most Asked Interview Questions
No ratings yet
Interview Python: Most Asked Interview Questions
24 pages
Fast Data Processing with Spark 2 - Third Edition
From Everand
Fast Data Processing with Spark 2 - Third Edition
Krishna Sankar
No ratings yet
Advanced Excel Tutorial
98% (47)
Advanced Excel Tutorial
232 pages
Mastering Excel
100% (14)
Mastering Excel
259 pages
Learn Excel Data Analysis
100% (15)
Learn Excel Data Analysis
721 pages
Excel Formulas & Functions
100% (14)
Excel Formulas & Functions
100 pages
120+ Useful Excel Macro Codes For VBA Beginners
100% (24)
120+ Useful Excel Macro Codes For VBA Beginners
205 pages
Excel Macros Tutorial
100% (8)
Excel Macros Tutorial
117 pages
2 Medrega Cristian Gabriel en
No ratings yet
2 Medrega Cristian Gabriel en
1 page
Data Science Questions
No ratings yet
Data Science Questions
16 pages
Data Science Interview Questions
No ratings yet
Data Science Interview Questions
31 pages
Data Science Interview Questions
No ratings yet
Data Science Interview Questions
32 pages
Data Science Interview Questions and Answer
100% (1)
Data Science Interview Questions and Answer
41 pages
Data Science Interview Questions Leaked
100% (3)
Data Science Interview Questions Leaked
12 pages
100 Data Science Interview Questions and Answers (General)
100% (1)
100 Data Science Interview Questions and Answers (General)
11 pages
Data Scientist Interview Questions and Answers PDF
No ratings yet
Data Scientist Interview Questions and Answers PDF
37 pages
Data Science Interview Q&A
100% (1)
Data Science Interview Q&A
39 pages
Data Science Fundamentals and Practical Approaches: Understand Why Data Science Is the Next (English Edition)
From Everand
Data Science Fundamentals and Practical Approaches: Understand Why Data Science Is the Next (English Edition)
Dr. Gypsy Nandi
No ratings yet
Data Science Interview Questions and Answers For 2020 PDF
No ratings yet
Data Science Interview Questions and Answers For 2020 PDF
20 pages
Interview Questions Data Analytics
No ratings yet
Interview Questions Data Analytics
25 pages
Datanest - Data Science Interview
No ratings yet
Datanest - Data Science Interview
19 pages
Data Science Interview Questions
100% (2)
Data Science Interview Questions
55 pages
Data Scientist Interview Questions
No ratings yet
Data Scientist Interview Questions
2 pages
Kenny-230717-Google Data Scientist Guide
No ratings yet
Kenny-230717-Google Data Scientist Guide
8 pages
Data Science Note
No ratings yet
Data Science Note
24 pages
Interview Questions ML
100% (1)
Interview Questions ML
83 pages
Data Science Interview Question
83% (6)
Data Science Interview Question
84 pages
Data Science & ML - A Complete Interview Guide - Dimensionless PDF
100% (1)
Data Science & ML - A Complete Interview Guide - Dimensionless PDF
18 pages
Python Interview Questions 1653100147
No ratings yet
Python Interview Questions 1653100147
24 pages
Data Analytics Interview Handbook Isb
No ratings yet
Data Analytics Interview Handbook Isb
40 pages
Data Science Interview Questions (#Day11) PDF
100% (1)
Data Science Interview Questions (#Day11) PDF
11 pages
Mastering Data Science Interview Loops
50% (2)
Mastering Data Science Interview Loops
23 pages
100 Data Science in R Interview Questions and Answers For 2016
100% (2)
100 Data Science in R Interview Questions and Answers For 2016
56 pages
100 Days Data Analyst Learning Roadmap
No ratings yet
100 Days Data Analyst Learning Roadmap
6 pages
Data Science Course in Hyderabad
100% (1)
Data Science Course in Hyderabad
29 pages
Data Analyst Interview Questions To Prepare For in 2018
No ratings yet
Data Analyst Interview Questions To Prepare For in 2018
17 pages
Machine Learning Interview Questions
100% (1)
Machine Learning Interview Questions
4 pages
40 Interview Questions Asked at Startups in Machine Learning - Data Science
100% (3)
40 Interview Questions Asked at Startups in Machine Learning - Data Science
33 pages
Data Analyst Interview Questions
No ratings yet
Data Analyst Interview Questions
4 pages
Python Data Science Group Bootcamp NYC (Affordable Machine Learning)
No ratings yet
Python Data Science Group Bootcamp NYC (Affordable Machine Learning)
16 pages
30 Must Know Data Analyst SQL Interview Questions
No ratings yet
30 Must Know Data Analyst SQL Interview Questions
15 pages
Data Science Hiring Guide
50% (2)
Data Science Hiring Guide
56 pages
Salman's Resume Cognizant
No ratings yet
Salman's Resume Cognizant
1 page
UltimateGuidetoDataScienceInterviews 2
100% (3)
UltimateGuidetoDataScienceInterviews 2
87 pages
Data Science Project Ideas
No ratings yet
Data Science Project Ideas
6 pages
Data Science Interview
100% (4)
Data Science Interview
12 pages
Resume Parse
No ratings yet
Resume Parse
3 pages
A Complete Data Science Interview With 100 Questions
100% (1)
A Complete Data Science Interview With 100 Questions
57 pages
FAQ in Data Science Interviews
No ratings yet
FAQ in Data Science Interviews
93 pages
Data and Business Analytics Interview Questions
No ratings yet
Data and Business Analytics Interview Questions
54 pages
Data Science Use Cases
100% (1)
Data Science Use Cases
10 pages
Python For Data Science
100% (1)
Python For Data Science
4 pages
Top Data Analyst Interview Questions
No ratings yet
Top Data Analyst Interview Questions
28 pages
200+ Tableau Interview Questions and Answers - Vizard
100% (4)
200+ Tableau Interview Questions and Answers - Vizard
45 pages
Assignment Data Analysis Example
100% (1)
Assignment Data Analysis Example
10 pages
Anti Hacking Security: Fight Data Breach
From Everand
Anti Hacking Security: Fight Data Breach
Vivek Ashvinbhai Pancholi
No ratings yet
Practical Machine Learning with Spark: Uncover Apache Spark’s Scalable Performance with High-Quality Algorithms Across NLP, Computer Vision and ML
From Everand
Practical Machine Learning with Spark: Uncover Apache Spark’s Scalable Performance with High-Quality Algorithms Across NLP, Computer Vision and ML
Gourav Gupta
No ratings yet
PYTHON DATA SCIENCE FOR BEGINNERS: Unlock the Power of Data Science with Python and Start Your Journey as a Beginner (2023 Crash Course)
From Everand
PYTHON DATA SCIENCE FOR BEGINNERS: Unlock the Power of Data Science with Python and Start Your Journey as a Beginner (2023 Crash Course)
Rufus Johnston
No ratings yet
Excel Formulas
100% (15)
Excel Formulas
315 pages
Excel Pivot Tables
100% (10)
Excel Pivot Tables
133 pages
Easy Guide Excel 2022 Boost Your Excel Skills With This Simple and
100% (12)
Easy Guide Excel 2022 Boost Your Excel Skills With This Simple and
392 pages
Learn Excel Dashboard
100% (15)
Learn Excel Dashboard
233 pages
The Ultimate Excel Resource Guide v1
90% (10)
The Ultimate Excel Resource Guide v1
86 pages
Advanced Excel Charts Tutorial
100% (10)
Advanced Excel Charts Tutorial
120 pages
Excel 2021 - The Beginner - S Guide To Learn and Master Excel Basics, Formulas, Functions, and New Features
100% (7)
Excel 2021 - The Beginner - S Guide To Learn and Master Excel Basics, Formulas, Functions, and New Features
74 pages
Advanced Excel Functions Tutorial
100% (7)
Advanced Excel Functions Tutorial
42 pages
Excel Advanced
100% (28)
Excel Advanced
422 pages
Understand Statistics
100% (10)
Understand Statistics
146 pages
Excel 2020
100% (11)
Excel 2020
165 pages
Applied DAX With Power BI - Teo Lachev - 2019
100% (2)
Applied DAX With Power BI - Teo Lachev - 2019
367 pages
Gilbert Strang - ZoomNotes For Linear Algebra-Wellesley - Cambridge Press (2021)
100% (1)
Gilbert Strang - ZoomNotes For Linear Algebra-Wellesley - Cambridge Press (2021)
80 pages
Chapter1-Foundations For Efficiencies
No ratings yet
Chapter1-Foundations For Efficiencies
5 pages
Chapter3 Gaining Efficiencies
No ratings yet
Chapter3 Gaining Efficiencies
6 pages
Semi-: Supervised Learning
No ratings yet
Semi-: Supervised Learning
40 pages
RNN LSTM
No ratings yet
RNN LSTM
49 pages
Time Series
100% (1)
Time Series
91 pages
Uncertainity Quantification
No ratings yet
Uncertainity Quantification
88 pages
08.time Series
No ratings yet
08.time Series
1 page
20210501-ML Question Bank
No ratings yet
20210501-ML Question Bank
1 page
SQL Joins Interview Questions: Click Here
No ratings yet
SQL Joins Interview Questions: Click Here
34 pages
Artificial Intelligence Interview Questions: Click Here
No ratings yet
Artificial Intelligence Interview Questions: Click Here
44 pages
Numpy Interview Questions: Click Here
No ratings yet
Numpy Interview Questions: Click Here
32 pages
MST 567 Quiz
No ratings yet
MST 567 Quiz
2 pages
Capsula Endoscopica
No ratings yet
Capsula Endoscopica
4 pages
Project ppt
No ratings yet
Project ppt
26 pages
Quality Control
No ratings yet
Quality Control
19 pages
Serology Template Test Developers
No ratings yet
Serology Template Test Developers
33 pages
Decision Tree Algorithms For Prediction of Heart Disease: Srabanti Maji and Srishti Arora
No ratings yet
Decision Tree Algorithms For Prediction of Heart Disease: Srabanti Maji and Srishti Arora
8 pages
Normas TADL-Q (Muñoz-Neira, 2012)
No ratings yet
Normas TADL-Q (Muñoz-Neira, 2012)
11 pages
Brazilian
No ratings yet
Brazilian
16 pages
Module 3.3 Classification Models, An Overview
No ratings yet
Module 3.3 Classification Models, An Overview
11 pages
Advanced Threat Detection and Response S
100% (1)
Advanced Threat Detection and Response S
28 pages
Assessment of Fracture Risk and Its Application To Screening For Postmenopausal Osteoporosis Synopsis of A WHO Report
No ratings yet
Assessment of Fracture Risk and Its Application To Screening For Postmenopausal Osteoporosis Synopsis of A WHO Report
14 pages
Simultaneous VFC y BDT M4
No ratings yet
Simultaneous VFC y BDT M4
7 pages
1.3-1.4 Scalar N Vector
No ratings yet
1.3-1.4 Scalar N Vector
54 pages
Veterinary Record - 2020 - Zoia - Discriminating Transudates and Exudates in Dogs With Pleural Effusion Diagnostic Utility
No ratings yet
Veterinary Record - 2020 - Zoia - Discriminating Transudates and Exudates in Dogs With Pleural Effusion Diagnostic Utility
8 pages
Statistics With GraphPad Prism
No ratings yet
Statistics With GraphPad Prism
53 pages
Validation of The Placenta Accreta Index (PAI) - Improving The Antenatal Diagnosis of The Morbidly Adherent Placenta PDF
No ratings yet
Validation of The Placenta Accreta Index (PAI) - Improving The Antenatal Diagnosis of The Morbidly Adherent Placenta PDF
2 pages
Vlas in Dogs
No ratings yet
Vlas in Dogs
8 pages
L-0025614809-pdf
No ratings yet
L-0025614809-pdf
16 pages
Skin Analysis API Functions
100% (1)
Skin Analysis API Functions
22 pages
Ep7.0 01 301
No ratings yet
Ep7.0 01 301
8 pages
05.fire Alarm System Using Microcontroller
No ratings yet
05.fire Alarm System Using Microcontroller
2 pages
An Enhanced Graph-Based Semi-Supervised Learning Algorithm To Detect Fake Users On Twitter
No ratings yet
An Enhanced Graph-Based Semi-Supervised Learning Algorithm To Detect Fake Users On Twitter
22 pages
Definition of Drug-Resistant Epilepsy: A Reappraisal Based On Epilepsy Types
No ratings yet
Definition of Drug-Resistant Epilepsy: A Reappraisal Based On Epilepsy Types
6 pages
Q1-What's The Trade-Off Between Bias and Variance?
100% (1)
Q1-What's The Trade-Off Between Bias and Variance?
5 pages
Test bank Advanced Health Assessment & Clinical Diagnosis in Primary Care 6th Edition Dains download
100% (2)
Test bank Advanced Health Assessment & Clinical Diagnosis in Primary Care 6th Edition Dains download
34 pages
Fin Irjmets1652378206
No ratings yet
Fin Irjmets1652378206
6 pages
CLSI C49 Analysis of Body Fluids in Clinical Chemistry, 2nd Edition
No ratings yet
CLSI C49 Analysis of Body Fluids in Clinical Chemistry, 2nd Edition
96 pages
Vereckei Criteria As A Diagnostic Tool Amongst Emergency Medicine Residents To Distinguish Between Ventricular Tachycardia and Supra-Ventricular Tachycardia With Aberrancy
No ratings yet
Vereckei Criteria As A Diagnostic Tool Amongst Emergency Medicine Residents To Distinguish Between Ventricular Tachycardia and Supra-Ventricular Tachycardia With Aberrancy
6 pages
Slope Stability Predictions On Spatially Variable Random Fields Using Machine Learning Surrogate Models
No ratings yet
Slope Stability Predictions On Spatially Variable Random Fields Using Machine Learning Surrogate Models
49 pages

100 Data Science Interview Questions and Answers

Uploaded by

100 Data Science Interview Questions and Answers

Uploaded by

100 Data Science Interview Questions and

Answers (General) for 2018

Data Science Interview Questions and Answers

A wide term that

Data Science is a broad term for

AI Loosely integrated Machine learning A sub- field of

Data Science vs Machine Learning

3) Which technique is used to predict categorical responses?

5) What are Recommender Systems?

6) Why data cleaning plays a vital role in analysis?

7) Differentiate between univariate, bivariate and multivariate analysis.

8) What do you understand by the term Normal Distribution?

Image Credit : mathisfun.com

9) What is Linear Regression?

10) What is Interpolation and Extrapolation?

11) What is power analysis?

12) What is K-means? How can you select K for K-means?

14) What is the difference between Cluster and Systematic Sampling?

15) Are expected value and mean value different?

For Sampling Data

16) What does P-value signify about the statistical data?

• P-value=0.05is the marginal value indicating it is possible to go either way.

17) Do gradient descent methods always converge to same point?

18) What are categorical variables?

22) Explain the use of Combinatorics in data science.

25) What is an Eigenvalue and Eigenvector?

26) What is Gradient Descent?

1) To change the value and bring in within a range

2) To just remove the value.

28) How can you assess a good logistic model?

29) What are various steps involved in an analytics project?

• Explore the data and become familiar with it.

31) During analysis, how do you treat missing values?

32) Explain about the box cox transformation in regression models.

for i in range(len1+len2):

Similar function can be returned in R as well by following the similar steps.

return_union <- function(list_a,list_b)

#initializing counter variables

#Here goes our for loop

For example, the following image shows three different groups.

Next we have to create a logit function using independent variables, i.e.

Logit = L = β0 + β1*X1 + β2*X2

Probability= e^Logit/(1+ e^Logit )

e is base of natural logarithm i.e. e = 2.71828163

And this p can be calculated as-

Then we have to take natural log of the above function-

Which turns out to be –

Yactual*ln⁡〖[ 〗 P(X)]*(Yactual- 1)*ln[1-P(X)]

Now click on Solve button at the bottom –

You will see a popup like below -

· Adjusted R^2

· RMSE, MAPE

· Assumptions of linear regression

54) What do you understand by Hypothesis in the content of Machine Learning?

We will explain this with a simple example for better understanding -

56) How will you find the right K for K-means?

59) In experimental design, is it necessary to do randomization? If yes, why?

Now what if they have sent it to false positive cases?

In simple terms ,the differences can be summarized as-

 Training Set is to fit the parameters i.e. weights.

Calculation of senstivity is pretty straight forward-

Senstivity = True Positives /Positives in Actual Dependent Variable

67) What is the importance of having a selection bias?

e) SVM is preferred in multi-dimensional problem set - like text classification

69) What do you understand by feature vectors?

76) Can you write the formula to calculat R-square?

1 - (Residual Sum of Squares/ Total Sum of Squares)

77) What is the advantage of performing dimensionality reduction before fitting an SVM?

Can you tell if the equation given below is linear or not ?

What will be the output of the following R programming code ?

It will give an error.

How many Pianos are there in Chicago?

How often would a Piano require tuning?

How much time does it take for each tuning?

3. Given a dataset, show me how Euclidean Distance works in three dimensions.

4. How will you prevent overfitting when creating a statistical model ?

Logit = L = β0 + β1X1 + β2X2

Yactualln⁡〖[ 〗 P(X)](Yactual- 1)*ln[1-P(X)]