0% found this document useful (0 votes)
22 views

Data Science Assignment

first data science assignment. It covers 3 questions related to statistics required for data science.

Uploaded by

Kanak 8064
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Data Science Assignment

first data science assignment. It covers 3 questions related to statistics required for data science.

Uploaded by

Kanak 8064
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Question 1: Why is the confusion matrix useful for evaluating the performance of a classifier ?

Answer: In data science a classifier is a type of machine learning algorithm used to assign a class
label to a data input.

Now the more accurate the classifier predicts the more efficient the classifier is. To evaluate a
classifier we use confusion matrix.

A confusion matrix also known as an error metrics is a summarised table used to assess the
performance of a classifier. The table contains the information about the actual and predicted values
for a classifier. There are four types of results,

true positives, true negatives, false positives and false negatives.

Suppose we have a classifier which predicts if the image that is given as input is an image of a dog or
not.

Now from the table we can see that there are 50 true negatives 10 false positives 5 false negatives
and hundred true positives. and the total input is 165. Now from this data we can calculate accuracy
of the classifier. We can also calculate some other factors such as misclassification rate, true positive
rate and precision. From these factors we can really understand whether the classifier is a good
classifier or not.

Finally we can say that confusion matrix is very useful to evaluate a classifier and after evaluation we
can clearly understand weather the classifier is good or not.

Question 2: What happens if two features correlate in a linear regression ?

Answer: Regression analysis is a set of statistical methods used for the estimation of relationships
between a dependent variable and one or more independent variables. Suppose we want to conduct
a regression analysis about the GDP of our country. And we get the following equation of our
regression analysis -

GDP = B0 + B1*Interest rate + B2*inflation rate + ei


Here GDP is the dependent variable and interest rate and inflation rate are independent variables.
Here we can mention interest rate and inflation rate as the features of the regression model. Here
we can calculate GDP using the interest rate and inflation rate values. Now there are certain
assumptions of classical linear regression models.

One of the assumptions says for a regression model the features need to be independent that
means there should not be any correlation between the features. Now if the features have a
correlation between them then the regression model might not give proper results. It is because the
independent variables are dependent on each other and we assume that they are independent. So
the regression model doesn't give accurate results.

We can say it is important not to have multicollinearity issue between the independent variables of a
regression model.

Question 3: Prove why Pearson's correlation coefficient is between - 1 and 1.

Answer: Correlation coefficients are measures of association between two or more variables.
Correlation is a measure of association that tests whether a relationship exists between two
variables. It indicates both the strength of the association and its direction. The Pearson’s product
moment correlation coefficient written as ‘r’ can describe a linear relationship between two
variables.

Now the value of correlation coefficient is between minus one and one.

-1 indicates a strong negative relationship. It implies a perfect negative relationship between the
variables.

If the correlation coefficient is 0 it indicates no relationship.

If the correlation coefficient is 1 it indicates a strong positive relationship. It implies a perfect


positive relationship between the variables.

Now the values will vary between -1 and 1. It's because if we want to have a perfect negative or
positive relationship then the correlation coefficient will be either -1 and 1. Nothing can be more
perfect then a perfect correlation. So other correlation values we will get will be between -1 and 1. It
cannot be more than 1 or less than -1.

You might also like