0% found this document useful (0 votes)
6 views9 pages

AIML Question Ans Part1

Having basic questions and answers of Machine learning

Uploaded by

khan adil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views9 pages

AIML Question Ans Part1

Having basic questions and answers of Machine learning

Uploaded by

khan adil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Name: Khan Adil Parvez

Enrollment No:A70466225003
Branch:CSE
Batch: Jan-2025

Q.1: Explain about Linear Regression along with example?


Ans: Linear regression is used to predict the relationship between two variables
by applying a linear equation to observed data. There are two types of variable,
one variable is called an independent variable, and the other is a dependent
variable. Linear regression is commonly used for predictive analysis. The main
idea of regression is to examine two things. First, does a set of predictor
variables do a good job in predicting an outcome (dependent) variable? The
second thing is which variables are significant predictors of the outcome
variable?

Linear Regression Example


Example 1: Linear regression can predict house prices based on size.
For example, if the formula is:
Price = 50,000 + 100 × Size (sq. ft),
a 2,000 sq. ft. house would cost:
Price = 50,000 + 100 × 2,000 = 250,000.
It helps find relationships and make predictions.
Example 2: Linear regression can predict sales based on advertising spend. For
example, if the formula is:
Sales = 5,000 + 20 × Ad Spend (in $1,000s),
and a company spends $50,000 on ads:

Sales = 5,000 + 20 × 50 = 105,000.

Q.2: What are the assumptions of linear regression?


Ans:
1. Linearity:

● Assumption: There is a linear relationship between the independent


variables (predictors) and the dependent variable (response).
● Explanation: This means that the change in the dependent variable should
be proportional to the change in the independent variable(s). If the
relationship is nonlinear, linear regression may not be the best model, and
alternative methods like polynomial regression might be needed.

2. Independence of Errors:

● Assumption: The residuals (errors) are independent of each other.


● Explanation: The residuals are the differences between the observed and
predicted values. For the model to be reliable, the residuals should not be
correlated with each other. This assumption is particularly important
when dealing with time series data, where autocorrelation of errors can
occur.

3. Homoscedasticity:

● Assumption: The variance of the residuals (errors) is constant across all


levels of the independent variables.
● Explanation: Homoscedasticity means that the spread (or variability) of
the residuals should be the same for all predicted values of YYY. If the
variance of residuals changes as the predicted values change
(heteroscedasticity), it can indicate problems with the model and affect
the reliability of statistical tests.

4. Normality of Errors:

● Assumption: The residuals (errors) are normally distributed.

● Explanation: For the purposes of hypothesis testing (like t-tests for the
regression coefficients) and calculating confidence intervals, the residuals
should follow a normal distribution. While this assumption is important
for inference, linear regression can still provide unbiased predictions even
if this assumption is somewhat violated, though statistical significance
might be affected.

5. No Multicollinearity (for multiple linear regression):


● Assumption: The independent variables are not highly correlated with
each other.
● Explanation: In multiple linear regression, multicollinearity occurs when
two or more independent variables are highly correlated with each other.
This can make it difficult to isolate the effect of each individual predictor
on the dependent variable, leading to unreliable estimates of the
coefficients.

6. No Measurement Error in Independent Variables:

● Assumption: The independent variables are measured accurately with no


error.
● Explanation: Measurement error in the independent variables can cause
biased regression coefficients, leading to incorrect conclusions. In
practice, it’s challenging to ensure no error in measurement, but
minimizing this error is important for obtaining valid results.

Q.3: Explain about Logistic Regression with example.


Ans:
Logistic Regression is a statistical method used to predict the probability of a
binary outcome (yes/no, 0/1) based on one or more independent variables,
where the outcome is modeled using a sigmoid function to ensure the predicted
probability falls between 0 and 1; essentially, it helps determine the likelihood
of a specific event occurring given certain input factors.

Example:
● Predicting whether a customer will purchase a product online:
● Independent variables: Customer's age, income level, time spent
browsing the website, number of items added to cart.
● Dependent variable: Whether the customer purchases the product
(yes/no).
● How it works: The logistic regression model analyzes past
customer data to identify patterns between these variables and the
purchase decision, then calculates the probability of a new
customer making a purchase based on their individual
characteristics.

Q.4: What are the assumption of Logistic Regression


Ans:
Logistic regression does not make many of the key assumptions of linear
regression and general linear models that are based on ordinary least squares
algorithms – particularly regarding linearity, normality, homoscedasticity, and
measurement level.
First, logistic regression does not require a linear relationship between the
dependent and independent variables. Second, the error terms (residuals) do not
need to follow a normal distribution. Third, you do not require
homoscedasticity. Finally, logistic regression does not require you to measure
the dependent variable on an interval or ratio scale.

However, some other assumptions still apply.

First, binary logistic regression requires the dependent variable to be binary and
ordinal logistic regression requires the dependent variable to be ordinal.

Second, logistic regression requires the observations to be independent of each


other. In other words, the observations should not come from repeated
measurements or matched data.

Third, logistic regression requires there to be little or no multicollinearity


among the independent variables. Meaning, that the independent variables
should not be too highly correlated with each other.

Fourth, logistic regression assumes linearity of independent variables and log


odds of the dependent variable. Although this analysis does not require the
dependent and independent variables to be related linearly, it requires that the
independent variables are linearly related to the log odds of the dependent
variable.

Finally, logistic regression typically requires a large sample size. A general


guideline is that you need at minimum of 10 cases with the least frequent
outcome for each independent variable in your model.
Q.5: Enlist and explain performance metrics of Regression.
Ans:

In regression analysis, various performance metrics are used to evaluate how


well the model predicts the continuous target variable. Here's a list of key
performance metrics for regression, along with an explanation of each:

1. Mean Absolute Error (MAE)

● Definition: MAE is the average of the absolute differences between the


actual and predicted values.
● Formula:

Where, yi is the actual value, y^i i is the predicted value, and n is the
number of observations.
● Interpretation: MAE gives a linear score, meaning that all errors are
weighted equally. It provides a simple interpretation of how far off, on
average, the predictions are from the true values.

2. Mean Squared Error (MSE)

● Definition: MSE is the average of the squared differences between the


actual and predicted values. It penalizes larger errors more than MAE.
● Formula:

● Interpretation: Since MSE squares the errors, larger deviations from the
true values have a disproportionately large effect on the metric. This
makes MSE more sensitive to outliers than MAE.

3. Root Mean Squared Error (RMSE)

● Definition: RMSE is the square root of the MSE. It returns the error in the
same units as the target variable.
● Formula:
● Interpretation: RMSE is useful for measuring how spreads out the
residuals are. It is more sensitive to larger errors than MAE, similar to
MSE, and is interpreted in the same units as the target variable.

4. R-squared (R²)

● Definition: R-squared measures the proportion of the variance in the


target variable that is explained by the regression model.
● Formula:

Where, yˉ\bar{y} is the mean of the actual values.


● Interpretation: R² ranges from 0 to 1, with a higher value indicating a
better fit. An R² of 1 means the model perfectly explains the variance in
the target variable, while an R² of 0 means the model explains none of the
variance.

5. Adjusted R-squared

● Definition: Adjusted R-squared is a modified version of R-squared that


accounts for the number of predictors in the model. It is used to compare
models with a different number of predictors.
● Formula:

Where, n is the number of data points, and p is the number of predictors.


● Interpretation: Unlike R-squared, the adjusted R-squared will decrease if
Irrelevant predictors are added to the model, making it a better measure
for comparing models with different numbers of predictors.

6. Mean Absolute Percentage Error (MAPE)

● Definition: MAPE measures the percentage difference between actual and


predicted values. It's useful for comparing models across different
datasets with different scales.
● Formula:

● Interpretation: MAPE provides an intuitive percentage-based error, which


is easy to interpret. However, it is sensitive to small actual values and
may become undefined when actual values are zero.
7. Explained Variance Score

● Definition: This metric measures how much of the variance in the


dependent variable is explained by the model.
● Formula:

● Interpretation: A higher explained variance indicates that the model


explains more of the variation in the target variable. A score of 1 means
perfect prediction, and a score of 0 means the model explains none of the
variance.

8. F-statistic (ANOVA F-test)

● Definition: The F-statistic tests the overall significance of the regression


model. It checks whether the model is a good fit for the data.
● Interpretation: A higher F-statistic suggests that the model explains a
significant amount of variability in the target variable compared to the
residuals (unexplained variance).

9. Heteroscedasticity

● Definition: This refers to the assumption that the variance of the errors is
constant across all levels of the independent variable(s).
● Test: Common tests like the Breusch-Pagan test or White's test are used to
detect heteroscedasticity.
● Interpretation: If heteroscedasticity is present, it means that the model's
error variance is not constant, which can lead to inefficient estimates of
model parameters. It's important to address this for accurate predictions
and inference.

10. Residual Plots

● Definition: Residual plots are graphs that show the difference between the
actual and predicted values (residuals) against fitted values or predictor
values.
● Interpretation: In a well-fitted regression model, residuals should appear
randomly scattered with no obvious patterns. If there are patterns, this
may indicate that the model is not capturing important trends in the data.
Q.6: Enlist and explain performance metrics of Classifier.
Ans:

1. Accuracy

● Definition: Accuracy is the proportion of correctly predicted instances


(both true positives and true negatives) out of all instances.
● Formula:

where:
○ TP = True Positives
○ TN = True Negatives
○ FP = False Positives
○ FN = False Negatives
● Interpretation: Accuracy is a straightforward metric that is most useful
when the classes are balanced. However, it may be misleading in
imbalanced datasets, as it can be high even if the model performs poorly
on the minority class.

2. Precision

● Definition: Precision (also called Positive Predictive Value) measures the


proportion of positive predictions that are actually correct.
● Formula:

● Interpretation: Precision tells us how many of the predicted positive


instances were truly positive. It is particularly useful when the cost of
false positives is high (e.g., spam detection, fraud detection).

3. Recall (Sensitivity or True Positive Rate)

● Definition: Recall (or Sensitivity) measures the proportion of actual


positive instances that are correctly identified by the model.
● Formula:

● Interpretation: Recall is important when the cost of false negatives is


high (e.g., in medical diagnostics where missing a positive case could be
dangerous). A high recall means that the model captures most of the
actual positive instances.
4. F1-Score

● Definition: The F1-score is the harmonic mean of precision and recall,


providing a balance between the two metrics.
● Formula:

● Interpretation: F1-score is useful when you need a balance between


precision and recall, particularly in imbalanced datasets where both false
positives and false negatives are costly. It ranges from 0 (worst) to 1
(best).

5. Specificity (True Negative Rate)

● Definition: Specificity measures the proportion of actual negative


instances that are correctly identified by the model.
● Formula:

● Interpretation: Specificity is useful in contexts where correctly


identifying the negative class is important. It complements recall by
focusing on how well the model avoids false positives.

You might also like