0% found this document useful (0 votes)
32 views20 pages

Fda Unit 5

The document discusses predictive analytics, focusing on various regression techniques such as linear least squares, multiple regression, and logistic regression. It explains the importance of goodness-of-fit tests, parameter estimation, and the use of statistical tools like StatsModels for regression analysis. Additionally, it covers concepts like spurious regression and the significance of residual analysis in validating regression models.

Uploaded by

paulsteffin2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
32 views20 pages

Fda Unit 5

The document discusses predictive analytics, focusing on various regression techniques such as linear least squares, multiple regression, and logistic regression. It explains the importance of goodness-of-fit tests, parameter estimation, and the use of statistical tools like StatsModels for regression analysis. Additionally, it covers concepts like spurious regression and the significance of residual analysis in validating regression models.

Uploaded by

paulsteffin2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 20
: Predictive Analytics ear least squares - implementation - goodness of fit - testing a linear model - ‘esampling. Regression using StatsModels - multiple regression - nonlinear ighted a ° relationships = logistic regression - estimating parameters - Time series analysis - moving ‘averages ~ missing values - serial correlation - autocorrelation. Introduction to survival analysis. Contents Linear Least Squares Regression using StatsModels Multiple Regression Logistic Regression Time Series Analysis Introduction to Survival Two Marks Questions with Answers — Fundamentals of Data Science and Analytics (5-2) Predictive Analytiog Qs1 Linear Least Squares Least square method * The method of least squares is about estimating parameters by minimizing the squareq discrepancies between observed data, on the one-hand and their expected values on the other. ¢ The Least Squares (LS) criterion states that the sum of the squares of errors is minimum, The least-squares solutions yield y(x) whose elements sum to 1, but do not ensure the outputs to be in the range [0,1]. * How to draw such a line based on data points observed ? Suppose a imaginary line of y=a+bx. E(Y)=a+bx x, Fig. 5.1.1 © Imagine a vertical distance between the line and a data point E = Y - EY). This error is the deviation of the data point from the imaginary line, regression line. Then what is the best values of a and b ? A and b that minimizes the sum of such errors. Deviation does not have good properties for computation. Then why do we use squares of deviation ? Let us get a and b that can minimize the sum of squared deviations rather than the sum of deviations. This method is called least squares. Least squares method minimizes the sum of squares of errors. Such a and b are called least squares estimators i,c. estimators of parameters 0: and B. *. The process of getting parameter estimators (e.g., a and b) is called estimation. Lest squares method is the estimation method of Ordinary Least Squares (OLS). TECHNICAL PUBLICATIONS® an pst or iowiedga mentals of Data Science and Analytics (5-3) Predictive Analytics a pot ised 1 yack rob : certain & ntages of least square ustness to outliers. iatasets unsuitable for least squares classification. pecis ‘on boundary corresponds to ML solution, 3. example 5.1.4 : Fit straight line to the points in the table, Compute m and b by least uares Solution : Represent in matrix form : 3.00 1 4507 [Va 425 1 IR =| 45 |, Vp 550 1 |LbJ | 550 |*] ve 8.00 1 ssod | y, : x-["]-a'w'a'y [ 121.3125 anasto" _ [ 195825] _ [o25] ={ 20.7500. 4.0000 | ~L 19.7500 | = L 3.663 V =AX-L ; . 3.00 1 4507] [-0.10 _|425 1 [226] 425 |_| 0.46 5.50 1 |L3.663)~ | 5.50 | | -0.48 8.00 1 5.50 0.13 5.1.1 Goodness of Fit * A goodness-of-fit test, in general, refers to measuring how well do the observed data Correspond to the fitted (assumed) model. The goodness-of-fit test compares the observed Values to the expected (fitted or predicted) values. * Goodness-of-fit tests are frequently applied in business decision making. For example, the below image depicts the linear regression function. The goodness-of-fit test here will Compare the actual observéd values denoted by dots to the predicted values denoted by the Tegression line, TECHNICAL PUBLICATIONS® - an upthrust for knowledge mee — —+ Fundamentals of Data Science and Analytics (5-4) Predictive Anaiyticg Fig. 5.1.2 : Goodness of fit © Broadly, the goodness of fit test categorization can be done based on the distribution of the predict and variable of the dataset. a) The chi-square b) Kolmogorov-Smimov _‘c) Anderson-Darling. ' @ 5:1.2 Testing a Linear Model : © The following measures are used to validate the simple linear regression models : 1. Co-efficient of determination (R-square). 2. Hypothesis test for the regression coefficient by. 3. Analysis of variance for overall model validity (relevant more for multiple linear regression). : 4, Residual analysis to validate the regression model assumptions. 5. Outlier analysis. ©, The primary objective of regression is to explain the variation in Y using the knowledge of X. The coefficient of determination (R-square) measures the percentage of variation in Y explained by the model (8, + B, X). Characteristics of R-square : © Here are some basic characteristics of the measure : 1. Since R’ is a proportion, it is always a number between 0 and 1. 2 . 2. If R’ = 1, all of the data points fall perfectly on the regression line. The predictor x . accounts for all of the variation in y. 2 . aes 3. If R= 0, the estimated regression line is perfectly horizontal. The predictor x accounts for none of the variation in y. TECHNICAL PUBLICATIONS® - an up-thrust for knowledge ean eae the linear regression setting. More specifically, R- indicates the rion of the Variance in the dependent variable (Y) that is predicted or explained by ession and the predictor variable (X, also known as the independent variable). P sient TRF 2 jn general @ high R° value indicates that the model is a good fit for the data, although interpretations of fit depend on the context of analysis, An R. of 0.35, for examples indicates that 35 percent of the variation in the outcome has been explained just by predicting th outcome using the covariates included in the model. «» That percentage right be a very high portion of variation to predict in a field such, as the social sciences in other fields, such as the physical sciences, one would expect R' to be rnuch closer to 100 percent. ‘The theoretical minimum R° is 0. However, since linear regression is based on the best possible fit, R’ will always be greater than zero, even when the predictor and outcome variables bear no relationship to one another. increases when a new predictor variable i added tothe model, even if the new predictor js not associated with the outcome. 1» ‘account for that effect, the adjusted R' incorporates the same information as the usual R’ but then also penalizes for the number of predictor variables included in the model. 2 ‘As a result, R” increases as new predictors are added to a multiple linear regression model, but the adjusted R’ increases only ifthe increase in Ris greater than one would expect from chance alone. In such a model, the adjusted R is the most realistic estimate of the proportion of the variation that is predicted by the covariates included in the model. B sa24 ‘Spurious Regression 1 we regress one random walk onto another independent © The regression is spurious whe will most likely indicate a non-existing random walk, It is spurious because the regression " relationship : 1. The coefficient estimate will not convel limit the coefficient estimate will follow a nom srge toward Zero (the true value). Instead, in the degenerate distribution. 2. The t value most often is significant. 2. 3. Ris typically very high. TEGHRICAL PUBLICATIONS®-an ups! for rowledge = 6-9) Predictive Analytics Fundamentals of Data Science and Analytics * Spurious regression is linked to serially correlated errors. Granger and Newbold(1974) pointed out that along with the large t-values strong evidence stating that when a low value of serially correlated errors will appear in regression analysis, of the Durbin-Watson statistic is combined with a high value of the t-statistic the relationship is not true. Hypothesis Test for Regression Co-Efficient (t-Test) ¢ The regression co-efficient (B,) captures the existence of a line response variable and the explanatory variable. If B, = 0, we can conclude that there is no statistically significant linear relationship ar relationship between the between the two variables. Using the Analysis of Variance (ANOVA), we can test whether the overall model is statistically significant. However, for a simple linear regression, the null and alternative hypotheses in ANOVA and t-test are exactly same and thus there will be no difference in the p-value. Residual analysis © Residual (error) analysis is important to check whether the assumptions of regression models have been satisfied. It is performed to check the following : 1. The residuals are normally distributed. 2. The variance of residual is constant (homoscedasticity). 3. The functional form of regression is correctly specified. 4, If there are any outliers. 5.2 Regression using StatsModels Linear regression statsmodel is the model that helps us to predict and is used for fitting up the scenario where one parameter is directly dependent on the other parameter. Here, we have one variable that is dependent and the other one which is independent. Depending on the change in the value of the independent parameter, we need to predict the change in the dependent variable. The statsmodels library has more advanced statistical tools as compared to sci-kit learn. Moreover, it’s regression analysis tools can give more detailed results, TECHNICAL PUBLICATIONS? - an up-ratfornomoage S aon of Data Science and Analytics (5-7) pretetve Anaytlcs | | pe i ll axe ace four availabe clasts ofthe properties ofthe regression model that wil help us f° se te statsmodel linear regression. The classes are as follows : « | 4) ordinary Least Square (OLS) . ; | 1 Weighted Least Square (WLS) | ° Generalized Least Square (GLS) | @) GLSAR- Feasible generalized leat square along with the errors that are auto correlated. ‘| | statsmodel Linear regression model helps to predict or estimate the values of the dependent variables a8 and when there is a change in the independent quantities. 4. statsmodels is @ Python module that provides classes and functions for the estimation of i many different statistical models, as well as for conducting statistical tests and statistical data exploration. + Statsmodels is built on top of NumPy, SciPy and matplotlib. 53 Multiple Regression let’s see how big the difference in weight «The mothers of ftt babies are 3.59 yers younger, Running the linear model agai, we Bet the change in birth weight as a function of age: si ad tal agers ete «The slope is 0.0175 pounds per yea. we multiply the slope by the difference in ages, We get the expected difference in birth weight for first babies and others, due to mother’s age : © The result is 0.063, just about half of the observed difference. So we conclude, tentatively, that the observed difference in birth weight can be partly explained by the difference in og oe] mother’s age. these relationships more systematically. i 4 TEGHNIGAL PUBLICATIONS® - en uphrustorInewfodge Fundamentals of Data Science and Analytics (5-8) Predictive Analytics © The first line creates a new column named isfrst that is True for first babies and False otherwise. Then we fit a model using isfirst as an explanatory variable. © Here are the sample results : © Because isfirst is a boolean, ois treats it as a categorical variable, which means that the values fall into categories, like True and False and should not be treated as numbers. The estimated parameter is the effect on birth weight when isfist is true, so the result, -0.125 Ibs, is the difference in birth weight between first babies and others. The slope and the intercept are statistically significant, which means that they were unlikely to occur by chance, but the R” value for this model is small, which means that isfirst doesn’t account for a substantial part of the variation in birth weight. The results are similar with seers: ice 2 ae e « Again, the parameters are statistically significant, but RX is low. © These models confirm results we have already seen. But now we can fit a single model that includes both variables. With the formula totawet Ib ~ isfirst-+ agepreg, we get : In the combined model, the parameter for isist is smaller by about half, which means that part of the apparent effect of isfirst is actually accounted for by agepreg. And the p-value for isfirst is about 2.5 %, which is on the border of statistical significance. ‘ @ 5.3.1 Nonlinear Relationship « Remembering that the contribution of agepreg might be nonlinear, we might consider adding a variable to capture more of this relationship. One option is to create a column, agepreg2, that contains the squares of the ages : : TECHNICAL PUBLICATIONS® - an up-thrust for knowledge SERIE wa 36) URPRON Isaacs cans aesea IASI « The parameter of agepreg? is negative, so the parabola curves downward, which is consistent with the shape of the lines in Fig. 5.3.1. 19 05} 2 00 3 3 g-05 oh aa — 50th — 251 ' a a rr Age (years) Fig. 5.3.1 : Residuals of the linear fit * The quadratic model of agepreg accounts for more of the variability in birth weight; the parameter for isfirst is smaller in this model and no longer statistically significant. * Using computed variables like agepreg2 is a common way to fit polynomials and other functions to data. This process is still considered linear regression, because the dependent Variable is ‘a linear function of the explanatory variables, regardless of whether some Variables are nonlinear functions of others.* TECHNICAL PUBLICATIONS® - an up-trust for knowledge Fundamentals of Data Science and Analytics (5 - 10) Predictive Analytics OD 5.4 Logistic Regression © Logistic regression is a form of regression analysis in which the outcome variable is binary or dichotomous. A statistical method used to model dichotomous or binary outcomes using Predictor variables. ‘© Logistic regression is one of the supervised learning algorithms. The binary logistic regression model is given by, z PM) = l+e Where Z = By +B, X + By Xp +By Xy + --- + By Xy (Kp Xp» Xs ate independent variable) The logistic regression function is rewritten as, PY z T- Ty) gee (Po) = By +B, X, + By Xp +». + By X Multiple linear regression is an extension of linear regression, which allows a response variable, y to be modeled as a linear function of two or more predictor variables. Logit function is similar to a multiple linear regression model. Such models are called Generalized Linear Models (GLM), in GLM the errors do not follow normal distribution and there exists a transformation function of the outcome variable that takes a linear function. @ 5.4.1 Estimation of Parameters in Logistic Regression © Parameter estimates (also called coefficients) are the log odds ratio associated with a one- unit change of the predictor, all other predictors being held constant. For each term involving a categorical variable, a number of dummy predictor variables are created to predict the effect of each different level. Regression parameters in the case of logistic regression are estimated using Maximum Likelihood Estimator (MLE). In binary logistic regression, the response variable Y takes only two values (Y = 0 and 1). © The unknown model parameters are estimated using maximum-likelihood estimation. © A coefficient describes the size of the contribution of that predictor; a large coefficient indicates that the variable strongly influences the probability of that outcome, while a near- zero coefficient indicates that variable has little influence on the probability of that outcome. TEGHNIGAL PUBLICATIONS® - en up:tvt or rawiedge ea pul ndamentals of Data Science and Analytics (5-11) Predictive Analytics «A positive sign indicates thatthe explanatory variable increases the probability of the outcome, while a negative sign indicates that the variable decreases the probability of that outeome. A confidence interval for each parameter shows the uncertainty in the estimate. PY=NZ = BytB, X,+B)X) +... +B, X, = a(Z) z e Ite «The probability function of binary logistic regression for specific observation Y; is given by Yi 1-¥j PY) = m2) ‘(=n *t «The log-likelihod function is given by, in(L) = LL = 2 Maloy + Bay) {Int —m(Z) } 6 5.4.2 Logistic Regression Model Diagnostics Regression models for categorical outcomes should be evaluated for fit and adherence to model assumptions. There are two main elements of such an assessment : Discrimination and calibration. Discrimination measutes the ability of the model to correctly classify observations into outcome categories, Calibration measures how well the model estimated probabilities agree with the observed outcomes and it is typically evaluated via a goodness-of-fit test. The (binary) logistic regression model describes the relationship between a binary outcome variable and one or more predictor variables, Here we discuss four test models : 1, Omnibus test : The omnibus test is a likelihood-ratio chi-square test of the current model versus the null model. The significance value of less than 0.05 indicates that the current model outperforms the null model. Omnibus tests are generic statistical tests used for checking whether the variance explained by the model is more than the unexplained variance, Wald's test : Wald’s test is used for checking whether an individual explanatory Variable is statistically significant. Wald's test is a chi-square test, A Wald test calculates aZ statistic which is : we SE(B) [TEGHICAL PLBLUCATIONS an wpe tontdge Fundamentals of Data Science and Analytics (5 - 12) Predictive Analytics This value is squared which yields a chi-square distribution and is used as the Wald test statistic 3. Hosmer-Lemeshow test : It is a chi-square goodness of fit test for bin regression. 4. Psendo R” : Pseudo R” is a measure of goodness of the model. It is called pseudo R because it does not have the same interpretation of R’ as in the MLR model. ary logistic 5.4.3 Variable Selection in Logistic Regression © Variable selection is an important consideration when creating logistic re; Variables must be selected carefully so that the model makes accurate predictions, but gression models. without over-fitting the data. Forward LR (Likelihood Ratio) : © In Forward LR, at each step one variable is added to the model. The following steps are used in building logistic regression model using forward LR selection method : 1. Start with no variables in the model. 2. For each independent variable, calculate the difference between — 2LL, and — 2LL; , , value. 3, Repeat step 2, till all the variables are exhausted or the change in ~ 2LL is not significant, that is the p-value after adding a new variable is greater than 0. Forward selection wald : Stepwise selection method with entry testing based on the significance of the score statistic and removal testing based on the probability of the Wald statistic. Method selection allows you to specify how independent variables are entered into the analysis. Using different methods, you can construct a variety of regression models from the same set of variables. Enter ; A procedure for variable selection in which all variables in a block are entered in single step. Forward selection (Conditional) : Stepwise selection method with entry testing based on the significance of the score statistic and removal testing based on the probability of a likelihood-ratio statistic based on conditional parameter estimates. Forward selection (Likelihood Ratio) : Stepwise selection method with entry testing based on the significance of the score statistic and removal testing based on the probability of @ likelihood-ratio statistic based on the maximum partial likelihood estimates. TECHNICAL PUBLICATIONS® - an up-thrust for knowtedgo Fundamentals of Data Science and Analytics (5 - 19) Analytics « Forward selection (Wald) : Stepwise selection method with entry testing based on the significance of the score statistic and removal testing based on the probability of the Wald statistic. « Backward elimination (Conditional) ; Backward stepwise selection. Removal testing is _ based on the probability of the likelihood-ratio statistic based on conditional parameter estimates. « Backward elimination Likelihood Ratio) : Backward stepwise selection, Removal testing is based on the probability of the likelihood-ratio statistic based on the maximum partial likelihood estimates. Backward elimination (Wald) : Backward stepwise selection. Removal testing is based on the probability of the Wald statistic, (1 5.5 Time Series Analysis First of all, we will create a scatter plot of dates and values in Matplotlib using plt.plot_date(). We will be using Python’s built-in module called datetime(datetime, timedelta) for parsing the dates. So, let us create a python file called ‘plot_time_series.py’ Fundamentals of Data Science and Analytics (5 - 14) feen tae Output : pomo732 mers 0745 mort 0-739 20728 207-33 J 5.5.1 Missing Values Data can have missing values for a number of reasons such as observations that were not recorded and data corruption. Handling missing data is important as many machine learning algorithms do not support data with missing values. We can load the dataset as a Pandas DataFrame and print summary statistics on each ay a4 © In Python, specifically Pandas, NumPy and Scikit-Leam, we mark missing values as NaN. Values with a NaN value are ignored from operations like sum, count, ete. TECHNICAL PUBLICATIONS® on ups for knonedpe Fundamentals of Data Science and Analytics (5 . 15) Predictive Analytics ¢ Use the ismall() method to detect the missing values. Pandas Dataframe provides a function isnull(), it cree new dataframe of same size as calling dataframe, it contains only True & False only. With True at the place NaN in original dataframe and False at other places, Encoding missingness : «The fillna() function is used to fill NA/NNaN values using the specified method. Syntax: ‘pateFrame-fillna(valie=None, tmethod=None, axis=None, inplace=False, limit =None, ‘downoast=None,**kwargs) i Where, 1. value : It is a value that is used to fill the null values, 2. method : A method that is used to fill the null values. 3. axis : It takes int or string value for rows/columns, 4, inplace : If it is true, it fills values at an empty place. 5. limit : It is an integer value that specifies the maximum number of consecutive forward/backward NaN value fills. 6. downcast : It takes a dict that specifies what to downcast like Float64 to int64. (| 5.5.2 Serial Coorelation Serial correlation is the relationship between a given variable and a lagged version of itself cover various time intervals. It measures the relationship between a variable's current value given its past values. ¢ A variable that is serially correlated indicates that it may not be random. Technical analysts validate the profitable patterns of a security or group of securities and determine the risk associated with investment opportunities. © The most common form of serial correlation is called first-order serial correlation in which the error in time is related to the previous (t- 1) period’s error : & = P&_; tty -l 0 indicates positive serial correlation - The error terms will tend to have the same sign from one period to the next period. ©) <0 indicates negative serial correlation - The error terms will tend to have a different sign from one period to the next period. Impure serial correlation * This type of serial correlation is caused by a specification error such as an omitted variable or ignoring nonlinearities. Suppose the true regression equation is given by, YY, = Bo+B, Xj, + By Xq +e The error term €, will capture the effect of X,,. Since many economic variables exhibit trends over time, X,, is likely to depend on X2, ¢_j» X,,¢_7 ++» This will translate into a seeming correlation between €, and €_y, &_ 2» «-- and this serial correlation would violate assumption. A specification error of the functional form can also cause this type of serial correlation. ‘Suppose the true regression equation between Y and X is quadratic. but we assume it’s linear. The error term will depend on X’. The consequences of serial correlation : 1. Pure serial correlation does not cause bias in the regression coefficient estimates. 2. Serial correlation causes OLS to no longer be a minimum variance estimator. 3. Serial correlation causes the estimated variances of the regression coefficients to be biased, leading to unreliable hypothesis testing. 5.5.3 Autocorrelation * Autocorrelation refers to the degree of correlation of the same variables between two successive time intervals, It measures how the lagged version of the value of a variable is related to the original version of it in a time series. The value of autocorrelation varies between + 1 and — 1. If the autocorrelation of series is a very small value that does not mean, there is no correlation. The correlation could be non- linear. A value between — 1 and 0 represents negative autocorrelation. A value between O and 1 represents positive autocorrelation. ‘ © Autocorrelation gives information about the trend of a set of historical data so that it can be usefal in the technical analysis for the equity market. TECHNICAL PUBLIGATIONS® - an up-thrust for knowledge - fundamentals of Data Science and Analytics (5-17) Analytics « Fig. 5.5.1 shows positive and negative autocorrelation. ype a — (a) Positive autocorrelation (b) Negative autocorrelation Fig. 5.5.1 A technical analyst can learn how the stock price of a particular day is affected by those of previous days through autocorrelation. Thus, he/she can estimate how the price will move in the future. «If the price of a stock with strong positive autocorrelation has been increasing for several days, the analyst can reasonably estimate the future price will continue to move upward in the recent future days. The analyst may buy and hold the stock for a short period of time to profit from the upward price movement, «The autocorrelation analysis only provides information about short-term trends and tells little about the fundamentals of a company. Therefore, it can only be applied to support the trades with short holding periods. 4 5.6 Introduction to Survival © Survival analysis is used to analyze data in which the time until the event is of interest. The response is often referred to as a failure time, survival time or event time. © Originally, this branch of statistics developed around measuring the effects of medical treatment on patient’s survival in clinical trials. For example, imagine a group of cancer patients who are administered a certain new form of treatment. Survival analysis can be used for analyzing the results of that treatment in terms of the patients’ life expectancy. Censoring : # Censoring is present when we have some information about a subject’s event time, but we don’t know the exact event time. For the analysis methods we will discuss to be valid, censoring mechanism must be independent of the survival mechanism. TEOPRIGHL PUBLIGATIONS® nop know Fundamentals of Data Science and Analytics (5-18) Brectetive Ansivics * There are generally three reasons why censoring might occur : ‘ & A subject does not experience the event before the study ends b. A person is lost to follow-up during the study period ©. A person withdraws from the study. © These are all examples of right-censoring, © Types of right-censoring :* 1. Fixed type I censoring occurs when a study is designed to end after C years of follow- up. In this case, everyone who does not have an event observed during the course of the study is censored at C years. 2. In random type I censoring, the study is designed to end after C years, but censored subjects do not all have the same censoring time. This is the main type of right- censoring we will be concerned with. 3. In type II censoring, a study ends when there is a pre-specified number of events. ©. The survival fiinction is a function of time (t) and can be represented as : S(t) = Pr(T>t) where Pr() stands for the probability and T for the time of the event of interest for a random observation from the sample. We can interpret the survival function as the probability of the event of interest (for example, the death event) not occurring by the time t. * The survival function takes values in the range between 0 and 1 (inclusive) and is a non- increasing function of t. £2 5.7 Two Marks Questions with Answers Q.1 What is logistic regression ? Ans, : Logistic regression is a form of regression analysis in which the outcome variable is binary or dichotomous. A statistical method used to model dichotomous or binary outcomes using predictor variables. Logistic regression is one of the supervised learning algorithms. Q2 What is omnibus test ? Ans. : The omnibus test is a likelihood-ratio chi-square test of the current model versus the null model. The significance value of less than 0.05 indicates that the current model outperforms the null model. Omnibus tests are generic statistical tests used for checking whether the Variance explained by the model is more than the unexplained variance, TECHNICAL PUBLICATIONS® . en up-thrust for knowledge Se rundoments of Dal Scien and Aneytes (5.19) a Predictive Anelytics a3 Define serial correlation, ‘ans. : Serial correlation is the relationship between a given variable and a lagged version of ious time i itself over various ime intervals. It measures the relationship between a variable's current value given its past values. as What are the consequences of serial correlation 7 ‘Ans.t 1. Pure serial correlation does not cause bias inthe regression coefficient estimates. 2. Serial correlation causes OLS to no longer be a minimum variance estimator. 3. Serial sexaition causes the estimated variances of the regression coefficients to be biased, leading to unreliable hypothesis testing. Q.5 Define autocorrelation, ‘Ans. : Autocorrelation refers to the degree of correlation of the same variables between two successive time intervals. It measures how the lagged version of the value of a variable is related to the original version of it in a time series. Q.6. What are reasons for censoring 7 ‘Ans, : There are generally three reasons why censoring might occur: a. A subject does not experience the event before the study ends. b. A person is lost to follow-up during the study period. c. A person withdraws from the study. Q.7_ Explain regression using statsmodels. ‘Ans. : Linear regression statsmodel is the model that helps us to predict and is used for fitting up the scenario where one parameter is directly dependent on the other parameter. Here, we have one variable that is dependent and the other one which is independent. Depending on the change in the value of the independent parameter, we need to predict the change in the dependent variable. Q.8 Why residual analysis is important 7 ‘Ans. : Residual (error) analysis is important to check whether the assumptions of regression models have been satisfied. It is performed to check the following : 1. The residuals are normally distributed. 2. The variance of residual is constant (homoscedasticity). 3, The functional form of regression is correctly specified. 4. If there are any outliers. TECHNICAL PUBLICATIONS® - an up-trust for knowledge tive Analytics Fundamentals of Data Science and Analytics —_(5- 20) Predictive Analytic Q.9 What Is spurious regression 7 ther independent Ans. : The regression is spurious when we regress one random. walk onto anol pt a non-existin, random walk, It is spurious because the regression will most likely indicate 1g relationship. 900 TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

You might also like