Understanding Binary Logistic Regression
Understanding Binary Logistic Regression
The term 'logit' in logistic regression refers to the natural logarithm of the odds of the dependent event occurring. It is modeled linearly with respect to the explanatory variables. Specifically, it is expressed as logit(p) = log(p/(1-p)) = β0 + β1X1 + ... + βnXn, where p is the probability that the dependent variable equals 1, and β coefficients represent the model parameters. The linear relationship between the logit of the response and the explanatory variables allows for predicting binary outcomes effectively .
The key assumptions of a logistic regression model include: data should be independently distributed, binary logistic regression assumes Bernoulli distribution of the response, and a linear relationship between the logit of the response and the explanatory variables (not between the dependent and independent variables). Additionally, independent variables can be nonlinear transformations, homogeneity of variance is not required, errors need to be independent but not normally distributed, and maximum likelihood estimation (MLE) is used instead of ordinary least squares (OLS).
The concept of odds is central to interpreting the results in logistic regression because it provides a scalable measure of association between an independent variable and the probability of occurrence of an event. The odds are calculated as the ratio of the probability that an event occurs to the probability it does not. In logistic regression, the exponential of the regression coefficients gives the odds ratio, which quantifies how expected odds change with a unit change in the predictor variable, allowing for understanding the relative likelihood of different outcomes .
The significance of individual regression coefficients in logistic regression is tested using Wald's test. This involves comparing the estimated coefficient to its standard error. Specifically, for large samples, under the null hypothesis H0: βj = βj0, the test statistic z = (β̂j - βj0)/SE(β̂j) follows an approximate normal distribution N(0,1). If z falls outside the critical region for a given confidence level, H0 is rejected, indicating that the coefficient is significant and contributes to the model .
The Hessian matrix plays a critical role in estimating parameters in logistic regression. It is the second derivative of the likelihood function with respect to the parameters, and it is used in the Newton-Raphson optimization method to find maximum likelihood estimates. The Hessian matrix provides curvature information that guides the adjustment of parameter estimates iteratively until convergence. Its inverse is used to estimate the variance-covariance matrix of the parameter estimates, enabling the calculation of standard errors .
Understanding the assumptions underlying logistic regression is important when interpreting model results because violations of these assumptions can lead to biased estimates, incorrect conclusions, and poor generalizability of the model. For example, the assumption of independent observations is crucial for the validity of significance tests, and the wrongly assumed distribution of the response or errors can mislead about associations. A clear grasp of underlying assumptions enables appropriate model diagnostics, adjustments, and improved predictions, ensuring robust inferences and practical applicability .
Logistic regression can be extended to handle multiple categories in the response variable through multinomial logistic regression. In this extension, the model simultaneously estimates separate regression equations for each category of the response variable, in comparison to a baseline category. This enables modeling scenarios where the response variable can take on more than two values, accommodating polychotomous outcomes. The multinomial extension provides odds ratios for each response category relative to the baseline, preserving the interpretability of the logistic model .
In logistic regression, the odds ratio is the ratio of two odds for different values of an independent variable. It can be interpreted as the relative change in the odds due to a small change in the independent variable. If the independent variable is binary, the odds ratio compares the odds of the outcome occurring when the independent variable is at one level versus another. If the odds ratio equals 1, it suggests no effect. If it is greater than 1, the likelihood of the outcome increases as the independent variable increases, and less than 1 suggests a decrease .
Maximum likelihood estimation (MLE) is preferred over ordinary least squares (OLS) in logistic regression because MLE focuses on finding the parameter values that maximize the likelihood of observing the given sample data. OLS, which works well for linear regression, relies on minimizing the sum of squared errors and assumes normal distribution of errors. In contrast, logistic regression deals with binary outcomes and assumes Bernoulli distribution, where MLE provides statistically consistent and asymptotically efficient estimates suitable for large samples .
Logistic regression models do not directly handle nonlinear relationships between the predictor and response variables; instead, they assume a linear relationship between the logit of the response and the explanatory variables. However, nonlinear relationships can be addressed by including polynomial terms or transformations of the predictors as inputs in the model, effectively capturing nonlinearities within the framework of a linear logit relationship .