Wednesday, March 15, 2017

Log-likelihood, logit, -2 log-likelihood (-2LL)

Analogous to the Sum of Squares in Multiple regression, the -2 log-likelihood (-2LL) provides us with an indication of the total error that is in a logistic regression model. The larger the value of the -2LL the less accurate the predictions of the model are. 

The deviance, or -2 log-likelihood (-2LL) statistic
The deviance is basically a measure of how much unexplained variation there is in our logistic regression model – the higher the value the less accurate the model.

Deviance (-2LL)
This is the log-likelihood multiplied by -2 and is commonly used to explore how well a logistic regression model fits the data. The lower this value is the better your model is at predicting your binary outcome variable.

Multiplying it by -2 is a technical step necessary to convert the log-likelihood into a chi-square distribution, which is useful because it can then be used to ascertain statistical significance. Don't worry if you do not fully understand the technicalities of this.

The deviance has little intuitive meaning because it depends on the sample size and the number of parameters in the model as well as on the goodness of fit.

R2 equivalents for logistic regression
The two versions most commonly used are
Hosmer - Lemeshow’s R2
Nagelkerke’s R2

Both describe the proportion of variance in the outcome that the model successfully explains. Like R2 in multiple regression these values range between ‘0’ and ‘1’ with a value of ‘1’ suggesting that the model accounts for 100% of variance in the outcome and ‘0’ that it accounts for none of the variance. Be warned: they are calculated differently and may provide conflicting estimates!

Hosmer-Lemeshow Goodness of fit – This option provides a X2 (Chi-square) test of whether or not the model is an adequate fit to the data. The null hypothesis is that the model is a ‘good enough’ fit to the data and we will only reject this null hypothesis (i.e. decide it is a ‘poor’ fit) if there are sufficiently strong grounds to do so (conventionally if p<.05). We will see that with very large samples as we have here there can be problems with this level of significance, but more on that later.

in Model Summary, This table contains the Cox & Snell R Square and Nagelkerke R Square values, which are both methods of calculating the explained variation. These values are sometimes referred to as pseudo R2 values (and will have lower values than in multiple regression). However, they are interpreted in the same manner, but with more caution. Therefore, the explained variation in the dependent variable based on our model ranges from 24.0% to 33.0%, depending on whether you reference the Cox & Snell R2 or Nagelkerke R2 methods, respectively. Nagelkerke R2 is a modification of Cox & Snell R2, the latter of which cannot achieve a value of 1. For this reason, it is preferable to report the Nagelkerke R2 value.

No comments: