Logistic Regression

Logistic regression models the probabilities for classification bug with 2 possible outcomes. It'southward an extension of the linear regression model for nomenclature problems.

What is Incorrect with Linear Regression for Classification?

The linear regression model tin can work well for regression, but fails for classification. Why is that? In case of two classes, you could label one of the classes with 0 and the other with 1 and utilise linear regression. Technically information technology works and about linear model programs will spit out weights for y'all. But at that place are a few problems with this approach:

A linear model does not output probabilities, but it treats the classes every bit numbers (0 and 1) and fits the best hyperplane (for a single feature, it is a line) that minimizes the distances between the points and the hyperplane. So it merely interpolates between the points, and you cannot translate it equally probabilities.

A linear model as well extrapolates and gives yous values below nix and above one. This is a practiced sign that there might exist a smarter approach to classification.

Since the predicted outcome is non a probability, but a linear interpolation between points, there is no meaningful threshold at which you can distinguish i grade from the other. A skillful illustration of this issue has been given on Stackoverflow.

Linear models do not extend to nomenclature problems with multiple classes. You would accept to start labeling the adjacent class with ii, then iii, and and then on. The classes might not have any meaningful order, but the linear model would force a weird structure on the relationship between the features and your grade predictions. The higher the value of a feature with a positive weight, the more it contributes to the prediction of a grade with a higher number, even if classes that happen to get a like number are not closer than other classes.

A linear model classifies tumors as malignant (1) or benign (0) given their size. The lines show the prediction of the linear model. For the data on the left, we can use 0.5 as classification threshold. After introducing a few more malignant tumor cases, the regression line shifts and a threshold of 0.5 no longer separates the classes. Points are slightly jittered to reduce over-plotting.

Figure five.5: A linear model classifies tumors as malignant (1) or benign (0) given their size. The lines bear witness the prediction of the linear model. For the data on the left, we tin can utilize 0.5 as classification threshold. After introducing a few more malignant tumor cases, the regression line shifts and a threshold of 0.5 no longer separates the classes. Points are slightly jittered to reduce over-plotting.

Theory

A solution for classification is logistic regression. Instead of plumbing equipment a directly line or hyperplane, the logistic regression model uses the logistic role to squeeze the output of a linear equation between 0 and 1. The logistic part is defined as:

\[\text{logistic}(\eta)=\frac{1}{1+exp(-\eta)}\]

And it looks like this:

The logistic function. It outputs numbers between 0 and 1. At input 0, it outputs 0.5.

FIGURE 5.six: The logistic office. It outputs numbers between 0 and one. At input 0, information technology outputs 0.5.

The step from linear regression to logistic regression is kind of straightforward. In the linear regression model, we have modelled the relationship between result and features with a linear equation:

\[\hat{y}^{(i)}=\beta_{0}+\beta_{ane}x^{(i)}_{1}+\ldots+\beta_{p}x^{(i)}_{p}\]

For classification, we prefer probabilities between 0 and 1, so we wrap the correct side of the equation into the logistic part. This forces the output to presume merely values between 0 and one.

\[P(y^{(i)}=ane)=\frac{1}{1+exp(-(\beta_{0}+\beta_{i}ten^{(i)}_{1}+\ldots+\beta_{p}x^{(i)}_{p}))}\]

Permit us revisit the tumor size example again. But instead of the linear regression model, nosotros utilize the logistic regression model:

The logistic regression model finds the correct decision boundary between malignant and benign depending on tumor size. The line is the logistic function shifted and squeezed to fit the data.

FIGURE five.seven: The logistic regression model finds the right decision boundary between malignant and beneficial depending on tumor size. The line is the logistic office shifted and squeezed to fit the information.

Classification works better with logistic regression and nosotros can use 0.5 equally a threshold in both cases. The inclusion of boosted points does not really touch the estimated curve.

Interpretation

The interpretation of the weights in logistic regression differs from the interpretation of the weights in linear regression, since the outcome in logistic regression is a probability between 0 and 1. The weights do not influence the probability linearly whatever longer. The weighted sum is transformed by the logistic office to a probability. Therefore we demand to reformulate the equation for the estimation so that only the linear term is on the right side of the formula.

\[ln\left(\frac{P(y=i)}{i-P(y=1)}\right)=log\left(\frac{P(y=i)}{P(y=0)}\right)=\beta_{0}+\beta_{1}x_{i}+\ldots+\beta_{p}x_{p}\]

We call the term in the ln() part "odds" (probability of consequence divided by probability of no event) and wrapped in the logarithm it is called log odds.

This formula shows that the logistic regression model is a linear model for the log odds. Corking! That does not sound helpful! With a niggling shuffling of the terms, yous can figure out how the prediction changes when one of the features \(x_j\) is inverse past 1 unit. To do this, nosotros can get-go apply the exp() role to both sides of the equation:

\[\frac{P(y=i)}{1-P(y=1)}=odds=exp\left(\beta_{0}+\beta_{1}x_{ane}+\ldots+\beta_{p}x_{p}\correct)\]

Then we compare what happens when we increase one of the feature values by 1. Simply instead of looking at the difference, we look at the ratio of the two predictions:

\[\frac{odds_{x_j+1}}{odds}=\frac{exp\left(\beta_{0}+\beta_{ane}x_{1}+\ldots+\beta_{j}(x_{j}+1)+\ldots+\beta_{p}x_{p}\right)}{exp\left(\beta_{0}+\beta_{1}x_{1}+\ldots+\beta_{j}x_{j}+\ldots+\beta_{p}x_{p}\right)}\]

We apply the following rule:

\[\frac{exp(a)}{exp(b)}=exp(a-b)\]

And we remove many terms:

\[\frac{odds_{x_j+1}}{odds}=exp\left(\beta_{j}(x_{j}+1)-\beta_{j}x_{j}\right)=exp\left(\beta_j\right)\]

In the end, nosotros have something as simple as exp() of a feature weight. A modify in a feature by one unit changes the odds ratio (multiplicative) by a factor of \(\exp(\beta_j)\). We could also translate it this manner: A change in \(x_j\) by one unit increases the log odds ratio by the value of the respective weight. Nigh people interpret the odds ratio considering thinking most the ln() of something is known to exist hard on the brain. Interpreting the odds ratio already requires some getting used to. For example, if you have odds of 2, it means that the probability for y=1 is twice equally high every bit y=0. If you take a weight (= log odds ratio) of 0.seven, and so increasing the respective feature by 1 unit multiplies the odds by exp(0.7) (approximately 2) and the odds change to 4. But unremarkably yous practise non deal with the odds and interpret the weights only every bit the odds ratios. Because for actually calculating the odds you would demand to gear up a value for each feature, which only makes sense if y'all want to look at i specific example of your dataset.

These are the interpretations for the logistic regression model with dissimilar characteristic types:

  • Numerical characteristic: If you increase the value of feature \(x_{j}\) by one unit, the estimated odds change by a factor of \(\exp(\beta_{j})\)
  • Binary categorical characteristic: 1 of the two values of the characteristic is the reference category (in some languages, the one encoded in 0). Changing the feature \(x_{j}\) from the reference category to the other category changes the estimated odds past a cistron of \(\exp(\beta_{j})\).
  • Categorical feature with more than than two categories: One solution to deal with multiple categories is one-hot-encoding, meaning that each category has its ain column. You lot merely need L-i columns for a chiselled characteristic with L categories, otherwise it is over-parameterized. The Fifty-th category is then the reference category. You lot tin use any other encoding that can be used in linear regression. The interpretation for each category then is equivalent to the interpretation of binary features.
  • Intercept \(\beta_{0}\): When all numerical features are nil and the categorical features are at the reference category, the estimated odds are \(\exp(\beta_{0})\). The interpretation of the intercept weight is usually not relevant.

Example

Nosotros apply the logistic regression model to predict cervical cancer based on some risk factors. The following table shows the estimate weights, the associated odds ratios, and the standard fault of the estimates.

TABLE five.2: The results of fitting a logistic regression model on the cervical cancer dataset. Shown are the features used in the model, their estimated weights and respective odds ratios, and the standard errors of the estimated weights.
Weight Odds ratio Std. Fault
Intercept -2.91 0.05 0.32
Hormonal contraceptives y/northward -0.12 0.89 0.30
Smokes y/north 0.26 one.thirty 0.37
Num. of pregnancies 0.04 1.04 0.10
Num. of diagnosed STDs 0.82 2.27 0.33
Intrauterine device y/n 0.62 1.86 0.40

Interpretation of a numerical feature ("Num. of diagnosed STDs"): An increase in the number of diagnosed STDs (sexually transmitted diseases) changes (increases) the odds of cancer vs. no cancer by a factor of ii.27, when all other features remain the same. Keep in mind that correlation does non imply causation.

Estimation of a chiselled characteristic ("Hormonal contraceptives y/n"): For women using hormonal contraceptives, the odds for cancer vs. no cancer are by a factor of 0.89 lower, compared to women without hormonal contraceptives, given all other features stay the same.

Similar in the linear model, the interpretations always come up with the clause that 'all other features stay the same'.

Advantages and Disadvantages

Many of the pros and cons of the linear regression model also apply to the logistic regression model. Logistic regression has been widely used by many unlike people, but information technology struggles with its restrictive expressiveness (e.g. interactions must exist added manually) and other models may have better predictive performance.

Another disadvantage of the logistic regression model is that the interpretation is more difficult because the interpretation of the weights is multiplicative and not condiment.

Logistic regression tin suffer from complete separation. If at that place is a feature that would perfectly dissever the two classes, the logistic regression model can no longer be trained. This is because the weight for that characteristic would not converge, because the optimal weight would be infinite. This is actually a chip unfortunate, because such a characteristic is really useful. But you do non need machine learning if you accept a simple rule that separates both classes. The problem of complete separation can be solved past introducing penalization of the weights or defining a prior probability distribution of weights.

On the skilful side, the logistic regression model is non merely a nomenclature model, only too gives y'all probabilities. This is a big advantage over models that can only provide the final nomenclature. Knowing that an instance has a 99% probability for a form compared to 51% makes a big difference.

Logistic regression tin also be extended from binary classification to multi-grade classification. Then it is called Multinomial Regression.

Software

I used the glm function in R for all examples. You can find logistic regression in whatsoever programming linguistic communication that can be used for performing data assay, such equally Python, Java, Stata, Matlab, …