Skip to content

Understanding Logistic Regression

Published On:
Feb 21, 2020
Last Updated:
Mar 21, 2020

Logistic regression (or logit regression) is a very common and popular algorithm that is used in machine learning. It is used for making binary categorical predictions, such as “is it going to rain today?” or more precisely, it can be used to give a percentage chance of it raining today. Logistic regression can also be extended to make many-option categorical predictions, such as “is it more likely to be sunny, overcast, or rainy today?“.

Probability and Odds

Before learning about logistic regression, it is wise to understand the terms probability and odds.

We’ll start with a real simple equation. The probability PP of a binary event occurring is:

PP

The probability of a binary event not occurring must then be:

1P1 - P

The odds OO is defined as the ratio of success to the ratio of failure.

O=P1PO = \frac{P}{1 - P}

It should be clear from the above equation that if PP can vary in the range [0,1][0, 1] then OO can vary in the range [0,][0, \infty]. The higher the odds, the higher the chance of success. This should sound very familiar to gamblers, who use this term frequently.

Why Use The Logarithmic Function?

The reason we introduce the ln()ln() function into the equation begins to make sense once you understand basic linear regression, which can be used to predict the probability of continuous target variables. The basic equation defining linear regression involving just one predictor xx and the outcome yy is:

y=β0+β1x<x<<y<\begin{align} y = \beta_0 + \beta_1 x \\ -\infty \lt x \lt \infty \\ -\infty \lt y \lt \infty \\ \end{align}

where:
βn\beta_n are the coefficients
xx is the predictor
yy is the outcome

The problem with using linear regression for making binary categorical predictions (i.e. true/false) is that yy can vary from -\infty to ++\infty. We really want yy to range from 00 to 11. When varying between 00 and 11, this tells us the probability of the target being true or false. For example, if y=0.7y=0.7 this would say there is a 70% chance of the target being true, and conversely a 30% chance of the target being false.

To make things less confusing, we will replace yy which is used to represent a continuous target variable with PP (for probability), which is used to represent the probability:

P=β0+β1x<x<0P1\begin{align} P = \beta_0 + \beta_1 x\\ -\infty \lt x \lt \infty\\ 0 \leq P \leq 1 \end{align}

Notice a problem? This limits of the LHS and RHS of the equation don’t match up! This is where we begin to understand why the log function is introduced. We will try and modify the RHS such that it has the same range as the LHS (-\infty to ++\infty). What if we replace the probability PP on the LHS with the odds OO:

O=β0+β1xx<0O<\begin{align} O = \beta_0 + \beta_1 x\\ -\infty \leq x \lt \infty\\ 0 \leq O \lt \infty \end{align}

where:
OO are the odds

We are getting closer! Now the RHS varies from 00 to \infty rather than from just 00 to 11. So how do we modify a number which ranges from 00 to 11 to range from -\infty to ++\infty? One way is to use the ln() function! (to recap some mathematics, the ln of values between 0 and 1 map from -\infty and 00, and the ln of values from 11 to \infty map to 00 to \infty.)

ln(O)=β0+β1x<x<<ln(O)<\begin{align} \ln(O) = \beta_0 + \beta_1 x\\ -\infty \lt x \lt \infty\\ -\infty \lt \ln(O) \lt \infty \end{align}

The ranges on both sides of the equation now match! The base of the logarithm does not actually matter. We choose to use the natural logarithm (ln\ln) but you could use any other base such as base 10 (typically written as log10log_{10} or just loglog).

Now we can see why it’s called logistic regression, and why it is useful.

However, the equation is usually re-arranged with PP on the LHS.

ln(O)=β0+β1xO=eβ0+β1xTake the exponential of both sidesP1P=eβ0+β1xSubstitute O as per equation XXXP=eβ0+β1x(1P)Multiply both sides by (1P)P=eβ0+β1xPeβ0+β1xExpand RHSP+Peβ0+β1x=eβ0+β1xMove all terms with P to the LHSP(1+eβ0+β1x)=eβ0+β1xFactor the PP=eβ0+β1x1+eβ0+β1xDivide both sides of equation by 1+eβ0+β1xP=11eβ0+β1x+1Divide top and bottom of RHS fraction by eβ0+β1xP=11+e(β0+β1x)Use rule 1ex=ex\begin{align} \ln(O) &= \beta_0 + \beta_1 x\\ O &= e^{\beta_0 + \beta_1 x} &\text{Take the exponential of both sides}\\ \frac{P}{1 - P} &= e^{\beta_0 + \beta_1 x} &\text{Substitute $O$ as per equation XXX}\\ P &= e^{\beta_0 + \beta_1 x}(1 - P) &\text{Multiply both sides by $(1 - P)$}\\ P &= e^{\beta_0 + \beta_1 x} - Pe^{\beta_0 + \beta_1 x} &\text{Expand RHS}\\ P + Pe^{\beta_0 + \beta_1 x} &= e^{\beta_0 + \beta_1 x} &\text{Move all terms with $P$ to the LHS}\\ P(1 + e^{\beta_0 + \beta_1 x}) &= e^{\beta_0 + \beta_1 x} &\text{Factor the $P$}\\ P &= \frac{e^{\beta_0 + \beta_1 x}}{1 + e^{\beta_0 + \beta_1 x}} &\text{Divide both sides of equation by $1 + e^{\beta_0 + \beta_1 x}$}\\ P &= \frac{1}{\frac{1}{e^{\beta_0 + \beta_1 x}} + 1} &\text{Divide top and bottom of RHS fraction by $e^{\beta_0 + \beta_1 x}$}\\ P &= \frac{1}{1 + e^{-(\beta_0 + \beta_1 x)}} &\text{Use rule $\frac{1}{e^x} = e^{-x}$} \end{align}

As you can see from above, PP is now a form of a sigmoid function.

What Does The Logistic Function Look Like?

So we have the basic logistic function equation:

P=11+e(β0+β1x)P = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x)}}

where:
β0\beta_0 and β1\beta_1 are constants

What happens as we change β1\beta_1?

Animated graph showing the effect of changing β1\beta_1 in the logistic function.

It changes the shape of the curve, starting-off looking like a linear line, and progressively getting closer to looking like a step function. This β1\beta_1 term is analogous to the slope mm in linear regression.

What happens as we change β0\beta_0?

Animated graph showing the effect of changing β0\beta_0 in the logistic function.

This is analogous to the y-intercept cc in linear regression, except that β0\beta_0 shifts the curve along the x-axis.

Worked Example

We can use logistic regression to perform basic “machine learning” tasks. We will use the famous Iris dataset, and write the code in Python, leveraging sklearn’s logistic regression training class and various reporting tools. The Iris dataset is popular enough that it’s bundled with a number of Python libraries, including seaborn (which is where we will grab it from):

import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split
data = sns.load_dataset('iris')
print(data.shape[0])
# 150
data.head()
The first 5 rows of the Iris dataset.

Split the data into the features x and the target y:

x = data.iloc[:, 0:-1] # All columns except "species"
y = data.iloc[:, -1] # The "species" column

Now split the data into training and test data:

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

Let’s train the model:

model = LogisticRegression()
model.fit(x_train, y_train) # Training the model

Make predictions:

predictions = model.predict(x_test)
print(predictions)
The predictions of the type of Iris for the test data.

Print a “classification report”:

print(classification_report(y_test, predictions))
The classification report for our logistic regression model.

For someone new to categorization, these terms in the classification report can be confusing. This is what they mean1:

  • The precision is how well the classifier is at labelling an instance positive when it was actually positive. It can be thought of as: “For all instances labelled positive, what percentage of them are actually correct?

    precision=truepositivetruepositive+falsepositiveprecision = \frac{true\,positive}{true\,positive + false\,positive}
  • The recall is the ability for a classifier to find all true positives. It can be though of as: “For all instances that where actually positive, what percentage were labelled correctly?”

    recall=truepositivetruepositive+falsenegativerecall = \frac{true\,positive}{true\,positive + false\,negative}
  • The f1-score is the harmonic mean of the precision and recall. Personally I find this the most difficult metric to understand intuitively. It is a score which incorporates both the precision and recall, and varies between 0 and 1.

  • The support is the number of actual occurrences of a class in a specific dataset.

And let’s print the accuracy score:

print(accuracy_score(y_test, predictions))
# 0.9666666666666667

External Resources

https://towardsdatascience.com/logit-of-logistic-regression-understanding-the-fundamentals-f384152a33d1

Footnotes

  1. https://www.scikit-yb.org/en/latest/api/classifier/classification_report.html