A gentle introduction in plain English
Logistic Regression is very similar to Linear Regression except that Logistic Regression predicts whether something is true or false instead of predicting how big (or how much) something is. In other words: Where Linear Regression predicts the outcome of a numerical variable, Logistic Regression predicts the outcome of a binary categorical variable. The ability of Logistic Regression to provide probabilities and classify new samples makes it a popular machine learning method.
Remember that in Linear Regression, a straight line is used to fit to the data. However, in Logistic Regression, an “S”-shaped logistic function is used. The curve goes from 0 to 1, which – for example – means that the curve tells you the probability of passing the (fictional) Flatiron Data Science exam based on the hours of daily studying:
Although Logistic Regression tells you the probability of someone passing an exam or not, like in our example, it’s usually used for classification. For example, if the probability of a transaction being fraudulent is greater than 50%, we’ll classify it as fraudulent, otherwise as genuine.
Just like with Linear Regression, we can create simple models such as:
Passing the exam is predicted by the hours of study daily.
… or more complicated models such as:
Passing the exam is predicted by the hours of study + number of cups of coffee.
… or even more complicated models such as:
Passing the exam is predicted by the hours of study + number of cups of coffee + hair color + hours of sleep + astrological sign.
Logistic Regression can work with continuous predictors (e.g. hours of studying, cups of coffee, hours of sleep) as well as discrete predictors (e.g. hair color, astrological sign).
We can test if each variable is useful for predicting the outcome of the exam. However, unlike Linear Regression, we can’t simply compare a complicated model to the simple model. Instead, we just test if a variable’s effect on the prediction is significantly different from 0. If not, this variable is not helpful for our prediction. (More on this in a later post.)
In our case, the astrological sign and the hair color are completely useless – of course ;-). We can save time and space in our study by leaving them out.
As said above, a big difference between Logistic and Linear Regression is how the line is fit to the data. With Linear Regression, we fit the line using “Least Squares”, meaning we find the line that minimizes the sum of squares of the residuals. We also calculate R² to compare simple models to complicated ones. Logistic Regression doesn’t have the same concept of a “residual” so it can neither use “Least Squares” nor R². Instead, it uses something like Maximum Likelihood.
Here we use the observed status of students (whether they passed or failed) to calculate their likelihood of them passing – given the shape of the curve. We’ll start by calculating the likelihood of the hardworking students passing – given the shape of the curve – and then that of the less hardworking students. The likelihood is always the value on the y-axis where the data point intersects the curve.
In other words, this is the simple secret: The likelihood of a student passing the exam is the same as the predicted probability. In this case, the probability is not calculated as the area under the curve, but is instead the y-axis value. That’s why it’s the same as the likelihood. Lastly, we multiply all these likelihoods together, and that’s the overall likelihood of the data given this line.
Important: Although it is possible to calculate the overall likelihood as the product of the individual likelihoods, statisticians prefer to calculate the log of the likelihood instead. Either way works because the line that maximizes the likelihood is the same one that maximizes the log of the likelihood.
Then, with the “log-likelihood”, we add the logs of the individual likelihoods instead of multiplying them.
Likelihood of data GIVEN this line = 0.07 x 0.35 x 0.97 x …
log(Likelihood of data GIVEN this line) = log(0.07) + log(0.35) + log(0.97) + …
Then we shift the line and calculate a new likelihood of the data, then shift the line and calculate the likelihood again and again …
We will ultimately get a line that maximizes the likelihood. That’s the one chosen as having the best fit:
The algorithm that finds the line with the maximum likelihood is pretty smart: each time it rotates the line, it does so in a way that increases the likelihood. Thus, the algorithm can find the optimal fit after a few rotations. This smart algorithm is called Gradient Descent. (And yes, this would be a great topic for yet another gentle introduction in plain English..)
. . . . . . . . . . . . . .
I hope you enjoyed reading this article, and I’m always happy to get critical and friendly feedback, or suggestions for improvement!