Logistic Regression — Idea and Application

Tanveer Hurra
Towards Data Science
6 min readApr 20, 2020

--

Photo by Janita Sumeiko on Unsplash

This article will try to:

  • Discuss the idea behind logistic regression
  • Explain it further through an example

What you are already supposed to know

You are given a problem of predicting the gender of a person based on his/her height. To start with you are provided the data of 10 people, whose height and gender are known. You are asked to fit a mathematical model in this data in a way that will enable you to predict the gender of some other person whose height value is known but we have no information about his/her gender. This type of problem comes under the classification domain of supervised machine learning. If the problem demands from you to make classifications into various categories like True, False or rich, middle class, poor or Failure, Success, etc then you are dealing with the classification problems. Its counterpart in machine learning is a regression problem where the problem demands from us to predict a continuous value like marks = 33.4%, weight = 60 Kg, etc. This article will discuss a classification algorithm called Logistic regression.

Although there are a lot of classification algorithms out there that vary from each other in degree of complexity like Linear Discriminant Analysis, Decision Trees, Random Forest, etc, Logistic Regression is the most basic one and is perfect to learn about classification models. Let’s jump to the above-stated problem and suppose, we are given the data of ten people as shown below:

This problem is wholly different from other mathematical prediction problems. The reason being that, on one hand, we have continuous values of height but on the other hand, we have categorical values of gender. Our mathematical operations know how to deal with numbers but dealing with categorical values poses a challenge. To overcome this challenge in classification problems whether they are solved through logistic regression or some other algorithm, we always calculate the probability value associated with a class. In the given context, we will calculate the probability associated with the male class or female class. The probability with the other class need not be explicitly calculated but can be obtained by subtracting the probability of previously calculated class from one.

In the given data set, we have height as independent variable and gender as the dependent variable. For time being, if we assume it to be a regression problem, it would have been solved by calculating the parameters of the regression model given as below:

In short, we would have calculated Bo and B1 and problem solved. The classification problem cannot be solved in this manner. As stated, we cannot calculate the value of gender but the probability associated with a particular gender class. In logistic regression, we take inspiration from linear regression and use the linear model above to calculate probability. We just need a function that will take the above linear model as input and give us the probability value as output. In mathematical form, we should have something like this:

The above model calculates the probability of male class but we can use either of the two classes here. The function showed on the right side of the equation should satisfy the condition that it should take any real number input but should give the output in the range of 0 and 1 only, the reason being obvious. The above condition is satisfied by a function called Sigmoid or logistic function shown below:

The Sigmoid function has the domain of -inf to inf and the range of 0 to 1, which makes it perfect for probability calculation in logistic regression. If we plug the linear model in Sigmoid function we will get something as shown below:

The above equation can be easily rearranged to give us the more simple & easily understandable form as shown below:

The right-hand side of the equation is exactly what we have in the linear regression model & the left-hand side is the log of the probability of odds, also called logit. So the above equation can be also written as:

logit(gender=male) = Bo + B1*height

This is the idea behind the logistic regression. Now let’s solve the problem given to us to see its application.

We will use the Python code to train our model using the given data. Let’s first import the necessary modules. We need NumPy and LogisticRegression class from sklearn.

from sklearn.linear_model import LogisticRegression

So now modules are imported we need to create an instance of the LogisticRegression class.

lg = LogisticRegression(solver = ‘lbfgs’)

The solver used is lbfgs. It’s now time to create the data set that we will use to train the model.

height = np.array([[132,134,133,139,145,144,165,160,155,140]])gender = np.array([1,0,1,1,0,0,0,0,0,1])

Note that the sklearn can only handle numerical values, so here we are representing the female class with 1 and the male class with 0. Using the above datasets let’s train the model:

lg.fit(height.reshape(-1,1),gender.ravel())

Once the model is trained, you will get a confirmation message of the same. So no we have a trained model, let's check the parameters, the intercept ( Bo) and the slope( B1).

lg.coef_lg.intercept_

Running the above lines will show you the intercept value of 35.212 and the slope value of -0.252. Hence our trained model can be written as:

We can use the above equation to predict the gender of any person given his/her height or we can directly use the trained model as shown below to find the gender value of a person with say height = 140cm:

lg.predict(np.array([[140]]))

Give the above lines of code a try, you will get the idea. Note that the model actually gives us the probability value associated with a given class and it’s up to us to decide the threshold value of the probability. The default is considered 0.5 i.e. all the probability values associated with the male class above 0.5 are considered males & the probability of male class if less than 0.5 is considered female. Also, the separation boundary in logistic regression is linear which can be easily confirmed graphically.

Further Read

Linear Discriminant Analysis

Decision Trees

That is all in Logistic Regression. For any queries regarding the article, you can reach me on LinkedIn

Thanks,

Have a nice time 😊

Originally published at https://www.wildregressor.com on April 20, 2020.

--

--