- This topic has 0 replies, 1 voice, and was last updated 2 years, 5 months ago by Oluwole.

- AuthorPosts
- February 18, 2020 at 6:42 pm #86097Spectator@oluwole
**Logistic Regression**is a type of Regression Analysis used when the target (dependent) variable is categorical. It could be binary, multinomial or ordinal. Binary logistic regression involves a target variable with only two possible outcomes. Multinomial and Ordinal regression are similar, as their target variables involve at least three possible outcomes. The difference between both types is that the outcomes of the latter are ordered while those of the former aren’t.In this post, we will see how to perform logistic regression analysis using Python in the JupyterLab environment. Let us begin with an introduction to our dataset.

## The Datasets

The datasets are related to the admission status of students, given their SAT scores and gender. The classification goal is to predict whether a student will be given admission (Yes or No) based on their SAT scores and gender (Male or Female). The

**Train Dataset**is stored in a CSV file titled ‘Binary predictors.csv’. Our model accuracy will also be tested with a**Test Dataset**titled ‘Test dataset.csv’. Both datasets can be downloaded here.Let’s begin!

### Importing the relevant libraries

The Statsmodels library is used to implement the logistic regression method while the other libraries are imported for data manipulation and visualisation purposes. The libraries are imported thus:

`# Importing Relevant Libraries import numpy as np import pandas as pd import statsmodels.api as sm import matplotlib.pyplot as plt import seaborn as sns sns.set()`

### Loading the dataset

The dataset is read into the variable

`raw_data`

in this manner:`# Loading the Dataset raw_data = pd.read_csv('Binary predictors.csv') # Previewing raw data raw_data.head()`

The shape of the data reveals that there are 168 entries and 3 fields.

`# Shape of data print(raw_data.shape)`

The variables of the dataset are thus defined:

**Input variables:**- SAT – SAT score (numeric)
- Gender – Gender (Male or Female)

**Output variable:**- Admitted – Admission status ( Yes or No)

### Making a copy of the raw data

I always advise to make a copy of the train dataset at the start and also at different checkpoints when handling a project. This ensures that if the need ever arises – say you made a mistake – you have the train dataset for reference without having to re-run cells or re-read files.

This is done thus:

`# Making a Copy of the raw data data = raw_data.copy() data.head()`

### Data Exploration

Let’s pry into our data set to obtain some more insights.

Using pandas

`describe()`

method, the mean SAT score is obtained as*1695.273810*and the other statistics are as displayed;`#The mean SAT score data['SAT'].describe()`

The SAT score distribution can be visualized using a histogram plot;

`# Histogram of SAT scores data.SAT.hist() plt.title('Histogram of SAT Scores') plt.ylabel('Frequency') plt.xlabel ('SAT Score')`

For the amount and percentage of admitted students;

`# Amount of Admitted Students admitted = len(data [data['Admitted']=='Yes']) not_admitted = len(data [data['Admitted']=='No']) print ('The amount of admitted students and Not Admitted Students are {} and {} respectively'.format(admitted, not_admitted)) # Percentage of Admitted Students pct_admitted = (admitted/(admitted + not_admitted))*100 print ('The percentage of Admitted Students is', pct_admitted) print ('The percentage of Not Admitted Students is', 100-pct_admitted)`

The above distribution can be visualized as shown below.

`# Visualizing the Number of Admitted Students sns.countplot(x='Admitted', data = data) plt.show() plt.savefig('Admitted_fig')`

For the Gender category, it is seen below that there are more males than females in the sample population.

`# Gender distribution male = len(data [data['Gender']=='Male']) female = len(data [data['Gender']=='Female']) print ('The amount of Male and Female students are {} and {} respectively'.format(admitted, not_admitted)) # Percentage Students Distribution pct_male = (male/(male + female))*100 print ('The percentage of Male Students is', pct_male) print ('The percentage of Female Students is', 100-pct_male)`

### Data Transformation

Our dataset has two fields with categorical data –

**Gender**and**Admitted**. In order to implement our logistic regression, these data will need to be transformed into dummy variables.We do this by mapping;

`# Dummy variables df = data.copy() df['Admitted']=df['Admitted'].map({'Yes': 1, 'No': 0}) df['Gender']= df['Gender'].map({'Female': 1, 'Male': 0}) df.head()`

For the Gender field, we ascribe

**1**and**0**to**Females**and**Males**respectively while for the Admitted field we ascribe**1**to**Admitted**students, and**0**to**Not Admitted**students.Next we define our dependent and independent variables:

`y = df['Admitted'] x1= df[['SAT','Gender']]`

## Logistic Regression

First we add a constant column of 1s for the intercept.

`x = sm.add_constant(x1) x.shape x.head()`

Then we model the data using

`sm.Logit()`

. This is then fitted and summarized thus:`reg_log = sm.Logit(y,x) results_log = reg_log.fit() results_log.summary()`

The

**p-values**of each term are less than**0.05**, and we can conclude that each model term is relevant.## Model Accuracy

Having developed the model, we will proceed to test the model and assess its accuracy using the

**Test Dataset**. First, let’s see how well our model did with the**Train Dataset**by comparing the predicted values with the actual values.### Model accuracy using Train Dataset

We obtain the predicted admission status of the sample population as follows;

`# Predicted Admission Status np.set_printoptions(formatter={'float': lambda x: "{0:0.2f}".format(x)}) results_log.predict()`

The actual admission status of the population is obtained as follows

`# Actual Admission Status np.array(df['Admitted'])`

Then we obtain a confusion matrix using the statsmodel

`pred_table()`

method`#Confusion matrix results_log.pred_table()`

A confusion matrix is a table that measures the performance of a classification model. The elements in the leading diagonal (i.e. top left to bottom right) represent the number of correctly predicted values.

We can present the matrix more elaborately thus;

`# Formatting confusion matrix cm_df = pd.DataFrame(results_log.pred_table()) cm_df.columns = ['Predicted 0', 'Predicted 1'] cm_df = cm_df.rename(index={0: 'Actual 0',1: 'Actual 1'}) cm_df`

The interpretation of this result reveals that our model –

- Correctly predicted that 90 students were given admission
- Correctly predicted that 69 students were denied admission
- Wrongly predicted that 5 students were given admission
- Wrongly predicted that 4 students were denied admission

The accuracy of the model is obtained as follows;

`# Model Accuracy cm = np.array(cm_df) accuracy_train = 100*(cm[0,0]+cm[1,1])/cm.sum() print ('The model accuracy based is {:.5}'.format(accuracy_train))`

Using the Train dataset, our model gave a high accuracy of

*94.643%*. We will now assess the model accuracy using the Test data set.### Model accuracy using Test Dataset

Again we read the data set and map dummy variables to the categorical fields as follows;

`# read the dataset test =pd.read_csv('Test dataset.csv') # map the test data test['Admitted'] = test['Admitted'].map({'Yes': 1, 'No': 0}) test ['Gender'] = test['Gender'].map({'Female': 1, 'Male': 0}) test.head()`

Next, we obtain our dependent and independent variable datasets as follows;

`#Getting the variables # Get dependent variable test_actual = test['Admitted'] # Get independent variable test_data = test.drop(['Admitted'], axis=1) test_data = sm.add_constant(test_data) test_data.head()`

We will now create a function called

`confusion_matrix`

that takes in three arguments – the independent variables, the dependent variable and the model to be used. The function returns two outputs – the confusion matrix and the classification accuracy.This is done as follows;

`def confusion_matrix(data,actual_values,model): # Confusion matrix # Parameters # data: data frame or array # data is a data frame containing only the input data # the order is important, e.g. const, var1, var2, etc. # actual_values: data frame or array # These are the actual values from the test_data, i.e. test_actual; # It is a single column with 0s and 1s because it's logistic regression # model: a LogitResults object # this is the variable with the fitted model, i.e. results_log # ---------- #Predict the values using the Logit model pred_values = model.predict(data) # Specify the bins bins=np.array([0,0.5,1]) # Create a histogram, where if values are between 0 and 0.5 tell will be considered 0 # if they are between 0.5 and 1, they will be considered 1 cm = np.histogram2d(actual_values, pred_values, bins=bins)[0] # Calculate the accuracy accuracy = (cm[0,0]+cm[1,1])/cm.sum() # Return the confusion matrix and the accuracy return cm, accuracy`

Now, we create a confusion matrix using the just defined function to test the data;

`# Create a confusion matrix with the test data cm = confusion_matrix(test_data,test_actual,results_log) cm`

The value

*0.8947*represents the model accuracy.

The confusion matrix can be formatted as follows:`# Formatting Confusion matrix cm_df = pd.DataFrame(cm[0]) cm_df.columns = ['Predicted 0','Predicted 1'] cm_df = cm_df.rename(index={0: 'Actual 0',1:'Actual 1'}) cm_df`

The interpretation of this result reveals that for our test dataset with a total of 18 entries, our model –

- Correctly predicted that 12 students were given admission
- Correctly predicted that 5 students were denied admission
- Wrongly predicted that 1 student was given admission
- Wrongly predicted that 1 student was denied admission

The model Accuracy and the Misclassification rate sum up to 1 or 100%. The misclassification rate is obtained thus;

`# The accuracy and misclassification rates accuracy = cm[1] m_rate = 1 - cm[1] print('The model accuracy using the test data is {:.3%}'.format(accuracy)) print(' The Misclassification rate is {:.4} or {:.3%}'.format(m_rate, m_rate))`

## Conclusion

We have seen how to handle datasets with categorical dependent and independent variables. Logistic regression employs dummy variables by transforming categorical values into binaries (1s and 0s) such that they can be included in the model. We also saw how to create a confusion matrix and obtain insights into our model performance. Our model had an accuracy of about 95% and 89% with the train and test datasets respectively indicating that the model is a good fit for the data.

- AuthorPosts

- You must be logged in to reply to this topic.