- This topic has 0 replies, 1 voice, and was last updated 10 months, 3 weeks ago by
Oluwole.
- AuthorPosts
- February 18, 2020 at 6:42 pm #86097Participant@oluwole
Logistic Regression is a type of Regression Analysis used when the target (dependent) variable is categorical. It could be binary, multinomial or ordinal. Binary logistic regression involves a target variable with only two possible outcomes. Multinomial and Ordinal regression are similar, as their target variables involve at least three possible outcomes. The difference between both types is that the outcomes of the latter are ordered while those of the former aren’t.
In this post, we will see how to perform logistic regression analysis using Python in the JupyterLab environment. Let us begin with an introduction to our dataset.
The Datasets
The datasets are related to the admission status of students, given their SAT scores and gender. The classification goal is to predict whether a student will be given admission (Yes or No) based on their SAT scores and gender (Male or Female). The Train Dataset is stored in a CSV file titled ‘Binary predictors.csv’. Our model accuracy will also be tested with a Test Dataset titled ‘Test dataset.csv’. Both datasets can be downloaded here.
Let’s begin!
Importing the relevant libraries
The Statsmodels library is used to implement the logistic regression method while the other libraries are imported for data manipulation and visualisation purposes. The libraries are imported thus:
1234567# Importing Relevant Librariesimport numpy as npimport pandas as pdimport statsmodels.api as smimport matplotlib.pyplot as pltimport seaborn as snssns.set()Loading the dataset
The dataset is read into the variable
raw_data
in this manner:1234# Loading the Datasetraw_data = pd.read_csv('Binary predictors.csv')# Previewing raw dataraw_data.head()The shape of the data reveals that there are 168 entries and 3 fields.
12# Shape of dataprint(raw_data.shape)The variables of the dataset are thus defined:
Input variables:- SAT – SAT score (numeric)
- Gender – Gender (Male or Female)
Output variable:
- Admitted – Admission status ( Yes or No)
Making a copy of the raw data
I always advise to make a copy of the train dataset at the start and also at different checkpoints when handling a project. This ensures that if the need ever arises – say you made a mistake – you have the train dataset for reference without having to re-run cells or re-read files.
This is done thus:
123# Making a Copy of the raw datadata = raw_data.copy()data.head()Data Exploration
Let’s pry into our data set to obtain some more insights.
Using pandas
describe()
method, the mean SAT score is obtained as 1695.273810 and the other statistics are as displayed;12#The mean SAT scoredata['SAT'].describe()The SAT score distribution can be visualized using a histogram plot;
12345# Histogram of SAT scoresdata.SAT.hist()plt.title('Histogram of SAT Scores')plt.ylabel('Frequency')plt.xlabel ('SAT Score')For the amount and percentage of admitted students;
12345678# Amount of Admitted Studentsadmitted = len(data [data['Admitted']=='Yes'])not_admitted = len(data [data['Admitted']=='No'])print ('The amount of admitted students and Not Admitted Students are {} and {} respectively'.format(admitted, not_admitted))# Percentage of Admitted Studentspct_admitted = (admitted/(admitted + not_admitted))*100print ('The percentage of Admitted Students is', pct_admitted)print ('The percentage of Not Admitted Students is', 100-pct_admitted)The above distribution can be visualized as shown below.
1234# Visualizing the Number of Admitted Studentssns.countplot(x='Admitted', data = data)plt.show()plt.savefig('Admitted_fig')For the Gender category, it is seen below that there are more males than females in the sample population.
12345678# Gender distributionmale = len(data [data['Gender']=='Male'])female = len(data [data['Gender']=='Female'])print ('The amount of Male and Female students are {} and {} respectively'.format(admitted, not_admitted))# Percentage Students Distributionpct_male = (male/(male + female))*100print ('The percentage of Male Students is', pct_male)print ('The percentage of Female Students is', 100-pct_male)Data Transformation
Our dataset has two fields with categorical data – Gender and Admitted. In order to implement our logistic regression, these data will need to be transformed into dummy variables.
We do this by mapping;
12345# Dummy variablesdf = data.copy()df['Admitted']=df['Admitted'].map({'Yes': 1, 'No': 0})df['Gender']= df['Gender'].map({'Female': 1, 'Male': 0})df.head()For the Gender field, we ascribe 1 and 0 to Females and Males respectively while for the Admitted field we ascribe 1 to Admitted students, and 0 to Not Admitted students.
Next we define our dependent and independent variables:
12y = df['Admitted']x1= df[['SAT','Gender']]Logistic Regression
First we add a constant column of 1s for the intercept.
123x = sm.add_constant(x1)x.shapex.head()Then we model the data using
sm.Logit()
. This is then fitted and summarized thus:123reg_log = sm.Logit(y,x)results_log = reg_log.fit()results_log.summary()The p-values of each term are less than 0.05, and we can conclude that each model term is relevant.
Model Accuracy
Having developed the model, we will proceed to test the model and assess its accuracy using the Test Dataset. First, let’s see how well our model did with the Train Dataset by comparing the predicted values with the actual values.
Model accuracy using Train Dataset
We obtain the predicted admission status of the sample population as follows;
123# Predicted Admission Statusnp.set_printoptions(formatter={'float': lambda x: "{0:0.2f}".format(x)})results_log.predict()The actual admission status of the population is obtained as follows
12# Actual Admission Statusnp.array(df['Admitted'])Then we obtain a confusion matrix using the statsmodel
pred_table()
method12#Confusion matrixresults_log.pred_table()A confusion matrix is a table that measures the performance of a classification model. The elements in the leading diagonal (i.e. top left to bottom right) represent the number of correctly predicted values.
We can present the matrix more elaborately thus;
12345# Formatting confusion matrixcm_df = pd.DataFrame(results_log.pred_table())cm_df.columns = ['Predicted 0', 'Predicted 1']cm_df = cm_df.rename(index={0: 'Actual 0',1: 'Actual 1'})cm_dfThe interpretation of this result reveals that our model –
- Correctly predicted that 90 students were given admission
- Correctly predicted that 69 students were denied admission
- Wrongly predicted that 5 students were given admission
- Wrongly predicted that 4 students were denied admission
The accuracy of the model is obtained as follows;
1234# Model Accuracycm = np.array(cm_df)accuracy_train = 100*(cm[0,0]+cm[1,1])/cm.sum()print ('The model accuracy based is {:.5}'.format(accuracy_train))Using the Train dataset, our model gave a high accuracy of 94.643%. We will now assess the model accuracy using the Test data set.
Model accuracy using Test Dataset
Again we read the data set and map dummy variables to the categorical fields as follows;
123456# read the datasettest =pd.read_csv('Test dataset.csv')# map the test datatest['Admitted'] = test['Admitted'].map({'Yes': 1, 'No': 0})test ['Gender'] = test['Gender'].map({'Female': 1, 'Male': 0})test.head()Next, we obtain our dependent and independent variable datasets as follows;
1234567#Getting the variables# Get dependent variabletest_actual = test['Admitted']# Get independent variabletest_data = test.drop(['Admitted'], axis=1)test_data = sm.add_constant(test_data)test_data.head()We will now create a function called
confusion_matrix
that takes in three arguments – the independent variables, the dependent variable and the model to be used. The function returns two outputs – the confusion matrix and the classification accuracy.This is done as follows;
123456789101112131415161718192021222324252627def confusion_matrix(data,actual_values,model):# Confusion matrix# Parameters# data: data frame or array# data is a data frame containing only the input data# the order is important, e.g. const, var1, var2, etc.# actual_values: data frame or array# These are the actual values from the test_data, i.e. test_actual;# It is a single column with 0s and 1s because it's logistic regression# model: a LogitResults object# this is the variable with the fitted model, i.e. results_log# ----------#Predict the values using the Logit modelpred_values = model.predict(data)# Specify the binsbins=np.array([0,0.5,1])# Create a histogram, where if values are between 0 and 0.5 tell will be considered 0# if they are between 0.5 and 1, they will be considered 1cm = np.histogram2d(actual_values, pred_values, bins=bins)[0]# Calculate the accuracyaccuracy = (cm[0,0]+cm[1,1])/cm.sum()# Return the confusion matrix and the accuracyreturn cm, accuracyNow, we create a confusion matrix using the just defined function to test the data;
123# Create a confusion matrix with the test datacm = confusion_matrix(test_data,test_actual,results_log)cmThe value 0.8947 represents the model accuracy.
The confusion matrix can be formatted as follows:12345# Formatting Confusion matrixcm_df = pd.DataFrame(cm[0])cm_df.columns = ['Predicted 0','Predicted 1']cm_df = cm_df.rename(index={0: 'Actual 0',1:'Actual 1'})cm_dfThe interpretation of this result reveals that for our test dataset with a total of 18 entries, our model –
- Correctly predicted that 12 students were given admission
- Correctly predicted that 5 students were denied admission
- Wrongly predicted that 1 student was given admission
- Wrongly predicted that 1 student was denied admission
The model Accuracy and the Misclassification rate sum up to 1 or 100%. The misclassification rate is obtained thus;
12345# The accuracy and misclassification ratesaccuracy = cm[1]m_rate = 1 - cm[1]print('The model accuracy using the test data is {:.3%}'.format(accuracy))print(' The Misclassification rate is {:.4} or {:.3%}'.format(m_rate, m_rate))Conclusion
We have seen how to handle datasets with categorical dependent and independent variables. Logistic regression employs dummy variables by transforming categorical values into binaries (1s and 0s) such that they can be included in the model. We also saw how to create a confusion matrix and obtain insights into our model performance. Our model had an accuracy of about 95% and 89% with the train and test datasets respectively indicating that the model is a good fit for the data.
- AuthorPosts
- You must be logged in to reply to this topic.