• Skip to main content
  • Skip to primary sidebar

Technical Notes Of
Ehi Kioya

Technical Notes Of Ehi Kioya

  • Forums
  • About
  • Contact
MENUMENU
  • Blog Home
  • AWS, Azure, Cloud
  • Backend (Server-Side)
  • Frontend (Client-Side)
  • SharePoint
  • Tools & Resources
    • CM/IN Ruler
    • URL Decoder
    • Text Hasher
    • Word Count
    • IP Lookup
  • Linux & Servers
  • Zero Code Tech
  • WordPress
  • Musings
  • More
    Categories
    • Cloud
    • Server-Side
    • Front-End
    • SharePoint
    • Tools
    • Linux
    • Zero Code
    • WordPress
    • Musings

Logistic Regression With Python

Tagged: Data science, JupyterLab, Logistic Regression, Python, Regression Analysis

  • This topic has 0 replies, 1 voice, and was last updated 10 months, 3 weeks ago by OluwoleOluwole.
Viewing 1 post (of 1 total)
  • Author
    Posts
  • February 18, 2020 at 6:42 pm #86097
    Oluwole
    Participant
    @oluwole

    Logistic Regression is a type of Regression Analysis used when the target (dependent) variable is categorical. It could be binary, multinomial or ordinal. Binary logistic regression involves a target variable with only two possible outcomes. Multinomial and Ordinal regression are similar, as their target variables involve at least three possible outcomes. The difference between both types is that the outcomes of the latter are ordered while those of the former aren’t.

    In this post, we will see how to perform logistic regression analysis using Python in the JupyterLab environment. Let us begin with an introduction to our dataset.

    The Datasets

    The datasets are related to the admission status of students, given their SAT scores and gender. The classification goal is to predict whether a student will be given admission (Yes or No) based on their SAT scores and gender (Male or Female). The Train Dataset is stored in a CSV file titled ‘Binary predictors.csv’. Our model accuracy will also be tested with a Test Dataset titled ‘Test dataset.csv’. Both datasets can be downloaded here.

    Let’s begin!

    Importing the relevant libraries

    The Statsmodels library is used to implement the logistic regression method while the other libraries are imported for data manipulation and visualisation purposes. The libraries are imported thus:

    1
    2
    3
    4
    5
    6
    7
    # Importing Relevant Libraries
    import numpy as np
    import pandas as pd
    import statsmodels.api as sm
    import matplotlib.pyplot as plt
    import seaborn as sns
    sns.set()

    Loading the dataset

    The dataset is read into the variable raw_data in this manner:

    1
    2
    3
    4
    # Loading the Dataset
    raw_data = pd.read_csv('Binary predictors.csv')
    # Previewing raw data
    raw_data.head()

    The shape of the data reveals that there are 168 entries and 3 fields.

    1
    2
    # Shape of data
    print(raw_data.shape)

    The variables of the dataset are thus defined:
    Input variables:

    • SAT – SAT score (numeric)
    • Gender – Gender (Male or Female)

    Output variable:

    • Admitted – Admission status ( Yes or No)

    Making a copy of the raw data

    I always advise to make a copy of the train dataset at the start and also at different checkpoints when handling a project. This ensures that if the need ever arises – say you made a mistake – you have the train dataset for reference without having to re-run cells or re-read files.

    This is done thus:

    1
    2
    3
    # Making a Copy of the raw data
    data = raw_data.copy()
    data.head()

    Data Exploration

    Let’s pry into our data set to obtain some more insights.

    Using pandas describe() method, the mean SAT score is obtained as 1695.273810 and the other statistics are as displayed;

    1
    2
    #The mean SAT score
    data['SAT'].describe()

    The SAT score distribution can be visualized using a histogram plot;

    1
    2
    3
    4
    5
    # Histogram of SAT scores
    data.SAT.hist()
    plt.title('Histogram of SAT Scores')
    plt.ylabel('Frequency')
    plt.xlabel ('SAT Score')

    For the amount and percentage of admitted students;

    1
    2
    3
    4
    5
    6
    7
    8
    # Amount of Admitted Students
    admitted = len(data [data['Admitted']=='Yes'])
    not_admitted = len(data [data['Admitted']=='No'])
    print ('The amount of admitted students and Not Admitted Students are {} and {} respectively'.format(admitted, not_admitted))
    # Percentage of Admitted Students
    pct_admitted = (admitted/(admitted + not_admitted))*100
    print ('The percentage of Admitted Students is', pct_admitted)
    print ('The percentage of Not Admitted Students is', 100-pct_admitted)

    The above distribution can be visualized as shown below.

    1
    2
    3
    4
    # Visualizing the Number of Admitted Students
    sns.countplot(x='Admitted', data = data)
    plt.show()
    plt.savefig('Admitted_fig')

    For the Gender category, it is seen below that there are more males than females in the sample population.

    1
    2
    3
    4
    5
    6
    7
    8
    # Gender distribution
    male = len(data [data['Gender']=='Male'])
    female = len(data [data['Gender']=='Female'])
    print ('The amount of Male and Female students are {} and {} respectively'.format(admitted, not_admitted))
    # Percentage Students Distribution
    pct_male = (male/(male + female))*100
    print ('The percentage of Male Students is', pct_male)
    print ('The percentage of Female Students is', 100-pct_male)

    Data Transformation

    Our dataset has two fields with categorical data – Gender and Admitted. In order to implement our logistic regression, these data will need to be transformed into dummy variables.

    We do this by mapping;

    1
    2
    3
    4
    5
    # Dummy variables
    df = data.copy()
    df['Admitted']=df['Admitted'].map({'Yes': 1, 'No': 0})
    df['Gender']= df['Gender'].map({'Female': 1, 'Male': 0})
    df.head()

    For the Gender field, we ascribe 1 and 0 to Females and Males respectively while for the Admitted field we ascribe 1 to Admitted students, and 0 to Not Admitted students.

    Next we define our dependent and independent variables:

    1
    2
    y = df['Admitted']
    x1= df[['SAT','Gender']]

    Logistic Regression

    First we add a constant column of 1s for the intercept.

    1
    2
    3
    x = sm.add_constant(x1)
    x.shape
    x.head()

    Then we model the data using sm.Logit(). This is then fitted and summarized thus:

    1
    2
    3
    reg_log = sm.Logit(y,x)
    results_log = reg_log.fit()
    results_log.summary()

    The p-values of each term are less than 0.05, and we can conclude that each model term is relevant.

    Model Accuracy

    Having developed the model, we will proceed to test the model and assess its accuracy using the Test Dataset. First, let’s see how well our model did with the Train Dataset by comparing the predicted values with the actual values.

    Model accuracy using Train Dataset

    We obtain the predicted admission status of the sample population as follows;

    1
    2
    3
    # Predicted Admission Status
    np.set_printoptions(formatter={'float': lambda x: "{0:0.2f}".format(x)})
    results_log.predict()

    The actual admission status of the population is obtained as follows

    1
    2
    # Actual Admission Status
    np.array(df['Admitted'])

    Then we obtain a confusion matrix using the statsmodel pred_table() method

    1
    2
    #Confusion matrix
    results_log.pred_table()

    A confusion matrix is a table that measures the performance of a classification model. The elements in the leading diagonal (i.e. top left to bottom right) represent the number of correctly predicted values.

    We can present the matrix more elaborately thus;

    1
    2
    3
    4
    5
    # Formatting confusion matrix
    cm_df = pd.DataFrame(results_log.pred_table())
    cm_df.columns = ['Predicted 0', 'Predicted 1']
    cm_df = cm_df.rename(index={0: 'Actual 0',1: 'Actual 1'})
    cm_df

    The interpretation of this result reveals that our model –

    • Correctly predicted that 90 students were given admission
    • Correctly predicted that 69 students were denied admission
    • Wrongly predicted that 5 students were given admission
    • Wrongly predicted that 4 students were denied admission

    The accuracy of the model is obtained as follows;

    1
    2
    3
    4
    # Model Accuracy
    cm = np.array(cm_df)
    accuracy_train = 100*(cm[0,0]+cm[1,1])/cm.sum()
    print ('The model accuracy based is {:.5}'.format(accuracy_train))

    Using the Train dataset, our model gave a high accuracy of 94.643%. We will now assess the model accuracy using the Test data set.

    Model accuracy using Test Dataset

    Again we read the data set and map dummy variables to the categorical fields as follows;

    1
    2
    3
    4
    5
    6
    # read the dataset
    test =pd.read_csv('Test dataset.csv')
    # map the test data
    test['Admitted'] = test['Admitted'].map({'Yes': 1, 'No': 0})
    test ['Gender'] = test['Gender'].map({'Female': 1, 'Male': 0})
    test.head()

    Next, we obtain our dependent and independent variable datasets as follows;

    1
    2
    3
    4
    5
    6
    7
    #Getting the variables
    # Get dependent variable
    test_actual = test['Admitted']
    # Get independent variable
    test_data = test.drop(['Admitted'], axis=1)
    test_data = sm.add_constant(test_data)
    test_data.head()

    We will now create a function called confusion_matrix that takes in three arguments – the independent variables, the dependent variable and the model to be used. The function returns two outputs – the confusion matrix and the classification accuracy.

    This is done as follows;

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    def confusion_matrix(data,actual_values,model):
     
    # Confusion matrix
     
    # Parameters
    # data: data frame or array
    # data is a data frame containing only the input data
    # the order is important, e.g. const, var1, var2, etc.
    # actual_values: data frame or array
    # These are the actual values from the test_data, i.e. test_actual;
    # It is a single column with 0s and 1s because it's logistic regression
     
    # model: a LogitResults object
    # this is the variable with the fitted model, i.e. results_log
    # ----------
     
    #Predict the values using the Logit model
    pred_values = model.predict(data)
    # Specify the bins
    bins=np.array([0,0.5,1])
    # Create a histogram, where if values are between 0 and 0.5 tell will be considered 0
    # if they are between 0.5 and 1, they will be considered 1
    cm = np.histogram2d(actual_values, pred_values, bins=bins)[0]
    # Calculate the accuracy
    accuracy = (cm[0,0]+cm[1,1])/cm.sum()
    # Return the confusion matrix and the accuracy
    return cm, accuracy

    Now, we create a confusion matrix using the just defined function to test the data;

    1
    2
    3
    # Create a confusion matrix with the test data
    cm = confusion_matrix(test_data,test_actual,results_log)
    cm

    The value 0.8947 represents the model accuracy.
    The confusion matrix can be formatted as follows:

    1
    2
    3
    4
    5
    # Formatting Confusion matrix
    cm_df = pd.DataFrame(cm[0])
    cm_df.columns = ['Predicted 0','Predicted 1']
    cm_df = cm_df.rename(index={0: 'Actual 0',1:'Actual 1'})
    cm_df

    The interpretation of this result reveals that for our test dataset with a total of 18 entries, our model –

    • Correctly predicted that 12 students were given admission
    • Correctly predicted that 5 students were denied admission
    • Wrongly predicted that 1 student was given admission
    • Wrongly predicted that 1 student was denied admission

    The model Accuracy and the Misclassification rate sum up to 1 or 100%. The misclassification rate is obtained thus;

    1
    2
    3
    4
    5
    # The accuracy and misclassification rates
    accuracy = cm[1]
    m_rate = 1 - cm[1]
    print('The model accuracy using the test data is {:.3%}'.format(accuracy))
    print(' The Misclassification rate is {:.4} or {:.3%}'.format(m_rate, m_rate))

    Conclusion

    We have seen how to handle datasets with categorical dependent and independent variables. Logistic regression employs dummy variables by transforming categorical values into binaries (1s and 0s) such that they can be included in the model. We also saw how to create a confusion matrix and obtain insights into our model performance. Our model had an accuracy of about 95% and 89% with the train and test datasets respectively indicating that the model is a good fit for the data.

     

  • Author
    Posts
Viewing 1 post (of 1 total)
  • You must be logged in to reply to this topic.
Log In

Primary Sidebar

FORUM   MEMBERSHIP

Log In
Register Lost Password

POPULAR   FORUM   TOPICS

  • How to find the title of a song without knowing the lyrics
  • Welcome Message
  • How To Change Or Remove The WordPress Login Error Message
  • The Art of Exploratory Data Analysis (Part 1)
  • Replacing The Default SQLite Database With PostgreSQL In Django
  • Getting Started with SQL: A Beginners Guide to Databases
  • The Best Web Safe Fonts In Web Development
  • Forums
  • About
  • Contact

© 2021   ·   Ehi Kioya   ·   All Rights Reserved
Privacy Policy