• Skip to main content
  • Skip to primary sidebar

Technical Notes Of
Ehi Kioya

Technical Notes Of Ehi Kioya

  • Forums
  • About
  • Contact
MENUMENU
  • Blog Home
  • AWS, Azure, Cloud
  • Backend (Server-Side)
  • Frontend (Client-Side)
  • SharePoint
  • Tools & Resources
    • CM/IN Ruler
    • URL Decoder
    • Text Hasher
    • Word Count
    • IP Lookup
  • Linux & Servers
  • Zero Code Tech
  • WordPress
  • Musings
  • More
    Categories
    • Cloud
    • Server-Side
    • Front-End
    • SharePoint
    • Tools
    • Linux
    • Zero Code
    • WordPress
    • Musings

Sklearn For Multiple Linear Regression (Part 2)

Tagged: JupyterLab, Linear Regression, Python, Sklearn

  • This topic has 0 replies, 1 voice, and was last updated 11 months ago by OluwoleOluwole.
Viewing 1 post (of 1 total)
  • Author
    Posts
  • February 13, 2020 at 5:51 am #85722
    Oluwole
    Participant
    @oluwole

    Welcome to Part 2 of the Sklearn for Multiple Linear Regression series. In the first part of this series, we were able to accomplish the following:

    • Reading and describing the data
    • Making scatter plots of the dependent and independent variables using matplotlib
    • Fitting the dataset to a regression model
    • Obtaining descriptive statistics of the model

    We also observed that the Rand variable bore little weight on the model.

    Regression Analysis

    In this part, our objective is to improve the model. Firstly, we will use Feature Selection to identify and remove any unnecessary variables. Then we would standardize the variables and obtain the regression statistics of the model. We would also make predictions and compare the results of the model obtained before and after removing the unnecessary variable. We shall continue using the same notebook from Part 1.

    Feature Selection

    Feature selection is a data preprocessing technique that helps us to identify unnecessary variables in our dataset. We will achieve this by using f_regression to obtain the Prob-F (or p-values) and F-statistics.
    This is achieved thus:

    1
    2
    3
    4
    #Import f_regression from feature_selection module
    from sklearn.feature_selection import f_regression
    #Obtain f-statistic and the p-values
    f_regression(x,y)

    The first column of the resulting array contains the F-statistic for both variables, while the second column contains the p-values. Our primary concern is the p-values.
    The p-values are obtained and approximated thus:

    1
    2
    3
    4
    #Obtain p_values by indexing
    p_values = f_regression(x,y)[1]
    #round off p_values to 3 decimal places
    p_values.round(3)

    It is observed that the p-value for the SAT is approximately 0, while that of the Rand is about 0.676. Typically, variables with p-values greater than 0.005 (i.e. p-value > 0.005) are discarded because they are considered insignificant variables.

    Therefore, in our analysis, we would discard the Rand variable. Discarding insignificant variables has little on no effect on the model because that is what they are – insignificant. There is no need to involve them in our analysis.

    Create a Summary Table

    Let’s create a table that summarizes our observations so far, including the weights,

    1
    2
    3
    4
    5
    #Create a dataframe
    reg_summary = pd.DataFrame(data=x.columns.values, columns = ['Features'])
    reg_summary['Coefficients']=reg.coef_
    reg_summary['p-values']=p_values.round(3)
    reg_summary

    Standardization

    Standardization is the process of transforming a variable by rescaling it to have a mean of zero and a standard deviation of one. It is also referred to as feature scaling or Normalization. We achieve this using the StandardScaler module.

    1
    2
    3
    #Import relevant module
    from sklearn.preprocessing import StandardScaler
    scaler = StandardScaler()

    The fit method calculates the mean and standard deviation;

    1
    2
    # Fit calculates the mean and SD of each feature
    scaler.fit(x)

    The scaling mechanism is then applied to the data set thus:

    1
    2
    3
    # This applies the scaling mechanism
    x_scaled = scaler.transform(x)
    x_scaled

    Regression with Scaled Features

    Let’s fit the standardized data into a linear regression model. This is done thus:

    1
    2
    3
    #Create regression with scaled features
    reg = LinearRegression()
    reg.fit(x_scaled, y)

    The regression coefficients (weights) and the intercept are obtained as before:

    1
    2
    reg.coef_
    reg.intercept_

    We would also create a summary table to display the results obtained;

    1
    2
    3
    4
    reg_summary = pd.DataFrame([['Intercept'], ['SAT'], ['Rand']], columns =['Features'])
    reg_summary['Weights'] = reg.intercept_, reg.coef_[0], reg.coef_[1]
    # Weights is the another term for coefficient
    reg_summary

    Making Predictions with Standardized Coefficients

    We will now use our model to predict the scores of two students using the standardized coefficients. It is important to standardize the new data before applying the predict method; else, the wrong results will be obtained. This is because the model fit applied here was done using standardized dependent variables and as such any predictions made will be done under the assumption that the input variables have bee standardized.

    The new data is standardized thus:

    1
    2
    3
    new_data = pd.DataFrame(data =[[1700, 2], [1800,1]], columns = ['SAT', 'Rand'])
    new_data_scaled = scaler.transform(new_data)
    new_data_scaled

    Then the predict() method is used to predict the GPA score of the students;

    reg.predict(new_data_scaled)

    They are obtained as 3.09051403 and 3.26413803 respectively.

    Effect of the Insignificant Variable

    What if we removed the Rand variable which we classified as insignificant? How much change will that have on the predicted GPA?
    Creating a new x-data series containing only the SAT variable, there is a reshape method called in to ensure the proper dimensioning of the dataset;

    1
    2
    # Creating new x data series
    x_simple_matrix = x_scaled[:,0].reshape(-1,1)

    The normalized variables are fitted to the regression model;

    1
    2
    3
    # Fitting the new data to the model
    reg_simple = LinearRegression()
    reg_simple.fit(x_simple_matrix, y)

    The prediction is made as before on the students;

    reg_simple.predict(new_data_scaled[:,0].reshape(-1,1))

    The GPA of the students are obtained as 3.08970998 and 3.25527879.

    Now comparing the predicted GPA obtained before and after the removing the unnecessary Rand variable;

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    # Create a dataframe for the comparison and label the two scores to be
    # predicted as Student 1 and Student 2
    Comparison = pd.DataFrame([['Student 1'], ['Student 2']], columns = ['Student'])
    # the GPA before and after removing the Rand Variable are added to the dataframe
    Comparison['Old GPA'] = reg.predict(new_data_scaled)
    Comparison['New GPA'] = reg_simple.predict(new_data_scaled[:,0].reshape(-1,1))
    # The difference between the GPAs is obtained for each student
    GPA_diff = (Comparison['Old GPA'] - Comparison['New GPA'])
    # This difference is converted to percentage
    GPA_percent = 100*(GPA_diff/Comparison['Old GPA'])
    Comparison['GPA difference(%)'] = GPA_percent
    Comparison

    We observe that the percentage difference is negligible (0.03 and 0.27) for both students. This confirms the insignificance of the Rand variable.

    Conclusion

    We have successfully performed linear regression analysis on a multivariable data using Python’s Sklearn library. In the process, we used feature selection to identify an unimportant variable and discarded it. We also saw how to standardize the variables and predict the value of an output variable given its input variables.

  • Author
    Posts
Viewing 1 post (of 1 total)
  • You must be logged in to reply to this topic.
Log In

Primary Sidebar

FORUM   MEMBERSHIP

Log In
Register Lost Password

POPULAR   FORUM   TOPICS

  • How to find the title of a song without knowing the lyrics
  • The Art of Exploratory Data Analysis (Part 1)
  • Welcome Message
  • How To Change Or Remove The WordPress Login Error Message
  • Getting Started with SQL: A Beginners Guide to Databases
  • Replacing The Default SQLite Database With PostgreSQL In Django
  • Facebook Marketing – A Beginner’s Guide
  • Forums
  • About
  • Contact

© 2021   ·   Ehi Kioya   ·   All Rights Reserved
Privacy Policy