Tagged: JupyterLab, Linear Regression, Python, Sklearn

- This topic has 0 replies, 1 voice, and was last updated 11 months ago by Oluwole.

- AuthorPosts
- February 13, 2020 at 5:51 am #85722Participant@oluwole
Welcome to Part 2 of the Sklearn for Multiple Linear Regression series. In the first part of this series, we were able to accomplish the following:

- Reading and describing the data
- Making scatter plots of the dependent and independent variables using
**matplotlib** - Fitting the dataset to a regression model
- Obtaining descriptive statistics of the model

We also observed that the

**Rand**variable bore little weight on the model.## Regression Analysis

In this part, our objective is to improve the model. Firstly, we will use Feature Selection to identify and remove any unnecessary variables. Then we would standardize the variables and obtain the regression statistics of the model. We would also make predictions and compare the results of the model obtained before and after removing the unnecessary variable. We shall continue using the same notebook from Part 1.

### Feature Selection

Feature selection is a data preprocessing technique that helps us to identify unnecessary variables in our dataset. We will achieve this by using

**f_regression**to obtain the**Prob-F**(or p-values) and**F-statistics.**

This is achieved thus:1234#Import f_regression from feature_selection modulefrom sklearn.feature_selection import f_regression#Obtain f-statistic and the p-valuesf_regression(x,y)The first column of the resulting array contains the F-statistic for both variables, while the second column contains the p-values. Our primary concern is the p-values.

The p-values are obtained and approximated thus:1234#Obtain p_values by indexingp_values = f_regression(x,y)[1]#round off p_values to 3 decimal placesp_values.round(3)It is observed that the p-value for the

**SAT**is approximately**0**, while that of the**Rand**is about**0.676**. Typically, variables with p-values greater than 0.005 (i.e. p-value > 0.005) are discarded because they are considered insignificant variables.Therefore, in our analysis, we would discard the

**Rand**variable. Discarding insignificant variables has little on no effect on the model because that is what they are – insignificant. There is no need to involve them in our analysis.### Create a Summary Table

Let’s create a table that summarizes our observations so far, including the weights,

12345#Create a dataframereg_summary = pd.DataFrame(data=x.columns.values, columns = ['Features'])reg_summary['Coefficients']=reg.coef_reg_summary['p-values']=p_values.round(3)reg_summary### Standardization

Standardization is the process of transforming a variable by rescaling it to have a mean of zero and a standard deviation of one. It is also referred to as feature scaling or Normalization. We achieve this using the StandardScaler module.

123#Import relevant modulefrom sklearn.preprocessing import StandardScalerscaler = StandardScaler()The fit method calculates the mean and standard deviation;

12# Fit calculates the mean and SD of each featurescaler.fit(x)The scaling mechanism is then applied to the data set thus:

123# This applies the scaling mechanismx_scaled = scaler.transform(x)x_scaled### Regression with Scaled Features

Let’s fit the standardized data into a linear regression model. This is done thus:

123#Create regression with scaled featuresreg = LinearRegression()reg.fit(x_scaled, y)The regression coefficients (weights) and the intercept are obtained as before:

12reg.coef_reg.intercept_We would also create a summary table to display the results obtained;

1234reg_summary = pd.DataFrame([['Intercept'], ['SAT'], ['Rand']], columns =['Features'])reg_summary['Weights'] = reg.intercept_, reg.coef_[0], reg.coef_[1]# Weights is the another term for coefficientreg_summary### Making Predictions with Standardized Coefficients

We will now use our model to predict the scores of two students using the standardized coefficients. It is important to standardize the new data before applying the predict method; else, the wrong results will be obtained. This is because the model fit applied here was done using standardized dependent variables and as such any predictions made will be done under the assumption that the input variables have bee standardized.

The new data is standardized thus:

123new_data = pd.DataFrame(data =[[1700, 2], [1800,1]], columns = ['SAT', 'Rand'])new_data_scaled = scaler.transform(new_data)new_data_scaledThen the predict() method is used to predict the

**GPA**score of the students;`reg.predict(new_data_scaled)`

They are obtained as

**3.09051403**and**3.26413803**respectively.### Effect of the Insignificant Variable

What if we removed the

**Rand**variable which we classified as insignificant? How much change will that have on the predicted**GPA**?

Creating a new x-data series containing only the**SAT**variable, there is a reshape method called in to ensure the proper dimensioning of the dataset;12# Creating new x data seriesx_simple_matrix = x_scaled[:,0].reshape(-1,1)The normalized variables are fitted to the regression model;

123# Fitting the new data to the modelreg_simple = LinearRegression()reg_simple.fit(x_simple_matrix, y)The prediction is made as before on the students;

`reg_simple.predict(new_data_scaled[:,0].reshape(-1,1))`

The

**GPA**of the students are obtained as**3.08970998**and**3.25527879.**Now comparing the predicted

**GPA**obtained before and after the removing the unnecessary**Rand**variable;123456789101112# Create a dataframe for the comparison and label the two scores to be# predicted as Student 1 and Student 2Comparison = pd.DataFrame([['Student 1'], ['Student 2']], columns = ['Student'])# the GPA before and after removing the Rand Variable are added to the dataframeComparison['Old GPA'] = reg.predict(new_data_scaled)Comparison['New GPA'] = reg_simple.predict(new_data_scaled[:,0].reshape(-1,1))# The difference between the GPAs is obtained for each studentGPA_diff = (Comparison['Old GPA'] - Comparison['New GPA'])# This difference is converted to percentageGPA_percent = 100*(GPA_diff/Comparison['Old GPA'])Comparison['GPA difference(%)'] = GPA_percentComparisonWe observe that the percentage difference is negligible (

**0.03**and**0.27**) for both students. This confirms the insignificance of the**Rand**variable.## Conclusion

We have successfully performed linear regression analysis on a multivariable data using Python’s Sklearn library. In the process, we used feature selection to identify an unimportant variable and discarded it. We also saw how to standardize the variables and predict the value of an output variable given its input variables.

- AuthorPosts

- You must be logged in to reply to this topic.