Tagged: JupyterLab, Linear Regression, Python, Sklearn
- This topic has 0 replies, 1 voice, and was last updated 11 months ago by
Oluwole.
- AuthorPosts
- February 13, 2020 at 5:51 am #85722Participant@oluwole
Welcome to Part 2 of the Sklearn for Multiple Linear Regression series. In the first part of this series, we were able to accomplish the following:
- Reading and describing the data
- Making scatter plots of the dependent and independent variables using matplotlib
- Fitting the dataset to a regression model
- Obtaining descriptive statistics of the model
We also observed that the Rand variable bore little weight on the model.
Regression Analysis
In this part, our objective is to improve the model. Firstly, we will use Feature Selection to identify and remove any unnecessary variables. Then we would standardize the variables and obtain the regression statistics of the model. We would also make predictions and compare the results of the model obtained before and after removing the unnecessary variable. We shall continue using the same notebook from Part 1.
Feature Selection
Feature selection is a data preprocessing technique that helps us to identify unnecessary variables in our dataset. We will achieve this by using f_regression to obtain the Prob-F (or p-values) and F-statistics.
This is achieved thus:1234#Import f_regression from feature_selection modulefrom sklearn.feature_selection import f_regression#Obtain f-statistic and the p-valuesf_regression(x,y)The first column of the resulting array contains the F-statistic for both variables, while the second column contains the p-values. Our primary concern is the p-values.
The p-values are obtained and approximated thus:1234#Obtain p_values by indexingp_values = f_regression(x,y)[1]#round off p_values to 3 decimal placesp_values.round(3)It is observed that the p-value for the SAT is approximately 0, while that of the Rand is about 0.676. Typically, variables with p-values greater than 0.005 (i.e. p-value > 0.005) are discarded because they are considered insignificant variables.
Therefore, in our analysis, we would discard the Rand variable. Discarding insignificant variables has little on no effect on the model because that is what they are – insignificant. There is no need to involve them in our analysis.
Create a Summary Table
Let’s create a table that summarizes our observations so far, including the weights,
12345#Create a dataframereg_summary = pd.DataFrame(data=x.columns.values, columns = ['Features'])reg_summary['Coefficients']=reg.coef_reg_summary['p-values']=p_values.round(3)reg_summaryStandardization
Standardization is the process of transforming a variable by rescaling it to have a mean of zero and a standard deviation of one. It is also referred to as feature scaling or Normalization. We achieve this using the StandardScaler module.
123#Import relevant modulefrom sklearn.preprocessing import StandardScalerscaler = StandardScaler()The fit method calculates the mean and standard deviation;
12# Fit calculates the mean and SD of each featurescaler.fit(x)The scaling mechanism is then applied to the data set thus:
123# This applies the scaling mechanismx_scaled = scaler.transform(x)x_scaledRegression with Scaled Features
Let’s fit the standardized data into a linear regression model. This is done thus:
123#Create regression with scaled featuresreg = LinearRegression()reg.fit(x_scaled, y)The regression coefficients (weights) and the intercept are obtained as before:
12reg.coef_reg.intercept_We would also create a summary table to display the results obtained;
1234reg_summary = pd.DataFrame([['Intercept'], ['SAT'], ['Rand']], columns =['Features'])reg_summary['Weights'] = reg.intercept_, reg.coef_[0], reg.coef_[1]# Weights is the another term for coefficientreg_summaryMaking Predictions with Standardized Coefficients
We will now use our model to predict the scores of two students using the standardized coefficients. It is important to standardize the new data before applying the predict method; else, the wrong results will be obtained. This is because the model fit applied here was done using standardized dependent variables and as such any predictions made will be done under the assumption that the input variables have bee standardized.
The new data is standardized thus:
123new_data = pd.DataFrame(data =[[1700, 2], [1800,1]], columns = ['SAT', 'Rand'])new_data_scaled = scaler.transform(new_data)new_data_scaledThen the predict() method is used to predict the GPA score of the students;
reg.predict(new_data_scaled)
They are obtained as 3.09051403 and 3.26413803 respectively.
Effect of the Insignificant Variable
What if we removed the Rand variable which we classified as insignificant? How much change will that have on the predicted GPA?
Creating a new x-data series containing only the SAT variable, there is a reshape method called in to ensure the proper dimensioning of the dataset;12# Creating new x data seriesx_simple_matrix = x_scaled[:,0].reshape(-1,1)The normalized variables are fitted to the regression model;
123# Fitting the new data to the modelreg_simple = LinearRegression()reg_simple.fit(x_simple_matrix, y)The prediction is made as before on the students;
reg_simple.predict(new_data_scaled[:,0].reshape(-1,1))
The GPA of the students are obtained as 3.08970998 and 3.25527879.
Now comparing the predicted GPA obtained before and after the removing the unnecessary Rand variable;
123456789101112# Create a dataframe for the comparison and label the two scores to be# predicted as Student 1 and Student 2Comparison = pd.DataFrame([['Student 1'], ['Student 2']], columns = ['Student'])# the GPA before and after removing the Rand Variable are added to the dataframeComparison['Old GPA'] = reg.predict(new_data_scaled)Comparison['New GPA'] = reg_simple.predict(new_data_scaled[:,0].reshape(-1,1))# The difference between the GPAs is obtained for each studentGPA_diff = (Comparison['Old GPA'] - Comparison['New GPA'])# This difference is converted to percentageGPA_percent = 100*(GPA_diff/Comparison['Old GPA'])Comparison['GPA difference(%)'] = GPA_percentComparisonWe observe that the percentage difference is negligible (0.03 and 0.27) for both students. This confirms the insignificance of the Rand variable.
Conclusion
We have successfully performed linear regression analysis on a multivariable data using Python’s Sklearn library. In the process, we used feature selection to identify an unimportant variable and discarded it. We also saw how to standardize the variables and predict the value of an output variable given its input variables.
- AuthorPosts
- You must be logged in to reply to this topic.