Home › Forums › ML, AI, Data Science › Sklearn For Multiple Linear Regression (Part 1)
Tagged: JupyterLab, Linear Regression, Python, Sklearn
 This topic has 0 replies, 1 voice, and was last updated 6 days, 10 hours ago by Oluwole.

AuthorPosts

Multiple Linear Regression (MLR) is an extension of simple linear regression. It is useful in establishing a linear correlation between a continuous dependent variable and two or more independent or explanatory variables. The MLR model results in a linear equation alongside some statistics such as Rsquared, Prob(Fstatistic), LogLikelihood, Adjusted RSquared, and more. These statistics define the model properties and are useful indicators of the model accuracy.
Scikitlearn (usually contracted to Sklearn) is, perhaps, the standard premium library for Machine Learning in Python. It has lots of learning algorithms for regression, classification, clustering, and dimensionality reduction. It is usually the goto library for regression analysis for data scientists, and as such, we shall make use of it. The userinterface of choice is JupyterLab.
The DataSet
The dataset contains three columns of values corresponding to the SAT score, GPA, and Rand (randomly assigned integers between 1 and 3). It is stored in a CSV file titled Multiple linear regression and can be downloaded here. The objective of our regression analysis is to establish a linear relationship between the GPA (dependent variable) and the SAT & Rand (independent variables).
Regression Analysis
The goal of any data analysis is to obtain the insights therein. In every project, knowing what to look for beforehand provides direction on the best approach to the analysis. Of course, this skill is developed over time with practice and experience. So before long, you should begin to see patterns in datasets that’d inform your approach to analysis.
At the end of this, we should be able to answer the following questions:
Is there a relationship between the GPA score of a student and the SAT and Rand variables?
If there is a relationship, how strong is it?
Can I predict a student’s GPA if I am given their SAT score and a Random integer?
We will perform the linear regression by using the sklearn
LinearRegression()
method in thesklearn.linear_model
module.Let’s begin!
Importing the relevant modules
The LinearRegression module in sklearn is used for ordinary least squares Linear Regression. The other modules are imported for data manipulation and visualization.
12345678#Import the relevant librariesimport numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snssns.set()from sklearn.linear_model import LinearRegressionReading the data
This is done with pandas
pd.read_csv
. The file is read into a variable called data anddata.head()
shows the first five rows of the data;12data = pd.read_csv('Multiple linear regression.csv')data.head()Next, we use
data.describe()
to provide summary statistics of the data.data.describe()
Pay particular attention to the count metric. It shows you how many row values are present in each column. This value is 84 for the variables, implying that there are no empty rows, luckily for us. I find this metric useful for spotting missing values when starting with any dataset. Other metrics shown include the standard deviation (std), mean, minimums, maximums, and some percentile values.
Let’s define our dependent and independent variables using the following line of code. You can denote the SAT score as x1, the Rand as x2 and the GPA as y.
12x=data[['SAT', 'Rand']]y = data ['GPA']Visualizing the Data
Data visualization is obtained thus using matplotlib.pyplot module:
123456789101112131415# Create scatter plot of GPA and SATfig1 = plt.figure()plt.scatter(x['SAT'], y)fig1.suptitle('GPA and SAT')plt.xlabel('SAT')plt.ylabel('GPA')plt.show()# Create scatter plot of GPA and Randfig2 = plt.figure()plt.scatter(x['Rand'], y)fig2.suptitle('GPA and Rand')plt.xlabel('Rand')plt.ylabel('GPA')plt.show()Upon inspection, you can observe that there is no apparent linear correlation in the scatter plot of GPA and Rand compared to the plot of GPA and SAT score.
Fitting the Data
First, we instantiate the LinearRegression class and then fit the data to this model.
1234#Instantiate the LinearRegression classreg = LinearRegression()#Fit data the data with the linear regression modelreg.fit(x,y)Obtaining model statistics
Next, we obtain the coefficients of the independent variables, x1 and x2
12#Coefficient of x1 and x2reg.coef_As seen, x1 and x2 have coefficient values of 0.0016534 and –0.00826982 respectively.
We also obtain the bias or intercept value thus:
12#Constant valuereg.intercept_The intercept is obtained as 0.2960
Now we calculate the Rsquared value, which is a descriptive statistic of the variation in the dependent variable due to a variation in the independent variable. The higher the Rsquared value, the better the model accuracy.
The Rsquared value is obtained as 0.40668 thus:
123#Obtain RsquaredR2 = reg.score(x,y)R2The adjusted Rsquared is calculated to be 0.39203 as seen below, where n is the number of rows of data while p is the number of variables:
1234567# Calculating Adjusted Rsquared#n is the number of observations, p is the number of predictorsn = x.shape[0]p = x.shape[1]#Using the adjusted Rsquared formulaadjusted_r2 = 1(1R2)*(n1)/(np1)adjusted_r2Predicting the GPA of Students
Having fitted the data to the model, let’s see how to predict a student’s GPA given their SAT score and a random integer.
First, we create a data set:
123# Creating new datanew_data = pd.DataFrame(data =[[1700, 2], [1800,1]], columns = ['SAT', 'Rand'])new_dataThen we use the
reg.predict()
function to predict the GPA of both students.12new_data['Predicted GPA'] = reg.predict(new_data)new_dataThe predicted GPA of the students are: 3.09 and 3.26.
Observations
From the results obtained, the following insights can be derived;
 The small absolute value of the coefficient of the Rand variable (0.00826982) is an indication that it bears a little effect on the model. We should, therefore, consider removing this variable from our model.
 The low Rsquared value (0.4) informs us of the possible low accuracy of our model. Values of Rsquared should be greater than 0.7 for more reliable predictive analysis using the model.
Conclusion
We have seen how to create a linear regression model for a multivariable case using scikitlearn. The model statistics were obtained, including a low regression coefficient (Rsquared). We also observed that the Rand variable bears little weight on our model. In the second part of this series, we will investigate the model further and work on improving its accuracy.

AuthorPosts