February 13, 2020 at 4:57 am #85719OluwoleParticipant@oluwole
Multiple Linear Regression (MLR) is an extension of simple linear regression. It is useful in establishing a linear correlation between a continuous dependent variable and two or more independent or explanatory variables. The MLR model results in a linear equation alongside some statistics such as R-squared, Prob(F-statistic), Log-Likelihood, Adjusted R-Squared, and more. These statistics define the model properties and are useful indicators of the model accuracy.
Scikit-learn (usually contracted to Sklearn) is, perhaps, the standard premium library for Machine Learning in Python. It has lots of learning algorithms for regression, classification, clustering, and dimensionality reduction. It is usually the go-to library for regression analysis for data scientists, and as such, we shall make use of it. The user-interface of choice is JupyterLab.
The dataset contains three columns of values corresponding to the SAT score, GPA, and Rand (randomly assigned integers between 1 and 3). It is stored in a CSV file titled Multiple linear regression and can be downloaded here. The objective of our regression analysis is to establish a linear relationship between the GPA (dependent variable) and the SAT & Rand (independent variables).
The goal of any data analysis is to obtain the insights therein. In every project, knowing what to look for beforehand provides direction on the best approach to the analysis. Of course, this skill is developed over time with practice and experience. So before long, you should begin to see patterns in datasets that’d inform your approach to analysis.
At the end of this, we should be able to answer the following questions:
Is there a relationship between the GPA score of a student and the SAT and Rand variables?
If there is a relationship, how strong is it?
Can I predict a student’s GPA if I am given their SAT score and a Random integer?
We will perform the linear regression by using the sklearn
LinearRegression()method in the
Importing the relevant modules
The LinearRegression module in sklearn is used for ordinary least squares Linear Regression. The other modules are imported for data manipulation and visualization.12345678#Import the relevant librariesimport numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snssns.set()from sklearn.linear_model import LinearRegression
Reading the data
This is done with pandas
pd.read_csv. The file is read into a variable called data and
data.head()shows the first five rows of the data;12data = pd.read_csv('Multiple linear regression.csv')data.head()
Next, we use
data.describe()to provide summary statistics of the data.
Pay particular attention to the count metric. It shows you how many row values are present in each column. This value is 84 for the variables, implying that there are no empty rows, luckily for us. I find this metric useful for spotting missing values when starting with any dataset. Other metrics shown include the standard deviation (std), mean, minimums, maximums, and some percentile values.
Let’s define our dependent and independent variables using the following line of code. You can denote the SAT score as x1, the Rand as x2 and the GPA as y.12x=data[['SAT', 'Rand']]y = data ['GPA']
Visualizing the Data
Data visualization is obtained thus using matplotlib.pyplot module:123456789101112131415# Create scatter plot of GPA and SATfig1 = plt.figure()plt.scatter(x['SAT'], y)fig1.suptitle('GPA and SAT')plt.xlabel('SAT')plt.ylabel('GPA')plt.show()# Create scatter plot of GPA and Randfig2 = plt.figure()plt.scatter(x['Rand'], y)fig2.suptitle('GPA and Rand')plt.xlabel('Rand')plt.ylabel('GPA')plt.show()
Upon inspection, you can observe that there is no apparent linear correlation in the scatter plot of GPA and Rand compared to the plot of GPA and SAT score.
Fitting the Data
First, we instantiate the LinearRegression class and then fit the data to this model.1234#Instantiate the LinearRegression classreg = LinearRegression()#Fit data the data with the linear regression modelreg.fit(x,y)
Obtaining model statistics
Next, we obtain the coefficients of the independent variables, x1 and x212#Coefficient of x1 and x2reg.coef_
As seen, x1 and x2 have coefficient values of 0.0016534 and –0.00826982 respectively.
We also obtain the bias or intercept value thus:12#Constant valuereg.intercept_
The intercept is obtained as 0.2960
Now we calculate the R-squared value, which is a descriptive statistic of the variation in the dependent variable due to a variation in the independent variable. The higher the R-squared value, the better the model accuracy.
The R-squared value is obtained as 0.40668 thus:123#Obtain R-squaredR2 = reg.score(x,y)R2
The adjusted R-squared is calculated to be 0.39203 as seen below, where n is the number of rows of data while p is the number of variables:1234567# Calculating Adjusted R-squared#n is the number of observations, p is the number of predictorsn = x.shapep = x.shape#Using the adjusted R-squared formulaadjusted_r2 = 1-(1-R2)*(n-1)/(n-p-1)adjusted_r2
Predicting the GPA of Students
Having fitted the data to the model, let’s see how to predict a student’s GPA given their SAT score and a random integer.
First, we create a data set:123# Creating new datanew_data = pd.DataFrame(data =[[1700, 2], [1800,1]], columns = ['SAT', 'Rand'])new_data
Then we use the
reg.predict()function to predict the GPA of both students.12new_data['Predicted GPA'] = reg.predict(new_data)new_data
The predicted GPA of the students are: 3.09 and 3.26.
From the results obtained, the following insights can be derived;
- The small absolute value of the coefficient of the Rand variable (0.00826982) is an indication that it bears a little effect on the model. We should, therefore, consider removing this variable from our model.
- The low R-squared value (0.4) informs us of the possible low accuracy of our model. Values of R-squared should be greater than 0.7 for more reliable predictive analysis using the model.
We have seen how to create a linear regression model for a multivariable case using scikit-learn. The model statistics were obtained, including a low regression coefficient (R-squared). We also observed that the Rand variable bears little weight on our model. In the second part of this series, we will investigate the model further and work on improving its accuracy.
- This topic has 0 replies, 1 voice, and was last updated 6 days, 10 hours ago by .
Viewing 1 post (of 1 total)
Viewing 1 post (of 1 total)