Tagged: JupyterLab, Linear Regression, Python, Sklearn

- This topic has 0 replies, 1 voice, and was last updated 11 months ago by Oluwole.

- AuthorPosts
- February 13, 2020 at 4:57 am #85719Participant@oluwole
**Multiple Linear Regression**(MLR) is an extension of simple linear regression. It is useful in establishing a linear correlation between a continuous dependent variable and two or more independent or explanatory variables. The MLR model results in a linear equation alongside some statistics such as R-squared, Prob(F-statistic), Log-Likelihood, Adjusted R-Squared, and more. These statistics define the model properties and are useful indicators of the model accuracy.**Scikit-learn**(usually contracted to**Sklearn**) is, perhaps, the standard premium library for Machine Learning in Python. It has lots of learning algorithms for regression, classification, clustering, and dimensionality reduction. It is usually the go-to library for regression analysis for data scientists, and as such, we shall make use of it. The user-interface of choice is JupyterLab.## The DataSet

The dataset contains three columns of values corresponding to the SAT score, GPA, and Rand (randomly assigned integers between 1 and 3). It is stored in a CSV file titled

**Multiple linear regression**and can be downloaded here. The objective of our regression analysis is to establish a linear relationship between the GPA (dependent variable) and the SAT & Rand (independent variables).## Regression Analysis

The goal of any data analysis is to obtain the insights therein. In every project, knowing what to look for beforehand provides direction on the best approach to the analysis. Of course, this skill is developed over time with practice and experience. So before long, you should begin to see patterns in datasets that’d inform your approach to analysis.

At the end of this, we should be able to answer the following questions:

Is there a relationship between the GPA score of a student and the SAT and Rand variables?

If there is a relationship, how strong is it?

Can I predict a student’s GPA if I am given their SAT score and a Random integer?

We will perform the linear regression by using the sklearn

`LinearRegression()`

method in the`sklearn.linear_model`

module.Let’s begin!

### Importing the relevant modules

The LinearRegression module in sklearn is used for ordinary least squares Linear Regression. The other modules are imported for data manipulation and visualization.

12345678#Import the relevant librariesimport numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snssns.set()from sklearn.linear_model import LinearRegression### Reading the data

This is done with pandas

`pd.read_csv`

. The file is read into a variable called data and`data.head()`

shows the first five rows of the data;12data = pd.read_csv('Multiple linear regression.csv')data.head()Next, we use

`data.describe()`

to provide summary statistics of the data.`data.describe()`

Pay particular attention to the count metric. It shows you how many row values are present in each column. This value is 84 for the variables, implying that there are no empty rows, luckily for us. I find this metric useful for spotting missing values when starting with any dataset. Other metrics shown include the standard deviation (std), mean, minimums, maximums, and some percentile values.

Let’s define our dependent and independent variables using the following line of code. You can denote the SAT score as x1, the Rand as x2 and the GPA as y.

12x=data[['SAT', 'Rand']]y = data ['GPA']### Visualizing the Data

Data visualization is obtained thus using matplotlib.pyplot module:

123456789101112131415# Create scatter plot of GPA and SATfig1 = plt.figure()plt.scatter(x['SAT'], y)fig1.suptitle('GPA and SAT')plt.xlabel('SAT')plt.ylabel('GPA')plt.show()# Create scatter plot of GPA and Randfig2 = plt.figure()plt.scatter(x['Rand'], y)fig2.suptitle('GPA and Rand')plt.xlabel('Rand')plt.ylabel('GPA')plt.show()Upon inspection, you can observe that there is no apparent linear correlation in the scatter plot of GPA and Rand compared to the plot of GPA and SAT score.

### Fitting the Data

First, we instantiate the LinearRegression class and then fit the data to this model.

1234#Instantiate the LinearRegression classreg = LinearRegression()#Fit data the data with the linear regression modelreg.fit(x,y)### Obtaining model statistics

Next, we obtain the coefficients of the independent variables, x1 and x2

12#Coefficient of x1 and x2reg.coef_As seen, x1 and x2 have coefficient values of

**0.0016534**and –**0.00826982**respectively.We also obtain the bias or intercept value thus:

12#Constant valuereg.intercept_The intercept is obtained as

**0.2960**Now we calculate the R-squared value, which is a descriptive statistic of the variation in the dependent variable due to a variation in the independent variable. The higher the R-squared value, the better the model accuracy.

The

**R-squared**value is obtained as**0.40668**thus:123#Obtain R-squaredR2 = reg.score(x,y)R2The adjusted R-squared is calculated to be

**0.39203**as seen below, where n is the number of rows of data while p is the number of variables:1234567# Calculating Adjusted R-squared#n is the number of observations, p is the number of predictorsn = x.shape[0]p = x.shape[1]#Using the adjusted R-squared formulaadjusted_r2 = 1-(1-R2)*(n-1)/(n-p-1)adjusted_r2### Predicting the GPA of Students

Having fitted the data to the model, let’s see how to predict a student’s GPA given their SAT score and a random integer.

First, we create a data set:

123# Creating new datanew_data = pd.DataFrame(data =[[1700, 2], [1800,1]], columns = ['SAT', 'Rand'])new_dataThen we use the

`reg.predict()`

function to predict the GPA of both students.12new_data['Predicted GPA'] = reg.predict(new_data)new_dataThe predicted GPA of the students are:

**3.09**and**3.26**.## Observations

From the results obtained, the following insights can be derived;

- The small absolute value of the coefficient of the Rand variable (0.00826982) is an indication that it bears a little effect on the model. We should, therefore, consider removing this variable from our model.
- The low R-squared value (0.4) informs us of the possible low accuracy of our model. Values of R-squared should be greater than 0.7 for more reliable predictive analysis using the model.

## Conclusion

We have seen how to create a linear regression model for a multivariable case using scikit-learn. The model statistics were obtained, including a low regression coefficient (R-squared). We also observed that the Rand variable bears little weight on our model. In the second part of this series, we will investigate the model further and work on improving its accuracy.

- AuthorPosts

- You must be logged in to reply to this topic.