February 14, 2020 at 5:23 pm #85830OluwoleParticipant@oluwole
It is often the case that in our analysis of data sets, we aim to investigate existing relationships between the variables therein. One of the ways we accomplish this is through Regression Analysis. It very popular among analysts as it does not only help in finding underlying relationships between the attributes of the data; it is also a useful predictive tool.
In this article, we will learn about Regression analysis – from its definition to the different model types and how to choose the best type for your project.
What is Regression Analysis?
Regression analysis refers to a set of statistical techniques that describe the relationship between a dependent variable and a set of independent variables. It is essentially of two forms – simple regression and multiple regression – depending on the number of the independent variables.
On the one hand, simple regression analysis describe a case where there is only one independent variable influencing the dependent variable. Multiple regression models, on the other hand, involve more than one independent variable.
A simple example to understand the forms of regression is presented thus. Suppose you are in the umbrella business. A simple regression model could describe a relationship between the revenue and rainfall, with the revenue as the dependent variable. For the multivariable regression model, you can investigate the relationship between revenue and rainfall, pricing, and the number of workers.
The example mentioned above highlights the use of regression analysis, which we shall further explore.
Why Use Regression Analysis?
The use of regression analysis can be broadly categorized into two. The first is for prediction and forecasting. The analyst must be able to justify the predictive power of existing relationships within the data set, and this often overlaps with the field of machine learning.
The second category of use is to infer causal relationships between the dependent variable and the set of independent variables. Here, the focus of the analyst is to justify why a relationship between the variables has a causal interpretation, especially in the analysis of observational data.
Generally, regression analysis is useful to businesses for predictive analytics, supporting decisions, error correction, discovering new insights, and optimizing operational efficiency. Having established the usefulness of regression analysis, we will now examine its various types and what characterizes each of them.
Types of Regression Analysis
The more common types of regression analysis are Linear and Logistic regressions, especially in the field of Machine Learning and Data Science. They’re great, no doubt, especially as they are easy to use and interpret. However, their simplicity makes them unsuitable for establishing certain kinds of causal relationships.
For this purpose, several regression model types exist. We’d be taking a look at four of the most common types and their properties.
Linear regression is a regression model that is entirely made up of linear variables. It is also referred to as Ordinary Least Squares (OLS) or Linear Least Squares. Linear regression is used to understand the effect of a unit change in each independent variable on the mean change of the dependent variable.
The simple case of Linear Regression involves a single independent variable and an independent variable. This is referred to as Single Variable Linear Regression. It is often the case that multiple independent variables have a relationship with the dependent variable. This more general case is called Multi-Variable Linear Regression.
Linear regression models have no non-linearities, as seen below:
Where a is the intercept (bias), b is the coefficients, while X represents the independent variables. The parameters of the model are estimated by minimizing the Sum of Squared Srrors (SSE).
Linear regression models are generally based on the following assumptions:
- A linear relationship exists between the dependent and independent variables
- The value of the residual (error) is zero and constant across all observations
- The independent variable is not random
- The residual values follow a normal distribution
- The residual value is not correlated across all observations
Linear regression models are the most common and most intuitive. They are suitable for use if your data is not extremely complex, and the dependent variable is continuous. The model accuracy is, however, very sensitive to outliers. It is also prone to overfitting, multicollinearity, and autocorrelation. Analysts must, therefore, be wary of this while implementing the model.
Logistic regression models are those that study the relationship between a categorical dependent variable and a set of independent variables. A categorical variable, as opposed to a continuous variable, is one that takes on a fixed number of possible values. Variables like gender, for instance, are considered to be categorical as you are typically either “Male” or “Female.”
Unlike linear regression models, they don’t require a linear relationship between the dependent and independent variables. Rather, they transform the categorical variables and use Maximum Likelihood Estimation to estimate the parameters. Therefore, they are commonly used for classification problems.
Logistic regression could be binary, ordinal, or nominal depending on the nature of the dependent variable. Binary logistic regression is used when there are only two possible values of the dependent variable, such as yes or no, pass or fail, true or false, 0 or 1, etc.
Ordinal logistic regression models the relationship between the independent variables and an ordinal dependent variable. Ordinal variables have at least three possible values in natural order, such as big, medium or small; short, average, or tall.
Nominal logistic regression models are used when the dependent variable is a nominal one, i.e., there are at least three groups without any natural order such as football, basketball, or baseball.
Polynomial regression models are used to handle non-linearly separable data. Here, instead of a straight line, the line of best fit is a curve that fits the data points. It is used if the power of any of the independent variables is more than one.
The equation polynomial regression equation is as seen below:
In polynomial regression models, the analyst has full control over the modelling of the feature variables. He can have some variables with exponents and others without. He can also decide the exact exponent to assign to each variable.
Polynomial regression is not as straightforward as linear regression, and so it can be tricky. It requires careful design as one might be tempted to reduce the errors in the model by increasing the order (exponent) of the polynomial. This might make the model overfit and provide low predictive accuracy.
As such, it is advisable to make plots of the relationships to see the fit and lookout, especially towards the end of the curve, for signs of overfitting as higher polynomials often produce poor results on extrapolation.
Non-linear regression models require a continuous dependent variable but provide more flexibility to fit curves than linear regression. It is not so different from linear regression in the sense that it also estimates the parameters by minimizing the SSE. However, there is a difference in approach between both models. While linear models follow a linear approach by providing solutions with matrix equations, non-linear regression models make use of an iterative algorithm.
In most non-linear models, there is only one continuous independent variable, although it is possible to have more than one. Non-linear models are usually advised only when the linear models fail to provide an adequate fit.
There are other more specific types of regression models, although they are mostly variants of the already discussed ones. They include;
- Ridge regression
- Lasso regression
- Partial Least Squares (PLS) regression
- Poisson Regression
- Negative binomial regression
- Zero-inflated regression
- Stepwise Regression
- ElasticNet Regression
Choosing the Right Model
Life is simple when we have few options, say to eat or not to eat. It gets harder when you are bombarded with lots of options, say what to eat; do you want pasta, burger, fruits, potatoes, bread, and so forth. In that manner, choosing the right regression model to apply to your data can be difficult. There is no one-size-fits-all rule to apply, but the following tips should prove useful;
- Explore your data as this may help you identify the relationships between variables.
- Use different metrics such as R-squared, Adjusted R-squared, error term to compare the goodness of fit for the different models.
- Cross-validate by dividing your dataset into two – train and validate. A mean squared difference between the predicted and observed values is an indication of the prediction accuracy.
Regression analysis is useful in describing causal relationships between dependent and independent variables. It exists in different forms and, thus, provides tremendous flexibility, making it useful under a variety of circumstances. Therefore, choosing the right regression model for your work is very important, as this determines the accuracy of your predictive analytics.
- This topic has 0 replies, 1 voice, and was last updated 4 days, 19 hours ago by .
Viewing 1 post (of 1 total)
Viewing 1 post (of 1 total)