- This topic has 0 replies, 1 voice, and was last updated 1 year, 10 months ago by Oluwole.
- February 21, 2020 at 3:39 pm #86402Spectator@oluwole
Autocorrelation, Homoscedasticity and Multicollinearity are concepts that find relevance in data science and analysis. They are particularly involved in linear regression. These technical terms need to be understood for better predictive analysis and proper interpretation of correlation and regression results. As such, in this post, we will learn about them and their significance to data science operations.
Autocorrelation refers to sample or population observations or variables which are related to each other across space, time or other dimensions. It mathematically describes the degree of similarity between a given time series and a lagged version of itself over successive time intervals.
A time series is a sequence of measurements of the same variable(s) made over time. Usually, these measurements are made at evenly spaced times, say daily, weekly or monthly. Autocorrelation measures the degree of similarity between a variable’s current value and its past value.
Autocorrelation characterises the relationship between variables in their original form and their lagged form.
Serial Correlation and Lagged Correlation are alternative terms that describe cases of autocorrelation. It is a common but problematic feature primarily because it violates the basic statistical assumption about sample observations – independence across elements.
Regression models sometimes fail to effectively capture time trends when sample data have been collected over time. As such, the random errors in the model become positively correlated over time, so that they become dependent on each other rather than independent.
Types of Autocorrelation
Autocorrelation is of two types; positive autocorrelation and negative autocorrelation.
Positive autocorrelation refers to a relationship between two variables in which they both move in the same direction. Thus, a positive correlation exists when one time series increases the other increases, or one time series decreases as the other decreases.
In positive autocorrelation, an error of a given sign tends to be followed by an error of the same sign. Therefore, positive errors usually follow positive errors, and negative errors usually follow negative errors.
It is the most likely form of autocorrelation. In positive autocorrelation, an autocorrelation value of +1 represents a perfect positive correlation, while a value of +0.1 represents a weak positive autocorrelation.
Negative autocorrelation refers to a relationship between two variables in which they both move in a different direction. It is also referred to as inverse autocorrelation. Thus, a negative correlation exists when one time series increases as the other decreases, or one time series decreases as the other increases.
In negative autocorrelation, an error of a given sign tends to be followed by an error of a different sign. Therefore, negative errors usually follow positive errors, and positive errors usually follow negative errors. An autocorrelation of -1 represents a perfect negative correlation while a value of -0.1 represents a weak negative autocorrelation.
Example of Autocorrelation
A good example of autocorrelation is the case of taking observations over time. For example, taking the humidity readings in an area for a month. One might expect that the humidity on the first and second day of the month would be more similar to each other compared to that of the 30th day. If the humidity values that occurred closer together in time, are, in fact, more similar than the humidity values that occurred farther apart in time, the data would very likely be autocorrelated.
How to detect Autocorrelation
Autocorrelation can be detected by plotting the model residuals over time. It can also be tested using the Durbin-Watson test. This test returns a test statistic that ranges from 0 to 4. Values close to the middle of the range (2) suggest less autocorrelation and values closer to 4 or 0 indicate greater negative or positive autocorrelation respectively.
Homoscedasticity is a central theme in linear regression. It essentially means “the same variance” or “same scatter”. It describes a situation in which the error term is the same across all the values of the predictor variables or attributes. The error term refers to the noise or random disturbance in the relationship between the dependent and independent variables.
Homoscedasticity (or Homoskedasticity) is usually assumed in linear regression. It is violated when the size of the error term differs across values of an independent variable, i.e. the variance is different. This complementary phenomenon is referred to as Heteroscedasticity (or Heteroskedasticity). As such, the higher the homoscedasticity, the lower the heteroscedasticity and vice versa.
The general rule of thumb is that if the ratio of the largest variance to the smallest variance is less than 1.5, the data is homoscedastic.
The problem that heteroscedasticity poses to linear regression is a disturbance on the error term. Ordinary least-square (OLS) regression works by minimising residuals and producing the smallest possible standard errors. It seeks to give equal weight to all observations and a case of heteroscedasticity causes larger disturbances to have a “greater” pull than other observations.
This also adds some bias to the standard errors, leading to incorrect conclusions about the significance of the statistical coefficients.
How to check Homoscedasticity
Various tests can be conducted to check if a dataset meets this assumption. These include;
- Bartlett’s Test
- Box’s Test
- Brown-Forsythe Test
- Hartley’s Fmax Test
- Levene’s Test
- Breusch-Pagan Test
Multicollinearity refers to the occurrence of high correlations between two or more independent variables. These variables are usually called multicollinear predictors or correlated predictor variables. Examples include a person’s salary and their academic qualification or a person’s age and their years of education.
The objective of regression analysis is to isolate the relationship between each predictor variable and the dependent variable. When interpreting regression coefficients, the idea is that they represent the mean change in the dependent variable for every 1 unit change in an independent variable, given that all the other independent variables are held constant.
However, when independent variables are correlated, it indicates that the changes in one variable are associated with changes in another variable. Therefore, the stronger the correlation, the more difficult it is to hold one variable constant while varying another. It becomes difficult for the model to estimate the effect of each predictor variable on the dependent variable because the independent variables tend to change in unison.
Multicollinearity can be problematic if it is severe, such that it increases the variance of the coefficient estimates and makes the estimates very sensitive to minor variations in the model. The effect of this is that it makes the coefficient estimates very unstable and difficult to use, thus draining the analysis of its statistical and predictive power.
Causes of Multicollinearity
Multicollinearity is often the fault of the researcher. It could occur as a result of;
- Insufficient data – this can be remedied by collecting more data
- Dummy variables – wrong use of dummy variables to transform categorical data can result in multicollinearity
- Poorly designed experiments
- Data that is completely observational
- Creating new predictor variables
- Including two identical (or almost identical) variables, say height in metres and centimetres
- Including a variable in the regression that is a combination of two other variables, e.g. including “Total Sales” when Total sales = online sales + offline sales
How to check for Multicollinearity
A very easy way to detect multicollinearity is to calculate the correlation coefficients for all pairs of independent variables. Correlation coefficient values of +1 or -1 indicate perfect multicollinearity. If the values are close to either extreme, one of the variables should be removed from the model if possible. Values close to zero (0) suggest a lack of correlation between the predictor variables.
Some of the ways of checking if your model is affected by multicollinearity include;
- A bivariate correlation greater than 0.8
- Very high standard errors for regression coefficients
- Significant model, but non-significant coefficients
- High condition indices
- High Variance Inflation Factor (VIF) and Low Tolerance
Autocorrelation, Homoscedasticity and Multicollinearity are pertinent concepts in data science, especially when performing regression analysis. Autocorrelation refers to a correlation between the values of an independent variable, while multicollinearity refers to a correlation between two or more independent variables. Homoscedasticity is a case of similar variance in the data.
In regression analysis, there is usually the assumption of homoscedasticity and an absence of multicollinearity and autocorrelation. A violation of these assumptions saps the regression model of its accuracy and predictive powers. It is, therefore, important to verify that these assumptions hold and several tests can help with that.
- You must be logged in to reply to this topic.