A linear relationship between two parameters
Linear regression is a statistical process of calculating a (usually) linear relation between two parameters. The dependent variable is plotted on the y-axis, and is determined by the independent variable on the x-axis. Linear regression, as the name implies, assumes a linear relationship between x and y.
Linear regression is a powerful tool
If used correctly, linear regression is a powerful tool for analyzing data. However, it can just as easily be used to manipulate the results. Everything depends on how well the process is understood, especially the conditions under which it can be used. Here, we’ll discuss the correct use of linear regression, and we’ll demonstrate how it can be misused in the context of predictive analytics.
The term “linear regression” was first used by the Victorian scientist Francis Galton. Galton did research that proved that the offspring of tall people became shorter with each generation, until they ended up at average height. He called this process “linear regression” because the decrease in length – the regression – happened in a more or less linear fashion.
When processing his measurements, Galton used thorough statistical descriptions. Later researchers adopted his approach to naming and statistics and applied it to other measurements, including measurements that increased, rather than decreased, over time.
The problem illuminated by linear regression
Linear regression can be easily demonstrated through a physics experiment where the stretching of a feather is measured as a function of gravitational force, Fz, on a little weight hanging off the feather. The amount of force Fz depends on the mass m of the weight hanging off the feather, according to Fz=mg, where g=9,81 ms-2 gravitational acceleration. Then, the stretching of the feather x is measured for various values of gravity Fz. Keep in mind the mass m of the weight determines the gravitational force Fz.
Table: measurements of stretching x of the feather as a result of gravitational force Fz on a weight with mass m.
The above table demonstrates that the stretching x increases as gravitational force Fz increases. But it isn’t immediately clear whether we’re talking about a correlation. Plotting the data in a diagram starts to sketch out a certain correlation. It seems that there may be a linear correlation between the force Fz and the stretching x.
Linear relationship? Answer 2 questions
Let’s say there is a linear relationship between the gravitational force Fz and the stretching x of the feather, so:
x=C1 Fz + C2
where C1 and C2 are constants. In that case, two questions need to be answered:
1. How can you find the most appropriate straight line through the measurements? In other words, is it possible to determine the values of C1 and C2?
2. How reliable is the equation? In other words, how accurately can you calculate x given Fz, and vice versa?
The solution? Linear regression!
Assume that the various measurements in the table above are on a straight line. This straight line can be described with the equation so that when C1 and C2 are determined the relationship between gravitational force Fz and the stretching x is determined. Before computers, the line through the measurements was drawn as accurately as possible using a ruler.
The trick was keeping the average distance between the various measurements and the line as small as possible. The reason the line was determined manually, along with C1 and C2, was the amount of work needed to calculate the line precisely. Although it wasn’t difficult, it took so much time that an experienced mathematician would need a whole day for just one diagram.
Minimal distance between measurements
The principle of linear regression is drawing a line through a data set with a minimal distance to the variables. Most data processing software, like a spreadsheet, has this function. Determining a line is a purely mathematical process and can be applied to any data set.
When the result of linear regression is added to the original diagram, the relationship between gravitational force Fz and the stretching x is immediately apparent. As nice as a diagram like that may look, it remains to be seen how accurate an approach it is. In practice this problem is solved by determining the correlation coefficient r.
This coefficient indicates how accurately the various measurements are placed on the drawn line, where r=1 is a perfectly rising line, and r=-1 is a perfectly descending line. If r=0 there is no correlation between the data. The calculation of the correlation coefficient is a common procedure in data processing.
Plotting a line through a data set is a purely mathematical process. It can be applied to any data set. This is a strength of the method, but also its Achilles heel. The data has to meet certain conditions:
- Condition 1: The independent parameter has almost no spread. In other words, the exact value of this parameter is known.
- Condition 2: The dependent parameter is a linear combination of the independent parameter. That means the relationship can be described using a polynomial.
- Condition 3: The variance r in the dependent parameter is independent from the size of the independent parameter.
By making adjustments to the calculations, it’s possible to account for a spread in the independent parameter (condition 1) in the calculation. That doesn’t go for the other conditions.
If one of these conditions is not met:
- Situation 1: Linear regression can’t be applied to the data set.
- Situation 2: Using linear regression will not provide reliable results.
- Situation 3: The problem that linear regression is applied to has not been accurately described.
Which of the above situations applies depends on (an analysis of) the problem. Even if the data does not have a linear relationship, you could calculate a line using linear regression. A good example of this is Anscombe’s quartet, developed in 1973 by the English statistician Frank Anscombe. This quartet consists of four data sets that culminate in the exact same line after applying linear regression: (a) normal (b) curved (c) peak (d) cluster.
Four identical data sets
Linear regression doesn’t just create identical lines in the diagrams, the other variables in the data sets are identical. If you just look at the numbers, there’s no difference between the four diagrams – they are identical. Close examination reveals that the line in at least two of the four diagrams is meaningless.
- Mean of x: 9
- Sample variance of x: 11
- Mean of y: 7.50
- Sample variance of y: 4.125
- Correlation between x and y: 0.816
- Linear regression line: 0.67
Shortening the axes
Shortening the axes in a diagram, or only using part of the data set, can create the suggestion that the data set has a linear relationship, when the opposite is true. This is especially the case when part of the data set is used to deduce a certain relationship which is then used to categorize the remaining (or incomplete) data.
Altering the graph’s scale and the represented values can suggest a relationship in the data that isn’t there. This problem can rear its head in methods that use types of regression, like nearest neighbor. Altering the data set happens when linear regression is applied to look for connections between data with the goal of achieving a desired result.
Want to know more about regression techniques?
The specialists of Passionned Group know the pros and cons of linear regression and other predictive algorithms like no others. Make a free appointment with our data mining experts and learn what AI, machine learning, and Big Data can mean for your organization.