One of the main goals of science is to investigate the relationship between different things, so we can predict their future evolution and better understand our reality.
The relationship between education level and income. Math exam results and hours studied. Interest rates and housing starts.
These things can be independent of each other and have no relationship between them. Or they can be deterministic.
Deterministic means there’s a perfect relationship between the pair of variables. 100% of the time one goes up, the other goes up/down in the same proportion.
Most of the variables we study will fall somewhere in the middle.
These relationships can involve two types of data:
- Quantitative (numerical): Represents numbers, counts, and frequencies. For example income, population, GDP, and stock prices. These data sets involve cross-sectional and time series data.
- Qualitative (categorical): Represents qualities and characteristics. For example, male or female, married or single, religion, and qualifications.
When relationships involve quantitative data, you can use statistical tools like Correlation Analysis and Regression Analysis to analyze and measure said relationship.
Correlation quantifies the relationship between two variables. It tells you how strong is their connection.
Simple correlation is the relationship between two variables.
Multiple correlation is the relationship between at least three variables.
The scatter plot is a graphical representation that helps you analyze correlation (and regression).
It works by plotting all the points of observed data on a graph. With that, you usually try to find the line that better represents all of those points.
For example, a straight line means a linear relationship. (More on this later.)
It’s an indicator of the intensity of the relationship, but it doesn’t offer an exact measure for the degree of linear association between variables. To get that we use:
Covariance tells you the direction of a relationship between two variables. In which direction does one variable move given the other moved as well?
A positive covariance means the variables move together. A negative covariance means they move inversely.
However, it’s not always the best tool to use. Why? Because it is influenced by the units of measurement of the variables.
Also, while covariance measures the directional relationship between two variables, it doesn’t answer the question of how strong is that relationship.
The coefficient of correlation is a more appropriate indicator of this strength:
Simple Correlation Coeficient
Also known as Pearson Coefficient.
It is better than covariance, as it’s a relative measure.
To calculate it, divide the covariance between variables by the multiplication of standard deviations of variables.
The value it gives you varies between -1 and 1.
A correlation coefficient of either 1 (positive correlation) or -1 (negative correlation) indicates a super strong relationship. As strong as one can be.
On the other hand, the closer the coefficient is to zero, the weaker the relationship between the two variables.
If it is zero, there is no relationship between the variables. They’re independent.
That’s why you need to test the significance of the sample’s correlation coefficient. How so? Through hypothesis testing.
The null hypothesis is that the real correlation coefficient (from the population) is zero.
After running the test, you’ll get a p-value that tells you the probability that the sample’s correlation coefficient is the result of a sample taken from a population where the real correlation coefficient is equal to zero.
Therefore, we want the p-value to be small (usually, less than 5%) to reject the null.
Rejecting the null means the estimation for the coefficient is statistically significant, and that rho (population correlation) is not zero.
What happens if you don’t reject the null? What does that mean?
Well, the formula for the correlation coefficient is only applicable to linear relationships.
This means that when the correlation coefficient is zero, you can only conclude the variables are not linearly correlated. It doesn’t mean they’re statistically independent. They can still have a non-linear relationship.
Also, correlation analysis can’t tell you the linear equation that describes the relationship between two variables.
On the other hand, regression can. But before we get to that:
There is a different type of correlation (Spearman) you can use when at least one of the variables is qualitative.
What does non-parametric mean? It’s when the data at hand is qualitative, like interviews, characteristics, and observations (as opposed to numbers from which you can derive parameters—things like the mean, standard deviation, and variance).
The Spearman correlation differs from the Pearson correlation in the sense that:
It doesn’t look at the values in the observed data. Instead, it ranks that data based on its values and considers those rankings.
It studies the correlation between the rankings of the observed data, not the values the variables have.
Now, let’s move on to regression:
Regression finds the relationship between a dependent variable and one or more independent (explanatory) variables.
As opposed to correlation, regression is more complete as it allows you to get a model capable of estimating the values of a (dependent) variable based on the (observed or expected) values of other (independent) variable(s).
Now, there can be:
- Simple regression: One variable explains one other variable.
- Multiple regression: Many independent explanatory variables explain one dependent variable.
There can also be:
- Linear regression: The model is a linear equation (straight line).
- Non-linear regression: Non-linear regression requires a continuous dependent variable, but it provides greater flexibility to fit curves.
We’ll focus on simple linear regression for now. You can then extrapolate it to other types of regression.
So, how does it work?
Finding the Best Line
Imagine a scatter plot. You want to draw a straight line that best represents the points in the graph.
Usually, that straight line is the product of a linear equation:
a is the slope of the line. b is the y-intercept. Basic math, right?
y is the dependent variable you want to explain with x.
You draw a line that is better adjusted to the cloud of points in the scatter plot.
This means the line that best describes the relationship between the two variables is as close as possible to all points in the scatter plot.
How do I go about doing that?
You must minimize the vertical distance of all points to the straight line. The distance of a point of observed data to the straight line you draw is the residue.
The most common method of minimizing residue is called the Ordinary Least Squares Method. (More on this later.)
If the points in the scatter plot are in a perfect straight line, there will be no residue. It’s a perfect relationship. This is unlikely to happen in reality, as it’s rare only one variable explains the other.
An imperfect relationship means there are other factors at play to explain the dependent variable, and that’s usually what you’ll deal with.
Simple Linear Regression
The simple linear regression model adds to the linear equation a random term (error) to explain other factors that affect Y other than X2.
Y is the dependent variable and X2 is the independent variable (explanatory).
Why X2 and not X1? Because we consider X1 is always equal to 1 and associated with B1. Therefore you can ignore it.
B1 and B2 are the coefficients of regression. The value of these parameters is unknown. However, you can estimate it with the available data (sample) for each variable.
ε is a random variable with a specific probability distribution. It represents the error. Generally, you can say ε is all the other factors capable of influencing Y’s behavior, other than X2.
The existence of ε indicates that Y is linearly related to X2 but the relationship is not perfect.
Sample Regression Function and Population Regression Function
The goal is to estimate or predict the mean of the population. Or the mean of the dependent variable calculated through the fixed or known values of the independent variable.
Correlation considers both variables are random. Regression considers only the dependent variable as random. The independent X2 is fixed and non-stochastic.
In most cases, we only have data for the sample. We don’t have the function of the population‘s regression. What can we do?
Estimate it with the sample regression function (assuming a linear relationship between variables):
You can read that as Y “hat.”
Sometimes, Y values are in a range for each value of X2. Instead of being a single number, it varies from so-and-so to so-and-so.
That range has a certain distribution (usually normal). For each conditional distribution of Y, you can calculate its mean and expected conditional value. Right?
We say “conditional” because there’s only a range of values if you fixate on each X2 value. You can still look at the “general” distribution of Y without considering each different value of X2.
All this to say what?
The ^Yi in the equation above is actually E(Yi|X2i), an estimator for the expected value of Yi given that the X2 from the population is equal to X2i from the sample.
i is an index (i = 1, 2, …, N) and is used for sectional data. t is used for chronological data.
^B1 and ^B2 are estimators for parameters B1 and B2.
The sample regression function is an approximation of the population’s regression line, and there can be as many of these functions as there are samples.
Residue is now the difference between Yi and ^Yi. It’s an estimate for the error and it represents the portion of Yi not explained by X2i.
Remember, we never have access to the population regression function. But here’s a method to make the sample regression function as close to that as possible:
Ordinary Least Squares Method (OLS)
OLS is a method to get a straight line that can be used either in simple or multiple regression, as long as it’s linear regression (as opposed to non-linear).
We know that residual is the difference between actual data from the sample and data from the predicted straight line created in the scatter plot.
OLS finds the line that originates the smaller value of residuals after you square them.
Why not just add up all the residuals?
Because some are positive, and some are negative. They would counteract, and likely add up to zero, making it look like there’s no residual.
Why not use the absolute value? Because using squares is better as it puts more weight on significant outliers.
So, the goal is to find the values of ^B1 and ^B2 that minimize the sum of the squares of the residuals. That is the sample regression line.
If the goal was to simply estimate parameters B1 and B2, this is it. Our job would be done here. But there are two more things we need to tackle:
- Is the sample regression line a “good” approximation of the population’s regression line? In other words, is it possible to know how close ^B1 and ^B2 are to the real parameters?
- Is there a way to examine if the estimated values for Y are close to E(Y|X2)—the expected value of Y given that X2 from the population is the same as the sample?
To test this and validate the conclusions we take from the regression estimations that resulted from applying OLS, we need to formulate some assumptions about the model. Then, we can evaluate it:
Statistical Inference on the Model
The most common hypothesis to test is that:
If it’s true, X2 is statistically insignificant and irrelevant to explain the behavior of Y.
The p-value gives you the probability of this being true, so the lower it is the better. Why? Because it means there’s less evidence against the relevance of the explanatory variable associated with B2, which is X2.
If Beta is zero, it means the independent variable is also zero. Therefore, the independent variable doesn’t explain the behavior of the dependent variable.
We also need to test the functional form of the equation.
We can do this with a RESET Test:
Test if the cubic and quadratic versions of the equation—meaning raise the independent variables to the power of 2 and/or 3, which results in non-straight lines—have any power in explaining the dependent variable. If they do, the function may not be linear.
You can then use logarithms to correct non-linearities in your model. Remember, a non-linear function is one in which the slope is not constant.
To confirm the non-linearities were dealt with, use a RESET Test again. This time on the logarithmic function.
One last important thing to mention:
Correlation and regression tell you nothing about if one variable causes the other. They only measure the statistical relationship between variables, by comparison of their variation.
The cause-effect conclusion can only come from the theory in the field you’re studying. For example, if you’re analyzing consumption and income, economic theory is what justifies cause-effect here. Correlation and regression will only verify if there’s also a statistical relationship.
Even the strongest of correlations does not mean causality. Just the tendencies in the observed data.
To go ahead and say there’s cause-effect relation (no matter how strong the correlation or regression), you need a theory to support it, along with common sense.
References and Further Reading
Curto, José Dias (2017) Potenciar os negócios? A Estatística dá uma ajuda! (3rd edition). Guide – Artes Gráficas.