One of the main goals of science is to investigate the relationship between different things, so we can predict their future evolution and better understand our reality.

The relationship between education level and income. Math exam results and hours studied. Interest rates and housing starts.

These things can be independent of each other and have no relationship between them. Or they can be *deterministic*.

Deterministic means there’s a perfect relationship between the pair of variables. 100% of the time one goes up, the other goes up/down in the same proportion.

Most of the variables we study will fall somewhere in the middle.

These relationships can involve two types of data:

**Quantitative**(numerical): Represents numbers, counts, and frequencies. For example income, population, GDP, and stock prices. These data sets involve cross-sectional and time series data.**Qualitative**(categorical): Represents qualities and characteristics. For example, male or female, married or single, religion, and qualifications.

When relationships involve **quantitative data**, you can use statistical tools like *Correlation Analysis* and *Regression Analysis* to analyze and measure said relationship.

**Table of Contents**

## Correlation Analysis

Correlation *quantifies *the relationship between two variables. It tells you how strong is their connection.

*Simple *correlation is the relationship between two variables.

*Multiple *correlation is the relationship between at least three variables.

### Scatter Plot

The scatter plot is a graphical representation that helps you analyze correlation (and regression).

It works by plotting all the points of observed data on a graph. With that, you usually try to find the line that better represents all of those points.

For example, a straight line means a *linear* relationship. (More on this later.)

It’s an indicator of the intensity of the relationship, but it doesn’t offer an exact measure for the degree of linear association between variables. To get that we use:

### Covariance

Covariance tells you the direction of a relationship between two variables. In which direction does one variable move given the other moved as well?

A **positive **covariance means the variables move together. A **negative **covariance means they move inversely.

However, it’s not always the best tool to use. Why? Because it is influenced by the units of measurement of the variables.

Also, while covariance measures the directional relationship between two variables, it doesn’t answer the question of how strong is that relationship.

The *coefficient of correlation* is a more appropriate indicator of this strength:

### Simple Correlation Coeficient

Also known as Pearson Coefficient.

It is better than covariance, as it’s a *relative *measure.

To calculate it, divide the covariance between variables by the multiplication of standard deviations of variables.

The value it gives you varies between -1 and 1.

A correlation coefficient of either 1 (positive correlation) or -1 (negative correlation) indicates a super strong relationship. As strong as one can be.

**On the other hand, the closer the coefficient is to zero, the weaker the relationship between the two variables.**

If it *is *zero, there is no relationship between the variables. They’re independent.

That’s why you need to test the significance of the sample’s correlation coefficient. How so? Through hypothesis testing.

The *null *hypothesis is that the real correlation coefficient (from the population) is zero.

After running the test, you’ll get a p-value that tells you the probability that the sample’s correlation coefficient is the result of a sample taken from a population where the real correlation coefficient is equal to zero.

Therefore, we want the p-value to be small (usually, less than 5%) to reject the null.

Rejecting the null means the estimation for the coefficient is **statistically significant**, and that *rho *(population correlation) is not zero.

*What happens if you don’t reject the null? What does that mean?*

Well, the formula for the correlation coefficient is only applicable to **linear **relationships.

This means that when the correlation coefficient is zero, you can only conclude the variables are not *linearly correlated*. It doesn’t mean they’re statistically independent. They can still have a non-linear relationship.

Also, correlation analysis can’t tell you the **linear equation** that describes the relationship between two variables.

On the other hand, *regression* can. But before we get to that:

### Non-Parametric Correlation

There is a different type of correlation (Spearman) you can use when at least one of the variables is *qualitative*.

What does non-parametric mean? It’s when the data at hand is qualitative, like interviews, characteristics, and observations (as opposed to numbers from which you can derive **parameters**—things like the mean, standard deviation, and variance).

The *Spearman *correlation differs from the *Pearson *correlation in the sense that:

It doesn’t look at the values in the observed data. Instead, it ranks that data based on its values and considers those rankings.

It studies the correlation between the rankings of the observed data, not the values the variables have.

Now, let’s move on to regression:

## Regression Analysis

Regression finds the relationship between a *dependent *variable and one or more *independent *(explanatory) variables.

As opposed to correlation, **regression **is more complete as it allows you to get a model capable of estimating the values of a (dependent) variable based on the (observed or expected) values of other (independent) variable(s).

Now, there can be:

*Simple*regression: One variable explains one other variable.*Multiple*regression: Many independent explanatory variables explain one dependent variable.

There can also be:

*Linear*regression: The model is a linear equation (straight line).*Non-linear*regression: Non-linear regression requires a continuous dependent variable, but it provides greater flexibility to fit*curves*.

We’ll focus on simple linear regression for now. You can then extrapolate it to other types of regression.

*So, how does it work?*

### Finding the Best Line

Imagine a scatter plot. You want to draw a straight line that best represents the points in the graph.

Usually, that straight line is the product of a linear equation:

** a** is the slope of the line.

*is the y-intercept. Basic math, right?*

**b****y** is the dependent variable you want to explain with **x**.

You draw a line that is **better adjusted** to the cloud of points in the scatter plot.

*“Better adjusted?”*

This means **the line that best describes the relationship between the two variables is as close as possible to all points in the scatter plot**.

*How do I go about doing that?*

You must minimize the vertical distance of all points to the straight line. The distance of a point of observed data to the straight line you draw is the **residue**.

The most common method of minimizing residue is called the *Ordinary Least Squares Method*. (More on this later.)

If the points in the scatter plot are in a perfect straight line, there will be no residue. It’s a perfect relationship. This is unlikely to happen in reality, as it’s rare only one variable explains the other.

An imperfect relationship means there are other factors at play to explain the dependent variable, and that’s usually what you’ll deal with.

### Simple Linear Regression

The **simple linear regression model** adds to the *linear equation* a random term (error) to explain other factors that affect Y other than X_{2}.

** Y** is the dependent variable and

**is the independent variable (explanatory).**

*X*_{2}Why X_{2} and not X_{1}? Because we consider X_{1} is always equal to 1 and associated with B_{1}. Therefore you can ignore it.

** B_{1}** and

**are the**

*B*_{2}**coefficients of regression**. The value of these parameters is unknown. However, you can estimate it with the available data (sample) for each variable.

**ε** is a random variable with a specific probability distribution. It represents the **error**. Generally, you can say ε is all the other factors capable of influencing Y’s behavior, other than X_{2}.

The existence of ε indicates that Y is *linearly related* to X_{2} but the relationship is not *perfect*.

### Sample Regression Function and Population Regression Function

The goal is to estimate or predict the mean of the population. Or the mean of the dependent variable calculated through the fixed or known values of the independent variable.

Correlation considers both variables are random. Regression considers only the dependent variable as random. The independent X_{2} is fixed and non-stochastic.

In most cases, we only have data for the sample. We don’t have the *function of the population‘s regression*. What can we do?

Estimate it with the ** sample regression function** (assuming a linear relationship between variables):

You can read that as Y “hat.”

Sometimes, *Y* values are in a range for each value of X_{2}. Instead of being a single number, it varies from so-and-so to so-and-so.

That range has a certain distribution (usually normal). For each conditional distribution of Y, you can calculate its *mean *and expected conditional value. Right?

We say “conditional” because there’s only a range of values if you fixate on each X_{2} value. You can still look at the “general” distribution of Y without considering each different value of X_{2}.

*All this to say what?*

The ^Y_{i} in the equation above is actually E(Y_{i}|X_{2i}), an estimator for the expected value of **Y _{i}** given that the X

_{2}from the population is equal to X

_{2i}from the sample.

** i** is an index (i = 1, 2, …, N) and is used for

*sectional*data.

*is used for chronological data.*

**t**^B_{1} and ^B_{2} are estimators for parameters B_{1} and B_{2}.

**The sample regression function is an approximation of the population’s regression line**, and there can be as many of these functions as there are samples.

Residue is now the difference between Y_{i} and ^Y_{i}. It’s an estimate for the error and it represents the portion of Y_{i} *not* explained by X_{2i}.

Remember, we never have access to the population regression function. But here’s a method to make the sample regression function as close to that as possible:

### Ordinary Least Squares Method (OLS)

OLS is a method to get a straight line that can be used either in simple or multiple regression, as long as it’s linear regression (as opposed to non-linear).

We know that **residual **is the difference between actual data from the sample and data from the predicted straight line created in the scatter plot.

OLS finds the line that originates the smaller value of residuals after you * square *them.

*Why not just add up all the residuals?*

Because some are positive, and some are negative. They would counteract, and likely add up to *zero*, making it look like there’s no residual.

*Why not use the absolute value?* Because using squares is better as it puts more weight on significant outliers.

So, the goal is to find the values of ^B_{1} and ^B_{2} that minimize the sum of the squares of the residuals. That is the **sample regression line**.

If the goal was to simply estimate parameters B_{1} and B_{2}, this is *it*. Our job would be done here. But there are two more things we need to tackle:

- Is the sample regression line a “good” approximation of the population’s regression line? In other words, is it possible to know how close ^B
_{1}and ^B_{2}are to the real parameters? - Is there a way to examine if the estimated values for Y are close to E(Y|X
_{2})—the expected value of Y given that X_{2}from the population is the same as the sample?

To test this and validate the conclusions we take from the regression estimations that resulted from applying OLS, we need to formulate some assumptions about the model. Then, we can evaluate it:

### Statistical Inference on the Model

The most common hypothesis to test is that:

H0: B_{2}=0

If it’s true, X_{2} is statistically insignificant and irrelevant to explain the behavior of Y.

The p-value gives you the probability of this being true, so the lower it is the better. Why? Because it means there’s less evidence against the relevance of the explanatory variable associated with B_{2}, which is X_{2}.

If Beta is zero, it means the independent variable is also zero. Therefore, the independent variable doesn’t explain the behavior of the dependent variable.

We also need to test the functional form of the equation.

We can do this with a RESET Test:

Test if the cubic and quadratic versions of the equation—meaning raise the independent variables to the power of 2 and/or 3, which results in non-straight lines—have any power in explaining the dependent variable. If they do, the function may not be linear.

You can then use logarithms to correct non-linearities in your model. Remember, a non-linear function is one in which the slope is not constant.

To confirm the non-linearities were dealt with, use a RESET Test again. This time on the logarithmic function.

One last important thing to mention:

## Causality

Correlation and regression tell you nothing about if one variable causes the other. They only measure the statistical relationship between variables, by comparison of their variation.

The cause-effect conclusion can only come from the theory in the field you’re studying. For example, if you’re analyzing consumption and income, economic theory is what justifies cause-effect here. Correlation and regression will only verify if there’s also a *statistical *relationship.

Even the strongest of correlations does not mean causality. Just the tendencies in the observed data.

To go ahead and say there’s cause-effect relation (no matter how strong the correlation or regression), you need a theory to support it, along with common sense.

## References and Further Reading

Curto, José Dias (2017) *Potenciar os negócios? A Estatística dá uma ajuda!* (3rd edition). Guide – Artes Gráficas.

Keep learning: