What Is Correlation, Really?

A word that means something precise in statistics and something much vaguer in most business conversations — and why the gap matters.

Sit in enough business meetings and you will hear it regularly.

"We see a correlation between customer tenure and upsell rate." "There's a strong correlation between delivery time and churn." "The data shows correlation between ad spend and revenue."

Everyone nods. The conversation moves on.

What almost nobody in the room asks — and what matters enormously — is what kind of relationship is actually there. Because "correlation" in the statistical sense means something very specific, and that specific meaning is often not what the speaker intends.

The gap between the two is not a minor technicality. It leads to wrong model choices, misleading analyses, and confident conclusions built on shaky ground.

What Pearson's r actually measures

When statisticians say "correlation," they usually mean Pearson's correlation coefficient — a number between -1 and +1.

Here is what that number actually captures: the strength and direction of a linear relationship between two variables.

That is all.

r = 1 means a perfect positive line. r = -1 means a perfect negative line. r = 0 means no linear relationship.

It says nothing about:

Yet in most business usage, "correlation" is shorthand for something much broader: there is a pattern worth paying attention to. This is not wrong — but it is imprecise in ways that quietly cause problems downstream.

Anscombe's Quartet

In 1973, statistician Francis Anscombe constructed four small datasets specifically to illustrate this problem. He called them Anscombe's Quartet.

All four datasets have nearly identical summary statistics: same mean, same variance, same Pearson correlation coefficient (approximately 0.816), and even the same regression line.

Plot them, and the similarity disappears entirely.

The quartet has been reproduced, extended, and animated many times since — most memorably as the Datasaurus Dozen, which shows twelve datasets (including one that forms the shape of a dinosaur) all sharing the same summary statistics.

The point is always the same:

The number hides what the plot reveals.

The nonlinear trap

Dataset II from Anscombe's Quartet deserves special attention, because the failure mode it represents is not exotic. It is common.

Consider an inverted-U relationship. Performance as a function of stress, for example. At low stress, performance is poor — not enough pressure to focus. At moderate stress, performance peaks. At high stress, it collapses.

The relationship is real. It is strong. It is deterministic. And Pearson's r on this data would be approximately zero — because the upward slope on one side and the downward slope on the other cancel each other out.

Price and demand often follow a similar shape at extremes. Quality and speed frequently do too. Many human behavioral and organizational variables live in this territory.

If you compute Pearson's r on an inverted-U relationship and get something close to zero, the natural interpretation is: no relationship. But the correct interpretation is: the wrong tool.

The variable may be highly predictive. The linear measure simply cannot see it.

What to do instead

The fix is not complicated. It just requires changing the default workflow.

Plot first, always. A scatter plot takes thirty seconds and reveals structure that no single number can summarize. Nonlinearity, outliers, clusters, truncated ranges — all visible to the eye, all invisible to r. If you are making decisions based on correlation coefficients without having looked at the scatter plot, you are skipping the most important step.

Use Spearman's rank correlation for monotonic relationships. Spearman measures whether the relationship is consistently increasing or decreasing, without requiring that it be linear. It is more robust to outliers and works on ordinal data. For many business relationships — where you care about direction more than precise shape — Spearman is simply the better default.

Use mutual information for general associations. Mutual information captures any statistical dependence — linear, nonlinear, symmetric, asymmetric — without assuming a specific shape. It is harder to interpret and requires more data, but it is the right choice when you genuinely do not know what form the relationship takes.

Be precise about what you mean. When you say "correlation," specify: linear association? Monotonic association? A pattern worth investigating? The word can mean all of these, and the distinction matters when choosing a model.

Back to the boardroom

None of this is an argument against using the word "correlation" in non-technical settings. Language simplifies, and that is fine.

It is an argument for knowing what question you are actually answering — and for not letting a convenient shorthand smuggle in assumptions you have not made deliberately.

When someone reports a correlation of 0.7 between two business variables, the right follow-up questions are:

These are not pedantic questions. They are the difference between a finding that holds up and one that dissolves on closer inspection.

And when a model is later built on that finding — a linear regression, a feature engineering decision, a product hypothesis — the assumptions baked into the correlation coefficient come along for the ride, whether anyone intended them to or not.

Closing thoughts

Correlation is not a bad concept. It is a precise one.

The problem is not that people use it — it is that the word has been stretched to cover a much wider range of ideas than it can technically support. And when that stretched meaning quietly shapes an analysis or a model, the resulting errors are hard to spot because they were never made explicit.

Plot your data. Choose your measure deliberately. And when someone says "there's a correlation," ask what kind.

The answer will tell you more than the number ever could.