Seeing things that aren't there, as it so often happens
We can't tell whether two variables are related without their joint distribution, but sometimes we are given separate distributions and told that's enough. It isn't.
I want to dedicate this post to the many —oh so many!— and useless —oh so useless!— faculty meetings I was forced to attend. —JCS.
See any patterns?
Recently a plot appeared on Twitter that had some politically and emotionally charged variables but the same numbers as the following, and asked "see any patterns?"1
The implication is that there’s an apparent relationship between absinthe and psychotic breaks: that increasing incidence of absinthe-drinking is associated with increasing incidence of psychotic breaks. But can we say that from that chart?
(Spoiler: no.)
Say there’s such a relationship and Bob and Nina are Haardvark professors; if we know that Bob drinks absinthe and Nina doesn't, then we should predict that Bob is more likely than Nina to have a psychotic break.
But the data in that plot doesn't support that conclusion. We have two numbers for Haardvark and we would need four.
Information about relationships needs joint distributions
What we know is that 12.6% of Haardvark faculty drink absinthe (Bob is in those 12.6%) and 1 faculty member has a psychotic break per 100 meetings (but we don't know whether that member is more likely to be Bob or Nina).
These distributions where each variable is described separately are called marginal distributions. We need something called a joint distribution, a distribution over the pairs of occurrences.
What we need to know, to understand the relationship between absinthe and psychotic breaks is the probability (or incidence) of each of the following paired variables:
Faculty member X drinks absinthe and faculty member X has a psychotic break in 100 meetings;
Faculty member X drinks absinthe and faculty member X doesn't have a psychotic break in 100 meetings;
Faculty member X doesn’t drink absinthe and faculty member X has a psychotic break in 100 meetings;
Faculty member X doesn’t drink absinthe and faculty member X doesn't have a psychotic break in 100 meetings.
Having the distributions of those who drink (12.6%) or not drink (87.4%) and of those who have psychotic breaks (1 in 100 meetings) or don't (99 in 100 meetings) doesn't give us enough information. For example, consider the following three possible cases for Haardvark, with 1000 faculty; that is, three ways to arrange those 1000 faculty into those four pairs of occurrences such that the resulting marginal distributions (each variable measured at a time) are the same as in the plot.
Note that in all three cases we have 126 drinkers (12.6%) and 874 non-drinkers (87.4%) and 10 faculty with psychotic breaks (1%) and 990 faculty without psychotic breaks (99%) in 100 meetings. In other words, all have the distributions for Haardvark characterized in the plot above.
Case I captures the relationship many people infer from the chart: drinking absinthe is necessary for psychotic breaks in faculty (but not sufficient). In this case, if we know Bob had a psychotic break, that tells us Bob is an absinthe drinker; if we know that Nina doesn't drink absinthe, that tells us Nina didn't have a psychotic break.
We can make a narrative for Case I very easily: absinthe is known to have psychoactive effects and in excess may lead to neurological damage, which would explain the psychotic breaks.
Case II has the opposite relationship: only those who don't drink absinthe suffer the psychotic breaks. In this case, if we know that Nina had a psychotic break, we know that Nina doesn't drink absinthe. We also know that if Bob is an absinthe drinker, Bob didn't have a psychotic break.
A narrative for Case II takes a little more creativity, but it's not hard: faculty meetings are terrible mind-warping events, so faculty are somewhat prone to have psychotic breaks; but since absinthe helps faculty space out during the meeting, those who drink it are immune to the mind-warping.
Case III is the most interesting, because knowing that Bob drinks absinthe gives us no information about the likelihood that Bob had a psychotic break: it's exactly the same probability that we started with, the probability that any faculty member had a psychotic break: 1%. (There are 10 total psychotic breaks, of which 1.26 or 12.6% are from drinkers, who are 12.6% of the faculty. No information is gained from learning that Bob drinks absinthe, which means the two variables, drinking absinthe and having psychotic breaks, are orthogonal, which is an expensive word for unrelated.)
The narrative for Case III is basically that these things are unrelated to each other at least within the set of Haardvark faculty.
So, that takes care of Haardvark. We can easily find joint distributions of (drink absinthe, psychotic break) where those variables aren't informative about each other (knowing that Bob drinks absinthe doesn't help us determine whether Bob is likely to have a psychotic break beyond the base rate for all faculty, i.e. Case III). Here they are for the other three universities, in percentages:
The numbers we have can't say, for each university, whether there's any relationship between drinking absinthe and having a psychotic break, so putting non-informative marginal distributions together can't give us information about that relationship.
This is a well-known and often overlooked point in statistics.
It's so common an occurrence, that the plot that motivated this post appeared on Twitter on the day after I had posted an abstract example of the error. It wasn't so much a coincidence as the simple fact that there's a relentless barrage of examples of this bad argument being produced almost on a daily basis.
Many of them by people who should know better.
Still, there’s something there, no?
That plot looks so persuasive!
The technical reason for that persuasiveness is called “perfect rank correlation.” When we rank the universities by either variable, we get the same rankings: Haardvaark is the smallest, followed by Coulombia, then Princetoon, and finally Stentford.
And the narrative makes sense; whichever narrative we choose to believe. But herein lies the problem: narratives and data don't mix. The logic of the numbers doesn't care about the meaning of the variables (in the world outside of math). So, if the second variable was “number of coffee spills in the student coffeehouses per 100 cups served,” the numbers would tell the same story, but the narrative would be a little strained.
Yes, human creativity is strong enough that we could come up with a narrative for that new variable. The point is that the persuasiveness of the narrative results from our minds working with preconceptions that have nothing to do with the cold hard reality of the numbers.
And yet, perfect rank correlation.
We could hide behind the limitations of having only four data points, but there's a more elegant alternative: that correlation may come from hidden factors.
(There's an entire chapter on hidden factors in this intergalactic best-seller: Data to Information to Decision. Click for free sample. Free with Kindle Unlimited.)
The idea is that these universities have different characteristics and it’s those characteristics that are causing both faculty to drink absinthe and to have psychotic breaks in faculty meetings, but drinking and psychotic breaks are unrelated within each of the universities.
A little like: more people have both snowshoes and ear muffs in Boston than in Honolulu, but owning snow shoes doesn't cause owning ear muffs, as we can test by giving someone in Honolulu a pair of snow shoes and checking whether that makes them get ear muffs. The hidden factor in this case is the climate in these cities.
So, maybe there's a factor that is different across these universities, call it Administrative Insanity, such that: the higher the Administrative Insanity the higher the percentage of faculty who drink absinthe; the higher the Administrative Insanity the higher the number of psychotic breaks in faculty; and Haardvark has the lowest Administrative Insanity and Stentford has the highest Administrative Insanity.2
A careful analyst would now measure Administrative Insanity and run an analysis of the other variables taking into account that Administrative Insanity. But even the careless analyst should be aware that the correlation between marginal distributions across different discrete categories should be checked for hidden factors.
But one thing that we can always know for sure, is that you can’t assess the relationship between two variables without their joint distribution. So if we see a chart with marginal distributions for some variables (i.e. one variable at a time), we can be sure it isn’t evidence for a relationship between those variables.
But many are used as if they were!
The purpose of the chart posted to Twitter is to participate in a discussion of social and policy matters, so its variables are adequate for that purpose. That tweet is about those substantive matters and therefore the variables are the point.
This is a post about technical matters of statistics.The purpose of this post is to show that the numbers present in that chart can’t say more than (in the variables of this post) “universities have different characteristics along these two variables, possibly driven by some university-level factor.”
Because of that, the political and emotionally charged variables of the original chart would be a distraction. And we’d miss out on an opportunity to make fun of university bureaucracy.
Note that these diagrams are stronger than the relationships in the Cases I and II above: they are causal, so their narratives are more elaborate. Case I and Case II just say that there’s a positive or negative correlation.
Both illustrative narratives for Case I and II were part of the leftmost example: in Case I drinking led to more psychotic breaks via neurological damage caused by absinthe and in Case II drinking led to fewer psychotic breaks by serving as a defense against the mind-warping effects of faculty meetings.
We can also create illustrative narratives for Cases I and II that belong in the second type of causality: the more faculty suffer psychotic breaks in meetings, the more that drives other faculty to drink (Case I) or the more it serves as a warning to live a healthier life and quit drinking (Case II).