Always visualise your data: Simpsons Paradox in action

Data Visualisation
Applied Statistics (Beginners)
What if the data you’re analyzing tells one story in aggregate—but the exact opposite when you break it down?
Author

Conor O’Driscoll

Published

July 30, 2025

If you’ve ever opened a dataset and jumped straight into statistical testing, you’re not alone. It’s tempting to rush toward a result — an effect, a relationship, a difference — and get to work writing it up. But what if the “result” you find hides a deeper contradiction? What if the truth is visible in your data, but only if you look at it the right way?

This is where data visualization comes in. Visualization is how we make our data legible not just to others, but to ourselves. Indeed, I would make the case that you cannot fully understand what is going on in your data without some form of data visualisation as it helps us detect patterns, check assumptions, and avoid being misled.

To illustrate this, let’s explore one of the most famous cases where misleading aggregate data can lead to erroneous conclusions: Simpson’s Paradox, using the classic UC Berkeley admissions dataset from 1973.

Setting The Scene

Imagine that you’re interested in studying gender bias in university admissions. You obtain real administrative data from UC Berkeley’s graduate programs from 1973 and start with what seems like a straightforward question:

Were men more likely to be admitted than women?

You begin by exploring what type of data you have available in your dataset.

In this post, we will use the UCBAdmissions dataset: A well-known built-in R dataset derived from real administrative records of graduate admissions at UC Berkeley in 1973.

#The relevant packages have been loaded elsewhere

#Load the data
data(UCBAdmissions)
ucb_df <- as.data.frame(UCBAdmissions)

#What are the variable names?
names(ucb_df)
[1] "Admit"  "Gender" "Dept"   "Freq"  
#A broad overview of the structure these variables take
glimpse(ucb_df)
Rows: 24
Columns: 4
$ Admit  <fct> Admitted, Rejected, Admitted, Rejected, Admitted, Rejected, Adm…
$ Gender <fct> Male, Male, Female, Female, Male, Male, Female, Female, Male, M…
$ Dept   <fct> A, A, A, A, B, B, B, B, C, C, C, C, D, D, D, D, E, E, E, E, F, …
$ Freq   <dbl> 512, 313, 89, 19, 353, 207, 17, 8, 120, 205, 202, 391, 138, 279…

Ok. These commands already tells us a lot about the structure of the dataset. There are four variables (columns) and 24 unique data points (rows). More specifically, we can see that it counts the number of applications for six departments by admissions status and sex. To really ensure that you understand the structure of this dataset, try answering the following questions:

  1. What type of variable is Admit?

  2. What type of variable is Gender?

  3. What type of variable is Freq?

  4. What type of variable is Dept?

  5. What type of variable is labelled as fct?

  6. What type of variable is labelled as dbl?

Digging A Bit Deeper: Department-Specific Heterogeneity?

You might stop here and think your work is done. But this only tells us what is going on in the aggregate. Unless you are a macroeconomist, you should know better than to trust aggregate data; something interesting probably lies beneath the surface. So, with this in mind, we shall dig a bit deeper.

Let’s start by breaking this down by department. The data includes six departments, labeled A–F. What happens when we examine admission rates within departments?

dept_admit <- ucb_df %>%
  group_by(Dept, Gender, Admit) %>%
  summarise(Freq = sum(Freq)) %>%
  tidyr::pivot_wider(names_from = Admit, 
                     values_from = Freq) %>%
  mutate(Total = Admitted + Rejected,
         AdmitRate = Admitted / Total)
`summarise()` has grouped output by 'Dept', 'Gender'. You can override using
the `.groups` argument.
print(dept_admit)
# A tibble: 12 × 6
# Groups:   Dept, Gender [12]
   Dept  Gender Admitted Rejected Total AdmitRate
   <fct> <fct>     <dbl>    <dbl> <dbl>     <dbl>
 1 A     Male        512      313   825    0.621 
 2 A     Female       89       19   108    0.824 
 3 B     Male        353      207   560    0.630 
 4 B     Female       17        8    25    0.68  
 5 C     Male        120      205   325    0.369 
 6 C     Female      202      391   593    0.341 
 7 D     Male        138      279   417    0.331 
 8 D     Female      131      244   375    0.349 
 9 E     Male         53      138   191    0.277 
10 E     Female       94      299   393    0.239 
11 F     Male         22      351   373    0.0590
12 F     Female       24      317   341    0.0704
  1. One of the following statements best describes the core difference between dept_admit and overall_admit. Pick the most appropriate:

Now the story flips: in most departments, women have higher admission rates than men. So how can the overall numbers suggest the opposite? This reversal is a textbook example of Simpson’s Paradox - a phenomenon where a trend appears in different groups but reverses when the groups are combined.

In this case, women were more likely to apply to departments with lower overall admission rates (e.g., departments C, D, E, F), while men applied more to departments with higher admission rates (departments A and B). The aggregate numbers hide this because they mix different denominators across departments.

This illustrates a broader lesson: data summaries without disaggregation can obscure the underlying structure of your data. And without visualization, this kind of paradox is hard to detect.

Let’s visualize the department-level admission rates by gender.

ggplot(dept_admit, aes(x = Dept, 
                       y = AdmitRate, 
                       fill = Gender)) +
  geom_col(position = "dodge") +
  labs(title = "Admission Rates by Gender and Department (UC Berkeley, 1973)",
       y = "Admission Rate", 
       x = "Department") +
  scale_fill_manual(values = c("Male" = "#377eb8", 
                               "Female" = "#e41a1c")) +
  theme_minimal()

  1. Which of the following charts best describes the chart displayed above?

  2. What do the letters x and y refer to, in statistical terms?

  3. Which of the following might best describe why we have put AdmitRate as the Y variable in this chart?

This plot shows that in nearly all departments, women had similar or higher admission rates than men. The illusion of bias in the aggregate comes from differences in application patterns, not unfair decisions within departments. To confirm this, let’s also show how department choice varied by gender.

applicants <- ucb_df %>%
  group_by(Dept, Gender) %>%
  summarise(Applicants = sum(Freq))
`summarise()` has grouped output by 'Dept'. You can override using the
`.groups` argument.
ggplot(applicants, aes(x = Dept, 
                       y = Applicants, 
                       fill = Gender)) +
  geom_col(position = "dodge") +
  labs(title = "...",
       y = "...", 
       x = "Department") +
  scale_fill_manual(values = c("Male" = "#377eb8", 
                               "Female" = "#e41a1c")) +
  theme_minimal()

This second chart reveals that men were more likely to apply to departments A and B, which had higher acceptance rates, while women applied more often to departments C through F, where competition was steeper.

  1. What is the core difference between the two charts presented above?

Wrapping Up

This example is more than a historical curiosity—it’s a powerful reminder of what can go wrong when we skip visual exploration. Without breaking down the data into meaningful subgroups or visualizing it, we risk drawing misleading conclusions from aggregated numbers—a mistake that can easily obscure important patterns or biases hidden within the data. Here are a few key takeaways:

  1. Never Rely on Aggregates Alone Always ask: What groups might I be collapsing? Can different subgroups tell different stories?

  2. Use the Right Visualization for the Question Tables are great for precision, but bar plots, dot plots, and faceted graphics help reveal structure. In this case, side-by-side bar charts made the paradox visible in seconds.

  3. Visuals Help You Understand Your Own Data Good graphics aren’t just for presentations. They’re how you, as a researcher or analyst, come to understand the texture of the data you’re working with.

  4. Tabulation Has Added Value When It Reveals Structure A well-designed cross-tab or grouped summary tells you what’s driving a result. Don’t just count things—count them strategically.

Hungry For More?

For more information on the exact data used in this post, check out the full paper here. Alternatively, type help(UCBAdmissions) into the R console if you wish to replicate or extend the analyses conducted here.