A Case-control study compares patients who have a disease or outcome of interest (cases) with patients who do not have the disease or outcome (controls), and looks back retrospectively to compare how frequently the exposure to a risk factor is present in each group to determine the relationship between the risk factor and the disease.
Case control studies are observational because no intervention is attempted and no attempt is made to alter the course of the disease. The goal is to retrospectively determine the exposure to the risk factor of interest from each of the two groups of individuals: cases and controls. These studies are designed to estimate odds.
Case control studies are also known as “retrospective studies” and “case-referent studies.”
In the classic textbook of Breslow and Day about data analysis of cancer research,this is the table of the study about risk factors for oesophageal cancer:

From Breslow and N. E. Day, ch 4.
We will use dplyr and ggplot2 to graph this data.
In this project, we will recreate this table the tidyverse
way.
First, we load the meta-package tidyverse thant contains packages as dplyr for data wrangling among others.
library(tidyverse)
The dataset of the book can be found here
df <- read_csv("http://bit.ly/data_esoph", col_names = FALSE)
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## X1 = col_double(),
## X2 = col_double(),
## X3 = col_double(),
## X4 = col_double()
## )
Let’s see the first six rows of the dataset
head(df)
## # A tibble: 6 x 4
## X1 X2 X3 X4
## <dbl> <dbl> <dbl> <dbl>
## 1 1 2 1 0
## 2 1 1 1 0
## 3 1 2 4 0
## 4 1 2 2 0
## 5 1 2 1 0
## 6 1 2 1 0
Data is without column names. The variables are:
COL | VAR | RANGE/VALUES |
---|---|---|
1 | Age group | 1 = 25-34 |
(years) | 2 = 35-44 | |
3 = 45-54 | ||
4 = 55-64 | ||
5 = 65-74 | ||
6 = 75+ | ||
2 | Alcohol | 1 = 0-39 |
(gms/day) | 2 = 40-79 | |
3 = 80-119 | ||
4 = 120+ | ||
3 | Tobacco | 1 = 0- 9 |
(gms/day) | 2 = 10-19 | |
3 = 20-29 | ||
4 = 30+ | ||
4 | Case or | 1 = case |
Control | 0 = control |
Now, we will ad the column names:
colnames(df) <- (c("age", "alc", "tob", "cc"))
Check:
head(df)
## # A tibble: 6 x 4
## age alc tob cc
## <dbl> <dbl> <dbl> <dbl>
## 1 1 2 1 0
## 2 1 1 1 0
## 3 1 2 4 0
## 4 1 2 2 0
## 5 1 2 1 0
## 6 1 2 1 0
Since we know the codes, we will recode all the dataset using the function mutate
and case_when
of dplyr
Age groups 1 = 25-34 2 = 35-44 3 = 45-54 4 = 55-64 5 = 65-74 6 = 75+
df <- df %>%
mutate(
age_grp =
case_when(
age == 1 ~ "25-34",
age == 2 ~ "35-44",
age == 3 ~ "45-54",
age == 4 ~ "55-64",
age == 5 ~ "65-74",
TRUE ~ "75+",
)
)
head(df)
## # A tibble: 6 x 5
## age alc tob cc age_grp
## <dbl> <dbl> <dbl> <dbl> <chr>
## 1 1 2 1 0 25-34
## 2 1 1 1 0 25-34
## 3 1 2 4 0 25-34
## 4 1 2 2 0 25-34
## 5 1 2 1 0 25-34
## 6 1 2 1 0 25-34
OK!
The same for the rest of the variables Alcohol
df <- df %>%
mutate(
alc_grp =
case_when(
alc == 1 ~ "0-39",
alc == 2 ~ "40-79",
alc == 3 ~ "80-119",
TRUE ~ "120+"
)
)
Tobacco
df <- df %>%
mutate(
tob_grp =
case_when(
tob == 1 ~ "0- 9",
tob == 2 ~ "10-19",
tob == 3 ~ "20-29",
TRUE ~ "30+"
)
)
Group
df <- df %>%
mutate(
cc_grp =
case_when(
cc == 0 ~ "control",
TRUE ~ "case"
)
)
Now, omit the former columns
df <- df %>%
select(age_grp:cc_grp)
and now, we have to give the order of the factors for the ordinal variables age, alcohol and tobbaco
df %>%
mutate(age_grp = factor(age_grp, levels = c("25-34",
"35-44" ,
"45-54" ,
"55-64" ,
"65-74" ,
"75"
)))
df %>%
mutate(alc_grp = factor(alc_grp, levels = c("0-39",
"40-79",
"80-119",
"120"
)))
df %>%
mutate(tob_grp = factor(tob_grp, levels = c("0- 9",
"10-19",
"20-29",
"30+"
)))
Now we have the data ready for the analysis!
Let’s make the table 1 We have two options, first the traditional table:
table(df$age_grp, df$cc_grp)
##
## case control
## 25-34 1 115
## 35-44 9 190
## 45-54 46 167
## 55-64 76 166
## 65-74 55 106
## 75+ 13 31
Here we can add the margins with
addmargins(table(df$age_grp, df$cc_grp))
##
## case control Sum
## 25-34 1 115 116
## 35-44 9 190 199
## 45-54 46 167 213
## 55-64 76 166 242
## 65-74 55 106 161
## 75+ 13 31 44
## Sum 200 775 975
or make a proportional table, with
options(digits = 2) # limit the digits to two decimals
prop.table(table(df$age_grp, df$cc_grp))*100
##
## case control
## 25-34 0.10 11.79
## 35-44 0.92 19.49
## 45-54 4.72 17.13
## 55-64 7.79 17.03
## 65-74 5.64 10.87
## 75+ 1.33 3.18
Since the size of the groups is different, this table is not useful. But we can change the calculation of the proportion, to add the prop by columns instead of rows:
prop.table(table(df$age_grp, df$cc_grp), 2)*100 # note the ,2 added. That means % by col.
##
## case control
## 25-34 0.5 14.8
## 35-44 4.5 24.5
## 45-54 23.0 21.5
## 55-64 38.0 21.4
## 65-74 27.5 13.7
## 75+ 6.5 4.0
This is better.
The same table in the dplyr way:
df %>%
group_by(age_grp, cc_grp) %>%
summarise(n = n()) %>%
spread(cc_grp, n)
## `summarise()` regrouping output by 'age_grp' (override with `.groups` argument)
## # A tibble: 6 x 3
## # Groups: age_grp [6]
## age_grp case control
## <chr> <int> <int>
## 1 25-34 1 115
## 2 35-44 9 190
## 3 45-54 46 167
## 4 55-64 76 166
## 5 65-74 55 106
## 6 75+ 13 31
or as proportional table:
df %>%
count(age_grp, cc_grp ) %>%
mutate(prop = prop.table(n)*100) %>%
select(-n) %>%
spread(cc_grp, prop)
## # A tibble: 6 x 3
## age_grp case control
## <chr> <dbl> <dbl>
## 1 25-34 0.103 11.8
## 2 35-44 0.923 19.5
## 3 45-54 4.72 17.1
## 4 55-64 7.79 17.0
## 5 65-74 5.64 10.9
## 6 75+ 1.33 3.18
The dplyr
version of the table with proportions is
options(digits = 3)
df %>%
count(age_grp, cc_grp) %>%
group_by(cc_grp) %>%
mutate(prop = n / sum(n)) %>%
select(-n) %>%
spread(cc_grp, prop, fill = 0)
## # A tibble: 6 x 3
## age_grp case control
## <chr> <dbl> <dbl>
## 1 25-34 0.005 0.148
## 2 35-44 0.045 0.245
## 3 45-54 0.23 0.215
## 4 55-64 0.38 0.214
## 5 65-74 0.275 0.137
## 6 75+ 0.065 0.04
Also, there is a new package called janitor
, full of nice functions. One of them allow to make such table with a simple syntax:
df %>%
janitor::tabyl(age_grp, cc_grp, percent = 'col') # this means: use the package janitor to create a crosstable of this variables and adding the percent by columns. You can change the latter to 'row'
## age_grp case control
## 25-34 1 115
## 35-44 9 190
## 45-54 46 167
## 55-64 76 166
## 65-74 55 106
## 75+ 13 31
For alcohol:
df %>%
group_by(alc_grp, cc_grp) %>%
summarise(n = n()) %>%
spread(cc_grp, n)
## `summarise()` regrouping output by 'alc_grp' (override with `.groups` argument)
## # A tibble: 4 x 3
## # Groups: alc_grp [4]
## alc_grp case control
## <chr> <int> <int>
## 1 0-39 29 386
## 2 120+ 45 22
## 3 40-79 75 280
## 4 80-119 51 87
and tobacco:
df %>%
group_by(tob_grp, cc_grp) %>%
summarise(n = n()) %>%
spread(cc_grp, n)
## `summarise()` regrouping output by 'tob_grp' (override with `.groups` argument)
## # A tibble: 4 x 3
## # Groups: tob_grp [4]
## tob_grp case control
## <chr> <int> <int>
## 1 0- 9 78 447
## 2 10-19 58 178
## 3 20-29 33 99
## 4 30+ 31 51
So, in this post we had recreated the table from the case-control study of (o)esophageal cancer in Ille-et-Vilaine, France in the Breslow and Day textbook.