Calculating incidence or prevalence requires first identifying an
appropriate denominator population. To find such a denominator
population (or multiple denominator populations) we can use the
generateDenominatorCohortSet()
function. This function will
identify the time that people in the database satisfy a set of criteria
related to the study period and individuals´ age, sex, and amount of
prior observed history.
When using generateDenominatorCohortSet()
individuals
will enter a denominator population on the respective date of the latest
of the following:
They will then exit on the respective date of the earliest of the following:
Let´s go through a few examples to make this logic a little more concrete.
The simplest case is that no study start and end dates are specified, no prior history requirement is imposed, nor any age or sex criteria. In this case individuals will enter the denominator population once they have entered the database (start of observation period) and will leave when they exit the database (end of observation period). Note that in some databases a person can have multiple observation periods, in which case their contribution of person time would look like the the last person below.
If we specify a study start and end date then only observation time during this period will be included.
If we also add some requirement of prior history then somebody will only contribute time at risk once this is reached.
Lastly we can also impose age and sex criteria, and now individuals will only contribute time when they also satisfy these criteria. Not shown in the below figure is a person´s sex, but we could also stratify a denominator population by this as well.
generateDenominatorCohortSet()
is the function we use to
identify a set of denominator populations. To demonstrate its use, let´s
load the IncidencePrevalence package (along with a couple of packages to
help for subsequent plots) and generate 500 example patients using the
mockIncidencePrevalenceRef()
function.
library(IncidencePrevalence)
library(ggplot2)
library(tidyr)
cdm <- mockIncidencePrevalenceRef(sampleSize = 500)
We can get a denominator population without including any particular requirements like so
cdm <- generateDenominatorCohortSet(
cdm = cdm,
name = "denominator",
cohortDateRange = as.Date(c(NA,NA)),
ageGroup = list(c(0, 150)),
sex = "Both",
daysPriorObservation = 0
)
cdm$denominator
#> # Source: table<main.denominator> [?? x 4]
#> # Database: DuckDB v1.0.0 [eburn@Windows 10 x64:R 4.2.1/:memory:]
#> cohort_definition_id subject_id cohort_start_date cohort_end_date
#> <int> <chr> <date> <date>
#> 1 1 2 1986-05-02 2060-06-10
#> 2 1 3 1994-09-05 2048-01-09
#> 3 1 4 1961-02-23 2040-11-30
#> 4 1 6 1973-10-02 2040-12-18
#> 5 1 7 1959-05-19 2027-10-28
#> 6 1 8 1989-04-22 2073-11-28
#> 7 1 9 1983-10-23 2039-05-09
#> 8 1 10 1974-04-19 2063-08-04
#> 9 1 11 1954-12-16 2043-04-22
#> 10 1 12 1969-04-18 2065-01-23
#> # ℹ more rows
cdm$denominator %>%
filter(subject_id %in% c("1", "2", "3", "4", "5"))
#> # Source: SQL [5 x 4]
#> # Database: DuckDB v1.0.0 [eburn@Windows 10 x64:R 4.2.1/:memory:]
#> cohort_definition_id subject_id cohort_start_date cohort_end_date
#> <int> <chr> <date> <date>
#> 1 1 2 1986-05-02 2060-06-10
#> 2 1 3 1994-09-05 2048-01-09
#> 3 1 4 1961-02-23 2040-11-30
#> 4 1 5 1986-08-04 2062-12-11
#> 5 1 1 1976-03-17 2039-11-26
Let´s have a look at the included time of the first five patients
We can also plot a histogram of start and end dates of the 500 simulated patients
We can get specify a study period like so
cdm <- generateDenominatorCohortSet(
cdm = cdm,
name = "denominator",
cohortDateRange = c(as.Date("2008-01-01"), as.Date("2009-12-31")),
ageGroup = list(c(0, 150)),
sex = "Both",
daysPriorObservation = 0
)
cdm$denominator
#> # Source: table<main.denominator> [?? x 4]
#> # Database: DuckDB v1.0.0 [eburn@Windows 10 x64:R 4.2.1/:memory:]
#> cohort_definition_id subject_id cohort_start_date cohort_end_date
#> <int> <chr> <date> <date>
#> 1 1 2 2008-01-01 2009-12-31
#> 2 1 3 2008-01-01 2009-12-31
#> 3 1 4 2008-01-01 2009-12-31
#> 4 1 6 2008-01-01 2009-12-31
#> 5 1 7 2008-01-01 2009-12-31
#> 6 1 8 2008-01-01 2009-12-31
#> 7 1 9 2008-01-01 2009-12-31
#> 8 1 10 2008-01-01 2009-12-31
#> 9 1 11 2008-01-01 2009-12-31
#> 10 1 12 2008-01-01 2009-12-31
#> # ℹ more rows
cohortCount(cdm$denominator)
#> # A tibble: 1 × 3
#> cohort_definition_id number_records number_subjects
#> <int> <int> <int>
#> 1 1 339 339
cdm$denominator %>%
filter(subject_id %in% c("1", "2", "3", "4", "5"))
#> # Source: SQL [5 x 4]
#> # Database: DuckDB v1.0.0 [eburn@Windows 10 x64:R 4.2.1/:memory:]
#> cohort_definition_id subject_id cohort_start_date cohort_end_date
#> <int> <chr> <date> <date>
#> 1 1 2 2008-01-01 2009-12-31
#> 2 1 3 2008-01-01 2009-12-31
#> 3 1 4 2008-01-01 2009-12-31
#> 4 1 5 2008-01-01 2009-12-31
#> 5 1 1 2008-01-01 2009-12-31
Now we can see the person “2”, “4” and “5” haven´t been included as they don´t have any observation time during the study period. Indeed, we´re now including 106 of the original 500 simulated patients.
cdm$denominator %>%
filter(subject_id %in% c("1", "2", "3", "4", "5")) %>%
collect() %>%
pivot_longer(cols = c(
"cohort_start_date",
"cohort_end_date"
)) %>%
ggplot() +
geom_point(aes(x = value, y = subject_id)) +
geom_line(aes(x = value, y = subject_id)) +
theme_minimal() +
xlab("Year")
We can also plot a histogram of start and end dates and we can see that now most people enter at the start of the study period and leave at the end.
We can add some requirement of prior history
cdm <- generateDenominatorCohortSet(
cdm = cdm,
name = "denominator",
cohortDateRange = c(as.Date("2008-01-01"), as.Date("2009-12-31")),
ageGroup = list(c(0, 150)),
sex = "Both",
daysPriorObservation = 365
)
cdm$denominator
#> # Source: table<main.denominator> [?? x 4]
#> # Database: DuckDB v1.0.0 [eburn@Windows 10 x64:R 4.2.1/:memory:]
#> cohort_definition_id subject_id cohort_start_date cohort_end_date
#> <int> <chr> <date> <date>
#> 1 1 2 2008-01-01 2009-12-31
#> 2 1 3 2008-01-01 2009-12-31
#> 3 1 4 2008-01-01 2009-12-31
#> 4 1 6 2008-01-01 2009-12-31
#> 5 1 7 2008-01-01 2009-12-31
#> 6 1 8 2008-01-01 2009-12-31
#> 7 1 9 2008-01-01 2009-12-31
#> 8 1 10 2008-01-01 2009-12-31
#> 9 1 11 2008-01-01 2009-12-31
#> 10 1 12 2008-01-01 2009-12-31
#> # ℹ more rows
cohortCount(cdm$denominator)
#> # A tibble: 1 × 3
#> cohort_definition_id number_records number_subjects
#> <int> <int> <int>
#> 1 1 339 339
cdm$denominator %>%
filter(subject_id %in% c("1", "2", "3", "4", "5"))
#> # Source: SQL [5 x 4]
#> # Database: DuckDB v1.0.0 [eburn@Windows 10 x64:R 4.2.1/:memory:]
#> cohort_definition_id subject_id cohort_start_date cohort_end_date
#> <int> <chr> <date> <date>
#> 1 1 2 2008-01-01 2009-12-31
#> 2 1 3 2008-01-01 2009-12-31
#> 3 1 4 2008-01-01 2009-12-31
#> 4 1 5 2008-01-01 2009-12-31
#> 5 1 1 2008-01-01 2009-12-31
Now we only include patient “1” of the original first five and we´re now including 57 of the original 500 simulated patients.
cdm$denominator %>%
filter(subject_id %in% c("1", "2", "3", "4", "5")) %>%
collect() %>%
pivot_longer(cols = c(
"cohort_start_date",
"cohort_end_date"
)) %>%
ggplot() +
geom_point(aes(x = value, y = subject_id)) +
geom_line(aes(x = value, y = subject_id)) +
theme_minimal() +
xlab("Year")
With the histograms of start and end dates now looking like
In addition to all the above we could also add some requirements around age and sex. One thing to note is that the age upper limit will include time from a person up to the day before their reach the age upper limit + 1 year. For instance, when the upper limit is 65, that means we will include time from a person up to and including the day before their 66th birthday.
cdm <- generateDenominatorCohortSet(
cdm = cdm,
name = "denominator",
cohortDateRange = c(as.Date("2008-01-01"), as.Date("2009-12-31")),
ageGroup = list(c(18, 65)),
sex = "Female",
daysPriorObservation = 365
)
cdm$denominator %>%
glimpse()
#> Rows: ??
#> Columns: 4
#> Database: DuckDB v1.0.0 [eburn@Windows 10 x64:R 4.2.1/:memory:]
#> $ cohort_definition_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ subject_id <chr> "6", "20", "24", "38", "39", "44", "54", "61", "6…
#> $ cohort_start_date <date> 2008-01-01, 2008-01-01, 2008-01-01, 2008-01-01, …
#> $ cohort_end_date <date> 2009-12-31, 2009-12-31, 2009-12-31, 2009-12-31, …
cohortCount(cdm$denominator)
#> # A tibble: 1 × 3
#> cohort_definition_id number_records number_subjects
#> <int> <int> <int>
#> 1 1 109 109
cdm$denominator %>%
filter(subject_id %in% c("1", "2", "3", "4", "5"))
#> # Source: SQL [0 x 4]
#> # Database: DuckDB v1.0.0 [eburn@Windows 10 x64:R 4.2.1/:memory:]
#> # ℹ 4 variables: cohort_definition_id <int>, subject_id <chr>,
#> # cohort_start_date <date>, cohort_end_date <date>
Now none of the original first five are included and we´re including 20 of the original 500 simulated patients.
The histograms of start and end dates now looking like
More than one age, sex and prior history requirements can be specified at the same time. First, we can take a look at having two age groups. We can see below that those individuals who have their 41st birthday during the study period will go from the first cohort (age_group: 0;40) to the second (age_group: 41;100) on this day.
cdm <- mockIncidencePrevalenceRef(sampleSize = 500,
earliestObservationStartDate = as.Date("2000-01-01"),
latestObservationStartDate = as.Date("2005-01-01"),
minDaysToObservationEnd = 10000,
earliestDateOfBirth = as.Date("1960-01-01"),
latestDateOfBirth = as.Date("1980-01-01"))
cdm <- generateDenominatorCohortSet(
cdm = cdm,
name = "denominator",
ageGroup = list(
c(0, 40),
c(41, 100)
),
sex = "Both",
daysPriorObservation = 0
)
cdm$denominator %>%
filter(subject_id %in% !!as.character(seq(1:30))) %>%
collect() %>%
pivot_longer(cols = c(
"cohort_start_date",
"cohort_end_date"
)) %>%
mutate(cohort_definition_id = as.character(cohort_definition_id),
subject_id = factor(as.numeric(subject_id))) %>%
ggplot(aes(x = subject_id, y = value, colour = cohort_definition_id)) +
geom_point(position = position_dodge(width = 0.5)) +
geom_line(position = position_dodge(width = 0.5)) +
theme_minimal() +
theme(legend.position = "top") +
ylab("Year") +
coord_flip()
We can then also sex specific denominator cohorts.
cdm <- generateDenominatorCohortSet(
cdm = cdm,
name = "denominator",
ageGroup = list(
c(0, 40),
c(41, 100)
),
sex = c("Male", "Female", "Both"),
daysPriorObservation = 0
)
cdm$denominator %>%
filter(subject_id %in% !!as.character(seq(1:15))) %>%
collect() %>%
left_join(cohortSet(cdm$denominator)) %>%
pivot_longer(cols = c(
"cohort_start_date",
"cohort_end_date"
)) %>%
mutate(cohort_definition_id = as.character(cohort_definition_id),
subject_id = factor(as.numeric(subject_id))) %>%
ggplot(aes(x = subject_id, y = value, colour = cohort_definition_id)) +
facet_grid(sex ~ ., space = "free_y") +
geom_point(position = position_dodge(width = 0.5)) +
geom_line(position = position_dodge(width = 0.5)) +
theme_bw() +
theme(legend.position = "top") +
ylab("Year") +
coord_flip()
And we could also specifying multiple prior history requirements
cdm <- generateDenominatorCohortSet(
cdm = cdm,
name = "denominator",
ageGroup = list(
c(0, 40),
c(41, 100)
),
sex = c("Male", "Female", "Both"),
daysPriorObservation = c(0, 365)
)
cdm$denominator %>%
filter(subject_id %in% !!as.character(seq(1:15))) %>%
collect() %>%
left_join(cohortSet(cdm$denominator)) %>%
pivot_longer(cols = c(
"cohort_start_date",
"cohort_end_date"
)) %>%
mutate(cohort_definition_id = as.character(cohort_definition_id),
subject_id = factor(as.numeric(subject_id))) %>%
ggplot(aes(x = subject_id, y = value, colour = cohort_definition_id)) +
facet_grid(sex + days_prior_observation ~ ., space = "free_y") +
geom_point(position = position_dodge(width = 0.5)) +
geom_line(position = position_dodge(width = 0.5)) +
theme_bw() +
theme(legend.position = "top") +
ylab("Year") +
coord_flip()
Note, setting requirementInteractions to FALSE would mean that only the first value of other age, sex, and prior history requirements are considered for a given characteristic. In this case the order of the values will be important and generally the first values will be the primary analysis settings while subsequent values are for secondary analyses.
cdm <- generateDenominatorCohortSet(
cdm = cdm,
name = "denominator",
ageGroup = list(
c(0, 40),
c(41, 100)
),
sex = c("Male", "Female", "Both"),
daysPriorObservation = c(0, 365),
requirementInteractions = FALSE
)
cdm$denominator %>%
filter(subject_id %in% !!as.character(seq(1:15))) %>%
collect() %>%
left_join(cohortSet(cdm$denominator)) %>%
pivot_longer(cols = c(
"cohort_start_date",
"cohort_end_date"
)) %>%
mutate(cohort_definition_id = as.character(cohort_definition_id),
subject_id = factor(as.numeric(subject_id))) %>%
ggplot(aes(x = subject_id, y = value, colour = cohort_definition_id)) +
facet_grid(sex + days_prior_observation ~ ., space = "free_y") +
geom_point(position = position_dodge(width = 0.5)) +
geom_line(position = position_dodge(width = 0.5)) +
theme_bw() +
theme(legend.position = "top") +
ylab("Year") +
coord_flip()
generateDenominatorCohortSet()
will generate a table
with the denominator population, which includes the information on all
the individuals who fulfill the given criteria at any point during the
study period. It also includes information on the specific start and end
dates in which individuals contributed to the denominator population
(cohort_start_date and cohort_end_date). Each patient is recorded in a
different row. For those databases that allow individuals to have
multiple non-overlapping observation periods, one row for each patient
and observation period is considered.
Considering the following example, we can see:
cdm <- generateDenominatorCohortSet(
cdm = cdm,
name = "denominator",
cohortDateRange = c(as.Date("2008-01-01"), as.Date("2009-12-31")),
ageGroup = list(
c(0, 18),
c(19, 100)
),
sex = c("Male", "Female"),
daysPriorObservation = c(0, 365)
)
head(cdm$denominator, 8)
#> # Source: SQL [8 x 4]
#> # Database: DuckDB v1.0.0 [eburn@Windows 10 x64:R 4.2.1/:memory:]
#> cohort_definition_id subject_id cohort_start_date cohort_end_date
#> <int> <chr> <date> <date>
#> 1 5 2 2008-01-01 2009-12-31
#> 2 5 3 2008-01-01 2009-12-31
#> 3 5 4 2008-01-01 2009-12-31
#> 4 5 9 2008-01-01 2009-12-31
#> 5 5 10 2008-01-01 2009-12-31
#> 6 5 12 2008-01-01 2009-12-31
#> 7 5 14 2008-01-01 2009-12-31
#> 8 5 15 2008-01-01 2009-12-31
The output table will have several attributes. With
CDMConnector::cohortSet()
we can see the options used when
defining the set of denominator populations. More than one age, sex and
prior history requirements can be specified at the same time and each
combination of these variables will result in a different cohort, each
of which has a corresponding cohort_definition_id. In the above example,
we identified 8 different cohorts:
cohortSet(cdm$denominator) %>%
glimpse()
#> Rows: 8
#> Columns: 9
#> $ cohort_definition_id <int> 1, 2, 3, 4, 5, 6, 7, 8
#> $ cohort_name <chr> "denominator_cohort_1", "denominator_cohor…
#> $ age_group <chr> "0 to 18", "0 to 18", "0 to 18", "0 to 18"…
#> $ sex <chr> "Male", "Male", "Female", "Female", "Male"…
#> $ days_prior_observation <dbl> 0, 365, 0, 365, 0, 365, 0, 365
#> $ start_date <date> 2008-01-01, 2008-01-01, 2008-01-01, 2008-0…
#> $ end_date <date> 2009-12-31, 2009-12-31, 2009-12-31, 2009-1…
#> $ target_cohort_definition_id <int> NA, NA, NA, NA, NA, NA, NA, NA
#> $ target_cohort_name <lgl> NA, NA, NA, NA, NA, NA, NA, NA
With cohortCount()
we can see the number of individuals
who entered each study cohort
cohortCount(cdm$denominator) %>%
glimpse()
#> Rows: 8
#> Columns: 3
#> $ cohort_definition_id <int> 1, 2, 3, 4, 5, 6, 7, 8
#> $ number_records <int> 0, 0, 0, 0, 233, 233, 267, 267
#> $ number_subjects <int> 0, 0, 0, 0, 233, 233, 267, 267
With CDMConnector::cohortAttrition()
we can see the
number of individuals in the database who were excluded from entering a
given denominator population along with the reason (such as missing
crucial information or not satisfying the sex or age criteria required,
among others):
cohortAttrition(cdm$denominator) %>%
glimpse()
#> Rows: 72
#> Columns: 7
#> $ cohort_definition_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2…
#> $ number_records <int> 500, 500, 500, 500, 500, 500, 500, 233, 0, 500, 5…
#> $ number_subjects <int> 500, 500, 500, 500, 500, 500, 500, 233, 0, 500, 5…
#> $ reason_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 10, 1, 2, 3, 4, 5, 6, 7, …
#> $ reason <chr> "Starting population", "Missing year of birth", "…
#> $ excluded_records <int> NA, 0, 0, 0, 0, 0, 0, 267, 233, NA, 0, 0, 0, 0, 0…
#> $ excluded_subjects <int> NA, 0, 0, 0, 0, 0, 0, 267, 233, NA, 0, 0, 0, 0, 0…