Help for package waou

Title:

Weighting All of Us

Version:

0.1.0

Description:

Utilities for using a probability sample to reweight prevalence estimates calculated from the All of Us research program. Weighted estimates will still not be representative of the general U.S. population. However, they will provide an early indication for how unweighted estimates may be biased by the sampling bias in the All of Us sample.

License:

AGPL (≥ 3)

Encoding:

UTF-8

RoxygenNote:

7.3.2

Suggests:

testthat (≥ 3.0.0)

Imports:

glmnet, dplyr, stringr, stats, glue, mice, nonprobsvy, survey, ggplot2, purrr

Config/testthat/edition:

Depends:

R (≥ 3.5)

LazyData:

true

NeedsCompilation:

Packaged:

2025-09-10 20:51:53 UTC; mbrannock

Author:

Daniel Brannock

[aut, cre], Mahmoud Elkasabi

[aut]

Maintainer:

Daniel Brannock <mbrannock@rti.org>

Repository:

CRAN

Date/Publication:

2025-09-15 09:10:02 UTC

NHIS Adult Data 2023

Description

Raw survey results from adults for the 2023 National Health Interview Survey (NHIS). This is public use data. Documentation for the dataset can be found at the source link. NHIS is conducted by the National Center for Health Statistics within the Centers for Disease Control.

Usage

adult2023

Format

`adult2023`

A data frame with 29,522 rows and 647 columns.

Source

https://www.cdc.gov/nchs/nhis/documentation/2023-nhis.html

Synthetic All of Us Data

Description

Synthetic data intended to show how NHIS survey results can be used to generate weights from All of Us.

Usage

aou_synthetic

Format

Data frame with columns

SEX_A_R_I: Sex: 0 (female), 1 (male)
AGEP_A_R_I: Age in years: 1 (18-29), 2 (30-39), 3 (40-49), ..., 6 (70+)
HISPALLP_A_R_I: Race/ethnicity: 1 (Hispanic), 2 (White), 3 (Black/African American), 4 (Other)
ORIENT_A_R_I: Sexual orientation: 0 (Bisexual, Gay, or Lesbian), 1 (Straight)
HICOV_A_R_I: Health insurance: 0 (Not insured), 1 (Insured)
EDUCP_A_R_I: Education: 1 (Less than HS), 2 (HS or GED), 3 (Some college), 4 (College graduate), 5 (Advanced degree)
REGION_R_I: Region: 1 (Northeast), 2 (Midwest), 3 (South), 4 (West)
EMPLASTWK_A_R_I: Employment: 0 (Unemployed), 1 (Employed)
HOUTENURE_A_R_I: Home ownership: 0 (Does not own home), 1 (Owns home)
MARITAL_A_R_I: Marital status: 0 (Not married), 1 (Married)
DEPEV_A_R_I: Depression: 0 (No diagnosis of depression), 1 (Has diagnosis of depression)
DEMENEV_A_R_I: Depression: 0 (No diagnosis of dementia), 1 (Has diagnosis of dementia)
DIBTYPE_A_R_I: Depression: 0 (No diagnosis of type 2 diabetes), 1 (Has diagnosis of type 2 diabetes)

Source

Generated from data-raw/aou_synthetic.R.

Calculate Weights

Description

Calculate weights using three methods: IPW, Calibration, and Calibration+IPW

Usage

calculate_weights(
  sample_a,
  sample_b,
  method,
  aux_variables,
  study_variables,
  weight,
  strata,
  psu
)

Arguments

sample_a

data.frame with representative sample

sample_b

data.frame with All of Us sample

method

string or string vector specifying weighting method to use: "ipw", "cal", and "ipw+cal"

aux_variables

character vector with names of calibration variables

study_variables

character vector with names of study variables

weight

character vector with name of the weight variable in sample_a

strata

character vector with name of the strata variable in sample_a

psu

character vector with name of the primary sampling units variable in sample_a

Details

Calculates weights intended to reduce the sampling bias present in All of Us. Three versions of weights are calculated from different reweighting strategies: IPW, Calibration, and Calibration+IPW.

Value

list of data.frame with added (or replaced) weight columns and survey designs

Examples

# Prepare the NHIS data
calVars <- c(
  "SEX_A_R", "AGEP_A_R", "HISPALLP_A_R", "ORIENT_A_R", "HICOV_A_R", "EDUCP_A_R", "REGION_R",
  "EMPLASTWK_A_R", "HOUTENURE_A_R", "MARITAL_A_R"
)
stuVars <- "DIBTYPE_A_R"
vars_dummies <- c("AGEP_A_R","HISPALLP_A_R","EDUCP_A_R","REGION_R")
nhis_keep_vars <- c("PPSU","PSTRAT","WTFA_A")
nhis_imputed <- impute_data(nhis_processed, c(calVars, stuVars), nhis_keep_vars)
nhis_dummied <- dummies(nhis_imputed, vars=paste0(vars_dummies, '_I'))
factor_vars <- setdiff(names(nhis_dummied), nhis_keep_vars)
nhis_dummied[factor_vars] <- lapply(nhis_dummied[factor_vars], as.factor)

# Prepare the synthetic All of Us data
aou_imputed <- impute_data(aou_synthetic, c(calVars, stuVars))
aou_dummied <- dummies(aou_imputed, vars=paste0(vars_dummies, '_I'))
aou_dummied[] <- lapply(aou_dummied, as.factor)

# Calculate IPW weights using NHIS data and applied to All of Us
weights_df <- calculate_weights(
  nhis_dummied, 
  aou_dummied, 
  'ipw',
  paste0(calVars, '_I'), 
  paste0(stuVars, '_I'), 
  weight='WTFA_A',
  strata='PSTRAT',
  psu='PPSU'
)

Create Dummy Variables

Description

Create dummy variables of factors and character vectors in a data frame

Usage

dummies(input, vars)

Arguments

input

data.frame with calibration variables

vars

character vector with names of variables requiring dummy encoding

Value

data.frame with the new dummy variables

Examples

calVars <- c(
  "SEX_A_R", "AGEP_A_R", "HISPALLP_A_R", "ORIENT_A_R", "HICOV_A_R", "EDUCP_A_R", "REGION_R",
  "EMPLASTWK_A_R", "HOUTENURE_A_R", "MARITAL_A_R"
)
stuVars <- "DIBTYPE_A_R"
nhis_keep_vars <- c("PPSU","PSTRAT","WTFA_A")

# First impute
nhis_imputed <- impute_data(nhis_processed, c(calVars, stuVars), nhis_keep_vars)

# Then create dummy variables
nhis_vars_dummies <- c("AGEP_A_R","HISPALLP_A_R","EDUCP_A_R","REGION_R")
nhis_dummied <- dummies(nhis_imputed, vars=paste0(nhis_vars_dummies, '_I'))

Extract population totals

Description

Calculate weights using three methods: IPW, Calibration, and Calibration+IPW

Usage

extract_totals(sample, vars, weight)

Arguments

sample

data.frame with representative sample

vars

character vector with names of calibration variables

weight

character vector with name of the weight variable

Details

Calculates weights intended to reduce the sampling bias present in All of Us. Three versions of weights are calculated from different reweighting strategies: IPW, Calibration, and Calibration+IPW.

Value

list of data.frame with added (or replaced) weight columns and survey designs

Impute Data

Description

Add imputed data columns to existing data.frame

Usage

impute_data(
  input,
  vars,
  keep_vars = c(),
  return_mice = FALSE,
  impute_constant = NULL
)

Arguments

input

data.frame with calibration variables

vars

character vector with names of variables to be imputed

keep_vars

character vector with names of additional variables that should be retained

return_mice

boolean for whether to return mice object (for looking at logged events)

impute_constant

numeric if not NULL will impute with provided constant

Details

For each of the specified variables, use all variables to predict missing values. Populate actual (when available) and imputed values into new columns appended with names appended with _I.

If you choose to return the mice object with return_mice, the function output will be a list that includes the final data.frame and the mice output.

Value

data.frame with imputed versions of variables

Examples

calVars <- c(
  "SEX_A_R", "AGEP_A_R", "HISPALLP_A_R", "ORIENT_A_R", "HICOV_A_R", "EDUCP_A_R", "REGION_R",
  "EMPLASTWK_A_R", "HOUTENURE_A_R", "MARITAL_A_R"
)
stuVars <- "DIBTYPE_A_R"
nhis_keep_vars <- c("PPSU","PSTRAT","WTFA_A")

nhis_imputed <- impute_data(nhis_processed, c(calVars, stuVars), nhis_keep_vars)

Processed NHIS Data

Description

Survey data from NHIS that has been sampled down, recoded, and subsetted.

Usage

nhis_processed

Format

Data frame with columns

SEX_A_R_I: Sex: 0 (female), 1 (male)
AGEP_A_R_I: Age in years: 1 (18-29), 2 (30-39), 3 (40-49), ..., 6 (70+)
HISPALLP_A_R_I: Race/ethnicity: 1 (Hispanic), 2 (White), 3 (Black/African American), 4 (Other)
ORIENT_A_R_I: Sexual orientation: 0 (Bisexual, Gay, or Lesbian), 1 (Straight)
HICOV_A_R_I: Health insurance: 0 (Not insured), 1 (Insured)
EDUCP_A_R_I: Education: 1 (Less than HS), 2 (HS or GED), 3 (Some college), 4 (College graduate), 5 (Advanced degree)
REGION_R_I: Region: 1 (Northeast), 2 (Midwest), 3 (South), 4 (West)
EMPLASTWK_A_R_I: Employment: 0 (Unemployed), 1 (Employed)
HOUTENURE_A_R_I: Home ownership: 0 (Does not own home), 1 (Owns home)
MARITAL_A_R_I: Marital status: 0 (Not married), 1 (Married)
DEPEV_A_R_I: Depression: 0 (No self-reported depression), 1 (Has self-reported depression)
DEMENEV_A_R_I: Depression: 0 (No self-reported dementia), 1 (Has self-reported dementia)
DIBTYPE_A_R_I: Depression: 0 (No self-reported type 2 diabetes), 1 (Has self-reported type 2 diabetes)
PPSU: Person-level ID
PSTRAT: Stratification to be used as part of the survey design
WTFA_A: Weights used to assure representativeness of U.S. population (may not be valid for sampled data)

Source

Generated from data-raw/nhis_processed.

Visualize Prevalence Estimates

Description

Visualize prevalence estimates for calibration or outcome variables using different weighting methods.

Usage

plot_prevalence(df, mean, mean_se, method, cal_vars, cal_levels)

Arguments

df

data.frame with representative sample

mean

character name of mean prevalence estimate variable

mean_se

character name of mean prevalence estimate variable

method

character name of the weighting method variable

cal_vars

character name of the variable with calibration variable names

cal_levels

character name of the variable with calibration variable levels

Details

Specify columns and weighting methodologies of interest to visualize.

Value

ggplot object

Examples

library(dplyr)
library(stringr)

# Prepare the NHIS data
calVars <- c(
  "SEX_A_R", "AGEP_A_R", "HISPALLP_A_R", "ORIENT_A_R", "HICOV_A_R", "EDUCP_A_R", "REGION_R",
  "EMPLASTWK_A_R", "HOUTENURE_A_R", "MARITAL_A_R"
)
stuVars <- "DIBTYPE_A_R"
vars_dummies <- c("AGEP_A_R","HISPALLP_A_R","EDUCP_A_R","REGION_R")
nhis_keep_vars <- c("PPSU","PSTRAT","WTFA_A")
nhis_imputed <- impute_data(nhis_processed, c(calVars, stuVars), nhis_keep_vars)
nhis_dummied <- dummies(nhis_imputed, vars=paste0(vars_dummies, '_I'))
factor_vars <- setdiff(names(nhis_dummied), nhis_keep_vars)
nhis_dummied[factor_vars] <- lapply(nhis_dummied[factor_vars], as.factor)

# Prepare the synthetic All of Us data
aou_imputed <- impute_data(aou_synthetic, c(calVars, stuVars))
aou_dummied <- dummies(aou_imputed, vars=paste0(vars_dummies, '_I'))
aou_dummied[] <- lapply(aou_dummied, as.factor)

# Calculate IPW weights using NHIS data and applied to All of Us
weights_df <- calculate_weights(
  nhis_dummied, 
  aou_dummied, 
  'ipw',
  paste0(calVars, '_I'), 
  paste0(stuVars, '_I'), 
  weight='WTFA_A',
  strata='PSTRAT',
  psu='PPSU'
)

# Get IPW results by group
ipw_outcome_df <- summarize_results_by_group(
  weights_df, 
  paste0(stuVars, '_I'), 
  paste0(calVars, '_I'), 
  weight_col='ipw_weight', 
  label='AoU: IPW'
)

# Process data prior to plotting to make labels more readable
plot_df <- ipw_outcome_df %>%
  mutate(
    Name = case_when(
      group_var == 'SEX_A_R_I' & level_var == 1 ~ 'Sex: Male',
      group_var == 'SEX_A_R_I' & level_var == 0 ~ 'Sex: Female',
      group_var == 'AGEP_A_R_I1' & level_var == 1 ~ 'Age: 18-29',
      group_var == 'AGEP_A_R_I2' & level_var == 1 ~ 'Age: 30-39',
      group_var == 'AGEP_A_R_I3' & level_var == 1 ~ 'Age: 40-49',
      group_var == 'AGEP_A_R_I4' & level_var == 1 ~ 'Age: 50-59',
      group_var == 'AGEP_A_R_I5' & level_var == 1 ~ 'Age: 60-69',
      group_var == 'AGEP_A_R_I6' & level_var == 1 ~ 'Age: 70+',
      group_var == 'HISPALLP_A_R_I1' & level_var == 1 ~ 'Race/Eth: Hispanic',
      group_var == 'HISPALLP_A_R_I2' & level_var == 1 ~ 'Race/Eth: White',
      group_var == 'HISPALLP_A_R_I3' & level_var == 1 ~ 'Race/Eth: Black',
      group_var == 'HISPALLP_A_R_I4' & level_var == 1 ~ 'Race/Eth: Other',
      TRUE ~ group_var
    )
  ) %>%
  filter(str_detect(group_var, "SEX|AGEP|HISPALLP")) %>%
  filter(!str_detect(Name, "_")) %>%
  mutate(
    condition = case_when(
      outcome_var == 'DIBTYPE_A_R_I' ~ "Diabetes"
    ),
    VAR = case_when(
      str_detect(group_var, "SEX") ~ "Sex",
      str_detect(group_var, "AGE") ~ "Age",
      str_detect(group_var, "HISPALL") ~ "Race",
      str_detect(group_var, "EDUC") ~ "Educ"
    )
  )

# Plot
plot_prevalence(
  plot_df,
  'WMEAN',
  'SEMEAN',
  'Method',
  'VAR',
  'Name'
)

Select Variables

Description

Select variables relevant to propensity for inclusion in All of Us

Usage

select_variables(sample_a, sample_b, aux_variables)

Arguments

sample_a

data.frame of the reference probability sample (i.e., NHIS)

sample_b

data.frame of the All of Us sample

aux_variables

character vector with names of auxiliary variables

Details

Chooses which variables are meaningful in modeling propensity for inclusion in All of Us (sample_b) as compared to the general US population as represented by a reference probability sample (sample_a). This function assumes that variable names in both sample_a and sample_b are harmonized (i.e., definitions and names are the same across the two sources).

Value

character vector with selected variable names

Examples

# Prepare the NHIS data
calVars <- c(
  "SEX_A_R", "AGEP_A_R", "HISPALLP_A_R", "ORIENT_A_R", "HICOV_A_R", "EDUCP_A_R", "REGION_R",
  "EMPLASTWK_A_R", "HOUTENURE_A_R", "MARITAL_A_R"
)
stuVars <- "DIBTYPE_A_R"
vars_dummies <- c("AGEP_A_R","HISPALLP_A_R","EDUCP_A_R","REGION_R")
nhis_keep_vars <- c("PPSU","PSTRAT","WTFA_A")
nhis_imputed <- impute_data(nhis_processed, c(calVars, stuVars), nhis_keep_vars)
nhis_dummied <- dummies(nhis_imputed, vars=paste0(vars_dummies, '_I'))
factor_vars <- setdiff(names(nhis_dummied), nhis_keep_vars)
nhis_dummied[factor_vars] <- lapply(nhis_dummied[factor_vars], as.factor)

# Prepare the synthetic All of Us data
aou_imputed <- impute_data(aou_synthetic, c(calVars, stuVars))
aou_dummied <- dummies(aou_imputed, vars=paste0(vars_dummies, '_I'))
aou_dummied[] <- lapply(aou_dummied, as.factor)

# Define base variable names of auxiliary variables
aux_variables <- c(
  "SEX_A_R_I","AGEP_A_R_I", "HISPALLP_A_R_I","EDUCP_A_R_I",
  "REGION_R_I","ORIENT_A_R_I","HICOV_A_R_I",
  "EMPLASTWK_A_R_I","HOUTENURE_A_R_I","MARITAL_A_R_I"
)

# Provide All of Us and NHIS data to select variables
selected_base_vars <- select_variables(nhis_dummied, aou_dummied, aux_variables)

Summarize Results

Description

Get adjusted totals and prevalence for provided variables.

Usage

summarize_results(
  df,
  vars,
  weight_col = NULL,
  id_col = 1,
  strata_col = NULL,
  label = NULL
)

Arguments

df

data.frame with sample and weights (if using a survey design)

vars

string vector of variables to calculate prevalences for

weight_col

string specifying the column with weights or NULL for unweighted

id_col

string specifying the column with IDs for cluster-aware standard error (SE) calculations

strata_col

string specifying the column with strata for cluster-aware SE calculations

label

string label for weighting method

Value

data.frame with totals, means, and standard errors (if using a survey design)

Examples

# Prepare the NHIS data
calVars <- c(
  "SEX_A_R", "AGEP_A_R", "HISPALLP_A_R", "ORIENT_A_R", "HICOV_A_R", "EDUCP_A_R", "REGION_R",
  "EMPLASTWK_A_R", "HOUTENURE_A_R", "MARITAL_A_R"
)
stuVars <- "DIBTYPE_A_R"
vars_dummies <- c("AGEP_A_R","HISPALLP_A_R","EDUCP_A_R","REGION_R")
nhis_keep_vars <- c("PPSU","PSTRAT","WTFA_A")
nhis_imputed <- impute_data(nhis_processed, c(calVars, stuVars), nhis_keep_vars)
nhis_dummied <- dummies(nhis_imputed, vars=paste0(vars_dummies, '_I'))
factor_vars <- setdiff(names(nhis_dummied), nhis_keep_vars)
nhis_dummied[factor_vars] <- lapply(nhis_dummied[factor_vars], as.factor)

# Prepare the synthetic All of Us data
aou_imputed <- impute_data(aou_synthetic, c(calVars, stuVars))
aou_dummied <- dummies(aou_imputed, vars=paste0(vars_dummies, '_I'))
aou_dummied[] <- lapply(aou_dummied, as.factor)

# Calculate IPW weights using NHIS data and applied to All of Us
weights_df <- calculate_weights(
  nhis_dummied, 
  nhis_dummied, 
  'ipw',
  paste0(calVars, '_I'), 
  paste0(stuVars, '_I'), 
  weight='WTFA_A',
  strata='PSTRAT',
  psu='PPSU'
)

results_ipw <- summarize_results(
  weights_df,
  c(paste0(calVars, '_I'), paste0(stuVars, '_I')), 
  weight_col='ipw_weight', 
  label='AoU: IPW'
)

Summarize Results by Group

Description

Get adjusted totals and prevalences for provided variables, grouped by specified variables.

Usage

summarize_results_by_group(
  df,
  vars,
  group_vars,
  weight_col = NULL,
  id_col = NULL,
  strata_col = NULL,
  label = NULL
)

Arguments

df

data.frame with sample and weights (if using a survey design)

vars

string vector of variables to calculate prevalences for

group_vars

string vector of variables to group by

weight_col

string specifying the column with weights, "nhis" or nhis survey design, or NULL for unweighted

id_col

string specifying the column with IDs for cluster-aware standard error (SE) calculations

strata_col

string specifying the column with strata for cluster-aware SE calculations

label

string label for weighting method

Details

TODO: Merge into regular summarize_results function

Value

data.frame with totals, means, and standard errors (if using a survey design)

Examples

# Prepare the NHIS data
calVars <- c(
  "SEX_A_R", "AGEP_A_R", "HISPALLP_A_R", "ORIENT_A_R", "HICOV_A_R", "EDUCP_A_R", "REGION_R",
  "EMPLASTWK_A_R", "HOUTENURE_A_R", "MARITAL_A_R"
)
stuVars <- "DIBTYPE_A_R"
vars_dummies <- c("AGEP_A_R","HISPALLP_A_R","EDUCP_A_R","REGION_R")
nhis_keep_vars <- c("PPSU","PSTRAT","WTFA_A")
nhis_imputed <- impute_data(nhis_processed, c(calVars, stuVars), nhis_keep_vars)
nhis_dummied <- dummies(nhis_imputed, vars=paste0(vars_dummies, '_I'))
factor_vars <- setdiff(names(nhis_dummied), nhis_keep_vars)
nhis_dummied[factor_vars] <- lapply(nhis_dummied[factor_vars], as.factor)

# Prepare the synthetic All of Us data
aou_imputed <- impute_data(aou_synthetic, c(calVars, stuVars))
aou_dummied <- dummies(aou_imputed, vars=paste0(vars_dummies, '_I'))
aou_dummied[] <- lapply(aou_dummied, as.factor)

# Calculate IPW weights using NHIS data and applied to All of Us
weights_df <- calculate_weights(
  nhis_dummied, 
  aou_dummied, 
  'ipw',
  paste0(calVars, '_I'), 
  paste0(stuVars, '_I'), 
  weight='WTFA_A',
  strata='PSTRAT',
  psu='PPSU'
)

# Get IPW results by group
ipw_outcome_df <- summarize_results_by_group(
  weights_df, 
  paste0(stuVars, '_I'), 
  paste0(calVars, '_I'), 
  weight_col='ipw_weight', 
  label='AoU: IPW'
)