This vignette demonstrates how to use the bioinformatics pipeline for
ColocBoost to perform colocalization analysis with
colocboost
.
pecotmr
with link
and colocboost_pipeline
with link.xqtl_protocol
with link.colocboost_analysis_pipeline
functionThis function harmonizes the input data and prepares it for
colocalization analysis. In this section, we introduce how to load the
regional data required for the ColocBoost analysis using the
load_multitask_regional_data
function. This function loads
mixed datasets for a specific region, including individual-level data
(genotype, phenotype, covariate data) or summary statistics (sumstats,
LD). Run load_regional_univariate_data
and
load_rss_data
multiple times for different datasets.
Below are the input parameters for this function:
region
(required): A string of
chr:start-end
for the phenotype region you want to
analyze.genotype_list
: A vector of paths for
PLINK bed files containing genotype data.phenotype_list
: A vector of paths for
phenotype file names.covariate_list
: A vector of paths for
covariate file names corresponding to the phenotype file vector.conditions_list_individual
: A vector
of strings representing different conditions or groups.match_geno_pheno
: A vector of indices
of phenotypes matched to genotype if multiple genotype PLINK files are
used.maf_cutoff
: Minimum minor allele
frequency (MAF) cutoff. Default is 0.mac_cutoff
: Minimum minor allele count
(MAC) cutoff. Default is 0.xvar_cutoff
: Minimum variance cutoff.
Default is 0.imiss_cutoff
: Maximum individual
missingness cutoff. Default is 0.association_window
: A string of
chr:start-end
for the association analysis window (cis or
trans). If not provided, all genotype data will be loaded.extract_region_name
: A list of vectors
of strings (e.g., gene ID ENSG00000269699
) to subset the
information when there are multiple regions available. Default is
NULL
.region_name_col
: Column name
containing the region name. Default is NULL
.keep_indel
: Logical indicating whether
to keep insertions/deletions (INDELs). Default is
TRUE
.keep_samples
: A vector of sample names
to keep. Default is NULL
.Illustrated example
The following example demonstrates how to set up input data with 3 phenotypes and 2 cohorts, where the first cohort has 2 phenotypes and the second cohort has 1 phenotype.
# Example of loading individual-level data
region = "chr1:1000000-2000000"
genotype_list = c("plink_cohort1.1.bed", "plink_cohort1.2.bed")
phenotype_list = c("phenotype1_cohort1.bed.gz", "phenotype2_cohort1.bed.gz", "phenotype1_cohort2.bed.gz")
covariate_list = c("covariate1_cohort1.bed.gz", "covariate2_cohort1.bed.gz", "covariate1_cohort2.bed.gz")
conditions_list_individual = c("phenotype1_cohort1", "phenotype2_cohort1", "phenotype1_cohort2")
match_geno_pheno = c(1,1,2) # indices of phenotypes matched to genotype
association_window = "chr1:1000000-2000000" # same as region for cis-analysis
# Following parameters need to be set according to your data
maf_cutoff = 0.01
mac_cutoff = 10
xvar_cutoff = 0
imiss_cutoff = 0.9
# More advanced parameters see pecotmr::load_multitask_regional_data()
sumstat_path_list
: A vector of file
paths to the summary statistics.column_file_path_list
: A vector of
file paths to the column mapping files.LD_meta_file_path_list
: A vector of
paths to LD metadata files. LD metadata is a data frame specifying LD
blocks with columns “chrom”, “start”, “end”, and “path”. “start” and
“end” denote the positions of LD blocks, and “path” is the path of each
LD block, optionally including bim
file paths.match_LD_sumstat
: Logical indicating
whether to match LD blocks with summary statistics.conditions_list_sumstat
: A vector of
strings representing different sumstats.n_samples
: User-specified sample size.
If unknown, set as 0 to retrieve from the sumstat file.n_cases
: User-specified number of
cases.n_controls
: User-specified number of
controls.region
: The region where tabix is used
to subset the input dataset.extract_sumstats_region_name
:
User-specified gene/phenotype name used to further subset the phenotype
data.sumstats_region_name_col
: Filter this
specific column for the extract_sumstats_region_name
.comment_string
: Comment sign in the
column mapping file, default is #
.extract_coordinates
: Optional data
frame with columns “chrom” and “pos” for specific coordinates
extraction.Illustrated example
The following example demonstrates how to set up input data with 2 summary statistics and one LD reference.
# Example of loading summary statistics
sumstat_path_list = c("sumstat1.tsv.gz", "sumstat2.tsv.gz")
column_file_path_list = c("mapping_columns_1.yml", "mapping_columns_2.yml")
LD_meta_file_path_list = "ld_meta_file.tsv"
covariate_list = c("covariate1_cohort1.bed.gz", "covariate2_cohort1.bed.gz", "covariate1_cohort2.bed.gz")
conditions_list_sumstat = c("sumstat_1", "sumstat_2")
# Following parameters need to be set according to your data
n_samples = c(0, 0)
n_cases = c(10000, 20000)
n_controls = c(20000, 40000)
# More advanced parameters see pecotmr::load_multitask_regional_data()
colocboost_analysis_pipeline
functionIn this section, we perform the colocalization analysis using the
colocboost_analysis_pipeline
function. Below are the input
parameters for this function:
region_data
: The output of the
load_multitask_regional_data
function.focal_trait
: Name of the trait if
performing disease-prioritized ColocBoost.event_filters
: A list of patterns for
filtering events based on context names. Example: for sQTL,
list(type_pattern = ".*clu_(\\d+_[+-?]).*", valid_pattern = "clu_(\\d+_[+-?]):PR:", exclude_pattern = "clu_(\\d+_[+-?]):IN:")
.maf_cutoff
: A scalar to remove
variants with maf < maf_cutoff, default is 0.005.pip_cutoff_to_skip_ind
: A vector of
cutoff values for skipping analysis based on PIP values for each
context. Default is 0.pip_cutoff_to_skip_sumstat
: A vector
of cutoff values for skipping analysis based on PIP values for each
sumstat. Default is 0.qc_method
: Quality control method to
use. Options are “rss_qc”, “dentist”, or “slalom” (default:
“rss_qc”).impute
: Logical; if TRUE, performs
imputation for outliers identified in the analysis (default: TRUE).impute_opts
: A list of imputation
options including rcond, R2_threshold, and minimum_ld (default:
list(rcond = 0.01, R2_threshold = 0.6, minimum_ld = 5)
).