--- title: "Interpret ColocBoost Output" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Interpret ColocBoost Output} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", dpi = 80 ) ``` This vignette demonstrates how to interpret the output of ColocBoost, specifically to get the summary of colocalization and focusing only on strong colocalization events. ```{r setup} library(colocboost) ``` ## 1. Summarize ColocBoost results ### Causal variant structure The dataset features two causal variants with indices 194 and 589. - Causal variant 194 is associated with traits 1, 2, 3, and 4. - Causal variant 589 is associated with traits 2, 3, and 5. ```{r run-colocboost} # Loading the Dataset data(Ind_5traits) # Run colocboost res <- colocboost(X = Ind_5traits$X, Y = Ind_5traits$Y) cos_summary <- res$cos_summary names(cos_summary) ``` The `cos_summary` object contains the colocalization summary for all colocalization events, with each row representing a single colocalization event. The summary includes the following columns: - **focal_outcome**: The focal outcome being analyzed, or `FALSE` if no focal outcome exists. - **colocalized_outcomes**: Traits that are colocalized within the 95% colocalization confidence set (CoS). - **cos_id**: A unique identifier for each 95% colocalization confidence set (CoS). - **purity**: The minimum absolute correlation of variants within the 95% colocalization confidence set (CoS). - **top_variable**: The variable with the highest variant colocalization probability (VCP). - **top_variable_vcp**: The variant colocalization probability for the top variable. - **cos_npc**: The normalized probability of colocalization for the 95% confidence set, providing empirical evidence in favor of colocalization over a trait-specific configuration. - **min_npc_outcome**: The minimum normalized probability among colocalized traits. - **n_variables**: The number of variables in the 95% colocalization confidence set (CoS). - **colocalized_index**: The indices of colocalized variables. - **colocalized_variables**: A list of colocalized variables. - **colocalized_variables_vcp**: The variant colocalization probabilities for all colocalized variables. To obtain the summary of colocalization with a specific focus on traits of interest, you can use the `get_cos_summary`, see the detailed usage of this function in [link](https://statfungen.github.io/colocboost/reference/get_cos_summary.html). This function allows you to filter the colocalization summary based on a particular outcome of interest, making it easier to interpret the results for specific traits. For example, if you are interested in the colocalization events involving the traits `Y1` and `Y2`, you can use the following code: ```{r summary-colocboost} # Get summary table of colocalization cos_interest_outcome <- get_cos_summary(res, interest_outcome = c("Y1", "Y2")) ``` ## 2. Filter colocalization events by relative strength of evidence In `cos_summary`, for each 95% CoS, the `cos_npc` column provides a normalized probability of colocalization and `min_npc_outcome` column provides the minimum normalized probability among colocalized traits. Those two metrics are measured as an empirical evidence of colocalization both in CoS-level and in trait-level. To obtain the best minimal colocalization configuration can be defined by using both `cos_npc` and `npc_outcome`. See the detailed usage of this function in [link](https://statfungen.github.io/colocboost/reference/get_robust_colocalization.html). ```{r run-strong-colocalization} filter_res <- get_robust_colocalization(res, cos_npc_cutoff = 0.5, npc_outcome_cutoff = 0.2) ``` - The output from `get_robust_colocalization` is the same as output from `colocboost`, which can be directly used in any post inference and visualization. - `npc=0.5` or `npc_outcome = 0.2` maintains robust colocalization signals for cases when many traits are evaluated. Higher thresholds can be specified if users want to focus only on strong colocalization events. ## 3. More details on ColocBoost output The entire colocalization output from `colocboost` is stored in the `colocboost` object, which contains several components: - **`cos_summary`**: A summary table for colocalization events (see details in above Section 1). - **`vcp`**: The variable colocalized probability for each variable. - **`cos_details`**: A object with all information for colocalization results. - **`data_info`**: A object with the detailed information from input data. - **`model_info`**: A object with the detailed information for colocboost model. In this section, we will provide a detailed explanation of the components for deepening into ColocBoost result using a mixed dataset. ```{r load-mixed-data} # Load example data data(Ind_5traits) data(Sumstat_5traits) # Create a mixed dataset X <- Ind_5traits$X[1:4] Y <- Ind_5traits$Y[1:4] sumstat <- Sumstat_5traits$sumstat[5] LD <- get_cormat(Ind_5traits$X[[1]]) # Run colocboost res <- colocboost(X = X, Y = Y, sumstat = sumstat, LD = LD) ``` ### 3.1. Variant colocalization probability (**`vcp`**) - **`vcp`** is the probability of a variant being colocalized with at least one traits, serving as analogs of posterior inclusion probabilities (PIPs) in single-trait fine-mapping. To plot the VCP for the variants within at least one CoS, you can use the `colocboost_plot` function with the `y` argument set to `"vcp"`. ```{r vcp-plot} colocboost_plot(res, y = "vcp") ``` Please visit our documentation portal at [Visualization of ColocBoost Results](https://statfungen.github.io/colocboost/articles/Visualization_ColocBoost_Output.html) for more details on the `colocboost_plot` function ### 3.2. Analyzed data information (**`data_info`**) - **`n_variables`**: number of variants being included. - **`variables`**: vector of variant names across all traits being included in colocalization analysis. - **`coef`**: regression coefficients estimated from the colocboost model for each trait. - **`z`**: z-scores from marginal associations for each trait. - **`n_outcomes`**: the number of traits being included in colocalization analysis. - **`outcome_info`** contains information of analyzed data, including sample size and data type. ```{r analyzed-data-info} res$data_info$outcome_info ``` ### 3.3. Colocalization details (**`cos_details`**) **`cos_details`** provides a detailed information for colocalization events identified by `colocboost`. This section will provide a detailed explanation of the components in `cos_details`. ```{r cos-details} names(res$cos_details) ``` #### 3.3.1. Colocalized variants for each CoS (**`cos`**) - **`cos`**: A list with a detailed information of colocalized variants for each CoS. - **`cos_index`**: Indices of colocalized variables with unique identifier for each CoS. - **`cos_variables`**: Names of colocalized variables with unique identifier for each CoS. - Note that variants are ordered by their VCP in descending order. ```{r cos} res$cos_details$cos ``` #### 3.3.2. Colocalized traits for each 95% CoS (**`cos_outcomes`**) - **`cos_outcomes`**: A list with a detailed information of colocalized traits for each CoS. - **`outcome_index`**: Indices of colocalized traits with unique identifier for each CoS. - **`outcome_name`**: Names of colocalized traits with unique identifier for each CoS. ```{r cos-outcome} res$cos_details$cos_outcomes ``` - **`cos_npc`**: normalized probability of colocalization for CoS, providing empirical evidence in favor of colocalization over a trait-specific configuration. - **`cos_outcomes_npc`**: normalized probability for each colocalized trait in order with evidence strength. - These two metrics could be used to filter the colocalization events by relative strength of evidence, see details in Section 2. ```{r cos-npc} res$cos_details$cos_npc res$cos_details$cos_outcomes_npc ``` - **`cos_purity`**: includes three lists, for each list, it contains $S \times S$ matrix, where $S$ is the number of CoS. - `min_abs_cor`: the minimum absolute correlation of variants within (diagonal) CoS or in-between (off-diagonal) different CoS. - `median_abs_cor`: the median absolute correlation of variants within (diagonal) CoS or in-between (off-diagonal) different CoS. - `max_abs_cor`: the maximum absolute correlation of variants within (diagonal) CoS or in-between (off-diagonal) different CoS. ```{r cos-purity} res$cos_details$cos_purity ``` - **`cos_top_variables`**: indices and names of the top variant for each CoS, which is the variant with the highest VCP. - Note that there may exist multiple variants in perfect LD with the same highest VCP. ```{r cos-top} res$cos_details$cos_top_variables ``` - **`cos_weights`**: the integrative weights for each colocalized trait in the CoS. This is used to recalibrate CoS when some traits are filtered out.. - **`cos_vcp`**: the single-effect VCP for each CoS. ### 3.4. Model information (**`model_info`**) - **`model_coveraged`**: if the model is converged. - **`outcome_model_coveraged`**: if the trait-specific model is converged. - **`n_updates`**: number of boosting rounds - **`outcome_n_updates`**: number of boosting rounds for each trait. - **`jk_update`**: indices of the variants being updated in the model at each boosting round. ```{r jk_update} # Pick arbitrary SEC updates, see entire update in advance res$model_info$jk_star[c(5:10,36:38), ] ``` - **`profile_loglik`**: joint profile log-likelihood changes over boosting rounds. - **`outcome_profile_loglik`**: trait-specific profile log-likelihood changes over boosting rounds. ```{r profile_loglik} # Plotting joint profile log-likelihood (blue) and trait-specific profile log-likelihood (red). par(mfrow=c(2,3),mar=c(4,4,2,1)) plot(res$model_info$profile_loglik, type="p", col="#3366CC", lwd=2, xlab="", ylab="Joint Profile") for(i in 1:5){ plot(res$model_info$outcome_profile_loglik[[i]], type="p", col="#CC3333", lwd=2, xlab="", ylab=paste0("Profile (Trait ", i, ")")) } ``` - **`outcome_proximity_obj`**: trait-specific proximity smoothed objective for each trait. - **`outcome_coupled_best_update_obj`**: objective at the (coupled) best update variant for each outcome. ```{r objective-proximity} # Save to restore default options oldpar <- par(no.readonly = TRUE) # Plotting trait-specific proximity objective par(mfrow=c(2,3), mar=c(4,4,2,1)) for(i in 1:5){ plot(res$model_info$outcome_proximity_obj[[i]], type="p", col="#3366CC", lwd=2, xlab="", ylab="Trait-specific Objective", main = paste0("Trait ", i)) } par(oldpar) ``` ```{r objective-best} # Save to restore default options oldpar <- par(no.readonly = TRUE) # Plotting trait-specific objective at the best update variant par(mfrow=c(2,3), mar=c(4,4,2,1)) for(i in 1:5){ plot(res$model_info$outcome_coupled_best_update_obj[[i]], type="p", col="#CC3333", lwd=2, xlab="", ylab=paste0("Objective at best update variant"), main = paste0("Trait ", i)) } par(oldpar) ``` ### 3.5. Trait-specific effects information (**`ucos_details`**) There is `ucos_details` in ColocBoost output when setting `output_level = 2`, including the trait-specific (uncolocalized) information from the single-effect learner (SEL). ```{r ucos-details} # Create a mixed dataset data(Ind_5traits) data(Heterogeneous_Effect) X <- Ind_5traits$X[1:3] Y <- Ind_5traits$Y[1:3] X1 <- Heterogeneous_Effect$X Y1 <- Heterogeneous_Effect$Y[,1,drop=F] res <- colocboost(X = c(X, list(X1)), Y = c(Y, list(Y1)), output_level = 2) names(res$ucos_details) ``` #### 3.5.1. Trait-specific (uncolocalized) confidence sets (**`ucos`**) - **`ucos`**: A list containing a detailed information about trait-specific (uncolocalized) variants for each uCoS. - **`ucos_index`**: Indices of trait-specific (uncolocalized) variants. - **`ucos_variables`**: Names of trait-specific (uncolocalized) variants. ```{r ucos} res$ucos_details$ucos ``` #### 3.5.2. Trait-specific (uncolocalized) outcomes (**`ucos_outcomes`**) - **`ucos_outcomes`**: A list with a detailed information about trait-specific (uncolocalized) outcomes for each uCoS. - **`outcome_index`**: Indices of trait-specific (uncolocalized) outcomes. - **`outcome_name`**: Names of trait-specific (uncolocalized) outcomes. ```{r ucos-outcomes} res$ucos_details$ucos_outcomes ``` #### 3.5.3. Purity across CoS and uCoS (**`cos_ucos_purity`**) - **`cos_ucos_purity`**: Includes three lists, each containing an $S \times uS$ matrix, where $S$ is the number of CoS and $uS$ is the number of uCoS: - `min_abs_cor`: Minimum absolute correlation of variables across each pair of CoS and uCoS. - `median_abs_cor`: Median absolute correlation of variables across each pair of CoS and uCoS. - `max_abs_cor`: Maximum absolute correlation of variables across each pair of CoS and uCoS. ```{r cos-ucos-purity} res$ucos_details$cos_ucos_purity ``` #### 3.5.4. Other components - **`ucos_weight`**: Integrative weights for each trait-specific (uncolocalized) trait, used to recalibrate uCoS when traits are filtered out. - **`ucos_top_variables`**: Indices and names of the top variable for each uCoS, which is the variable with the highest VCP. - **`ucos_purity`**: Includes three lists, each containing an $uS \times uS$ matrix, where $uS$ is the number of uCoS: - `min_abs_cor`: Minimum absolute correlation of variables within (diagonal) uCoS or between (off-diagonal) different uCoS. - `median_abs_cor`: Median absolute correlation of variables within or between uCoS. - `max_abs_cor`: Maximum absolute correlation of variables within or between uCoS. By analyzing these components, you can gain a deeper understanding of trait-specific (uncolocalized) effects that are not colocalized, providing additional insights into the data. ### 3.6. Diagnostic details (**`diagnostic_details`**) There is `diagnostic_details` in ColocBoost output when setting `output_level = 3`: ```{r diagnostic-details} # Loading the dataset data(Ind_5traits) X <- Ind_5traits$X Y <- Ind_5traits$Y res <- colocboost(X = X, Y = Y, output_level = 3) ``` - **`cb_model`**: trait-specific proximity gradient boosting model, including proximity weight at each iteration, residual after gradient boosting, etc. - **`weights_paths`**: individual trait-specific weights for each iteration. ```{r cb-model} names(res$diagnostic_details$cb_model) names(res$diagnostic_details$cb_model$ind_outcome_1) ``` - **`cb_model_para`**: parameters used in fitting ColocBoost model. ```{r cb-model-para} names(res$diagnostic_details$cb_model_para) ```