4 mvBIMBAM

4.1 Overview

Note this code is dependent upon code from the data preparation section which creates the mvBIMBAM input files detailed below.

4.1.1 Acknowledgements

Please note that this code was adapted from previous work led by Jerry Zhang, Lacey Heinsberg, and Daniel Weeks:

Zhang JZ, Heinsberg LW, Krishnan M, Hawley NL, Major TJ, Carlson JC, Harré Hindmarsh J, Watson H, Qasim M, Stamp LK, Dalbeth N, Murphy R, Sun G, Cheng H, Naseri T, Reupena MS, Kershaw EE, Deka R, McGarvey ST, Minster RL, Merriman TR, Weeks DE. Multivariate analysis of a missense variant in CREBRF reveals associations with measures of adiposity in people of Polynesian ancestries. Genet Epidemiol. 2023 Feb;47(1):105-118. doi: 10.1002/gepi.22508. Epub 2022 Nov 9. PMID: 36352773; PMCID: PMC9892232.

GitHub Repository

4.1.2 mvBIMBAM

mvBIMBAM is a command line program for multivariate genetic association analysis of multiple related phenotypes. In the mvBIMBAM framework, a global null model representing no association between phenotypes and genotypes is compared with an exhaustive combination of alternative models in which all different combinations of phenotype-genotype associations are considered (see the mph 2 analysis parameter details below). For the alternative models, the mvBIMBAM methodology splits phenotype-genotype associations into all possible partitions of U, D, and I, each representing ‘unassociated’, ‘directly’, and ‘indirectly’ associated. Details about the analyses and interpretation can be found below.

NOTE: The software executable is called ‘bimbam’, but we refer to the program as ‘mvBIMBAM’ throughout.

4.2 Load libraries

Load the libraries needed to run the code.

library(tidyverse)
select = dplyr::select
library(ggplot2)
# The 'preprocessCore' is a Bioconductor package
library(preprocessCore)
library(pander)
library(stringr)

4.3 Data

Most of our data (created in the data preparation section) are directly analyzed by the mvBIMBAM command line program (see below) and not formerly loaded into R. Here though, we will read in our stored lists of phenotypes/variants of interest. See the introduction section for information about phenotypes and SNPs of interest.

load("./data/TraitsGenes.RData", verbose=TRUE) ###CUSTOMIZE** (if input file name was changed in 00_ workflow)

Loading objects:
  traits
  genes
  trait_mapping
  custom_labels
  custom_labels2
  df_vertex_table

Could also hard code these variables as shown below.

traits <- c("EMO_tscore", "bdito", "FAT_tscore", "paohcif", "EPSscore", "pain") ###CUSTOMIZE**
trait_mapping <- c("Anxiety", "Depression", "Fatigue", "Cognitive function", "Sleepiness", "Pain") ###CUSTOMIZE**
custom_labels <- setNames(trait_mapping, traits)
custom_labels2 <- c(custom_labels, setNames(genes, genes))
genes <- c("rs4880", "rs5746136", "rs1041740", "rs10432782", "rs4135225", "rs7522705") ###CUSTOMIZE**

4.4 mvBIMBAM analyses

We will now apply mvBIMBAM to the synthetic data set we are working with. Rather than navigating the analyses in the terminal, we can make our code reproducible in R and call mvBIMBAM through the R using call system(). More information about the program can be found under the original mvBIMBAM documentation on GitHub. In brief, the command we need contains the following parameters:

-g: location of the genotype data (for our example the location is: ./inputs/geno_bimbam.txt)
-p: location of the phenotype data (for our example the location is: ./inputs/pheno_bimbam.txt)
-o: prefix for output files (for our example we will use: bimbam_out)
-f: number of phenotypes to analyze (length(traits), for our example this number is 6)
-A: indicate multiple values for sigma (see mvBIMBAM manual for more details; for our example we will use a range of values from 0.05 to 0.4)
-mph 1 vs. mph 2: analysis settings (see mvBIMBMAM manual for more details and our note below; for this example we will run mph 2)
-r: random seed for reproducibility of results**

**Random seed: The mvBIMBAM article states: “We maximized the likelihood using an EM algorithm; four independent runs of the algorithm from different starting points produced essentially identical results.” This appears to be implemented in a random way, and so depends on the random number seed. For the purposes of this tutorial, we set a seed number so that the results would be identical each time. In practice, this step is not necessary, so the “-r 5254” section can be deleted from the code below.

A note about mph 1 vs. mph 2: In brief, -mph 1 performs a simple multivariate test of the null hypothesis vs a general multivariate alternative. Because it does not consider all partitions of the alternative hypothesis, it is quite fast, can handle a moderately large number of phenotypes, and be run on a whole-genome scale. However, it was originally intended to be used only as needed in an initial filtering step before running the more computationally-intensive -mph 2 analyses. Because of the way -mph 1 functions, it does not allow you to ask interesting questions such as which phenotypes are associated with each SNP. To do this more detailed analysis use the -mph 2 option. Note that if you have only a handful of phenotypes, you may be able to run the -mph 2 analysis genome wide on all variants and skip -mph 1 all together.

A quick warning from the creators of mvBIMBAM: the prior used in mvBIMBAM is based on the idea that if a variant is associated with one phenotype, then it is likely associated with multiple phenotypes. That is, it is relatively permissive of associations with multiple phenotypes, and does not attempt to be “skeptical” of additional associations. In this sense it is more suited to hypothesis generation than of hypothesis testing. The prior can be changed to be more skeptical, but this adjustment has not been explored in detail as it is beyond the scope of this tutorial.

We now use the ‘system’ command to run the mvBIMBAM program on our data set, which outputs the following messages:

# Call system() to run mvBIMBAM by specifying settings for analysis using 
# the following settings:  
call <- paste0("bimbam -g ./inputs/geno_bimbam.txt -p ./inputs/pheno_bimbam.txt -o bimbam_out -mph 2 -f ", length(traits), " -A 0.05 -A 0.1 -A 0.2 -A 0.4 -r 5254") 
###CUSTOMIZE** (call object, do so if input file names (i.e., geno_bimbam.txt or pheno_bimbam.txt) were changed in 00_ workflow)
###CUSTOMIZE** (optional, can change output file name from bimbam_out to a name of your choice)
print(call)

[1] "bimbam -g ./inputs/geno_bimbam.txt -p ./inputs/pheno_bimbam.txt -o bimbam_out -mph 2 -f 6 -A 0.05 -A 0.1 -A 0.2 -A 0.4 -r 5254"

# Please note that this can take a few minutes
system(call, intern = TRUE)

[1] "-bimbam: file 0 has 763 individual and 6 snps"   
[2] "-bimbam: read file 0 again "                     
[3] "-bimbam: number of phenotypes = 6"               
[4] "total = 729"                                     
[5] "output/bimbam_out.mph.txt has been created."     
[6] "output/bimbam_out.mph.prob.txt has been created."
[7] "output/bimbam_out.mph.BFs.txt has been created." 
attr(,"status")
[1] 1

bimbam -g ./inputs/geno_bimbam.txt -p ./inputs/pheno_bimbam.txt 
-o bimbam_out -mph 2 -f 6 -A 0.05 -A 0.1 -A 0.2 -A 0.4 -r 5254

Once we run the chunk above, we can then find our output at ./output/bimbam_out.mph.BFs.txt.

4.5 Results

4.5.1 Bayes factors (BF)

The evidence against the null hypothesis is the sum of Bayes factors (BF) (log10 scale) of all partitions weighted by a diffuse prior. Let’s read in our BF factor results file to take a look.

options(pillar.sigfig = 10)
m1 <- read_delim("./output/bimbam_out.mph.BFs.txt", col_names = FALSE, show_col_types = FALSE) ###CUSTOMIZE** (if output file names are changed in the call object above)
m1 <- m1 %>% select_if(~any(!is.na(.)))
colnames(m1) <- c("SNP", "BF", "BF_All_Partition", trait_mapping)
m1 <- data.frame(m1)
m1 <- m1 %>% select(-BF_All_Partition)

# View results
pander(m1, digits = 4, caption = "Bayes Factors")

Bayes Factors (continued below)
SNP	BF	Anxiety	Depression	Fatigue	Cognitive.function
rs4880	11.64	-0.2122	0.9347	-0.405	2.041
rs5746136	8.751	0.1084	-0.3045	0.6435	0.2681
rs1041740	13.69	12.52	0.6702	4.629	6.156
rs10432782	4.878	1.619	1.85	-0.3186	0.05509
rs4135225	32.48	0.6325	1.603	25.11	5.586
rs7522705	1.672	-0.2782	-0.2978	-0.2944	0.5971

Sleepiness	Pain
0.02955	3.437
6.421	0.6853
0.00889	-0.2024
-0.05979	-0.1757
4.684	11.63
-0.3565	0.2123

Interpretation: The above table presents the overall log10 BF for the network, and individual log10 BFs for each trait. Strong evidence of association is defined as log10 BF > 5; suggestive evidence is defined as 1.5 < log10 BF < 5; and negligible evidence is defined as log10 BF < 1.5.

In this synthetic data set, we see that all SNPs (rs7522705 less so and rs4135225 more so) appear to play a role in this network. rs4880 has suggestive associations with cognitive function and pain; rs5746136 has significant associations with sleepiness; rs1041740 has significant associations with anxiety and cognitive function and a suggestive association with fatigue, etc.

4.5.2 Bayesian posterior probabilities of association

In addition to log10 BFs, probabilities for no association, direct association, and indirect association are output while marginal posterior probabilities of association (MPPA) are calculated by summing the marginal posterior probabilities of direct and indirect association.

# Read in second result file
s <- readLines("./output/bimbam_out.mph.prob.txt")

# Initialize an empty list to store the matrix for each rs#
matrix_list <- vector("list", length(s))

# Process each line
for (i in 1:length(s)) {
  # Split the line by spaces
  split_values <- strsplit(s[i], "\\s+")[[1]]
  # Store the rs# value
  rs <- split_values[1]
  
  m2 <- matrix(na.omit(as.numeric(str_split(s, " ")[[i]])), nrow = 2)
  m2 <- rbind(m2, 1 - (m2[1,] + m2[2,]))
  
  # Set the column and row names
  colnames(m2) <- trait_mapping
  rownames(m2) <- c(paste0(rs, "_Unassociated"), paste0(rs, "_Directly"), paste0(rs, "_Indirectly"))

  # Store the matrix in the list, using rs# as the name
  matrix_list[[i]] <- list(rs = rs, m2 = m2)
}

# Combine all matrices into a single matrix
m2 <- do.call(rbind, lapply(matrix_list, function(x) x$m))
for (i in 1:ncol(m2)) {
  m2[, i] <- as.numeric(gsub("\"", "", m2[, i]))
}

pander(m2, digits = 4, caption = "Bayesian Probabilities")

Bayesian Probabilities (continued below)
	Anxiety	Depression	Fatigue
rs4880_Unassociated	0.3751	0.0881	0.3875
rs4880_Directly	0.3415	0.9115	0.2468
rs4880_Indirectly	0.2834	0.00035	0.3657
rs5746136_Unassociated	0.1947	0.3647	0.07658
rs5746136_Directly	0.8037	0.5952	0.9207
rs5746136_Indirectly	0.00154	0.04015	0.00272
rs1041740_Unassociated	0	0.07304	0
rs1041740_Directly	1	0.9007	0.4676
rs1041740_Indirectly	0	0.02628	0.5324
rs10432782_Unassociated	0.00504	0.00493	0.3862
rs10432782_Directly	0.9831	0.9789	0.3531
rs10432782_Indirectly	0.01188	0.01618	0.2607
rs4135225_Unassociated	0.08721	0.01237	0
rs4135225_Directly	0.9128	0.532	1
rs4135225_Indirectly	0	0.4556	0
rs7522705_Unassociated	0.4168	0.54	0.4684
rs7522705_Directly	0.458	0.4192	0.5191
rs7522705_Indirectly	0.1253	0.04077	0.01252

	Cognitive function	Sleepiness	Pain
rs4880_Unassociated	0.00077	0.2343	0.00035
rs4880_Directly	0.9992	0.358	0.9997
rs4880_Indirectly	0	0.4077	0
rs5746136_Unassociated	0.1642	0	0.05097
rs5746136_Directly	0.8279	0.9999	0.5256
rs5746136_Indirectly	0.00791	9e-05	0.4234
rs1041740_Unassociated	0	0.2551	0.2203
rs1041740_Directly	0.9235	0.7228	0.3356
rs1041740_Indirectly	0.0765	0.02212	0.4441
rs10432782_Unassociated	0.3096	0.2886	0.3648
rs10432782_Directly	0.6901	0.4881	0.4547
rs10432782_Indirectly	0.00032	0.2233	0.1805
rs4135225_Unassociated	0	0	0
rs4135225_Directly	0.361	0.3242	0.9953
rs4135225_Indirectly	0.639	0.6758	0.00466
rs7522705_Unassociated	0.1034	0.5502	0.3012
rs7522705_Directly	0.8893	0.272	0.6018
rs7522705_Indirectly	0.00732	0.1778	0.09697

Interpretation: The numbers above can be interpreted as the probability of no association, direct association, or indirect association, which together sum to 1 (i.e., 100%). Both directly and indirectly associated phenotypes are associated with a given genotype, but indirectly associated phenotypes are conditionally independent of the genotype given the presence of a directly associated phenotype in the model.

For example, in this synthetic data set, there is a suggested 100% probability that there is a direct association between rs1041740 and anxiety. In addition, while there is a 0% probability of no association between rs4135225 and sleepiness, the likelihood of direct vs. indirect effects is more split at 32.417% and 67.583%, respectively. While mvBIMBAM doesn’t inform of us what variable indirect associations are conditional on, we can look back at our bnlearn network for more information.

4.6 Save data and results

To conclude, save the Bayes factors from the mvBIMBAM results.

4.6.1 mvBIMBAM BFs

save(m1, m2, file = "data/mvBimBam_Results.RDdata") ###CUSTOMIZE** (optional, can change file name of results)

4.7 Discussion

Note that mvBIMBAM’s default priors are “loose” (the phrase used in the software documentation), meaning associations are detected more liberally. This aligns with our results, where mvBIMBAM identified rs7522705 as suggestively important to the network (1.5 < BF < 5), but none of the individual phenotypes themselves were even suggestive (e.g., all BFs < 1.5), while bnlearn omitted it completely. It’s worth mentioning that you can modify the default priors to be more conservative (see software documentation), but this is beyond the scope of this guide.

4.8 Conclusion

And with that, we conclude our mvBIMBAM tutorial! We hope that this code is documented in enough detail so that you can easily adapt it for your own projects, but feel free to reach out with any questions! -Lacey