2  Data preparation

2.1 Overview

A fully synthetic example data set has been created to accompany this repository so that we can freely and safely document the analytical steps and interpretation related to our two multivariate Bayesian approaches. The real data from which the synthetic data were generated were collected via a nurse-led study from the University of Pittsburgh (PI: Dr. Cathy Bender) focused on individuals with breast cancer. The study collected extensive data related to symptoms, and a sub-study (PI: Dr. Tara Davis) collected candidate gene single nucleotide polymorphism (SNP) data. Candidate symptoms selected for inclusion in this tutorial included anxiety, depression, fatigue, daytime sleepiness, cognitive function, and pain. Candidate SNPs selected for examination in this study included rs4880, rs5746136, rs1041740, rs10432782, rs4135225, and rs7522705. The original data set consisted of 110 participants. For the purposes of our examples, a larger, synthetic data set mirroring the statistical properties of the original data set was created using the synthpop R package. R code demonstrating how this can be done is available in the mvNUR GitHub repository that complements this website. The synthetic data set used for demonstration purposes contains 770 participants.

2.2 Load libraries

Install and/or load the libraries needed to run the code.

library(tidyverse) 
library(pander) 
library(preprocessCore) 
library(corrplot) 

2.3 Prepare data for multivariate analyses

The programs we will use for multivariate analyses (bnlearn and mvBIMBAM) require the data to be in a specific format for analysis. Here we are preparing and formatting the data for analysis.

2.3.1 Read in the synthetic data set

Please see the introduction for information regarding the example synthetic data set.

# Read in the synthetic data set created for use with this example analysis code
df_synth <- read.csv("data/BrCa_synthetic.csv") ###CUSTOMIZE**
head(df_synth)
    ID EMO_tscore bdito FAT_tscore paohcif EPSscore pain age education race
1 1000       47.8     8       46.9       0        5    0  60        18    0
2 1001       47.8     8       49.2       4       12    3  51        21    0
3 1002       49.4    18       50.4      16       12    1  57        20    0
4 1003       45.9     3       38.5       6        2    0  51        16    0
5 1004       50.8     6       46.9       0        7    1  63        16    0
6 1005       37.1     0       38.5       0        2    1  59        15    0
  rs4880 rs5746136 rs1041740 rs10432782 rs4135225 rs7522705
1      1         0         1          1         1         1
2      2         1         0          1         0         2
3      2         1         0          0         0         0
4      1         1         0          0         1         1
5      0         1         1          0         1         0
6      1         0         0          0         1         1
(n <- nrow(df_synth))
[1] 770
# Define the phenotypes (in this case, symptoms) of interest and mapping names 
traits <- c("EMO_tscore", "bdito", "FAT_tscore", "paohcif", "EPSscore", "pain") ###CUSTOMIZE**
# Define a trait mapping object to create custom labels for figures 
# (e.g., EMO_tscore variable represents our measure of Anxiety; bdito represents our measure of depression; etc.)
trait_mapping <- c("Anxiety", "Depression", "Fatigue", "Cognitive function", "Sleepiness", "Pain") ###CUSTOMIZE**

# Create custom labels 
custom_labels <- setNames(trait_mapping, traits)

# Define variants of interest 
genes <- c("rs4880", "rs5746136", "rs1041740", "rs10432782", "rs4135225", "rs7522705") ###CUSTOMIZE**

# Create expanded custom labels 
custom_labels2 <- c(custom_labels, setNames(genes, genes))

# Combine traits and genes
vertex.names <- c(traits, genes)

# Create type vector
types <- c(rep("Symptom", length(traits)), rep("Genotype", length(genes)))

# Create the dataframe
df_vertex_table <- data.frame(
  vertex.names = vertex.names,
  type = types,
  stringsAsFactors = FALSE
)

# Replace vertex.names with the corresponding custom labels
df_vertex_table$vertex.names <- custom_labels2[match(df_vertex_table$vertex.names, names(custom_labels2))]

Check that all variables are numeric or integer. NOTE: The code below assumes that all SNP names begin with “rs” and no trait names begin with “rs”.

rs_columns <- grep("^rs", names(df_synth), value = TRUE)
non_numeric_rs_columns <- rs_columns[!sapply(df_synth[rs_columns], is.numeric)]
if (length(non_numeric_rs_columns) > 0) {
  cat("SAFETY CHECK WARNING: The following variables starting with 'rs' are not numeric:", paste(non_numeric_rs_columns, collapse = ", "), "\n")
} else {
  cat("SAFETY CHECK PASSED: All 'rs' variables are numeric.\n")
}
SAFETY CHECK PASSED: All 'rs' variables are numeric.

In brief, the data set focuses on 6 symptoms (EMO_tscore, bdito, FAT_tscore, paohcif, EPSscore, pain) and 6 candidate variants of interest (rs4880, rs5746136, rs1041740, rs10432782, rs4135225, rs7522705) and contains 770 participants.

# Read in a mapping of variable name to informative label
dict <- read.csv("data/BrCa_synthetic_SimpleDict.csv") ###CUSTOMIZE** (OPTIONAL)
pander(dict, "Data dictionary") 
Data dictionary (continued below)
Variable Label Description
Emo_tscore Anxiety PROMIS Emotional Distress Anxiety (Short Form 8a) - 8 items measuring emotional distress and anxiety (panic, fearfulness, worry, dread, tension,nervousness, restlessness, and somatic symptoms including racing heart and dizziness) in the last 7 days. This instrument generates T-scores, which are standard scores with a mean of 50 and standard deviation of 10 in the U.S. general population. Higher T-scores indicate worse distress and anxiety. T-scores range from 10 to 90.
bdito Depression The Beck Depression Inventory-II - 21-items measuring severity of depression (i.e., grief, loss, flat affect, withdrawal, mania) over a two-week period. Scores range from 0 to 63, where high scores indicate higher levels or more severe depression.
FAT_tscore Fatigue PROMIS Fatigue (Short Form 8a) - 8 items measuring the experience of fatigue and the interference of fatigue on daily activities over the past 7 days. This instrument generates T-scores, which are standard scores with a mean of 50 and standard deviation of 10 in the U.S. general population. Higher T-scores indicate worse fatigue. T-scores range from 10 to 90.
paohcif Cognitive function Patient assessment of own cognitive functioning - 33 items measuring self-reported ‘recent’ cognitive function consisting of 5 dimensions: memory, language and communication, use of hands, sensory-perceptual, higher level cognitive and intellectual functioning. Higher scores indicate worse perceived cognitive functioning. Scores range from 0 to 155.
EPSscore Sleepiness Epworth Sleepiness Scale - 8 item scale measuring self-reported ‘recent’ daytime sleepiness. Higher score indicate greater daytime sleepiness. Scores range from 0 to 24.
pain Pain Brief Pain Inventory (Short Form) - 9 items a widely used clinical tool for assessing pain (worst pain in the last 24 hours [0-10], least pain in the last 24 hours [0-10], pain on average [0-10], paing at the time of the interview [0-10], pain relief from treatment(s) in the last 24 hours [0%-100%]). Higher scores indicate higher levels of pain, each item ranges from 0-10.
rs4880 Variant 1 SOD2 gene, hg19 postion chr6: 160113872
rs5746136 Variant 2 SOD2 gene, hg19 postiion chr6: 160103084
rs1041740 Variant 3 SOD1 gene, hg19 postiion chr21: 33040162
rs10432782 Variant 4 SOD1 gene, hg19 postiion chr21: 33036391
rs4135225 Variant 5 TXN gene, hg19 postiion chr9: 113006691
rs7522705 Variant 6 PRDX1 gene, hg19 postiion chr1: 45992300
race Race Self-identified race, 0=White, 1=Black
age Age Age in years
education Education Self-reported years of education
Type Citation
Numeric Pilkonis, P.A. et al. (2011) Item banks for measureing emotional distress from the patient-reported outcomes measurement information system (PROMIS): Depression, Anxiety, and Anger, Assesssment, 18(3), 263-283
Numeric Beck, A.T. et al. (1996) Beck Depression Inventory-II. San Antonio: The Psychological Corporation
Numeric Cell, D. et al. (2016) PROMIS fatigue item bank had clinical validity across diverse chronic conditions. Journal of Clinical Epidemiology, 73, 128-134
Numeric Chelune, G. J. et al. (1986) Neuropsychological and personality correlates of patients’ complaints of disability. Advances in clinical Neuropsychology, 1986:95-126
Numeric Johns MW. A new method for measuring daytime sleepiness: The Epworth Sleepiness Scale. Sleep 1991; 14(6):540-5
Numeric Cleeland, C.S. et al. (2009). Pain assessment: Global use of the Brief Pain Inventory. Annals, Academy of Medicine, Sigapore, 23(2), 129-138.
Numeric
Numeric
Numeric
Numeric
Numeric
Numeric
Numeric, encoded value
Numeric
Numeric

2.3.2 Examine correlation structure of the symptoms

# Subset the dataset to include only the selected traits
subset_df <- df_synth[traits]
# Calculate the correlation matrix
cor_matrix <- cor(subset_df, use = "complete.obs")
# Define custom labels for rows and columns
# Update row and column names using custom labels
rownames(cor_matrix) <- custom_labels[rownames(cor_matrix)]
colnames(cor_matrix) <- custom_labels[colnames(cor_matrix)]
# Create a correlation plot
corrplot(cor_matrix, method = "circle", type = "lower", tl.col = "black", tl.srt = 45)

2.3.3 Examine correlation structure of SNPs

# Subset the dataset to include only the selected traits
subset_df <- df_synth[genes]
# Calculate the correlation matrix
cor_matrix <- cor(subset_df, use = "complete.obs")
# Create a correlation plot
corrplot(cor_matrix, method = "circle", type = "lower", tl.col = "black", tl.srt = 45)

2.3.4 Summarize missing data

missing_data_table <- df_synth %>%
  summarise(across(everything(), ~sum(is.na(.)))) %>% # Count NA values in each column
  gather(key = "Column", value = "Number_of_NAs") %>% # Convert to long format
  arrange(desc(Number_of_NAs)) # Sort by the number of NAs
pander(missing_data_table)
Column Number_of_NAs
EMO_tscore 7
ID 0
bdito 0
FAT_tscore 0
paohcif 0
EPSscore 0
pain 0
age 0
education 0
race 0
rs4880 0
rs5746136 0
rs1041740 0
rs10432782 0
rs4135225 0
rs7522705 0

2.3.5 Clean up data

First, let’s prepare our data sets. Currently, the software we are using does not allow missing data for multivariate phenotype analysis, so select complete cases only. We will also set up a few data frames in preparation to regress out variation related to our covariates of interest (in this example, age and race).

  1. df_i1_regress: contains all variables of interest ordered and filtered for complete cases
  2. df_i1: contains all variables except the covariates that will be regressed out below (in this example, age and race)
  3. df_synth: Recode 0, 1, 2 genotypes to AA, AB, BB for later use in mvBIMBAM
# The first data frame (df_i1_regress) contains all variables of interest ordered 
# and filtered for complete cases
df_i1_regress <- df_synth %>% 
  select(all_of(genes), 
         all_of(traits),
         race, ###CUSTOMIZE** (covariate names, replacing age and race)
         age) %>% ###CUSTOMIZE** (covariate names, replacing age and race)
  filter(complete.cases(.))
(n <- nrow(df_i1_regress))
[1] 763
# The second data frame (df_i1) contains all variables except the covariates that
# will be regressed out (in this example, age and race)
df_i1 <- df_i1_regress %>% 
  select(all_of(genes), all_of(traits)) %>% 
  filter(complete.cases(.))

# Note, if covariates have a unique pattern of missing data, df_i1 could 
# be smaller/different than df_i1_regress. Printing number of rows in df_i1 to check
print(dim(df_i1)[1])
[1] 763
print(dim(df_i1_regress)[1])
[1] 763
# Recode 0, 1, 2 genotypes to AA, AB, BB for later use in mvBIMBAM 
df_synth <- df_synth %>% 
  mutate_at(
    .vars = vars(starts_with("rs")),
    .funs = list(~ case_when(
      . == 2 ~ "BB",
      . == 1 ~ "AB",
      . == 0 ~ "AA"
    ))
  )

# Create Genotype (G) and Phenotype (Y) matrices
G <- as.matrix(df_i1 %>% select(all_of(genes)))
head(G)
     rs4880 rs5746136 rs1041740 rs10432782 rs4135225 rs7522705
[1,]      1         0         1          1         1         1
[2,]      2         1         0          1         0         2
[3,]      2         1         0          0         0         0
[4,]      1         1         0          0         1         1
[5,]      0         1         1          0         1         0
[6,]      1         0         0          0         1         1
Y <- as.matrix(df_i1 %>% select(all_of(traits)))
head(Y)
     EMO_tscore bdito FAT_tscore paohcif EPSscore pain
[1,]       47.8     8       46.9       0        5    0
[2,]       47.8     8       49.2       4       12    3
[3,]       49.4    18       50.4      16       12    1
[4,]       45.9     3       38.5       6        2    0
[5,]       50.8     6       46.9       0        7    1
[6,]       37.1     0       38.5       0        2    1

There are 763 participants with complete data that we will retain for our analyses.

2.3.6 Normalize and adjust data for covariates

In this example, we are adjusting our phenotypes of interest for the covariates age and race using ordinary linear regression models. If adapting this code for your own work, manually edit the covariate names in the f_quatile_norm_resid() function. Note that the sensitivity of the Bayesian multivariate mvBIMBAM framework to outlier values and non-normality also necessitates the normalization of phenotypes. As shown below, residualized phenotypes (i.e., adjusted for age/race) are quantile-normalized.

2.3.6.1 Create adjustment/normalization functions

We now create a function to perform residual adjustment for covariates (in this example, we are adjusting for age and race):

f_quantile_norm_resid <- function(Y, df) {
  {o <- apply(Y, 2, function(x) resid(lm(x ~ age + race, data = df)))} ###CUSTOMIZE** (covariate names, replacing age and race)
  return(o)
}

If adapting this code above for your own work, edit the x ~ age + race regression formula to adjust for covariates of interest, replacing age and race with your covariate variable names.

We now create function to ‘super quantile normalize’ the data:

f_quantile_normalize_adjust <- function(Y, data, ...) {
  # Quantile normalize
  Y_qn <- normalize.quantiles(Y)
  # Fit Y ~ age + race, extra residual (using function created above)
  Y_qn_resid <- f_quantile_norm_resid(Y = Y_qn, df = data, ...) 
  # Quantile normalize the residual
  Y_qn_resid_qn <- data.frame(normalize.quantiles(Y_qn_resid))
  return(Y_qn_resid_qn)
}

2.3.6.2 Apply functions to perform normalization and covariate adjustment

# Create a quantile normalized adjusted Y data frame (i.e., quantile normalization 
# and covariate adjustment is performed in one fell swoop)
qn_resid_Y <- f_quantile_normalize_adjust(Y, data = df_i1_regress)
# Create a copy of this data frame for use later in this workflow 
qn_resid_Y_b <- qn_resid_Y 
# Rename the columns of the quantile normalized data frame to match the 
# phenotypes of interest  
names(qn_resid_Y) <- traits
head(qn_resid_Y)
  EMO_tscore       bdito FAT_tscore   paohcif  EPSscore      pain
1 -0.8851885  2.30362054  -3.330215 -5.645429 -3.414409 -6.012957
2 -1.9542886  0.09188708  -3.221254 -1.577718  7.196097  1.066255
3 -0.1561693  9.13047401  -2.133030 13.281734  8.530244 -3.592499
4 -3.4144091 -6.10527417  -7.413173  1.066255 -8.598853 -7.199561
5  1.8913640  0.83681075  -2.840231 -4.802835  1.891364 -1.847911
6 -6.1052742 -8.14287258  -6.446351 -6.057697 -8.142873 -2.818185

Note: For one test user (DM), the above chunk threw the error:

Error in normalize.quantiles(Y) : 
ERROR; return code from pthread_create() is 22

This issue was resolved by updating RStudio to the latest version.

2.3.7 Remove outliers

Observations in violation of multivariate normality at an alpha=0.01 level based on Mahalanobis distance-based test statistics are now removed to avoid spurious conclusions.

2.3.7.1 Write a function to calculate Mahalanobis distance

# Create a function to calculate Mahalanobis distance
getMD <- function(x) {
  Sx <- cov(x)
  m <- mahalanobis(x, colMeans(x), Sx)
  return(m)
}

2.3.7.2 Apply function to identify outliers

# Drop individuals with data violating multivariate normality at alpha = 0.01
i_keep <- which(pchisq(getMD(qn_resid_Y_b), df = dim(Y)[2]) > 0.01)

2.3.8 Create a summary

# Record sample sizes in a summary table
table1 <- data.frame(study=rep(NA,1),N.traits=NA,N.variants=NA,N.total=NA,n.complete=NA,n.used=NA)
i <- 1
table1[i,"study"] <- "Study Name" ###CUSTOMIZE** (study name)
table1[i,"N.total"] <- nrow(df_synth)
table1[i,"n.complete"] <- nrow(qn_resid_Y)
table1[i,"n.used"]  <- nrow(qn_resid_Y[i_keep, ])
table1[i,"N.traits"] <- ncol(qn_resid_Y)
table1[i,"N.variants"] <- length(genes)
table1[i,]
       study N.traits N.variants N.total n.complete n.used
1 Study Name        6          6     770        763    763
# Print number of observations due to violation of multivariate normality 
cat(dim(Y)[1] - length(i_keep), " Obs removed due to violation of MV-Normality")
0  Obs removed due to violation of MV-Normality
# Add to summary table
table1$n.removed <- table1$N.total - table1$n.used
table1$percent.removed <- round(100*table1$n.removed/table1$N.total,2)
pander(table1,caption="Sample sizes")
Sample sizes (continued below)
study N.traits N.variants N.total n.complete n.used n.removed
Study Name 6 6 770 763 763 7
percent.removed
0.91

Less than 1% of participants were removed in data processing.

2.3.9 Prepare and save final files

2.3.9.1 Data used in both programs

# Check for data directory; if not present, create it 
if (!dir.exists("./data")) {
  dir.create("./data")
}

# Write data 
save(traits, genes, trait_mapping, custom_labels, custom_labels2, df_vertex_table, file = "./data/TraitsGenes.RData") ###CUSTOMIZE** (optional, file name)

2.3.9.2 mvBIMBAM

# Write phenotypes to a text file for use in mvBIMBAM 
if (!dir.exists("./inputs")) {
  dir.create("./inputs")
}
write.table(round(qn_resid_Y[i_keep,], 8), 
            "./inputs/pheno_bimbam.txt", sep = " ", ###CUSTOMIZE** (optional, file name)
            row.names = F, col.names = F)

# Refine genotype data for mvBIMBAM and write file
Geno_write <- df_synth %>% select(all_of(genes), all_of(traits)) %>%
  filter(complete.cases(.)) %>%
  select(all_of(genes)) %>%
  {.[i_keep,]} # Apply i_keep matrix here to retain non-outlying participants

# Grep rs column numbers to create the geno_string file required for mvBIMBAM format
rs_cols <- grep("^rs", colnames(df_synth), value = TRUE)

# Create geno_string format required for mvBIMBAM, filtering data set for only i_keep participants
Geno_String <- map(rs_cols, ~ {
  Geno_write <- df_synth %>%
    select(all_of(.x), all_of(traits)) %>%
    filter(complete.cases(.)) %>%
    select(all_of(.x))%>%
    {.[i_keep,]}
  
  # Creating the Geno_String for each SNP rsID  
  Geno_String <- paste0(unlist(c(Geno_write)), collapse = ",")
  Geno_String <- paste0(.x, ",", Geno_String)
  
  Geno_String
}) 

# Polish geno_string for mvBIMBAM
final_Geno_String <- paste0(unlist(Geno_String), collapse = "\n")
M <- length(unlist(strsplit(Geno_String[[1]], ","))) - 1
N <- length(Geno_String)
final_Geno_String <- paste0(M, "\n", N, "\n", final_Geno_String)

# Write it out 
writeLines(final_Geno_String, con = "./inputs/geno_bimbam.txt", sep = "") ###CUSTOMIZE** (optional, file name)

2.3.9.3 bnlearn

# Curate bnlearn data (convert AA/BB coding back to additive) 
Geno_write2 <- Geno_write %>%
    mutate_at(
    .vars = vars(starts_with("rs")),
    .funs = list(~ case_when(
      . ==  "BB" ~ 2,
      . == "AB" ~ 1,
      . ==  "AA" ~ 0
    ))
  )

# Create merged data frame
bnlearn_data <- data.frame(Geno_write2, round(qn_resid_Y[i_keep,], 8))

# The package to learn the Bayesian networks (bnlearn) does not support integer data,
# so convert integer columns to numeric
bnlearn_data[sapply(bnlearn_data, class) == "integer"] <- 
  sapply(bnlearn_data[sapply(bnlearn_data, class) == "integer"], as.numeric)

# Write data for bnlearn
saveRDS(bnlearn_data, file = "data/QuantNorm.rds") ###CUSTOMIZE** (optional, file name)