TB Data Results: base model

Libraries

library(InfectionTrees)
library(tidyr)
library(dplyr)
library(kableExtra)
library(ggplot2)
library(forcats)
library(readr)

Vignette goals

Briefly recap the data used in the analysis
Fit a series of nested models to the data using our base model
Repeat the model fitting with the Multiple Outside Transmission model
Examine standard errors

The Data

The data can be accessed using data(tb_clean). The clusters are uniquely ID’d with the variable group

The data consists of 159 clusters where each cluster contains between 1 and 25 individuals. These individuals have covariates corresponding to their smear status (+/-/NA), HIV status (+/-/unknown), date of sputum collection, and race (Asian/Black/White). Please see ?tb_clean for more information.

There are only individuals who have smear status NA, and in this analysis we impute those to be smear -. We transform date of sputum collection into a variable we call relative time, which is the time in years between the reference sputum collection date and the first observed sputum within a cluster. Therefore, all relative time values of singleton clusters will have the value 0.

We consider HIV status and race to be categorical variables and use “HIV+” and “White” as the reference groups, respectively.

We use the below code to format tb_clean to use in our model fitting.

clusters <- tb_clean %>%
        dplyr::mutate(smear = ifelse(spsmear == "Positive",
                                     1, 0),
                      cluster_id = group,
                      hiv_f = ifelse(hivstatus == "Negative", "neg",
                                     ifelse(hivstatus == "Positive", "pos",
                                            "unk"))) %>%
        dplyr::mutate(hiv_neg_pos = ifelse(hiv_f == "neg", 1, 0),
               hiv_unk_pos = ifelse(hiv_f == "unk", 1, 0)) %>%
        dplyr::group_by(cluster_id) %>%
        dplyr::mutate(rel_time = as.numeric(rel_time / 365)) %>%
        dplyr::mutate(cluster_size = dplyr::n()) %>%
        dplyr::ungroup() %>%
    mutate(race_f = fct_collapse(race,
                                 white = "White",
                                 black = "Black or African American",
                                 asian = "Asian")) %>%
    mutate(race_asian_white = ifelse(race_f == "asian", 1, 0),
           race_black_white = ifelse(race_f == "black", 1, 0)) %>%
    select(cluster_id, smear,
           hiv_neg_pos,
           hiv_unk_pos,
           rel_time,
           race_asian_white,
           race_black_white,
           cluster_size)

Model fitting

We analyze the series of nested models:

Model 1: \(logit(p_i) = \beta_0\)

Model 2: \(logit(p_i) = \beta_0 + \beta_{1}x_{i,{smear pos}}\)

Model 3: \(logit(p_i) = \beta_0 + \beta_{1}x_{i,smear pos} + \beta_{2}x_{i,{HIV neg}}+ \beta_{3}x_{i,{HIV unk}}\)

Model 4: \(logit(p_i) = \beta_0 + \beta_{1}x_{i,{smear pos}} + \beta_{2}x_{i,{HIV neg}} + \beta_{3}x_{i,{HIV unk}} + \beta_{4}x_{i,{rel. time}}\)

Model 5: \(logit(p_i) = \beta_0 + \beta_{1}x_{i,{smear pos}} + \beta_{2}x_{i,{HIV neg}} + \beta_{3}x_{i,{HIV unk}} + \beta_{4}x_{i,{rel. time}} + \beta_{5}x_{i,{race Asian}} + \beta_{6}x_{i,{race Black}}\)

The base model

We first fit our base model described in the model overview, where we assume that the infections within a cluster can be traced back to a root individual within the cluster.

We use the below code to fit each of the models where we use \(K=1000\) MC samples for each cluster in the data. We then report the log likelihood and AIC for each of the models.

Note that for the full results, we use \(K=10000\), which takes about ~3 hour to run on a PC.

K <- 1000
my_seed <- 6172020
set.seed(my_seed)

## MODELS
## models
covariate_list <- vector(mode = "list", length = 5)
covariate_list[[1]] <- NA
covariate_list[[2]] <- "smear"
covariate_list[[3]] <- c("smear",
                         "hiv_neg_pos", "hiv_unk_pos")
covariate_list[[4]] <- c("smear",
                         "hiv_neg_pos", "hiv_unk_pos",
                         "rel_time")
covariate_list[[5]] <- c("smear",
                         "hiv_neg_pos", "hiv_unk_pos",
                         "rel_time",
                         "race_asian_white",
                         "race_black_white")

## Set up outputs
loglike_df <- data.frame(model = 1:length(covariate_list),
                         n_params = c(1, 2, 4, 5, 7),
                         loglike = 0,
                         aic = 0)
beta_mat1 <- matrix(0, nrow = 1, ncol = 4)
rownames(beta_mat1) <- c("Intercept")
colnames(beta_mat1) <- c("Est.", "lower", "upper", "SE")
beta_list <- vector(mode = "list", length = length(covariate_list))
beta_list[[1]] <- beta_mat1
for(ii in 2:length(covariate_list)){
  mat <- matrix(0, nrow = length(covariate_list[[ii]]) + 1,
                ncol = 4) 
  rownames(mat) <- c("Intercept", covariate_list[[ii]])
  colnames(mat) <- c("Est.", "lower", "upper", "SE")
  beta_list[[ii]] <- mat
}

t_init <- proc.time()[3]

## Sample MC trees all at once

    t0 <- proc.time()[3]
    ## Sample all MC clusters at once for each of the 5 models
    mc_trees <-  sample_mc_trees(clusters,
                                 B = K,
                                 multiple_outside_transmissions = FALSE,
                                 covariate_names = covariate_list[[length(covariate_list)]])
    print(proc.time()[3] - t0)
    
    
## Fit each of the models
    
for(jj in 1:length(covariate_list)){
        covariate_names <- covariate_list[[jj]]
        print("Model:")
        print(covariate_names)
        if(is.na(covariate_names[1])){
            init_params <- 0
        } else{
            init_params <- rep(0, length(covariate_names) + 1)
        }
        ## Optimize
        
        print("Optimizing")
        bds <- rep(-5, length(init_params))
        if(length(covariate_names) > 5){
            bds <- rep(-4, length(init_params))
        }
        lower_bds <- bds
        upper_bds <- -bds
        cov_mat <- covariate_df_to_mat(mc_trees,
                                       cov_names = covariate_names)
 
    t1 <- proc.time()[3]
    best_params <- optim(par = init_params,
                         fn = general_loglike,
                         mc_trees = data.table::as.data.table(mc_trees),
                         return_neg = TRUE,
                         cov_mat = cov_mat,
                         cov_names = covariate_names,
                         use_outsider_prob = FALSE,
                         multiple_outside_transmissions = FALSE,
                         method = "L-BFGS-B",
                         lower = lower_bds,
                         upper = upper_bds,
                         hessian = TRUE
                         )
    t2 <- proc.time()[3] - t1
    print(paste("Optimization time:", round( t2 / 60, 3),
                "min"))

    beta_list[[jj]][,1] <- best_params$par
    beta_list[[jj]][, 4] <- sqrt(diag(solve(best_params$hessian))) ## SE from Fisher info
    

    
    print("best params:")
    print(beta_list[[jj]])
    loglike_df$loglike[jj] <- -best_params$val
    
    print(paste("Total time:", round( (proc.time()[3] - t_init)  / 3600, 3),
                "hrs"))

    
}

loglike_df <- loglike_df %>%
  mutate(aic = -loglike + 2 * n_params,
         model = 1:5) %>%
  select(model, everything()) 


loglike_df %>% kable(digits = 2,
                     col.names = c("Model", "# params.", "Log like.", "AIC")) %>%
  kable_styling(bootstrap_options = c("condensed", "hover", "striped", "responsive"),
                full_width = FALSE, position = "center")

Model	# params.	Log like.	AIC
1	1	-409.40	411.40
2	2	-409.26	413.26
3	4	-407.46	415.46
4	5	-396.28	406.28
5	7	-395.93	409.93

Note that log likelihood increases as the model number increases, which should be the case since the models are nested. The best model according to AIC is model 4 which corresponds to the variables smear, hiv_neg_pos, hiv_unk_pos, rel_time.

The estimated parameters for this model are

beta_list[[4]] %>%
  kable(digits = 2) %>%
  kable_styling(bootstrap_options = c("condensed", "hover", "striped", "responsive"),
                full_width = FALSE, position = "center")

	Est.	SE
Intercept	-0.76	0.37
smear	-0.10	0.31
hiv_neg_pos	-0.32	0.34
hiv_unk_pos	-0.52	0.48
rel_time	0.35	0.07

where lower is the lower boundary for 95% likelihood profiling CI and upper is the upper boundary. The variable SE is the estimated standard error using the Hessian from the optimization process as an estimate for the Fisher Information. Using these boundaries, we see that HIV- compared to HIV+ and relative time are both significant at the \(\alpha = .05\) level because 0 is not included in the likelihood profiling CI.

Multiple outside transmissions model

Fitting the multiple outside transmissions (MOT) model is as easy as fitting with the base model, we only need to change one argument in two different functions. We fit the above 5 models. Here we use \(K=1000\) MC samples but for our full results, we us \(10000\).

K <- 1000
my_seed <- 24
set.seed(my_seed)

## MODELS
## models
covariate_list <- vector(mode = "list", length = 5)
covariate_list[[1]] <- NA
covariate_list[[2]] <- "smear"
covariate_list[[3]] <- c("smear",
                         "hiv_neg_pos", "hiv_unk_pos")
covariate_list[[4]] <- c("smear",
                         "hiv_neg_pos", "hiv_unk_pos",
                         "rel_time")
covariate_list[[5]] <- c("smear",
                         "hiv_neg_pos", "hiv_unk_pos",
                         "rel_time",
                         "race_asian_white",
                         "race_black_white")

## Set up outputs
loglike_df <- data.frame(model = 1:length(covariate_list),
                         n_params = c(1, 2, 4, 5, 7),
                         loglike = 0,
                         aic = 0)
beta_mat1 <- matrix(0, nrow = 1, ncol = 4)
rownames(beta_mat1) <- c("Intercept")
colnames(beta_mat1) <- c("Est.", "lower", "upper", "SE")
beta_list <- vector(mode = "list", length = length(covariate_list))
beta_list[[1]] <- beta_mat1
for(ii in 2:length(covariate_list)){
  mat <- matrix(0, nrow = length(covariate_list[[ii]]) + 1,
                ncol = 4) 
  rownames(mat) <- c("Intercept", covariate_list[[ii]])
  colnames(mat) <- c("Est.", "lower", "upper", "SE")
  beta_list[[ii]] <- mat
}

t_init <- proc.time()[3]

## Sample MC trees all at once

    t0 <- proc.time()[3]
    ## Sample all MC clusters at once for each of the 5 models
    mc_trees <-  sample_mc_trees(clusters,
                                 B = K,
                                 multiple_outside_transmissions = TRUE,
                                 covariate_names = covariate_list[[length(covariate_list)]])
    print(proc.time()[3] - t0)
    
    
## Fit each of the models
    
for(jj in 1:length(covariate_list)){
        covariate_names <- covariate_list[[jj]]
        print("Model:")
        print(covariate_names)
        if(is.na(covariate_names[1])){
            init_params <- 0
        } else{
            init_params <- rep(0, length(covariate_names) + 1)
        }
        ## Optimize
        
        print("Optimizing")
        bds <- rep(-5, length(init_params))
        if(length(covariate_names) > 5){
            bds <- rep(-4, length(init_params))
        }
        lower_bds <- bds
        upper_bds <- -bds
        cov_mat <- covariate_df_to_mat(mc_trees,
                                       cov_names = covariate_names)
 
    t1 <- proc.time()[3]
    best_params <- optim(par = init_params,
                         fn = general_loglike,
                         mc_trees = data.table::as.data.table(mc_trees),
                         return_neg = TRUE,
                         cov_mat = cov_mat,
                         cov_names = covariate_names,
                         use_outsider_prob = FALSE,
                         multiple_outside_transmissions = TRUE,
                         method = "L-BFGS-B",
                         lower = lower_bds,
                         upper = upper_bds,
                         hessian = TRUE
                         )
    t2 <- proc.time()[3] - t1
    print(paste("Optimization time:", round( t2 / 60, 3),
                "min"))

    beta_list[[jj]][, 4] <- sqrt(diag(solve(best_params$hessian))) ## SE from Fisher info
    

    
    print("best params:")
    print(beta_list[[jj]])
    loglike_df$loglike[jj] <- -best_params$val
    
    print(paste("Total time:", round( (proc.time()[3] - t_init)  / 3600, 3),
                "hrs"))

    
}

loglike_df <- loglike_df %>%
  mutate(aic = -loglike + 2 * n_params,
         model = 1:5) %>%
  select(model, everything()) 


loglike_df %>% kable(digits = 2,
                     col.names = c("Model", "# params.", "Log like.", "AIC")) %>%
  kable_styling(bootstrap_options = c("condensed", "hover", "striped", "responsive"),
                full_width = FALSE, position = "center")

Model	# params.	Log like.	AIC
1	1	-473.20	475.20
2	2	-473.06	477.06
3	4	-467.33	475.33
4	5	-449.05	459.05
5	7	-449.02	463.02

The estimated parameters for this model are

beta_list[[4]] %>%
  kable(digits = 2) %>%
  kable_styling(bootstrap_options = c("condensed", "hover", "striped", "responsive"),
                full_width = FALSE, position = "center")

	Est.	lower	upper	SE
Intercept	0	0	0	0.48
smear	0	0	0	0.41
hiv_neg_pos	0	0	0	0.33
hiv_unk_pos	0	0	0	0.76
rel_time	0	0	0	0.08

where lower is the lower boundary for 95% likelihood profiling CI and upper is the upper boundary. The variable SE is the estimated standard error using the Hessian from the optimization process as an estimate for the Fisher Information. Using these boundaries, we see that HIV- compared to HIV+, HIV unknown compared to HIV+, and relative time are all significant at the \(\alpha = .05\) level because 0 is not included in the likelihood profiling CI. Here we see smear is significant, but this is lost when we use \(K= 10000\) MC samples instead of \(K=1000\) MC samples.

Analysis of standard error and CI

We see that the standard errors are large compared to the likelihood profiling CI widths. If we multiplied the standard errors by \(2 \times 1.96\), the width of a 95% CI for a normal distributed variable, then this width is much larger than the likelihood profiling estimate.

To see which estimate is closer to the truth, we can bootstrap our data to get a second standard error and third CI estimate. We provide the function bootstrap_clusters() to resample clusters from our data. The analysis can be repeated on these bootstrap data sets. Below, we see that our new sampled data set contains the same amount of clusters as the original data but now has a different number of individuals compared to the original 389.

bootstrap_data <- bootstrap_clusters(clusters)

dim(bootstrap_data)

## [1] 371   9