vignettes/not-built-vignettes/tb-data-base-pos-imputation.Rmd
tb-data-base-pos-imputation.Rmd
In this vignette, we repeat the prior analysis but impute the smear covariates to be positive instead.
Briefly recap the data used in the analysis
Fit a series of nested models to the data using our base model
Repeat the model fitting with the Multiple Outside Transmission model
Examine standard errors
The data can be accessed using data(tb_clean)
. The clusters are uniquely ID’d with the variable group
The data consists of 159 clusters where each cluster contains between 1 and 25 individuals. These individuals have covariates corresponding to their smear status (+/-/NA), HIV status (+/-/unknown), date of sputum collection, and race (Asian/Black/White). Please see ?tb_clean
for more information.
There are only individuals who have smear status NA, and in this analysis we impute those to be smear -. We transform date of sputum collection into a variable we call relative time, which is the time in years between the reference sputum collection date and the first observed sputum within a cluster. Therefore, all relative time values of singleton clusters will have the value 0.
We consider HIV status and race to be categorical variables and use “HIV+” and “White” as the reference groups, respectively.
We use the below code to format tb_clean
to use in our model fitting.
clusters <- tb_clean %>% dplyr::mutate(smear = ifelse(spsmear == "Positive", 1, 0), cluster_id = group, hiv_f = hivstatus) %>% dplyr::mutate(hiv_neg_pos = ifelse(hiv_f == "neg", 1, 0), hiv_unk_pos = ifelse(hiv_f == "unk", 1, 0)) %>% dplyr::group_by(cluster_id) %>% dplyr::mutate(rel_time = rel_time / 365) %>% dplyr::mutate(cluster_size = dplyr::n()) %>% dplyr::ungroup() %>% mutate(smear = ifelse(spsmear == "Negative", 0, 1), cluster_id = group) %>% mutate(hiv_neg_pos = ifelse(hiv_f == "neg", 1, 0), hiv_unk_pos = ifelse(hiv_f == "unk", 1, 0)) %>% group_by(cluster_id) %>% mutate(cluster_size = n()) %>% ungroup() %>% mutate(race_f = fct_collapse(race, white = "White", black = "Black or African American", asian = "Asian")) %>% mutate(race_asian_white = ifelse(race_f == "asian", 1, 0), race_black_white = ifelse(race_f == "black", 1, 0)) %>% select(cluster_id, smear, hiv_neg_pos, hiv_unk_pos, rel_time, race_asian_white, race_black_white, cluster_size)
We analyze the series of nested models:
Model 1: \(logit(p_i) = \beta_0\)
Model 2: \(logit(p_i) = \beta_0 + \beta_{1}x_{i,{smear pos}}\)
Model 3: \(logit(p_i) = \beta_0 + \beta_{1}x_{i,smear pos} + \beta_{2}x_{i,{HIV neg}}+ \beta_{3}x_{i,{HIV unk}}\)
Model 4: \(logit(p_i) = \beta_0 + \beta_{1}x_{i,{smear pos}} + \beta_{2}x_{i,{HIV neg}} + \beta_{3}x_{i,{HIV unk}} + \beta_{4}x_{i,{rel. time}}\)
Model 5: \(logit(p_i) = \beta_0 + \beta_{1}x_{i,{smear pos}} + \beta_{2}x_{i,{HIV neg}} + \beta_{3}x_{i,{HIV unk}} + \beta_{4}x_{i,{rel. time}} + \beta_{5}x_{i,{race Asian}} + \beta_{6}x_{i,{race Black}}\)
We first fit our base model described in the model overview, where we assume that the infections within a cluster can be traced back to a root individual within the cluster.
We use the below code to fit each of the models where we use \(K=1000\) MC samples for each cluster in the data. We then report the log likelihood and AIC for each of the models.
Note that for the full results, we use \(K=10000\), which takes about ~3 hour to run on a PC.
K <- 1000 my_seed <- 42 set.seed(my_seed) ## MODELS ## models covariate_list <- vector(mode = "list", length = 5) covariate_list[[1]] <- NA covariate_list[[2]] <- "smear" covariate_list[[3]] <- c("smear", "hiv_neg_pos", "hiv_unk_pos") covariate_list[[4]] <- c("smear", "hiv_neg_pos", "hiv_unk_pos", "rel_time") covariate_list[[5]] <- c("smear", "hiv_neg_pos", "hiv_unk_pos", "rel_time", "race_asian_white", "race_black_white") ## Set up outputs loglike_df <- data.frame(model = 1:length(covariate_list), n_params = c(1, 2, 4, 5, 7), loglike = 0, aic = 0) beta_mat1 <- matrix(0, nrow = 1, ncol = 4) rownames(beta_mat1) <- c("Intercept") colnames(beta_mat1) <- c("Est.", "lower", "upper", "SE") beta_list <- vector(mode = "list", length = length(covariate_list)) beta_list[[1]] <- beta_mat1 for(ii in 2:length(covariate_list)){ mat <- matrix(0, nrow = length(covariate_list[[ii]]) + 1, ncol = 4) rownames(mat) <- c("Intercept", covariate_list[[ii]]) colnames(mat) <- c("Est.", "lower", "upper", "SE") beta_list[[ii]] <- mat }
t_init <- proc.time()[3] ## Sample MC trees all at once t0 <- proc.time()[3] ## Sample all MC clusters at once for each of the 5 models mc_trees <- sample_mc_trees(clusters, B = K, multiple_outside_transmissions = FALSE, covariate_names = covariate_list[[length(covariate_list)]]) print(proc.time()[3] - t0) ## Fit each of the models for(jj in 1:length(covariate_list)){ covariate_names <- covariate_list[[jj]] print("Model:") print(covariate_names) if(is.na(covariate_names[1])){ init_params <- 0 } else{ init_params <- rep(0, length(covariate_names) + 1) } ## Optimize print("Optimizing") bds <- rep(-5, length(init_params)) if(length(covariate_names) > 5){ bds <- rep(-4, length(init_params)) } lower_bds <- bds upper_bds <- -bds cov_mat <- covariate_df_to_mat(mc_trees, cov_names = covariate_names) t1 <- proc.time()[3] best_params <- optim(par = init_params, fn = general_loglike, mc_trees = data.table::as.data.table(mc_trees), return_neg = TRUE, cov_mat = cov_mat, cov_names = covariate_names, use_outsider_prob = FALSE, multiple_outside_transmissions = FALSE, method = "L-BFGS-B", lower = lower_bds, upper = upper_bds, hessian = TRUE ) t2 <- proc.time()[3] - t1 print(paste("Optimization time:", round( t2 / 60, 3), "min")) beta_list[[jj]][, 4] <- sqrt(diag(solve(best_params$hessian))) ## SE from Fisher info ## Likelihood Profile ests print("Likelihood profiling") lp_ests <- likelihood_profs(best_pars = best_params$par, max_loglike = -best_params$value, mc_trees = mc_trees, covariate_names = covariate_names, alpha = .05, multiple_outside_transmissions = FALSE) beta_list[[jj]][,-4] <- lp_ests print("best params:") print(beta_list[[jj]]) loglike_df$loglike[jj] <- -best_params$val print(paste("Total time:", round( (proc.time()[3] - t_init) / 3600, 3), "hrs")) }
loglike_df <- loglike_df %>% mutate(aic = -loglike + 2 * n_params, model = 1:5) %>% select(model, everything()) loglike_df %>% kable(digits = 2, col.names = c("Model", "# params.", "Log like.", "AIC")) %>% kable_styling(bootstrap_options = c("condensed", "hover", "striped", "responsive"), full_width = FALSE, position = "center")
Model | # params. | Log like. | AIC |
---|---|---|---|
1 | 1 | -409.40 | 411.40 |
2 | 2 | -409.23 | 413.23 |
3 | 4 | -407.44 | 415.44 |
4 | 5 | -396.22 | 406.22 |
5 | 7 | -395.82 | 409.82 |
Note that log likelihood increases as the model number increases, which should be the case since the models are nested. The best model according to AIC is model 4 which corresponds to the variables smear, hiv_neg_pos, hiv_unk_pos, rel_time.
The estimated paramters for this model are
beta_list[[4]] %>% kable(digits = 2) %>% kable_styling(bootstrap_options = c("condensed", "hover", "striped", "responsive"), full_width = FALSE, position = "center")
Est. | lower | upper | SE | |
---|---|---|---|---|
Intercept | -0.72 | -0.89 | -0.55 | 0.38 |
smear | -0.11 | -0.37 | 0.13 | 0.31 |
hiv_neg_pos | -0.36 | -0.59 | -0.14 | 0.36 |
hiv_unk_pos | -0.56 | -1.41 | 0.06 | 0.49 |
rel_time | 0.35 | 0.27 | 0.42 | 0.07 |
where lower
is the lower boundary for 95% likelihood profiling CI and upper
is the upper boundary. The variable SE
is the estimated standard error using the Hessian from the optimization process as an estimate for the Fisher Information. Using these boundaries, we see that HIV- compared to HIV+ and relative time are both significant at the \(\alpha = .05\) level because 0 is not included in the likelihood profiling CI.
Fitting the multiple outside transmissions (MOT) model is as easy as fitting with the base model, we only need to change one argument in two different functions. We fit the above 5 models. Here we use \(K=1000\) MC samples but for our full results, we us \(10000\).
K <- 1000 my_seed <- 24 set.seed(my_seed) ## MODELS ## models covariate_list <- vector(mode = "list", length = 5) covariate_list[[1]] <- NA covariate_list[[2]] <- "smear" covariate_list[[3]] <- c("smear", "hiv_neg_pos", "hiv_unk_pos") covariate_list[[4]] <- c("smear", "hiv_neg_pos", "hiv_unk_pos", "rel_time") covariate_list[[5]] <- c("smear", "hiv_neg_pos", "hiv_unk_pos", "rel_time", "race_asian_white", "race_black_white") ## Set up outputs loglike_df <- data.frame(model = 1:length(covariate_list), n_params = c(1, 2, 4, 5, 7), loglike = 0, aic = 0) beta_mat1 <- matrix(0, nrow = 1, ncol = 4) rownames(beta_mat1) <- c("Intercept") colnames(beta_mat1) <- c("Est.", "lower", "upper", "SE") beta_list <- vector(mode = "list", length = length(covariate_list)) beta_list[[1]] <- beta_mat1 for(ii in 2:length(covariate_list)){ mat <- matrix(0, nrow = length(covariate_list[[ii]]) + 1, ncol = 4) rownames(mat) <- c("Intercept", covariate_list[[ii]]) colnames(mat) <- c("Est.", "lower", "upper", "SE") beta_list[[ii]] <- mat }
t_init <- proc.time()[3] ## Sample MC trees all at once t0 <- proc.time()[3] ## Sample all MC clusters at once for each of the 5 models mc_trees <- sample_mc_trees(clusters, B = K, multiple_outside_transmissions = TRUE, covariate_names = covariate_list[[length(covariate_list)]]) print(proc.time()[3] - t0) ## Fit each of the models for(jj in 1:length(covariate_list)){ covariate_names <- covariate_list[[jj]] print("Model:") print(covariate_names) if(is.na(covariate_names[1])){ init_params <- 0 } else{ init_params <- rep(0, length(covariate_names) + 1) } ## Optimize print("Optimizing") bds <- rep(-5, length(init_params)) if(length(covariate_names) > 5){ bds <- rep(-4, length(init_params)) } lower_bds <- bds upper_bds <- -bds cov_mat <- covariate_df_to_mat(mc_trees, cov_names = covariate_names) t1 <- proc.time()[3] best_params <- optim(par = init_params, fn = general_loglike, mc_trees = data.table::as.data.table(mc_trees), return_neg = TRUE, cov_mat = cov_mat, cov_names = covariate_names, use_outsider_prob = FALSE, multiple_outside_transmissions = TRUE, method = "L-BFGS-B", lower = lower_bds, upper = upper_bds, hessian = TRUE ) t2 <- proc.time()[3] - t1 print(paste("Optimization time:", round( t2 / 60, 3), "min")) beta_list[[jj]][, 4] <- sqrt(diag(solve(best_params$hessian))) ## SE from Fisher info ## Likelihood Profile ests print("Likelihood profiling") lp_ests <- likelihood_profs(best_pars = best_params$par, max_loglike = -best_params$value, mc_trees = mc_trees, covariate_names = covariate_names, alpha = .05, multiple_outside_transmissions = TRUE) beta_list[[jj]][,-4] <- lp_ests print("best params:") print(beta_list[[jj]]) loglike_df$loglike[jj] <- -best_params$val print(paste("Total time:", round( (proc.time()[3] - t_init) / 3600, 3), "hrs")) }
loglike_df <- loglike_df %>% mutate(aic = -loglike + 2 * n_params, model = 1:5) %>% select(model, everything()) loglike_df %>% kable(digits = 2, col.names = c("Model", "# params.", "Log like.", "AIC")) %>% kable_styling(bootstrap_options = c("condensed", "hover", "striped", "responsive"), full_width = FALSE, position = "center")
Model | # params. | Log like. | AIC |
---|---|---|---|
1 | 1 | -473.20 | 475.20 |
2 | 2 | -473.03 | 477.03 |
3 | 4 | -467.47 | 475.47 |
4 | 5 | -449.24 | 459.24 |
5 | 7 | -449.20 | 463.20 |
Note that log likelihood increases as the model number increases, which should be the case since the models are nested. The best model according to AIC is model 4 which corresponds to the variables smear, hiv_neg_pos, hiv_unk_pos, rel_time.
The estimated paramters for this model are
beta_list[[4]] %>% kable(digits = 2) %>% kable_styling(bootstrap_options = c("condensed", "hover", "striped", "responsive"), full_width = FALSE, position = "center")
Est. | lower | upper | SE | |
---|---|---|---|---|
Intercept | -1.72 | -1.91 | -1.54 | 0.48 |
smear | 0.54 | 0.30 | 0.77 | 0.41 |
hiv_neg_pos | -0.60 | -0.85 | -0.37 | 0.33 |
hiv_unk_pos | -1.34 | -3.72 | -0.32 | 0.77 |
rel_time | 0.52 | 0.46 | 0.59 | 0.08 |
where lower
is the lower boundary for 95% likelihood profiling CI and upper
is the upper boundary. The variable SE
is the estimated standard error using the Hessian from the optimization process as an estimate for the Fisher Information. Using these boundaries, we see that HIV- compared to HIV+, HIV unknown compared to HIV+, and relative time are all significant at the \(\alpha = .05\) level because 0 is not included in the likelihood profiling CI. Here we see smear is significant, but this is lost when we use \(K= 10000\) MC samples instead of \(K=1000\) MC samples.
We see that the standard errors are large compared to the likelihood profiling CI widths. If we multiplied the standard errors by \(2 \times 1.96\), the width of a 95% CI for a normal distributed variable, then this width is much larger than the likelihood profiling estimate.
To see which estimate is closer to the truth, we can bootstrap our data to get a second standard error and third CI estimate. We provide the function bootstrap_clusters()
to resample clusters from our data. The analysis can be repeated on these bootstrap data sets. Below, we see that our new sampled data set contains the same amount of clusters as the original data but now has a different number of individuals compared to the original 389.
bootstrap_data <- bootstrap_clusters(clusters) dim(bootstrap_data)
## [1] 371 9