Overview

In this vignette, we show to perform simulations to assess how well our base model is doing.

  1. We briefly review how data is generated.

  2. Provide code (but do not run) for a large-scale simulation to check whether our model works.

Data simulation

Specifically, we generate data under our branching process model described in the modeling section where the probability of onward transmission \(p_i\) is a function of a single binary covariate \(X_i \sim Binomial \left (\gamma \right )\) and \[ p_i = logit^{-1}\left (\beta_0 + \beta_1 X_i \right ), \] where \(\gamma\), \(\beta_0\), and \(\beta_1\) are fixed parameters. Here, we are assuming we are working with the base model where all transmissions within a cluster can be directly traced back to an observed individual within the cluster.

Below is how we simulate a single data set from the branching process. We have summarized the clusters by size.

set.seed(2020)
## Generate possible covariates 
sample_covariates_df <- data.frame(x = c(1,0))

beta0 <- -2
beta1 <- 1
M <- 1000
data <- simulate_bp(K = M,
                                  inf_params = c(beta0, beta1),
                                  sample_covariates_df = sample_covariates_df,
                                  covariate_names = "x",
                                  max_size = 50)

 data %>% group_by(cluster_id) %>%
  summarize(cluster_size = n()) %>%
  ungroup() %>%
   group_by(cluster_size) %>%
   summarize(freq = n()) %>%
   ungroup() %>%
  kable(type = "html") %>%
   kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F,
                 position = "center")
cluster_size freq
1 818
2 118
3 38
4 9
5 11
6 3
7 2
13 1

The case where there is a single binary covariate is a special case for our model because the likelihood for a single transmission tree reduces to \[ L(T;C) = (1-p_+)^{n_+}(1-p_-)^{n_-}p_+^{N_+}p_-^{N_-}, \] where \(p_+ = logit^{-1}(\beta_0 + \beta_1)\), \(p_- = logit^{-1}(\beta_0)\), \(n_+\) is the number of positive \(X_i\) individuals, \(n_-\) is the number of negative \(X_i\) individuals, \(N_+\) is the number of transmissions from positive individuals and \(N_-\) is the number of transmissions from negative individuals. The consequence of this is that the observed clusters may be stored as the following table with the cluster size, number of positive individuals (\(x_{pos})\), and number of negative individuals (\(x_{neg}\)), along with their frequency instead of a 1304 row data frame, like the one we sampled.

binary_cluster_summaries <- summarize_binary_clusters(data)
binary_cluster_summaries %>%
   kable(type = "html") %>%
   kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                 position = "center", full_width = FALSE)
freq cluster_size x_pos x_neg
461 1 0 1
357 1 1 0
27 2 0 2
64 2 1 1
27 2 2 0
4 3 0 3
9 3 1 2
15 3 2 1
10 3 3 0
1 4 0 4
2 4 1 3
3 4 2 2
3 4 3 1
1 5 1 4
3 5 2 3
5 5 3 2
2 5 4 1
2 6 3 3
1 6 4 2
1 7 2 5
1 7 5 2
1 13 9 4

The consequence of this also extends to Monte Carlo sampling of trees. We do not have to store the individual trees, but only need a summary of them.

We sample \(B= 1000\) MC trees for each row in the summarized clusters data. The MC trees for the cluster with 9 positive and 4 negatives are summarized with the following:

B <- 1000
mc_trees <- sample_mc_binary_cov(B = B,
                                 observed_cluster_summaries = binary_cluster_summaries)

mc_trees %>% filter(cluster_size == 13) %>%
     kable(type = "html") %>%
   kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                 position = "center", full_width = FALSE)
freq cluster_size x_pos x_neg mc_freq x_pos_trans x_neg_trans
1 13 9 4 2 0 12
1 13 9 4 1 1 11
1 13 9 4 1 2 10
1 13 9 4 7 3 9
1 13 9 4 19 4 8
1 13 9 4 53 5 7
1 13 9 4 101 6 6
1 13 9 4 161 7 5
1 13 9 4 185 8 4
1 13 9 4 178 9 3
1 13 9 4 166 10 2
1 13 9 4 109 11 1
1 13 9 4 17 12 0

We can then estimate the MLE from the MC sampled trees with our specialized likelihood function bp_loglike_binary_cov() which is the specialized counterpart to bp_loglike_general().

inf_params_0 <- c(0,0)

best_params <- optim(inf_params_0, fn = bp_loglike_binary_cov,
                      mc_samples_summary = mc_trees %>%
                       dplyr::select(-freq),
                     obs_data_summary = mc_trees %>%
                       group_by(cluster_size, x_pos,
                                x_neg) %>%
                       summarize(freq = sum(freq)),
                                  return_neg = TRUE,
                     method = "L-BFGS-B",
                     lower = c(-5, -5),
                     upper = c(5, 5),
                     hessian = TRUE)

round(best_params$par, 2)
## [1] -1.27  0.78

We can use likelihood profiling to get individual \((1-\alpha\))% CIs for our estimates.

lp_ests <- binary_likelihood_profs(best_pars = best_params$par,
                                    max_loglike = -best_params$value,
                                    mc_samples_summary = mc_trees %>% select(-freq),
                                     obs_data_summary = mc_trees %>%
                       group_by(cluster_size, x_pos,
                                x_neg) %>%
                       summarize(freq = sum(freq)),
                                    alpha = .05,
                                    multiple_outside_transmissions = FALSE)

lp_ests %>% as.data.frame() %>%
      kable(type = "html", digits = 2) %>%
   kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                 position = "center", full_width = FALSE)
est. lower upper
-1.27 -1.34 -1.19
0.78 0.67 0.88

The CIs for these estimate do not include \(\beta_0 = -2\) nor \(\beta_1 = 1\), but this is not unexpected as we are looking at the MLE of a single data set.