Overview

In this vignette, we show to perform simulations to assess how well our multiple outside transmission model with a single binary covariate performs..

  1. We briefly review how data is generated.

  2. Provide code (but do not run) for a large-scale simulation to check whether our model works.

Data simulation

Specifically, we generate data under our branching process model described in the modeling section where the probability of onward transmission \(p_i\) is a function of a single binary covariate \(X_i \sim Binomial \left (\gamma \right )\) and \[ p_i = logit^{-1}\left (\beta_0 + \beta_1 X_i \right ), \] where \(\gamma\), \(\beta_0\), and \(\beta_1\) are fixed parameters. Here, we are assuming we are working with the base model where all transmissions within a cluster can be directly traced back to an observed individual within the cluster.

Below is how we simulate a single data set from the branching process. We have summarized the clusters by size.

set.seed(2020)
## Generate possible covariates 
sample_covariates_df <- data.frame(x = c(1,0))

beta0 <- -2
beta1 <- 1
M <- 1000
data <- simulate_bp(K = M,
                                  inf_params = c(beta0, beta1),
                                  sample_covariates_df = sample_covariates_df,
                                  covariate_names = "x",
                                  max_size = 50)

Now the above are clusters with the outsider individual. To mimic realistic multiple outside transmission data, we simply exclude the individuals in generation one and shrink the cluster size by 1. This means we no longer observe 1000 clusters.

data <- data %>% filter(gen != 1) %>%
  mutate(cluster_size = cluster_size - 1)

 data %>% group_by(cluster_id) %>%
  summarize(cluster_size = n()) %>%
  ungroup() %>%
   group_by(cluster_size) %>%
   summarize(freq = n()) %>%
   ungroup() %>%
  kable(type = "html") %>%
   kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F,
                 position = "center")
cluster_size freq
1 118
2 38
3 9
4 11
5 3
6 2
12 1

The case where there is a single binary covariate is a special case for our model because the likelihood for a single transmission tree reduces to \[ L(T;C) = (1-p_+)^{n_+}(1-p_-)^{n_-}p_+^{N_+}p_-^{N_-}, \] where \(p_+ = logit^{-1}(\beta_0 + \beta_1)\), \(p_- = logit^{-1}(\beta_0)\), \(n_+\) is the number of positive \(X_i\) individuals, \(n_-\) is the number of negative \(X_i\) individuals, \(N_+\) is the number of transmissions from positive individuals and \(N_-\) is the number of transmissions from negative individuals. The consequence of this is that the observed clusters may be stored as the following table with the cluster size, number of positive individuals (\(x_{pos})\), and number of negative individuals (\(x_{neg}\)), along with their frequency instead of a 304 row data frame, like the one we sampled.

The consequence of this also extends to Monte Carlo sampling of trees. We do not have to store the individual trees, but only need a summary of them.

We sample \(B= 1000\) MC trees for each row in the summarized clusters data. This step will take a few minutes.

B <- 1000
mc_trees <- sample_mc_trees(B = B,
                                 observed_data = data,
                            multiple_outside_transmissions = TRUE)

mc_trees %>% filter(cluster_size == 6) %>% head() %>%
     kable(type = "html") %>%
   kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                 position = "center", full_width = FALSE)
gen n_in_gen inf_id id n_inf cluster_size cluster_id x orig_id
2 1 1-1 2-1 0 6 n6-g2-1 1 126
2 2 1-1 2-2 0 6 n6-g2-1 0 126
2 3 1-1 2-3 0 6 n6-g2-1 1 126
2 4 1-1 2-4 0 6 n6-g2-1 1 126
2 5 1-1 2-5 0 6 n6-g2-1 0 126
1 1 NA 1-1 5 6 n6-g2-1 1 126

We can then estimate the MLE from the MC sampled trees with our specialized likelihood function bp_loglike_binary_cov() which is the specialized counterpart to bp_loglike_general().

inf_params_0 <- c(0,0)

best_params <- optim(inf_params_0, fn = general_loglike,
                      mc_trees = mc_trees,
                                  return_neg = TRUE,
                     multiple_outside_transmissions = TRUE,
                     cov_names = "x",
                     method = "L-BFGS-B",
                     lower = c(-5, -5),
                     upper = c(5, 5),
                     hessian = TRUE)

round(best_params$par, 2)
## [1] -2.64  1.86