Multiple covariate model simulations • InfectionTrees

Overview

In this vignette, we show to perform simulations to assess how well our base model using multiple covariates is performing.

We briefly review how data is generated.
Provide code of how to simulate data from this model and estimate the MLE.

Data simulation

Specifically, we generate data under our branching process model described in the modeling section where the probability of onward transmission \(p_i\) is a function of multiple covariates, a vector \(X_i \sim F\), which are covariates generated from some distribution \(F\) and the probability of transmission is given by \[ p_i = logit^{-1}\left (X_i\beta \right ), \] where \(F\) and \(\beta\) are fixed parameters. Here, we are assuming we are working with the base model where all transmissions within a cluster can be directly traced back to an observed individual within the cluster.

For \(F\), we sample with replacement from a data frame where each row corresponds to a possible set of covariates for an individual.

Below is how we simulate a single data set from the branching process. We have summarized the clusters by size.

set.seed(2020)
## Generate possible covariates 
sample_covariates_df <- data.frame(x = c(1, 1, 0, 0),
                                   y = c(1, 0, 1, 0),
                                   z = c(-1, 0, .5, 1))

beta0 <- -2
beta1 <- 1
beta2 <- .5
beta3 <- -1.2
M <- 1000
data <- simulate_bp(K = M,
                                  inf_params = c(beta0, beta1, beta2, beta3),
                                  sample_covariates_df = sample_covariates_df,
                                  covariate_names = c("x", "y", "z"),
                                  max_size = 50)

 data %>% group_by(cluster_id) %>%
  summarize(cluster_size = n()) %>%
  ungroup() %>%
   group_by(cluster_size) %>%
   summarize(freq = n()) %>%
   ungroup() %>%
  kable(type = "html") %>%
   kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F,
                 position = "center")

cluster_size	freq
1	769
2	77
3	39
4	27
5	12
6	9
7	6
8	9
9	3
10	6
11	2
12	6
13	4
14	4
15	1
16	3
17	1
19	2
20	2
21	1
22	1
23	2
25	1
26	2
27	1
28	1
31	1
42	2
46	1
50	5

Unlike the binary covariate case, we need to keep the entire data set in order to compute our likelihood, and so this general covariate is more expensive in terms of both computational time and memory.

We sample \(B= 1000\) MC trees for each row in the summarized clusters data.

B <- 1000
mc_trees <- sample_mc_trees(observed_data = data,
                            B = B,
                            covariate_names = c("x", "y", "z"))

mc_trees %>% head() %>%
     kable(type = "html") %>%
   kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                 position = "center", full_width = FALSE)

cluster_size	gen	n_in_gen	id	cluster_id	z	orig_id	inf_id
1	1	1	1-1	n1-g1-1	1	1	NA
1	1	1	1-1	n1-g1-2	1	1	NA
1	1	1	1-1	n1-g1-3	1	1	NA
1	1	1	1-1	n1-g1-4	1	1	NA
1	1	1	1-1	n1-g1-5	1	1	NA
1	1	1	1-1	n1-g1-6	1	1	NA

We can then estimate the MLE from the MC sampled trees with general_loglike(). This will take a minute or two to run.

inf_params_0 <- c(0,0, 0, 0)


best_params <- optim(inf_params_0, fn = general_loglike,
                     mc_trees = mc_trees,
                                  return_neg = TRUE,
                     cov_names = c("x", "y", "z"),
                     method = "L-BFGS-B",
                     lower = c(-4, -4, -4, -4),
                     upper = c(4, 4, 4, 4),
                     hessian = TRUE)

round(best_params$par, 2)

## [1]  0.12 -1.03 -0.54 -1.92

We can use likelihood profiling to get individual \((1-\alpha\))% CIs for our estimates. NOTE: In practice, we have found that likelihood profiling has low coverage and instead prefer bootstrapping to get an estimate of the SE.

lp_ests <- likelihood_profs(best_pars = best_params$par,
                                    max_loglike = -best_params$value,
                                    mc_trees = mc_trees,
                                    covariate_names = c("x", "y", "z"),
                                    alpha = .05,
                                    multiple_outside_transmissions = FALSE)

lp_ests %>% as.data.frame() %>%
      kable(type = "html", digits = 2) %>%
   kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                 position = "center", full_width = FALSE)

est.	lower	upper
0.12	0.04	0.19
-1.03	-1.12	-0.94
-0.54	-0.63	-0.44
-1.92	-2.02	-1.80

cluster_size	freq
1	769
2	77
3	39
4	27
5	12
6	9
7	6
8	9
9	3
10	6
11	2
12	6
13	4
14	4
15	1
16	3
17	1
19	2
20	2
21	1
22	1
23	2
25	1
26	2
27	1
28	1
31	1
42	2
46	1
50	5

cluster_size	freq
1	769
2	77
3	39
4	27
5	12
6	9
7	6
8	9
9	3
10	6
11	2
12	6
13	4
14	4
15	1
16	3
17	1
19	2
20	2
21	1
22	1
23	2
25	1
26	2
27	1
28	1
31	1
42	2
46	1
50	5

cluster_size	freq
1	769
2	77
3	39
4	27
5	12
6	9
7	6
8	9
9	3
10	6
11	2
12	6
13	4
14	4
15	1
16	3
17	1
19	2
20	2
21	1
22	1
23	2
25	1
26	2
27	1
28	1
31	1
42	2
46	1
50	5