InfectionTrees
is a software package used to sample and analyze branching processes that model the spread of infectious diseases.
Currently, InfectionTrees
is available at https://github.com/skgallagher/InfectionTrees, and can be installed in R
via the following command,
devtools::install_github("skgallagher/InfectionTrees")
Below is a true quick start to simulate data, sample trees based on that data, and finally compute the log likelihood.
The other vignettes cover:
The Model
Simulations
Tuberculosis in Maryland
We can simulate from a branching process to form a data set of transmission trees or clusters (if we remove infector information from the transmission trees). We simulate \(K\) branching processes where the primary infector in each tree is the root of each transmission tree and the probability of infection of another individual is where \(\bf{X}\) is a covariate matrix and \(\beta\) is a vector of coefficients, \[ p_i = logit^{-1} (\bf{X}_i\bf{\beta}). \] Then the number of new infectious individuals infected by individual \(i\) is \[ N_i \sim geometric(p_i) \]
It is possible to produce transmission trees that do not naturally terminate, so a maximal number of infections is used as an argument. The output is a data frame with \(M\) transmission trees which contains the cluster ID, the person ID within a cluster, the generation of the person (i.e. the length of the path from the primary infector to the person + 1), the infector ID (i.e. who infected this person), the number of infections caused directly by this person if any, the total cluster size, the covariates, and an indicator telling us whether the cluster was artificially terminated or not (i.e. censored = TRUE
).
set.seed(2020)
## Generate possible covariates
sample_covariates_df <- data.frame(x = c(1,0))
beta0 <- -2
beta1 <- 1.5
M <- 100
data <- simulate_bp(K = M,
inf_params = c(beta0, beta1),
sample_covariates_df = sample_covariates_df,
covariate_names = "x",
max_size = 50)
data %>% head(., 10) %>%
kable(type = "html") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
cluster_id | person_id | gen | inf_id | n_inf | cluster_size | x | censored |
---|---|---|---|---|---|---|---|
1 | C1-G1-N1 | 1 | NA | NA | 1 | 0 | FALSE |
2 | C2-G1-N1 | 1 | NA | NA | 1 | 0 | FALSE |
3 | C3-G1-N1 | 1 | NA | NA | 1 | 1 | FALSE |
4 | C4-G1-N1 | 1 | NA | NA | 1 | 0 | FALSE |
5 | C5-G1-N1 | 1 | NA | NA | 1 | 0 | FALSE |
6 | C6-G1-N1 | 1 | NA | NA | 1 | 0 | FALSE |
7 | C7-G1-N1 | 1 | NA | 1 | 2 | 0 | FALSE |
7 | C7-G2-N1 | 2 | C7-G1-N1 | NA | 2 | 0 | FALSE |
8 | C8-G1-N1 | 1 | NA | NA | 1 | 0 | FALSE |
9 | C9-G1-N1 | 1 | NA | NA | 1 | 0 | FALSE |
Although our data above are true transmission trees (i.e. we can trace a direct path of infections), generally the infector ID is unknown and hence the data is a group of individuals where the transmission tree is not known.
We estimate the parameters \(\beta_0\) and \(\beta_1\) by summing over all possible transmission trees for each cluster, which includes the number of individuals in each cluster and their covariates.
The likelihood for a given transmission tree \(T\) with \(n\) individuals is
\[ L(\beta; T) = \prod_{i=1}^n (1-p_i)p_i^{N_i} \] and the total likelihood of a single cluster \(C\) over all trees of size \(|C|=n\) is \[ \mathcal{L}( \beta; C) = \sum_{T \in \mathcal{T}_C} L(\beta;T). \]
We instead maximize the approximate likelihood, which is obtained from sampling transmission trees from \(\mathcal{T}_n\). That is, we approximate \(\frac{\mathcal{L}(C, \beta)}{|\mathcal{T}_C|}\) averaging the likelihood from Monte Carlo sampled trees \(T_1, \dots, T_K\),
\[ \hat{\mathcal{L}}_K(\beta; C) = \frac{1}{K} \sum_{i=1}^K L(\beta, T_i). \]
The estimates for the parameters are then found by maximizing the likelihood over the approximate likelihood over all clusters \(C_1, \dots, C_M\) \[ \left (\hat{\beta} \right)^{(MC)} = \arg \max_{\beta} \prod_{m=1}^M \hat{\mathcal{L}}_K(\beta; C_m). \]
If we want to sample \(B\) transmission trees for each cluster in our data, we use the function sample_mc_trees
. Here, B
is the number of MC samples. Typically this number is \(\ge5000\), but we use 10 below for demonstrative purposes.
mc_trees <- sample_mc_trees(observed_data = data, B = 10,
covariate_names = "x")
mc_trees %>% head(., 4) %>%
kable(type = "html") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
cluster_size | gen | n_in_gen | id | n_inf | cluster_id | x | orig_id | inf_id |
---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | 1-1 | 0 | n1-g1-1 | 0 | 1 | NA |
1 | 1 | 1 | 1-1 | 0 | n1-g1-2 | 0 | 1 | NA |
1 | 1 | 1 | 1-1 | 0 | n1-g1-3 | 0 | 1 | NA |
1 | 1 | 1 | 1-1 | 0 | n1-g1-4 | 0 | 1 | NA |
The samples are linked to the original cluster by the variable orig_id
.
Once the MC trees are sampled, it is possible to compute the log likelihood using the function general_loglike()
<- general_loglike(inf_params = c(-2, 1),
loglike mc_trees = mc_trees,
return_neg = FALSE,
cov_names = "x")
print(loglike)
1] -107.5961 [