Getting started with `InfectionTrees`

Downloading the package

Currently, InfectionTrees is available at https://github.com/skgallagher/InfectionTrees, and can be installed in R via the following command,

devtools::install_github("skgallagher/InfectionTrees")

library(InfectionTrees)
library(knitr)
library(kableExtra)
library(tidyr)
library(dplyr)
library(data.table)

About

Below is a true quick start to simulate data, sample trees based on that data, and finally compute the log likelihood.

The other vignettes cover:

The Model

Simulations

Tuberculosis in Maryland

Simulating branching processes

We can simulate from a branching process to form a data set of transmission trees or clusters (if we remove infector information from the transmission trees). We simulate \(K\) branching processes where the primary infector in each tree is the root of each transmission tree and the probability of infection of another individual is where \(\bf{X}\) is a covariate matrix and \(\beta\) is a vector of coefficients, \[ p_i = logit^{-1} (\bf{X}_i\bf{\beta}). \] Then the number of new infectious individuals infected by individual \(i\) is \[ N_i \sim geometric(p_i) \]

It is possible to produce transmission trees that do not naturally terminate, so a maximal number of infections is used as an argument. The output is a data frame with \(M\) transmission trees which contains the cluster ID, the person ID within a cluster, the generation of the person (i.e. the length of the path from the primary infector to the person + 1), the infector ID (i.e. who infected this person), the number of infections caused directly by this person if any, the total cluster size, the covariates, and an indicator telling us whether the cluster was artificially terminated or not (i.e. censored = TRUE).

set.seed(2020)
## Generate possible covariates 
sample_covariates_df <- data.frame(x = c(1,0))

beta0 <- -2
beta1 <- 1.5
M <- 100
data <- simulate_bp(K = M,
                                  inf_params = c(beta0, beta1),
                                  sample_covariates_df = sample_covariates_df,
                                  covariate_names = "x",
                                  max_size = 50)

data %>% head(., 10) %>%
  kable(type = "html") %>%
   kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

cluster_id	person_id	gen	inf_id	n_inf	cluster_size	x	censored
1	C1-G1-N1	1	NA	NA	1	0	FALSE
2	C2-G1-N1	1	NA	NA	1	0	FALSE
3	C3-G1-N1	1	NA	NA	1	1	FALSE
4	C4-G1-N1	1	NA	NA	1	0	FALSE
5	C5-G1-N1	1	NA	NA	1	0	FALSE
6	C6-G1-N1	1	NA	NA	1	0	FALSE
7	C7-G1-N1	1	NA	1	2	0	FALSE
7	C7-G2-N1	2	C7-G1-N1	NA	2	0	FALSE
8	C8-G1-N1	1	NA	NA	1	0	FALSE
9	C9-G1-N1	1	NA	NA	1	0	FALSE

Estimate the maximum likelihood

Although our data above are true transmission trees (i.e. we can trace a direct path of infections), generally the infector ID is unknown and hence the data is a group of individuals where the transmission tree is not known.

We estimate the parameters \(\beta_0\) and \(\beta_1\) by summing over all possible transmission trees for each cluster, which includes the number of individuals in each cluster and their covariates.

The likelihood for a given transmission tree \(T\) with \(n\) individuals is

\[ L(\beta; T) = \prod_{i=1}^n (1-p_i)p_i^{N_i} \] and the total likelihood of a single cluster \(C\) over all trees of size \(|C|=n\) is \[ \mathcal{L}( \beta; C) = \sum_{T \in \mathcal{T}_C} L(\beta;T). \]

We instead maximize the approximate likelihood, which is obtained from sampling transmission trees from \(\mathcal{T}_n\). That is, we approximate \(\frac{\mathcal{L}(C, \beta)}{|\mathcal{T}_C|}\) averaging the likelihood from Monte Carlo sampled trees \(T_1, \dots, T_K\),

\[ \hat{\mathcal{L}}_K(\beta; C) = \frac{1}{K} \sum_{i=1}^K L(\beta, T_i). \]

The estimates for the parameters are then found by maximizing the likelihood over the approximate likelihood over all clusters \(C_1, \dots, C_M\) \[ \left (\hat{\beta} \right)^{(MC)} = \arg \max_{\beta} \prod_{m=1}^M \hat{\mathcal{L}}_K(\beta; C_m). \]

How to get sampled transmission trees

If we want to sample \(B\) transmission trees for each cluster in our data, we use the function sample_mc_trees. Here, B is the number of MC samples. Typically this number is \(\ge5000\), but we use 10 below for demonstrative purposes.

mc_trees <- sample_mc_trees(observed_data = data, B = 10,
                            covariate_names = "x")

mc_trees %>% head(., 4) %>%
  kable(type = "html") %>%
   kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

cluster_size	gen	n_in_gen	id	cluster_id	orig_id	inf_id
1	1	1	1-1	n1-g1-1	1	NA
1	1	1	1-1	n1-g1-2	1	NA
1	1	1	1-1	n1-g1-3	1	NA
1	1	1	1-1	n1-g1-4	1	NA

The samples are linked to the original cluster by the variable orig_id.

Compute the log likelihood

Once the MC trees are sampled, it is possible to compute the log likelihood using the function general_loglike()