generalized method to convert raw agent based data to aggregate data

This function converts data on an agent-based level (1 row = 1 agent) relative when an agent is in each state and aggregates it, so that the user can know how many agents are in each state at a given time point (integer based). This function can take standard data.frames and grouped_df data.frames (from dplyr). For the later, this function aggregates within grouping parameters and also provides the columns associated with the grouping.

agents_to_aggregate(
  agents,
  states,
  death = NULL,
  birth = NULL,
  min_max_time = c(0, NA),
  integer_time_expansion = TRUE
)

Arguments

agents	data frame style object (currently either of class `data.frame` or `grouped_df`)
states	time entered state. Do not include column for original state. These need to be ordered, for example: for an SIR model, with columns "`tI`" and "`tR`" expressing the time the individual became infected and recovered (respectively), we want "`states = c("tI", "tR")`". See details for more information.
death	string for column with death time information (default `NULL`)
birth	string for column with birth time information (default `NULL`)
min_max_time	vector (length 2) of minimum and maximum integer time, the second value can be `NA` - and if so, we estimate maximum time from the data. Note that if this is done relative to a `grouped_df` then `NA` means we take the maximum from all groups and give that to each subgroup.
integer_time_expansion	boolean if every integer time point in the range of `min_max_time` should be presented in the aggregation output. If `FALSE` (default is `TRUE`), then lines will only include those time points where

Value

dataset with aggregated information, we label classes X{i} for i in 0:(length(states)). Potentially calculated per group of a grouped_df (and retains grouping columns).

Details

D.1. What each column should have (NAs, orderings, births & deaths,...)

The parameters state, death, birth, and min_max_time provide the user with the flexibility to capture any potential structure related to agent's progression to through the epidemic (and life).

As mentioned in the states parameter details, we expect a set of column names X1, X2, ..., XK that contain information on when an individual enters each state. Also mentioned in the parameter details is that the function assumes that each agent is in the initial state X0 until X1 (except if min_max_time[1] >= X1, which means the agent starts out at state X1).

This function expects transition in an ordered fashion, i.e. X(I+1) >= X(I), but does allow agents to jump states. This can either be recorded with a value at the jumped state the same as the next non-jumped state or an NA (and the authors of this package believe this is a cleaner approach - and matches expectation in birth and death).

Specifically, birth and death can contain NA values, which the function interprets as an individual not being born (or dying respectively) in the given time interval.

The time interval (defined by min_max_time) can be moved, which abstractly just shifts the rows (or time points) the user gets at the end.

D.2. Changing time points

Beyond defining the time interval with min_max_time, if a user wishes to have more minute (smaller) time steps than integers, we recommend they just multiple all values by \(1/s\) where \(s\) is the length of the desired time steps. A transformation of the output's t column by \(s\) would get the time back to the standard time.

Examples


###
# for standard data.frame objects (agents_to_aggregate.grouped_df)
###
library(dplyr)
#> 
#> Attaching package: ‘dplyr’
#> The following objects are masked from ‘package:stats’:
#> 
#>     filter, lag
#> The following objects are masked from ‘package:base’:
#> 
#>     intersect, setdiff, setequal, union
agents <- EpiCompare::hagelloch_raw
# making babies
set.seed(5)
babies <- sample(nrow(agents),size = 5)
agents$tBIRTH <- NA
agents$tBIRTH[babies] <- agents$tI[babies] - 5

aggregate_b <- agents_to_aggregate(agents, states = c(tI, tR),
                                   death = NULL, birth = tBIRTH)

# looking at when babies where born:
agents %>% dplyr::filter(!is.na(.data$tBIRTH)) %>%
  dplyr::pull(.data$tBIRTH) %>% ceiling() %>% sort
#> [1] 23 26 29 29 32
# vs:
data.frame(counts = 1:nrow(aggregate_b),
           num_people = aggregate_b %>% select(-t) %>% apply(1, sum))
#>    counts num_people
#> 1       1        183
#> 2       2        183
#> 3       3        183
#> 4       4        183
#> 5       5        183
#> 6       6        183
#> 7       7        183
#> 8       8        183
#> 9       9        183
#> 10     10        183
#> 11     11        183
#> 12     12        183
#> 13     13        183
#> 14     14        183
#> 15     15        183
#> 16     16        183
#> 17     17        183
#> 18     18        183
#> 19     19        183
#> 20     20        183
#> 21     21        183
#> 22     22        183
#> 23     23        183
#> 24     24        184
#> 25     25        184
#> 26     26        184
#> 27     27        185
#> 28     28        185
#> 29     29        185
#> 30     30        187
#> 31     31        187
#> 32     32        187
#> 33     33        188
#> 34     34        188
#> 35     35        188
#> 36     36        188
#> 37     37        188
#> 38     38        188
#> 39     39        188
#> 40     40        188
#> 41     41        188
#> 42     42        188
#> 43     43        188
#> 44     44        188
#> 45     45        188
#> 46     46        188
#> 47     47        188
#> 48     48        188
#> 49     49        188
#> 50     50        188
#> 51     51        188
#> 52     52        188
#> 53     53        188
#> 54     54        188
#> 55     55        188
#> 56     56        188
#> 57     57        188
#> 58     58        188
#> 59     59        188
#> 60     60        188
#> 61     61        188
#> 62     62        188
#> 63     63        188
#> 64     64        188
#> 65     65        188
#> 66     66        188
#> 67     67        188
#> 68     68        188
#> 69     69        188
#> 70     70        188
#> 71     71        188
#> 72     72        188
#> 73     73        188
#> 74     74        188
#> 75     75        188
#> 76     76        188
#> 77     77        188
#> 78     78        188
#> 79     79        188
#> 80     80        188
#> 81     81        188
#> 82     82        188
#> 83     83        188
#> 84     84        188
#> 85     85        188
#> 86     86        188
#> 87     87        188
#> 88     88        188
#> 89     89        188
#> 90     90        188
#> 91     91        188
#> 92     92        188
#> 93     93        188
#> 94     94        188


# including death
aggregate_d <- agents_to_aggregate(agents, states = c(tI, tR),
                                   death = tDEAD, birth = NULL)

# looking at when people died:
agents %>% dplyr::filter(!is.na(.data$tDEAD)) %>%
  dplyr::pull(.data$tDEAD) %>% ceiling() %>% sort
#>  [1] 20 41 44 44 44 45 46 47 47 47 49 60
# vs:
data.frame(counts = 1:nrow(aggregate_d),
           num_people = aggregate_d %>% select(-t) %>% apply(1, sum))
#>    counts num_people
#> 1       1        188
#> 2       2        188
#> 3       3        188
#> 4       4        188
#> 5       5        188
#> 6       6        188
#> 7       7        188
#> 8       8        188
#> 9       9        188
#> 10     10        188
#> 11     11        188
#> 12     12        188
#> 13     13        188
#> 14     14        188
#> 15     15        188
#> 16     16        188
#> 17     17        188
#> 18     18        188
#> 19     19        188
#> 20     20        188
#> 21     21        187
#> 22     22        187
#> 23     23        187
#> 24     24        187
#> 25     25        187
#> 26     26        187
#> 27     27        187
#> 28     28        187
#> 29     29        187
#> 30     30        187
#> 31     31        187
#> 32     32        187
#> 33     33        187
#> 34     34        187
#> 35     35        187
#> 36     36        187
#> 37     37        187
#> 38     38        187
#> 39     39        187
#> 40     40        187
#> 41     41        187
#> 42     42        186
#> 43     43        186
#> 44     44        186
#> 45     45        183
#> 46     46        182
#> 47     47        181
#> 48     48        178
#> 49     49        178
#> 50     50        177
#> 51     51        177
#> 52     52        177
#> 53     53        177
#> 54     54        177
#> 55     55        177
#> 56     56        177
#> 57     57        177
#> 58     58        177
#> 59     59        177
#> 60     60        177
#> 61     61        176
#> 62     62        176
#> 63     63        176
#> 64     64        176
#> 65     65        176
#> 66     66        176
#> 67     67        176
#> 68     68        176
#> 69     69        176
#> 70     70        176
#> 71     71        176
#> 72     72        176
#> 73     73        176
#> 74     74        176
#> 75     75        176
#> 76     76        176
#> 77     77        176
#> 78     78        176
#> 79     79        176
#> 80     80        176
#> 81     81        176
#> 82     82        176
#> 83     83        176
#> 84     84        176
#> 85     85        176
#> 86     86        176
#> 87     87        176
#> 88     88        176
#> 89     89        176
#> 90     90        176
#> 91     91        176
#> 92     92        176
#> 93     93        176
#> 94     94        176

###
# for grouped_df objects (agents_to_aggregate.grouped_df)
###


max_time <- 100
agents_g <- hagelloch_raw %>%
  filter(SEX %in% c("female", "male")) %>% group_by(SEX)
sir_group <- agents_to_aggregate(agents_g, states = c(tI, tR),
                                 min_max_time = c(0, max_time))
agents <- agents_g %>%
  filter(SEX == "female") %>% ungroup()
sir_group1 <- agents_to_aggregate(agents, states = c(tI, tR),
                                 min_max_time = c(0, max_time))
sir_group_1 <- sir_group %>% filter(SEX == "female")
assertthat::are_equal(sir_group1,
                      sir_group_1 %>% ungroup %>% select(t, X0, X1, X2))
#> [1] TRUE