/sir-hawkes

SIR-Hawkes: Linking Epidemic Models and Hawkes Processes to Model Diffusions in Finite Populations

Primary LanguageJupyter Notebook

SIR-Hawkes: Linking Epidemic Models and Hawkes Processes to Model Diffusions in Finite Populations

This repository contains:

  • Scripts for SIR-Hawkes project: simulation, modeling with SIR and HawkesN.
  • Three datasets consist of tweet cascades.
  • A hands-on tutorial to walk you through some main components of the project: simulation, modeling and population size distribution.

Citation

The project was introduced in this paper:

Marian-Andrei Rizoiu, Swapnil Mishra, Quyu Kong, Mark Carman, Lexing Xie. 2018. SIR-Hawkes: 
Linking Epidemic Models and Hawkes Processes to Model Diffusions in Finite Populations. . 
In WWW 2018: The 2018 Web Conference, April 23–27, 2018, Lyon, France. ACM, New York, NY, 
USA, 10 pages. https://doi.org/10.1145/3178876.3186108

License

Both dataset and code are distributed under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license, a copy of which can be obtained following this link. If you require a different license, please contact us at Marian-Andrei@rizoiu.eu or Lexing.Xie@anu.edu.au.

SIR-Hawkes tutorial

required packages:

- nloptr
- parallel
- data.table

1. Preliminary

We need to first load all required packages for simulation and modeling cascades.

library(parallel)
source('scripts/functions-SIR-HawkesN.R')
source('scripts/functions-size-distribution.R')

2. Stochastic R simulation

We then simulate 20 stochastic SIR realizations. In this step, we chose a set of parameters ($N = 1300, I_0 = 300, \gamma = 0.2, \beta = 1$) for simulation. Given those simulated events, we are going to fit them with both SIR model and our proposed HawkesN model to see their modeling performance.

params.S <- c(N = 1300, I.0 = 300, gamma = 0.2, beta = 1)
nsim <- 20
simdat <- replicate(
    n = nsim,
    generate.stochastic.sir(params = params.S, Tmax = 11, hide.output = T)    
)

Let's take a look at the simulated data (only the first 20 events of the first simulation were shown).

One simulation is identified as four components: - relative times - susceptible population size at each time - infected population size at each time - recovered population size at each time

as.data.frame(simdat[,1])[1:20,]
timeSIRC
0.0000000001000 300 0 300
0.001861149 999 301 0 301
0.002054043 998 302 0 302
0.002297054 997 303 0 303
0.003679539 996 304 0 304
0.005308822 995 305 0 305
0.010011856 995 304 1 305
0.013833165 995 303 2 305
0.015425608 994 304 2 306
0.018386120 993 305 2 307
0.018890458 992 306 2 308
0.019177649 992 305 3 308
0.032191757 991 306 3 309
0.035561494 990 307 3 310
0.037340979 989 308 3 311
0.038834789 988 309 3 312
0.041421224 987 310 3 313
0.041912680 986 311 3 314
0.043324751 985 312 3 315
0.051393662 984 313 3 316

3. Fit Stochastic SIR on simulated cascades

We fit stochastic SIR in the following steps:

  • Choose a starting point for all parameters.
  • Apply LBFGS algorithm for optimizing the likelihood function of SIR model (This step might take quite a lot of time).
# initial fitting point for each execution
params.fit.start <- c(N = 0.1, I.0 = 0.1, gamma = 0.1, beta = 0.1)

.cl <- makeCluster(spec = min(nsim, detectCores()), type = 'FORK')
results <- parSapply(cl = .cl, X = 1:nsim, FUN = function(i) {
    mysim <- as.data.frame(simdat[, i])
    return(fit.stochastic.sir(mysim, params.fit.start))
})
stopCluster(.cl)

# reconstruct result data format
res <- as.data.frame(results[1,])
names(res) <- 1:nsim             
res <- as.data.frame(t(res))     
res$ll <- unlist(results[2,])
complete_res <- res

In the following block, we show how well parameters were retreived by our fitting process. Generally, the median values are closed to the theoretical values with a small standard deviation.

# let's see how well parameters were retreived
prnt <- rbind(params.S[c('N', 'I.0', 'gamma', 'beta')], 
              apply(X = complete_res[, c('N', 'I.0', 'gamma', 'beta')], MARGIN = 2, FUN = median),
              apply(X = complete_res[, c('N', 'I.0', 'gamma', 'beta')], MARGIN = 2, FUN = sd))
rownames(prnt) <- c('theoretical', 'median', 'sd')
print(prnt[, c('N', 'I.0', 'gamma', 'beta')], digits = 2)
                 N I.0 gamma  beta
theoretical 1300.0 300 0.200 1.000
median      1283.5 300 0.200 0.986
sd             3.6   0 0.006 0.026

4. Fit HawkesN on simulated cascades

We need first to pull out the infective events for HawkesN model.

# get the means at given time points, to be able to compare to deterministic
simhistory <- sapply(X = 1:nsim, FUN = function(i) {
  history.S <- SIR2HAWKES.stochastic.event.series(state = simdat[,i])  
})

We model HawkesN on the simulated data following same steps as modeling SIR:

  • Choose a starting point for all parameters.
  • Apply LBFGS algorithm for optimizing the likelihood function of HawkesN model (this step might take quite a lot of time).
# start point 
params.fit.start <- c(K = 1, c = 0.1, theta = 0.1, N = 1000)

# fit the event series with HawkesN
.cl <- makeCluster(spec = min(20, detectCores()), type = 'FORK')
results <- parSapply(cl = .cl, X = 1:nsim, FUN = function(i) {
  history.S <- as.data.frame(simhistory[,i])
  fitted.model <- fitSeries(history = history.S, params.fit.start)
})
stopCluster(.cl)
res <- as.data.frame(results['par',])
names(res) <- 1:nsim
res <- data.frame(t(res))

We compare the fitted HawkesN parameters with theoretical HawkesN parameters.

# these are the theoretical parameters
params.H <- c(K = 5, c = 0.001, theta = 0.2, N = 1300)

prnt <- rbind(params.H, 
              apply(X = res, MARGIN = 2, FUN = median, na.rm = T),
              apply(X = res, MARGIN = 2, FUN = sd, na.rm = T))
rownames(prnt) <- c('theoretical', 'median', 'sd')
print(prnt[, c('K', 'theta', 'c', 'N')], digits = 2)
              K theta     c    N
theoretical 5.0  0.20 0.001 1300
median      5.5  0.18 0.100 1292
sd          2.1  8.52 0.000   11

We then compare the fitted HawkesN parameters with theoretical SIR parameters. Theorem 3.1 in our paper reveals the link between HawkesN and SIR models that:

Suppose the new infections in a stochastic SIR process of parameters ${\beta, \gamma, N}$ follow a point process of intensity $\lambda^I(t)$. Suppose also the events in a HawkesN process with parameters ${\mu, \kappa, \theta, N}$ have the intensity $\lambda^H(t)$. Let $\mathcal{T} = {\tau_1, \tau_2, \ldots}$ be the set of the times to recovery of the infected individuals in SIR. The expectation of $\lambda^I(t)$ over $\mathcal{T}$ is equal $\lambda^H(t)$: $E_\mathcal{T}[ \lambda^I(t)] = \lambda^H(t)$, when $\mu = 0$, $\beta = \kappa \theta$, $\gamma = \theta$.

Given this theorem, we are able to convert HawkesN parameters into SIR parameters and compare them with theoretical SIR parameters.

res$gamma <- res$theta
res$beta <- res$K * res$theta
prnt <- rbind(params.S[c('N', 'gamma', 'beta')], 
              apply(X = res[, c('N', 'gamma', 'beta')], MARGIN = 2, FUN = mean, na.rm = T),
              apply(X = res[, c('N', 'gamma', 'beta')], MARGIN = 2, FUN = sd, na.rm = T))
rownames(prnt) <- c('theoretical', 'median', 'sd')
print(prnt[, c('N', 'gamma', 'beta')], digits = 2)
               N gamma beta
theoretical 1300   0.2  1.0
median      1295   3.5  6.4
sd            11   8.5 13.7

5. Plotting size distribution

In this section, we will study the probability distribution of population size for a given set of parameters.

# theoretical parameters shown previously
params.S <- c(N = 1300, I.0 = 300, gamma = 0.2, beta = 1)
params.H <- c(K = 5, c = 0.001, theta = 0.2, N = 1300)

Let's calculate the transition matrix and the states mapping matrix for an SIR model with the given parameters.

.transition <- construct.transition.matrix.and.states(params = params.S)

Then we construct the event history in required format for HawkesN and we choose the first cascade from our simulated data.

seen_perc <- 0.5
history <- as.data.frame(simhistory[,1])
seenEvents <- round(nrow(history) * seen_perc)

Then we get two size distributions for two scenarios:

  • when only 1 event is observed (apriori probability size distribution)
  • when half of events are observed (aposteriori probability size distribution)
# compute size probs at event 1 and current event
size.est.at.zero <- get.size.distribution(params = params.H, .transition = .transition)
size.est.at.end <- get.size.distribution(params = params.H, .transition = .transition, history = history[seq(seenEvents),])

We then plot both apriori probability size distribution and aposteriori probability size distribution.

# plot our cascade
matplot(cbind(size.est.at.zero$final.state, size.est.at.end$final.state), 
        col = c('black', 'blue'), type = 'l', log = 'y', lty = c(1, 1), lwd = 3,
        xlab = 'Cascade final size', ylab = 'Probability',
        main = sprintf('Probability distribution of cascade final size\nfitted on %d seen events (N = %.2f)\n(seen %%: %.2f)', 
                       seenEvents, params.H['N'], seen_perc) )
abline(v = seenEvents, lty = 3, col = 'gray40')
abline(v = size.est.at.end$theo.mean, lty = 2, col = 'darkmagenta')
abline(v = nrow(history), lty = 1, col = 'red')
legend('bottomleft', 
        legend = c('Stoch. size distribution (apriori)', 
                   'Stoch. size distribution (aposteriori)',
                   sprintf('Observed events (%d)', seenEvents), 
                   sprintf('Deterministic size (%.2f)', size.est.at.zero$theo.mean), 
                   sprintf('Observed final size (%d)', nrow(history)) ), 
        lty = c(1, 1, 3, 2, 1), lwd = c(3, 3, 1, 1, 1), col = c('black', 'blue', 'gray40', 'darkmagenta', 'red'), bty = 'n')
Warning message in xy.coords(x, y, xlabel, ylabel, log = log):
“642 y values <= 0 omitted from logarithmic plot”Warning message in xy.coords(x, y, xlabel, ylabel, log):
“1 y value <= 0 omitted from logarithmic plot”

png

Several observations (Sec 6.3 in our paper):

  • The apriori probability size distribution shows two maxima. This provides the following explanation for the general perceived unpredictability of online popularity. For cascades showing a bi-modal apriori size distribution, there are two likely outcomes: either it dies out early or it reaches a large size compared to the maximum population $N$. At time $t = 0$ is it impossible to di erentiate between the two outcomes.
  • The aposteriori probability distribution reflects the information gained from the observed events and it shows a single maximum towards the higher size values. The more events we observe, the higher the likelihood of the true value of cascade size.