When is a 'cluster' a real cluster?

6 cyclists were killed in London in a 2 week period between 5th and 13th November 2013. Should we be surprised that this happens at some point in recent history?

In a paper in Significance, Aberdein and Spiegelhalter show that it is reasonable to assume that, over an 8 year period between 2005 and 2012, cycle deaths in London occurred as as a 'Poisson process', and that the expected number of deaths in any 2-week period is 108/208 = 0.57, which we shall denote by $m$. This means that the probability of exactly $x$ deaths in a single 2-week period has a Poisson form
$$ f_x = \frac{e^{-m} m^x}{x!},$$
and cumulative distribution function
$$ F_x = \sum_{i=0}^{i=x} \frac{e^{-m} m^i}{i!}.$$

Setting $m=108/208=0.57$, this means that the probability of getting at least 6 deaths in a particular 2-week period = $ 1- F_5 = 0.00003$, or 1-in-35,000. If we consider the whole 8 years, there are 208 disjoint 2-week periods, and so the chance of seeing at least 6 deaths in at least one of these periods = 1 - (the chance of seeing 5 or fewer in all 208 periods) = $ 1 - F_5^{208} = 0.006$, or 1-in-168.

However we need to know the chance of getting 6 deaths in any moving window of 2-weeks, and this requires the use of so-called 'scan statistics'. The exact distribution theory is extremely tricky, but fortunately Naus (1982) has provided some useful accurate approximations.

Let $X$ be the maximum count in any moving window, where the number of occurrences within each window has a Poisson($m$) distribution, and the overall length of time of interest comprises $L$ disjoint windows - in our case $m = 0.6$ and $L=208$. Let $P_n(L) = P(X$ less than $n |L)$ be the probability of $X$ being less than $n$ given $L$. Naus shows the following exact identities (assuming $F(i) = 0$ if $i$ less than $0$).
$$ P_n(2) = F^2_{n-1} - (n-1)f_nf_{n-2} - (n-1-m)f_nF_{n-3}$$$

$$ P_n(3) = F^3_{n-1} -A_1 + A_2 +A_3-A_4, $$
where
\begin{eqnarray*}
A_1 & = & 2 f_n F_{n-1} ((n-1) F_{n-2} - m F_{n-3}) \\
A_2 & = &0.5 f^2_n ((n-1) (n-2) F_{n-3} - 2 (n-2) m F_{n-4} + m^2 F_{n-5}) \\
A_3 & = &\sum_{i=1}^{n-1} f_{2*n-i} F^2_{i-1} \\
A_4 & = & \sum_{i=1}^{n-2} f_{2*n-i} f_i ((i-1) F_{i-2} - mF_{i-3})
\end{eqnarray*}

In the case of the cyclists we find $P_6(2) = 0.99983 , P_6(3) = 0.99971 $, and so the probability of getting at least 6 deaths in any 2-week window over 4 weeks ($L=2$) is $1 - P_6(2) = 0.00017 $, while the probability of at least 6 deaths in any 2-week window over 6 weeks ($L=3$) is $1 - P_6(3) = 0.00029 $.

Naus then argues for the approximation
$$ P_n(L) \approx P_n(2) \left[ \frac{P_n(3)}{P_n(2)} \right]^{L-2} .$$

Using the formulae above, we obtain $P_6(208) \approx 0.99983 \left[ \frac{0.99971}{0.99983} \right]^{206} = 0.9761$, and so the chance of getting at least 6 deaths in any 2-week window over 8 years is estimated to be 2.4%. For whatever reason, this is therefore a surprising cluster.

References

Naus JI (1982) Approximations for distributions of scan statistics. Journal of American Statistical Association, 77, 177-183

Comments

COOLSerdash's picture

Dear Mr. Spiegelhalter First of all, thank you very much for this interesing article, I've greatly enjoyed reading it. I have tried to reproduce your result using Naus' approximation formulae but I get different values. The first thing I've noticed: Shouldn't $108/208 = 0.52$ rather than $0.57$? Using $n = 6$ and $m = 0.6$, I got a value of $0.9997658$ for $P_{6}(2)$ and $0.9995738$ for $P_{6}(3)$, respectively. So my final probability is $3.9\%$ (using $L = 208$). I re-checked my implementation in R but can't see where I made a mistake. CS