When is a 'cluster' a real cluster?

As of the 23rd May 2022 this website is archived and will receive no further updates.

understandinguncertainty.org was produced by the Winton programme for the public understanding of risk based in the Statistical Laboratory in the University of Cambridge. The aim was to help improve the way that uncertainty and risk are discussed in society, and show how probability and statistics can be both useful and entertaining.

Many of the animations were produced using Flash and will no longer work.

6 cyclists were killed in London in a 2 week period between 5th and 13th November 2013. Should we be surprised that this happens at some point in recent history?

In a paper in Significance, Aberdein and Spiegelhalter show that it is reasonable to assume that, over an 8 year period between 2005 and 2012, cycle deaths in London occurred as as a 'Poisson process', and that the expected number of deaths in any 2-week period is 108/208 = 0.57, which we shall denote by $m$. This means that the probability of exactly $x$ deaths in a single 2-week period has a Poisson form
$$ f_x = \frac{e^{-m} m^x}{x!},$$
and cumulative distribution function
$$ F_x = \sum_{i=0}^{i=x} \frac{e^{-m} m^i}{i!}.$$

Setting $m=108/208=0.57$, this means that the probability of getting at least 6 deaths in a particular 2-week period = $ 1- F_5 = 0.00003$, or 1-in-35,000. If we consider the whole 8 years, there are 208 disjoint 2-week periods, and so the chance of seeing at least 6 deaths in at least one of these periods = 1 - (the chance of seeing 5 or fewer in all 208 periods) = $ 1 - F_5^{208} = 0.006$, or 1-in-168.

However we need to know the chance of getting 6 deaths in any moving window of 2-weeks, and this requires the use of so-called 'scan statistics'. The exact distribution theory is extremely tricky, but fortunately Naus (1982) has provided some useful accurate approximations.

Let $X$ be the maximum count in any moving window, where the number of occurrences within each window has a Poisson($m$) distribution, and the overall length of time of interest comprises $L$ disjoint windows - in our case $m = 0.6$ and $L=208$. Let $P_n(L) = P(X$ less than $n |L)$ be the probability of $X$ being less than $n$ given $L$. Naus shows the following exact identities (assuming $F(i) = 0$ if $i$ less than $0$).
$$ P_n(2) = F^2_{n-1} - (n-1)f_nf_{n-2} - (n-1-m)f_nF_{n-3}$$$

$$ P_n(3) = F^3_{n-1} -A_1 + A_2 +A_3-A_4, $$
where
\begin{eqnarray*}
A_1 & = & 2 f_n F_{n-1} ((n-1) F_{n-2} - m F_{n-3}) \\
A_2 & = &0.5 f^2_n ((n-1) (n-2) F_{n-3} - 2 (n-2) m F_{n-4} + m^2 F_{n-5}) \\
A_3 & = &\sum_{i=1}^{n-1} f_{2*n-i} F^2_{i-1} \\
A_4 & = & \sum_{i=1}^{n-2} f_{2*n-i} f_i ((i-1) F_{i-2} - mF_{i-3})
\end{eqnarray*}

In the case of the cyclists we find $P_6(2) = 0.99983 , P_6(3) = 0.99971 $, and so the probability of getting at least 6 deaths in any 2-week window over 4 weeks ($L=2$) is $1 - P_6(2) = 0.00017 $, while the probability of at least 6 deaths in any 2-week window over 6 weeks ($L=3$) is $1 - P_6(3) = 0.00029 $.

Naus then argues for the approximation
$$ P_n(L) \approx P_n(2) \left[ \frac{P_n(3)}{P_n(2)} \right]^{L-2} .$$

Using the formulae above, we obtain $P_6(208) \approx 0.99983 \left[ \frac{0.99971}{0.99983} \right]^{206} = 0.9761$, and so the chance of getting at least 6 deaths in any 2-week window over 8 years is estimated to be 2.4%. For whatever reason, this is therefore a surprising cluster.

References

Naus JI (1982) Approximations for distributions of scan statistics. Journal of American Statistical Association, 77, 177-183

Comments

Dear Mr. Spiegelhalter First of all, thank you very much for this interesing article, I've greatly enjoyed reading it. I have tried to reproduce your result using Naus' approximation formulae but I get different values. The first thing I've noticed: Shouldn't $108/208 = 0.52$ rather than $0.57$? Using $n = 6$ and $m = 0.6$, I got a value of $0.9997658$ for $P_{6}(2)$ and $0.9995738$ for $P_{6}(3)$, respectively. So my final probability is $3.9\%$ (using $L = 208$). I re-checked my implementation in R but can't see where I made a mistake. CS