Predicting the premier league results

As of the 23rd May 2022 this website is archived and will receive no further updates.

understandinguncertainty.org was produced by the Winton programme for the public understanding of risk based in the Statistical Laboratory in the University of Cambridge. The aim was to help improve the way that uncertainty and risk are discussed in society, and show how probability and statistics can be both useful and entertaining.

Many of the animations were produced using Flash and will no longer work.

Here is the spreadsheet showing the way in which my predictions were made. I hope it is comprehensible, at least for enthusiasts! I discussed this on the Today programme the day before the matches.

The statistical method used is basically the same as we used last year, when we did well and got 9 win/draw/lose correct and 2 exact scores. For each team we work out an expected number of goals that they will score: this is based on the average for a home or away side (1.69 and 1.09, this season, a strong home advantage of over 50%), adjusted by the 'attack strength' of the team and the 'defence weakness' of their opponents. The expected number of goals for each team are then taken as the means of two independent Poisson distributions and the probability of each goal combination calculated. Adding up the relevant probabilities then gives the assessed chances of a home win, draw, or away win.

Last year we used a very simple model for attack strength and defence weakness, based on the total goals scored and conceded during the season. This year we have allowed the attack strength to depend on whether the team is playing home and away - this easiest way to do this is to consider goals scored home and away entirely independently, but we have 'smoothed' the resulting estimates by giving some weight to away goals when estimating home attack strength. (In formal statistical terms, we are fitting an approximate Poisson regression model with main effects for home/away, team and opposing team, and a mixed effect interaction term.)

A real problem occurs when the 'most likely' exact score is a draw, but overall the most likely overall result is a win for one team. In this case we have gone for the most likely overall outcome, although this means we have not predicted any draws.

The final predictions are as follows:

The most likely results for each match, with their % probability.
Home Away Most likely Probability of result Probability Actual result
score (win/draw/lose) exact score
Arsenal Fulham 2-0 73% 15% 4-0
Aston Villa Blackburn 1-0 69% 16% 0-1
Bolton Birmingham 1-0 39% 11% 2-1
Burnley Tottenham 0-2 64% 11% 4-2
Chelsea Wigan 4-0 96% 11% 8-0
Everton Portsmouth 2-0 75% 14% 1-0
Hull Liverpool 0-1 61% 14% 0-0
Man U Stoke 2-0 80% 18% 4-0
West Ham Man C 1-2 55% 10% 1-1
Wolves Sunderland 0-1 37% 15% 2-1

It is important that by adding up the probabilities we can work out how many we expect to get right: 6.5 results and 1.3 exact scores, and anything more than this is luck!

By multiplying the probabilities we can assess the chance that all the predictions will be correct: this comes to around a 1 in 100 chance that all the results will be right, and around 1 in 700 million chance that all the exact scores will be correct. That's why I don't bet on these predictions.

Added at 6pm on Sunday 9th May

A fairly pathetic result. Only 5 results right and no correct exact scores, less than expected but not incompatible with the probabilities given. Just goes to show that uncertainty does not always play out as desired. Mark Lawrenson for the BBC did better: 6 results and 2 exact scores, so he gets his own back for last year when he only got 7 results and 1 exact score. Oh well, back to the day job.

Levels: 
AttachmentSize
Office spreadsheet icon PREMIER-0910-BBC-DJS.xls74.5 KB

Comments

Enjoyed your spot on the Today programme this morning. Have you thought of running a season-long shoot-out with Daniel Finkelstein's "fink tank"? PS Your CAPTCHA is too hard for an old man with failing eyesight.

I think the biggest problem with the model proposed above is the independence of goals scored by both teams. What makes football a more exciting to watch sport than tennis, or golf, for example, is that once a team concedes a goal (and this in itself can happen for the most crazy and unlikely reasons, e.g. elementary goalkeeping mistakes, own goals etc.) that team is instantly put at a big advantage, being in the position of only having to sit back and defend to win the game. There also seems to be an effect of goals being scored more easily against a side which has already conceded a goal, i.e. the confidence boost to the attacking team and the "shock" effect on the defending team. Thus a better model would be to condition the result on the first goal scored, although this is clearly not easy to predict at all. In tennis or golf, each set or hole respectively is more or less completely independent of the previous, so, for example, a stroke of luck in getting a hole in one on a particular hole will not significantly help you in winning any major title. Consistency in these sports is then key.

Conditioning the prediction based upon which side scores the first goal sounds like Bayesian Probability to me. Not difficult at all to do. An easier way is to apply the maxim that he who scores first is more likely to win (an alternative would be to say that he who scored the most recent goal wins) and use Spread Betting to net your gains. I do have concerns about the accuracy of predictions algorithms having seen claims for in excess of 50% accuracy amongst academic's models of annual datasets. Repeating the same method on week-by-week data does not result in a regular level of accuracy above 50% on the database system that I have used. However, just applying HOME WINs to all matches results in accuracys of 54% plus. David has pointed out that, with the exception of top-flight clubs, the result of any match owes more to luck than skill. Can anyone confirm whether academic models and predictions are based on calculations of annual results (for goal averages and attack/defense factors)? If that were the case then applying a calculation based on annual factors for any one match would be wrong and the model flawed and inapplicable for week-by-week predictions.

Since first hearing the predictions in 2009 I've played with occasional spreadsheets not dissimilar to yours (perhaps not as colourful!) - googling around a bit recently there seems to be various assessments that suggest a negative binomial distribution forms a better fit to football events than a Poisson distribution (for example - www.physik.uni-leipzig.de/~janke/CompPhys06/Folien/Nussbaumer.pdf ) Are you going to stick with Poisson for next May, or expand the thinking to an alternative approach?

Hi David You have computed figures for "home attack advantage" and "defence home advantage" in cols I and S in the strengths spreadsheet though these do not appear to be used in the formulas that calculate the weighted strengths and weaknesses. Should these figures used to specify the weight given to away results depending on the teams playing? The weighting of 0.33 appears arbitrary. I cannot see how it is derived. Please forgive me if I have fundamentally mis-understood the spreadsheet but I would be grateful for an explanation as I am currently trying to develop a similar model. I would also be indebted to anyone who can give a simple explanation of how the negative binomial distribution can be used in forecasting the outcome of a football match. In terms of failure and success what exactly is being measured? Goals? The Result Home, Away or Draw? And what probability of a success is used? This too makes little sense to me. Many Thanks.

Hello. I've discovered this just today! (13th April 2011) I dream of being able to mathematically predict the best fantasy football 11 going into the week on yahoo fantasy football. I've been working on a model to predict team scores first, then will move onto players. The problem with this model above is that I don't understand why you apply the blanket home/away advantage from across the league. Why not use the home/away advantage specifically for each team as this would seem to be more accurate. I used this model to predict next weekend's (16th April 2011) La Liga results and got the following: Sat, Apr 16 MGA 1-1 MAL Sat, Apr 16 MAD 1-3 BAR Sat, Apr 16 ALM 0-2 VAL Sat, Apr 16 GET 2-2 SEV Sun, Apr 17 OSA 1-1 BIL Sun, Apr 17 RSO 1-0 SPO Sun, Apr 17 DEP 0-0 RAC Sun, Apr 17 LEV 1-0 HER Sun, Apr 17 ESP 1-1 AMR Mon, Apr 18 VIL 1-0 RZG http://fantasyfootballsuper.blogspot.com

This is a great article for noobs like me. I was trying to access the link (xls) provided but it seems to be dead. Thanks.

Like the content of your article! I have been doing some research on calculating fair odds by expected goals, poisson distribution and placing value bets lately and would like to recommend the following articles (for beginners like myself): footytradingposts.blogspot.co.uk/2012/07/calculating-goal-expectancy.html footytradingposts.blogspot.co.uk/2012/01/poisson-for-dummies.html soccerwidow.com/betting-maths/tutorial/calculation-of-odds-probability-and-deviation/

Dear David, if the prior probabilities are coming from above Poisson calculations, could they be updated using Bayes' Rule (aka Bayes' Update)? Meaning the new information would be goal difference from past 6 matches, that could be then used as evidence and therefore posterior probability could be calculated using Bayes'. Have you tried running Bayes' model in football predictions, David? My effort is still under construction and long way from being accurate. Results tend to be too home biased. Regards, Bonnevillet100