Have you ever wondered about the data science behind sports betting and odds creating? Yes, that’s right… odds, betting and data science are one family. Before we dive into it, there is one necessary knowledge we first have to communicate – explanation what the odds really are.
This is going to be quick and pain-free, so please bear with me… Imagine you roll a dice. There is a 1/6 (16.666%) chance of landing any of the 1-6 numbers and putting that into decimal odds you would get 1/6×100=6,00. So, what are the odds? Odds are just a probability expressed in a betting friendly way.
Is that it? Not really, and together with rowdie.co.uk we are going to look at the complicated part because this is what the experts have to say about it.
There are many sports out there. To make it simpler let’s talk about a sport most of us know, which is indeed football (in the US they use the name soccer). However even within football we have dozens of possible bets (markets) a punter can place. Besides the most common bets like the 3 way result, over/under, HT/FT there are handicaps, first scoring team, correct score, fouls, cards, corners and much more. Calculating probability for those events is not as simple as one might think. The actions such as dice rolling or a coin flip have their probabilities written within their DNA, but this is not the case for football.
Just a practical example. On our page we run a survey asking 50 verified punters to estimate the chances of the Premier League match between Liverpool and Chelsea back in 2019. After we excluded the extremes, there was still a marginal error over 20%.
You might say: Sure, that’s why the bookmakers earn so much… and you have a point with that. So we went further and asked 3 mathematicians to do the same job. Without adding more drama to it, I am going to reveal that the marginal error got only slightly lower and nobody from the experiment would be able to run a company as a bookmaker.
The goal is to distinguish between real data and the noise
All of the participants of our experiment had all the data they needed. They had statistics going 20 years back. They had details about any fixture splitting it into ball possession, shots on/off goal, corners, fouls, passes and dozens more. They could even see the same split for each player. Nothing helped. What did they fail to do? Being slammed with huge amounts of data causes one thing. You try to look everywhere, take into account everything, and fail to focus on what you should be focusing on. This we call the data noise and the main goal of a bookie while creating the odds is to distinguish between real data and the noise.
Ok, how do we do that? Well, if there was a simple answer, bookies wouldn’t be making billions… But I try to answer it. There are 2 main ways to approach the data.
Data analysis based on the past results
What is the most common approach that most of the participants took? The most obvious one. Most of them looked at the past results of each team (the more accurate ones took the head to head statistics), and estimated the count of goals (using mostly arithmetic mean). Some of them went 5 games back, but there were those who decided taking into account data from 10+ years back is relevant. With the “expected count of goals” outcome we had to give them a little help and using the Poisson Distribution we calculated the probabilities.
The problem with this approach is that sometimes it works and sometimes it doesn’t. Sometimes you can use data span going back half a year, sometimes you get more accurate predictions with 5 years data. Other words: In order to claim your data span being “correct”, you have to use statistical methods to verify the outcome.
This brought us from Rowdie to dividing the data outcomes into 4 groups. From 1 (great results) to 4 (poor results). Can a bookmaker afford having leagues where they are doing the data analysis in a wrong way? Nope. This would be costly and therefore here is a second approach.
Data analysis breaks down to individual players
None of the participants was using this approach and the reason is the complexity and difficulty with the multi level formulas and variables. The advantage of this detailed approach can be explained by this example. Imagine you have some great team e.g. FC Barcelona and 2 key players (Messi or Griezmann) are not part of the line up. This is the moment where your approach based on the latest results falls apart. Team is made of players and if the best ones are missing, you with your analysis based on past results are doomed to wrong outputs.
What do you do? To keep it simple and easy to understand, imagine the squad breaking down to single each player and each player ending with a number – called “impact”. Each player has 2 numbers. Defence and offence. When a player with a high offensive number is missing, the scoring ability of the team goes down. Similarly it goes if a strong defender who is supposed to take care of Messi is missing. But even using this procedure, the necessity for retroactive necessity remains. The model however shows more accurate probabilities.
Predicting the outcomes of a football match is not a static task. Your approach to data analysis needs to be dynamic and with the constantly changing variables you get some vibrant outcomes.