pi Predictions always throws up a few interesting outcomes for discussion and other employees in the office are surprisingly nervous about getting the correct result all of a sudden it seems. There was even talk earlier in the season that some potential jiggery-pokery (technical term) had taken place: “Sheffield United to win again?!”. Just jealous Leeds United and Bradford City fans really…
I should start with a recap for those who are unaware what I’m blabbering on about though.
Using pi Analytics to predict the Premier League
In short, I took a load of freely available historical football data (results, shots, expected goals etc), played around with it a little bit (calculating different metrics for the last two games, four games, six games and so on – I can’t give the whole game away, no pun intended) and then threw all that data into our pi Analytics model as characteristics, with the objective of predicting football matches. Simples.
Well, it is actually. We’re trying to make decision trees fun, easy, and not just for data geeks. It’s a way of trying to bridge that gap between data science and the people with the domain knowledge. YES, I have a background in data and YES, I bloody love football, but this could be about something you love or are passionate about. It could even be about work…
What is this ‘Machine Learning’ that everyone keeps showing off about I hear you ask?
Well to put it simply, it’s just more data. As the games take place and the results appear, the model has more data to ‘learn’ from. It’s like a human; the more information we have, the more informed decisions we can make about something. That all said, this is football and by its very wonderful nature is unpredictable; that’s why we love it, well, I do anyway (other sports are available).
This ‘learning’ data can then provide the basis for a predictive model to be applied to new data as we receive it. In our pi Predictions scenario that would be the upcoming fixtures and the potential outcomes of those games, but let’s forget football for a minute (honestly it won’t be long)…
Other scenarios could be our fantastic NHS and predicting the quality of an x-ray before it has taken place. In the insurance world it may be the likelihood of a claim before an accident has even happened. In the fascinating world of finance it may be the chance that a customer will actually pay an invoice on time.
You hopefully get the idea… or erm, the invoice.
So, in the Panintelligence world this ‘prediction’ against a new set of data is what we call pi. Decision: the ability to save a model and apply that to a new data set. Here is an example (screenshot time!):
My Home Win Model
Now let’s break it down and think of the decision tree as a penny coin drop machine you have at an arcade by the seaside. We put the coin in the top (our new data record or fixture) and follow the route until it finds its resting place and gives us the potential outcome.
The grey ‘node’ at the top tells us the probability of a Home Win (or whatever your objective may be) without any characteristics being applied. From this data set of 1,413 records, there is a 47% chance of a Home Win in the Premier League (this is data from around the last four seasons to those interested).
This means that if you chose the home team to win every game in the Premier League across the last 4 seasons, you’d be correct 47% of the time.
Now this is already interesting to people with domain knowledge because there are three outcomes in football: home win, away win, and draw.
We’d therefore expect the probability to tell us it’s a 1 in 3 chance or 33.3%, but straight away, historical data shows us it isn’t the case. Now home advantage in sport is a WHOLE different area (I actually did my dissertation on it back in the day, don’t you know?) so I won’t bore you with that…
Interpreting the model
As we follow the coin down to the first split, we see that due to some nifty linear regression modelling (we worked all this out for you, so don’t worry), our analytics model has found that non-shot expected goals – NSXG (basically teams playing in areas close to the goal without a shot taking place) – is the most significant characteristic to determine the outcome of the game.
It’s now telling me that if the home team has NSXG of greater than or equal to 1.762, then this home win probability jumps to 69%. Great stuff. We’re on to something.
This split then carries on for each node until we find that applying multiple characteristics results in a probability of 78% for a home win. So, if the characteristics of the new record matches the below then (based on historical data) our model will predict 78% for a home win. We only had a 47% probability at the start but now we’re onto 78% chance. Even better!
What does the outcome look like?
An example of this in action is below in my ‘fixture table’ which shows how the above models feed into the new data to give me the weekly predictions. N.B. Remember I mentioned that football is unpredictable…?
So that’s it really. pi Analytics, machine learning, Premier League, football, home advantage, predictive models, Sheffield United being amazing, arcade machines… erm… data and a potential European tour. Did I mention Sheffield United!?
Oh, I suppose you’re all wondering how #pi Prediction is actually doing?
Well, pi Analytics has currently predicted 137 results correctly so far this season, while our Panintelligence employees have only achieved 123 correct predictions.
Still, lots more football to be played, so the title isn't quite wrapped up yet! (Although it likely is for Liverpool...)
Please note – These are the words of Alex Paramore and sometimes he waffles a bit. Panintelligence accepts no responsibility for the word count.