# # Tennis Model (Betfair Australian Open Datathon)

**What**: This was a data science contest run by Betfair to predict the results of the Australian Open (tennis) in 2019 using historical data. I managed to win first place ($5000 cash prize) and even beat Betfairâ€™s internal models, which were published as benchmarks.**When**: 2019**Who**: Me

## # Approach

### # Competition

Entrants in this competition had to predict the outcome of every match in the Australian Open using historical data. Since the head to head match ups are unknown beyond the first round, submissions needed to include every *possible* match up in the mens and womens draws.

### # Player Embeddings

I used a number of techniques for this task but one of the most successful strategies was to generate embeddings to represent players. Embeddings are a deep learning technique for representing categorical variables in a continuous space. Traditionally this method is used in NLP (natural language processing), but I used it to capture relationships between players as well as information about how external factors (such as conditions) affect match results.

The plot above shows the embeddings of all players who have competed in 200+ matches between 1991 and 2020. The raw embeddings are 32-dimensional and can't be visualised but I have used a dimensionality reduction technique called UMAP (Uniform Manifold Approximation and Projection) to produce a plot in 2 dimensions.

While a lot of information is lost in the dimensionality reduction process, we can see a number of patterns such as:

- Roger Federer, Rafael Nadal and Novak Djokovic (the 3 greatest players of the open era) are clustered together and separated from the rest of the players.
- There is another cluster with some of the next best players (Andy Murray, Pete Sampras, Andre Agassi, Andy Roddick and Juan Martin Del Potro)
- Clay specialists have been grouped together (Gustavo Cuerten, Thomas Muster, Carlos Moya etc.)

My primary reason for using embeddings rather than a simple measure of "form" was that I hypothesised that head to head match ups could not be captured with a single number. Much like in a game of "Scissors Paper Rock", just because Player X is historically stronger than Player Y and Player Y is historically stronger than Player Z, we can't assume that Player X is better than Player Z. That is, "form" is not linear and we need a more sophisticated way of representing player strength.

## # Results

Predictions were evaluated using the Log Loss method based on the 254 matches that actually took place. My model for the men's draw actually performed better than Betfair's market odds which is impressive given I had to submit my predictions before the tournament began (whereas the market odds were able to factor in previous matches in the tournament, injuries etc.).