Learning Based Soccer Scouting System

This repo is strictly for use of maintianing and developing the Final Year Project seamlessly.

Abstract

With the advent of sports sciences and data analytics, a few teams from the major sporting leagues of different sports have adapted to best make use of the available resources to optimize their performances and results. This has been particularly observed in the case of football, where the sport has become increasingly system-based with a greater emphasis on occupying the right spaces, whether to attack, maintain possession or to defend, and moving away from relying on individual quality. The objective of this project is to develop a system which should be able to recommend the best players suited for a given team, i.e., they should be the right fit not only qualitatively, but also tactically. Ideally, the system should be powerful to report findings that are an improvement over traditional scouting methods, the most commonly used and trusted method being the eye test, and come up with accurate solutions not found by reports generated by aforementioned traditional methods. To quote Mikhail Zhilkin, Data Scientist at Arsenal Football Club and author of Data Science Without Makeup, “The end goal of data science is to change opinions.”

Introduction

The first task in any Machine Learning project is to gather data. So, collection of data on the English Premier League, the rationale being its global exposure, coverage and thus presumably extensive data availability. Unfortunately, most websites only have limited data (5-15 features) whereas the ones that have extensive data do not make them available for free and are notoriously difficult to wrangle from. At first, a dataset available on Kaggle having 10+ features was used to make a basic statistical model outputting a score for the suitability of each player in a certain role based on a pre-specified set of weights.

While no ‘learning’ was involved here, it helped build some familiarity with working with football statistics when limited data was provided.

Progress and Challenges

Later, a website called fbref.com was discovered, from which 100+ features were gathered for both players and clubs, using which two datasets were generated, one for player statistics and the other for team statistics, for the 2022-23 Premier League season. The intention was to create and train a regression model to identify how the 10 most appearing outfield players contribute to the team’s cumulative statistics and be able to predict the same and another model to predict the number of points won by the team at the end of the season based on the previous team stats predicted. This would help in the scouting problem by taking 9C10 combinations from the 10 most played players of a given team and adding a player from the rest of the league to predict total number of points and displaying the best results. However, that turned out to be impractical (at least until this juncture) since: Team statistics would have to be predicted from the player statistics, from which points would be predicted. Formulating a model on such indirect and heavily coupled relationships is too complex. Working with 40+ features usually leads to inaccurate predictions. Applying feature selection becomes futile since certain features hold more significance for certain positions on the pitch (e.g. shots for forwards, tackles for defenders, passes for midfielders) and establishing a different set of weights across a homogenous dataset becomes a challenge in and of itself.

What can be done is to compartmentalize this massively challenging problem into small problems and then proceed from there.

The first problem is as follows:- A team is underperforming their xG, i.e. having high xG (expected goals) but isn’t scoring as many goals; and they want to address the issue by signing a new forward.

First, forwards that have played more than 0 minutes in the Premier League were selected from the player dataset along with only shot-related features. Then a dictionary having each individual team as the key with their corresponding values being panda dataframes containing shot-related data of the forwards. X_train (as part of the training dataset) is generated by totalling most of the attributes while taking a weighted average of the remaining ones (like Shots per 90 mins and Shots on Target per 90 mins). y_train is generated from the team dataset. Now, regression models can be trained on X_train and y_train.

The limitations of this method are:- Using an aggregation of the forwards and using that as reference is not representative of how a team plays. Usually, 1-3 forwards are present on the field at a given time, the cumulative stats are not specific to a given situation. While goals scored are directly dependent on shots, shots are dependent on possession, passes, chances created, etc. which in turn is dependent on how efficient the team is in winning back possession, i.e., defensive stats, and so this approach cuts out those indirect relationships entirely. However, this is still better compared to dealing with a very large number of features. Taking an aggregation doesn’t take into account player interactions and chemistry, which are fundamental aspects to team-building, squad selection and expected performances. The data points generated so far are 20 (1 season of football, 20 teams), which is low. The priority is on gathering more data for more leagues and across more seasons to generate more data points for better testing and subsequent testing.

Conclusions and Future Work

In the work done so far, it has been clear that the only way forward is to build simple models, analyze results and introduce further complexity to better represent real-world football. A proposed way of testing is to take historical data, predict transfers that theoretically would have the most improvement on the team’s performance, and show real world examples of those same transfers taking place and the team doing better thereafter. The converse of this is also true (bad transfers).

Each and every team has a certain philosophy and vision, and thereby the formations, tactics and profiles of players they prefer can be quite different. In fact, player roles and responsibilities in the same position in two different formations often vary significantly. For example, wingbacks in a five-man defence have far less defensive responsibilities than their counterparts in a four-man defence, and are therefore free to join the attack more frequently. Ideally, the project should be sensitive enough to identify the context and accurately scout players who can slot in very well into a given system. So for that reason, gathering data related to players’ relative positions throughout games and player-to-player interactions (passing, swapping positions, simultaneous pressing, etc.), supposing such data is freely available, and incorporating it would presumably lead to high accuracy and provide insightful results.

It must also be noted that players are often signed based on factors such as injury record, languages known, feelings towards the club, its manager, players and fans, likelihood of settling in well at a new club and their psychological state and mentality. Obtaining the data for some of these attributes and preparing an equivalent numerical measure for the others to be used by the system becomes a challenge.

References and Important Links https://www.kaggle.com/datasets/rajatrc1705/english-premier-league202021 https://fbref.com/en/comps/9/2022-2023/stats/2022-2023-Premier-League-Stats https://www.guidetofootball.com GitHub Repo - https://github.com/sambuddharay/FinalYearProject Google docs - https://docs.google.com/document/d/15_9UxefSb8FwaqIMTKnxEtd_1cRLF0o5SlbDxFPEAxc/edit?usp=sharing