A-Machine-Learning-Approcah-to-Analyze-the-statistics-of-Football-Players

In this project I tried to predict which players should be sold and which ones should be retained by the team at the end of every season based on their performance and other factors with SVM and KNN classifiers with approx 80% accuracy

Dataset

The data is collected from the two websites whoscored.com and transfermarket.co.uk by webscraping method and stored into excel sheets. We have collected the Players’ Performance Statistics, their market values and the transfer details of the teams. Players’ performance statistics, market value records, transfer records for both buying and selling of the players. The records are being collected for last 5 years of Europe’s top 5 leagues.

Preprocess

After collecting the data the first step to be done is pre-process the data. As the data is collected for 5 different leagues, those are combined in a single dataset. After combining there are four different datasets: players performance datasets, market value details, selling details, buying details. From the Players statistics dataset 3 attributes and ‘players name’, ‘minutes played’, ‘rating’ has been picked up. From the market value dataset we collected 2 attributes ‘players name’, ‘market value’ has been picked up. From the buying details dataset 3 attributes ‘players name’, ‘transfer fee’, ‘market value’ has been picked up. From the Selling details dataset only the ‘players name’ attributes have been picked up. The ‘players name’ attribute of the players statistics dataset was a multivalued attribute. Keeping only the player’s name rest of the data has been stripped off. The price column of both the buying and selling details dataset and the ‘market value’ column of the Market details dataset contained the Euro sign and units such as ‘m’ and ‘k’ they have been stripped off. Apart from that many values for the above attributes contained values like ‘Free Transfer’, ‘-’, ‘?’. All these have been replaced by zeros. The Players statistics dataset and the buying details table has been merged on attribute ‘players name’. Now we have the buyout clause of the players that are bought by the clubs in that season. For rest of the players it contains null value now. All those values are replaced by zero. Now the resulted dataset is joined with the ‘market value’ dataset on the ‘players name’. Now some of the entries in the resulted dataset has market value and rest has null value for that attribute. Here we have used a linear regression technique to predict the market value for the entries that contain null based data of the rows that have valid market values and stored in a new dataset named ‘dataset1’. Now this dataset1 is joined with the Selling Details dataset by keeping only those player names that are in the selling dataset. Now an extra column has been added to that dataset named ‘Class’ with value 0. This dataset contains the players that are sold with class value 0. Now this resulting dataset is joined with the dataset1. So, some rows have the new dataset has ‘Class’ value 0 and rest has null value for that. Now these null values are set as 1. 1 signifies that the players have been retained by their team. Now the ‘Class’ value for the players with rating below ‘6.5’ has been set as 2. 2 signifies the player has not performed well this year but team has kept them in the team to give them another chance. So, the final dataset is consisted of the features ‘Playing Time’, ‘Rating’, ‘Buyout Clause’, ‘Market Value’ and the target variable ‘Class’.

Classification

In this paper we have used machine learning techniques to classify whether a player should be retained by his team or should be sold based on the features we have extracted. First the dataset is divided in two parts using train_test_split method available in sklearn library of python in 80 to 20 ratio. 80% of the data is used to train the models and 20% for testing purpose. Here two supervised machine learning algorithms have been used for the classification purpose. The prepared SVM and KNN method of sklearn library of python has been used here. The model is trained with the training dataset, then it is evaluated with the testing dataset for checking the accuracy of the metric. Then attributes of a player is passed to get his class. And the models predicts the class for the player.

Results

On tetsing it is found that SVM model is giving 82% accuracy whereas KNN model gives 80% accuracy. So in this case SVM outperforms KNN by a slight margin. We also found that on the basis of our dataset, SVM model gives the best accuracy with ‘Linear’ Kernel, regularization value(C) 1.0, gamma value 0.01. Also based on our dataset KNN model gives the best accuracy with ‘Hamming’ distance and K-value=100.

Future Target

The future target includes to increase the size of the dataset by adding more seasons to the dataset, adding more features like internactional performance to make a better prediction. And to make the front end so that it can used directly to the industry.

Publication

The paper has been published to IJSREM journal in the december 2019 edition. Paper Link: http://ijsrem.com/download/a-machine-learning-approach-to-analyze-the-statistics-of-football-players/?wpdmdl=1975&masterkey=5df7645b66cc1