The developed of this test was done with colab in a Google Drive platform if you want acces to the drive and run these notebooks ask at ediboc@gmail.com (only test evaluators)
The transformation of data was done with pandas library and the final tables were import to MySQL . This can be done manually or with the python library mysql.
Here we make a resume for the steps done, all the code was developed in jupyter notebooks and there you will find comments for further details
In this notebook we joined the 3 cvs files into one table, after cleaning the data with: delete repeated rows, fix numeric variables and wrong values.
resulting files: CompleteCSVDataset.csv: is the result of join PlayerPersonalData.csv, PlayerAttributeData.csv and PlayerPersonalData. ColumnDataIndex.csv: is a table or dataframe, that contains all the column names from CompleteCSVDataset.csv in the first column and the original table where the column came from is in the second column.
In this notebook the 3 json files were joined into one table, the json files were transform into a list of document and the function json_normalize was used to transform intoa dataframes
resulting file: CompleteJSONDataset.csv: is the result of join languages.json, countries.json and continents.json in one data frame.
This notebook join the files CompleteCSVDataset.csv and CompleteJSONDataset.csv and make analysis of metrics by language
resulting file: CompleteDataLanguage.csv: is the join of players files (CSV) and language (JSON).
Here we import the tables CompCsvClusters.csv as players (primary key ID) and CompleteJSONDataset (primary key ID_ctlg) as country_languages.
It was chosen a structured database because the data observation can be store in a row, the tables have primary key and do not have sub documents.
Objective 2:Build a Data Visualization of the data FIFA game by country and languages. Clustering to help us for take the best decisions for Localization.
It is important to highlight that there is a big difference in the number of observations between countries or languages, so it is decided to clustering with the top 20 players from each country or language.
Clustering by country and top 20 players We get 5 clusters where cluster 1 is the countries that produces the best players and whose performance is similar. And the cluster 2 of countries with similar performance and metrics like Overall and Potential slightly smaller than those in group 1. This is the same analysis for groups 3, 4 and 5.
resulting file: CompCsvCLusters.csv: is the CompleteCSVDataset.csv with the column Cluster_countries added, for players not in the top 20 or others Cluster_countries is 0.
Clustering by language and top 20 players
We get 7 clusters where cluster 1 are the languages that produces the best players and whose performances or metrics are similar.
resulting file: languageCLusters.csv: is dataframe of languagues average metrics and Cluster_language
To display metrics of players, teams and countries based on country clusters.
To display metrics by lannguage clusters.
Note: there is a file 'dashboard print screen.doc'
In order to get the sentiment analysis:
- First we have to classify the comments as positive or negative.
- Get comments from a social network
- Classify comments and see results
In the folder NLP_movie we will create a classifier based on nltk movie_reviews data On notebook NLP_preprocess.ipynb is the preprocess of the documents and NLP_Models.ipynb is the model generation (LogisticRegression, XGBClassifier, SVC from Sklearn and a Neural Network), where we chose the neural network model because with this model we obtained the highest accuracy.
resulting file: NN_model.h5:Neural network model used to classify comments
Notebook Get_tweets.ipynb where we will get tweets and their location related to FIFA20.
resulting file: DFtweets.csv: dataframe of tweets with info of user, location, date of creation and text
Notebook 3-1tweets_comments_classifier.ipynb, where we preprocess tweet texts, it is classified as positive or negative comments and a table of results is presented by location
This comments have no information about location
Notebook Get_youtube_comments.ipynb where we will get comments in videos related to FIFA20, saved in the file ytComments_translated.csv.
Notebook 3-2youtube_comments_classifier.ipynb where we preprocess videos comments, it is classified as positive or negative comments and a table of results is presented by video.
It can be found on a word document with the same name