Write Python crawlers to get thousands of game scores from sites and use steam RESTful APIs to get user ratings. Use tools of natural language processing to match similar game names to avoid impact of game nickname. Build Machine Learning model to predict user rating of the game, and The accuracy achieved 0.86
There may be ProxyError
during crawling IGN and PCgamer, and getting json from steamAPI.
The program include a restart system. During ProxyError
, it is able to restart the spiders to retry from cheakpoint instead of run the code again. However, it cannot handle exception like ConnectionError
, which means the computer is completely disconnected with Internet.
Please ensure the computer network in good connection.
To find similar name in steam appid api is not that easy. It need to compare every name in appid_list.csv
which have 100,000 lines of data. For 2000 games in IGN and 2000 games in PC gamer. It needs 20,000,000 calculation. It cost about 20 hours in a single core in Intel i9-9900k with 4.7GHz overclock. In order to reduce running time.
I import multiprocessing pack. You can input the number of threads you want to use when you try the python program.
Please input a appropriate number of threads you want to enable for this project.
Warning: For a single thread, this program may need more than 20 hours to process
A environment.yml
file is created at directory
to install imblrean
, open a terminal at python envs and input conda install -c glemaitre imbalanced-learn
at python envs
to install xgboost
, open a terminal at python envs and input anaconda search -t conda xgboost
to search xgboost and input conda install -c anaconda py-xgboost
to install
nltk
if first import nltk, please input nltk.download()
at python console
selenium
use firefox web browser to get IGN which is a dynamic web page. You should download geckodiver for your operating systems such as MacOS or Win10.
The following have a basic tutor for selenium.
https://stackoverflow.com/questions/42204897/how-to-set-up-a-selenium-python-environment-for-firefox
If fail to run the following cells, please run conda install ipykernel
at the terminal
and input python -m ipykernel install --user --name final_project_510 --display-name "env510"
(final_pro_510 should be the name of the virtual environment and press Kernel above the page and change the kernel.)
orange blocks are scripts
Because cloudflare has just upgraded anti-crawler mechanism at October. It is hard to get data from SteamDB (a website with game history data. I have to use reviews in steam to label the games in steam.
This project generate a model to predict the popularity of a game by using early data during game released. This model is able to classify the review of a game into 3 class: Good, Fair or Bad with a 80% accuracy. An accuracy of 80% for a 3-classification problems is not a low score. A model like this assist developers to decide if the game should increase investment, for example, advertisement or new DLCs to improve game quality.
You can try python TUO_SUN_proj2.py --source=local
to get the accuracy
or try python Tuo_SUN_proj2.py --source=remote
to try the whole project. However, this need many hours to process
The goal of the project is to get reviews data from IGN and PCgamer, get game-tag data from steam store, and use these data to predict the popularity of the game. We will use the dataframe below to predict the review score from steam
IGN is a game information website with a dynamic web page. I have to simulate page turning by using selenium
to crawl to html for 2000 games in IGN
For a same game, its name in IGN may be different from its name in PCgamer. Even if they are the same, they two may be different from that in steam appid list. For example, 'Sid Meier's Civilization VI' is a game named in steam, but in PCgamer it is named 'Civilization 6'. I have to write an agent to recognize similar names in these three game name lists and minimize Type I and Type II errors.
After designing the recognization agents, it takes me a long time to get create similar_relation.csv
, similar_relation_pc.csv
. I try to use multiprocessing to handle this problems.
The better ability to design and complete a little bit large project. Actually, this project have more than 3000 lines of code (including comments). It is the biggest python project that I have ever tried.
To be honest, the agents I set to recognize similar names is far from high efficiency. It compares names one by one. There must be better agent for these problems, for example, clustering, or generate a recognization model from a bigger dataset.
There are more game information websites. Review scores from them may increase the accuracy of the model.
This model uses review scores and game tags of a game, but I think more kind of features should be involved, for example, the price of the game or the publisher of the game