/Steam_statistics_ETL

This pipeline can be used to collect statistical information about all games, distributed through the Steam platform.

Primary LanguagePythonBSD Zero Clause License0BSD

Steam statistics ETL

Disclaimer:

⚠️Using some or all of the elements of this code, You assume responsibility for any consequences!

⚠️The licenses for the technologies on which the code depends are subject to change by their authors.

Description of the pipeline:

This pipeline is used to collect statistical information about all games,
distributed through the Steam platform, including:

  • 💰Price
  • 🏷️Tags
  • 🌐Publisher
  • 🛠️Developer
  • 📅Steam release date

Unfortunately, the Steam Web API does not provide such information when requested directly through its methods at the moment.
To solve this problem, this pipeline is being developed.
The pipeline also receives directly from the Steam Web API:

  • 🔤Application names
  • 🆔And their id on Steam.

🗓️Additionally, the pipeline remembers the scan date.


Required:

The application code is written in python and obviously depends on it.
Python version 3.6 [Python Software Foundation License / (with) Zero-Clause BSD license (after 3.8.6 version Python)]:

Required Packages:

Used to Luigi tasks conveyor.
Luigi [Apache License 2.0]:

Used to work with tabular data.
Pandas [BSD-3-Clause license]:

Used to create a random user agent.
fake-useragent [Apache-2.0 license]:

Used to send requests and receive responses.
Requests [Apache-2.0 license]:

Used for scraping.
BeautifulSoup4 [MIT]:

Used to bring the table cells to the desired value.
NumPy [BSD-3-Clause license]:

Used to save data in parquet format.
PyArrow [Apache-2.0 license]:

Used to monitor the progress of certain tasks from the terminal while they are running.
tqdm [MIT/ (with) Mozilla v. 2.0]:

Installing the Required Packages:

pip install luigi
pip install pandas
pip install fake_useragent
pip install requests
pip install beautifulsoup4
pip install numpy
pip install pyarrow
pip install tqdm

Launch:

If Your OS has a bash shell the ETL pipeline can be started using the bash script:

./start_steam_statistics_ETL.sh

At the beginning of this script, the values of variables are described,
by changing the values of which you can change this pipeline.
File location:
./:open_file_folder:Steam_statistics_ETL
   └── 📄start_steam_statistics_ETL.sh
The script contains an example of all the necessary arguments to run.
To launch the pipeline through this script, do not forget to make it executable.

chmod +x ./start_steam_statistics_ETL.sh

The script can also be run directly with 'python' command.
Example of one task with 'python' command:

python3 -B -m steam_statistics_luigi_ETL AllSteamProductsData.AllSteamProductsData \
\
--AllSteamProductsData.AllSteamProductsData-landing-path-part $all_steam_products_data_path \
--AllSteamProductsData.AllSteamProductsData-date-path-part $date_path_part \
--AllSteamProductsData.AllSteamProductsData-file-mask $all_steam_products_data_file_mask \
--AllSteamProductsData.AllSteamProductsData-ancestor-file-mask $all_steam_products_data_ancestor_file_mask \
--AllSteamProductsData.AllSteamProductsData-file-name $all_steam_products_data_file_name \
--AllSteamProductsData.AllSteamProductsData-logfile-path $all_steam_products_logfile_path \
--AllSteamProductsData.AllSteamProductsData-loglevel $all_steam_products_loglevel \

The example above shows the launch of one task.

Also note that the task pipeline itself is described in the 'steam_statistics_luigi_ETL.py' script.
File location:
./:open_file_folder:Steam_statistics_ETL
   └── 📄steam_statistics_luigi_ETL.py

Launch in Docker:

To run a docker container with etl, you can use a ready-made dockerfile.
To do this, run the build command with administrator rights in Windows, or with sudo privileges in Unix-like systems.
Example docker build command:

docker build -t luigi-steam -f С:\Git\Steam_statistics_ETL\Docker\luigi_ETL.df С:\Git\Steam_statistics_ETL\Docker\

Wait for the image to build.

Start building the docker image with a shell command.
Example docker run command:

docker run -d -t --name Steam_Statistics luigi-steam

Description of tasks:

AllSteamProductsData

  • Retrieves a list of applications from steam Web-API.
  • If the launch is not the first time, saves the difference with the previous launch as a result.

GetSteamProductsDataInfo

  • Requests application pages received from the last task.
  • Masquerades as a new user every request and waits for a random value of seconds between 3 and 6 before a new request.
  • Scraping data on these pages.
  • Filters out everything that is not applications and additions to them.
  • Sifts DLC and saves them to a separate file from applications.
  • Separately saves applications and add-ons that are not available to the request in this region.
  • Saves each request to a local cache, in case the pipeline crashes.
  • Reads the local cache of applications and DLC every unsuccessful instances.
  • Deletes the local cache of applications and add-ons after successful instances.

Result features:

  • Some apps require registration to scrape. Information about them cannot be collected.
    All columns in the row about this application will be empty, except for the 'id' and 'name'.
  • The fields of the 'rating_30d_percent' and 'rating_30d_count' columns can be empty if no one has left a review in the last 30 days.
  • 'not_available_in_steam_now' is set in the 'price' column value if the app is no longer available in the Steam store.
    Usually this situation is adjacent to the previous point.
  • Some apps and DLCs may be missing tags. Most often this applies to various OSTs, in which case the cell will be empty.
  • If the release of the application, or DLC has not yet occurred, then the cells of the steam_release_date column will be marked "in the pipelane".

[SteamAppInfoCSVJoiner, SteamDLCInfoCSVJoiner]

  • Collects the results of all successful instances of the past task and merges them into a new file containing statistics about applications.
  • Fills in the empty cells 'nan'.

Known Problems:

FakeUserAgentError:

When a task tries to send a request to a page, a number of fake-useragent errors appear:

FakeUserAgentError:
Error occurred during loading data. Trying to use cache server https://fake-useragent.herokuapp.com/browsers/0.1.11
...
urllib.error.HTTPError: HTTP Error 503: Service Unavailable
...
fake_useragent.errors.FakeUserAgentError: Maximum amount of retries reached
...

This is due to the outdated version of the fake-useragent.
Solution:
Update the fake-useragent with the terminal, or command line.

pip install fake-useragent --upgrade

bs4.FeatureNotFound:

raise FeatureNotFound:
bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?

It looks like you don't have the lxml library installed.
Solution:
Install the lxml library with terminal, or command line.

pip install lxml

Known Bugs:

  • Applications that do not have a price receive as a value not 0, but literally emptiness.
  • Sometimes it is not possible to scrape information about the publisher and developer of the application. The value in the cell will be empty.

Thank you for your interest in my work.