Extracts data from websites and creates datasets for ML or analysis purposes.
Setup • Configuration • Usage • Table schemes •
This repository is dedicated to gathering and organizing datasets for machine learning based StarCraft II bots. The aim of this project is twofold - firstly, it provides a tool to collect replay data that can be used in supervised training methods; secondly, it creates datasets suitable for use with value functions in reinforcement learning algorithms.
Available functionality:
- Collect replays from two websites
- Preprocess data into a human readable form
- Transform data and load it into the DB.
Limitations to consider:
- The only available game mode is 1v1.
- Made for game version from
5.0.0
to5.0.11
Python <= 3.9
(the latest sc2replay library is available in Python version 3.9).- Access to configured PostgreSQL database.
- Packages listed in
requirements.txt
. - Optionally:
jupyter notebook
- Create a new database (using psql):
create database sc2replays;
\c sc2replays
- Clone the repository by running
git clone https://github.com/dvarkless/sc2_replay_converter.git
- Create a python virtual environment:
cd sc2_replay_converter
python -m venv venv
- If you are using Linux or Mac:
source ./venv/bin/activate
If you are using Windows:
./venv/Scripts/activate.ps1
- Install packages:
pip install -r requirements.txt
- Download submodule
git submodule update --init --recursive
Configuration files can be found in ./configs
directory
File ./configs/secrets.yml
db_host: localhost # Database url address
db_name: sc2replays # Database name
db_user: dvarkless # Username which can interract with the DB
db_password: password # Password for this user, set to `None` if it is not set
File ./configs/downloader_config.yml
The only reasonable thing to change here is user-agent:
headers:
user_agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36"
# Chrome from Windows device
If you want to add another site, you should add it into the config and write
another method in class ReplayDownloader
(def name_yield: ...)
.
The example code is provided in the download_and_process.ipynb
- Collect replays:
from replay_downloader import ReplayDownloader
REPLAY_DIR = "../replays"
DOWNLOADER_CONFIG = "./configs/downloader_config.yml"
downloader = ReplayDownloader(REPLAY_DIR, DOWNLOADER_CONFIG, max_count=500, jupyter=True)
downloader.start_download("sc2rep")
# downloader.start_download("spawningtool")
- Preprocess files
from replay_process import ReplayProcess, ReplayFilter
from datetime import datetime
REPLAY_DIR = "../replays"
SECRETS = "./configs/secrets.yml"
GAME_INFO_FILE = "./starcraft2_replay_parse/game_info.csv"
processor = ReplayProcess(
SECRETS,
DATABASE_CONFIG,
GAME_INFO_FILE,
jupyter=True
)
# Setup filter
replay_filter = ReplayFilter()
replay_filter.is_1v1 = True # Select only 1v1 games
replay_filter.game_len = [1920, 38400] # Games with length from 2 to 40 mins
replay_filter.time_played = datetime(2021, 1, 1) # Earliest allowed game
# Process replays (this should take a while)
processor.process_replays(REPLAY_DIR, filt=replay_filter)
- Create dataset tables
from itertools import product
from pipeline import PipelineComposer
MINS_PER_SAMPLE = 4 # Take first samples every 4 minutes on average
PRED_STEP = 1 # Take every second samples 1 minute later
MIN_LEAGUE = 3 # Min league is Gold
r_pairs = product("ZTP", repeat=2) # ((Z, Z), (Z, T), ...)
matchups = ["v".join((r1, r2)) for r1, r2 in r_pairs] # ['ZvZ', 'ZvT', ...]
composer = PipelineComposer("ZvZ", tick_step=32)
# Create pipelines for each table type
for matchup in matchups:
composer.change_matchup(matchup)
comp_pipeline = composer.get_compositon(MINS_PER_SAMPLE, PRED_STEP, MIN_LEAGUE)
comp_pipeline.run()
Table schemes can be found in ./queries/create_*.sql
Dataset tables are created dynamically.
PRIMARY KEYS: tick
, game_id
.
FOREIGN KEY: game_id
REFERENCES game_info
.
Their structure:
[NOTE] This tables are used to train which unit the agent should build next based on army composition and scouting info.
player_unit: INTEGER,
...
player_building: INTEGER,
...
player_minerals_available: INTEGER,
player_vespene_available: INTEGER,
enemy_unit: INTEGER,
...
out_unit: NUMERIC(4, 3) # 0.001 # player's units in 1 minute from current tick
...
[NOTE] This tables are used to train agents to predict game outcome based on the available information.
game_id: INTEGER,
tick: INTEGER,
player_unit: INTEGER,
...
player_building: INTEGER,
...
player_upgrade: INTEGER,
...
player_minerals_available: INTEGER,
player_vespene_available: INTEGER,
enemy_unit: INTEGER,
...
enemy_building: INTEGER,
...
out_winprob: NUMERIC(4, 3) # 0.001 # probability what this game ends in 1 minute
# with 1 - player's win
# or 0 - player's defeat
[NOTE] This tables are used to train agents to predict enemy composition based on scouted buildings.
game_id: INTEGER,
tick: INTEGER,
enemy_building: INTEGER,
...
out_unit: NUMERIC(4, 3) # 0.001 # enemy units in 1 minute from now
First letter of matchup means player's game race.
The last letter is enemy's race.
For example, 'ZvT' means player = 'Zerg', enemy = 'Terran'.
This affect table's unit, building and upgrades columns. Columns can be found in
./starcraft2_replay_parse/data/game_info.csv
.
[NOTE] Mirror matchups count twice, player and enemy change their places.
Distributed under the MIT License. See LICENSE.txt
for more information.