
parsing data from discogs

Primary LanguagePython

Discogs Data Extractor


Two main features:

  1. Loading Discogs monthly dump data into a Postgres db.
  2. Extracting additional data (sellers, sale history, etc.) not available via the API or the XML dump.


  • Low Memory XML Loading (under 500mb memory usage for 13gb xml.gz file)
  • Proxy rotation to manage request limits.
  • Multi-threaded requests for efficient data extraction.
  • PostgreSQL for data storage.

Getting Started


  • Python 3.9+
  • PostgreSQL


  1. Clone the repository and navigate into it:
    git clone git@github.com:rezaisrad/discogs.git
    cd discogs
  2. (Optional) Set up a virtual environment:
    python3 -m venv venv
    source venv/bin/activate
  3. Install dependencies:
    pip install -r requirements.txt
  4. Set up environment variables by copying .env.example to .env and adjusting values:
    cp .env.example .env

Database Setup

Run SQL scripts located in db/ to set up and initialize your database.


1. XML Data Loading

Load data from Discogs monthly dumps using load.py. This script downloads an XML.gz file, parses relevant fields, and loads data into PostgreSQL. Currently, it supports loading data for releases and artists, storing each record with a primary key and a JSONB column named data.

Example usage for loading artist data:

handler = XMLDataHandler(DATA_URL, 

2. Extracting Additional Information

  1. Use main.py to fetch additional information from Discogs based on a set of release IDs. Example query from QUERY_PATH:
FROM releases e
JOIN release_formats f ON f.release_id = e.id
WHERE format_name = 'Vinyl'
AND release_date BETWEEN '2000-01-01` AND '2002-01-01'
  1. Iterate through the set of id values using the scraper object
scraper = Scraper(URL, max_workers=MAX_WORKERS)
  1. Insert into your postgres table using the BATCH_SIZE constant
   for i in range(0, len(release_ids), BATCH_SIZE):
      batch_ids = release_ids[i : i + BATCH_SIZE]
         releases = scraper.run(batch_ids)
         write_to_postgres(p, releases)
      except Exception as e:
         logging.error(f"Error processing batch {i//BATCH_SIZE}: {e}")

Session and Proxy Management

The SessionManager and ProxyManager classes ensure efficient and reliable extracting:

  • SessionManager maintains a session for each thread, utilizing proxies from ProxyManager.
  • ProxyManager handles proxy rotation, selecting a new proxy if the current one fails.


proxy_manager = ProxyManager(PROXIES_URL)
session_manager = SessionManager(proxy_manager)
scraper = Scraper(proxy_list_url=PROXIES_URL, max_workers=MAX_WORKERS)

Each thread created by the Scraper uses a unique session and proxy, managed by SessionManager. I have had success using setting my MAX_WORKERS=32.


Run unit tests using pytest:

pytest tests/