Two main features:
- Loading Discogs monthly dump data into a Postgres db.
- Extracting additional data (sellers, sale history, etc.) not available via the API or the XML dump.
- Low Memory XML Loading (under 500mb memory usage for 13gb xml.gz file)
- Proxy rotation to manage request limits.
- Multi-threaded requests for efficient data extraction.
- PostgreSQL for data storage.
- Python 3.9+
- PostgreSQL
- Clone the repository and navigate into it:
git clone git@github.com:rezaisrad/discogs.git cd discogs
- (Optional) Set up a virtual environment:
python3 -m venv venv source venv/bin/activate
- Install dependencies:
pip install -r requirements.txt
- Set up environment variables by copying
.env.example
to.env
and adjusting values:cp .env.example .env
Run SQL scripts located in db/
to set up and initialize your database.
Load data from Discogs monthly dumps using load.py
. This script downloads an XML.gz file, parses relevant fields, and loads data into PostgreSQL. Currently, it supports loading data for releases and artists, storing each record with a primary key and a JSONB column named data
.
Example usage for loading artist data:
handler = XMLDataHandler(DATA_URL,
DESTINATION_DIR,
data_store,
ArtistParser()
)
- Use
main.py
to fetch additional information from Discogs based on a set of release IDs. Example query fromQUERY_PATH
:
SELECT id
FROM releases e
JOIN release_formats f ON f.release_id = e.id
WHERE format_name = 'Vinyl'
AND release_date BETWEEN '2000-01-01` AND '2002-01-01'
- Iterate through the set of
id
values using thescraper
object
scraper = Scraper(URL, max_workers=MAX_WORKERS)
- Insert into your postgres table using the
BATCH_SIZE
constant
for i in range(0, len(release_ids), BATCH_SIZE):
batch_ids = release_ids[i : i + BATCH_SIZE]
try:
releases = scraper.run(batch_ids)
write_to_postgres(p, releases)
except Exception as e:
logging.error(f"Error processing batch {i//BATCH_SIZE}: {e}")
The SessionManager
and ProxyManager
classes ensure efficient and reliable extracting:
- SessionManager maintains a session for each thread, utilizing proxies from ProxyManager.
- ProxyManager handles proxy rotation, selecting a new proxy if the current one fails.
Example:
proxy_manager = ProxyManager(PROXIES_URL)
session_manager = SessionManager(proxy_manager)
scraper = Scraper(proxy_list_url=PROXIES_URL, max_workers=MAX_WORKERS)
Each thread created by the Scraper
uses a unique session and proxy, managed by SessionManager
. I have had success using setting my MAX_WORKERS=32
.
Run unit tests using pytest:
pytest tests/