pile_allpoetry
A tool for scraping poems from allpoetry.com
Usage
to scrape the first 100000 poems on the site:
python scrape_poems.py
to scrape all poems:
python scrape_poems.py -a
to scrape from poem 500000 to 1000000:
python scrape_poems.py --start_id 500000 --latest_id 1000000
All usage options:
usage: scrape_poems.py [-h] [--latest_id LATEST_ID] [--start_id START_ID]
[--chunk_size CHUNK_SIZE] [-a] [-v] [-c]
CLI for allpoetry dataset - A tool for scraping poems from allpoetry.com
optional arguments:
-h, --help show this help message and exit
--latest_id LATEST_ID
scrape from start_id to latest_id poems (default:
100000)
--start_id START_ID scrape from start_id to latest_id poems (default: 1)
--chunk_size CHUNK_SIZE
size of multiprocessing chunks (default: 500)
-a, --all if this flag is set *all poems* up until the latest
poem will be scraped
-v, --verbose if this flag is set a poem will be printed out every
chunk
-c, --checkpoint if this flag is set a the scraper will resume from the
poem id in out/checkpoint.txt