A tool for scraping poems from allpoetry.com, originally implemented by EleutherAI and modified to include comments associated with each poem.
to scrape the first 100000 poems on the site:
python scrape_poems.py
to scrape all poems:
python scrape_poems.py -a
to scrape from poem 500000 to 1000000:
python scrape_poems.py --start_id 500000 --latest_id 1000000
All usage options:
usage: scrape_poems.py [-h] [--latest_id LATEST_ID] [--start_id START_ID]
[--chunk_size CHUNK_SIZE] [-a] [-v] [-c]
CLI for allpoetry dataset - A tool for scraping poems from allpoetry.com
optional arguments:
-h, --help show this help message and exit
--latest_id LATEST_ID
scrape from start_id to latest_id poems (default:
100000)
--start_id START_ID scrape from start_id to latest_id poems (default: 1)
--chunk_size CHUNK_SIZE
size of multiprocessing chunks (default: 500)
-a, --all if this flag is set *all poems* up until the latest
poem will be scraped
-v, --verbose if this flag is set a poem will be printed out every
chunk
-c, --checkpoint if this flag is set the scraper will resume from the
poem id in out/checkpoint.txt
from lm_dataformat import Reader
reader = Reader('input_dir_or_file') # e.g., "out/data_0_time1679533961_default.jsonl.zst" or "out/"
for doc in reader.stream_data(get_meta=True):
print(doc)
The original implementation is by EleutherAI and released under the MIT license. The current author waives the rights of the scripts.