/pile-allpoetry

Scraper to gather poems and associated comments from allpoetry.com

Primary LanguagePythonMIT LicenseMIT

pile_allpoetry

A tool for scraping poems from allpoetry.com, originally implemented by EleutherAI and modified to include comments associated with each poem.

Scraping

to scrape the first 100000 poems on the site:

python scrape_poems.py

to scrape all poems:

python scrape_poems.py -a

to scrape from poem 500000 to 1000000:

python scrape_poems.py --start_id 500000 --latest_id 1000000

All usage options:

usage: scrape_poems.py [-h] [--latest_id LATEST_ID] [--start_id START_ID]
                       [--chunk_size CHUNK_SIZE] [-a] [-v] [-c]

CLI for allpoetry dataset - A tool for scraping poems from allpoetry.com

optional arguments:
  -h, --help            show this help message and exit
  --latest_id LATEST_ID
                        scrape from start_id to latest_id poems (default:
                        100000)
  --start_id START_ID   scrape from start_id to latest_id poems (default: 1)
  --chunk_size CHUNK_SIZE
                        size of multiprocessing chunks (default: 500)
  -a, --all             if this flag is set *all poems* up until the latest
                        poem will be scraped
  -v, --verbose         if this flag is set a poem will be printed out every
                        chunk
  -c, --checkpoint      if this flag is set the scraper will resume from the
                        poem id in out/checkpoint.txt

Loading scraped data

from lm_dataformat import Reader

reader = Reader('input_dir_or_file')  # e.g., "out/data_0_time1679533961_default.jsonl.zst" or "out/"

for doc in reader.stream_data(get_meta=True):
    print(doc)

License

The original implementation is by EleutherAI and released under the MIT license. The current author waives the rights of the scripts.