pgcorpus/gutenberg

Pipeline to generate the Standardized Project Gutenberg Corpus

PythonGPL-3.0

Issues

Is this repo still actively maintained?
#49 opened 2 months ago by d-kleine
0
include size of processed corpus in README
#43 opened 2 years ago by erikfredner
3
Not windows-friendly things
#37 opened 4 years ago by fontclos
2
skipped due to duplication
#48 opened 8 months ago by Felix-liu0989
0
no data stored in bookshelves_ebooks_dict.pkl and bookshelves_categories_dict.pkl after successful running
#45 opened a year ago by kaapivalli
3
Storing raw data in a compressed format
#44 opened 2 years ago by PadLex
0
"Connection refused"
#40 opened 2 years ago by danielplatt
1
Bookshelves
#38 opened 2 years ago by nofreewill42
2
Getting info about the data before download
#29 opened 6 years ago by edilsonacjr
6
pandas
#36 opened 4 years ago by iandoug
3
rsync command fails on Windows 10
#30 opened 4 years ago by andreluizgit
13
Allow for retrieving epubs files?
#31 opened 4 years ago by hneutr
1
File not found on Windows 10
#34 opened 4 years ago by luigiusai
3
get_data.py fails: ReadError
#32 opened 4 years ago by maxbry
5
indicate in README that SPGC-2018-07-18 doesn't contain full texts
#33 opened 4 years ago by bpshaver
1
Processing fails when locale.getpreferredencoding() does not return UTF-8
#27 opened 6 years ago by fontclos
0
"Copyright Renewal" text
#24 opened 6 years ago by martingerlach
2
parse_bookshelves() fails due to encoding issue
#25 opened 6 years ago by fontclos
1
Add a LICENSE
#22 opened 6 years ago by fontclos
0
Simplify requirements files
#19 opened 6 years ago by fontclos
2
python get_data: 'metadata/bookshelves' is not a directory
#21 opened 6 years ago by martingerlach
5
remove notebooks and all jupyter stuff
#20 opened 6 years ago by fontclos
1
Getting error when running `python get_data.py`
#9 opened 6 years ago by fontclos
3
Create lists of counts
#10 opened 7 years ago by fontclos
1
Metadata contains books that are not in data/
#14 opened 6 years ago by martingerlach
1
ValueError: The specified mirror directory does not exist when running 'python get_data.py'
#1 opened 6 years ago by martingerlach
5
Missing newline at the end of counts files
#2 opened 7 years ago by fontclos
1
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 12, column 65
#3 opened 7 years ago by fontclos
1
Bookshelves metadata is not automatically generated
#16 opened 6 years ago by fontclos
4
NLTK tokenizer always using english trained model
#6 opened 6 years ago by fontclos
3
Duplicates detection
#11 opened 7 years ago by fontclos
1
NLTK tokenizer missing on fresh run
#5 opened 7 years ago by fontclos
2