sgraaf/Replicate-Toronto-BookCorpus

Smashwords new limits

Opened this issue · 10 comments

Some testing using a VPN connection to a number of points of presence (pops) around the world: the limit is 50 books per IP per day now, making the effort to recompile the Toronto Corpus a painful, difficult process.

From smashwords:
You are unable to download this book right now
We are currently throttling downloads for users who download more than 100 per day, and you are currently above this limit. We recommend that you limit your downloads to a more reasonable amount per day.

meaning, the process of less than 100 books/day is a pain - using a VPN can step around this, but that means swapping VPN connections every 100 (or less) books and most VPN providers don't have all that many POPs to choose from, which means burning through the available POPs and then repeating for several days at 100 (or less) book downloads/day.


Python 3.7.6 error on string/binary type: line 28 & 67, download_books.py - error from line 28.

Traceback (most recent call last):
File "src/download_books.py", line 76, in
main()
File "src/download_books.py", line 28, in main
book_download_urls = [url for url in book_download_urls if not (data_dir / f'{get_book_id(url)}.txt').exists()]
File "src/download_books.py", line 28, in
book_download_urls = [url for url in book_download_urls if not (data_dir / f'{get_book_id(url)}.txt').exists()]
File "/Users/byron/Documents/_nlpprojects/GANs/TorontoBookCorpus/src/utils.py", line 77, in get_book_id
return url.split('/download/')[1].split('/')[0]
TypeError: a bytes-like object is required, not 'str'
(venv) ~ $ python src/download_books.py

The offending line:

book_download_urls = [url for url in book_download_urls if not (data_dir / f'{get_book_id(url)}.txt').exists()]

f'{get_book_id(url)}.txt' <- the issue.

The fix:
book_download_urls = [url for url in book_download_urls if not (data_dir / f'{get_book_id(url.decode("utf-8"))}.txt').exists()]

also, error line 113 of get_book_urls.py

Traceback (most recent call last):
File "src/get_book_urls.py", line 130, in
main()
File "src/get_book_urls.py", line 113, in main
formats = _json['formats']
KeyError: 'formats'

which is odd, given I can inspect the JSON in question and paths & data requested are present. Still looking at it.

I'm hard thinking how one could (easily) "circumvent" this new hard limit of 100 downloads per day. Someone else suggested using (free) proxies, which requests does support. Free proxies, however, are very unstable, and often-times very slow, too. Premium proxies on the other hand are rather expensive. Would you have any suggestions on this front?

I've been testing today and yes, the free proxies are slow, and, they don't seem to make a difference - even using small batches of 50 books. The site responds with 403 errors as often as not.

I did additional testing using a VPN, and could not download more than 50 books at a time from a given IP address. Interestingly, about half of the time, changing connections, from India to New York for example, the downloads would fail entirely - it seems that the IP wasn't refreshing properly or flushing from cache - hard to tell without running wireshark.

The guterberg.org project has no limits and uses wget for replication & downloading books:

https://www.gutenberg.org/wiki/Gutenberg:Information_About_Robot_Access_to_our_Pages

This may be as or more expedient & productive - the download is slow - there many thousands of books in TXT format, so a wget command:

wget -w 2 -m -H "http://www.gutenberg.org/robot/harvest?filetypes[]=txt&langs[]=en"

would return all books in english that are in txt format - it also returns UTF-8 and ASCII versions.

Thoughts?

One drawback to the Gutenberg books - the presence of pages of legal boilerplate at the start and end of each book - there's enough of it that it would contaminate the model & would need to be removed. Not hard, just annoying.

Screenshot 2020-05-06 at 1 55 28 PM

I'm running a test on Gutenberg.org using wget english txt based books. It's a large corpus, but will require additional processing to remove the UTF8 files - leaving only ASCII and then to remove the legal boilerplate from the start and end of each file to avoid contaminating the resulting model with recurrent legal bumfodder.

sed in a terminal is your friend -
ls | grep '.txt$' | while read f; do sed '/End of Project Gutenberg/,$d' "$f" >tmpfile; mv tmpfile "$f"; done

ls | grep '.txt$' | while read f; do sed 1,27d "$f" >tmpfile; mv tmpfile "$f"; done

for 17000 files it's not fast, but it's efficient.

Result: 17000 books covering a broad range of styles and subjects ready for processing into a large corpus.

Maybe I was too quick to judge (ruling out the viability of Project Gutenberg wholesale). I'll try my hand at this myself today and report back.

Maybe I was too quick to judge (ruling out the viability of Project Gutenberg wholesale). I'll try my hand at this myself today and report back.

I agree, not the most modern writing. Lacks sloppy colloquialisms & slang, but still possibly an interesting, grammatically and lexically rich & diverse source.

Do you have any clue as to how many English-language books they have in their catalog? I have been downloading their books for more than 24 hours now (almost 30k books thus far), but have no clue yet how many I have to go still.