philbot9/youtube-comment-scraper-cli

Investigate "unknown error" on koPmuEyP3a0

spiralofhope opened this issue · 5 comments

Note that this issue is now blocked by #47


Prompted by Suggestion: Comment limiting + Don't discard on fail. I've been experimenting with downloading the comments for koPmuEyP3a0 using --stream.

I am using youtube-comment-scraper 1.0.1 and node v10.19.0 in a Debian 64bit (stable) environment within a VirtualBox guest on a Windows 10 host.

All attempts have failed with "unknown error", and all attempts have resulted in a file with a different size.

Either I need advice on how to troubleshoot further, or debugging functionality would need to be implemented to learn more. Perhaps a counter of time and amount of data collected could help; I might be able to implement that on the user side of things (perhaps using some combination of watch and du manual logging? I don't know.)


Suspicions

  • Flaky internet connection somewhere along the chain
  • Throttling or limitations by YouTube
    • Note that I'm running this on a non-proxy IP which also has a browser logged into an account, so IP-based spam protection shouldn't be the issue, but usage still might.
  • Invalid data? - I don't think this is something like invalid data (like a font) within the stream because different file sizes result. There is nothing obvious when I look at the tail data.

CSV tests

youtube-comment-scraper --format csv --stream -- koPmuEyP3a0 | tee output.csv
(text)
✕ unknown error
  • Test 1 - a 13,652 kB file
  • Test 2 - a 1,244 kB file (using just a redirect instead of tee)
  • Test 3 - a 36,632 kB file

JSON tests

youtube-comment-scraper --stream -- koPmuEyP3a0 | tee output.json
(text)
✕ unknown error
  • Test 1 - a 2,440 kB file
  • Test 2 - a 57,436 kB file

(3) With:

node --max-old-space-size=10000 /usr/local/bin/youtube-comment-scraper --stream -- koPmuEyP3a0 | tee output3.json
  • Test 3 - a 68,644 kB file

I believe the root cause is A/B testing of a changed YouTube video page.

See philbot9/youtube-comments-task#26

@spiralofhope This should be fixed in 1.0.2 of youtube-comment-scraper-cli. I ran the scraper for a while and it never failed when previously it did.

Please update and let me know if it's working for you.

$ npm install -g youtube-comment-scraper-cli@1.0.2

Thanks for your efforts!

sudo npm install -g youtube-comment-scraper-cli@1.0.2
output
npm WARN npm npm does not support Node.js v10.19.0
npm WARN npm You should probably upgrade to a newer version of node as we
npm WARN npm can't make any promises that npm will work with this version.
npm WARN npm Supported releases of Node.js are the latest release of 4, 6, 7, 8, 9.
npm WARN npm You can find the latest version at https://nodejs.org/
npm WARN deprecated request@2.88.2: request has been deprecated, see https://github.com/request/request/issues/3142
/usr/local/bin/youtube-comment-scraper -> /usr/local/lib/node_modules/youtube-comment-scraper-cli/bin/youtube-comment-scraper
+ youtube-comment-scraper-cli@1.0.2
removed 1 package and updated 18 packages in 9.124s
youtube-comment-scraper --format csv --stream -- koPmuEyP3a0 | tee output4.csv
✕ API response does not contain a "content_html" field

The following appeared to work for some time, but ended up with the same error:

youtube-comment-scraper --outputfile output4.json -- koPmuEyP3a0
✕ API response does not contain a "content_html" field

I'll continue with some other tests, using --stream, for example:

youtube-comment-scraper --stream -- koPmuEyP3a0 | tee output4.json

I have spent some time running tests with and without --stream and json vs csv, and all attempts result in a different size of file and an error:

✕ API response does not contain a "content_html" field

Note that this issue is now blocked by #47

I tested another downloader to determine if it was an issue with YouTube throttling.

https://github.com/egbertbouman/youtube-comment-downloader

./downloader.py --youtubeid=koPmuEyP3a0 --output=koPmuEyP3a0.json

I did eventually get an error, but the download seems to have completed successfully. (it's huge)

This other downloader doesn't seem to have self-throttling, and I don't think YouTube disconnected me during the process.