philbot9/youtube-comment-scraper

Suggestion: Comment limiting + Don't discard on fail.

HT-7 opened this issue · 12 comments

HT-7 commented

Because videos with more than 20000 comments tend to overwhelm the online comment scraper, leading to a fail, I have two suggestions:

  • Ability to optionally limit number of captured comments to _____.
  • If the comment capture is unsuccessful, then let the user download the already scraped comments instead of discarding the comments.

Don't forget: The csv and json files should contain the information of how many comments the video has in total.

This will allow the user to see the total amount of comments on a video from a comment file which does not contain every comment (due to manual limiting or failed capture).

More metadata:

  • Has the comment been ever edited?
  • Comment rank (top comment is on rank 1). Same for commment replies.

Perhaps dump the comments every x comments by default, and have a parameter to customize that.

For example, generate:

comment-foo-1.csv
comment-foo-2.csv
...

This could also trivially be tied to a simple pause mechanic between dumps to throttle scraper use.


Regarding more metadata, that would be best mentioned in its own ticket, but it occurs to me that it would be nigh incredible if it were made easy to scrape multiple times and essentially generate an "edit history". I suppose that's best left for another tool though.

Is there a possibility to limit the amount of captured comments yet?

Perhaps dump the comments every x comments by default, and have a parameter to customize that.
This could also trivially be tied to a simple pause mechanic between dumps to throttle scraper use.

Is there any chance to get a follow up on this?

I've downloaded large comment sets without seeing this problem myself. Perhaps it's me (my net connection?) or YouTube has improved/changed in some way.

I can help with testing if anyone can give an example or two (even with inconsistent failures).

I was trying to get the comments for this video https://www.youtube.com/watch?v=koPmuEyP3a0 in order to extract information for a research project.

After realizing in multiple occasions that I was not able to run the scraper from beginning to end I was trying to gather as many comments as possible by using --stream, but the results are very unstructured.

The scraper crashes with unknown error.
I'm thinking that the number of comments is just too much. That's why I was searching for a solution to limit the comments scraped in order to get results I can interpret.

Thank you for your help!

I have it running on that URL and so far it hasn't crashed.

I wonder if there is a limitation with the brute-force nature of the youtube-comment-scraper code running with node, like some sort of memory use limit. I don't think so, because my memory usage for that process seems fairly stable; I can see that it's trying though.

I'll let the process continue, and report back later.

For reference, my command is:

node /usr/local/bin/youtube-comment-scraper --format csv --outputFile foo.csv -- koPmuEyP3a0

Maybe my choice to use csv matters; I'm not sure.

Wouldn't you know it, my hunch was right. It did end up crashing, and there is a mention of memory garbage collection.

I don't know that this will help the youtube-comment-scraper author directly, but it does give me a hint as to how I might fumble around on my own. Maybe node has memory options.

crash output
<--- Last few GCs --->

[29965:0xf2b9f0]  3958876 ms: Mark-sweep 1272.2 (1455.5) -> 1272.3 (1426.5) MB, 427.2 / 0.0 ms  (average mu = 0.180, current mu = 0.003) last resort GC in old space requested
[29965:0xf2b9f0]  3959309 ms: Mark-sweep 1272.3 (1426.5) -> 1272.1 (1425.5) MB, 432.8 / 0.0 ms  (average mu = 0.100, current mu = 0.000) last resort GC in old space requested


<--- JS stacktrace --->

==== JS stack trace =========================================

Security context: 0x376cb3eaee11 <JSObject>
    0: builtin exit frame: parse(this=0x376cb3ebc8a9 <Object map = 0x35f42ce0cac9>,0x2cb24a4025d9 <undefined>,0x10de1cb02201 <Very long string[161650]>,0x376cb3ebc8a9 <Object map = 0x35f42ce0cac9>)

    1: /* anonymous */ [0x324445377089] [/usr/local/lib/node_modules/youtube-comment-scraper-cli/node_modules/request/request.js:1152] [bytecode=0x1ac57ffeb19 offset=396](this=0x3fe0e05657e1 <Request...

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory
 1: 0x7f1e762bd2a8 node::Abort() [/lib/x86_64-linux-gnu/libnode.so.64]
 2: 0x7f1e762bd2f1  [/lib/x86_64-linux-gnu/libnode.so.64]
 3: 0x7f1e7649def2 v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [/lib/x86_64-linux-gnu/libnode.so.64]
 4: 0x7f1e7649e148 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [/lib/x86_64-linux-gnu/libnode.so.64]
 5: 0x7f1e7682cdc2  [/lib/x86_64-linux-gnu/libnode.so.64]
 6: 0x7f1e76840967 v8::internal::Heap::AllocateRawWithRetryOrFail(int, v8::internal::AllocationSpace, v8::internal::AllocationAlignment) [/lib/x86_64-linux-gnu/libnode.so.64]
 7: 0x7f1e7680eed9 v8::internal::Factory::AllocateRawWithImmortalMap(int, v8::internal::PretenureFlag, v8::internal::Map*, v8::internal::AllocationAlignment) [/lib/x86_64-linux-gnu/libnode.so.64]
 8: 0x7f1e768162d1 v8::internal::Factory::NewRawTwoByteString(int, v8::internal::PretenureFlag) [/lib/x86_64-linux-gnu/libnode.so.64]
 9: 0x7f1e768e99b6 v8::internal::Handle<v8::internal::String> v8::internal::JsonParser<false>::SlowScanJsonString<v8::internal::SeqTwoByteString, unsigned short>(v8::internal::Handle<v8::internal::String>, int, int) [/lib/x86_64-linux-gnu/libnode.so.64]
10: 0x7f1e768ec038 v8::internal::JsonParser<false>::ParseJsonValue() [/lib/x86_64-linux-gnu/libnode.so.64]
11: 0x7f1e768eb1c5 v8::internal::JsonParser<false>::ParseJsonObject() [/lib/x86_64-linux-gnu/libnode.so.64]
12: 0x7f1e768ecbdd v8::internal::JsonParser<false>::ParseJson() [/lib/x86_64-linux-gnu/libnode.so.64]
13: 0x7f1e76566a70  [/lib/x86_64-linux-gnu/libnode.so.64]
14: 0x19b0d62d464b 
Aborted

I experimented with some parameters, to no avail:

node --max-old-space-size=10000 /usr/local/bin/youtube-comment-scraper --format csv --outputFile output.csv -- koPmuEyP3a0      
✕ unknown error
node --experimental-modules --experimental-repl-await --experimental-vm-modules --experimental-worker  --max-old-space-size=10000 /usr/local/bin/youtube-comment-scraper --format csv --outputFile output.csv -- koPmuEyP3a0
(node:20921) ExperimentalWarning: The ESM module loader is experimental.
✕ unknown error

Thank you for your effort!

I was getting similar results while increasing the max-old-space-size.

I checked the node.js issues tracker for "mark-sweep" and there might be some related items in there; I don't know.

I do want to see this issue worked-around, but at this point I'm in way over my head. Hopefully @philbot9 will understand things better.

Using the --stream flag is your best option. That way the program won't be storing everything in memory.

If there are further issues regarding the youtube-comment-scraper-cli tool, please post them over on that project's repo.

I have created Investigate ''unknown error'' on koPmuEyP3a0 for continued efforts using --stream.

@ftomasin I was able to use another download tool for your video of interest.

https://github.com/egbertbouman/youtube-comment-downloader

./downloader.py --youtubeid=koPmuEyP3a0 --output=koPmuEyP3a0.json

https://mega.nz/file/PVQVmSwD#sjIg_cPIBBZHeb6_FOOCVyOrJGvncm5B5fQql5kyfz4