icy/google-group-crawler

Scrapper running forever, not generating tables

spacewaffle opened this issue · 5 comments

So I've got the cookie setup and I've actually successfully used the scrapper before. I've been trying to replicate what I did before but I'm running into some issues. The scrapper will run forever with logs like below.

2016-12-21 12:26:09 (1.27 MB/s) - written to stdout [60251]

:: Creating './workbarbiz//threads/t.129' with 'forum/workbarbiz'
:: Fetching data from 'https://groups.google.com/forum/?_escaped_fragment_=forum/workbarbiz'...
--2016-12-21 12:26:09--  https://groups.google.com/forum/?_escaped_fragment_=forum/workbarbiz
Resolving groups.google.com... 74.125.192.139, 74.125.192.101, 74.125.192.138, ...
Connecting to groups.google.com|74.125.192.139|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://accounts.google.com/ServiceLogin?service=groups2&passive=1209600&continue=https://groups.google.com/forum/?_escaped_fragment_%3Dforum/workbarbiz&followup=https://groups.google.com/forum/?_escaped_fragment_%3Dforum/workbarbiz&authuser=1 [following]
--2016-12-21 12:26:09--  https://accounts.google.com/ServiceLogin?service=groups2&passive=1209600&continue=https://groups.google.com/forum/?_escaped_fragment_%3Dforum/workbarbiz&followup=https://groups.google.com/forum/?_escaped_fragment_%3Dforum/workbarbiz&authuser=1
Resolving accounts.google.com... 216.58.219.237, 2607:f8b0:4006:80b::200d
Connecting to accounts.google.com|216.58.219.237|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘STDOUT’

-                                       [ <=>                                                             ]  58.68K  --.-KB/s    in 0.04s   

2016-12-21 12:26:09 (1.39 MB/s) - written to stdout [60091]

:: Creating './workbarbiz//threads/t.130' with 'forum/workbarbiz'
:: Fetching data from 'https://groups.google.com/forum/?_escaped_fragment_=forum/workbarbiz'...
--2016-12-21 12:26:09--  https://groups.google.com/forum/?_escaped_fragment_=forum/workbarbiz
Resolving groups.google.com... 74.125.192.139, 74.125.192.101, 74.125.192.138, ...
Connecting to groups.google.com|74.125.192.139|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://accounts.google.com/ServiceLogin?service=groups2&passive=1209600&continue=https://groups.google.com/forum/?_escaped_fragment_%3Dforum/workbarbiz&followup=https://groups.google.com/forum/?_escaped_fragment_%3Dforum/workbarbiz&authuser=1 [following]
--2016-12-21 12:26:09--  https://accounts.google.com/ServiceLogin?service=groups2&passive=1209600&continue=https://groups.google.com/forum/?_escaped_fragment_%3Dforum/workbarbiz&followup=https://groups.google.com/forum/?_escaped_fragment_%3Dforum/workbarbiz&authuser=1
Resolving accounts.google.com... 216.58.219.237, 2607:f8b0:4006:80b::200d
Connecting to accounts.google.com|216.58.219.237|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘STDOUT’

I eventually killed the process because it was taking multiple hours even though our google groups forum doesn't have that many posts. I found that there were thousands of thread files generated but nothing in msgs, nothing in mbox, and no db file generated after scraping. Every thread file had the same single line of text:

https://groups.google.com/forum/?_escaped_fragment_=forum/workbarbiz

Any idea what's going on here? Also not sure if this changes things but the cookie I'm using is pretty old.

icy commented

It seems the service requires login and your cookies don't work. Let me figure it out with a local test.

icy commented

I've tested with a private group and I don't see a similar problem. Maybe your cookies are expired. Could you please confirm that your cookies are still valid? Thx

BTW I have the same issue.. perhaps because both of these are groups on Google Apps (Business) accounts and not simply private groups.

icy commented

@rhukster @spacewaffle It seems there is a problem with cookie file generated by browser's extension. See also #24 (comment) . I have updated README.md accordingly.

Thanks a lot.

icy commented

The problem probably was that the script couldn't detect the loop in case of invalid cookie is provided. That'd be fixed now. Moreoever, the new version is using curl with better cookie string settings instead of netscape cookie file with wget. Please try it out.

Thanks a lot.