icy/google-group-crawler

Cookies don't seem to be working..

Closed this issue ยท 15 comments

I'm trying to grab the contents of a private Google group we've been using as a group inbox, and create an mbox file so we can import the messages back into an IMAP account.

I've followed the instructions, and even when I grab the cookies via multiple ways (firefox with cookie exporter, chrome with cookies.txt plugin), then set my wget options, i always get the same response from wget:

: Creating './devs//threads/t.3' with 'forum/devs'
:: Fetching data from 'https://groups.google.com/a/mycompany.com/d/__FRAGMENT__?_escaped_fragment_=forum/devs'...
--2018-03-22 19:40:27--  https://groups.google.com/a/mycompany.com/d/__FRAGMENT__?_escaped_fragment_=forum/devs
Resolving groups.google.com (groups.google.com)... 108.177.112.113, 108.177.112.139, 108.177.112.102, ...
Connecting to groups.google.com (groups.google.com)|108.177.112.113|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://accounts.google.com/AccountChooser?continue=https://groups.google.com/a/mycompany.com/d/__FRAGMENT__?_escaped_fragment_%3Dforum/devs&hl=en&service=groups2&hd=mycompany.com [following]
...

It get's stuck in this loop because it's not authenticating and getting redirected to the AccountChooser page.

I can access the https://groups.google.com/a/mycompany.com/d/__FRAGMENT__?_escaped_fragment_=forum/devs URL in my browser, but i can't with wget even directly in the command line (same error).

Any ideas would be appreciated!

icy commented

Hi @rhukster ,

I'm sorry for any inconvenience. Did you use _GROUP variable to specify your company information? (e.g, export _GROUP=mycompany.com).

I will give some tests with private group in a organization today.

Thanks

No, I used _ORG for that:

export _GROUP="devs"
export _ORG="mycompany.com"
export _WGET_OPTIONS="--load-cookies /my/path/to/cookies.txt --keep-session-cookies --verbose"
icy commented

@rhukster You're right. Please make sure _ORG's value is in lowercase. (See also #22.)

I have some problem setting up the business plan for my org, which is required for the test. Stay tuned.

Thanks

icy commented

I can reproduce the problem now (=_ORG`'s value is lowercase). I am taking further look at this issue. Thanks for your patience

icy commented

I'm pretty sure that the script will not work with (new) Organization groups: They are written in new web framework (single-page application). This is similar to issue reported on #14. Let me see if there is any work-around.

icy commented

@rhukster Good news for you. The addons https://addons.mozilla.org/en-US/firefox/addon/cookies-txt/?src=search generates some weird output. You can fix as below

  1. Generate cookie file by using that addons (cookies-txt)
  2. Open the file, and remove all strings #HttpOnly_
  3. Remove all temporary directory (the script would create devs directory in your working directory), and try again.

I have tested and it's working very well at my side. Hope this also helps you :)

icy commented

Changes:

  • Detect loopping issue
  • Improved documentation (Remove #HttpOnly_ strings from cookie file.

Feel free to reopen the ticket if there is any looping issue. Thanks a lot.

I seem to be encountering this issue as well as of this morning, which is strange since I was able to get this to work without error on 10/2/18.

It seems like this was just an issue with my cookies.txt file- I was missing the groupsloginpref cookie for some reason and that seemed to be the source of my issue (which was more or less identical to the first code block in this issue).

It might be worth mentioning in the Readme the exact cookies that are needed for private group scraping to work, according to here, these are: SID, HSID, SSID, and groupsloginpref.

icy commented

Thanks a lot for your very useful feedback @jpellman . I will update README accordingly.

Ach- I finally figured out what this was. Basically, my issue was that I wasn't reading the instructions properly. I somehow misconstrued "When you have the file, please open it and remove all #HttpOnly_ strings." in the README to mean "remove all lines starting with #HttpOnly_" when it meant "find all instances of #HttpOnly_ and replace them with an empty string". It might be worth adding a sed command under there to reinforce that you're doing string replacement and not line removal. Maybe something like:

sed -i -e 's/#HttpOnly_//g' cookies.txt

Sorry for any noise.

icy commented

Never mind @jpellman . English is not my primary language and I may always confuse anyone ;) I've updated README as you suggested :) Thx again.

icy commented

Cookies don't seem to be working... Google now has denied to crawler lolz