icy/google-group-crawler

HTTP Error 413

want-to-export-group opened this issue · 8 comments

I am trying to archive messages from a large private group. The script seems to run fine, until the "Fetching data" step. Here is the output (the name has been changed to "group"):

:: Downloading all topics (thread) pages...
:: Creating './group//threads/t.0' with 'categories/group'
:: Fetching data from 'https://groups.google.com/forum/?_escaped_fragment_=categories/group'...
--2019-12-20 13:16:16-- https://groups.google.com/forum/?_escaped_fragment_=categories/group
Resolving groups.google.com (groups.google.com)... 2607:f8b0:400d:c0f::8a, 172.217.197.102, 172.217.197.113, ...
Connecting to groups.google.com (groups.google.com)|2607:f8b0:400d:c0f::8a|:443... connected.
HTTP request sent, awaiting response... 413 Request Entity Too Large
2019-12-20 13:16:16 ERROR 413: Request Entity Too Large.

As you can see, there is an Error 413. What is causing this, and how can it be fixed?

The test script works fine for "google-group-crawler-public" but fails for "google-group-crawler-public2" due to HTTP error 500. Could something be going wrong with the cookies?

icy commented

@want-to-export-group Was you able to resolve the issue?

icy commented

I haven't seen that issue. Maybe it's a temporary network issue, you can look at the wget command and retry if that helps.

icy commented

The test script works fine for "google-group-crawler-public" but fails for "google-group-crawler-public2" due to HTTP error 500. Could something be going wrong with the cookies?

Yes I can confirm this issue. Google has changed something to prevent our script from working :(

icy commented

:( it's used to work. Now accessing from the web browser also generates an error https://groups.google.com/forum/?_escaped_fragment_=categories/google-group-crawler-public2

icy commented

By mistake google-group-crawler-public2 was set to private mode. Now it's fine. Btw, I have rewritten the script using curl hopefully it can help to resolve a few strange issue. Stay tuned.

icy commented

The problem should be fixed in the latest version 2.0.0 (using curl). Please have a look if it's better. Thanks.