icy/google-group-crawler

wget.sh generated but nothing follows

Closed this issue · 22 comments

Hi,

Would you mind adding some notes on how to troubleshoot the script?

I'm trying to download this list with the following parameters:

export _GROUP="ggplot2"
export _WGET_OPTIONS="--no-check-certificate"

The next commands then generate the wget.sh file and try to run it, but the file itself does not seem to run on anything:

./crawler.sh -sh > wget.sh
bash wget.sh

Thanks in advance for any pointers. The wget.sh file I get is copied below.

#!/usr/bin/env bash

export _GROUP="${_GROUP:-ggplot2}"
export _D_OUTPUT="${_D_OUTPUT:-./ggplot2/}"
export _USER_AGENT="${_USER_AGENT:-Mozilla/5.0 (X11; Linux x86_64; rv:34.0) Gecko/20100101 Firefox/34.0}"
export _WGET_OPTIONS="${_WGET_OPTIONS:---no-check-certificate}"

__wget_hook () 
{ 
    :
}
__wget__ () 
{ 
    if [[ ! -f "$1" ]]; then
        wget --user-agent="$_USER_AGENT" $_WGET_OPTIONS "$2" -O "$1";
        __wget_hook "$1" "$2";
    fi
}
icy commented

Would you mind adding some notes on how to troubleshoot the script?

I will. Basically, it's about adding some wget option to get more verbose message.

The next commands then generate the wget.sh file and try to run it, but the file itself does not seem to run on anything:

What's OS you're running? Do you have any output from the command ./crawler.s -sh > wget.sh?

I've tried to run (exactly as you did, except I don't need export _WGET_OPTIONS="--no-check-certificate" on my ArchLinux machine), and I have very good result (as below).

I suggest you to remove the temporary directory (the ggplot2 directory in place you run crawler.sh command) and start again. You may record all logs for future debugging (crawler.sh >test.log 2>&1).

Hope this helps

Result on my machine

#!/usr/bin/env bash

export _GROUP="${_GROUP:-ggplot2}"
export _D_OUTPUT="${_D_OUTPUT:-./ggplot2/}"
export _USER_AGENT="${_USER_AGENT:-Mozilla/5.0 (X11; Linux x86_64; rv:34.0) Gecko/20100101 Firefox/34.0}"
export _WGET_OPTIONS="${_WGET_OPTIONS:-}"

__wget_hook () 
{ 
    :
}
__wget__ () 
{ 
    if [[ ! -f "$1" ]]; then
        wget --user-agent="$_USER_AGENT" $_WGET_OPTIONS "$2" -O "$1";
        __wget_hook "$1" "$2";
    fi
}
__wget__ "./ggplot2//mbox/m.0cgvmtmwmac.kmLcl5JnAwAJ" \
  "https://groups.google.com/forum/message/raw?msg=ggplot2/0cgvmtmwmac/kmLcl5JnAwAJ"
__wget__ "./ggplot2//mbox/m.40Qd5d_OTpg.8Cw2WxXsGgAJ" \ 
  "https://groups.google.com/forum/message/raw?msg=ggplot2/40Qd5d_OTpg/8Cw2WxXsGgAJ"

## a lot more commands

I'm running Mac OS X 10.9.5, and here's the requested output:

:: Creating './ggplot2//threads/t.0' with 'forum/ggplot2'
:: Fetching data from 'https://groups.google.com/forum/?_escaped_fragment_=forum/ggplot2'...
--2015-10-07 17:03:37--  https://groups.google.com/forum/?_escaped_fragment_=forum/ggplot2
Resolving groups.google.com... 64.233.166.139, 64.233.166.101, 64.233.166.138, ...
Connecting to groups.google.com|64.233.166.139|:443... connected.
WARNING: cannot verify groups.google.com's certificate, issued by '/C=US/O=Google Inc/CN=Google Internet Authority G2':
  Unable to locally verify the issuer's authority.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: 'STDOUT'

    [ <=>                                   ] 6,188       --.-K/s   in 0.008s  

2015-10-07 17:03:38 (714 KB/s) - written to stdout [6188]

cat: ./ggplot2//msgs/m.*: No such file or directory

Anything weird in that output?

I have tried refreshing the ggplot2 folder completely, to no avail.

More pointers to my config:

  • GNU Wget 1.15 built on darwin13.1.0.
  • awk version 20070501

(I had to install wget through homebrew.)

Okay, just ran your script with Xubuntu, and it works fine.

Last question: what do I need to set to scrape old messages? The default settings seem to have scraped only a tiny fraction of the emails, and I supposed that those that got scraped are the most recent ones.

Thanks again for your help!

icy commented

Okay, just ran your script with Xubuntu, and it works fine.

Perfect. I don't have a Mac for test; I would ask some guy to improve the script

Last question: what do I need to set to scrape old messages? The default settings seem to have scraped only a tiny fraction of the emails, and I supposed that those that got scraped are the most recent ones.

By default, crawler.sh will get all threads, messages from your Google archive. When you use -rss option (as in crawler.sh -rss) it will read group's atom file for the latest messages.

For example, I can fetch a 4-year archive of my group (http://l.archlinuxvn.org/archlinuxvn/). After I fetch all mesages, I need to run crawler.sh -rss once every hour to get an exact mirror of my group.

Hmm, I have run the following commands, and my mbox folder has only 95 messages… Is Google limiting the number of messages that I can retrieve?

export _GROUP="ggplot2"
./crawler.sh -sh > wget.sh
bash wget.sh

Similarly, I get only one file in threads/, called t.0, with only 23 lines.

Sorry if my questions are very basic. I'm struggling to understand how this all works.

icy commented

Sorry if my questions are very basic. I'm struggling to understand how this all works.

Let me check. There may be something wrong with the script!

icy commented

I've fixed the regular expression issue in the last two commits. Please try to run wget -sh again (you don't need to remove the current temporary directory.)

Thanks a lot!

The scraper has been running for some time now, everything seems to be alright with crawler.sh. I have not yet tested wget.sh but I guess it will go fine.

Thanks a lot!

icy commented

Ah my bad, it's not wget-sh what I meant was crawler.sh.

Thanks again for your patience. I would reopen this ticket because there was a problem with Mac support.

As far as I can tell, it's not your fault: it must have to do with the versions of sed / awk / bash / wget that are installed as part of Mac OS X. My best guess is that the issue is either with awk or with wget.

Also note that I am using Mac OS X 10.9.5, which is quite old by now (current OSX is 11.0).

What versions of awk and wget are you running?

icy commented

I understand.

My versions are GNU awk 4.1.3, and GNU wget 1.16.3. It's possibly the order of options that matter. (Similar issue icy/pacapt#59)

@icy @briatte: I bet that it's not awk problem.

awk '{print $NF}' work in all known awk variants, include the oawk from heirloom tools chest or Brian Kernighan own one.

icy commented

@Gnouc I thought it's due to a wget issue. I used -O (output) option at the end of the argument list, as below

wget --user-agent="$_USER_AGENT" $_WGET_OPTIONS "$2" -O "$1";

As I recall that won't work on FreeBSD system. Similar to what you said about grep foo -q issue in pacapt project.

@icy: If you used GNU tools, you're fine with that.

wget google.com -O /tmp/test work fine in my FreeBSD 11.

Confirm working for FreeBSD 10.2 as well, GNU's wget from FreshPorts.

icy commented

Thanks @Gnouc and @cmpitg {happy to see you again;)}

Me too :-).

@icy : just have a quick test on OSX 10.10.5
The problem is BSD sed version on OSX don't interpret \n as a newline so it would break _links_dump() function.
Try replace sed -e "s#['\"]#\n#g" with tr "['\"]" "[\r\n]" worked.

@luk4hn What's the point of "[\r\n]"? It will replace ' with \r and " with \n.

If you worry about the newline character in OSX, then insert it literal:

sed -e "s#['\"]#\
#g"

or using bash quoting $'\n'.

@Gnouc : Hehe I just tried to point out the problem.
Thank you for the bash quoting 👍

icy commented

@luk4hn Can you please send a pull request?