wget.sh generated but nothing follows
Closed this issue · 22 comments
Hi,
Would you mind adding some notes on how to troubleshoot the script?
I'm trying to download this list with the following parameters:
export _GROUP="ggplot2"
export _WGET_OPTIONS="--no-check-certificate"
The next commands then generate the wget.sh
file and try to run it, but the file itself does not seem to run on anything:
./crawler.sh -sh > wget.sh
bash wget.sh
Thanks in advance for any pointers. The wget.sh
file I get is copied below.
#!/usr/bin/env bash
export _GROUP="${_GROUP:-ggplot2}"
export _D_OUTPUT="${_D_OUTPUT:-./ggplot2/}"
export _USER_AGENT="${_USER_AGENT:-Mozilla/5.0 (X11; Linux x86_64; rv:34.0) Gecko/20100101 Firefox/34.0}"
export _WGET_OPTIONS="${_WGET_OPTIONS:---no-check-certificate}"
__wget_hook ()
{
:
}
__wget__ ()
{
if [[ ! -f "$1" ]]; then
wget --user-agent="$_USER_AGENT" $_WGET_OPTIONS "$2" -O "$1";
__wget_hook "$1" "$2";
fi
}
Would you mind adding some notes on how to troubleshoot the script?
I will. Basically, it's about adding some wget
option to get more verbose message.
The next commands then generate the wget.sh file and try to run it, but the file itself does not seem to run on anything:
What's OS you're running? Do you have any output from the command ./crawler.s -sh > wget.sh
?
I've tried to run (exactly as you did, except I don't need export _WGET_OPTIONS="--no-check-certificate"
on my ArchLinux machine), and I have very good result (as below).
I suggest you to remove the temporary directory (the ggplot2
directory in place you run crawler.sh
command) and start again. You may record all logs for future debugging (crawler.sh >test.log 2>&1
).
Hope this helps
Result on my machine
#!/usr/bin/env bash
export _GROUP="${_GROUP:-ggplot2}"
export _D_OUTPUT="${_D_OUTPUT:-./ggplot2/}"
export _USER_AGENT="${_USER_AGENT:-Mozilla/5.0 (X11; Linux x86_64; rv:34.0) Gecko/20100101 Firefox/34.0}"
export _WGET_OPTIONS="${_WGET_OPTIONS:-}"
__wget_hook ()
{
:
}
__wget__ ()
{
if [[ ! -f "$1" ]]; then
wget --user-agent="$_USER_AGENT" $_WGET_OPTIONS "$2" -O "$1";
__wget_hook "$1" "$2";
fi
}
__wget__ "./ggplot2//mbox/m.0cgvmtmwmac.kmLcl5JnAwAJ" \
"https://groups.google.com/forum/message/raw?msg=ggplot2/0cgvmtmwmac/kmLcl5JnAwAJ"
__wget__ "./ggplot2//mbox/m.40Qd5d_OTpg.8Cw2WxXsGgAJ" \
"https://groups.google.com/forum/message/raw?msg=ggplot2/40Qd5d_OTpg/8Cw2WxXsGgAJ"
## a lot more commands
I'm running Mac OS X 10.9.5, and here's the requested output:
:: Creating './ggplot2//threads/t.0' with 'forum/ggplot2'
:: Fetching data from 'https://groups.google.com/forum/?_escaped_fragment_=forum/ggplot2'...
--2015-10-07 17:03:37-- https://groups.google.com/forum/?_escaped_fragment_=forum/ggplot2
Resolving groups.google.com... 64.233.166.139, 64.233.166.101, 64.233.166.138, ...
Connecting to groups.google.com|64.233.166.139|:443... connected.
WARNING: cannot verify groups.google.com's certificate, issued by '/C=US/O=Google Inc/CN=Google Internet Authority G2':
Unable to locally verify the issuer's authority.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: 'STDOUT'
[ <=> ] 6,188 --.-K/s in 0.008s
2015-10-07 17:03:38 (714 KB/s) - written to stdout [6188]
cat: ./ggplot2//msgs/m.*: No such file or directory
Anything weird in that output?
I have tried refreshing the ggplot2
folder completely, to no avail.
More pointers to my config:
- GNU Wget 1.15 built on darwin13.1.0.
- awk version 20070501
(I had to install wget through homebrew.)
Okay, just ran your script with Xubuntu, and it works fine.
Last question: what do I need to set to scrape old messages? The default settings seem to have scraped only a tiny fraction of the emails, and I supposed that those that got scraped are the most recent ones.
Thanks again for your help!
Okay, just ran your script with Xubuntu, and it works fine.
Perfect. I don't have a Mac for test; I would ask some guy to improve the script
Last question: what do I need to set to scrape old messages? The default settings seem to have scraped only a tiny fraction of the emails, and I supposed that those that got scraped are the most recent ones.
By default, crawler.sh
will get all threads, messages from your Google archive. When you use -rss
option (as in crawler.sh -rss
) it will read group's atom file for the latest messages.
For example, I can fetch a 4-year archive of my group (http://l.archlinuxvn.org/archlinuxvn/). After I fetch all mesages, I need to run crawler.sh -rss
once every hour to get an exact mirror of my group.
Hmm, I have run the following commands, and my mbox
folder has only 95 messages… Is Google limiting the number of messages that I can retrieve?
export _GROUP="ggplot2"
./crawler.sh -sh > wget.sh
bash wget.sh
Similarly, I get only one file in threads/
, called t.0
, with only 23 lines.
Sorry if my questions are very basic. I'm struggling to understand how this all works.
Sorry if my questions are very basic. I'm struggling to understand how this all works.
Let me check. There may be something wrong with the script!
I've fixed the regular expression issue in the last two commits. Please try to run wget -sh
again (you don't need to remove the current temporary directory.)
Thanks a lot!
The scraper has been running for some time now, everything seems to be alright with crawler.sh
. I have not yet tested wget.sh
but I guess it will go fine.
Thanks a lot!
Ah my bad, it's not wget-sh
what I meant was crawler.sh
.
Thanks again for your patience. I would reopen this ticket because there was a problem with Mac support.
As far as I can tell, it's not your fault: it must have to do with the versions of sed
/ awk
/ bash
/ wget
that are installed as part of Mac OS X. My best guess is that the issue is either with awk
or with wget
.
Also note that I am using Mac OS X 10.9.5, which is quite old by now (current OSX is 11.0).
What versions of awk
and wget
are you running?
I understand.
My versions are GNU awk 4.1.3
, and GNU wget 1.16.3
. It's possibly the order of options that matter. (Similar issue icy/pacapt#59)
@Gnouc I thought it's due to a wget
issue. I used -O
(output) option at the end of the argument list, as below
wget --user-agent="$_USER_AGENT" $_WGET_OPTIONS "$2" -O "$1";
As I recall that won't work on FreeBSD
system. Similar to what you said about grep foo -q
issue in pacapt
project.
@icy: If you used GNU tools, you're fine with that.
wget google.com -O /tmp/test
work fine in my FreeBSD 11.
Confirm working for FreeBSD 10.2 as well, GNU's wget from FreshPorts.
Me too :-).
@icy : just have a quick test on OSX 10.10.5
The problem is BSD sed
version on OSX don't interpret \n
as a newline so it would break _links_dump()
function.
Try replace sed -e "s#['\"]#\n#g"
with tr "['\"]" "[\r\n]"
worked.
@luk4hn What's the point of "[\r\n]"
? It will replace '
with \r
and "
with \n
.
If you worry about the newline character in OSX, then insert it literal:
sed -e "s#['\"]#\
#g"
or using bash
quoting $'\n'
.