How to determine how many messages were pulled in?
ricks03 opened this issue · 6 comments
My number of files in mbox, and my number of messages in the google group, aren't the same. What's the best way to determine why?
My number of files in mbox, and my number of messages in the google group, aren't the same. What's the best way to determine why?
May you share the number of the difference? How many lines did you get in the output script? My sample script as below
#!/usr/bin/env bash
export _ORG="${_ORG:-}"
export _GROUP="${_GROUP:-bbedit}"
export _D_OUTPUT="${_D_OUTPUT:-./bbedit/}"
export _USER_AGENT="${_USER_AGENT:-Mozilla/5.0 (X11; Linux x86_64; rv:74.0) Gecko/20100101 Firefox/74.0}"
export _CURL_OPTIONS="${_CURL_OPTIONS:-}"
__curl_hook ()
{
:
}
__curl__ ()
{
if [[ ! -f "$1" ]]; then
echo ":: Downloading '$1'..." 1>&2;
curl -Ls -A "$_USER_AGENT" $_CURL_OPTIONS "$2" -o "$1";
__curl_hook "$1" "$2";
else
echo ":: Skipping '$1'..." 1>&2;
fi
}
__curl__ "./bbedit//mbox/m.00ZxvsSgSx0.6kyK1BoUizkJ" "https://groups.google.com/forum/message/raw?msg=bbedit/00ZxvsSgSx0/6kyK1BoUizkJ"
__curl__ "./bbedit//mbox/m.00ZxvsSgSx0.fiOWi-cJqykJ" "https://groups.google.com/forum/message/raw?msg=bbedit/00ZxvsSgSx0/fiOWi-cJqykJ"
# ...
If there is any mismatch (number of __curl__
in the output script vs the number of messages in the google group, I'd suggest you to rerun the process, i.e, delete all local (cache) files, before you start.
I haven't seen that issue so far. The best thing is to run the script in verbose mode, for example, you can rerun the the whole process, and try bash -x output-script.sh
.
edit: fix typo errors
My curl file shows 17625 lines all told. My google group shows 5861 messages. I have 17593 files in the folder on the server. (which is about right for the number of lines in the curl file).
My best guess is that each curl line is a message, but the google group shows the number as threads.
Right the output script contains curl
commands to download messages (emails). Each thread (topic?) in your google group may contain multiple messages. I wrote down what I knew about google group in the code too:
google-group-crawler/crawler.sh
Line 28 in c183ffd
Hope this helps to explain your issue.
Edit: Fix typo errors
My best guess is that each curl line is a message, but the google group shows the number as threads.
Oh , how many files did you see in the threads
/ folder? Basically there are three folders (threads, msgs, mbox). Maybe one of them matches your expected number...
Threads is 293. mbox is 17593. msgs is 5880. Aha! So that's where it is, in messages. Thx.
Great you've found that ;)
threads
is an attempt to find all threads (topics) id. So if you get number of lines in all files in threads
, you almost get the right number.
Each file in msgs
contains all links to messages within each thread. It comes with pagination so the number may vary, and it's often greater than number of threads you have. (5861 threads with pagination --> 5880 I guess.)
The last one, mbox
, contains all individual emails/messages in the whole group, and it's often a lot.