icy/google-group-crawler

How to determine how many messages were pulled in?

ricks03 opened this issue · 6 comments

My number of files in mbox, and my number of messages in the google group, aren't the same. What's the best way to determine why?

icy commented

My number of files in mbox, and my number of messages in the google group, aren't the same. What's the best way to determine why?

May you share the number of the difference? How many lines did you get in the output script? My sample script as below

#!/usr/bin/env bash

export _ORG="${_ORG:-}"
export _GROUP="${_GROUP:-bbedit}"
export _D_OUTPUT="${_D_OUTPUT:-./bbedit/}"
export _USER_AGENT="${_USER_AGENT:-Mozilla/5.0 (X11; Linux x86_64; rv:74.0) Gecko/20100101 Firefox/74.0}"
export _CURL_OPTIONS="${_CURL_OPTIONS:-}"

__curl_hook () 
{ 
    :
}
__curl__ () 
{ 
    if [[ ! -f "$1" ]]; then
        echo ":: Downloading '$1'..." 1>&2;
        curl -Ls -A "$_USER_AGENT" $_CURL_OPTIONS "$2" -o "$1";
        __curl_hook "$1" "$2";
    else
        echo ":: Skipping '$1'..." 1>&2;
    fi
}
__curl__ "./bbedit//mbox/m.00ZxvsSgSx0.6kyK1BoUizkJ" "https://groups.google.com/forum/message/raw?msg=bbedit/00ZxvsSgSx0/6kyK1BoUizkJ"
__curl__ "./bbedit//mbox/m.00ZxvsSgSx0.fiOWi-cJqykJ" "https://groups.google.com/forum/message/raw?msg=bbedit/00ZxvsSgSx0/fiOWi-cJqykJ"
# ...

If there is any mismatch (number of __curl__ in the output script vs the number of messages in the google group, I'd suggest you to rerun the process, i.e, delete all local (cache) files, before you start.

I haven't seen that issue so far. The best thing is to run the script in verbose mode, for example, you can rerun the the whole process, and try bash -x output-script.sh.

edit: fix typo errors

My curl file shows 17625 lines all told. My google group shows 5861 messages. I have 17593 files in the folder on the server. (which is about right for the number of lines in the curl file).

My best guess is that each curl line is a message, but the google group shows the number as threads.

icy commented

Right the output script contains curl commands to download messages (emails). Each thread (topic?) in your google group may contain multiple messages. I wrote down what I knew about google group in the code too:

# For your hack ;)

Hope this helps to explain your issue.

Edit: Fix typo errors

icy commented

My best guess is that each curl line is a message, but the google group shows the number as threads.

Oh , how many files did you see in the threads/ folder? Basically there are three folders (threads, msgs, mbox). Maybe one of them matches your expected number...

Threads is 293. mbox is 17593. msgs is 5880. Aha! So that's where it is, in messages. Thx.

icy commented

Great you've found that ;)

threads is an attempt to find all threads (topics) id. So if you get number of lines in all files in threads, you almost get the right number.

Each file in msgs contains all links to messages within each thread. It comes with pagination so the number may vary, and it's often greater than number of threads you have. (5861 threads with pagination --> 5880 I guess.)

The last one, mbox, contains all individual emails/messages in the whole group, and it's often a lot.