icy/google-group-crawler

Help to import e-mails at a Closed Google Group...

marceliogp opened this issue · 17 comments

Hi Friend,

I have a problem, because I need to export to mbox all Google Group's Messages, but in this group I need to logon with my Google Account and password. In a google account we have a special character '@' and '.'. In my password I have some more specials charecter.

Whe I try to use "crawler.sh" script, it dosen't ask a username and password, and it returns HTTP Error 403 Forbidden.

Do you know, How can I resolve this problem? How can I pass to "crawler.sh" script my Google Username and password (using some special characters, like !@#$%&* and others)?

Congratulations for your "crawler.sh" script and your work and thanks for your help.

Please, if you have a little time, Can you see it and send to me a answer for this problem at authentication in a closed Google Group?

Thanks for all,

Marcélio G. Pereira
marceliogp@gmail.com

icy commented

@marceliogp If you are using Firefox, please install some add-ons to manage your cookies. Then log in to your google groups, and use the add-ons to export cookies to a file and specify a wget options for the script to load. All basic steps are described here

https://github.com/icy/google-group-crawler#private-group

Google authentication is complex and the script can't handle that. Please try with cookie and let me know if you have any problem.

Thanks,

@icy, very very thank you for your help.

I downloaded all topics from my private google group (about 7800 messages). Your script make a directory structure like:

  • ../ group name /mbox
  • ../ group name /msgs
  • ../ group name /threads

I used new add-on on my thunderbird to import mbox, but it didn't works well. Thunderbird showed a lot of folders (same name of the files inside on 'mbox' folder) without any message.

Do I need same script of software to convert this structure to PST format (or another) that I can import in a other Off-line E-mail Manager Software (like thunderbird, outlook or other)?

Thunderbird add-on: ImportExportTools (at mozilla's web site)

Congratulation to you for the best work at 'google-group-crawler' and very thank you for your help.

Best regards,

Marcélio G. Pereira
marceliogp@gmail.com

icy commented

@marceliogp Sorry for the confusion. For some reason, I used the mbox name, but the files are in RFC 822 format instead.

If you can write some scripts you may see it's quite trivial to convert the RFC 822 files to mbox files. That was exactly I did for my groups, but unfortunately I couldn't see them here in my terminal's history :(

You make take a look at this http://askubuntu.com/questions/13967/importing-mail-files-of-type-message-rfc822 it seems Thunderbid can import those files. Please give it a try and let me know if you still need some support.

Thanks a lot!

icy commented

As I recall, all I needed to do is add a header line, as below

Original file

Received: by 10.68.228.227 with SMTP id sl3mr645728pbc.5.1345774109533;
        Thu, 23 Aug 2012 19:08:29 -0700 (PDT)
...
Date: Fri, 24 Aug 2012 09:08:26 +0700
From: "Nguyen Vu Hung (vuhung)" <vuhung...@gmail.com>

Now insert the From and Date field at top of the file, and keep all other lines remained

From vuhung...@gmail.com Fri, 24 Aug 2012 09:08:26 +0700
Received: by 10.68.228.227 with SMTP id sl3mr645728pbc.5.1345774109533;
        Thu, 23 Aug 2012 19:08:29 -0700 (PDT)
...

Now your have a very correct file which can be seen by your mbox importer. I should have added some notice / guidelines about this.!

icy commented

@marceliogp Are you able to solve your problem? Thanks a lot.

Hi friend,

I'm not able to resolve this problem. I used your script to download all
messages, but now I don't know how can I search some subject inside of this
structure.

There aren't any tools to make all structure, that your script makes, for
same email tools, like a thunderbird, outlook or other else.

If you find any tools to help me. I'm really apreciate.

Att.,

Marcelio G. Pereira
Analista de Sistemas
WebSite:
E-Mails: marceliogp@gmail.com

Esta mensagem, incluindo seus anexos, tem caráter confidencial e seu
conteúdo é restrito ao destinatário da mensagem. Caso você tenha recebido
esta mensagem por engano, queira por favor retorná-la ao destinatário e
apagá-la de seus arquivos. Qualquer uso não autorizado, replicação ou

disseminação desta mensagem ou parte dela é expressamente proibido.

2016-08-31 20:17 GMT-03:00 Ky-Anh Huynh notifications@github.com:

@marceliogp https://github.com/marceliogp Are you able to solve your
problem? Thanks a lot.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#15 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AI8yIsboVi9FmSXrDX-3p_IVobGStzMNks5qlgumgaJpZM4Ii1VI
.

icy commented

I'm sorry to hear that. I will take a look at your problem. Stay tuned.

Thanks for all...

Att.,

Marcelio G. Pereira
Analista de Sistemas
WebSite:
E-Mails: marceliogp@gmail.com

Esta mensagem, incluindo seus anexos, tem caráter confidencial e seu
conteúdo é restrito ao destinatário da mensagem. Caso você tenha recebido
esta mensagem por engano, queira por favor retorná-la ao destinatário e
apagá-la de seus arquivos. Qualquer uso não autorizado, replicação ou

disseminação desta mensagem ou parte dela é expressamente proibido.

2016-09-01 10:17 GMT-03:00 Ky-Anh Huynh notifications@github.com:

I'm sorry to hear that. I will take a look at your problem. Stay tuned.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#15 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AI8yInIurcI5ustI076dvSibiFZcCPoKks5qltBlgaJpZM4Ii1VI
.

icy commented

sorry for my late response. Are you able to resolve your problem? Would you mind taking a look at a similar problem here #16? Thanks.

Hey @icy , thanks for your script. I tried it and three empty folders are quickly generated (mbox, msg, threads), except that threads contains a file called t.0 which is also empty. Commandline returns:

:: Skipping './mygroupname//threads/t.0' (downloaded with 'forum/mygroupname')

Anything I have missed there?

Thanks for your help.

icy commented

hi @cryptoque,

Are you working with closed groups? Then the problem may be due to wrong cookie. I will give another tests with my closed group if there is any changes from google.

@icy thanks for the quick reply! It is an open group:

Anyone from the xx organization can view content.
Anyone can apply to join.
Only members can post.
Anyone from the xx organization can view the list of members.

Initially without the cookie file I got a 403 forbidden, however after exporting the cookie file as instructed and rerunning, I saw the 403 error message gone and the following is returned:


__wget_hook () 
{ 
    :
}
__wget__ () 
{ 
    if [[ ! -f "$1" ]]; then
        wget --user-agent="$_USER_AGENT" $_WGET_OPTIONS "$2" -O "$1";
        __wget_hook "$1" "$2";
    fi
}
icy commented

Oh I see. For organization's group I think you may need to set up environment variable _ORG as seen here https://github.com/icy/google-group-crawler/blob/ee4cfe61ee83270fadef08c87acbe5876d77ff24/README.md#group-on-google-apps . Please try again if that helps. You may need to delete all directories generated by the script before trying.

@icy Yes, that is what I tried initially, with _ORG set. I will try to locate the problem with more tests, hopefully.

icy commented

I'm sorry that didn't help. I will try to test today if there is any problem. In the mean time, you may want to initial some wget command manually using your cookie file: crawler.sh generates a bash script, and if you look at the file, you can see the full wget command. Adding --verbose option to $_WGET_OPTIONS also helps

icy commented

This topic contains a few different issues. As now the script switches to use curl, please try it out if you have the same issues with the output. To process the results mbox files, please look at for example https://github.com/icy/google-group-crawler#what-to-do-with-your-local-archive.

Thanks a lot.