/web-scraping-telegram

Web Scraping Telegram | [Text] [Content] [Message] [Reactions] [Replies] [Comments] [Channels] [Groups] [Chats]

Primary LanguagePython

Web Scraping Telegram Posts and Content

This code [code here] aims to scrape data from selected Telegram Channels, Groups or Chats through the Telethon Library, also integrating Google's Gspread Library and printing the results in a Google Spreadsheet in real time.

In summary, it is possible to set 'Periods' (date), 'Keywords' (search) and 'ID' (Channels, Groups or Chats) to scrape all the desired content, returning: 'Scraping ID', 'Group', 'Author ID', 'Content', 'Date ', 'Message ID', 'Author', 'Views', 'Reactions', 'Shares', 'Media', 'Comments'.

To avoid impacts from code breaks during the scraping process, it was decided to insert each scraped content into the spreadsheet, one by one, instead of scraping them all and, only at the end, resulting in output to a spreadsheet.

Output Example:

It was asked to scrape Jair Bolsonaro's Channel from Telegram between january 1st 2019 and january 1st 2023, then it returned 5241 posts: image

It can be accessed at: [Worksheet of Scraped Telegram Bolsonaro's Posts].

Recommendation

[Data] [Academic Research] [Scientific Research] [Public Policy] [Political Science] [Data Science]

Its use is highly encouraged and recommended for academic and scientific research, content analysis, sentiment and speech. It is free and open, and academic use is encouraged. Its responsible use is the sole responsibility of those who adapt and manipulate the data.


!Pip Before Coding

pip install telethon
pip install google-auth google-auth-oauthlib google-auth-httplib2 google-api-python-client

Setup the Code

Just First Time:

Attention: If you don't have the necessary credentials, you can create it for free on the official Telegram for Developers website: https://my.telegram.org/apps. There you can get your 'api_id' and 'api-hash'.

# setup / change only the first time you use it
username = 'username' # here you put your username from your telegram account
phone = '+5511999999999'  # here you put your phone number from your telegram account
api_id = '12345678' # here you put your api_id from https://my.telegram.org/apps
api_hash = '12ab12ab12ab12ab12ab12ab12ab12ab' # here you put your api_hash from https://my.telegram.org/apps

To Scrape:

# setup / change every time to use to define scraping
channel = '@jairbolsonarobrasil' # here you put the name of the channel or group that you want to scrap (ex: '@jairbolsonarobrasil' or 'https://t.me/jairbolsonarobrasil/' / not: 'https://web.telegram.org/z/#-1273465589' or '-1273465589')
worksheet_name = 'Telegram Teste' # here you put the name of the file you want as output, it will create a file on your google drive home screen
d_min = 1 # start day / this date will be included
m_min = 1 # start month
y_min = 2022 # start year
d_max = 2 # final day / only the day before this date will be included, that is, this date will not be included
m_max = 1 # final month
y_max = 2022 # final year
key_search = '' # only if you want to search a keyword, if not, keep as ''

Done? You can run it!


Run the Code (10 easy steps! Just your first run!)

Just First Time:

01.) It should ask you 'allow this laptop to access your Google credentials?' This will allow code running on this notebook to access your Google Drive and Google Cloud data. Review the code before allowing access. Put ir 'Allow':

image

02.) Choose an account to proceed to Collaboratory Runtimes. To continue, Google will share your name, email address, preferred language, and profile picture with the Collaboratory Runtimes app. Please review the Collaboratory Runtimes app's Privacy Policy and Terms of Service before using it:

image

03.) Then it'll call you to config your Telegram, put your phone number:

image

04.) You will recieve a code:

image

05.) Came back with your new code:

image

06.) Put your password for your Telegram account:

image

07.) You will be notified that the Login was successful:

image

image

The next time it runs, it will start here!

08.) The scraping will start from the parameters you entered earlier, note that it will also be updated in the panel:

image

09.) Your file will be automatically generated on the homepage of your logged in Google Drive:

image

10.) At the end, you will receive a message of how many messages were scraped, based on the loop performed:

image

It's done!

The output can be found in this format:

image


More About:

Its use is highly encouraged and recommended for academic and scientific research, content analysis, sentiment and speech. It is free and open, and academic use is encouraged. Its responsible use is the sole responsibility of those who adapt and manipulate the data.


Author Info:

Ergon Cugler de Moraes Silva, from Brazil, mailto: contato@ergoncugler.com / Master's Program in Public Administration and Government, Getulio Vargas Foundation (FGV) / Funded Researcher by the National Council for Scientific and Technological Development (CNPq) / Center of Bureaucratic Studies (NEB) / Núcleo de Estudos da Burocracia (NEB).

How to Cite it:

SILVA, Ergon Cugler de Moraes. Web Scraping Telegram Posts and Content. (feb) 2023. Avaliable at: https://github.com/ergoncugler/web-scraping-telegram/.