/telegram-tracker-light

A version of telegram-tracker with less storage consumption and less execution time

Primary LanguagePython

telegram-tracker-light: a Python-based open-source tool for Telegram(fork de Telegram-tracker)


GitHub forks GitHub stars License Open Source Made-with-python Twitter estebanpdl


:::

Overview

About this version of Telegram-tracker

telegram-tracker is an excellent implementation for extracting information from Telegram channels. However, it does not retain context of what has been downloaded, which makes it challenging to automatically update channels with the latest messages. The lack of context can lead to message repetitions if a channel is downloaded more than once.

This version of telegram-tracker aims to download messages from Telegram channels and keep them updated with new messages, avoiding duplicates. To achieve this, it maintains context for each download by recording the date and time, the most recent message, and the total number of downloaded messages. It also stores context for all downloaded channels and their related channels.

Usually, a download can take many hours or even days. If the scripts are interrupted, following this procedure ensures a safer restart, preventing the storage of repeated messages.

This version stores only the most relevant data directly in CSV. This dramatically reduces data storage by not storing JSON. It also reduces execution time by not having to convert JSON to CSV, which was a very time-consuming process.

The directory structure is modified so that channels are in a common repository. Datasets are formed by joining a list of channels. If a channel in the list does not exist in the repository, it will be downloaded and if it exists, its latest posts will be updated. This avoids downloading a channel more than once, as happened in telegram-tracker.

Changes introduced in this Telegram-tracker fork

  • Corrections:

    • Fix a warning about using a deprecated parameter
    • Fixed some download errors with exceptions
  • Changes

    • Store data directly in csv, selecting only the relevant data

    • Added a download progress bar. Some channels can have millions of messages and it was not known how much was left to download. This progress bar takes as the number of messages the number of the last published message (the message ids are consecutive numbers starting from 1). The channel may have deleted messages, the download percentage will rarely reach 100%, but it gives an idea of ​​the percentage of deleted messages

    • The file related_channels.csv with the names of the related channels has been added to the download directory

    • The script build-datasets.py has been removed since the data is collected directly in csv. Instead, a build_dataset.py has been created which joins the channel data (in csv) into a dataset. If the channel does not exist in the repository, it will be downloaded. If it exists, it will be updated.

    • The --min-id parameter has been removed as it is no longer necessary, given that updates with new messages are now automatic.

    • The --batch-file parameter has been removed to ensure that channels are downloaded one by one using main.py. When you want to download a list of channels, you can do so using:

      • A shell script menu.py
      • The telegram-tracker.ipynb notebook included in this repository, prepared to be used in Google's Colab environment
  • Added context:

    • The context directory has been created to store the different download logs
    • The {channel}_log_download.csv file has been added to the context directory. For each download, the most recent message number, the number of downloaded messages, and the date and time of the download are stored. This allows automatic updating of new messages without having to use the --min-id parameter which has been removed.
    • For snowball downloading, context is saved in a file {channel}_n2_log.csv with the operations carried out by the channels in order to resume the download at the point it left off after an interruption
    • To download a list of channels, context is saved in a file {name_list}_log.csv with the operations carried out by the channels in order to resume the download at the point it left off after an interruption
    • In the context directory of the data, two files have been added:
      • collected_channels_all.csv: contains all the downloaded channels
      • related_channel_all.csv: includes all channels related to the downloaded channels. These channels may appear in multiple datasets, which is why we also store the number of times they appear in other downloads and the directories where they can be found.
  • Limitations: There are two types of channels:

    • Broadcasts that publish and do not allow comments. It only has one owner
    • Chat channels allow conversations and usually have several administrators

    Chat channels can have millions of messages and downloading them requires a lot of time and storage. It is possible to limit the number of messages with the parameter --max_msgs and the most recent ones are obtained.

This fork includes a notebook to run these scripts in the colab environment.

Requirements:

  • Have a Google account
  • Have a Telegram account
  • Create an App in Telegram https://my.telegram.org/auth?to=apps and follow the steps to obtain the Id and API KEY
  • Fill the telegram-tracker configuration file (config/config.ini) with the App data
  • Upload the repository to drive
  • Open telegram-tracker.ipynb in the colab environment (clicking on the notebook will open it in the environment)

The telegram-tracker.ipynb notebook will mount drive so that the scripts can be executed from there

Local enviroment

This tool connects to Telegram's API. It generates csv files containing channel's data, including channel's information and posts. You can search for a specific channel, or a set of channels provided in a text file (one channel per line.)

You can access it in command mode, but with the menu.py script you can download in a controlled way

  • A channel

  • Snowball from a channel

  • A group of channels

Software required

Python required libraries

Installing

  • Via git clone
https://github.com/congosto/telegram-tracker-t-hoarder_tg.git

This will create a directory called telegram-tracker which contains the Python scripts. Cloning allows you to easily upgrade and switch between available releases.

  • From the github download button

Download the ZIP file from github and use your favorite zip utility to unpack the file telegram-tracker-t-hoarder_tg.zip on your preferred location.

After cloning or downloding the repository, install the libraries from requirements.txt.

pip install -r requirements.txt

or

pip3 install -r requirements.txt

Once you obtain an API ID and API hash on my.telegram.org, populate the config/config.ini file with the described values.

[Telegram API credentials]
api_id = api_id
api_hash = api_hash
phone = phone

Note: Your phone must be included to authenticate for the first time. Use the format +<code><number> (e.g., +19876543210). Telegram API will send you a code via Telegram app that you will need to include.



Example usage by commands

main.py

This Python script will connect to Telegram's API and handle your API request.

Options

  • --telegram-channel Specifies Telegram Channel to download data from.
  • --limit-download-to-channel-metadata Will collect channels metadata only, not channel's messages. (default = False)
  • --output, -o Specifies a folder to save collected data. If not given, script will generate a default folder called ./output/data

Structure of output data

├──🗂 data
||       └──🗂 context
|           └──<channel>_log.csv
|           └──related_channel_log.csv
|           └──collected_channel_log.csv
|       └──chats.txt // TM channels, groups, or users' IDs found in data.
|       └──collected_chats.csv // TM channels or groups found in data (e.g., forwards)
|       └──counter.csv // TM channels, groups or users found in data (e.g., forwards)
|       └──user_exceptions.txt // From collected_chats, these are mostly TM users' which 
|                                   metadata was not possible to retrieve via the API
|       └──msgs_dataset.csv // Posts and messages from the requested channels
|──🗂 dataset
|       └──🗂 context
|           └──<dataset>_log.csv
|       └──collected_chats.csv // TM channels or groups found in data (e.g., forwards)
|       └──msgs_dataset.csv // Posts and messages from the requested channels

Examples


Basic request

python main.py --telegram-channel channelname`

Expected output

  • Files of collected channels:
    • chats.txt
    • collected_chats.csv
    • user_exceptions.txt
    • counter.csv
    • collected_chanels

Request using a text file containing a set of channels

Download the channels one by one with the same output directory

for chanel in channels:
   python main.py  --telegram-channel channel --output './path/to/channels'

Expected output

  • Files of collected channels:
    • chats.txt
    • collected_chats.csv
    • user_exceptions.txt
    • counter.csv
    • collected_chanels

These examples will retrieve all posts available through the API from the requested channel. If you want to collect channel's information only, without posts, you can run:


Limit download to channel's metadata only

python main.py --telegram-channel channelname --limit-download-to-channel-metadata

Updating channel's data

It is automatic because it has context

python main.py --telegram-channel channelname --output './path/to/channels'

Expected output

  • Files of collected channels:
    • chats.txt
    • collected_chats.csv
    • user_exceptions.txt
    • counter.csv
    • collected_chanels

Specify output folder

The script allows you to specify a specific output directory to save collected data. The sxcript will create those folders in case do not exist.

python main.py --telegram-channel channelname --output ./path/to/chosen/directory`

The expected output is the same a described above but data will be save using the chosen directory.



channels-to-network.py

This Python script builds a network graph. By default, the file will be located in the output folder. The script also saves a preliminary graph: network.png using the modules matplotlib, networkx, and python-louvain, which implements community detection. You can import the GEFX Graph File using different softwares, including Gephi.

Options

  • --data-path Path were data is located. Will use ./output/data if not given.

If a specific directory was not provided in main.py, run:

python channels-to-network.py

If you provided a specific directory using the option --output in main.py, run:

python channels-to-network.py --data-path ./path/to/chosen/directory