/messaging-chat-parser

Whatsapp and Telegram chat parsers

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

📲 Messaging parser

Use what you had written

What is this repo?

This repository provides python scripts to parse WhatsApp and Telegram messages.
The goal is to obtain text files with a good structure for machine learning purposes. [4]

📥 Inputs

Data to provide:

  • WhatsApp data
    • .txt files exported from one or more chat - how
      • place all txt files in ./data/chat_raw/whatsapp/*.txt
  • Telegram data
    • .json with the telegram dump - how [5]
      • copy and rename the json file in ./data/chat_raw/telegram/telegram_dump.json

⚙ Usage

  • Install requirements.txt
  • WhatsApp [1]

    python ./src/whatsapp_parser.py --session_token "<|endoftext|>" --delta_h_threshold 4 --user_name <user_name>

  • Telegram [2]

    python ./src/telegram_parser.py --session_token "<|endoftext|>" --delta_h_threshold 4

  • Join files and extract user messages

    python ./src/joiner.py

📤 Outputs

  • telegram-chats.txt and wa-chats.txt
    • Will have this structure both:
      [me] bla bla bla
      [others] bla bla bla
      [others] bla bla bla
      <|endoftext|>
      [me] bla bla bla
      ...
    • Where the three tags:
      • [me]: placed as suffix of text wrote by the user [3]
      • [others]: placed as suffix of text wrote by others
      • <|endoftext|>: added when the time elapsed between two sequential messages is > 4 hours
  • all-messages.txt
    • One file with both telegram-chats.txt and wa-chats.txt rows.
  • user-messages.txt
    • One line per message wrote by the user [3]

📝 Notes

  • [1] How find <user_name> value?
    • From the WhatsApp chat exported text, e.g. from one line:
      12/12/19, 08:40 - <user_name>: bla bla bla
  • [2] Check that the telegram dump is named telegram_dump.json and is inside
    ./data/chat_raw/telegram/telegram_dump.json
  • [3] user = the owner of the messages (I hope it coincides with who use those scripts)
    • the account that had done the data dump for Telegram
    • the value passed in --user_name in WhatsApp parser
  • [4] Is always better to don't run random scripts on personal information (like chat messages)
    • You can check this code
    • Take in mind that before:
      • This is a free-time project, I'm not guaranteeing efficiently or good programming practice
      • I'm not so good at writing English
      • Good luck
  • [5] Be sure to select the "Account information" checkbox into the telegram dump dialog window
  • Both Telegram and WhatsApp parsers aren't tested on the group's chats data and is not intended to manage those types of information.
  • Is possible to change the chat session behavior
    • with --session_token we can change the session splitting token, if argument not provided session split will be disabled.
    • with --delta_h_threshold is possible to change the time windows to be elapsed between two sequential messages before inserting a session_token
  • 📅 Parsing data with custom values:
    • Both WhatsApp and Telegram parser use a default Italian datetime format
    • You can always use a custom format parser by using the --time_format parameter:
      • WhatApp:

      python ./src/whatsapp_parser.py --session_token "<|endoftext|>" --delta_h_threshold 4 --user_name <user_name> --time_format "%d/%m/%y, %H:%M"

      • Telegram:

      python ./src/telegram_parser.py --session_token "<|endoftext|>" --time_format "%Y-%m-%dT%H:%M:%S"