xenova/chat-downloader

[BUG] Twitch emotes name parsed wrong if message text contains non-ASCII characters

Opened this issue · 0 comments

Basic information

  • Program version: 0.2.8
  • Python version: 3.11.9
  • Operating system: Linux

Describe the bug

If a chat message in a VOD contains a non-ASCII character (any 2-bytes UTF-8 symbol for example) then emotes[].name field of message JSON from the library parsed wrongly.

Command/Code used

chat_downloader --start_time 05:58:28 --end_time 05:58:30 --output test.jsonl --testing 'https://www.twitch.tv/videos/2184933543'

  1. The command used (including the verbose tag, -v):
chat_downloader --start_time 05:58:28 --end_time 05:58:30 --output test.jsonl --testing 'https://www.twitch.tv/videos/2184933543'
  1. Output from the above command:

(I've patcher the library with temporarily debugging by prints to see the raw GQL content for the message mapper (chat_downloader.sites.twitch.TwitchChatDownloader._parse_message_info()))

[DEBUG] Python version: 3.11.9 (main, Jul  3 2024, 00:12:48) [GCC 12.2.0]
[DEBUG] Program version: 0.2.8
[DEBUG] Initialisation parameters: {'headers': None, 'cookies': None, 'proxy': None}
[DEBUG] Created TwitchChatDownloader session.
[INFO] Site: twitch.tv
[DEBUG] Program parameters: {'url': 'https://www.twitch.tv/videos/2184933543', 'start_time': '05:58:28', 'end_time': '05:58:30', 'max_attempts': 15, 'retry_timeout': None, 'interruptible_retry': True, 'timeout': None, 'inactivity_timeout': None, 'max_messages': None, 'message_groups': ['messages'], 'message_types': None, 'output': 'test.jsonl', 'overwrite': True, 'sort_keys': True, 'indent': 4, 'format': 'twitch', 'format_file': None, 'chat_type': 'live', 'ignore': None, 'message_receive_timeout': 0.1, 'buffer_size': 4096}
[DEBUG] Starting new HTTPS connection (1): gql.twitch.tv:443
[DEBUG] https://gql.twitch.tv:443 "POST /gql HTTP/11" 200 880
[DEBUG] https://gql.twitch.tv:443 "POST /gql HTTP/11" 200 None
[DEBUG] Match found: "<re.Match object; span=(0, 39), match='https://www.twitch.tv/videos/2184933543'>". Running "_get_chat_by_vod_id" function in "TwitchChatDownloader".
[DEBUG] Chat information: {'chat': <generator object TwitchChatDownloader._get_chat_messages_by_vod_id at 0x7f8ce80acf40>, 'title': 'DLC НА КАЗУАЛЫЧАХ | Прохождение #2 | ELDEN RING Shadow of the Erdtree | стрим 9', 'duration': 23578, 'status': 'past', 'video_type': 'video', 'start_time': None, 'id': '2184933543', '_output_writer': <chat_downloader.output.continuous_write.ContinuousWriter object at 0x7f8ce7fc0250>, '_output_callback': None, 'format': <function ChatDownloader.get_chat.<locals>.<lambda> at 0x7f8ce7f83880>, 'site': <chat_downloader.sites.twitch.TwitchChatDownloader object at 0x7f8ce8c21e50>}
[INFO] Retrieving chat for "DLC НА КАЗУАЛЫЧАХ | Прохождение #2 | ELDEN RING Shadow of the Erdtree | стрим 9".
[DEBUG] https://gql.twitch.tv:443 "POST /gql HTTP/11" 200 None
...
message={'fragments': [{'emote': None, 'text': 'Спасибо за стрим ', '__typename': 'VideoCommentMessageFragment'}, {'emote': {'id': '196892;31;41', 'emoteID': '196892', 'from': 31, '__typename': 'EmbeddedEmote'}, 'text': 'TwitchUnity', '__typename': 'VideoCommentMessageFragment'}, {'emote': None, 'text': ' Удовольствия от игры', '__typename': 'VideoCommentMessageFragment'}], 'userBadges': [], 'userColor': '#FF69B4', '__typename': 'VideoCommentMessage'}
fragment={'emote': None, 'text': 'Спасибо за стрим ', '__typename': 'VideoCommentMessageFragment'}
fragment={'emote': {'id': '196892;31;41', 'emoteID': '196892', 'from': 31, '__typename': 'EmbeddedEmote'}, 'text': 'TwitchUnity', '__typename': 'VideoCommentMessageFragment'}
fragment={'emote': None, 'text': ' Удовольствия от игры', '__typename': 'VideoCommentMessageFragment'}
[DEBUG] Writing to file: test.jsonl
5:58:29 | NIKI_ORNIS: Спасибо за стрим TwitchUnity Удовольствия от игры
...
[INFO] Finished retrieving chat messages.
[DEBUG] Session closed.

Actual content of test.jsonl (prettified)

{
  "author": {
    "colour": "#FF69B4",
    "display_name": "NIKI_ORNIS",
    "id": "458636669",
    "name": "niki_ornis"
  },
  "emotes": [
    {
      "id": "196892",
      "images": [
        {
          "height": 28,
          "id": "28x28-light",
          "url": "https://static-cdn.jtvnw.net/emoticons/v2/196892/default/light/1.0",
          "width": 28
        },
        {
          "height": 56,
          "id": "56x56-light",
          "url": "https://static-cdn.jtvnw.net/emoticons/v2/196892/default/light/2.0",
          "width": 56
        },
        {
          "height": 112,
          "id": "112x112-light",
          "url": "https://static-cdn.jtvnw.net/emoticons/v2/196892/default/light/3.0",
          "width": 112
        },
        {
          "height": 28,
          "id": "28x28-dark",
          "url": "https://static-cdn.jtvnw.net/emoticons/v2/196892/default/dark/1.0",
          "width": 28
        },
        {
          "height": 56,
          "id": "56x56-dark",
          "url": "https://static-cdn.jtvnw.net/emoticons/v2/196892/default/dark/2.0",
          "width": 56
        },
        {
          "height": 112,
          "id": "112x112-dark",
          "url": "https://static-cdn.jtvnw.net/emoticons/v2/196892/default/dark/3.0",
          "width": 112
        }
      ],
      "locations": "31-41",
      "name": ""
    }
  ],
  "message": "\u0421\u043f\u0430\u0441\u0438\u0431\u043e \u0437\u0430 \u0441\u0442\u0440\u0438\u043c TwitchUnity \u0423\u0434\u043e\u0432\u043e\u043b\u044c\u0441\u0442\u0432\u0438\u044f \u043e\u0442 \u0438\u0433\u0440\u044b",
  "message_id": "5bc4d778-e3fa-45da-bdb4-0206dd035902",
  "message_type": "text_message",
  "time_in_seconds": 21509,
  "time_text": "5:58:29",
  "timestamp": 1719705721803000
}

Expected content of test.jsonl (prettified)

name field of the emote should be filled:

{
  "author": {
    "colour": "#FF69B4",
    "display_name": "NIKI_ORNIS",
    "id": "458636669",
    "name": "niki_ornis"
  },
  "emotes": [
    {
      "id": "196892",
      "images": [
        {
          "height": 28,
          "id": "28x28-light",
          "url": "https://static-cdn.jtvnw.net/emoticons/v2/196892/default/light/1.0",
          "width": 28
        },
        {
          "height": 56,
          "id": "56x56-light",
          "url": "https://static-cdn.jtvnw.net/emoticons/v2/196892/default/light/2.0",
          "width": 56
        },
        {
          "height": 112,
          "id": "112x112-light",
          "url": "https://static-cdn.jtvnw.net/emoticons/v2/196892/default/light/3.0",
          "width": 112
        },
        {
          "height": 28,
          "id": "28x28-dark",
          "url": "https://static-cdn.jtvnw.net/emoticons/v2/196892/default/dark/1.0",
          "width": 28
        },
        {
          "height": 56,
          "id": "56x56-dark",
          "url": "https://static-cdn.jtvnw.net/emoticons/v2/196892/default/dark/2.0",
          "width": 56
        },
        {
          "height": 112,
          "id": "112x112-dark",
          "url": "https://static-cdn.jtvnw.net/emoticons/v2/196892/default/dark/3.0",
          "width": 112
        }
      ],
      "locations": "31-41",
      "name": "TwitchUnity"
    }
  ],
  "message": "\u0421\u043f\u0430\u0441\u0438\u0431\u043e \u0437\u0430 \u0441\u0442\u0440\u0438\u043c TwitchUnity \u0423\u0434\u043e\u0432\u043e\u043b\u044c\u0441\u0442\u0432\u0438\u044f \u043e\u0442 \u0438\u0433\u0440\u044b",
  "message_id": "5bc4d778-e3fa-45da-bdb4-0206dd035902",
  "message_type": "text_message",
  "time_in_seconds": 21509,
  "time_text": "5:58:29",
  "timestamp": 1719705721803000
}

Additional context/information

Twitch GQL uses byte positioning as the beginning and the end of an emote code inside the chat text, so for non-ASCII characters the byte form of Python string should be used as the source of applying locations.

The fix is straightforward:

'name': message_text.encode("utf-8")[begin:end + 1].decode("utf-8")

instead of

'name': message_text[begin:end + 1]