Refactor telegram parser
Closed this issue · 3 comments
I am currently adding typing to the files to make it easier to use chat-miner. While doing this I read (for the first time) the code for the Telegram parser. The _read_raw_messages_from_file
method is really good (other than a potential spot for a tqdm
) but the _parse_message
method is littered with assert
and isinstance
. In addition, I don't really understand what the code is doing.
For reference, here is the current code:
def _parse_message(self, mess: dict[str, Any]):
if "from" in mess and "text" in mess:
if isinstance(mess["text"], str):
body = mess["text"]
elif isinstance(mess["text"], list):
assert all(
[
(isinstance(m, dict) and "text" in m) or isinstance(m, str)
for m in mess["text"]
]
)
body = " ".join(
map(
lambda m: m["text"] if isinstance(m, dict) else m,
mess["text"],
)
)
else:
raise ValueError(f"Unable to parse type {type(mess['text'])} in {mess}")
time = dt.datetime.fromtimestamp(int(mess["date_unixtime"]))
author = mess["from"]
return ParsedMessage(time, author, body)
# NOTE: I changed the default return value to None instead of False to standardize the failure case
return None
I propose that someone who knows how the Telegram JSON file is structured tries to rewrite this method (perhaps @galatolofederico) so that it is more readable, type-safe (not actually but it makes it harder to make mistakes), and pythonic.
@BlueishTint I agree that this needs some refactoring. The available formats of telegram exports are referenced in #26
@BlueishTint fyi I refactored the telegram parser just now.
I didn't see this until now but thanks! It looks much better.