joweich/chat-miner

Wordcloud also displays non printable characters (Workaround)

Closed this issue · 5 comments

I tried this on the chat with my girlfriend and as we are Slovak (Š,Č,Á,ô,...) and also use lot of emojis, it got into the "wordcloud" and caused lot of weird characters and lots of numbers.

I made small "dirty" fix with use of regex.

edit visualisations.py Line 108 to:

words = ["".join(re.findall(r'\b[a-zA-Z0-9]+\b', word)).lower() for sublist in messages for word in sublist]

This is really cheap workaround as I did not have lot of time to check inner working of the code, but it gets rid of all annoying characters other than A-Z, a-z, 0-9. If the word has emoji in between it removes it and joins the two halves together.
I am not sure if this would be code ready to be added to project, that's why I am writing this as issue.
EDIT: It may be possible that this is problem because I use linux and this behaviour is not present on the windows/mac as I had few problems with special characters before.

Hey @UntriexTv, thank you for raising this! Are the Slovak characters also resolved into their unicode notation (I would assume that this is what you mean by lots of numbers) in the dataframe before visualization? I try do understand if the issue is caused by our parser, or might be related to the wordcloud module.
On another note: Which is the parser you are using?

Okay @joweich I found this issue on the internet that seems to talk about Unicode escapes.
Yes they are in unicode notation. Even when I use Krita on the Json file (And try all of the encodings) or if I use translator on the internet I just get garbage.
I think I already had this kind of problem on my linux machine before. I will check it more in depth and maybe send some edited json to see if you can reproduce the problem. Also some of them could be emojis, but I guess you are filtering them out?

This is taken after disabling my filter, breakpoint on line 113 in wordcloud function:
image
If I am not mistaken (Its hard to be sure), it should be rozmýšľam
That means the notation is getting translated but bad encoding?

And I am using FacebookMessengerParser
Edit:

Also noticed now these:
image
Are you aware of them? If some people use same stickers often this could get into statistics

Also some of the problems could be because of not having space between word and emoji (like "HI:smile:") I think I see some cases of that in the render. ð Is also repeated a lot and based on the fact that it is mostly only character and also the link I sent earlier I guess it represents Emojis.
Some examples:
Probably something like "uuuu:open_mouth:"
image
"Nemusí"
image
**you ❤️ ** (Probably?)
image
me"some emoji", And others I have no idea. Also there Is lot of big characters like this "ð" In this render it is around 10, but they are the only character in the word (Not sure how to say it), so emoji
image

@UntriexTv I think I fixed the issue in #77 by picking up the information from the stackoverflow post you linked.
Would you mind testing the fix and checking if it resolved the issue? The "ð" character should definitely be gone.

@joweich It doesn't seem to change anything. In visualisation they are also still present
image