qwertyquerty/pypresence

Presence update code should be able to limit input string length

afontenot opened this issue · 2 comments

Many presence payload fields are limited in size (see Discord documentation).

One downstream application (a plugin for Quod Libet, the music player) ran into an issue where trying to send the title of a song in the "details" field caused an exception. See the issue for that here: quodlibet/quodlibet#4168

From our discussion, it became clear that while we could fix the issue, it would probably make more sense for pypresence to add some behavior to its updating function to automatically trim strings to the Discord limit. That way, all pypresence users would benefit from the change.

I recommend checking out my comment for more details, but to simplify, I believe that Discord allows 256 bytes of text for the "details" field, when encoded as UTF-16, and not counting the byte order mark. If all characters are ASCII, that's 128 characters. Other fields are probably similar but might have different limits.

Strings in Python are indexed by code point, not by byte. The naive approach s[:128] won't work as expected because some Unicode code points take more than 2 bytes when encoded in UTF-16. There are two possible approaches.

  1. If you're willing to add a small dependency, pyicu, you can do "nice" grapheme based splitting, which avoids breaking some characters if they come at the end of the string. Example code:
from icu import BreakIterator, Locale, UnicodeString

break_iter = BreakIterator.createCharacterInstance(Locale())

def get_str_by_grapheme(s):
    icu_string = UnicodeString(s)
    break_iter.setText(icu_string)
    start = break_iter.first()
    for end in break_iter:
        yield str(icu_string[start:end])
        start = end

def trim_text_utf16(s, max_bytes=256):
    # use _le tagged encoding to avoid BOM insertion
    if len(s.encode("utf_16_le")) <= max_bytes:
        return s
    result = ""
    byte_size = 0
    for cp in get_str_by_grapheme(s):
        byte_size += len(cp.encode("utf_16_le"))
        # 2 = len("…".encode("utf_16_le"))
        if byte_size > max_bytes - 2:
            return result + "…"
        result += cp
  1. If no added dependency is acceptable, you can trim the string by code points instead:
def trim_text_utf16(s, max_bytes=256):
    # use _le tagged encoding to avoid BOM insertion
    if len(s.encode("utf_16_le")) <= max_bytes:
        return s
    result = ""
    byte_size = 0
    for cp in s:
        byte_size += len(cp.encode("utf_16_le"))
        # 2 = len("…".encode("utf_16_le"))
        if byte_size > max_bytes - 2:
            return result + "…"
        result += cp

The downside of the latter approach is just that it will break nice rendering of characters if they consist of multiple code points that fall across the max_bytes boundary.

Trimming a string with 1000 rainbow flags to the 256 byte limit with the full grapheme splitting code takes less than 0.1 ms on my test system (a 10 year old laptop).

Pypresence has repeatedly opted to not transform received input and instead leave things like this up to whoever is using the library.
However I wouldn't be opposed to a PR which adds this behaviour to the Presence class, preferably without an added dependency

Happy to reopen if anyone is up to the task or something changes which makes this a necessity.