Strings decoded are left as byte strings

Question

Strings decoded are left as byte strings

jackm opened this issue 8 years ago · 3 comments

I am uncertain if this is a bug or not, but I noticed that when decoding bencoded data, any decoded strings are left as byte strings even though they're not really binary data, but actual human-readable strings.

eg.

bdecode(b'8:a string')
# b'a string'

If you wanted to use this as a string you would have to also run it through decode('utf-8).

From what I understand, the bencode standard does not make a distinction between strings and binary data. Would it be a good idea to try and decode any byte strings to regular strings and if it fails then leave it as-is, assuming it is actually binary data? I just know that having to append decode('utf-8) onto every decoded dict key gets to be very repetitive.

Answer 1 · 2018-03-23T05:45:28.000Z

It's intended because we don't know what encoding the string is using.

Would it be a good idea to try and decode any byte strings to regular strings and if it fails then leave it as-is, assuming it is actually binary data?

I think it's actually more confusing since you can't figure out whether you will get a bytes or a string. And there're cases that bytes encoded by enconding A can be decoded using encoding B.

Answer 2 · 2018-03-23T15:06:01.000Z

I thought about it some more and you're right; there's no guarantee that even if it was supposed to be a string that it would be utf-8 encoded.

So I suppose the correct way to do it is decode the byte strings into strings only when you know that they should be strings. For example, when bdecoding a torrent file the dict keys should all be utf-8 strings (right?), but some of the dict values might be encoded with something else.

Answer 3 · 2018-03-23T15:47:13.000Z

IMO In most cases it’s hard to know which encoding the torrent file use before bencoding it, and there might have mixed encoding in one torrent file as you mentioned. So if we can’t guarantee that all bytes in the decoded dict can be decoded into correct string, the user would still have to manually check and do encoding/decoding.