itkach/slob

Slob File format error in WIKI

tosbaha opened this issue · 5 comments

Hi,
I checked the WIKI for Slob File format and it says

Element Type Description
content types char-sized sequence of content types MIME content types. Content items refer to content types by id.
Content type id is 0-based position of content type in this sequence.

However, when I checked a sample file I saw that size of content types is not char sized but short sized.

Example freedict-eng-tur-0.3.slob

00000d8c  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000d9c  00 00 00 00 00 00 00 00  00 00 03 00 08 74 65 78  |.............tex|
00000dac  74 2f 63 73 73 00 16 61  70 70 6c 69 63 61 74 69  |t/css..applicati|
00000dbc  6f 6e 2f 6a 61 76 61 73  63 72 69 70 74 00 17 74  |on/javascript..t|
00000dcc  65 78 74 2f 68 74 6d 6c  3b 63 68 61 72 73 65 74  |ext/html;charset|
00000ddc  3d 75 74 66 2d 38 00 00  8e f0 00 00 00 00 00 0d  |=utf-8..........|

There are 3 content types. However, size of content is not 08 but instead 00 08 It is also same with others. Is there a typo in the WIKI? I checked the slob.py and it also says

 def read_text(self):
        return self._read_text(U_SHORT)

 def read_content_types():
        content_types = []
        count = f.read_byte()
        for _ in range(count):
            content_type = f.read_text()
            content_types.append(content_type)
        return tuple(content_types)

00 08 is length (short-sized, 2 bytes) of the following character sequence text/css. 03 in 03 00 08 is the length of list of content types. I don't see an issue here.

Your documentation says char-sized for content_types. Char means a single byte, not 2 bytes. For example, tags are char sized and they are indeed char-sized which is a single byte.

There's length of list of content types (one byte) and then there's length of each individual content type entry. Number of different content types that can be contained in a slob file is char-sized, as in "you can have 255 of them", that's single byte "03". Content type itself is a piece of text of up to 2^16-1 length. That's specified as two bytes for length, followed by actual bytes that represent content type text; repeat for each entry in content type list

My confusion was in your documentation, for content type it just writes text. It sounds like a placeholder in the documentation. For example, other sections of the format have tables and clearly shows text is a type. Maybe you can word it differently to clarify that content types are series of text instead of tiny-text