/slob

Data store for Aard 2

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

Slob

Slob (sorted list of blobs) is a read-only, compressed data store with dictionary-like interface to look up content by text keys. Keys are sorted according to Unicode Collation Algorithm. This allows to perform punctuation, case and diacritics insensitive lookups. slob.py is a reference implementation of slob format reader and writer in Python 3.

Installation

slob.py depends on the following components:

In addition, the following components are needed to set up slob environment:

Consult your operating system documentation and these component’s websites for installation instructions.

For example, on Ubuntu 20.04, the following command installs required packages:

sudo apt update
sudo apt install python3 python3-icu python3.8-venv git

Create new Python virtual environment:

python3 -m venv env-slob --system-site-packages

Activate it:

source env-slob/bin/activate

Install from source code repository:

pip install git+https://github.com/itkach/slob.git

or, download source code manually:

wget https://github.com/itkach/slob/archive/master.zip
pip install master.zip

Run tests:

python -m unittest slob

Command line interface

slob.py provides basic command line interface to inspect and modify slob content.

usage: slob [-h] {find,get,info,tag} ...

positional arguments:
  {find,get,info,tag}  sub-command
    find               Find keys
    get                Retrieve blob content
    info               Inspect slob and print basic information about it
    tag                List tags, view or edit tag value
    convert            Create new slob with the same convent but different
                       encoding and compression parameters
                       or split into multiple slobs

optional arguments:
  -h, --help           show this help message and exit

To see basic slob info such as text encoding, compression and tags:

slob info my.slob

To see value of a tag, for example label:

slob tag -n label my.slob

To set tag value:

slob tag -n label -v "A Fine Dictionary" my.slob

To look up a key, for example abc:

slob find wordnet-3.0.slob abc

The output should like something like

465 text/html; charset=utf-8 ABC
466 text/html; charset=utf-8 abcoulomb
472 text/html; charset=utf-8 ABC's
468 text/html; charset=utf-8 ABCs

First column in the output is blob id. It can be used to retrieve blob content (content bytes are written to stdout):

slob get wordnet-3.0.slob 465

To re-encode or re-compress slob content with different parameters:

slob convert -c lzma2 -b 256 simplewiki-20140209.zlib.384k.slob simplewiki-20140209.lzma2.256k.slob

To split into multiple slobs:

slob convert --split 4096 enwiki-20150406.slob enwiki-20150406-vol.slob

Output name enwiki-20150406-vol.slob is the name of the directory where resulting .slob files will be created.

This is useful for crippled systems that can’t use normal filesystems and have file size limits, such as SD cards on vanilla Android. Note that this command doesn’t duplicate any content, so clients must search all these slobs when looking for shared resources such as stylesheets, fonts, javascript or images.

Examples

Basic Usage

Create a slob:

import slob
with slob.create('test.slob') as w:
    w.add(b'Hello A', 'a')
    w.add(b'Hello B', 'b')

Read content:

import slob
with slob.open('test.slob') as r:
    d = r.as_dict()
    for key in ('a', 'b'):
        result = next(d[key])
        print(result.content)

will print

b'Hello A'
b'Hello B'

Slob we created in this example certainly works, but it is not ideal: we neglected to specify content type for the content we are adding. Lets consider a slightly more involved example:

import slob
PLAIN_TEXT = 'text/plain; charset=utf-8'
with slob.create('test1.slob') as w:
    w.add('Hello, Earth!'.encode('utf-8'),
          'earth', 'terra', content_type=PLAIN_TEXT)
    w.add_alias('земля', 'earth')
    w.add('Hello, Mars!'.encode('utf-8'), 'mars',
          content_type=PLAIN_TEXT)

Here we specify MIME type of the content we are adding so that consumers of this content can display or process it properly. Note that the same content may be associated with multiple keys, either when it is added or later with add_alias.

This

with slob.open('test1.slob') as r:

    def p(blob):
        print(blob.id, blob.content_type, blob.content)

    for key in ('earth', 'земля', 'terra'):
        blob = next(r.as_dict()[key])
        p(blob)

    p(next(r.as_dict()['mars']))

will print

0 text/plain; charset=utf-8 b'Hello, Earth!'
0 text/plain; charset=utf-8 b'Hello, Earth!'
0 text/plain; charset=utf-8 b'Hello, Earth!'
1 text/plain; charset=utf-8 b'Hello, Mars!'

Note that blob id for the first three keys is the same, they all point to the same content item.

Take a look at tests in slob.py for more examples.

Software and Dictionaries

Slob File Format

Slob

ElementTypeDescription
magicfixed size sequence of 8 bytesBytes 21 2d 31 53 4c 4f 42 1f: string !-1SLOB followed by ascii unit separator (ascii hex code 1f) identifying slob format
uuidfixed size sequence of 16 bytesUnique slob identifier (RFC 4122 UUID)
encodingtiny text (utf8)Name of text encoding used for all other text elements: tag names and values, content types, keys, fragments
compressiontiny textName of compression algorithm used to compress storage bins.
slob.py understands following names: bz2, zlib which correspond to Python module names, and lzma2 which refers to raw lzma2 compression with LZMA2 filter (this is default).
Empty value means bins are not compressed.
tagschar-sized sequence of tagsTags are text key-value pairs that may provide additional information about slob or its data.
content typeschar-sized sequence of content typesMIME content types. Content items refer to content types by id.
Content type id is 0-based position of content type in this sequence.
blob countintNumber of content items stored in the slob
store offsetlongFile position at which store data begins
sizelongTotal file byte size (or sum of all files if slob is split into multiple files)
refslist of long-positioned refsReferences to content
storelist of long-positioned store itemsStore item contains number of items stored, content type id for each item and storage bin with each item’s content

tiny text

char-sized sequence of encoded text bytes

text

short-sized sequence of encoded text bytes

large byte string

int-sized sequence of bytes

size type-sized sequence of items

ElementType
countsize type
itemssequence of count items

tag

ElementType
nametiny text
valuetiny text padded to maximum
length with null bytes

Tag values are tiny text of length 255, starting with encoded text bytes followed by null bytes. This allowes modifying tag values without having to recompile the whole slob. Null bytes must be stripped before decoding value text.

content type

text

ref

ElementTypeDescription
keytextText key associated with content
bin indexintIndex of compressed bin containing content
item indexshortIndex of content item inside uncompressed bin
fragmenttiny textText identifier of a specific location inside content

store item

ElementTypeDescription
content type idsint-sized sequence of bytesEach byte is a char representing content type id.
storage binlist of int-positioned large byte strings without countContent

Storage bin doesn’t include leading int that would represent item count - item count equals the length of content type ids. Items in the storage bin are large byte strings - actual content bytes.

list of position type-positioned items

ElementTypeDescription
positionsint-sized sequence of item offsets of type position type.Item offset specifies position in file where item data starts, relative to the end of position data
itemssequence of items

char

unsigned char (1 byte)

short

big endian unsigned short (2 bytes)

int

big endian unsigned int (4 bytes)

long

big endian unsigned long long (8 bytes)

Design Considerations

Slob format design is influenced by old Aard Dictionary’s aard and ZIM file formats. Similar to Aard Dictionary, it allows to perform non-exact lookups based on UCA’s notion of collation strength. Similar to ZIM, it groups and compresses multiple content items to achieve high compression ratio and can combine several physical files into one logical container. Both aard and ZIM contain vestigial elements of predecessor formats as well as elements specific to a particular use case (such as implementing offline Wikipedia content access). Slob aims to provide a minimal framework to allow building such applications while remaining a simple, generic, read-only data store.

No Format Version

Slob header doesn’t contain explicit file format version number. Any incompatible changes to the format should be introduced in a new file format which will get its own identifying magic bytes.

No Content Checksum

Unlike aard and ZIM file formats, slob doesn’t contain content checksum. File integrity can be easily verified by employing standard tools to calculate content hash. Inclusion of pre-calculated hash into the file itself prevents using most standard tools and puts burden of implementing hash calculation on every slob reader implementation.