A simple Python interface for LMDB databases.
- Getting Started
- Reading from a Database
- How Does It Work
- Specific Databases
- Creating a PyTorch Dataset
- Writing Databases
- Going Further
Install the following packages to your environment:
pip install lmdb Pillow
The Database
class mimics Python's dict
structure, with the exception that is read-only.
from database import Database
database = Database(path=f"/path/to/database")
# To iterate through the keys of the database.
for key in database:
value = database[key]
# To retrieve one value.
key = database.keys[69]
value = database[key]
# To retrieve a several values at once. (Similar to numpy.array mechanics).
keys = database.keys[69:420]
values = database[keys]
LMDB operates with binary data for both keys and values to maintain its extremely high performance and memory efficiency.
Look at the method _fetch()
:
- An
lmdb.Transaction
handle is instantiated. - An
lmdb.Cursor
object is created. - The key is encoded with
key = fencode(key)
. - The binary value is retrieved with
value = cursor.get(key)
. - The value is decoded with
value = fdecode(value)
.
By default, Database
uses pickle
to encode keys and decode values.
Several sub-classes exist already, ImageDatabase
, LabelDatabase
, ArrayDatabase
, TensorDatabase
.
ImageDatabase._fdecode()
converts a value directly to aPIL.Image
.ArrayDatabase._fdecode()
converts a value directly to anp.ndarray
.TensorDatabase._fdecode()
converts a value directly to atorch.Tensor
.
Example:
from database import ImageDatabase
database = ImageDatabase(path=f"/path/to/image/database")
# To retrieve one image.
key = database.keys[69]
value = database[key] # <- This is a PIL.Image.
# To retrieve a several values at once. (Similar to numpy.array's mechanics).
keys = database.keys[69:420]
values = database[keys] # <- This is a list of PIL.Image.
If you have specific needs in terms of I/O, you only have to sub-class _fdecode()
, its behaviour should mimic behaviour opening a regular file, excepted that this one is binary.
Integrating Databases
with PyTorch looks like this.
from typing import Union
from torch.utils.data import Dataset
from pathlib import Path
from database import ImageDatabase, LabelDatabase
class MyDataset(Dataset):
def __init__(self, path: Union[str, Path], transform=None):
if not isinstance(path, Path):
path = Path(path)
images = path / f"Images.lmdb"
labels = path / f"Labels.lmdb"
self.images = ImageDatabase(path=images)
self.labels = LabelDatabase(path=labels)
self.keys = self._keys()
self.transform = transform
def _keys(self):
# We assume that the keys are the same for the images and the labels.
# Feel free to do something else if you fancy it.
keys = sorted(set(self.images.keys).intersection(self.labels.keys))
return keys
def __len__(self):
return len(self.keys)
def __getitem__(self, item):
key = self.keys[item]
data = {
"image": self.images[key],
"label": self.labels[key]
}
if self.transform:
data = self.transform(data)
return data
- To write a database for images located in a directory tree, execute write_image_database.py:
python write_image_database.py --src_images SRC_IMAGES
--extension EXTENSION
--dst_database DST_DATABASE
- To write a database of labels stored in a JSON file, execute write_label_database.py:
python write_label_database.py --src_labels SRC_LABELS
--dst_database DST_DATABASE
- Feel free to tailor the scripts for writing databases to suit your use-case/needs.
- You are strongly encouraged to read the LMDB docs, they're straightforward and simple.
- If you have questions, make sure to read the manual first.