Coded with love and coffee ☕ by Adrian Cosma. But I need more coffee!
AcumenIndexer
is designed to help with organizing various big datasets with many small instances into a common, highly efficient format, enabling random accessing, using either RAM or HDD for storing binary data chunks.
Currently, the way storing and accessing data is performed is inefficient, especially for begginer data scientists, each practitioner having its own way of doing things. It is not always possible to store the whole dataset in RAM memory, so a usual approach is resorting to splitting each training instance in a separate file. Datasets comprised of many images or small files are very difficult to handle in practice (i.e., transferring the dataset through ssh, zipping takes a long time). Many files in a single folder can lead to performance issues on certain filesystems and lead to crashes.
A simple way to overcome the issue of big dataset with many small instances is to store in RAM only the metadata and the index, and use a random access mechanism for big binary chunks of data on disk.
We make use of the native Python I/O operations of f.seek()
, f.read()
to read and write from large binary chunk files. We build a custom index based on byte offsets to access any training instance in O(1). Chunks can be mmap()
-ed into RAM if memory is available to speed up I/O operations.
Install the pypi package via pip:
pip install -U acumenindexer
Alternatively, install directly via git:
pip install -U git+https://github.com/cosmaadrian/acumen-indexer
def data_read_fn(path):
# read image from file
image = cv2.imread(path) # or something like this
# must return (data:numpy.ndarray, metadata:dict)
return image, x
file_names = [x for x in os.listdir('images/')]
ai.split_into_chunks(
data_list = file_names,
read_fn = data_read_fn,
output_path = 'my_data',
chunk_size_bytes = 5 * 1024 * 1024, #5MB
use_gzip = False,
dtype = np.float16,
n_jobs = 1,
)
import numpy as np
import acumenindexer as ai
the_index = ai.load_index('index.csv') # just a pd.DataFrame
# in_memory = False reads directly from chunk in O(1) using f.seek()
# in_memory = True uses mmap to map the data into RAM
read_fn = ai.read_from_index(the_index, dtype = np.float16, in_memory = True, use_gzip = False)
for i in range(10):
data = read_fn(i)
print(data) # contains both metadata and actual binary data
from torch.utils.data import Dataset
import acumenindexer as ai
class CustomDataset(Dataset):
def __init__(self, index_path):
self.index = ai.load_index(the_index, dtype = np.float16, in_memory = True, use_gzip = False)
self.read_fn = ai.read_from_index(self.index)
def __len__(self):
return len(self.index)
def __getitem__(self, idx):
data = self.read_fn(idx)
return data
This repository uses MIT License.