nmslib/hnswlib

stream load. Is it possible?

Opened this issue · 5 comments

I want to load network resourse to index but it failed

import requests
import io
import pickle
import hnswlib

def get_stream(url):
    response = requests.get(url)
    stream_data = response.content
    return io.BytesIO(stream_data)

model = pickle.load(get_stream('http://example.com/model')) # it works

index = hnswlib.Index(space='cosine', dim=128)
index.load_index(get_stream('http://example.com/index.hnsw')) # doesn't work

got error

TypeError: load_index(): incompatible function arguments. The following argument types are supported:
    1. (self: hnswlib.Index, path_to_index: str, max_elements: int = 0, allow_replace_deleted: bool = False) -> None

Invoked with: <hnswlib.Index(space='cosine', dim=128)>, <_io.BytesIO object at 0x7fd364e557c0>

Is it normal idea?

I am not sure, but can we pass io.BytesIO as std::ifstream?

void loadIndex(const std::string &location, SpaceInterface<dist_t> *s) {

    void loadIndex(const std::ifstream &input, SpaceInterface<dist_t> *s) {
        std::streampos position;

        readBinaryPOD(input, maxelements_);
        readBinaryPOD(input, size_per_element_);
        readBinaryPOD(input, cur_element_count);

        data_size_ = s->get_data_size();
        fstdistfunc_ = s->get_dist_func();
        dist_func_param_ = s->get_dist_func_param();
        size_per_element_ = data_size_ + sizeof(labeltype);
        data_ = (char *) malloc(maxelements_ * size_per_element_);
        if (data_ == nullptr)
            throw std::runtime_error("Not enough memory: loadIndex failed to allocate data");
                                                             
        input.read(data_, maxelements_ * size_per_element_);
    
        input.close();
    }

split function at first phase

    void loadStream(const std::ifstream &input, SpaceInterface<dist_t> *s) {
        readBinaryPOD(input, maxelements_);
        readBinaryPOD(input, size_per_element_);
        readBinaryPOD(input, cur_element_count);

        data_size_ = s->get_data_size();
        fstdistfunc_ = s->get_dist_func();
        dist_func_param_ = s->get_dist_func_param();
        size_per_element_ = data_size_ + sizeof(labeltype);
        data_ = (char *) malloc(maxelements_ * size_per_element_);
        if (data_ == nullptr)
            throw std::runtime_error("Not enough memory: loadIndex failed to allocate data");

        input.read(data_, maxelements_ * size_per_element_);
    }
    
    void loadIndex(const std::string &location, SpaceInterface<dist_t> *s) {
        std::ifstream input(location, std::ios::binary);
        std::streampos position;
        loadStream(input, s);
        input.close();
    }

the same things with it

void loadIndex(const std::string &location, SpaceInterface<dist_t> *s, size_t max_elements_i = 0) {

Unfortunately, we can't just do this because functions are used.

.seekg() and .tellg() (we can simplify loading code and remove it)

and maybe std::ifstream is not compatible with io.ByteIO and we need std::istringstream

What do you think about?

Take a look at #556