Jiacheng-WU/lipp

Out-of-memory exception

Closed this issue · 3 comments

When running your LIPP source code on a dataset of mixed SOSD data(1.6GB), we found a potential bug of LIPP that makes it run out of our memory (756 GB).

terminate called after throwing an instance of 'St9bad_alloc'
  what():  std::bad_alloc
zsh: abort (core dumped)  ./example_bulk_load

The code below is used to reproduce the case . Data could be fetched from https://drive.google.com/file/d/1d3JomhUHXBCMGnRx5fTtwPrmQk-adxJS/view?usp=sharing

Would you mind to take a look and fix it? Besides, I could not find a range query interface from your code, could you please tell me how to do range query using your code or provide a new interface of range query?

Thank you very much in advance.

Best regards,
Zhicong

#include <lipp.h>
#include <iostream>
#include <vector>
#include <cassert>
#include <algorithm>
#include <iostream>
#include <fstream>

using namespace std;

template<class T>
bool load_binary_data(T data[], int length, const std::string &file_path) {
    std::ifstream is(file_path.c_str(), std::ios::binary | std::ios::in);
    if (!is.is_open()) {
        return false;
    }
    is.read(reinterpret_cast<char *>(data), std::streamsize(length * sizeof(T)));
    is.close();

    return true;
}

int main()
{
    LIPP<uint64_t, uint64_t> lipp;
    size_t data_size = 200000000;
    size_t bulk_load_size = 100000000;

    // prepare data
    uint64_t *keys = new uint64_t[data_size];
    load_binary_data(keys, data_size, "../test.data");

    vector<pair<uint64_t, uint64_t>> data;
    std::sort(keys, keys + bulk_load_size);
    auto last = std::unique(keys, keys + bulk_load_size);
    bulk_load_size = last - keys;
    for (size_t i = 0; i < bulk_load_size; i ++) {
        data.push_back({keys[i], keys[i]});
    }

    // bulk load
    lipp.bulk_load(data.data(), bulk_load_size);

    for(int i = 0; i < data_size - bulk_load_size; i++) {
        lipp.insert(keys[bulk_load_size + i], i);
    }
    
    return 0;
}

It might not be a bug in our code since the inserted keys should be totally different from those keys already in the index.

However, in your provided dataset, there exist two elements with the same keys. One is added into the index during the bulkload phase, and the other is added during the insertion phase. Thus, the issue appears. After we slightly modified the code to deduplicate the data, now the code can run without OOM.

As for the range query, it seems we forget to upload it. We may need a period of time to organize the relevant code and upload it to a new branch later.

Sincerely,
Jiacheng

Dear Jiacheng,

Thanks for your great work! Have you uploaded the range query code to the github? I could not find the corresponding branch.

Regards,
Zhicong

We are sorry. We haven't uploaded it now.
we are not focusing on this work, and are busy working on other works.
You may need to wait more time. Please be patient.

Sincerely,
Jiacheng