gvinciguerra/PGM-index

Out-of-range predictions and undefined behavior

RyanMarcus opened this issue · 1 comments

Hello! Awesome work on the PGM index, this is really cool stuff! I appreciate you taking the time to release the code as well.

I tried using the PGM-index on a dataset of Facebook user IDs, and got a wrong result and found some undefined behavior with valgrind. It is possible that I'm doing something silly.

For reference, you can download the dataset here: https://dataverse.harvard.edu/file.xhtml?persistentId=doi:10.7910/DVN/JGVF9A/Y54SI9&version=3.0

The dataset is formatted as densely packed unsigned 64-bit ints, with the first value being the number of items in the dataset (200M).

Here's the driver program I used to read the data and try out the PGM index:

#include <vector>
#include <cstdlib>
#include <iostream>
#include <algorithm>
#include <iterator>
#include "pgm_index.hpp"

static std::vector<uint64_t> load_data(const std::string& filename) {
  std::vector<uint64_t> data;

  std::ifstream in(filename, std::ios::binary);
  if (!in.is_open()) {
      std::cerr << "unable to open " << filename << std::endl;
      exit(EXIT_FAILURE);
  }
  
  // Read size.
  uint64_t size;
  in.read(reinterpret_cast<char*>(&size), sizeof(uint64_t));
  data.resize(size);
  
  // Read values.
  in.read(reinterpret_cast<char*>(data.data()), size*sizeof(uint64_t));
  in.close();

  return data;
}


int main(int argc, char **argv) {
    // Generate some random data
    std::vector<uint64_t> dataset = load_data("fb_200M_uint64");
    std::cout << "Dataset is loaded." << std::endl;

    // Construct the PGM-index
    const int error = 128;
    PGMIndex<uint64_t, error> index(dataset);

    std::cout << "PGM index constructed." << std::endl;
    
    // Query the PGM-index
    uint64_t q = 1040094788556900096;
    auto approx_range = index.find_approximate_position(q);
    std::cout << approx_range.lo << " to " << approx_range.hi << "\n";
        
    auto lo = dataset.begin() + approx_range.lo;
    auto hi = dataset.begin() + approx_range.hi;

    auto pos = std::distance(dataset.begin(), std::lower_bound(lo, hi, q));
    std::cout << "search returned:  " << pos << "\n";

    auto corr_pos = std::distance(dataset.begin(), std::lower_bound(dataset.begin(), dataset.end(), q));
    std::cout << "correct position: " << corr_pos << "\n";

    return 0;
}

The program outputs:

50766967 to 50767223
search returned:  50766967
correct position: 22806066

... indicating that the PGM index predicted the location of the search key far more than 128 positions from the correct position.

I added -g to the build flags and ran Valgrind. The PGM index is constructed without any errors, but then the lookup procedure seems to do several invalid reads.

[ryan@ryan-arch-tower pgm_test]$ valgrind ./pgm 
==728290== Memcheck, a memory error detector
==728290== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==728290== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==728290== Command: ./pgm
==728290== 
==728290== Warning: set address range perms: large range [0x59c99040, 0xb927a040) (undefined)
==728290== Warning: set address range perms: large range [0x59c9b037, 0xb927a040) (defined)
Dataset is loaded.
PGM index constructed.
==728290== Invalid read of size 8
==728290==    at 0x10B080: find_segment_for_key (pgm_index.hpp:260)
==728290==    by 0x10B080: find_approximate_position (pgm_index.hpp:556)
==728290==    by 0x10B080: main (main.cpp:43)
==728290==  Address 0x5502390 is 80 bytes inside an unallocated block of size 608 in arena "client"
==728290== 
==728290== Invalid read of size 8
==728290==    at 0x10AD41: operator()<long unsigned int> (pgm_index.hpp:58)
==728290==    by 0x10AD41: find_segment_for_key (pgm_index.hpp:265)
==728290==    by 0x10AD41: find_approximate_position (pgm_index.hpp:556)
==728290==    by 0x10AD41: main (main.cpp:43)
==728290==  Address 0x545d4a0 is 224 bytes inside an unallocated block of size 480 in arena "client"
==728290== 
==728290== Invalid read of size 8
==728290==    at 0x10AD47: operator()<long unsigned int> (pgm_index.hpp:58)
==728290==    by 0x10AD47: find_segment_for_key (pgm_index.hpp:265)
==728290==    by 0x10AD47: find_approximate_position (pgm_index.hpp:556)
==728290==    by 0x10AD47: main (main.cpp:43)
==728290==  Address 0x545d4a8 is 232 bytes inside an unallocated block of size 480 in arena "client"
==728290== 
50766967 to 50767223
search returned:  50766967
correct position: 22806066
==728290== Warning: set address range perms: large range [0x59c99028, 0xb927a058) (noaccess)
==728290== 
==728290== HEAP SUMMARY:
==728290==     in use at exit: 0 bytes in 0 blocks
==728290==   total heap usage: 399,438,306 allocs, 399,438,306 frees, 14,414,997,616 bytes allocated
==728290== 
==728290== All heap blocks were freed -- no leaks are possible
==728290== 
==728290== For lists of detected and suppressed errors, rerun with: -s
==728290== ERROR SUMMARY: 6 errors from 3 contexts (suppressed: 0 from 0)

The error at line 260 seems to be the linear scan code. I'm not sure how line 58 is incurring an invalid read, I'm guessing some inlining happened and the actual error is happening at the array lookup on line 265.

@RyanMarcus great catch! Thank you for reporting this.