microsoft/ALEX

Could I have the four datasets?

kaiwang19 opened this issue · 5 comments

Hi, I am wondering if I could get the four datasets in the paper?

longitudes
longlat
lognormal
YCSB
The given sample dataset is only 200M, which is part of the longitudes dataset.
I have also tried to extract the dataset from the open street map, but I assume that there must be a strategy that you use to select the longitudes or the latitudes. Could you mention a little about this strategy? Thanks.

You can now find links to the datasets in the README.

The longitude and latitude values should be GPS coordinates from randomly-selected locations in OSM. But I did not generate the longitude and latitude values myself, so I don't know the exact selection procedure.

Thanks a lot!

Dear Jialing,

I found that the lognormal dataset and YCSB dataset cannot be run properly if you bulk load them. Could you double-check if the two datasets are the original ones in the paper?

For the lognormal dataset, there are 190M keys.

  • If you want to bulk load the lognormal dataset with less than 629,145 keys, everything is fine.
  • But if I bulk load the lognormal dataset with more than 629,145 keys, ALEX suddenly goes out of control.
    To be specific, I have tested the following amount of keys to bulk load:
  • bulk load the lognormal dataset with 600,000 keys, ALEX has 0 model nodes and 1 data node, and the maximum depth is 0.
  • bulk load the lognormal dataset with 620,000 keys, ALEX has 0 model nodes and 1 data node, and the maximum depth is 0.
  • bulk load the lognormal dataset with 630,000 keys, ALEX has 855 model nodes and 856 data nodes, and the maximum depth is 855.
  • bulk load the lognormal dataset with 700,000 keys, ALEX cannot run with an error message: Segmentation fault (core dumped).

For the YCSB dataset, there are 200M keys.

  • If you want to bulk load the YCSB dataset with less than 629,145 keys, everything is fine.
  • But if I bulk load the YCSB dataset with more than 629,145 keys, ALEX suddenly goes out of control.
    To be specific, I have tested the following amount of keys to bulk load:
  • bulk load the YCSB dataset with 600,000 keys, ALEX has 0 model nodes and 1 data node, and the maximum depth is 0.
  • bulk load the YCSB dataset with 620,000 keys, ALEX has 0 model nodes and 1 data node, and the maximum depth is 0.
  • bulk load the YCSB dataset with 630,000 keys, ALEX has 855 model nodes and 856 data nodes, and the maximum depth is 855.
  • bulk load the YCSB dataset with 700,000 keys, ALEX cannot run with an error message: Segmentation fault (core dumped).

I thus debug the code to see what has happened. The problem is at Line 731 of 'alex.h'. The if condition will decide if num_keys <= derived_params_.max_data_node_slots * data_node_type::kMinDensity_.
derived_params_.max_data_node_slots is 1,048,576, and the data_node_type::kMinDensity_ is 0.6, thus less than 1,048,576 * 0.6 = 629,145.6 keys are fine for bulk loading, but if there are too many keys in lognormal or YCSB, ALEX cannot handle.

The weird thing is that I test the same amount of keys on longitudes and longlat, everything is fine.

  • bulk load the longitudes dataset with 630,000 keys, ALEX has 8 model nodes and 8823 data nodes, and the maximum depth is 2.
  • bulk load the longlat dataset with 630,000 keys, ALEX has 791 model nodes and 22404 data nodes, and the maximum depth is 3.

I thus doubt if the lognormal dataset and YCSB dataset are correct? Or should I set some parameters specifically for the two datasets? Thanks.

I can't reproduce these errors. Can you try running the benchmark executable, as described in the README? For example, to bulk load 700K keys from YCSB, change line 16 of src/benchmark/main.cpp to #define KEY_TYPE uint64_t, then run this command:

./build/benchmark \
--keys_file=[path to location of YCSB dataset, might need to be an absolute path] \
--keys_file_type=binary \
--init_num_keys=700000 \
--total_num_keys=1000000 \
--batch_size=100000 \
--insert_frac=0.5 \
--lookup_distribution=zipf \
--print_batch_stats

Thank you so much, the problem is solved now. I used int64_t before, so I could not succeed. When change int64_t to uint64_t, everything is fine. Thank you.