Could I have the four datasets?
kaiwang19 opened this issue · 5 comments
Hi, I am wondering if I could get the four datasets in the paper?
longitudes
longlat
lognormal
YCSB
The given sample dataset is only 200M, which is part of the longitudes dataset.
I have also tried to extract the dataset from the open street map, but I assume that there must be a strategy that you use to select the longitudes or the latitudes. Could you mention a little about this strategy? Thanks.
You can now find links to the datasets in the README.
The longitude and latitude values should be GPS coordinates from randomly-selected locations in OSM. But I did not generate the longitude and latitude values myself, so I don't know the exact selection procedure.
Thanks a lot!
Dear Jialing,
I found that the lognormal dataset and YCSB dataset cannot be run properly if you bulk load them. Could you double-check if the two datasets are the original ones in the paper?
For the lognormal dataset, there are 190M keys.
- If you want to bulk load the lognormal dataset with less than 629,145 keys, everything is fine.
- But if I bulk load the lognormal dataset with more than 629,145 keys, ALEX suddenly goes out of control.
To be specific, I have tested the following amount of keys to bulk load: - bulk load the lognormal dataset with 600,000 keys, ALEX has 0 model nodes and 1 data node, and the maximum depth is 0.
- bulk load the lognormal dataset with 620,000 keys, ALEX has 0 model nodes and 1 data node, and the maximum depth is 0.
- bulk load the lognormal dataset with 630,000 keys, ALEX has 855 model nodes and 856 data nodes, and the maximum depth is 855.
- bulk load the lognormal dataset with 700,000 keys, ALEX cannot run with an error message: Segmentation fault (core dumped).
For the YCSB dataset, there are 200M keys.
- If you want to bulk load the YCSB dataset with less than 629,145 keys, everything is fine.
- But if I bulk load the YCSB dataset with more than 629,145 keys, ALEX suddenly goes out of control.
To be specific, I have tested the following amount of keys to bulk load: - bulk load the YCSB dataset with 600,000 keys, ALEX has 0 model nodes and 1 data node, and the maximum depth is 0.
- bulk load the YCSB dataset with 620,000 keys, ALEX has 0 model nodes and 1 data node, and the maximum depth is 0.
- bulk load the YCSB dataset with 630,000 keys, ALEX has 855 model nodes and 856 data nodes, and the maximum depth is 855.
- bulk load the YCSB dataset with 700,000 keys, ALEX cannot run with an error message: Segmentation fault (core dumped).
I thus debug the code to see what has happened. The problem is at Line 731 of 'alex.h'. The if condition will decide if num_keys <= derived_params_.max_data_node_slots * data_node_type::kMinDensity_.
derived_params_.max_data_node_slots is 1,048,576, and the data_node_type::kMinDensity_ is 0.6, thus less than 1,048,576 * 0.6 = 629,145.6 keys are fine for bulk loading, but if there are too many keys in lognormal or YCSB, ALEX cannot handle.
The weird thing is that I test the same amount of keys on longitudes and longlat, everything is fine.
- bulk load the longitudes dataset with 630,000 keys, ALEX has 8 model nodes and 8823 data nodes, and the maximum depth is 2.
- bulk load the longlat dataset with 630,000 keys, ALEX has 791 model nodes and 22404 data nodes, and the maximum depth is 3.
I thus doubt if the lognormal dataset and YCSB dataset are correct? Or should I set some parameters specifically for the two datasets? Thanks.
I can't reproduce these errors. Can you try running the benchmark executable, as described in the README? For example, to bulk load 700K keys from YCSB, change line 16 of src/benchmark/main.cpp
to #define KEY_TYPE uint64_t
, then run this command:
./build/benchmark \
--keys_file=[path to location of YCSB dataset, might need to be an absolute path] \
--keys_file_type=binary \
--init_num_keys=700000 \
--total_num_keys=1000000 \
--batch_size=100000 \
--insert_frac=0.5 \
--lookup_distribution=zipf \
--print_batch_stats
Thank you so much, the problem is solved now. I used int64_t before, so I could not succeed. When change int64_t to uint64_t, everything is fine. Thank you.