It does not work at all.
alexey-milovidov opened this issue · 3 comments
I'm trying LocustDB on a clean Ubuntu 22.04 VM on AWS:
#!/bin/bash
# https://rustup.rs/
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env
sudo apt-get update
sudo apt-get install -y git
git clone https://github.com/cswinter/LocustDB.git
cd LocustDB
sudo apt-get install -y g++ capnproto libclang-14-dev
cargo build --features "enable_rocksdb" --features "enable_lz4" --release
wget --continue 'https://datasets.clickhouse.com/hits_compatible/hits.csv.gz'
gzip -d hits.csv.gz
target/release/repl --load hits.csv --db-path db
# Loaded data in 920s.
# Table `default` (99997496 rows, 15.0GiB)
# SELECT * FROM default LIMIT 1
# And it immediately panicked and hung:
#locustdb> SELECT * FROM default LIMIT 1
#thread '<unnamed>' panicked at 'index out of bounds: the len is 65536 but the index is 65536', src/stringpack.rs:91:15
#note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
#thread '<unnamed>' panicked at 'index out of bounds: the len is 65536 but the index is 65536', src/stringpack.rs:91:15
#thread '<unnamed>' panicked at 'index out of bounds: the len is 65536 but the index is 65536', src/stringpack.rs:91:15
#thread '<unnamed>' panicked at 'index out of bounds: the len is 65536 but the index is 65536', src/stringpack.rs:91:15
#thread '<unnamed>' panicked at 'index out of bounds: the len is 65536 but the index is 65536', src/stringpack.rs:91:15
#thread '<unnamed>' panicked at 'index out of bounds: the len is 65536 but the index is 65536', src/stringpack.rs:91:15
#thread '<unnamed>' panicked at 'index out of bounds: the len is 65536 but the index is 65536', src/stringpack.rs:91:15
#thread '<unnamed>' panicked at 'index out of bounds: the len is 65536 but the index is 65536', src/stringpack.rs:91:15
#thread '<unnamed>' panicked at 'index out of bounds: the len is 65536 but the index is 65536', src/stringpack.rs:91:15
#thread '<unnamed>' panicked at 'index out of bounds: the len is 65536 but the index is 65536', src/stringpack.rs:91:15
#thread '<unnamed>' panicked at 'index out of bounds: the len is 65536 but the index is 65536', src/stringpack.rs:91:15
#thread '<unnamed>' panicked at 'index out of bounds: the len is 65536 but the index is 65536', src/stringpack.rs:91:15
#thread '<unnamed>' panicked at 'index out of bounds: the len is 65536 but the index is 65536', src/stringpack.rs:91:15
#thread '<unnamed>' panicked at 'index out of bounds: the len is 65536 but the index is 65536', src/stringpack.rs:91:15
#thread '<unnamed>' panicked at 'index out of bounds: the len is 65536 but the index is 65536', src/stringpack.rs:91:15
#thread '<unnamed>' panicked at 'index out of bounds: the len is 65536 but the index is 65536', src/stringpack.rs:91:15
It's unclear how to preserve data upon restart.
It's unclear how to define table structure.
Nevertheless, a simple query after loading panics.
Also, it gives strange messages:
# Table `default` (99997496 rows, 15.0GiB) #
2013-07-15: 0.92KiB
-1216690514: 0.14MiB
0: 1.6GiB
4: 61KiB
17: 0.77KiB
9110818468285196899: 0.92MiB
-2461439046089301801: 0.20MiB
�O: 57KiB
2013-07-14 20:38:47: 0.69MiB
�
: 53KiB
�: 28KiB
-1001831330: 0.12MiB
-296158784638538920: 0.42MiB
5: 33KiB
-8417682003818480435: 0.56MiB
: 13GiB
3793327: 0.10MiB
NH: 0.60KiB
2013-07-15 10:47:34: 0.69MiB
839: 81KiB
-1: 67MiB
1971-01-01 14:16:06: 0.70MiB
1: 0.52KiB
# Table `_meta_tables` (2 rows, 42.0B) #
timestamp: 1.0B
name: 41B
That makes me suspect it is not memory-safe.
Ah yes, I think you've found a bug that is triggered when input strings contain null bytes. Looks like it should be relatively straightforward to fix and improve performance as well.
It's unclear how to preserve data upon restart.
Just running target/release/repl --db-path db
should see all the data previously loaded to db
.
It's unclear how to define table structure.
One of the nice things about LocustDB is that you don't actually need to explicitly specify a schema, everything just happens automatically. There is some support for forcing columns to be interpreted as a certain type when loading data, see the --schema
option.
Another thing I just noticed, some of the strange output is because LocustDB
assumes that the first row in the CSV is a header the with column names. To get actual column names, you can add a header to the csv or use the --schema
option.
Things seem to be working with the fix in #153:
locustdb> SELECT COUNT(1), col89 FROM default;
Scanned 100.0 million rows in 17.1ms (5.8 billion rows/s)!
col89 | COUNT(1)
------+----------
"0U�" | 26
"5eL" | 1
"NH�" | 99995421
"R.�" | 81
"ZBT" | 57
"cHx" | 43
"iPP" | 1
"vUP" | 1770
"�J8" | 15
"�ht" | 42
"�o" | 40
locustdb> SELECT * FROM default LIMIT 1;
Scanned 65.5 thousand rows in 117ms (0.56 million rows/s)!
col39 | col21 | col10 | col74 | col3 | col71 | col75 | col79 | col81 | col15 | col45 | col48 | col78 | col85 | col89 | col103 | col61 | col8 | col1 | col18 | col7 | col28 | col47 | col27 | col56 | col22 | col34 | col23 | col69 | col32 | col29 | col95 | col96 | col46 | col63 | col86 | col88 | col16 | col83 | col11 | col72 | col80 | col14 | col19 | col49 | col0 | col50 | col59 | col24 | col62 | col26 | col35 | col37 | col64 | col30 | col6 | col76 | col94 | col93 | col67 | col98 | col31 | col42 | col43 | col9 | col91 | col44 | col97 | col65 | col38 | col99 | col55 | col36 | col33 | col77 | col92 | col101 | col5 | col87 | col54 | col2
| col51 | col58 | col4 | col60 | col17 | col82 | col68 | col100 | col102 | col73 | col25 | col70 | col52 | col66 | col53 | col84 | col104 | col40 | col12 | col41 | col57 | col90 | col20 | col13
------+-------+-------+-------+------+-------+-------+-------+-------+-------+-----------------------+-------+-------+-------+-------+----------------------+-------+------+------+-------+------------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+----------------------------------------------+-------+-------+---------------------+------------------------+-------+-------+-------+-------+-------+-------+-----------------------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+---------------------+-------+-------+-------+-------+-------+-------+---------------------+---------+-------+-------+-------+--------+--------------+-------+-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+-------+-----------------------+-------+-------+-------+-------+--------+----------------------+-------+-------+------------+-------+-------+-------+-------+--------+-------+-------+-------+-----------+-------+-------+-------------------------------------------------
"" | 554 | 0 | "S0" | 1 | 10208 | "h1" | 0 | 6 | 0 | "2013-07-10 00:27:42" | 16561 | 0 | 0 | "NH�" | -3299945852400637761 | 0 | 36 | 1 | 9911 | 2088547703 | 31 | 1 | 0 | "" | 37 | "" | 15 | 0 | 0 | "D�" | "" | "" | 4 | "g" | "" | null | 16000 | 111 | 44 | -1 | 2 | "http://smeshariki.ru/page=98&rstr=тержинсы" | 216 | 0 | 7746300919266382380 | "windows-1251;charset" | 0 | 7 | 0 | 0 | null | -1 | "2013-07-10 00:06:55" | 1 | 46429 | null | "" | "" | 2 | "" | 1 | 1750 | 653 | 7841794089446734162 | "" | 135 | "" | 22 | 0 | "" | 8744056147474783115 | 2528191 | 0 | null | "" | 0 | "2013-07-10" | 0 | 0 | "Тонус 5, объявлений и фотоград - Яндекс.Афиша@Mail.Ru - Мастей в Ростей в Россия) - AUTO.ria.ua Базар автосалоне | новых кинотеатронно блин в хорошем качестве - Пульс цене, стр. 5 мини из 31 - Яндекс.net беседов Сибирск по алфавить" | 1601 | 0 | "2013-07-10 09:05:15" | 0 | 158 | 198 | 1758 | "" | -1655607031864382640 | 13 | 700 | 1737435482 | 0 | 1 | 0 | 0 | 0 | 0 | 5 | 0 | 125358366 | 0 | 1368 | "http://smeshariki.ru/users/446132.html%3Fhtml"