datafuselabs/databend

Release proposal: Nightly v1.0

BohuTANG opened this issue · 9 comments

Summary

Release name: v1.0-nightly, get on the train now ✋
Let's make the Databend more Lakehouse!

v1.0 (Prepare for release on March 5th)

Task Status Comments
(Query) Support Decimal data type#2931 DONE high-priority(release in v1.0 )
(Query) Query external stage file(parquet)#9847 DONE high-priority(release in v1.0)
(Query) Array functions#7931 DONE high-priority(release in v1.0)
(Query) Query Result Cache#10010 DONE high-priority(release in v1.0)
(Planner) CBO#9597 DONE high-priority(release in v1.0)
(Processor) Aggregation spilling#10273 DONE high-priority(release in v1.0)
(Storage) Alter table#9441 DONE high-priority(release in v1.0 )
(Storage) Block data cache#9772 DONE high-priority(release in v1.0 )

Archive releases

Reference

What are Databend release channels?
Nightly v1.0 is part of our Roadmap 2023
Community website: https://databend.rs

Is there an expected time to release v1.0?

Is there an expected time to release v1.0?

The preliminary plan is to release in March, mainly focusing on alter table, update, and group by spill.

Hope simplify the way to insert data, it will help get more user.

Hope simplify the way to insert data, it will help get more user.
It's already the easiest to insert of all of the similar products I've tried, how would you like to insert?

@BohuTANG Are there any plans for higher-performance client reads, like maybe streaming Arrow/Parquet/some other high-perf format? I'm not familiar with other read protocols like for example ClickHouse's, I've just been using the mysql connector. But it would be neat to be able to have databend in the middle while paying little overhead vs reading the raw parquet files from S3.

@haydenflinner

But it would be neat to be able to have databend in the middle while paying little overhead vs reading the raw parquet files from S3.

Databend supports the suffix an ignore_result to ignore the result from server to client by MySQL wired protocol.

For example:

select * from hits limit 20000;

20000 rows in set (0.53 sec)
Read 146370 rows, 101.91 MiB in 0.507 sec., 288.51 thousand rows/sec., 200.88 MiB/sec.

With ignore_result(Not send result to client):

mysql> select * from hits limit 20000 ignore_result;
Empty set (0.26 sec)
Read 146370 rows, 101.91 MiB in 0.236 sec., 619.37 thousand rows/sec., 431.24 MiB/sec.

@BohuTANG That is neat and confirms my suspicion that MySql protocol is a bottleneck in some usecases. Parquet read speeds are in the GB/s, but even by telling the mysql client not to handle the result, we get only MB/s. This confirms the results in the paper I linked, see "Postgres++" in the final table of results vs "Postgres".

If one wanted to use databend as a simple intermediary between dataframes and s3 (more lake-house style), databend is providing a lot of value still in interactive query handling, file size and metadata mgmt, far simpler interface, etc. But it presents a bottleneck when it comes to raw-read-speed. If I wanted to do this for example: df = pd.read_sql("select * from hits limit 1000000"), that would be I think 10x slower than df = pd.read_parquet("local-download-of-hits.parquet"). But I suspect primarily due to mysql protocol overhead; the rest of databend is so fast I wouldn't expect it to get in the way much. I can file a ticket for this, don't let me derail the 1.0 thread, sorry 😄

I believe the modern open source protocol most similar to what that paper describes is "Apache Arrow Flight"

I believe the modern open source protocol most similar to what that paper describes is "Apache Arrow Flight"

Yes, we have plan to do this in #9832.

If the query result is small, MySQL client could work as normal since OLTP data result will commonly be small so it's ok.

Otherwise, we should use other formats or protocols to handle large output (MySQL client is really bad in this case)

You can use:

  1. Unload command to upload the data in parquet/csv formats into storage. https://databend.rs/doc/unload-data/
  2. HTTP/ClickHouse handler to export the data
curl 'http://default@localhost:8124/' --data-binary "select 1,2,3 from numbers(3) format TSV"
  1. Wait for the flight SQL feature, that's called native client!

This paper did not cover clickhouse-client. But AFAIK, clickhouse-client is the best client/protocol I ever see.