ClickHouse/ch-go

Streams leading to incomplete parts

a-dot opened this issue · 3 comments

a-dot commented

I'm experimenting streaming blocks of data into the server and I can't seem to be able to get a full stream until the server flushes to a part.

For example, I'm trying to blocks of 100 rows and return io.EOF every 10_000 blocks. I would expect the server to create a part after reaching 1_000_000 rows but instead it created these parts:

part with 98400 rows
part with 98600 rows
part with 98500 rows
part with 98400 rows
part with 98500 rows
part with 98400 rows
part with 98500 rows
part with 98500 rows
part with 98400 rows
part with 98500 rows
part with 15300 rows

That leads to a total of 1_000_000 rows as expected but why is the server flushing to a part before the client returns io.EOF (and I assume sends an empty block and end of stream?).

Is there a setting on the server side that can be adjusted to change how often the data is flushed to a part and hopefully flush less often to create bigger parts?

Thanks!

Hello, I’m not sure that there is any guarantee that parts will be to data blocks. Where you’ve got that assumption?

Is there any data missing or the problem is only in part length?

a-dot commented

No data is actually missing, all 1_000_000 rows have made it but I just assumed that the server would dump the data to a part only after receiving an empty block and end of stream. Clearly my assumption was wrong, but there seems to be a setting for everything in clickhouse so I was wondering if you knew if there was a setting that controlled that on the server side or if anything could be done on the client side to delay the flushing to a part until the actual end of stream?

I'm not aware of such setting.

Please ask in https://clickhouse.com/slack or https://telegram.me/clickhouse_en, but IMO it is not possible and ClickHouse does not provides such guarantees, moreover, it can merge those partitions offline.

Also, AFAIK clickhouse can persist data even if query was cancelled, i.e. you can stream 10GB of data and cancel query, and most of your data will be persisted, it is not atomic.

I will close this issue because we can't do much in ch-go about it.