ClickHouse/clickhouse-js

more efficient stream parsing

movy opened this issue ยท 9 comments

movy commented

Hello,
I continuously stream massive amounts of market data from ClickHouse and after profiling my app I discovered that most ticks are spent in stream parsing, i..e:

       const rows = await this.#client.query({
            query: `
                SELECT
                *
            FROM...
   `})
  
       const stream = rows.stream()              

        stream.on('data', async (rows) => {
            for (const row of rows) {
                const [
                    event,
                    time,
                    ...
                ] = await row.json()
      })

I briefly checked the module source code, and looks like quite expensive JSON.parse() is used internally even when we call text().

My question would be **is there a faster way to parse incoming stream maybe by using some raw output format? ** Or am I using reading streams wrong?

You could try CSV/TSV formats for selects and do your own row parsing after calling row.text.
row.text does not call JSON.parse, it is for row.json() only: https://github.com/ClickHouse/clickhouse-js/blob/main/packages/client-node/src/result_set.ts#L77-L81

I am also considering adding RowBinary to supported formats, at least for selects.

movy commented

Thanks for your advice. I wrote a simple datatype-aware parser to split TSV and convert "numbers" to numbers, and speed improved by x10 at least, amazing.
Just a thought: as the structure of JSON is known based on columns types, maybe something like simdjson can be used?

@movy, I think RowBinary implementation could yield even more significant gains. See https://github.com/luizperes/simdjson_nodejs#benchmarks - simdjson is 1.5-2x faster than JSON.parse. It is probably worth adding as an optional switch, I will eventually look at what's possible here.

I am considering, however, adding RowBinary alongside the third "get data" method to the Row object: something like array(), maybe? That will get all values of a particular row in the correct order, utilizing the metadata about the columns, which we will also receive. And the overhead should be pretty minimal.

movy commented

Yes, array() would be great as long as the type conversion is not done in JS.
In my case I receive an array of ~20 values via TabSeparated and as most of them are expected to be numbers, hence I need to parse them. Parsing the whole array is even slower than JSON.parse(), so I selectively parse 3-5 values needed for a particular event and the speed gains are very noticeable. Would be nice to avoid parsing altogether for the sake of code simplicity and for performance as well.

@movy, I am curious to run some internal tests on this (JSON vs. Text + "manual" numbers parsing vs. RowBinary + row as an array). Can you please contact me in the community Slack and share the code snippets and, maybe, a sample dataset (or its definitions) for your scenarios?

@slvrtrn @movy running into a similar scenario where JSON.parse isn't efficient enough for the result set I'm working with. Played around with the options described in this thread and have measured some improvement (particularly using simdjson instead of JSON.parse) but was curious to see how your testing is going/if there's an optimized strategy for these cases.

I wanted to post an update here, as this issue looks stale while it is not.

I've been studying RowBinary format and writing an experimental implementation of it with select streaming support out of the box for quite some time already; current progress is here: 1.0.0...row_binary

At the moment of writing this, that branch supports decoding a minimal set of useful data types:

  • Boolean
  • (U)Int8/16/32 as a number
  • Float32/64 as a number
  • (U)Int64/128/256 as a native JS BigInt
  • String
  • Date/Date32 as a JS Date.
  • Nullable

Currently in progress, as I consider the following types should be supported straight away (there are drafts decoders/placeholders for some already): Decimal, Enum, UUID, FixedString, DateTime(64), Array, Map, Tuple.

NB: under the hood, Date is just a UInt16 (number of days since Unix Epoch), and Date32 is just an Int32 (also days, but before/after). JS Date is chosen only as the default option for these types; there will be a mapping option to convert the "days" numeric value to any other object.

For example, consider the Date type. The mapper is just:

<T>(daysSinceEpoch: number) => T

And, by default, if no custom mapper is provided, it is something like this:

const defaultDateMapper = (daysSinceEpoch: number) => new Date(daysSinceEpoch * DayMillis)

Similarly, Date32, DateTime, DateTime64 (especially DateTime(9) as it requires proper nanoseconds support), and Decimal will have optional mappers support from their decoded "raw" representation.


@movy @chrisjh (and @cjk, I saw your reaction, too!), if you are still interested, I'd be curious to learn a few things:

  1. What are the table definitions of your datasets where selecting data in JSON format is too slow? I'd also appreciate it if you could share some of these with me (don't hesitate to ping me in the community Slack!)
  2. Directly related to 1 - what is a minimal set of supported data types you would like to see implemented for RowBinary in the client so that it could already be helpful in your scenario?
  3. Are you selecting or inserting the data into these tables?

Let's continue tracking this issue here: #216; please don't hesitate to comment. Cheers.

@chrisjh, for your request, I created this issue to track: #246