parallelising digestion
gouthamve opened this issue · 3 comments
Currently I am writing around 1.2K records a second and the digestion is taking 5-10secs sometimes. I am writing the records in batches of 256, i.e, each ingest call has 256 records. I am running with -skip-compact
and am running digestion in a background routine every 2secs. I want to increase the ingestion rate to 10K records a second and am worried that the digestion might not be able to keep up.
Any suggestions on how I can improve ingest rate?
What does a typical record schema look like? The ingestion/digestion rate will be a function of how much data is in those 1.2K samples
I will need to look into this question (parallelizing digestion) some more, I believe the digestion process use locks to prevent multiple digestions from happening at once and corrupting the DB files.
The ingestion rate is usually just dropping a row form file into the ingest/ directory, so the bottleneck should be the digestion phase. If you want to dig in to where the time is going during digestion, you can build sybil with profiling info (make profile
) and then examine the perf logs using the go perf tools.
a digestion usually works like this:
- place a write lock on table info
- find open partial block (partial block can hold up to 65K records), place a lock on that block
- append records to that block
- sort and re-write the whole block to disk, one column at a time
- release locks
If you are digesting a partial block multiple times (say you digest at 5k, 10k, 15k, etc), you are going to run into inefficiency due to redundant work. It would be better to digest at 20, 40 and 60K (3 times instead of 13 times)
One lever that can be adjusted is: how big a block is - the smaller the block, the less time to compact it. But I'm surprised a digest is taking 5 - 10 seconds, it likely means there is a lot of data in the block (if you can give me output of ls -l
on a block and the -debug
output from running sybil digest -debug -table <foo>
, that will also help
Based on the comment on the other issue, I think moving to set columns will help with digestion speed, but it will really depend on the shape of your data. I recommend trying set columns out and seeing how fast digestion process is compared to your current scheme.
I saw redbull - cool repo! ( and nice usage of sybil :)