logstash-plugins/logstash-output-google_bigquery

Enable multiple streaming workers to increase throughput

Closed this issue · 3 comments

  • Version: logstash 6.4.0 plugin 4.1.0
  • Operating System: Docker
  • Config File (if you have sensitive info, please remove it):
input {
  beats {
    port => 5044
  }
}

filter {
# some filters
}

output {
    google_bigquery {
                project_id => "XXXXXXX"
                dataset => "DATASET_NAME"
                table_prefix => "TABLE_PREFIX"
                ignore_unknown_values => true
                skip_invalid_rows => true # I am using the new patch from https://github.com/logstash-plugins/logstash-output-google_bigquery/pull/40
                json_key_file => "..."
                error_directory => "/tmp/bigquery-errors/erros"
                date_pattern => ""
                batch_size => 512
                flush_interval_secs => 5
    }
}

I have a single filebeat reading form a logfile and sending to a single logstash instance. The logfile receives more than 1000 lines per seconds, and it seems that the plugin cannot keep up with the load.
I solved the issues by running multiple logstash instances and load balancing from filebeat. Would be nice to having multiple streamers inside a single logstash instance like workes.

Hi @alepuccetti,

If your messages are long, it might be that one of the other conditions is being met to trigger the upload. An upload is triggered when the total number of messages is > batch_size or time since last upload > flush_interval_secs or the total request size is > batch_size_bytes (initially 1MB).

Maybe try bumping up batch_size to 5000, and batch_size_bytes to 5MB to see if that reduces the throughput issues you're seeing. If that doesn't solve the problem I can look at adding additional clients. Feel free to email me too at jlewisiii at google dot com and we can maybe get some faster back and forth.

Cheers!
- Joseph

Hi @josephlewis42,

My messages are pretty shorts and I can see the logging message saying Publishing 900 messages to TABLE_NAME get printed out every second or so. Thus, it seems that messages get flushed as soon as possible.

I did tests with different batch sizes and the frequency of Publishing <BATCH_SIZE> messages to TABLE_NAME did not change, but of course, the number of rows inserted per seconds was higher for bigger batch sizes.

Maybe try bumping up batch_size to 5000, and batch_size_bytes to 5MB to see if that reduces the throughput issues you're seeing.

I gave it a try and it works.

I thought that the max value for the batch was 1000. But maybe it was just a suggestion.
https://www.elastic.co/guide/en/logstash/current/plugins-outputs-google_bigquery.html#plugins-outputs-google_bigquery-batch_size

After #46, I do not think that multiple clients are necessery they will only introduce more complexity and erros. A better solution it is just to run multiple logstash instance using a load balancer.