linux-system-roles/metrics

[PCP] Add support for sending metrics to Viaq elasticsearch

sradco opened this issue · 6 comments

We need to be able to send metrics to Viaq we need support for cert auth authentication, Elasticsearch index parameters, buffer handling and back-off mechanism.

In the Rsyslog role the request also sends the following parameters that
type="omelasticsearch"
name="{{ res.name | default('viaq-elasticsearch') }}"
server="{{ res.server_host | default('logging-es') }}"
serverport="{{ res.server_port | default(9200) | int }}"
template="viaq_template"
searchIndex="index_template"
dynSearchIndex="on"
searchType="com.redhat.viaq.common"
bulkmode="on"
writeoperation="create"
bulkid="id_template"
dynbulkid="on"
retryfailures="on"
retryruleset="try_es"
usehttps="on"

In Fluentd we set the following parameters:

@type elasticsearch
host {{ fluentd_elasticsearch_host }}
port {{ fluentd_elasticsearch_port }}
scheme https
client_cert {{ fluentd_elasticsearch_client_cert_path }}
client_key {{ fluentd_elasticsearch_client_key_path }}
ca_file {{ fluentd_elasticsearch_ca_cert_path }}
ssl_verify {{ fluentd_elasticsearch_ssl_verify|lower }}
target_index_key {{ fluentd_elasticsearch_target_index_key }}
remove_keys {{ fluentd_elasticsearch_remove_keys }}
type_name {{ fluentd_elasticsearch_type_name_metrics }}
request_timeout {{ fluentd_elasticsearch_request_timeout_metrics }}

Buffer configurations:
flush_interval {{ fluentd_flush_interval_metrics }}
buffer_chunk_limit {{ fluentd_buffer_chunk_limit_metrics }}
buffer_queue_limit {{ fluentd_buffer_queue_limit_metrics }}
buffer_queue_full_action {{ fluentd_buffer_queue_full_action_metrics }}
retry_wait {{ fluentd_retry_wait_metrics }}
retry_limit {{ fluentd_retry_limit_metrics }}
disable_retry_limit {{ fluentd_disable_retry_limit_metrics }}
max_retry_wait {{ fluentd_max_retry_wait_metrics }}
flush_at_shutdown {{ fluentd_flush_at_shutdown_metrics }}
num_threads {{ fluentd_num_threads_metrics }}
slow_flush_log_threshold {{ fluentd_slow_flush_log_threshold_metrics }}

Can you please update status in PCP?
What is missing, will it be possible to implement?

This is a blocker for oVirt.

@lberk @pcahyna @tabowling

@sradco can you expand on this a bit more '[...] Viaq we need support for cert auth authentication'?

For the PCP integration, I think this would mean submitting https requests from pcp2elasticsearch when it sends REST API requests to elasticsearch ... or have I misunderstood? Thanks.

To change pcp2elasticsearch to submitting https requests, I believe that involves changing the ES_SERVER pcp2elasticsearch.conf file entry to use "https://..." in the request URL.

From Lukas:

One thing we might consider doing, is running the pcp2elasticsearch
script on the host running elasticsearch (instead of the ovirt, rhel, or
openstack host). pcp2elasticsearch can be configured to collect from
remote pcp instances (so long as the firewall is open on the correct
port, 44321/tcp). And then push to a local elasticsearch instance.

We could configure the pcp2elasticsearch service to depend on the
elasticsearch instance.

Would that work?
This will not work. We don't want to add PCP to OpenShift Logging.

Can you please explain what will happen if Elasticsearch is unavailable?
Are error logs being written somewhere?
Are we at risk of flooding the host?

Roy Golan helped me, and it seems that except for the type field we can pass all required parameters to the pcp2elasticsearch plugin.
He has submitted a PR. Please review:
performancecopilot/pcp#581

@lberk

| Can you please explain what will happen if Elasticsearch is unavailable?

pcp2elasticsearch(1) has been configured to send live performance data to elasticsearch - the model is pcp2elasticsearch samples the data (via pmcd) and sends output to elasticsearch. If either elasticsearch or pmcd is down, the data is not deliverable for that (presumably small) time period.

| Are error logs being written somewhere?

@lberk? I'm presuming this is specified in the ansible configuration someplace? IIRC, by default pcp2elasticsearch just prints errors to stderr.

| Are we at risk of flooding the host?

Probably not, since the sampling interval is relatively infrequent. It could possibly be further reduced by remembering if the last sample produced an error and not repeating the same warnings - @lberk? - until pmcd/elasticsearch is available once more.

| Can you please explain what will happen if Elasticsearch is unavailable?

pcp2elasticsearch(1) has been configured to send live performance data to elasticsearch - the model is pcp2elasticsearch samples the data (via pmcd) and sends output to elasticsearch. If either elasticsearch or pmcd is down, the data is not deliverable for that (presumably small) time period.

| Are error logs being written somewhere?

@lberk? I'm presuming this is specified in the ansible configuration someplace? IIRC, by default pcp2elasticsearch just prints errors to stderr.

| Are we at risk of flooding the host?

Probably not, since the sampling interval is relatively infrequent. It could possibly be further reduced by remembering if the last sample produced an error and not repeating the same warnings - @lberk? - until pmcd/elasticsearch is available once more.

Yes. This will be great!

lberk commented

Hi,
@natoscott committed 2a540a778be4a10bee85178fefc16b5ddd34a2a7 yesterday that should address the logging repetition issue.

As for the logging, it'll depend how we're running pcp2elasticsearch. What I proposed in the ansible patch was running it from a systemd service file. If that's the case, the stderr output should already be caught by systemd and sent to journald.