GoogleCloudPlatform/fluent-plugin-detect-exceptions

Provide a way to bundle error message/info into to the combined log entry

qingling128 opened this issue · 10 comments

Originally from the conversation in #11.

It's a common case that developers use logging frameworks (e.g. logback for Java) to help structure error logs. Some additional error message or other info is frequently added in front of the actual stack trace. For example the logback framework for Java would prepend a line with "ERROR " in front of the stacktrace like this:

ERROR [2017-03-21 07:16:42,627] u.s.d.Jobs <http://u.s.d.jobs/> {}
uw.smartmetering.domain.Jobs <http://uw.smartmetering.domain.jobs/>:
Job uw.smartmetering.tma.PollStatusFeedJob@7091121 failed
javax.ws.rs.ProcessingException:
org.apache.http.conn.ConnectTimeoutException: Connect to
ssp-api-stg01.tmadata.co.uk:443
[ssp-api-stg01.tmadata.co.uk/80.195.118.185,
ssp-api-stg01.tmadata.co.uk/217.138.85.11] failed: connect
   at io.dropwizard.client.DropwizardApacheConnector.apply(DropwizardApacheConnector.java:109)
   at org.glassfish.jersey.client.ClientRuntime.invoke(ClientRuntime.java:252)
   at org.glassfish.jersey.client.JerseyInvocation$1.call(JerseyInvocation.java:684)
Caused by: org.apache.http.conn.ConnectTimeoutException: Connect to ssp-api-stg01.tmadata.co.uk:443
[ssp-api-stg01.tmadata.co.uk/80.195.118.185,
ssp-api-stg01.tmadata.co.uk/217.138.85.11] failed: connect timed out
   at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:150)
   at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:353)
   at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:380)
   ... 26 common frames omitted
Caused by: java.net.SocketTimeoutException: connect timed out
   at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
   at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:207)
   at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)

   ... 36 common frames omitted

It would be nice if we could provide an option for developers to bundle these error messages/info into the combined log entry. Note that:

  1. This solution should be language independent.
  2. The format or this piece of info (error message, timestamp or other info) may vary from language to language and from framework to framework. So we need to make the format configurable, and leave the regex to developers' decisions.

Hi,

Can you please share any progress on this?

Many thanks,
Javi

Hi @jcotado, sorry we've not yet got a chance to evaluate / prioritize it. It's on the agenda for our next weekly meeting. Are you somehow blocked by this?

CC @rbuskens, @igorpeshansky, @dhrupadb for priority. I wonder if it's something we would like to squeeze into this quarter. It's a nice and helpful feature to have. Not a blocking one though I guess?
(@jcotado: correct me if i'm wrong).

This seems too esoteric to be useful in the default config. @jcotado, any reason this isn't doable with the existing in_tail format multiline functionality?

@igorpeshansky Javi and I chat a bit about using in_tail in the previous thread: #11. They are ingesting logs from containers in a kubernetes cluster by mounting /var/log. Kubernetes has fluentd configs built into the add-on image fluentd-es-image. There is some on-going effort to make it customizable by mounting configs to a certain path, but it's not available yet I believe.

@jcotado: One solution here, if we stick to using in_tail, would be to edit the td-agent.conf file and build a customized version of fluentd-es-image to use. So we can "customize" the fluentd config indirectly.

@qingling128 We are already using our own version of fluentd image, our config can be seen here: https://github.com/utilitywarehouse/fluentd/blob/master/fluent.conf

Could you please elaborate on the changes to the in_tail to allow formatting, bearing in mind this source is shared by all applications (does that matter?)

Thanks!

Hi @george-angel,

Sorry, I missed this one in my mailbox. Good to know you are already using your own version of the image. The instruction of multiline format of in_tail can be here. Look for multiline.

Quoting the section for Java exception here:

"You can parse Java like stacktrace logs with multiline. Here is a configuration example.

format multiline
format_firstline /\d{4}-\d{1,2}-\d{1,2}/
format1 /^(?<time>\d{4}-\d{1,2}-\d{1,2} \d{1,2}:\d{1,2}:\d{1,2}) \[(?<thread>.*)\] (?<level>[^\s]+)(?<message>.*)/

If you have a following log:

2013-3-03 14:27:33 [main] INFO  Main - Start
2013-3-03 14:27:33 [main] ERROR Main - Exception
javax.management.RuntimeErrorException: null
    at Main.main(Main.java:16) ~[bin/:na]
2013-3-03 14:27:33 [main] INFO  Main - End

It will be parsed as:

2013-03-03 14:27:33 +0900 zimbra.mailbox: {"thread":"main","level":"INFO","message":"  Main - Start"}
2013-03-03 14:27:33 +0900 zimbra.mailbox: {"thread":"main","level":"ERROR","message":" Main - Exception\njavax.management.RuntimeErrorException: null\n    at Main.main(Main.java:16) ~[bin/:na]"}
2013-03-03 14:27:33 +0900 zimbra.mailbox: {"thread":"main","level":"INFO","message":"  Main - End"}

@qingling128 First of all, I appreciate this isn't an issue with GoogleCloudPlatform, and I appreciate you are trying to help us.

Our input is not a Java stacktrace, as we are ingesting docker logs, and its actually json:

{"time":"2017-01-08T11:22:33.444Z", "stream": "stderr", "log":"10.2.22.19 - - [21/Mar/2017:07:16:41 +0000] \"POST /trigger-jobs HTTP/1.1\" 200 20 \"-\" \"curl/7.52.1\" 2\n"}
{"time":"2017-01-08T11:22:33.444Z", "stream": "stderr", "log":"ERROR [2017-03-21 07:16:42,627] u.s.d.Jobs {} uw.smartmetering.domain.Jobs: Job uw.smartmetering.tma.PollStatusFeedJob@7091121b failed\n"}
{"time":"2017-01-08T11:22:33.444Z", "stream": "stderr", "log":"javax.ws.rs.ProcessingException: org.apache.http.conn.ConnectTimeoutException: Connect to ssp-api-stg01.tmadata.co.uk:443 [ssp-api-stg01.tmadata.co.uk/80.195.118.185, ssp-api-stg01.tmadata.co.uk/217.138.85.11] failed: connect timed out\n"}
{"time":"2017-01-08T11:22:33.444Z", "stream": "stderr", "log":"  at io.dropwizard.client.DropwizardApacheConnector.apply(DropwizardApacheConnector.java:109)\n"}
{"time":"2017-01-08T11:22:33.444Z", "stream": "stderr", "log":"  at org.glassfish.jersey.client.ClientRuntime.invoke(ClientRuntime.java:252)\n"}
{"time":"2017-01-08T11:22:33.444Z", "stream": "stderr", "log":"  at org.glassfish.jersey.client.JerseyInvocation$1.call(JerseyInvocation.java:684)\n"}
{"time":"2017-01-08T11:22:33.444Z", "stream": "stderr", "log":"  at org.glassfish.jersey.client.JerseyInvocation$1.call(JerseyInvocation.java:681)\n"}
{"time":"2017-01-08T11:22:33.444Z", "stream": "stderr", "log":"  at org.glassfish.jersey.internal.Errors.process(Errors.java:315)\n"}
{"time":"2017-01-08T11:22:33.444Z", "stream": "stderr", "log":"  at org.glassfish.jersey.internal.Errors.process(Errors.java:297)\n"}
{"time":"2017-01-08T11:22:33.444Z", "stream": "stderr", "log":"  at org.glassfish.jersey.internal.Errors.process(Errors.java:228)\n"}
{"time":"2017-01-08T11:22:33.444Z", "stream": "stderr", "log":"  at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:444)\n"}
{"time":"2017-01-08T11:22:33.444Z", "stream": "stderr", "log":"  at org.glassfish.jersey.client.JerseyInvocation.invoke(JerseyInvocation.java:681)\n"}
{"time":"2017-01-08T11:22:33.444Z", "stream": "stderr", "log":"  at org.glassfish.jersey.client.JerseyInvocation$Builder.method(JerseyInvocation.java:411)\n"}
{"time":"2017-01-08T11:22:33.444Z", "stream": "stderr", "log":"  at uw.tmaclient.TmaClient$_makeRequestNoValidation_closure7.doCall(TmaClient.groovy:144)\n"}
{"time":"2017-01-08T11:22:33.444Z", "stream": "stderr", "log":"  at uw.circuitbreaker.Circuit.withBreaker(Circuit.groovy:33)\n"}
{"time":"2017-01-08T11:22:33.444Z", "stream": "stderr", "log":"  at uw.tmaclient.TmaClient.makeRequestNoValidation(TmaClient.groovy:134)\n"}
{"time":"2017-01-08T11:22:33.444Z", "stream": "stderr", "log":"  at uw.tmaclient.TmaClient$_makeRequest_closure6.doCall(TmaClient.groovy:112)\n"}
{"time":"2017-01-08T11:22:33.444Z", "stream": "stderr", "log":"  at uw.circuitbreaker.Circuit.withBreaker(Circuit.groovy:33)\n"}
{"time":"2017-01-08T11:22:33.444Z", "stream": "stderr", "log":"  at uw.tmaclient.TmaClient.makeRequest(TmaClient.groovy:110)\n"}
{"time":"2017-01-08T11:22:33.444Z", "stream": "stderr", "log":"  at uw.tmaclient.TmaClient.doGet(TmaClient.groovy:107)\n"}
{"time":"2017-01-08T11:22:33.444Z", "stream": "stderr", "log":"  at uw.tmaclient.TmaClient.get(TmaClient.groovy:86)\n"}
{"time":"2017-01-08T11:22:33.444Z", "stream": "stderr", "log":"  at uw.tmaclient.TmaStatusClient.getSince(TmaStatusClient.groovy:31)\n"}
{"time":"2017-01-08T11:22:33.444Z", "stream": "stderr", "log":"  at uw.smartmetering.tma.StatusFeedConsumer.doForEachNewStatus(StatusFeedConsumer.groovy:34)\n"}
{"time":"2017-01-08T11:22:33.444Z", "stream": "stderr", "log":"  at uw.smartmetering.tma.StatusFeedConsumer.forEachNewStatus(StatusFeedConsumer.groovy:24)\n"}
{"time":"2017-01-08T11:22:33.444Z", "stream": "stderr", "log":"  at uw.smartmetering.tma.PollStatusFeedJob.run(PollStatusFeedJob.groovy:44)\n"}
{"time":"2017-01-08T11:22:33.444Z", "stream": "stderr", "log":"  at uw.smartmetering.domain.Jobs$_run_closure1$_closure2.doCall(Jobs.groovy:49)\n"}
{"time":"2017-01-08T11:22:33.444Z", "stream": "stderr", "log":"  at uw.smartmetering.domain.Jobs$_run_closure1.doCall(Jobs.groovy:47)\n"}
{"time":"2017-01-08T11:22:33.444Z", "stream": "stderr", "log":"  at uw.smartmetering.db.DbLock.ifOwned(DbLock.groovy:40)\n"}
{"time":"2017-01-08T11:22:33.444Z", "stream": "stderr", "log":"  at uw.smartmetering.domain.Jobs.run(Jobs.groovy:43)\n"}
{"time":"2017-01-08T11:22:33.444Z", "stream": "stderr", "log":"  at uw.smartmetering.http.TriggerJobsResource.triggerJobs(TriggerJobsResource.groovy:27)\n"}
{"time":"2017-01-08T11:22:33.444Z", "stream": "stderr", "log":"Caused by: org.apache.http.conn.ConnectTimeoutException: Connect to ssp-api-stg01.tmadata.co.uk:443 [ssp-api-stg01.tmadata.co.uk/80.195.118.185, ssp-api-stg01.tmadata.co.uk/217.138.85.11] failed: connect timed out\n"}
{"time":"2017-01-08T11:22:33.444Z", "stream": "stderr", "log":"  at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:150)\n"}
{"time":"2017-01-08T11:22:33.444Z", "stream": "stderr", "log":"  at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:353)\n"}
{"time":"2017-01-08T11:22:33.444Z", "stream": "stderr", "log":"  at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:380)\n"}
{"time":"2017-01-08T11:22:33.444Z", "stream": "stderr", "log":"  at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)\n"}
{"time":"2017-01-08T11:22:33.444Z", "stream": "stderr", "log":"  at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184)\n"}
{"time":"2017-01-08T11:22:33.444Z", "stream": "stderr", "log":"  at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:88)\n"}
{"time":"2017-01-08T11:22:33.444Z", "stream": "stderr", "log":"  at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)\n"}
{"time":"2017-01-08T11:22:33.444Z", "stream": "stderr", "log":"  at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)\n"}
{"time":"2017-01-08T11:22:33.444Z", "stream": "stderr", "log":"  at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)\n"}
{"time":"2017-01-08T11:22:33.444Z", "stream": "stderr", "log":"  at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:107)\n"}
{"time":"2017-01-08T11:22:33.444Z", "stream": "stderr", "log":"  at io.dropwizard.client.DropwizardApacheConnector.apply(DropwizardApacheConnector.java:87)\n"}
{"time":"2017-01-08T11:22:33.444Z", "stream": "stderr", "log":"  ... 26 common frames omitted\n"}
{"time":"2017-01-08T11:22:33.444Z", "stream": "stderr", "log":"Caused by: java.net.SocketTimeoutException: connect timed out\n"}
{"time":"2017-01-08T11:22:33.444Z", "stream": "stderr", "log":"  at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)\n"}
{"time":"2017-01-08T11:22:33.444Z", "stream": "stderr", "log":"  at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:207)\n"}
{"time":"2017-01-08T11:22:33.444Z", "stream": "stderr", "log":"  at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)\n"}
{"time":"2017-01-08T11:22:33.444Z", "stream": "stderr", "log":"  at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)\n"}
{"time":"2017-01-08T11:22:33.444Z", "stream": "stderr", "log":"  at java.net.Socket.connect(Socket.java:589)\n"}
{"time":"2017-01-08T11:22:33.444Z", "stream": "stderr", "log":"  at org.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:337)\n"}
{"time":"2017-01-08T11:22:33.444Z", "stream": "stderr", "log":"  at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:141)\n"}
{"time":"2017-01-08T11:22:33.444Z", "stream": "stderr", "log":"  ... 36 common frames omitted\n"}
{"time":"2017-01-08T11:22:33.444Z", "stream": "stderr", "log":"ERROR [2017-04-25 04:13:26,118] u.s.i.Jobs {} uw.smartmetering.infrastructure.Jobs: Job uw.smartmetering.common.adapter.tma.PollScheduledReadingsJob@c77718c failed\n"}
{"time":"2017-01-08T11:22:33.444Z", "stream": "stderr", "log":"javax.ws.rs.ProcessingException: java.net.UnknownHostException: tma-mock\n"}
{"time":"2017-01-08T11:22:33.444Z", "stream": "stderr", "log":"  at io.dropwizard.client.DropwizardApacheConnector.apply(DropwizardApacheConnector.java:109)\n"}
{"time":"2017-01-08T11:22:33.444Z", "stream": "stderr", "log":"  at org.glassfish.jersey.client.ClientRuntime.invoke(ClientRuntime.java:252)\n"}
{"time":"2017-01-08T11:22:33.444Z", "stream": "stderr", "log":"  at org.glassfish.jersey.client.JerseyInvocation$1.call(JerseyInvocation.java:684)\n"}
{"time":"2017-01-08T11:22:33.444Z", "stream": "stderr", "log":"  at org.glassfish.jersey.client.JerseyInvocation$1.call(JerseyInvocation.java:681)\n"}
{"time":"2017-01-08T11:22:33.444Z", "stream": "stderr", "log":"  at org.glassfish.jersey.internal.Errors.process(Errors.java:315)\n"}

this works in detect_exceptions plugin by specifying message log so it selects only the string value from the json key log. And then applies formatting to that.

Also from my limited understanding, in_tail is a source, we already have a source in_tail and format on that has to remain as json, as its collecting all container logs. Can it act as a "filter"?

If it does, would we have to regex for the entire part of the json string: {"time":"2017-01-08T11:22:33.444Z", "stream": "stderr", "log":"ERROR [2017-03-21 07:16:42,627]

Does that make sense?

Yeah, detect_exception plugin supports extracting the message from one entry of the json by specifying a message field. Unfortunately we won't be able to make use of that when using in_tail.

You can have multiple in_tail sources with different paths that point to different log files. To support the case here, you would need to have a separate in_tail setting for the JVM log files and regex for the entire part with multiline enabled.

@qingling128 But all logs are actually under a single path of /var/log/docker. I had a bit of a play and I think ultimately what we are trying to match it is too generic to do it with fluent. Or it would require a lot of logic.

We are going to investigate either modifying the application logs to make it easier or search for a combination of plugins to do this.

Happy to close the Issue, thanks for your help!

Thanks for letting me know! I'll close this issue for now then. :)