SumoLogic/sumologic-collector-docker

Issues with Docker Collection - logs not being collected consistently

6eDesign opened this issue · 19 comments

Hello -

We are attempting to utilize the 'Docker Collection' variant of your container to send our Docker Cloud logs over to sumo. This worked well in our testing environment which utilizes docker-compose. It was also working pretty well when we initially deployed to our production docker cloud environment last night. However, this morning I am seeing that most of our Docker-Logs data is missing from sumo for the past 7 or 8 hours. Using the live tail tool, I see no Docker-Log data coming in at all.

On the other hand, I see a steady stream of Docker-Stats data. I do not see anything suspicious in any of the sumo container logs but here is the output in case it helps (this is taken from docker cloud's GUI):

[sumo-1]2017-01-06T17:11:54.683854491Z Running SumoLogic Collector...
[sumo-3]2017-01-06T17:12:18.871092437Z Running SumoLogic Collector...
[sumo-4]2017-01-06T17:12:30.944566498Z Running SumoLogic Collector...
[sumo-2]2017-01-06T17:12:06.943207222Z Running SumoLogic Collector...

I'm curious if there are known issues with this deployment environment or if there are some steps I can take to troubleshoot the lack of log data.

Here's a little more info from our stackfile pertaining to our sumologic setup:

sumo:
  deployment_strategy: every_node
  environment:
    - SUMO_ACCESS_ID=*****
    - SUMO_ACCESS_KEY=*****
    - SUMO_COLLECTOR_NAME=DockerCollector
    - SUMO_COLLECTOR_NAME_PREFIX=myApp-
  image: 'sumologic/collector:latest'
  volumes:
    - /var/run/docker.sock:/var/run/docker.sock

This project has 4 AWS instances deployed and the deployment strategy is correctly launching one sumo container per EC2 instance.

FWIW, I've also tried the latest-syslog image with the same result: logs are sent to sumo for a bit (2-3 hours) and then log data is sent intermittently if at all. The sumo container continues to run and provides no additional output.

I am seeing this same issue.

I'm collecting logs from a docker container using the '/var/run/docker.sock:/var/run/docker.sock' volume method. After a period of time, it will just stop collecting, but doesn't report anything being wrong. Redeploying the collector will recover some logs, but not all.

Here is the sumo-sources.json file that I'm using:

{
  "api.version": "v1",
  "sources": [
        {
            "name": "sumo-docker-collector",
            "category": "docker",
            "allContainers": false,
            "collectEvents": false,
            "uri": "unix:///var/run/docker.sock",
            "specifiedContainers": ["test-1"],
            "multilineProcessingEnabled": false,
            "sourceType": "DockerLog"
        }
   ]
}

Docker base image: sumologic/collector:latest
(Maybe) relevant ENV vars:

SUMO_JAVA_MEMORY_INIT=64
SUMO_JAVA_MEMORY_MAX=128 

Is this a known bug? Is there a workaround?

Hi @TRuppert , have you tried upgrading to the latest collector version? Thanks.

Im seeing a similar issue. I am running the latest collector and still have unexplained dropouts in log collection. I have a config file similar to @TRuppert

Hi @rghunter, please zip the collector logs directory (/opt/SumoCollector/logs in the collector container) and upload it if you can so we can inspect it. Thanks!

hey @maimaisie - I am a colleague of @rghunter. The issue seems to continue.

I deleted all the sumo collector's log files, restarted the container, and observed the drop-out within about an hour from the restart (the last message flowing into sumo is at 2017-06-04 08:39:40,882 UTC).

Attached are the resulting log files. Thank you!
logs.zip

Hi @avivgil , thank you for providing the log files. Can you also provide us your account name and deployment (which service endpoint are you using to login)? Thanks.

thanks for the prompt response @maimaisie . our account name is Censio, and we are using US2

I have escalated to engineering for investigation. I will keep you updated.

For future issues, it is best to file a Zendesk ticket to our support team :)

Hi @avivgil , I have taken a look into the log file you provided. It looks like there are multiple sources started in a short time so maybe not same as sumo-sources.json attached by @TRuppert . Are they for the same issue?

$ cat collector.* | grep "Starting blade"
2017-06-04 07:19:19,966 +0000 [WrapperSimpleAppMain] INFO  com.sumologic.scala.collector.blade.LocalBladeManager - Starting blades: '9'
2017-06-04 07:19:19,970 +0000 [WrapperSimpleAppMain] INFO  com.sumologic.scala.collector.blade.LocalBladeManager - Starting blade: 'consul', ID: '206557263'
2017-06-04 07:19:19,971 +0000 [WrapperSimpleAppMain] INFO  com.sumologic.scala.collector.blade.docker.DockerLogBlade - Starting blade 206557263: 'consul'
2017-06-04 07:19:20,361 +0000 [WrapperSimpleAppMain] INFO  com.sumologic.scala.collector.blade.LocalBladeManager - Starting blade: 'registrator', ID: '206557262'
2017-06-04 07:19:20,361 +0000 [WrapperSimpleAppMain] INFO  com.sumologic.scala.collector.blade.docker.DockerLogBlade - Starting blade 206557262: 'registrator'
2017-06-04 07:19:20,482 +0000 [WrapperSimpleAppMain] INFO  com.sumologic.scala.collector.blade.LocalBladeManager - Starting blade: 'dd-agent', ID: '206557261'
2017-06-04 07:19:20,483 +0000 [WrapperSimpleAppMain] INFO  com.sumologic.scala.collector.blade.docker.DockerLogBlade - Starting blade 206557261: 'dd-agent'
2017-06-04 07:19:20,740 +0000 [WrapperSimpleAppMain] INFO  com.sumologic.scala.collector.blade.LocalBladeManager - Starting blade: 'sumo-collector', ID: '206557260'
2017-06-04 07:19:20,741 +0000 [WrapperSimpleAppMain] INFO  com.sumologic.scala.collector.blade.docker.DockerLogBlade - Starting blade 206557260: 'sumo-collector'
2017-06-04 07:19:20,868 +0000 [WrapperSimpleAppMain] INFO  com.sumologic.scala.collector.blade.LocalBladeManager - Starting blade: 'ecs-agent', ID: '206557259'
2017-06-04 07:19:20,869 +0000 [WrapperSimpleAppMain] INFO  com.sumologic.scala.collector.blade.docker.DockerLogBlade - Starting blade 206557259: 'ecs-agent'
2017-06-04 07:19:21,139 +0000 [WrapperSimpleAppMain] INFO  com.sumologic.scala.collector.blade.LocalBladeManager - Starting blade: 'ecs-internal-git2consul-Git2ConsulTaskDef-3KL404RAQ472-1-Git2Consul-d4aaefbdc9c9b9f42300', ID: '209541098'
2017-06-04 07:19:21,139 +0000 [WrapperSimpleAppMain] INFO  com.sumologic.scala.collector.blade.docker.DockerLogBlade - Starting blade 209541098: 'ecs-internal-git2consul-Git2ConsulTaskDef-3KL404RAQ472-1-Git2Consul-d4aaefbdc9c9b9f42300'
2017-06-04 07:19:21,406 +0000 [WrapperSimpleAppMain] INFO  com.sumologic.scala.collector.blade.LocalBladeManager - Starting blade: 'ecs-internal-git2consul-Git2ConsulTaskDef-3KL404RAQ472-1-env-config-ecdc9ae3a9b4b5984200', ID: '209541093'
2017-06-04 07:19:21,414 +0000 [WrapperSimpleAppMain] INFO  com.sumologic.scala.collector.blade.docker.DockerLogBlade - Starting blade 209541093: 'ecs-internal-git2consul-Git2ConsulTaskDef-3KL404RAQ472-1-env-config-ecdc9ae3a9b4b5984200'
2017-06-04 07:19:21,494 +0000 [WrapperSimpleAppMain] INFO  com.sumologic.scala.collector.blade.LocalBladeManager - Starting blade: 'ecs-internal-agg-etl-AggETLTaskDef-N3IRDGLTIJG2-1-AggETL-fe82a9add28aafdb7200', ID: '208976317'
2017-06-04 07:19:21,494 +0000 [WrapperSimpleAppMain] INFO  com.sumologic.scala.collector.blade.docker.DockerLogBlade - Starting blade 208976317: 'ecs-internal-agg-etl-AggETLTaskDef-N3IRDGLTIJG2-1-AggETL-fe82a9add28aafdb7200'
2017-06-04 07:19:21,669 +0000 [WrapperSimpleAppMain] INFO  com.sumologic.scala.collector.blade.LocalBladeManager - Starting blade: 'ecs-internal-ubi3-etl-UBI3ETLTaskDef-PAJDWKUS7T9M-1-UBI3ETL-bea5d8a0ecf3c2d9b301', ID: '206557797'
2017-06-04 07:19:21,669 +0000 [WrapperSimpleAppMain] INFO  com.sumologic.scala.collector.blade.docker.DockerLogBlade - Starting blade 206557797: 'ecs-internal-ubi3-etl-UBI3ETLTaskDef-PAJDWKUS7T9M-1-UBI3ETL-bea5d8a0ecf3c2d9b301'

We're seeing the same thing on our side with the collector image. A brief period of it working followed by it silently not submitting logs. When we restart the container it surges many (but not all) of the logs into SumoLogic.

image

We saw very similar behavior to this when using a logspout container. Perhaps it's related?

We will publish a new version of collector as early as this week. It will include reliability fixes for docker sources, so maybe you can try if it fix your particular one. Thanks!

Any update on the availability of new version?

@maimaisie I am still seeing this behavior on v19.209-5. Sumo did not collect any logs regarding (specifically, for example) a script that runs every morning for us from Oct 15th-Oct 22nd. Papertrail which uses a logspout container consistently collected the logs every morning.

Sumo:
screen shot 2017-10-22 at 8 28 50 am

Papertrail:
screen shot 2017-10-22 at 8 30 41 am


As shown below, restarting the Sumo collectors retroactively sent logs into SumoLogic. Our logging rate briefly went from ~20,000/hour to over 7.5 million.
screen shot 2017-10-22 at 8 35 44 am

This is a serious issue for us as it disrupts any reports and monitors that we're writing against these logs.

Hi @taiidani can you provide your organization account ID and collector name/ID that is having this issue? Please also provide collector log file located at /opt/SumoCollector/logs/collector.log in the Sumo collector container if possible.

@bin3377 can you take care of issue above? Thanks.

@taiidani We may need your organization information to identify the problem. Please file a support ticket following the instruction on https://support.sumologic.com/hc/en-us and contact with our customer service department.
It's better than sharing your information on the public forum. Thanks!