openzipkin/zipkin-gcp

nothing sent to stackdriver - how to debug?

aviau opened this issue · 31 comments

aviau commented

Sorry if this is more of a question than a bug report.

I have two very similar staging and production deployments running in GCE and I could only get zipkin to successfully forward traces to stackdriver in one of them.

I could only get the stackdriver forwarding to work in production for a few requests, then it stopped working on reboot for unknown magical reasons.

I am using the docker image, (openzipkin/zipkin-gcp:0.13.2) and only setting the following env variable, letting zipkin discover and use the service account:

        env:
        - name: STORAGE_TYPE
          value: stackdriver

Both in the working (staging) and non-working (production) setup, there is nothing in the zipkin logs after the startup messages.

I have tried to enable more logging by setting JAVA_OPTS:

        - name: JAVA_OPTS
          value: "-Dlogging.level.zipkin=DEBUG -Dlogging.level.zipkin2=DEBUG -Dlogging.level.zipkin2.translation.stackdriver=DEBUG"

This resulted in debug messages that print translated stackdriver spans, but no errors about them failing to be forwarded.

GET /health shows:

{"status":"UP","zipkin":{"status":"UP","details":{"StackdriverSender{redacted-project-id}":{"status":"UP"}}}}

GET /metrics shows:

{"counter.zipkin_collector.messages.http":277.0,"counter.zipkin_collector.spans_dropped.http":0.0,"gauge.zipkin_collector.message_bytes.http":2303.0,"counter.zipkin_collector.bytes.http":681050.0,"gauge.zipkin_collector.message_spans.http":3.0,"counter.zipkin_collector.spans.http":886.0,"counter.zipkin_collector.messages_dropped.http":0.0}

What would be the next step? How can I debug this further?

Thanks you in advance!

you are doing the right thing. metrics should show a drop. possibly there's a timestamp related or otherwise issue. only thing you aren't seeing is the http data.. possibly GCP adds some sort of a request ID that could be used for further diagnosis?

@anuraaga hmm should we add the http request logging like we have in elasticsearch?

one other thing to suggest is whether or not the stats are increasing later. if during the "magical drop" phase if you are not showing counter increases.. there could be a smoking gun there. https://gitter.im/openzipkin/zipkin for more on general tips.

another thing you can do meanwhile (eventhough it is a bit of a firehose) is to dump the outbound http traffic using this --logging.level.com.linecorp.armeria.logging.traffic.client=trace

aviau commented

Thank you for helping!

one other thing to suggest is whether or not the stats are increasing later

They do increase.

dump the outbound http traffic using this logging.level.com.linecorp.armeria.logging.traffic.client=trace

How can I pass this using docker-zipkin-gcp? I have tried setting it using JAVA_OPTS and I am getting the following error:

Unrecognized option: --logging.level.com.linecorp.armeria.logging.traffic.client=trace
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.

possibly GCP adds some sort of a request ID that could be used for further diagnosis?

Zipkin submits traces to the stackdriver API right? Wouldn't stackdriver respond with 4/5xx response codes if it failed? And would zipkin log this anywhere?

aviau commented

In case it might help, here is a translated trace (with stuff redacted):

2019-08-14 20:16:22.304 DEBUG 1 --- [king-tasks-1-42] z.t.s.SpanTranslator                     : >> translating zipkin span: {"traceId":"83d33b49e5a8dcb9","id":"83d33b49e5a8dcb9","kind":"CLIENT","name":"redacted.cool.website","timestamp":1565813781148177,"duration":773132,"localEndpoint":{"serviceName":"ambassador-default","ipv4":"0.0.0.0"},"tags":{":authority":"redacted.cool.website",":path":"/cool/endpoint","component":"proxy","downstream_cluster":"-","guid:x-request-id":"043d44e7-ca45-9c54-a2f0-fb0b6637db47","http.method":"GET","http.protocol":"HTTP/1.1","http.status_code":"200","http.url":"http://redacted.cool.website/cool/endpoint","node_id":"test-id","request_size":"0","response_flags":"-","response_size":"32","upstream_cluster":"cluster_redacted","user_agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36"}}
2019-08-14 20:16:22.305 DEBUG 1 --- [king-tasks-1-42] z.t.s.SpanTranslator                     : << translated to stackdriver span: span_id: "83d33b49e5a8dcb9"
display_name {
  value: "redacted.cool.website"
}
start_time {
  seconds: 1565813781
  nanos: 148177000
}
end_time {
  seconds: 1565813781
  nanos: 921309000
}
attributes {
  attribute_map {
    key: "/kind"
    value {
      string_value {
        value: "client"
      }
    }
  }
  attribute_map {
    key: ":authority"
    value {
      string_value {
        value: "redacted.cool.website"
      }
    }
  }
  attribute_map {
    key: ":path"
    value {
      string_value {
        value: "/api/cool/endpoint"
      }
    }
  }
  attribute_map {
    key: "component"
    value {
      string_value {
        value: "proxy"
      }
    }
  }
  attribute_map {
    key: "downstream_cluster"
    value {
      string_value {
        value: "-"
      }
    }
  }
  attribute_map {
    key: "guid:x-request-id"
    value {
      string_value {
        value: "043d44e7-ca45-9c54-a2f0-fb0b6637db47"
      }
    }
  }
  attribute_map {
    key: "/http/method"
    value {
      string_value {
        value: "GET"
      }
    }
  }
  attribute_map {
    key: "http.protocol"
    value {
      string_value {
        value: "HTTP/1.1"
      }
    }
  }
  attribute_map {
    key: "/http/status_code"
    value {
      string_value {
        value: "200"
      }
    }
  }
  attribute_map {
    key: "/http/url"
    value {
      string_value {
        value: "http://redacted.cool.website/api/cool/endpoint"
      }
    }
  }
  attribute_map {
    key: "node_id"
    value {
      string_value {
        value: "test-id"
      }
    }
  }
  attribute_map {
    key: "request_size"
    value {
      string_value {
        value: "0"
      }
    }
  }
  attribute_map {
    key: "response_flags"
    value {
      string_value {
        value: "-"
      }
    }
  }
  attribute_map {
    key: "response_size"
    value {
      string_value {
        value: "32"
      }
    }
  }
  attribute_map {
    key: "upstream_cluster"
    value {
      string_value {
        value: "cluster_redacted"
      }
    }
  }
  attribute_map {
    key: "user_agent"
    value {
      string_value {
        value: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36"
      }
    }
  }
  attribute_map {
    key: "/component"
    value {
      string_value {
        value: "ambassador-default"
      }
    }
  }
  attribute_map {
    key: "/agent"
    value {
      string_value {
        value: "zipkin-java"
      }
    }
  }
}

@aviau I am not stackdriver support I have no idea. I don't even use this thing. Just saying that if you were to ask stackdriver support why something is being dropped, and they couldn't figure out otherwise, probably looking at the http dump could help cc @saturnism

to answer your question, though.. you can set anything like that as a system property by changing the double-dash to -D

Ex

JAVA_OPTS=-Dlogging.level.com.linecorp.armeria.logging.traffic.client=trace
aviau commented

This got me the following error:

Error: Could not find or load main class JAVA_OPTS=-Dlogging.level.com.linecorp.armeria.logging.traffic.client=trace
Caused by: java.lang.ClassNotFoundException: JAVA_OPTS=-Dlogging.level.com.linecorp.armeria.logging.traffic.client=trace

Edit: woops. Copy paste mistake on my part. I can now see the logs, but it looks like it would be hard to make any sense of them.

aviau commented

I would like to ask again.

Would Zipkin log 4XX and 5XX error messages from stackdriver anywhere? Or would it just ignore them?

sorry I gave you environment variable syntax in shell style. you'd need to assign that same value however you start your container. I don't know how you do that.

I looked at the underlying code and it should fail on non success like this, which would trigger drop counts.

                             if (!HttpStatus.OK.equals(msg.status())) {
                                 throw new ArmeriaStatusException(
                                         StatusCodes.INTERNAL,
                                         "Non-successful HTTP response code: " + msg.status());
                             }

The reason I think you should use wire logging is that it will remove any doubt about what's going on. It should be an easier problem to set a variable vs other approaches in other words.

ps I tried with and without quotes running in docker and couldn't get it to fail. anyway, if it helps, here's what I used.

docker run -d -p 9411:9411 \
  -e STORAGE_TYPE=stackdriver \
  -e GOOGLE_APPLICATION_CREDENTIALS=/zipkin/.gcp/credentials.json \
  -e STACKDRIVER_PROJECT_ID=zipkin-dev  \
  -e JAVA_OPTS=-Dlogging.level.com.linecorp.armeria.logging.traffic.client=trace \
  -v $HOME/.gcp:/zipkin/.gcp:ro \
  openzipkin/zipkin-gcp

looks like you can get failures, but still 200 cc @anuraaga

2019-08-16 09:57:00.112  INFO 44821 --- [-worker-nio-2-4] c.l.a.c.l.LoggingClient                  : Response: {startTime=2019-08-16T01:57:00.108Z(1565920620108628), length=0B, duration=3320µs(3320709ns), headers=[:status=200, content-type=application/grpc, date=Fri, 16 Aug 2019 01:56:42 GMT, alt-svc=quic=":443"; ma=2592000; v="46,43,39"], trailers=[EOS, grpc-status=5, grpc-message=Requested entity was not found., google.rpc.resourceinfo-bin=EhNwcm9qZWN0cy96aXBraW4tZGV2, grpc-status-details-bin=CAUSH1JlcXVlc3RlZCBlbnRpdHkgd2FzIG5vdCBmb3VuZC4aRAordHlwZS5nb29nbGVhcGlzLmNvbS9nb29nbGUucnBjLlJlc291cmNlSW5mbxIVEhNwcm9qZWN0cy96aXBraW4tZGV2]}

The error was due to me fat-fingering my application name, but it does show that we don't register errors as we should.

the reason is that the error check is on headers where the status is actually on trailers.

I raised #134 to make turning on logging easier and not require a java property.

There's a separate matter which is that the stats aren't showing certain types of errors. I raised a bug upstream so that the stats will be correct later. line/armeria#1998

aviau commented

sorry I gave you environment variable syntax in shell style

Oh yeah don't worry. Maybe I wasn't clear but I got it working. (see the edit in my comment: #133 (comment)) I just didn't know how to make sense of the logs. I'd be willing to send you a copy of them if that is useful but I didn't want to post them here in case they contained something sensitive.

I don't know which is why I asked you to use the other setting. Remember
I'm a volunteer on this code, and I do not work for google. I'm trying to
help you troubleshoot and get to the bottom of the issue and escalate the
concern to google.

I am sorry If I didn't come across as thankful for your help. I am more than willing to escalate things to Google but my guts tell me that the issue would be somewhere in my setup or zipkin-gcp.

I raised #134 to make turning on logging easier and not require a java property.

I have created a new docker-zipkin-gcp image with that feature @ github/aviau/docker-zipkin-gcp & docker-hub/reazem/docker-zipkin-gcp.

Then I set STACKDRIVER_HTTP_LOGGING=headers

... and here we go:

2019-08-16 19:23:46.404  INFO 1 --- [orker-epoll-2-1] c.l.a.c.l.LoggingClient                  : Response: {startTime=2019-08-16T19:23:46.403Z(1565983426403981), length=0B, duration=423?s(423159ns), headers=[:status=200, content-type=application/grpc, www-authenticate=Bearer realm="https://accounts.google.com/", error="insufficient_scope", scope="https://www.googleapis.com/auth/trace.append https://www.googleapis.com/auth/cloud-platform", date=Fri, 16 Aug 2019 19:23:46 GMT], trailers=[EOS, grpc-status=7, grpc-message=Request had insufficient authentication scopes., www-authenticate=Bearer realm="https://accounts.google.com/", error="insufficient_scope", scope="https://www.googleapis.com/auth/trace.append https://www.googleapis.com/auth/cloud-platform"]}

So it turns out that my service account had insufficient permissions. Zipkin wasn't logging this anywhere and the health check was passing for the StackdriverSender.

My understanding of line/armeria#1998 is that there is now a clear pathway for fixing this in zipkin-gcp?

Thank you very much for your work! Hopefully this has been productive :)

aviau commented

Ah, and the reason that the issue was sporadic on my side is that everytime I reboot zipkin it can be scheduled to another kubernetes node with a different service account. After that it was all based on luck.

Once the line/armeria#1998 is in, the health check can be implemented. Because the gRPC call never failed, there was no way to detect actual failure before. I spent a fair amount of time trying to figure out why I could not make it fail, so I was very happy to see your fix for propagating the trailing failure.

@elefeint it is too important to wait for upstream as I can't take the liability of time anymore. Can we decide on a check approach? I can raise a patch here and just revert it later (regarding trailers).

is empty batch ok? remember we need health check both here and also the sender. the sender has nothing to do with armeria so no need to couple them.

Going back to #131 to discuss there.

@aviau To circle back to the original issue, when traces are missing, these would be the troubleshooting steps I would go through in GCP console:

  1. Go to APIs & Services > Dashboard
  2. Click on Stackdriver Trace API, and select the Metrics section on the left side menu.
  3. Is there any traffic at all? If not, then the requests are likely not reaching Stackdriver at all. Check your applications/proxies to make sure requests go to the right place.
  4. Are there errors? If so, narrow down the source of errors by selecting specific Credentials and Methods values from the filter drop-downs on top and seeing the effect in "Errors by API method" chart or "Methods" table.
    a) If all writes are failing, check that your service account has access to cloudtrace.agent role or cloudtrace.traces.patch permission.
    b) If reads are failing, check for cloudtrace.user role or the specific permission (full permissions list).
    NOTE: read and write permissions are separate; only the admin role has both.

In your particular case, narrowing statistics by Credentials filter would likely pinpoint the problem, as one of the service accounts would show 100% failure rate.

Summarizing all findings/work/TODOs that came out of this issue and out of spring-attic/spring-cloud-gcp#1752, in approximate chronological order:

  • log at warn level the first time reporter fails to send spans in zipkin-reporter-java (openzipkin/zipkin-reporter-java#155)
  • configurable logging of Stackdriver errors in zipkin-gcp (#134)
  • propagate stackdriver gRPC errors into StackdriverSender in armeria (line/armeria#1998, #135)
  • invoke health checks when constructing objects in spring-cloud-sleuth (spring-cloud/spring-cloud-sleuth#1413)
  • implement StackdriverSender healthcheck in zipkin-gcp (#136)
  • invoke health checks when constructing objects in spring-cloud-gcp (would not be useful until decision to upgrade to not-yet-GA Stackdriver v2, but we are emotionally getting there).
  • upgrade armeria and back out #135 in zipkin-gcp (blocked on armeria release -- @adriancole, 0.90.3 seems to have been released yesterday and includes your fix; would that work instead of waiting for 0.91?).

we are emotionally getting there

:)

@elefeint so I think the more important bullet than backing out the patch of 0.90.3 is actually implementing the same health check that you did client side.. server side like here https://github.com/openzipkin/zipkin-gcp/blob/master/storage-stackdriver/src/main/java/zipkin2/storage/stackdriver/StackdriverStorage.java#L151

We should have a resolved zipkin server in the next day or two and we can address patch/backout independently of implementing the server-side health check.

#138 backs out recent workarounds

also we have one more TODO I think which is to add to the README how to troubleshoot (ex maybe to start with the /health endpoint now that it works! then what's already written #133 (comment))

ported the code to server side so we can cut a release. we can do the README later and close this out. Thanks folks! 5634701

health works now. Thanks for all the help! Last this is README, which can happen post-release

$ curl -s localhost:9411/health|jq .
{
  "status": "UP",
  "zipkin": {
    "status": "UP",
    "details": {
      "StackdriverStorage{zipkin-demo}": {
        "status": "UP"
      }
    }
  }
}
$ curl -s localhost:9411/health|jq .
{
  "status": "DOWN",
  "zipkin": {
    "status": "DOWN",
    "details": {
      "StackdriverStorage{ipkin-demo}": {
        "status": "DOWN",
        "details": {
          "error": "com.linecorp.armeria.common.grpc.protocol.ArmeriaStatusException: Requested entity was not found."
        }
      }
    }
  }
}

v0.14.0 on the way with the health-checked sender and storage