opendistro-for-elasticsearch/data-prepper

Trace Analytics not set up - otel-v1-apm-service-map index is blank

nnanoob opened this issue · 9 comments

Hi,

I tried to run with following combination but not able to get the Trace Analytics to show anything.
The otel-v1-apm-service-map is always empty. Any idea what I could misconfigured?

  1. AWS Managed Elasticsearch 7.9.3
  2. amazon/opendistro-for-elasticsearch-data-prepper:1.0.0
  3. otel/opentelemetry-collector-contrib:0.28.0

GET _cat/indices?v&s=i

health status index                             uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   otel-v1-apm-service-map           kkdvD7wEQWyyEEuV_Pl_9A     5   1          0            0        2kb            1kb
green  open   otel-v1-apm-span-000001           _QLY94WSTbS5ezSEazqg4Q   5   1         71            0      1.9mb        987.4kb

otel-collect.yml

receivers:
  otlp:
    protocols:
      grpc:
      http:

processors:
  batch: null

exporters:
  otlp/2:
    endpoint: "<ip>:<port>"
    insecure: true
  logging:
  
extensions:
  health_check: {}

service:
  extensions: [health_check]
  pipelines:
    metrics:
      receivers:
        - otlp
      exporters:
        - logging
    traces:
      receivers:
        - otlp
      processors:
        - batch
      exporters:
        - otlp/2
        - logging

pipelines.yml

entry-pipeline:
  delay: "100"
  source:
    otel_trace_source:
      ssl: false
  sink:
    - pipeline:
        name: "raw-pipeline"
    - pipeline:
        name: "service-map-pipeline"
raw-pipeline:
  source:
    pipeline:
      name: "entry-pipeline"
  prepper:
    - otel_trace_raw_prepper:
  sink:
    - elasticsearch:
        hosts: ["<host>:<port>"]
        username: "<username>"
        password: "<password>"
        trace_analytics_raw: true
service-map-pipeline:
  delay: "100"
  source:
    pipeline:
      name: "entry-pipeline"
  prepper:
    - service_map_stateful:
  sink:
    - elasticsearch:
        hosts: ["<host>:<port>"]
        username: "<username>"
        password: "<password>"
        trace_analytics_service_map: true

data-prepper container logs

2021-06-14T07:12:55,395 [main] INFO  com.amazon.dataprepper.pipeline.server.DataPrepperServer - Data Prepper server running at :4900
2021-06-14T07:12:55,484 [entry-pipeline-prepper-worker-1-thread-1] INFO  com.amazon.dataprepper.pipeline.ProcessWorker -  entry-pipeline Worker: No records received from buffer
2021-06-14T07:12:55,486 [service-map-pipeline-prepper-worker-3-thread-1] INFO  com.amazon.dataprepper.pipeline.ProcessWorker -  service-map-pipeline Worker: No records received from buffer
2021-06-14T07:12:58,392 [raw-pipeline-prepper-worker-5-thread-1] INFO  com.amazon.dataprepper.pipeline.ProcessWorker -  raw-pipeline Worker: No records received from buffer
2021-06-14T07:13:07,565 [entry-pipeline-prepper-worker-1-thread-1] INFO  com.amazon.dataprepper.pipeline.ProcessWorker -  entry-pipeline Worker: Processing 1 records from buffer
2021-06-14T07:13:07,679 [service-map-pipeline-prepper-worker-3-thread-1] INFO  com.amazon.dataprepper.pipeline.ProcessWorker -  service-map-pipeline Worker: Processing 1 records from buffer
2021-06-14T07:13:10,577 [raw-pipeline-prepper-worker-5-thread-1] INFO  com.amazon.dataprepper.pipeline.ProcessWorker -  raw-pipeline Worker: Processing 1 records from buffer
2021-06-14T07:25:02,209 [entry-pipeline-prepper-worker-1-thread-1] INFO  com.amazon.dataprepper.pipeline.ProcessWorker -  entry-pipeline Worker: Processing 1 records from buffer
2021-06-14T07:25:02,313 [service-map-pipeline-prepper-worker-3-thread-1] INFO  com.amazon.dataprepper.pipeline.ProcessWorker -  service-map-pipeline Worker: Processing 1 records from buffer
2021-06-14T07:25:05,210 [raw-pipeline-prepper-worker-5-thread-1] INFO  com.amazon.dataprepper.pipeline.ProcessWorker -  raw-pipeline Worker: Processing 1 records from buffer

Regards,
Boon

Hi, thanks for filling out an issue. This seems to be on the data-prepper side, @wrijeff could you take a look here please?

@nnanoob Thanks for reporting on the issue. I did not find any issue with service-map pipeline in pipelines.yml, so it could be a data processing bug in service-map of data-prepper. For further investigation, could you share a snapshot of all raw spans belongs to a sample traceId. In particular, could you include the following fields at least?

  1. TraceId
  2. SpanId
  3. parentSpanId
  4. name
  5. serviceName
  6. traceGroup

@chenqi0805,

Query

GET otel-v1-apm-span-000001/_search?
{
  "sort": [
    {
      "startTime": {
        "order": "desc"
      }
    }
  ],
  "query": {
    "term": {
      "traceId": {
        "value": "d66fa6449679a8ed796e0aaddd676ad3"
      }
    }
  }
}

Results

{
  "took" : 14,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [
      {
        "_index" : "otel-v1-apm-span-000001",
        "_type" : "_doc",
        "_id" : "79f3ca6ca90d6152",
        "_score" : null,
        "_source" : {
          "traceId" : "d66fa6449679a8ed796e0aaddd676ad3",
          "spanId" : "79f3ca6ca90d6152",
          "traceState" : "",
          "parentSpanId" : "01852c7b15e32fdd",
          "name" : "HTTP GET",
          "kind" : "SPAN_KIND_CLIENT",
          "startTime" : "2021-06-14T23:55:00.055073908Z",
          "endTime" : "2021-06-14T23:55:00.155158663Z",
          "durationInNanos" : 100084755,
          "serviceName" : "infra.monitoring",
          "events" : [ ],
          "links" : [ ],
          "droppedAttributesCount" : 2,
          "droppedEventsCount" : 0,
          "droppedLinksCount" : 0,
          "traceGroup" : "MonitoringService.monitors",
          "traceGroupFields.endTime" : "2021-06-14T23:55:00.156641Z",
          "traceGroupFields.statusCode" : 0,
          "traceGroupFields.durationInNanos" : 155638120,
          "span.attributes.http@url" : "<some url - masked on purpose>",
          "instrumentationLibrary.version" : "1.2.0",
          "span.attributes.thread@id" : 35,
          "resource.attributes.process@pid" : 1,
          "resource.attributes.host@arch" : "amd64",
          "span.attributes.net@peer@name" : "<some hostname - masked on purpose>",
          "resource.attributes.telemetry@sdk@version" : "1.2.0",
          "resource.attributes.service@name" : "infra.monitoring",
          "status.code" : 0,
          "instrumentationLibrary.name" : "io.opentelemetry.javaagent.http-url-connection",
          "resource.attributes.service@version" : "1.0.0",
          "resource.attributes.process@runtime@name" : "OpenJDK Runtime Environment",
          "resource.attributes.os@type" : "linux",
          "span.attributes.http@flavor" : "1.1",
          "resource.attributes.deployment@environment" : "poc",
          "span.attributes.http@status_code" : 200,
          "span.attributes.thread@name" : "scheduling-1",
          "resource.attributes.telemetry@sdk@language" : "java",
          "resource.attributes.host@name" : "a7ebcbdd1baa",
          "resource.attributes.process@runtime@description" : "Oracle Corporation OpenJDK 64-Bit Server VM 25.282-b08",
          "resource.attributes.process@executable@path" : "/var/lib/jre:bin:java",
          "span.attributes.net@transport" : "ip_tcp",
          "resource.attributes.process@command_line" : "/var/lib/jre:bin:java -javaagent:/apps/lib/ext/opentelemetry-javaagent-all.jar -Dotel.resource.attributes=service.name=infra.monitoring,service.version=1.0.0,deployment.environment=poc -Dotel.exporter.otlp.endpoint=<some url - masked on purpose> -Dspring.profiles.active=prod",
          "span.attributes.http@method" : "GET",
          "resource.attributes.process@runtime@version" : "1.8.0_282-b08",
          "resource.attributes.telemetry@sdk@name" : "opentelemetry",
          "resource.attributes.telemetry@auto@version" : "1.2.0",
          "resource.attributes.os@description" : "Linux 3.10.0-1160.25.1.el7.x86_64"
        },
        "sort" : [
          1623714900055073908
        ]
      },
      {
        "_index" : "otel-v1-apm-span-000001",
        "_type" : "_doc",
        "_id" : "fcacb962a394991a",
        "_score" : null,
        "_source" : {
          "traceId" : "d66fa6449679a8ed796e0aaddd676ad3",
          "spanId" : "fcacb962a394991a",
          "traceState" : "",
          "parentSpanId" : "01852c7b15e32fdd",
          "name" : "HTTP GET",
          "kind" : "SPAN_KIND_CLIENT",
          "startTime" : "2021-06-14T23:55:00.001449565Z",
          "endTime" : "2021-06-14T23:55:00.048979017Z",
          "durationInNanos" : 47529452,
          "serviceName" : "infra.monitoring",
          "events" : [ ],
          "links" : [ ],
          "droppedAttributesCount" : 2,
          "droppedEventsCount" : 0,
          "droppedLinksCount" : 0,
          "traceGroup" : "MonitoringService.monitors",
          "traceGroupFields.endTime" : "2021-06-14T23:55:00.156641Z",
          "traceGroupFields.statusCode" : 0,
          "traceGroupFields.durationInNanos" : 155638120,
          "span.attributes.http@url" : "<some url - masked on purpose>",
          "instrumentationLibrary.version" : "1.2.0",
          "span.attributes.thread@id" : 35,
          "resource.attributes.process@pid" : 1,
          "resource.attributes.host@arch" : "amd64",
          "span.attributes.net@peer@name" : "<some hostname - masked on purpose>",
          "resource.attributes.telemetry@sdk@version" : "1.2.0",
          "resource.attributes.service@name" : "infra.monitoring",
          "status.code" : 0,
          "instrumentationLibrary.name" : "io.opentelemetry.javaagent.http-url-connection",
          "resource.attributes.service@version" : "1.0.0",
          "resource.attributes.process@runtime@name" : "OpenJDK Runtime Environment",
          "resource.attributes.os@type" : "linux",
          "span.attributes.http@flavor" : "1.1",
          "resource.attributes.deployment@environment" : "poc",
          "span.attributes.http@status_code" : 200,
          "span.attributes.thread@name" : "scheduling-1",
          "resource.attributes.telemetry@sdk@language" : "java",
          "resource.attributes.host@name" : "a7ebcbdd1baa",
          "resource.attributes.process@runtime@description" : "Oracle Corporation OpenJDK 64-Bit Server VM 25.282-b08",
          "resource.attributes.process@executable@path" : "/var/lib/jre:bin:java",
          "span.attributes.net@transport" : "ip_tcp",
          "resource.attributes.process@command_line" : "/var/lib/jre:bin:java -javaagent:/apps/lib/ext/opentelemetry-javaagent-all.jar -Dotel.resource.attributes=service.name=infra.monitoring,service.version=1.0.0,deployment.environment=poc -Dotel.exporter.otlp.endpoint=<some url - masked on purpose> -Dspring.profiles.active=prod",
          "span.attributes.http@method" : "GET",
          "resource.attributes.process@runtime@version" : "1.8.0_282-b08",
          "resource.attributes.telemetry@sdk@name" : "opentelemetry",
          "resource.attributes.telemetry@auto@version" : "1.2.0",
          "resource.attributes.os@description" : "Linux 3.10.0-1160.25.1.el7.x86_64"
        },
        "sort" : [
          1623714900001449565
        ]
      },
      {
        "_index" : "otel-v1-apm-span-000001",
        "_type" : "_doc",
        "_id" : "01852c7b15e32fdd",
        "_score" : null,
        "_source" : {
          "traceId" : "d66fa6449679a8ed796e0aaddd676ad3",
          "spanId" : "01852c7b15e32fdd",
          "traceState" : "",
          "parentSpanId" : "",
          "name" : "MonitoringService.monitors",
          "kind" : "SPAN_KIND_INTERNAL",
          "startTime" : "2021-06-14T23:55:00.001002880Z",
          "endTime" : "2021-06-14T23:55:00.156641Z",
          "durationInNanos" : 155638120,
          "serviceName" : "infra.monitoring",
          "events" : [ ],
          "links" : [ ],
          "droppedAttributesCount" : 0,
          "droppedEventsCount" : 0,
          "droppedLinksCount" : 0,
          "traceGroup" : "MonitoringService.monitors",
          "traceGroupFields.endTime" : "2021-06-14T23:55:00.156641Z",
          "traceGroupFields.statusCode" : 0,
          "traceGroupFields.durationInNanos" : 155638120,
          "span.attributes.thread@name" : "scheduling-1",
          "instrumentationLibrary.version" : "1.2.0",
          "resource.attributes.telemetry@sdk@language" : "java",
          "span.attributes.thread@id" : 35,
          "resource.attributes.host@name" : "a7ebcbdd1baa",
          "resource.attributes.process@pid" : 1,
          "resource.attributes.host@arch" : "amd64",
          "resource.attributes.process@runtime@description" : "Oracle Corporation OpenJDK 64-Bit Server VM 25.282-b08",
          "resource.attributes.process@executable@path" : "/var/lib/jre:bin:java",
          "resource.attributes.telemetry@sdk@version" : "1.2.0",
          "resource.attributes.service@name" : "infra.monitoring",
          "resource.attributes.process@command_line" : "/var/lib/jre:bin:java -javaagent:/apps/lib/ext/opentelemetry-javaagent-all.jar -Dotel.resource.attributes=service.name=infra.monitoring,service.version=1.0.0,deployment.environment=poc -Dotel.exporter.otlp.endpoint=<some url - masked on purpose> -Dspring.profiles.active=prod",
          "status.code" : 0,
          "instrumentationLibrary.name" : "io.opentelemetry.javaagent.spring-scheduling-3.1",
          "resource.attributes.process@runtime@version" : "1.8.0_282-b08",
          "resource.attributes.service@version" : "1.0.0",
          "resource.attributes.telemetry@sdk@name" : "opentelemetry",
          "resource.attributes.process@runtime@name" : "OpenJDK Runtime Environment",
          "resource.attributes.os@type" : "linux",
          "resource.attributes.telemetry@auto@version" : "1.2.0",
          "resource.attributes.deployment@environment" : "poc",
          "resource.attributes.os@description" : "Linux 3.10.0-1160.25.1.el7.x86_64"
        },
        "sort" : [
          1623714900001002880
        ]
      }
    ]
  }
}

@nnanoob Thanks for sharing. By examining your sample trace group, it looks like all the spans share the same serviceName:

"serviceName" : "infra.monitoring"

Notice that the service-map prepper will not produce data for internal API calls which is identified by serviceName. This is explicit in our end-to-end integration test:

if (parentData != null && !parentData.serviceName.equals(currData.serviceName)) {

Therefore, Could you check if for spans belong to all other traceGroups/traceIds, do they all share the same serviceName?

  • If yes, does your client application architecture include multiple micro-services? If that is the case, we could dive deeper into why both the client spans and the internal spans share the same serviceName and see if there is possible enhancement on the data-prepper side.
  • If no, I need to find other root cause on the issue.

@chenqi0805 , i'm actually doing a POC with a single standalone application to see what can Trace Analytics offers in the OSS Kibana. The standalone application is only doing a periodic API health call check to 2 samples website URL and I can see their URL is captured under span.attributes.http@url

All the other traceGroups/traceId shares the same serviceName because I only deployed on single application with attribute that looks like this in the agent -Dotel.resource.attributes=service.name=infra.monitoring,service.version=1.0.0,deployment.environment=poc

I was initially expecting to see something on Trace Analytics screen. Does it not work if the otel-v1-apm-service-map is empty?

I was initially expecting to see something on Trace Analytics screen. Does it not work if the otel-v1-apm-service-map is empty?

@nnanoob No. If the index is empty then nothing shows up. AFAICT, trace analytics UI aggregation is dependent on both raw span data and service map edges.

@chenqi0805 , in that case, does a single application/service.name sending data to data-prepper will generate any service map? if yes, I'm just wondering why my example isn't inserting any docs into the otel-v1-apm-service-map
or does it bare minimum requires at least 2 applications where application A invoking application B (and both having telemetry configured)

@nnanoob

in that case, does a single application/service.name sending data to data-prepper will generate any service map?

Unfortunately no as we use serviceName as identifier of different services/service-map nodes.

does it bare minimum requires at least 2 applications where application A invoking application B (and both having telemetry configured)

Yes. You could refer to our sample-app services as example:

https://github.com/opendistro-for-elasticsearch/data-prepper/blob/931dbc13b700c8b997e915329fbc24b0a58797ac/examples/trace-analytics-sample-app/sample-app/paymentService.py

https://github.com/opendistro-for-elasticsearch/data-prepper/blob/931dbc13b700c8b997e915329fbc24b0a58797ac/examples/trace-analytics-sample-app/sample-app/databaseService.py

@chenqi0805 , thanks. i guess the issue i'm facing was due to I only have 1 application then, while it needs minimum 2 to work