jaegertracing/jaeger

ES 8.x / v2 integration test often fails

yurishkuro opened this issue ยท 7 comments

https://github.com/jaegertracing/jaeger/actions/runs/11341694226/job/31567998272?pr=6091

The test is running this script: bash scripts/es-integration-test.sh elasticsearch 8.x v2

some errors that might explain

This one looks like there was a retry save that resulted in duplicate spans in the storage:

=== NAME  TestElasticsearchStorage/GetTrace
    integration.go:279: 
        	Error Trace:	/home/runner/work/jaeger/jaeger/plugin/storage/integration/integration.go:279
        	Error:      	Should be true
        	Test:       	TestElasticsearchStorage/GetTrace
    trace_compare.go:66: Removed spans with duplicate span IDs; before=10, after=5

This one looks like the test might have started running before the index mappings were created:

  2024-10-15T16:23:38.766Z	error	config/config.go:151	Elasticsearch part of bulk request failed	{"kind": "extension", "name": "jaeger_storage", "map-key": "index", "response": {"_index":"jaeger-main-jaeger-service-2024-10-15","_id":"f0b7d270dd871f4","status":404,"error":{"type":"index_not_found_exception","reason":"no such index [jaeger-main-jaeger-service-2024-10-15]","resource.type":"index_or_alias","resource.id":"jaeger-main-jaeger-service-2024-10-15","index":"jaeger-main-jaeger-service-2024-10-15"}}}
  github.com/jaegertracing/jaeger/pkg/es/config.NewClient.func2
  	/home/runner/work/jaeger/jaeger/pkg/es/config/config.go:151

The logs from ES for creating the index and failing to query it are quite close in the timestamps:

create index @ 2024-10-15T16:23:38.491Z:

  elasticsearch-1  | {"@timestamp":"2024-10-15T16:23:38.491Z", "log.level": "INFO", "message":"[jaeger-main-jaeger-span-2024-10-15] creating index, cause [auto(bulk api)], templates [jaeger-main-jaeger-span], shards [5]/[1]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[25d7d119ecca][masterService#updateTask][T#2]","log.logger":"org.elasticsearch.cluster.metadata.MetadataCreateIndexService","elasticsearch.cluster.uuid":"yjRpwgx_Rpeo0HIgerdzjQ","elasticsearch.node.id":"96DWKX6kSLCH7fWb7vVjpw","elasticsearch.node.name":"25d7d119ecca","elasticsearch.cluster.name":"docker-cluster"}

query failure @ 2024-10-15T16:23:38.669Z (only 170ms later):

  elasticsearch-1  | {"@timestamp":"2024-10-15T16:23:38.669Z", "log.level": "WARN", "message":"path: /jaeger-main-jaeger-service-2024-10-15%2Cjaeger-main-jaeger-service-2024-10-14%2Cjaeger-main-jaeger-service-2024-10-13%2Cjaeger-main-jaeger-service-2024-10-12/_search, params: {rest_total_hits_as_int=true, ignore_unavailable=true, index=jaeger-main-jaeger-service-2024-10-15,jaeger-main-jaeger-service-2024-10-14,jaeger-main-jaeger-service-2024-10-13,jaeger-main-jaeger-service-2024-10-12}, status: 503", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[25d7d119ecca][transport_worker][T#1]","log.logger":"rest.suppressed","elasticsearch.cluster.uuid":"yjRpwgx_Rpeo0HIgerdzjQ","elasticsearch.node.id":"96DWKX6kSLCH7fWb7vVjpw","elasticsearch.node.name":"25d7d119ecca","elasticsearch.cluster.name":"docker-cluster","error.type":"org.elasticsearch.action.search.SearchPhaseExecutionException","error.message":"all shards failed","error.stack_trace":"Failed to execute phase [query], all shards failed; shardFailures {[_na_][jaeger-main-jaeger-service-2024-10-15][0]: org.elasticsearch.action.NoShardAvailableActionException\n}{[_na_][jaeger-main-jaeger-service-2024-10-15][1]: org.elasticsearch.action.NoShardAvailableActionException\n}{[_na_][jaeger-main-jaeger-service-2024-10-15][2]: org.elasticsearch.action.NoShardAvailableActionException\n}{[_na_][jaeger-main-jaeger-service-2024-10-15][3]: org.elasticsearch.action.NoShardAvailableActionException\n}{[_na_][jaeger-main-jaeger-service-2024-10-15][4]: org.elasticsearch.action.NoShardAvailableActionException\n}\n\tat 

This could indicate a race condition where writes are happening too quickly, while the templates are still being created. I have not been seeing this error in v1 integration tests.

When running in the local environment, the same issue does not occur. ๐Ÿ˜…

Depends on how many times you run it. It's intermittent.

The test attempts to create an index and immediately tries to query it, leading to race conditions:

create index @ 2024-10-15T16:23:38.491Z
query failure @ 2024-10-15T16:23:38.669Z (only 170ms later)

Image

Hi @yurishkuro

I'm new to the Jaeger project and would like to work on fixing this ES integration test race condition. I've already reproduced the issue locally and here are my observations:

  1. The race condition occurs when queries are attempted ~170ms after index creation
  2. I can reproduce it consistently using the integration test

There are multiple errors indicating that Elasticsearch is either failing to store spans correctly or that the test is querying indices that have not yet been created.

Would you like to assign issue #6094 to me?

The test attempts to create an index and immediately tries to query it, leading to race conditions:

create index @ 2024-10-15T16:23:38.491Z query failure @ 2024-10-15T16:23:38.669Z (only 170ms later)

Image

Hi @yurishkuro

I'm new to the Jaeger project and would like to work on fixing this ES integration test race condition. I've already reproduced the issue locally and here are my observations:

  1. The race condition occurs when queries are attempted ~170ms after index creation
  2. I can reproduce it consistently using the integration test

There are multiple errors indicating that Elasticsearch is either failing to store spans correctly or that the test is querying indices that have not yet been created.

Would you like to assign issue #6094 to me?

Got dublicate span error also

Image

How are you planning to fix it?

Hi @yurishkuro,

Sorry for the delay in addressing this issue. I've implemented a fix for the ES integration test race conditions that addresses both problems:

Index Creation Race Condition:

  • Added proper index readiness verification with wait/retry logic
  • Added index caching to prevent duplicate creation attempts

Duplicate Span Detection Issue:

  • Added thread-safe span tracking using sync.Map

The fix has been submitted in PR #6157

Would appreciate your review and feedback.