ES 8.x / v2 integration test often fails
yurishkuro opened this issue ยท 7 comments
https://github.com/jaegertracing/jaeger/actions/runs/11341694226/job/31567998272?pr=6091
The test is running this script: bash scripts/es-integration-test.sh elasticsearch 8.x v2
some errors that might explain
This one looks like there was a retry save that resulted in duplicate spans in the storage:
=== NAME TestElasticsearchStorage/GetTrace
integration.go:279:
Error Trace: /home/runner/work/jaeger/jaeger/plugin/storage/integration/integration.go:279
Error: Should be true
Test: TestElasticsearchStorage/GetTrace
trace_compare.go:66: Removed spans with duplicate span IDs; before=10, after=5
This one looks like the test might have started running before the index mappings were created:
2024-10-15T16:23:38.766Z error config/config.go:151 Elasticsearch part of bulk request failed {"kind": "extension", "name": "jaeger_storage", "map-key": "index", "response": {"_index":"jaeger-main-jaeger-service-2024-10-15","_id":"f0b7d270dd871f4","status":404,"error":{"type":"index_not_found_exception","reason":"no such index [jaeger-main-jaeger-service-2024-10-15]","resource.type":"index_or_alias","resource.id":"jaeger-main-jaeger-service-2024-10-15","index":"jaeger-main-jaeger-service-2024-10-15"}}}
github.com/jaegertracing/jaeger/pkg/es/config.NewClient.func2
/home/runner/work/jaeger/jaeger/pkg/es/config/config.go:151
The logs from ES for creating the index and failing to query it are quite close in the timestamps:
create index @ 2024-10-15T16:23:38.491Z:
elasticsearch-1 | {"@timestamp":"2024-10-15T16:23:38.491Z", "log.level": "INFO", "message":"[jaeger-main-jaeger-span-2024-10-15] creating index, cause [auto(bulk api)], templates [jaeger-main-jaeger-span], shards [5]/[1]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[25d7d119ecca][masterService#updateTask][T#2]","log.logger":"org.elasticsearch.cluster.metadata.MetadataCreateIndexService","elasticsearch.cluster.uuid":"yjRpwgx_Rpeo0HIgerdzjQ","elasticsearch.node.id":"96DWKX6kSLCH7fWb7vVjpw","elasticsearch.node.name":"25d7d119ecca","elasticsearch.cluster.name":"docker-cluster"}
query failure @ 2024-10-15T16:23:38.669Z (only 170ms later):
elasticsearch-1 | {"@timestamp":"2024-10-15T16:23:38.669Z", "log.level": "WARN", "message":"path: /jaeger-main-jaeger-service-2024-10-15%2Cjaeger-main-jaeger-service-2024-10-14%2Cjaeger-main-jaeger-service-2024-10-13%2Cjaeger-main-jaeger-service-2024-10-12/_search, params: {rest_total_hits_as_int=true, ignore_unavailable=true, index=jaeger-main-jaeger-service-2024-10-15,jaeger-main-jaeger-service-2024-10-14,jaeger-main-jaeger-service-2024-10-13,jaeger-main-jaeger-service-2024-10-12}, status: 503", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[25d7d119ecca][transport_worker][T#1]","log.logger":"rest.suppressed","elasticsearch.cluster.uuid":"yjRpwgx_Rpeo0HIgerdzjQ","elasticsearch.node.id":"96DWKX6kSLCH7fWb7vVjpw","elasticsearch.node.name":"25d7d119ecca","elasticsearch.cluster.name":"docker-cluster","error.type":"org.elasticsearch.action.search.SearchPhaseExecutionException","error.message":"all shards failed","error.stack_trace":"Failed to execute phase [query], all shards failed; shardFailures {[_na_][jaeger-main-jaeger-service-2024-10-15][0]: org.elasticsearch.action.NoShardAvailableActionException\n}{[_na_][jaeger-main-jaeger-service-2024-10-15][1]: org.elasticsearch.action.NoShardAvailableActionException\n}{[_na_][jaeger-main-jaeger-service-2024-10-15][2]: org.elasticsearch.action.NoShardAvailableActionException\n}{[_na_][jaeger-main-jaeger-service-2024-10-15][3]: org.elasticsearch.action.NoShardAvailableActionException\n}{[_na_][jaeger-main-jaeger-service-2024-10-15][4]: org.elasticsearch.action.NoShardAvailableActionException\n}\n\tat
This could indicate a race condition where writes are happening too quickly, while the templates are still being created. I have not been seeing this error in v1 integration tests.
When running in the local environment, the same issue does not occur. ๐
Depends on how many times you run it. It's intermittent.
The test attempts to create an index and immediately tries to query it, leading to race conditions:
create index @ 2024-10-15T16:23:38.491Z
query failure @ 2024-10-15T16:23:38.669Z (only 170ms later)
Hi @yurishkuro
I'm new to the Jaeger project and would like to work on fixing this ES integration test race condition. I've already reproduced the issue locally and here are my observations:
- The race condition occurs when queries are attempted ~170ms after index creation
- I can reproduce it consistently using the integration test
There are multiple errors indicating that Elasticsearch is either failing to store spans correctly or that the test is querying indices that have not yet been created.
Would you like to assign issue #6094 to me?
The test attempts to create an index and immediately tries to query it, leading to race conditions:
create index @ 2024-10-15T16:23:38.491Z query failure @ 2024-10-15T16:23:38.669Z (only 170ms later)
Hi @yurishkuro
I'm new to the Jaeger project and would like to work on fixing this ES integration test race condition. I've already reproduced the issue locally and here are my observations:
- The race condition occurs when queries are attempted ~170ms after index creation
- I can reproduce it consistently using the integration test
There are multiple errors indicating that Elasticsearch is either failing to store spans correctly or that the test is querying indices that have not yet been created.
Would you like to assign issue #6094 to me?
Got dublicate span error also
How are you planning to fix it?
Hi @yurishkuro,
Sorry for the delay in addressing this issue. I've implemented a fix for the ES integration test race conditions that addresses both problems:
Index Creation Race Condition:
- Added proper index readiness verification with wait/retry logic
- Added index caching to prevent duplicate creation attempts
Duplicate Span Detection Issue:
- Added thread-safe span tracking using sync.Map
The fix has been submitted in PR #6157
Would appreciate your review and feedback.