elastic/e2e-testing

Fix failing upgrade_agent test suite

Closed this issue · 4 comments

Upgrades for snapshot versions prior to 8.6 are broken when trying to upgrade to version 8.6 because of elastic/elastic-agent#1791. This does not apply to upgrades of non-snapshot versions because the snapshot upgrades use a different API for fetching artifacts.

For example, upgrading from 8.5.3-SNAPSHOT or 8.5.3 to 8.6.0-SNAPSHOT will fail, but upgrading from 8.5.3-SNAPSHOT or 8.5.3 to 8.6.0 should succeed.

We need to fix the agent upgrade feature tests that are affected by this:

Ideally those tests would not use snapshot versions as the final version in the tests, but if snapshot versions are the only solution we can just remove the upgrade tests for affected versions before 8.6.

the e2e upgrades test use release as initial version, not SNAPSHOTs, the upgrades are from 8.5.3 to 8.7.0-SNAPSHOT or from 8.5.3 to 8.6.0-SNAPSHOT

This is the state of the agent upgrade test according to the CI results (latest is 8.7.0-SNAPSHOT):
✔️ latest => latest (obviously)
✔️ 8.6.0 => latest
✔️ 8.5.3 => latest
❌ 8.4.3 => latest
✔️ 8.3.3 => latest
❌ 8.2.3 => latest
❌ 8.1.3 => latest
❌ 7.17.8 => latest

From the elastic agent logs we collected in the CI run for version 8.4.3 we can see that agent is trying to fetch the wrong package while upgrading:

{
    "log.level": "error",
    "@timestamp": "2023-01-31T10:30:30.331Z",
    "log.origin": {
        "file.name": "log/reporter.go",
        "file.line": 36
    },
    "message": "2023-01-31T10:30:30Z - message: Application: [0c1ab94d-e368-48a7-9e71-f25581ec2257]: State changed to FAILED: initiating fetcher: failed to detect remote snapshot repo, proceeding with configured: not an agent uri: 'https://snapshots.elastic.co/8.7.0-4687ed94/downloads/elastic-agent-shipper/elastic-agent-shipper-8.7.0-SNAPSHOT-windows-x86.zip' - type: 'ERROR' - sub_type: 'FAILED'",
    "ecs.version": "1.6.0"
}

I will check if all the other failing 8.x agent runs have the same problem.
If we are not releasing any 8.0.0 <= version <= 8.4.3 I think we can just remove the testcases for those versions since even backporting the fix would be pointless without future releases.

Version 7.17.8 fails to enroll with the fleet server according to the CI logs

[2023-01-31T11:00:27.018Z] time="2023-01-31T11:00:26Z" level=error msg="Error executing command" args="[install --e --force --insecure --enrollment-token=Ni0xN0I0WUJ5X1NJeTQ0X2dLeWs6cS04ZTVCVFRUbjZ3ZW9xU1B6YmhCUQ== --url http://3.145.73.13:8220/]" baseDir=. command=/root/.op/elastic-agent/elastic-agent/elastic-agent env="map[]" error="exit status 1" stderr="2023-01-31T11:00:23.461Z\tWARN\t[tls]\ttlscommon/tls_config.go:101\tSSL/TLS verifications disabled.\n2023-01-31T11:00:24.230Z\tINFO\tcmd/enroll_cmd.go:454\tStarting enrollment to URL: [http://3.145.73.13:8220/\nError](http://3.145.73.13:8220//nError): fail to enroll: fail to execute request to fleet-server: status code: 400, fleet-server returned an error: BadRequest, message: FindEnrollmentAPIKey: elastic fail 401: security_exception: token expired\nFor help, please see our troubleshooting guide at [https://www.elastic.co/guide/en/fleet/7.17/fleet-troubleshooting.html\nError](https://www.elastic.co/guide/en/fleet/7.17/fleet-troubleshooting.html/nError): enroll command failed with exit code: 1\nFor help, please see our troubleshooting guide at [https://www.elastic.co/guide/en/fleet/7.17/fleet-troubleshooting.html\n](https://www.elastic.co/guide/en/fleet/7.17/fleet-troubleshooting.html/n)"

It seems that fleet server is returning 400 HTTP status code.
I will try to test manually is it's possible to enroll Elastic Agent 7.x with fleet 8.7.0-snapshot and also with a pre-V2 version (8.5.x maybe) to check if 8.6 broke compatibility with older agents 7.x

failed to detect remote snapshot repo, proceeding with configured: not an agent uri: 'https://snapshots.elastic.co/8.7.0-4687ed94/downloads/elastic-agent-shipper/elastic-agent-shipper-8.7.0-SNAPSHOT-windows-x86.zip' - type: 'ERROR' - sub_type: 'FAILED'",

@pchila that is exactly the problem fixed by elastic/elastic-agent#1791 in 8.6.0. The elastic-agent-shipper binary was added in 8.6.0 and older version of the agent didn't expect to find anything but a binary with the name elastic-agent at that URL.

Upgrade e2e test is now working correctly from 7.17 to 8.x

Agent is still crashing after upgrade but that will be fixed in a separate issue