pact-foundation/pact_broker

Timeout occurs during contract verification - querying matrix until timeout

MarcinOsiadacz opened this issue · 13 comments

Description

I’ve an issue that for some of Consumer’s branches, timeout occurs during contract verification. In Pact Broker logs I can see that it retries PactBroker::Matrix::Service -- Querying matrix – action for an hour until timeout. After that, canIDeploy ends with:

* What went wrong:
Can you deploy? Computer says no ¯\_(ツ)_/¯

This issue occurs only for some Consumer's branches and 1 out of 2 providers. I'm newbie in PACT, so any advice / lead that could help me investigate and fix this will be appreciated. I'll do my best to provide any additional information that could help.

Pre issue-raising checklist

I have already (please mark the applicable with an x):

  • Upgraded to the latest Pact Broker OR
  • Checked the CHANGELOG to see if the issue I am about to raise has been fixed
  • Created an executable example that demonstrates the issue using either a:
    • Dockerfile
    • Git repository with a Travis or Appveyor (or similar) build

I've also tried executing this clean-up script on the database, but it did not help: https://github.com/pact-foundation/pact_broker/blob/master/script/prod/clean-up.sql

Software versions

  • pact-broker docker version: pactfoundation/pact-broker:2.98.0.1
  • gradle plugin: au.com.dius.pact 4.1.31

Expected behaviour

If PACT is verified successfully, PactBroker::Matrix::Service -- Querying matrix -- is responded, and results are displayed in the matrix as below (and it works fine for majority of Consumer's branches):
image

Actual behaviour

Timeout occurs during contract verification. In Pact Broker logs I can see that it retries PactBroker::Matrix::Service -- Querying matrix – action for an hour until timeout. After that, canIDeploy ends with:

* What went wrong:
Can you deploy? Computer says no ¯\_(ツ)_/¯

After that, results displayed in the matrix look like below (Pact verification result is missing for Provider B):
image

However, when I open the published pact by clicking on the link marked with an arrow, there is a verified badge in there on top-right:
image

Can you upgrade to the latest version please. There have been some performance improvements to the matrix queries over the last few months.

Can you also turn on the cleaning feature as documented here https://docs.pact.io/pact_broker/administration/maintenance and see if that fixes the issue.

Can you check the value of PACT_BROKER_DATABASE_STATEMENT_TIMEOUT. It should be timing out after 15 seconds unless it has been changed.

If those things don't fix the issue, then if you're able to do a dump of the database for me, then that would be the ideal way for me fix the problem.

export PGUSER=
export PGPASSWORD=
export PGHOST=
export PGDATABASE= 
pg_dump --no-acl --no-owner --file pact_broker.dump

You can redact the data if you need to by using this script (you'll need to import it into another database, then run the script, then export it again)

https://github.com/pact-foundation/pact_broker/blob/master/script/prod/redact-data.sql

I'll also need you to turn on the HTTP debug logging for the verification task and share those logs, so I can see the HTTP request that is made.

If you can't give me a database dump, then we'll have to do what we can with the logs. Set
PACT_BROKER_LOG_LEVEL=debug, PACT_BROKER_SQL_LOG_LEVEL=debug and PACT_BROKER_SQL_ENABLE_CALLER_LOGGING=true and reproduce the issue, and then share the logs.

Hello @bethesque! Thank you for the response.

  1. The latest version I can upgrade to is 2.101.0.0, as the next one drops Docker 19 support. Unfortunately, I'm limited by the docker version at the moment. Upgrade did not help.
  2. I've run clean-up script manually yesterday as I mentioned in pre issue-raising checklist and it did not help. I've also scheduled automatic clean-up and it had removed some data (I set PACT_BROKER_DATABASE_CLEAN_OVERWRITTEN_DATA_MAX_AGE to '0' to get rid of garbage instantly), but it did not help either. Btw. I've got a quick question regarding keep selectors - are such expressions with * handled {"pacticipant": "ConsumerName", "tag": "production*"} so I could keep all version with tag starting with production word?
  3. Can you please tell me where can I check value of the PACT_BROKER_DATABASE_STATEMENT_TIMEOUT? There is no such env variable defined in docker-compose file. I've also verified env variables directly on pact-broker docker container and there is no such var:
    image

I'll do my best to provide you more information as soon as possible. For now I share short summary with conclusions:

  1. In canIDeploy logs I can see, that it tries to verify the contract with the latest version of the provider f368238 and it fails as the verification result is not found (timeout after 1 hour)
  2. In Matrix UI there is no verification result displayed for affected consumer version and one provider's latest versions. However, the contract is published as I can click on it in UI and there is also a "verified" badge in there.
  3. In the logs of pact broker I can see contract_published event is produced, and then contract_content_unchanged. This is followed by a message that No enabled webhooks found for... both of mentioned event types. After that PactBroker::Pacts::Service -- Updating existing pact publication action is performed for both contracts (for both provides, one of them works as expected) and then it starts querying matrix.

It looks like pact broker expects that there is already verified result for this contract SHA with latest provider version, however it does not find it in given time. No verification process is actually triggered on provider's side in this case. Maybe it would be a good idea to check if this contract exists and if it has been verified for this consumer version and latest provider version directly in database?

Please find a link to logs on OneDrive with env variables set as requested. Some data in logs have been redacted.

There have been some significant performance enhancements since the version you're on, and I would expect that the issue you're having has been fixed, as we're not having that particular problem on Pactflow.

If you're stuck on Docker 19, I suggest that you build and push your own image with the latest pact broker code on it. You can copy from this repo, and just revert the recent changes to the Dockerfile that updated the base image.

@bethesque I've upgraded docker engine to 20.10.21, and Pact-Broker to the latest version afterwards. Unfortunately, it didn't help.

Bummer! Can you get me the logs with the new version please. The stack traces will at least match up with the code now.

Oh, and if you haven't set a statement time out, it should be defaulting to 15 seconds.

Bummer! Can you get me the logs with the new version please. The stack traces will at least match up with the code now.

@bethesque sure, here you go: logs on OneDrive. These are from new version.

Oh, and if you haven't set a statement time out, it should be defaulting to 15 seconds.

Yeah, I did not set this anywhere. No such env var defined in docker-compose.yml. I can define it and set higher value if you think it could help.

Looking at the logs, none of the queries are taking particularly long, so I don't think setting the timeout will help.

I'm trying to work out where the HTTP request has come in, and which exact request it is. It looks like the can-i-deploy request is being sent over and over. Do you have the --retry-while-unknown and --retry-interval values set?

Can you go over exactly what happens, because I'm still not sure I understood the problem correctly. Please go step by step, and include the exact commands that are used in the build. When you say "Timeout occurs during contract verification", I thought you meant there was an HTTP timeout that occurred when the verification results were published, but I can see from the logs that's not the case.

Ok, so first of all we are using gradle plugin: id "au.com.dius.pact" version "4.1.31"

Following values are set in gradle:
retryCountWhileUnknown = 60
retryWhileUnknownInterval = 60

What happens step by step:

  1. Tests passed, PACT jsons are generated.
  2. pactPublish gradle task is executed. This one ends with success.
    Publishing '<PROVIDER_B_25>.json' with tags <CONSUMERS_AFFECTED_BRANCH> ... OK
    Publishing '<PROVIDER_A_26>.json' with tags <CONSUMERS_AFFECTED_BRANCH> ... OK
  3. hasValidPactWith... gradle task is executed. This one ends with failure (as you can see, for one out of two providers only). 'Pact Verified' result is not displayed for this one failed provider in the Matrix.
    canIDeploy
    Computer says no ¯\_(ツ)_/¯
    There is no verified pact between version <version-a22ca9c> of <CONSUMER_24> and the latest version of <PROVIDER_B_25> with tag <MAIN_TAG> (<version-f368238>)

So it looks like the timeout occurs on canIDeploy after around 60 minutes (which is probably related with values you've asked for). I've tried to run canIDeploy locally (most recent version, newly created container from docker image) and the result is the same.

How is the provider verification being triggered? Can you share the configuration. Are you passing through the pact URL as described in https://docs.pact.io/pact_nirvana/step_6#add-a-new-provider-verification-job

Hi,

Provider verification is triggered with POST Request. We're using Jenkins. Pact URL is passed as a query parameter with a mentioned POST Request. It was set up using https://docs.pact.io/pact_broker/webhooks/template_library.
Below you can find redacted webhooks. Let me know if you need to verify some other configurations.

{
    "uuid": "<webhook_uuid>",
    "description": "Trigger PROVIDER_A_26 verification on Jenkins when contract requires verification",
    "consumer": {
        "name": "CONSUMER_24"
    },
    "provider": {
        "name": "PROVIDER_A_26"
    },
    "enabled": true,
    "request": {
        "method": "POST",
        "url": "<VERIFICATION_JOB_URL>?service=PROVIDER_A_26&pactUrl=${pactbroker.pactUrl}",
        "headers": {
            "Accept": "application/json"
        },
        "username": "<user>",
        "password": "<password>"
    },
    "events": [
        {
            "name": "contract_requiring_verification_published"
        }
    ],
    "createdAt": "2022-03-23T16:53:12+00:00",
    "_links": "..."
}
{
    "uuid": "<webhook_uuid>",
    "description": "Trigger PROVIDER_B_25 verification on Jenkins when contract requires verification",
    "consumer": {
        "name": "CONSUMER_24"
    },
    "provider": {
        "name": "PROVIDER_B_25"
    },
    "enabled": true,
    "request": {
        "method": "POST",
        "url": "<VERIFICATION_JOB_URL>?service=PROVIDER_B_25&pactUrl=${pactbroker.pactUrl}",
        "headers": {
            "Accept": "application/json"
        },
        "username": "<user>",
        "password": "<password>"
    },
    "events": [
        {
            "name": "contract_requiring_verification_published"
        }
    ],
    "createdAt": "2022-03-23T16:52:27+00:00",
    "_links": "..."
}

However, it is triggered only for event contract_requiring_verification_published which is not our case as during bug replication another events occurs - contract_published, and then contract_content_unchanged.

When event contract_requiring_verification_published occurs, verification is triggered successfully. Also, in most cases everything works fine for contract_content_unchanged. The issue occurs only on a few branches. It like like pact broker thinks that there is a verification result for this contract, but there isn't or there is an issue with getting it.

Sorry I haven't responded to this for so long. I've been on leave over Christmas. I hope to be able to get back to this soon, but I have a large backlog to catch up with.