advancedtelematic/ota-tuf

targets.json "lost" in my default namespace

Closed this issue · 25 comments

I've got an OTA+ Community Edition deployment that seems to have lost the ability to serve a targets.json over the weekend. I haven't changed any containers or configuration information in a week or two. I've done a couple of key rotations using these directions

https://docs.atsgarage.com/prod/rotating-signing-keys.html

And I have to other namespaces (we call them premerge and postmerge) that continue to work just fine with the same types of administration being done to them.

The issue I'm seeing that seems to be this:

curl https://api.foundries.io/lmp/repo/release/api/v1/user_repo/targets.json  | json_pp 
{
   "cause" : null,
   "description" : "KeyserverHttpClient|Unexpected response from remote server at http://tuf-keyserver:80/api/v1/root/d332b2a2-25a3-4262-a79f-f373d6891723/targets|POST|500|{\"error_id\":\"51bf58a6-1555-4394-a694-998986abab5f\",\"description\":\"Unexpected response from vault: HttpResponse(403 Forbidden,List(Cache-Control: no-store, Date: Tue, 29 May 2018 21:39:59 GMT),HttpEntity.Strict(application/json,{\\\"errors\\\":[\\\"permission denied\\\"]}\\n),HttpProtocol(HTTP/1.1))\"}",
   "errorId" : "1976bdb9-e340-40ba-ab0c-7b77d3c49006",
   "code" : "remote_service_error"
}

I have no idea how to debug things, and looking into the DB confused me more as I see no link between repo-id's in reposerver db and repo-ids in keyserver db:

MariaDB [(none)]> select * from tuf_reposerver.repo_namespaces;
+-----------+--------------------------------------+-------------------------+-------------------------+
| namespace | repo_id                              | created_at              | updated_at              |
+-----------+--------------------------------------+-------------------------+-------------------------+
| default   | d332b2a2-25a3-4262-a79f-f373d6891723 | 2018-04-27 16:06:14.731 | 2018-04-27 16:06:14.731 |
| postmerge | 751c8866-68de-4b8b-8430-8c9c4acf9702 | 2018-05-02 20:21:26.858 | 2018-05-02 20:21:26.858 |
| premerge  | 33781d4e-15c9-4a85-82a4-281ba40ed8ae | 2018-05-02 20:22:57.274 | 2018-05-02 20:22:57.274 |
+-----------+--------------------------------------+-------------------------+-------------------------+
3 rows in set (0.014 sec)

MariaDB [(none)]> select key_id, repo_id  from tuf_keyserver.keys where role_type = 'TARGETS';
+------------------------------------------------------------------+--------------------------------------+
| key_id                                                           | repo_id                              |
+------------------------------------------------------------------+--------------------------------------+
| e206bfd10411449fdeeb4eeffe17caa7ed6b82a55ec412ae94a53b8974e8f671 | 5da9ea01-e719-4134-bbbd-f5f0f1aedac8 |
| 53f98c81dde991d3f057d4e5fb729deb91e506acbcffd0f98b7be810377101ae | c3bf6068-558f-41b9-9d18-fd96702dd8fc |
| fc7cc0e84b19469c327115091cc3457bdfeeb9d7e3624e28dc328982c2726af5 | d48c5702-b213-477b-a6c5-1d22fc57434f |
+------------------------------------------------------------------+--------------------------------------+
3 rows in set (0.002 sec)

The key ids for each repo/namespace as I believe them to be are:

default keyid = 6318f6398dec996f554e1f55440475ee520c829eac92bc9db5c49011a6ee7ddf
postmerge keyid = b5b3b9291dbb7b15fe6d5a1186282a63d7573921c0f69c85e921739c699ad374
premerge keyid = 9a32b5e73f1a679c9df09de5c65c93786b3d9322007301b7b00d4e2bc3fcda7

The logs I get from the reposerver seem to be this:

I|21:34:30.177|akka.actor.ActorSystemImpl|method=GET path=/api/v1/user_repo/targets.json query='' service_name=reposerver stime=21 status=502
E|21:34:30.374|akka.actor.ActorSystemImpl|An error occurred. ErrorId: Some(bbd42a6d-8cbe-4b34-b73d-59b3059ccf8c) {"code":"remote_service_error","description":"KeyserverHttpClient|Unexpected response from remote server at http://tuf-keyserver:80/api/v1/root/d332b2a2-25a3-4262-a79f-f373d6891723/targets|POST|500|{\"error_id\":\"c247a417-aa37-42a5-aaee-824c96904348\",\"description\":\"Unexpected response from vault: HttpResponse(403 Forbidden,List(Cache-Control: no-store, Date: Tue, 29 May 2018 21:34:30 GMT),HttpEntity.Strict(application/json,{\\\"errors\\\":[\\\"permission denied\\\"]}\\n),HttpProtocol(HTTP/1.1))\"}","cause":null,"errorId":"bbd42a6d-8cbe-4b34-b73d-59b3059ccf8c"} 

The keyserver logs seem to mostly cycle between messages like this:

I|19:53:58.319|akka.actor.ActorSystemImpl|method=POST path=/api/v1/root/d332b2a2-25a3-4262-a79f-f373d6891723/targets query='' service_name=keyserver stime=4 status=500
I|19:53:58.703|akka.actor.ActorSystemImpl|method=GET path=/api/v1/root/c3bf6068-558f-41b9-9d18-fd96702dd8fc query='' service_name=keyserver stime=2 status=200
I|19:53:58.787|akka.actor.ActorSystemImpl|method=GET path=/api/v1/root/d332b2a2-25a3-4262-a79f-f373d6891723 query='' service_name=keyserver stime=2 status=200
E|19:53:59.027|akka.actor.ActorSystemImpl|Request error 3c84681e-7b82-4534-88d9-0084cc61ccf3 (http://tuf-keyserver/api/v1/root/d332b2a2-25a3-4262-a79f-f373d6891723/targets)
com.advancedtelematic.tuf.keyserver.vault.VaultHttpClient$VaultError: Unexpected response from vault: HttpResponse(403 Forbidden,List(Cache-Control: no-store, Date: Tue, 29 May 2018 19:53:59 GMT),HttpEntity.Strict(application/json,{"errors":["permission denied"]}
),HttpProtocol(HTTP/1.1))
I|19:53:59.027|akka.actor.ActorSystemImpl|method=POST path=/api/v1/root/d332b2a2-25a3-4262-a79f-f373d6891723/targets query='' service_name=keyserver stime=4 status=500
E|19:53:59.404|akka.actor.ActorSystemImpl|Request error 48ea6dca-760a-4416-baa7-8a3533080d68 (http://tuf-keyserver/api/v1/root/d332b2a2-25a3-4262-a79f-f373d6891723/targets)
com.advancedtelematic.tuf.keyserver.vault.VaultHttpClient$VaultError: Unexpected response from vault: HttpResponse(403 Forbidden,List(Cache-Control: no-store, Date: Tue, 29 May 2018 19:53:59 GMT),HttpEntity.Strict(application/json,{"errors":["permission denied"]}
),HttpProtocol(HTTP/1.1))
simao commented

Hi,

It seems either vault sealed itself and you'll need to unseal it, or the token is no longer valid, because it was not renewed for example. The tokens are renewed by the keyserver daemon.

Which versions are you using for tuf-reposerver and tuf-keyserver ?

When you rotate keys offline, the old keys are deleted from the database and you no longer have a link between keys and repos in ota-tuf.

simao commented

The other namespaces continue to work because they did not expire and so they do not require resigning, so no private keys are required. Private keys are kept in vault for your version.

Another question while on the topic of vaults. It looks like you've guys have removed the need for a vault:

#166

Is that the case, is this something that can be enabled today?

simao commented

The recovery procedure would be to manually renew the token using a vault root token

VAULT_TOKEN='root-token' vault token  renew <service token>   

And then it should work again.

Using reposerver and keyserver at different versions is currently not supported. These services should run with the same version.

Vault support was removed in 0.4.0. So when you deploy keyserver 0.4.0 with DB_MIGRATE=true, your keys will be migrated to the db using column encryption.

You'll need to setup keyserver with appropriate db encryption parameters, which can be done like this https://github.com/advancedtelematic/libats/tree/master/libats-slick#configuration

Unfortunately, due to a bug that is present in keyserver < 0.4.0, snapshot and timestamp keys will no longer be in the database, so if you rotated offline before 0.4.0 and timestamp.json/snapshot.json/targets.json expires you'll get an error on the client. I recommend you start with a new namespace, or you'll need to manually fetch the private key from vault and insert into the database with the proper encryption.

The error you got means you need to first pull the latest targets before you can push a new one.

On 05/30/2018 09:42 AM, Simão Mata wrote:

The recovery procedure would be to manually renew the token using a
vault root token

|VAULT_TOKEN='root-token' vault token renew |

I'm sure I'm just doing something dumb now, but I'm having troubles running that command. Here's what I have:

vault --version
Vault v0.6.5 ('5d8d702f33b5fd965cbe8d6d0728295de813a196')

There is no "vault token renew", but I tried:

# VAULT_TOKEN='root-token' vault token-renew targets
Error renewing token: Error making API request.

URL: PUT http://127.0.0.1:8200/v1/auth/token/renew
Code: 403. Errors:

* permission denied

and

# VAULT_TOKEN='root-token' vault token-renew TARGETS
Error renewing token: Error making API request.

URL: PUT http://127.0.0.1:8200/v1/auth/token/renew
Code: 403. Errors:

* permission denied
simao commented

It looks like your root token is not valid ?

Also instead of TARGETS you'll need to use the token keyserver is using

VAULT_TOKEN='some token' vault token-renew 'some-token'

I'm struggling to understand what these parameters should be, but I think I'm a little closer. I got a shell into my tuf-keyserver and ran:

 $ env | grep TOKEN
 TUF_VAULT_TOKEN=<some-value-ive-redacted>

I've tried to use that value eg:

$ VAULT_TOKEN='<some-value-ive-redacted>' vault token-renew <some-value-ive-redacted>
Error renewing token: Error making API request.

URL: PUT http://127.0.0.1:8200/v1/auth/token/renew
Code: 403. Errors:

* permission denied

I also tried doing: VAULT_TOKEN='<some-value-ive-redacted>' vault token-renew d332b2a2-25a3-4262-a79f-f373d6891723. Where 'd332b2a2-25a3-4262-a79f-f373d6891723' is the repoid that seems to be failing when you look at the logs above.

I'm not sure how I gather the proper inputs from my environment.

simao commented

The value for the root token is some value you got when bootstraping vault. How did you unseal and initialize vault? It should have returned some root token that you can use here.

VAULT_TOKEN='' vault token-renew

ota-community-edition unseals with logic that looks like:

 kubectl --namespace mgmt get secrets tuf-vault-init -o json | jq -r .data.keys | base64 --decode | awk 'BEGIN {print "export VAULT_ADDR=http://127.0.0.1:8200"} {print "vault unseal "$0} END   {print "vault status"}'
export VAULT_ADDR=http://127.0.0.1:8200
vault unseal <redacted value1>
vault unseal <redacted value2>
vault unseal <redacted value3>
vault unseal <redacted value4>
vault unseal <redacted value5>
vault status

I tried commands like this, but they all fail the same:

VAULT_TOKEN='<redacted value1>' vault token-renew 
Error renewing token: Error making API request.

URL: PUT http://127.0.0.1:8200/v1/auth/token/renew-self
Code: 403. Errors:

* permission denied

I also tried with VAULT_TOKEN='<redacted value1>' vault token-renew <redacted value1> but that fails as well.

simao commented

@taheris @alexhumphreys is there a way to get the vault root token out of ota-community edition?

After bootstrapping the root token will be placed under key root in the secret tuf-vault-init.

Okay - I'm getting a new error, so I think it could be progress. I now get:

VAULT_TOKEN='<VAULT_FROM tuf-vault-init.root>' vault token-renew
Error renewing token: Error making API request.

URL: PUT http://127.0.0.1:8200/v1/auth/token/renew-self
Code: 400. Errors:

* lease not found or lease is not renewable
simao commented

You need to do:

VAULT_TOKEN='<VAULT_FROM tuf-vault-init.root>' vault token-renew <token used in keyserver env>

I've tried a couple of things to determine . The main one being the value 'd332b2a2-25a3-4262-a79f-f373d6891723' from my logs. I also tried the keyid for my targets.json. Should I be trying one of the keys from my k8s tuf-vault-init.keys value? There are 5 of them, so I wasn't sure which one would be appropriate

simao commented

You should be using the value of the TUF_VAULT_TOKEN environment variable in the tuf-keyserver-container.

You could get this value using, for example:

kubectl exec ota-tuf-keyserver-<your pod id> -i -t  env | grep TUF_VAULT_TOKEN

i'm finally understanding things a bit now. However, vault claims the TUF_VAULT_TOKEN doesn't exist either. To be clear I ran:
VAULT_TOKEN='<VAULT_FROM tuf-vault-init.root>' vault token-renew <TUF_VAULT_TOKEN_FROM_KEY_SERVER_ENV>

This seems nuts, because if that's not working wouldn't all of my namespace also be broke right now?

For education purposes I was wondering if it would be possible to:

  1. create a new namespace, say "foo"
  2. rotate the keys on foo so that they are the same ones from this namespace that's not working.
  3. do a garage-push to get all the stuff i want into the new namespace so that it looks like the old one
  4. do some nginx trickery on my gateway and route $bad-namespace->$new-namespace
  5. go into mysql device_registry and update the namespace for affected devices
simao commented

Yeah that is weird, that token should exist.

In principle those steps might work, but I cannot be sure because there are other services and I am not sure how much they depend on proper initialization of namespaces and devices. From steps 1-4 everything should work, but I am almost sure moving a device from one namespace to another would require more than a simple update on device registry.

simao commented

Where are you pointing vault client? Are you using the same vault as the tuf-keyserver container ?

yep. I tracked down down the routing from tuf-keyserver and it goes through the tuf-vault service which then routes to the tuf-vault pod that i have a shell on.

simao commented

That is weird.. what is the exact vault message? Token does not exist? what is the env of vault?

You could try creating the token again:

VAULT_TOKEN='' vault token-create -id= -policy ota-tuf

it was just a not-found:

Error renewing token: Error making API request.

URL: PUT http://127.0.0.1:8200/v1/auth/token/renew
Code: 400. Errors:

* token not found

So creating things helps ... for a moment. The token is only valid for 30 seconds. During that team things seem to look okay (i can pull down targets.json from tuf-reposerver). I'm trying the token-renew command now, but it doesn't seem to increase the lifetime of the token. I've tried with -increment=600 but still to no effect.

edit: I added "-period=72h" to my token-create command and things are running. I'm going to hope the tuf-keyserver-deamon will renew it for me now

simao commented

I still find it work that the token just vanished, I don't know how would that happen.

keyserver-daemon should renew the token, but remember if you are running >= 0.4.0 vault is no longer supported and keys will be migrated to the database, with the caveats noted above. The token will not be renewed.

Thanks