nats-io/nsc

trying to push remove or push prune stack traces with no error message

Closed this issue · 32 comments

What version were you using?

nsc version 2.8.5

i dont have "nats-server" installed

What environment was the server running in?

Mac OSX 14.2.1

Is this defect reproducible?

I see an account after an nsc pull -A

310148761-f52036fc-8bae-4896-ade2-1496349c8c20

its there when i list them:

310148841-4fc5a4f2-12f5-4de9-ae91-3d05f7d8fbd1 (1)

however, trying to do an nsc delete followed by a push -P, OR a push -R both stacktrace:

310149281-becc8ea9-0e4a-487f-ae66-b50d7afc3dcc

310149376-fe04c97f-7c38-4bd2-8999-8bebe684f533

Given the capability you are leveraging, describe your expectation?

im trying to delete a NATs account

Given the expectation, what is the defect you are observing?

stacktraces with no error messages

@tommyjcarpenter so it looks like the operator doesn't have a resolver set - can you nsc describe operator
And also attach your resolver configuration in the server.

@aricart what are you looking for in the describe operator? im trying to not post company stuff in this public thread as much as possible

here are the fields:

Screenshot 2024-03-05 at 11 19 19

here is the redacted server config:

Data
====
nats.conf:
----
# NATS Clients Port
port: 4222

# PID file shared with configuration reloader.
pid_file: "/var/run/nats/nats.pid"

###############
#             #
# Monitoring  #
#             #
###############
http: 8222
server_name: $SERVER_NAME
#####################
#                   #
# TLS Configuration #
#                   #
#####################
tls {
    cert_file: /etc/nats-certs/clients/nats-client-tls/tls.crt
    key_file:  /etc/nats-certs/clients/nats-client-tls/tls.key
    timeout: 1
}
###################################
#                                 #
# NATS JetStream                  #
#                                 #
###################################
jetstream {
  max_mem: 1Gi
  store_dir: /data

  max_file:25Gi
}
###################################
#                                 #
# NATS Full Mesh Clustering Setup #
#                                 #
###################################
cluster {
  port: 6222
  name: [REDACTED]
  tls {
      cert_file: /etc/nats-certs/cluster/infra-nats-node/tls.crt
      key_file:  /etc/nats-certs/cluster/infra-nats-node/tls.key
      ca_file: /etc/nats-certs/cluster/infra-nats-node/ca.crt
  name: [REDACTED]
  }
  reject_unknown_cluster: true
  tls {
      cert_file: /etc/nats-certs/gateways/infra-nats-gateway/tls.crt
      key_file:  /etc/nats-certs/gateways/infra-nats-gateway/tls.key
      ca_file: /etc/nats-certs/gateways/infra-nats-gateway/ca.crt
  }

  # Gateways array here
  gateways: [

                [REDACTED]

  ]
}
logtime: true
max_payload: 4MB
lame_duck_grace_period: 10s
lame_duck_duration: 30s
##################
#                #
# Authorization  #
#                #
##################
        operator: REDACTED
        system_account: REDACTED

      resolver: {
        type: full
        dir: "/accounts/jwt"

        allow_delete: true

        interval: "2m"
      }
  resolver_preload: REDACTED

BinaryData
====

E

I wanted your operator to:

  • have a system account
  • have an operator service URL
  • have an account service URL

Like this:

+----------------------------------------------------------------------------------------+
|                                    Operator Details                                    |
+-----------------------+----------------------------------------------------------------+
| Name                  | XXX                                                            |
| Operator ID           | OCFHAU5YEMSOVK7JXUIHK2QOMCHLSER4X6NNQOF45UFVAUNA7TIBID7G       |
| Issuer ID             | OCFHAU5YEMSOVK7JXUIHK2QOMCHLSER4X6NNQOF45UFVAUNA7TIBID7G       |
| Issued                | 2024-03-05 19:05:26 UTC                                        |
| Expires               |                                                                |
| Account JWT Server    | nats://localhost:4222                                          |
| Operator Service URLs | nats://localhost:4222                                          |
| System Account        | ADDGU3EFAQPX3YHAXXJEHHU36FX3VCDTWRYODF5Q52GIT45BZFZW7ZF5 / SYS |
| Require Signing Keys  | false                                                          |

Your server config looks OK

Just for a bit of sanity:

# Add the operator
/t/s [1]$ nsc add operator O
[ OK ] generated and stored operator key "OBPLSTGWUQH7F54UPQU7QE53F3YDWGY6OVQJ3IDBQTO6D2USHVBP6TE5"
[ OK ] added operator "O"
[ OK ] When running your own nats-server, make sure they run at least version 2.2.0

# create a sys account
/t/s $ nsc add account SYS
[ OK ] generated and stored account key "AAPMM4G4CYMICDMR5KV2R35KZUPFQ5U6Q2HRLK6J5OP5KBGMYSEWXIIO"
[ OK ] added account "SYS"

# set the sys account on the operator
/t/s $ nsc edit operator --service-url "nats://localhost:4222" --account-jwt-server-url "nats://localhost:4222" --system-account SYS
[ OK ] set account jwt server url to "nats://localhost:4222"
[ OK ] set system account "AAPMM4G4CYMICDMR5KV2R35KZUPFQ5U6Q2HRLK6J5OP5KBGMYSEWXIIO"
[ OK ] added service url "nats://localhost:4222"
[ OK ] edited operator "O"

# generate the resolver config
/t/s $ nsc generate config --nats-resolver --config-file server.conf
[ OK ] wrote server configuration to `/tmp/s/server.conf`
Success!! - generated `/tmp/s/server.conf`

# start the NATS server
/t/s $ nats-server -c server.conf &
[26391] 2024/03/05 15:17:40.669196 [INF] Starting nats-server
[26391] 2024/03/05 15:17:40.669284 [INF]   Version:  2.11.0-dev
[26391] 2024/03/05 15:17:40.669286 [INF]   Git:      [not set]
[26391] 2024/03/05 15:17:40.669288 [INF]   Name:     NAIPUDMVSKC7PTW45E4WT74CC6XAOAWAVVUZKXTCXDWEX6BLP4EVHMDH
[26391] 2024/03/05 15:17:40.669291 [INF]   ID:       NAIPUDMVSKC7PTW45E4WT74CC6XAOAWAVVUZKXTCXDWEX6BLP4EVHMDH
[26391] 2024/03/05 15:17:40.669298 [INF] Using configuration file: server.conf
[26391] 2024/03/05 15:17:40.669300 [INF] Trusted Operators
[26391] 2024/03/05 15:17:40.669302 [INF]   System  : ""
[26391] 2024/03/05 15:17:40.669304 [INF]   Operator: "O"
[26391] 2024/03/05 15:17:40.669306 [INF]   Issued  : 2024-03-05 15:17:16 -0400 AST
[26391] 2024/03/05 15:17:40.669322 [INF]   Expires : Never
[26391] 2024/03/05 15:17:40.669698 [INF] Managing all jwt in exclusive directory /tmp/s/jwt
[26391] 2024/03/05 15:17:40.669909 [INF] Listening for client connections on 0.0.0.0:4222
[26391] 2024/03/05 15:17:40.670176 [INF] Server is ready

# Add an account
/t/s $ nsc add account A
[ OK ] generated and stored account key "ABGDFGOB2GCGCM6BJVEOAIVMLX4YNHHG2WMPUAV4KTOEQKTBPBRVP36I"
[ OK ] added account "A"

# Push the account
/t/s $ nsc push 
[ OK ] push to nats-server "nats://localhost:4222" using system account "SYS":
       [ OK ] push A to nats-server with nats account resolver:
              [ OK ] pushed "A" to nats-server NAIPUDMVSKC7PTW45E4WT74CC6XAOAWAVVUZKXTCXDWEX6BLP4EVHMDH: jwt updated
              [ OK ] pushed to a total of 1 nats-server
              
# Delete the account
/t/s $ nsc delete account A
[ OK ] expired account "A"
[ OK ] deleted account
[ OK ] deleted account directory

# Push with --prune
/t/s $ nsc push --prune
[26391] 2024/03/05 15:17:59.618908 [ERR] delete accounts request by OBPLSTGWUQH7F54UPQU7QE53F3YDWGY6OVQJ3IDBQTO6D2USHVBP6TE5 failed - delete must be enabled in server config
[ERR ] prune nats-server with nats account resolver:
       [ OK ] list 2 accounts from nats-server NAIPUDMVSKC7PTW45E4WT74CC6XAOAWAVVUZKXTCXDWEX6BLP4EVHMDH:
              [ OK ] account AAPMM4G4CYMICDMR5KV2R35KZUPFQ5U6Q2HRLK6J5OP5KBGMYSEWXIIO named SYS exists
              [ OK ] account ABGDFGOB2GCGCM6BJVEOAIVMLX4YNHHG2WMPUAV4KTOEQKTBPBRVP36I only exists in server
       [ OK ] listed accounts from a total of 1 nats-server
       [ERR ] server NAIPUDMVSKC7PTW45E4WT74CC6XAOAWAVVUZKXTCXDWEX6BLP4EVHMDH responded with error: delete accounts request by OBPLSTGWUQH7F54UPQU7QE53F3YDWGY6OVQJ3IDBQTO6D2USHVBP6TE5 failed - delete must be enabled in server config
       [ERR ] Fewer server responded to 'prune' (0) than to 'list' (1). Accounts may not be completely pruned.
Error: all jobs failed

# Fix the server config to have `allow_delete: true` and restart the server
/t/s [1]$ vim server.conf 
/t/s $ killall nats-server
[26391] 2024/03/05 15:18:30.336226 [INF] Initiating Shutdown...
[26391] 2024/03/05 15:18:30.336315 [INF] Server Exiting..

fish: Job 1, 'nats-server -c server.conf &' has ended
/t/s $ nats-server -c server.conf &
[26422] 2024/03/05 15:18:36.367349 [INF] Starting nats-server
[26422] 2024/03/05 15:18:36.367454 [INF]   Version:  2.11.0-dev
[26422] 2024/03/05 15:18:36.367457 [INF]   Git:      [not set]
[26422] 2024/03/05 15:18:36.367458 [INF]   Name:     NDNVR2PCT7RXAWMIQIYCHBVZFQARLPIDTA2I2P2MLTAUVEBU3OXK2YYH
[26422] 2024/03/05 15:18:36.367462 [INF]   ID:       NDNVR2PCT7RXAWMIQIYCHBVZFQARLPIDTA2I2P2MLTAUVEBU3OXK2YYH
[26422] 2024/03/05 15:18:36.367470 [INF] Using configuration file: server.conf
[26422] 2024/03/05 15:18:36.367472 [INF] Trusted Operators
[26422] 2024/03/05 15:18:36.367474 [INF]   System  : ""
[26422] 2024/03/05 15:18:36.367477 [INF]   Operator: "O"
[26422] 2024/03/05 15:18:36.367479 [INF]   Issued  : 2024-03-05 15:17:16 -0400 AST
[26422] 2024/03/05 15:18:36.367496 [INF]   Expires : Never
[26422] 2024/03/05 15:18:36.367885 [INF] Managing all jwt in exclusive directory /tmp/s/jwt
[26422] 2024/03/05 15:18:36.368091 [INF] Listening for client connections on 0.0.0.0:4222
[26422] 2024/03/05 15:18:36.368279 [INF] Server is ready

# Try the prune again
/t/s $ nsc push --prune
[ OK ] prune nats-server with nats account resolver:
       [ OK ] list 2 accounts from nats-server NDNVR2PCT7RXAWMIQIYCHBVZFQARLPIDTA2I2P2MLTAUVEBU3OXK2YYH:
              [ OK ] account AAPMM4G4CYMICDMR5KV2R35KZUPFQ5U6Q2HRLK6J5OP5KBGMYSEWXIIO named SYS exists
              [ OK ] account ABGDFGOB2GCGCM6BJVEOAIVMLX4YNHHG2WMPUAV4KTOEQKTBPBRVP36I only exists in server
       [ OK ] listed accounts from a total of 1 nats-server
       [ OK ] pruned nats-server NDNVR2PCT7RXAWMIQIYCHBVZFQARLPIDTA2I2P2MLTAUVEBU3OXK2YYH: deleted 1 accounts
/t/s $ 

Not sure what your commands were doing, (the --force flag is suspect)

So the stack traces you list refers to:

	okp, opPk, err := obtainRequestKey(ctx, report)
	if err != nil {
		report.AddError("Could not obtain Operator key to sign the delete request (err:%v)", err)
		return
	}
	defer okp.Wipe()

The operator key was found or you would have gotten an error. So not quite sure what that is about.

@aricart i showed the two commands that generated stacktraces

I tried (independently):
1 nsc delete account followed by nsc push -P -> failed
2 nsc push -R xxxxxx

all i want to do is delete an account from the nats system - but nsc delete account only does things locally I've gathered, because doing a pull just puts the account back.

I think nsc should obv give an error message to the user for whatever the underlying issue is here once we figure it out :)

The -- force was from this nsc output:

Error: managed accounts may require the account JWT or nkeys to cancel a service, specify the --force to override

As you see in my script it does do the delete.... Can you try what I did locally.

I wanted your operator to:

  • have a system account
  • have an operator service URL
  • have an account service URL

Like this:

+----------------------------------------------------------------------------------------+
|                                    Operator Details                                    |
+-----------------------+----------------------------------------------------------------+
| Name                  | XXX                                                            |
| Operator ID           | OCFHAU5YEMSOVK7JXUIHK2QOMCHLSER4X6NNQOF45UFVAUNA7TIBID7G       |
| Issuer ID             | OCFHAU5YEMSOVK7JXUIHK2QOMCHLSER4X6NNQOF45UFVAUNA7TIBID7G       |
| Issued                | 2024-03-05 19:05:26 UTC                                        |
| Expires               |                                                                |
| Account JWT Server    | nats://localhost:4222                                          |
| Operator Service URLs | nats://localhost:4222                                          |
| System Account        | ADDGU3EFAQPX3YHAXXJEHHU36FX3VCDTWRYODF5Q52GIT45BZFZW7ZF5 / SYS |
| Require Signing Keys  | false                                                          |

We do have the System account (as I showed) but seem to be missing the other two fields:

  • have an operator service URL
  • have an account service URL

How would I know what to set these to?

@aricart the plot thickens

so my boss, using the exact same operator, and without the account service urls, was able to remove an account both ways; with push -R and delete account ..; push --prune

We are on the same binary version 2.8.5

However, he is on an Apple M1 and I am on an Apple M3... I hope it's not that, but I have had issues with architecture versions like this before, so just wanted to put that out there.

A stacktrace on a different architecture vs another user in our same company working..

Perhaps you tried this already - but I want to make sure we try a new environment.
Before anything, please backup your operator and keys:
nsc env to show where things are stored and backup Current Store Dir and $NKEYS_PATH.

Next repeat my script above, but this time, I would like you to prefix all nsc invocations as:
nsc --all-dirs /tmp/test

This will create the entire tree for keys and jwts in the specified directory (outside of where you keep your actual configs and keys) ie nsc --all-dirs /tmp/test add operator O ...

After you verify the repeatability, you can set your old operator to be current by doing:
nsc env -o <your real operator name>

@aricart can you please paste the server.conf?

im following this in depth guide, which talks about copying a server.conf, but I dont see the file in the docs? https://docs.nats.io/running-a-nats-service/nats_admin/security/jwt#concepts

(I'm trying to run stuff locally now; the above thread was from our prod kubernetes deployment)

Ah - for your local test all you need to do after you create your operator/accounts etc - is nsc generate config --nats-resolver --config-file server.conf and then edit that to enable the full resolver. Set the allow_delete to true, the default template sets it to false and that will prevent deletes from happening - I believe that is my above script that I put above yesterday.

you can then start the server with nats-server -c server.conf

@aricart I was able to run all of your commands using the local nats-server, so it's something to do with the remote.

@aricart alright - two more pieces of info!

I was able to do all of these steps in STAGE. We have a staging nats supercluster. The delete worked there.

However, I can always stacktrace on PROD, here is a fresh test:

Screenshot 2024-03-06 at 16 33 33

looks like its dying here:

nsc/cmd/push.go

Line 449 in f85538f

func sendDeleteRequest(ctx ActionCtx, nc *nats.Conn, timeout int, report *store.Report, deleteList []string, expectedResponses int) {

that's a pretty dense function with pointers and such, hard to tell exactly what's happening, I'm new to this codebase

OK I see some doggy code there - there's an error that is added to the report, but the process doesn't stop.
Anyway I can get you to build nsc from a branch so that we can try to narrow this?

Two more questions:

  • have you looked to see if the nats-server is logging any kind of error?
  • does the account you are trying to delete have clients connected? (verifying if I see some issue with this)

Also - were you able to run two local tests with the different M3 binary that was failing - I am assuming that you did it on the stage and that worked, but on prod it failed - Also make sure that the operator keys in the production environment - the request to prune requires them.

You can verify that the keys are available by running nsc list keys :

 nsc list keys
+------------------------------------------------------------------------------------------+
|                                           Keys                                           |
+--------+----------------------------------------------------------+-------------+--------+
| Entity | Key                                                      | Signing Key | Stored |
+--------+----------------------------------------------------------+-------------+--------+
| O      | OBPLSTGWUQH7F54UPQU7QE53F3YDWGY6OVQJ3IDBQTO6D2USHVBP6TE5 |             | *      |
|  A     | AAYIW3IRJQCEZEKHAW3SU6CZGOJGUJELDJYIJEDLQYCMU67PJ377NO7K |             | *      |
+--------+----------------------------------------------------------+-------------+--------+
[ T ] auth callout encryption target

The stored column should be checked.

@aricart yes if you provide a branch I can build and do like a go run nsc/ ?

I was able to use the exact same M3 binary for your local tests, and against our env -O stage operator. It's clearly something to do with the env setup for prod on my machine (my boss was able to run it), some uncaught error message.

I am betting one dollar that the operator keys are not all there, and it is failing to find one.

I found the issue - let me do the fix

@tommyjcarpenter If you want to try the branch for the fix, that would be awesome. - Note that you'll have to specify the -K with an operator key - in your case I believe the key was not found (possibly it was ignoring other keys) - didn't see a confirmation on the list keys

I released a new nsc, you should be able to pull it with nsc update

@tommyjcarpenter thank you for reporting this - if the release doesn't fix it for you, please reopen this issue

Thanks I’ll try it out!

Is it available through homebrew? That’s how I install/update nsc

yeap - the update command will also do the right thing even if you installed with homebrew...

@aricart it actually works with the latest... I didn't get an error message about missing an operator key..! It worked.

What was the change (and should it have worked?). Thanks much for your attention

The code had a couple of edge cases. But the one that affected you was that your operator had more than one signing key but the first listed were not there. Go has a feature where if something returns an interface the value can be nil, but the interface isn't. So checks for nil fail since the interface is not nil and then usage of that will panic because the value is.