vmware/cloud-provider-for-cloud-director

Duplicate Authorization headers cause requests to fail when vCD is behind L7 load balancers

bingman-ux opened this issue · 24 comments

Describe the bug

Hi I have a AVI load balancer in front of the VCD cells and I am getting a strange issue with it

I0104 17:46:07.745301 1 auth.go:49] Using VCD OpenAPI version [36.0]
I0104 17:46:08.168096 1 cloud.go:92] Error initializing client from secrets: [unable to get swagger client from secrets: [unable to get bearer token from serets: [failed to set authorization header: [error finding LoginUrl: could not find valid version for login: could not retrieve supported versions: error fetching versions: [ParseErr]: error parsing error body for non-200 request: XML syntax error on line 6: element


closed by (&{Status:400 Bad Request StatusCode:400 Proto:HTTP/1.1 ProtoMajor:1 ProtoMinor:1 Header:map[Content-Length:[173] Content-Type:[text/html]] Body:0xc0005e59c0 ContentLength:173 TransferEncoding:[] Close:true Uncompressed:false Trailer:map[] Request:0xc0007cec00 TLS:0xc000863130})]]]]

The LB is instantly saying the requests are malformed and returning 400 error code. I can see its attempting to access the /api/versions endpoint. If I get a bash shell in the container I can curl this endpoint from within the container without issue.

If I bypass the AVI load balancer and let the url go directly to a VCD cell it works fine. Wondering if testing with this has only been directly to cells and the webserver on the cell is being a bit more lax with standards.

VCD 10.3.1
AVI LB 20.1.4
cloud-provider-for-cloud-director tested with 1.0.0 1.0.1 1.0.2

Reproduction steps

1. Install a cluster on VCD 10.3.1 with a AVI load balancer load balancing the VCD portal 
2. 
3.
...

Expected behavior

The provider to be able to access the API

Additional context

No response

@bingman-ux apologies for the delay.

In this case the issue seems to be access to the VCD api endpoint itself. We have not tested with VCD behind an AVI load balancer. Are you able to access the UI using that endpoint?

No problem

Access to the UI using the load balanced endpoint is fine and I have also been able to call the API on the same endpoint using pyvcloud and the go-vcloud-director sdks with my own code but the load balancer rejects the calls from the provider saying they are malformed

an0nz commented

I am receiving the same error with vcd 10.3.2 cells behind Cloudflare’s L7 Load Balancer, all other services can access the API and work fine.

CSE 3.1.1 with TKGm 1.4.1 is a recent addition being tested and I am currently stuck with this error.

I0116 20:35:00.309829 1 main.go:41] Initializing CCM [vmware-cloud-director]: [0.1.0]
Flag --allow-untagged-cloud has been deprecated, This flag is deprecated and will be removed in a future release. A cluster-id will be required on cloud instances.
I0116 20:35:00.662380 1 serving.go:319] Generated self-signed cert in-memory
W0116 20:35:01.069395 1 client_config.go:541] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
I0116 20:35:01.071462 1 controllermanager.go:117] Version: v0.0.0-master+$Format:%h$
I0116 20:35:01.071684 1 auth.go:49] Using VCD OpenAPI version [36.0]
I0116 20:35:02.425809 1 cloud.go:92] Error initializing client from secrets: [unable to get swagger client from secrets: [unable to get bearer token from serets: [failed to set authorization header: [error finding LoginUrl: could not find valid version for login: could not retrieve supported versions: error fetching versions: [ParseErr]: error parsing error body for non-200 request: XML syntax error on line 6: element <hr> closed by (&{Status:400 Bad Request StatusCode:400 Proto:HTTP/1.1 ProtoMajor:1 ProtoMinor:1 Header:map[Cf-Ray:[-] Content-Length:[155] Content-Type:[text/html] Date:[Sun, 16 Jan 2022 20:35:02 GMT] Server:[cloudflare]] Body:0xc000079e00 ContentLength:155 TransferEncoding:[] Close:true Uncompressed:false Trailer:map[] Request:0xc0006e4900 TLS:0xc0005f8580})]]]]

As far as I can tell by tracing the code and matching it to the logs it seems to be coming from the vcdversion check (below) which is called in the path of auth.go > GetBearerToken > SetToken > vcdloginurl > validateAPIVersion() > vcdFetchSupportedVersions() > ...... > NewRequest

_, err := cli.ExecuteRequest(apiEndpoint.String(), http.MethodGet,

Final request looks to be executed by the below with https://vcd_fqdn/api/versions as the requrl, method as the value of http.MethodGet, body nil and apiversion 36.0

I have no idea why it is happening though as when I look through the vcd and load balancer logs the only 400 errors I see are for the first attempt at a provider login prior to the successful tenant login using the refresh token which is correct, there are no GET requests at all to /api/versions listed by Cloudflare which is what I would be expecting, the original poster can see this in AVI though so its likely Cloudflare is just not providing me a log.

Example logs from vcd cells (Only these calls are ever made), same logs on the load balancer.
x.x.x.x - - [16/Jan/2022:20:34:59 +0000] "POST https://vcd.domain.com/oauth/provider/token HTTP/1.1" 400 171 "-" "Go-http-client/1.1" 132
x.x.x.x - - [16/Jan/2022:20:34:59 +0000] "POST https://vcd.domain.com/oauth/tenant/TestOrg/token HTTP/1.1" 200 564 "-" "Go-http-client/1.1" 1105
x.x.x.x - - [16/Jan/2022:20:35:10 +0000] "POST https://vcd.domain.com/oauth/provider/token HTTP/1.1" 400 171 "-" "Go-http-client/1.1" 115
x.x.x.x - - [16/Jan/2022:20:35:11 +0000] "POST https://vcd.domain.com/oauth/tenant/TestOrg/token HTTP/1.1" 200 568 "-" "Go-http-client/1.1" 363
x.x.x.x - - [16/Jan/2022:20:35:21 +0000] "POST https://vcd.domain.com/oauth/provider/token HTTP/1.1" 400 171 "-" "Go-http-client/1.1" 122
x.x.x.x - - [16/Jan/2022:20:35:21 +0000] "POST https://vcd.domain.com/oauth/tenant/TestOrg/token HTTP/1.1" 200 566 "-" "Go-http-client/1.1" 421

I hope this additional information helps identify the root cause and a fix.

an0nz commented

I added the GOVCD_SHOW_REQ and GOVCD_SHOW_RESP environment variables to get some more detail to help identify the point of failure.

Extended logs

Flag --allow-untagged-cloud has been deprecated, This flag is deprecated and will be removed in a future release. A cluster-id will be required on cloud instances.
I0117 07:03:07.817805       1 serving.go:319] Generated self-signed cert in-memory
W0117 07:03:08.330725       1 client_config.go:541] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0117 07:03:08.333182       1 controllermanager.go:117] Version: v0.0.0-master+$Format:%h$
I0117 07:03:08.333479       1 auth.go:49] Using VCD OpenAPI version [36.0]
** REQUEST **
time:    2022-01-17T07:03:09.037Z
method:  GET
host:    vcd.domain.com
length:  0
URL:     https://vcd.domain.com/api/versions
header:  [
	Authorization => [********]
	Accept => [application/*+xml;version=36.0]
	X-Vmware-Vcloud-Token-Type => [Bearer]
	User-Agent => [go-vcloud-director]
]

payload: 

## RESPONSE ##
time:   2022-01-17T07:03:09.048Z
status: 400 - 400 Bad Request 
length: 155
header: map[Cf-Ray:[-] Content-Length:[155] Content-Type:[text/html] Date:[Mon, 17 Jan 2022 07:03:09 GMT] Server:[cloudflare]]
body:   <html>
<head><title>400 Bad Request</title></head>
<body>
<center><h1>400 Bad Request</h1></center>
<hr><center>cloudflare</center>
</body>
</html>


I0117 07:03:09.048936       1 cloud.go:92] Error initializing client from secrets: [unable to get swagger client from secrets: [unable to get bearer token from serets: [failed to set authorization header: [error finding LoginUrl: could not find valid version for login: could not retrieve supported versions: error fetching versions: [ParseErr]: error parsing error body for non-200 request: XML syntax error on line 6: element \<hr> closed by </body> (&{Status:400 Bad Request StatusCode:400 Proto:HTTP/1.1 ProtoMajor:1 ProtoMinor:1 Header:map[Cf-Ray:[-] Content-Length:[155] Content-Type:[text/html] Date:[Mon, 17 Jan 2022 07:03:09 GMT] Server:[cloudflare]] Body:0xc000291860 ContentLength:155 TransferEncoding:[] Close:true Uncompressed:false Trailer:map[] Request:0xc000032500 TLS:0xc0006208f0})]]]]

Using this additional data for the get request I ran the same request using curl in the container without the Authorization header and it worked (Below curl works no problem in the container, I am not sure what the code is doing wrong, maybe it is related to the payload or the host header

curl --location --request GET 'https://vcd.domain.com/api/versions' \
--header 'X-Vmware-Vcloud-Token-Type: Bearer' \
--header 'Accept: application/*+xml;version=36.0' \
--header 'User-Agent: go-vcloud-director' \
--header 'Host: vcd.domain.com' \
--data-raw ''

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<SupportedVersions xmlns="http://www.vmware.com/vcloud/versions" xmlns:ns2="http://www.vmware.com/vcloud/v1.5" xmlns:vmext="http://www.vmware.com/vcloud/extension/v1.5" xmlns:ovf="http://schemas.dmtf.org/ovf/envelope/1" xmlns:vssd="http://schemas.dmtf.org/wbem/wscim/1/cim-schema/2/CIM_VirtualSystemSettingData" xmlns:common="http://schemas.dmtf.org/wbem/wscim/1/common" xmlns:rasd="http://schemas.dmtf.org/wbem/wscim/1/cim-schema/2/CIM_ResourceAllocationSettingData" xmlns:vmw="http://www.vmware.com/schema/ovf" xmlns:ovfenv="http://schemas.dmtf.org/ovf/environment/1">
    <VersionInfo deprecated="true">
        <Version>31.0</Version>
        <LoginUrl>https://vcd.domain.com/api/sessions</LoginUrl>
    </VersionInfo>
    <VersionInfo deprecated="true">
        <Version>32.0</Version>
        <LoginUrl>https://vcd.domain.com/api/sessions</LoginUrl>
    </VersionInfo>
......

I edited the deployment and set a hostalias for the load balancers FQDN to point directly to a vcd cell and everything stands up as expected, definitely something strange in how the requests are sent causing this issue.

Hi,

I have a nsx-t lb and I am getting the same issue.
I Turned on Debug mode in LB and I can see the following issue to '/api/versions' request:

2022/01/19 15:00:16 [info] 1247#0: *32731028 client sent duplicate header line: "Authorization: bearer Bearer eyJhbGciO...

an0nz commented

Hey @rogerpadilhasouza I am not sure exactly where this issue lies as the same issue exists for both cloud-provider-for-cloud-director and cloud-director-named-disk-csi-driver which use go-vcloud-director for executing requests.

There is an issue here tracking it vmware/go-vcloud-director#425 which looks to be caused by duplicate Authorization bearer headers which could be an issue with go-vcd or with this project and csi.

Follow the same message that i send to vmware/go-vcloud-director#425 and i belive that the problem is a duplicate authentication header.

In the NSX/AVI logs I get the error:

022/01/19 15:00:16 [info] 1247#0: *32731028 client sent duplicate header line: "Authorization: bearer Bearer (...)"

What we infer here in the lab is that the request is arriving with a duplicate header. That is, two authentication headers. This must be checked and dealt with, we were able to simulate the problem by sending a curl with the same duplicated field.
What it seems to me is that when we send it directly to the VCD this duplication is not verified, because it is not necessary to have a token to search for the versions. Now when you have an NSX/AVI in between, a more complete parsing is performed on the message and this problem is identified. Among the possible solutions is the correction of the sent fields.

When we change the load balancer from L7 to L4 the process is fine. The problem is L7, that check headers.

In my opnion is necessary check that the headers are sent correctly. The state machine is not correct, is not necessary send authentication to get versions for example. And the code send duplicate.

@Anirudh9794 could you take a look at this and the linked issue in go-vcloud-director

Hello,
thank you @an0nz @mrmassis @rogerpadilhasouza and @bingman-ux for bringing this to our attention and sorry for the delayed response.

I looked over the code in CPI, CSI and GoVcd. The issue here is that Authorization header is being set twice by the code - one with Bearer xyz and another with bearer Bearer xyz. For some reason in our testbeds we are not seeing this issue. I am assuming because VCD is silently ignoring the second value (bearer Bearer xyz).
The go vcd version used in 1.0.0 and 1.0.x branches didn't support refresh tokens and hence in our implementation, we had to set the header manually in the client after getting access token from the refresh token.

Since govcd now has support for refresh tokens, I tried to change the code to use govcd to log in using govcd and it seems to be working. But for some of the Api calls, the library is wrongly setting both X-Vmware-Vcloud-Access-Token header and Authorization header. So we will need to evaluate if this causes any issues when VCD is behind AVI or other load balancers.

I will keep the issue updated.

Hello @an0nz @mrmassis @rogerpadilhasouza @bingman-ux
The following PR removes the duplicate header issue in CPI: #47
Since we don't have the infrastructure (VCD behind Avi) to thoroughly test this change, Could you please help us test this ?
I have verified that the above change doesn't set duplicate headers but we want to make sure we are not missing any other edge cases.

Hello @an0nz @mrmassis @rogerpadilhasouza @bingman-ux
Im just following up again if you can help us test the fix for duplicate header values.

Thanks,
Aniruddha

an0nz commented

Hello @Anirudh9794,

What is the easiest way to build/pull an image to test this?
I have just had to stand up a test cluster for some other work so can test this out fairly easily.

I second this. I can spin up a cluster easily to test it.
I tried to build an image in docker but it looks like the make script pulls on image repos that are not public.

Hello @an0nz @bingman-ux ,
Thanks for helping us test the fix for this issue.
I have pushed out the image docker.io/aniruddheu52919/cloud-provider-for-cloud-director:1.1.0.latest which should be accessible to you.
Please let me know if you have any issues accessing the image or if the fix doesn't work.

an0nz commented

Hello @Anirudh9794 and @bingman-ux,

I managed to get a build going with a few minor tweaks to the Dockerfile and a directory name that was capitalized causing the go build to fail on a Linux server (not sure if you want to raise a bug for that).

Updated Dockerfile uses photon from DockerHub FROM photon:4.0

Capitalization for the OpenAPIv2 folder causes a build failure

go build -ldflags "-X github.com/vmware/cloud-provider-for-cloud-director/version.Version=main-branch" -o /build/vcloud/cloud-provider-for-cloud-director cmd/ccm/main.go
vendor/k8s.io/client-go/discovery/discovery_client.go:31:2: cannot find package "." in:
        /go/src/github.com/vmware/cloud-provider-for-cloud-director/vendor/github.com/googleapis/gnostic/openapiv2
make: *** [Makefile:12: build-within-docker] Error 1

I renamed the OpenAPIv2 folder to openapiv2 and the build went through.

My build is on a public dockerhub registry at dualitynz/cloud-provider-for-cloud-director:main-branch.latest

Your comment just came through @Anirudh9794, I haven't tried with your build yet.
My build was able to successfully connect to cloud director if the k8s cluster was deployed using an Org Admin account.

If the cluster was deployed using a System Admin the Refresh Token was unable to authenticate and receives the following error

I0209 21:51:18.332027       1 auth.go:45] Using VCD OpenAPI version [36.0]
I0209 21:51:18.515726       1 cloud.go:92] Error initializing client from secrets: [unable to get swagger client from secrets: [unable to get bearer token from secrets: [failed to set authorization header: [minorErrorCode: invalid_grant -  message: Invalid refresh token -  error: invalid_grant -  error_description: Invalid refresh token -  : 400 Bad Request]]]]
I0209 21:51:28.516346       1 cloudconfig.go:137] Using non-empty refresh token.

Hello @an0nz,
Yes, I think there is an issue with system admin refresh token. I am taking a look at it.

Thanks for testing this out!

an0nz commented

Hello @Anirudh9794,

I have tested your build now too and it has the same issue with authentication when the cluster is deployed using a Sys Admin account as the token auth fails.

Hello @an0nz ,
For the time being if sysadmin account is needed, could you please specify username in the secrets (vcloud-basic-auth) as well?
Base64 encoding of username in the format org/user (eg: system/administrator) should be used as the value for username.
I was able to replicate the issue with sysadmin and specifying the username fixed this. Please let me know if you run into any issues after specifying the username.

Will follow up with the issue about system admin token.

an0nz commented

Hello @Anirudh9794,

The sysadmin issue isn't a big one for us as usually it is deployed by an org admin, more of an issue to be resolved prior to the next official release, if we need to for now we can use the user/pass combo. Key thing is the duplicate header issue looks to be solved.

Do you know if this same issue in cloud-director-named-disk-csi-driver has been resolved with your changes?

Hello @[Anirudh9794]

Can confirm the image you put up works fine when the cluster is deployed by an Org admin going through an AVI LB. Its just the cloud-director-named-disk-csi-driver thats having issues now

Hi,
im my setup, with org admin differ from admin, the modification was sucessfull. But, was necesary insert an entry in clusterrole to enable handle "lease" crd.

Hey @an0nz @bingman-ux
The same set of changes was made for CSI as well. So the issue for refresh tokens should be fixed for tenant users in CPI and CSI. There has not been a release yet but the main branch has the fixes.

Hello @mrmassis ,
I believe you are facing a different issue?
Please check #48