bmc-toolbox/bmclib

Redfish provider broken

joelrebel opened this issue · 4 comments

The recent timeout changes in OpenConnectionFromInterfaces breaks the Redfish provider.

This is because the context to the Redfish client is passed through bmclibs Open() method and the Redfish client holds the context in its APIClient struct the context passed is then canceled when OpenConnectionFromInterfaces returns, and so any further calls on the client return a context canceled error.

The option here seems to be that,

  • OpenConnectionFromInterfaces returns the context cancel funcs for successful providers, which are then kept in the bmclib client, and are then canceled when the caller invokes Close().

@jacobweinstock let me know if you have any other/better ideas.

To reproduce the issue, test with this modified inventory example

package main

import (
	"context"
	"crypto/x509"
	"encoding/json"
	"flag"
	"fmt"
	"log"
	"os"
	"strconv"
	"strings"
	"time"

	bmclib "github.com/bmc-toolbox/bmclib/v2"
	"github.com/bombsimon/logrusr/v2"
	"github.com/sirupsen/logrus"
)

func main() {
	ctx, cancel := context.WithTimeout(context.Background(), 120*time.Second)
	defer cancel()

	user := flag.String("user", "", "Username to login with")
	pass := flag.String("password", "", "Username to login with")
	host := flag.String("host", "", "BMC hostname to connect to")
	port := flag.Int("port", 443, "BMC port to connect to")
	withSecureTLS := flag.Bool("secure-tls", false, "Enable secure TLS")
	certPoolFile := flag.String("cert-pool", "", "Path to an file containing x509 CAs. An empty string uses the system CAs. Only takes effect when --secure-tls=true")
	flag.Parse()

	l := logrus.New()
	l.Level = logrus.DebugLevel
	logger := logrusr.New(l)

	clientOpts := []bmclib.Option{bmclib.WithLogger(logger)}

	if *withSecureTLS {
		var pool *x509.CertPool
		if *certPoolFile != "" {
			pool = x509.NewCertPool()
			data, err := os.ReadFile(*certPoolFile)
			if err != nil {
				l.Fatal(err)
			}
			pool.AppendCertsFromPEM(data)
		}
		// a nil pool uses the system certs
		clientOpts = append(clientOpts, bmclib.WithSecureTLS(pool))
	}

	cl := bmclib.NewClient(*host, strconv.Itoa(*port), *user, *pass, clientOpts...)

	cl.Registry.Drivers = cl.Registry.Using("redfish")

	err := cl.Open(ctx)
	if err != nil {
		log.Fatal(err, "bmc login failed")
	}

	defer cl.Close(ctx)

	inv, err := cl.Inventory(ctx)
	if err != nil {
		l.Error(err)
	}

	b, err := json.MarshalIndent(inv, "", "  ")
	if err != nil {
		l.Error(err)
	}

	fmt.Println(string(b))
}

Hey @joelrebel, thanks for bringing this up. I believe that context in the gofish library is not abstracted in an appropriate way. All calls have to use the same context that is only allowed to be defined once with gofish.ConnectContext(). This means that subsequent calls are bound to that context, without the ability to modify it (the context field in the APIClient is not exported and does not have a method to modify it). gofish/redfish use HTTP sessions, which is a server-side construct, but the client-side timeout and context.Context seem to be too rigid in their cohesion. Long term, I think we should pursue some kind of an upstream gofish resolution.

There are 2 distinct timeouts that need to be handle when making HTTP calls in Go. The first is how long a single call should last. The second is how long a function in Go should last. These are not the same. bmclib is modeled more closely to database or db query interactions in Go. With databases you typically open a connection, do some queries, then close the connection. For clarity and context building, In my opinion, handling of HTTP sessions should be more aware of and handle explicitly any client side timeouts.

This brings up some context considerations in our closeConnection that I think we need to address too. We don't want a canceled context to block the close from being called. Especially for Redfish as this will leave sessions open and if enough calls are made without close being called, the sessions get orphaned on the server and the max sessions limit will be hit and no more calls or logins (even from the web ui) will be allowed.

With all this in mind, what do you think of something like this below? We wouldn't need to pass around cancel contexts.

diff --git a/internal/redfishwrapper/client.go b/internal/redfishwrapper/client.go
index f5e8fb2..508086e 100644
--- a/internal/redfishwrapper/client.go
+++ b/internal/redfishwrapper/client.go
@@ -7,6 +7,7 @@ import (
        "net/http"
        "os"
        "strings"
+       "time"
 
        bmclibErrs "github.com/bmc-toolbox/bmclib/v2/errors"
        "github.com/bmc-toolbox/bmclib/v2/internal/httpclient"
@@ -102,12 +103,24 @@ func (c *Client) Open(ctx context.Context) error {
                config.DumpWriter = os.Stdout
        }
 
+       if tm := getTimeout(ctx); tm != 0 {
+               config.HTTPClient.Timeout = tm
+       }
        var err error
-       c.client, err = gofish.ConnectContext(ctx, config)
+       c.client, err = gofish.Connect(config)
 
        return err
 }
 
+func getTimeout(ctx context.Context) time.Duration {
+       deadline, ok := ctx.Deadline()
+       if !ok {
+               return 0
+       }
+
+       return time.Until(deadline)
+}
+
 // Close closes the redfish session.
 func (c *Client) Close(ctx context.Context) error {
        if c == nil || c.client.Service == nil {

This below would address our context in our close calls. This one feels more like a short term solution, so maybe its just a start:

diff --git a/bmc/connection.go b/bmc/connection.go
index 47c0eb5..a90391f 100644
--- a/bmc/connection.go
+++ b/bmc/connection.go
@@ -76,21 +76,18 @@ func closeConnection(ctx context.Context, c []connectionProviders) (metadata Met
                if elem.closer == nil {
                        continue
                }
-               select {
-               case <-ctx.Done():
-                       err = multierror.Append(err, ctx.Err())
 
-                       return metadata, err
-               default:
-                       metadataLocal.ProvidersAttempted = append(metadataLocal.ProvidersAttempted, elem.name)
-                       closeErr := elem.closer.Close(ctx)
-                       if closeErr != nil {
-                               err = multierror.Append(err, errors.WithMessagef(closeErr, "provider: %v", elem.name))
-                               continue
-                       }
-                       connClosed = true
-                       metadataLocal.SuccessfulCloseConns = append(metadataLocal.SuccessfulCloseConns, elem.name)
+               if err = ctx.Err(); err != nil {
+                       ctx, _ = context.WithTimeout(context.Background(), 30*time.Second)
+               }
+               metadataLocal.ProvidersAttempted = append(metadataLocal.ProvidersAttempted, elem.name)
+               closeErr := elem.closer.Close(ctx)
+               if closeErr != nil {
+                       err = multierror.Append(err, errors.WithMessagef(closeErr, "provider: %v", elem.name))
+                       continue
                }
+               connClosed = true
+               metadataLocal.SuccessfulCloseConns = append(metadataLocal.SuccessfulCloseConns, elem.name)
        }
        if connClosed {
                return metadataLocal, nil

Hey @jacobweinstock thanks for looking into the issue and the notes above,

This brings up some context considerations in our closeConnection that I think we need to address too. We don't want a canceled context to block the close from being called. Especially for Redfish as this will leave sessions open and if enough calls are made without close being called, the sessions get orphaned on the server and the max sessions limit will be hit and no more calls or logins (even from the web ui) will be allowed.

This was what I realised when testing out the latest changes - the redfish sessions were left open since Close() never got a chance to run completely, thanks for addressing that problem here.

With all this in mind, what do you think of something like this below? We wouldn't need to pass around cancel contexts.

The patch makes sense and setting the timeout in the HTTP client seems like a better fix, would you mind turning it into a PR?

Tested the proposed fix here on some Dells with Redfish and it works fine 👍

Thanks @joelrebel, opened #316 to address.