vertica/vertica-sql-go

Possible issue with wrong rows returned from current stmt results

grzm opened this issue ยท 13 comments

grzm commented

I'm seeing an issue which looks like the driver mismatching prepared statements and result sets when running concurrently. The Prometheus free/sql_exporter runs the queries for each "scrape" concurrently (in go routines). The rows returned by stmt.QueryContext(ctx) are not always the rows associated with the prepared statement.

This is all with the patch in #21 applied.

Here's the function that executes each query, with some debug logging and comments added by me (un-mutilated, release version):

// run executes the query on the provided database, in the provided context.
func (q *Query) run(ctx context.Context, conn *sql.DB) (*sql.Rows, errors.WithContext) {
        // `q` is a struct which contains the query to be executed at q.config.Query
	r, _ := regexp.Compile("\\s+")
	queryString := r.ReplaceAllString(q.config.Query, " ")
	if q.conn != nil && q.conn != conn {
		panic(fmt.Sprintf("[%s] Expecting to always run on the same database handle", q.logContext))
	}

	if q.stmt == nil {
		stmt, err := conn.PrepareContext(ctx, q.config.Query)
		if err != nil {
			return nil, errors.Wrapf(q.logContext, err, "prepare query failed")
		}
		q.conn = conn
		q.stmt = stmt
	}
        // checking query associated with the prepared statement `stmt`
	stmtValue := reflect.ValueOf(q.stmt)
	stmtQueryString := r.ReplaceAllString(reflect.Indirect(stmtValue).FieldByName("query").String(), " ")
	rows, err := q.stmt.QueryContext(ctx)
	columns, _ := rows.Columns()
	log.Infof("run query [%s] stmt.query same? %t columns %q", queryString, (stmtQueryString == queryString), columns)
	return rows, errors.Wrap(q.logContext, err)
}

Here's an annotated sample of the associated logging output for an exporter that issues 5 queries for each scrape.

///////
// first scrape after launching sql_exporter
///////

// XXX columns don't match query
I0821 17:06:34.977006   22312 query.go:131] run query [SELECT current_epoch, ahm_epoch, last_good_epoch, last_good_epoch - ahm_epoch as ahm_epoch_lag, designed_fault_tolerance, current_fault_tolerance, wos_used_bytes, ros_used_bytes FROM v_monitor.system ] stmt.query same? true columns ["mode" "scope" "object_name" "count"]

// XXX columns missing
I0821 17:06:35.071003   22312 query.go:131] run query [SELECT lower(locks.lock_mode) as mode, lower(locks.lock_scope) as scope, object_name, count(*) FROM v_monitor.locks GROUP BY lower(lock_mode), lower(lock_scope), object_name; ] stmt.query same? true columns []

// OK
I0821 17:06:35.244324   22312 query.go:131] run query [SELECT count(*) FROM v_monitor.delete_vectors ] stmt.query same? true columns ["count"]

// OK
I0821 17:06:35.247949   22312 query.go:131] run query [SELECT EXTRACT(epoch FROM current_timestamp) AS epoch ] stmt.query same? true columns ["epoch"]

// OK
I0821 17:06:35.261840   22312 query.go:131] run query [SELECT node_name, lower(node_state) as node_state, count(*) FROM v_catalog.nodes GROUP BY node_name, node_state ] stmt.query same? true columns ["node_name" "node_state" "count"]

///////
// Second scrape
///////

// XXX columns missing
I0821 17:06:41.483174   22312 query.go:131] run query [SELECT lower(locks.lock_mode) as mode, lower(locks.lock_scope) as scope, object_name, count(*) FROM v_monitor.locks GROUP BY lower(lock_mode), lower(lock_scope), object_name; ] stmt.query same? true columns []

// OK
I0821 17:06:41.589738   22312 query.go:131] run query [SELECT count(*) FROM v_monitor.delete_vectors ] stmt.query same? true columns ["count"]

// OK
I0821 17:06:41.595501   22312 query.go:131] run query [SELECT current_epoch, ahm_epoch, last_good_epoch, last_good_epoch - ahm_epoch as ahm_epoch_lag, designed_fault_tolerance, current_fault_tolerance, wos_used_bytes, ros_used_bytes FROM v_monitor.system ] stmt.query same? true columns ["current_epoch" "ahm_epoch" "last_good_epoch" "ahm_epoch_lag" "designed_fault_tolerance" "current_fault_tolerance" "wos_used_bytes" "ros_used_bytes"]

// OK
I0821 17:06:41.672825   22312 query.go:131] run query [SELECT EXTRACT(epoch FROM current_timestamp) AS epoch ] stmt.query same? true columns ["epoch"]

// OK
I0821 17:06:41.791733   22312 query.go:131] run query [SELECT node_name, lower(node_state) as node_state, count(*) FROM v_catalog.nodes GROUP BY node_name, node_state ] stmt.query same? true columns ["node_name" "node_state" "count"]

///////
// third scrape
///////

// OK
I0821 17:06:53.111116   22312 query.go:131] run query [SELECT EXTRACT(epoch FROM current_timestamp) AS epoch ] stmt.query same? true columns ["epoch"]

// OK
I0821 17:06:53.113658   22312 query.go:131] run query [SELECT count(*) FROM v_monitor.delete_vectors ] stmt.query same? true columns ["count"]

// OK
I0821 17:06:53.210031   22312 query.go:131] run query [SELECT node_name, lower(node_state) as node_state, count(*) FROM v_catalog.nodes GROUP BY node_name, node_state ] stmt.query same? true columns ["node_name" "node_state" "count"]

// XXX mismatched columns
I0821 17:06:53.222417   22312 query.go:131] run query [SELECT current_epoch, ahm_epoch, last_good_epoch, last_good_epoch - ahm_epoch as ahm_epoch_lag, designed_fault_tolerance, current_fault_tolerance, wos_used_bytes, ros_used_bytes FROM v_monitor.system ] stmt.query same? true columns ["mode" "scope" "object_name" "count"]

// OK
I0821 17:06:53.305371   22312 query.go:131] run query [SELECT lower(locks.lock_mode) as mode, lower(locks.lock_scope) as scope, object_name, count(*) FROM v_monitor.locks GROUP BY lower(lock_mode), lower(lock_scope), object_name; ] stmt.query same? true columns ["mode" "scope" "object_name" "count"]

Note that the errors aren't consistent between scrapes.

I haven't been able to produce a test case isolated to just the Vertica driver, though I don't know if that's due the bug actually being in the sql_exporter code (I think unlikely, as I think they would have come across it before now, and the code that runs the queries looks pretty straightforward), my limited go experience, or an inherent difficulty in reproducing concurrency bugs. I'd rather not open an issue without an isolated test case, but at this point I'm not sure how to.

Any insight or suggestions, including pointers on making a test case appreciated.

grzm commented

For completeness, here's the sql_exporter.yml config file I'm using:

global:
  scrape_timeout_offset: 5000ms
  min_interval: 0s
  max_connections: 3
  max_idle_connections: 3

target:
  data_source_name: "vertica://dbadmin:PASSWORD@HOST:5433/DBNAME"
  collectors:
    - base

collectors:
  - collector_name: base
    queries:
      - query_name: vertica_epoch_status
        query: |
          SELECT current_epoch,
                 ahm_epoch,
                 last_good_epoch,
                 last_good_epoch - ahm_epoch as ahm_epoch_lag,
                 designed_fault_tolerance,
                 current_fault_tolerance,
                 wos_used_bytes,
                 ros_used_bytes
            FROM v_monitor.system

    metrics:
      - metric_name: vertica_local_time_seconds
        type: gauge
        help: Local time in seconds since Unix epoch
        values: [epoch]
        query: |
          SELECT EXTRACT(epoch FROM current_timestamp) AS epoch

      - metric_name: vertica_node_state
        type: gauge
        key_labels: [node_name, node_state]
        help: Node status of each node
        values: [count]
        query: |
          SELECT node_name, lower(node_state) as node_state, count(*)
          FROM v_catalog.nodes
          GROUP BY node_name, node_state

      - metric_name: vertica_delete_vectors
        type: gauge
        help: Vertica delete vectors
        values: [count]
        query: |
          SELECT count(*) FROM v_monitor.delete_vectors

      - metric_name: vertica_locks
        type: gauge
        help: Vertica locks
        key_labels: [mode, scope, object_name]
        values: [count]
        query: |
          SELECT lower(locks.lock_mode) as mode, lower(locks.lock_scope) as scope, object_name, count(*)
          FROM v_monitor.locks
          GROUP BY lower(lock_mode), lower(lock_scope), object_name;

      - metric_name: vertica_current_epoch
        type: counter
        help: Current Epoch
        query_ref: vertica_epoch_status
        values: [current_epoch]

      - metric_name: vertica_ahm_epoch
        type: counter
        help: Ancient History Marker Epoch
        query_ref: vertica_epoch_status
        values: [ahm_epoch]

      - metric_name: vertica_last_good_epoch
        type: counter
        help: Last Good Epoch
        query_ref: vertica_epoch_status
        values: [last_good_epoch]

      - metric_name: vertica_ahm_epoch_lag
        type: gauge
        help: Lag of AHM epoch behind Last Good Epoch
        query_ref: vertica_epoch_status
        values: [ahm_epoch_lag]

      - metric_name: vertica_designed_fault_tolerance
        type: gauge
        help: Vertica Designed Fault Tolerance
        query_ref: vertica_epoch_status
        values: [designed_fault_tolerance]

      - metric_name: vertica_current_fault_tolerance
        type: gauge
        help: Vertica Current Fault Tolerance
        query_ref: vertica_epoch_status
        values: [current_fault_tolerance]

      - metric_name: vertica_wos_used_bytes
        type: gauge
        help: WOS bytes used
        query_ref: vertica_epoch_status
        values: [wos_used_bytes]

      - metric_name: vertica_ros_used_bytes
        type: gauge
        help: ROS bytes used
        query_ref: vertica_epoch_status
        values: [ros_used_bytes]

Great research. I have an idea about what the problem might be. I suspect that this is a lack of protection from allowing two different threads to run queries at the same time and a race condition between the sibling message loops.

Having multiple cursors/portals would make this easier but for the short term, a mutex around these might solve the problem with the side effect of "serializing" the go routines' queries.

If it's okay, I'll do some repro testing here and we can try the workaround first to verify the cause.

Thanks again. We'll make it to 1.0.0 yet!

grzm commented

Cheers! I appreciate anything insight you can bring. As is probably pretty clear, I'm really interested in getting Prometheus monitoring for Vertica. I'm probably going to try to look at how lib/pq (for a comparison in go) and the Vertica Python driver (for a comparison in Vertica) handle this issue. Would still like to figure out how to write a test that would reproduce the issue independently.

@grzm, I've put together a quick test in this branch. I'm having trouble repo'ing your case. Can you please try the following branch and see if it solves/helps/exacerbates the issue you've defined?

https://github.com/huebnerr/vertica-sql-go/tree/mutex-experimental

grzm commented

@huebnerr Thanks! I've given it a shot locally, and the problem persists. Something I learned is that the sql_exporter has the option of using multiple connections. I've tested it with both 1 and 3 connections, and observed the aberrant behavior in both scenarios.

Given I haven't been able to create a test in the vertica-go-sql test suite, I've put together a self-contained demo of how I've been testing it here https://github.com/grzm/vertica-sql-go-sql-exporter-bug-demo.

@grzm Thank you. I was hopeful that this would at least give an improvement. Thank you for putting together the self-contained demo. I'll debug into it and give it the ol' college try!

Update: Your self-contained demo worked flawlessly. I saw the errors the first time I ran it. Nicely done. Now the hard part. Thanks again!

grzm commented

@huebnerr Glad to hear it worked out of the box!

Weird... I moved it to another box (due to the Pixelbook's unwillingness to let me debug) and for the first 15 runs or so, I invariably got:

# HELP vertica_ahm_epoch Ancient History Marker Epoch
# TYPE vertica_ahm_epoch counter
vertica_ahm_epoch 16
# HELP vertica_ahm_epoch_lag Lag of AHM epoch behind Last Good Epoch
# TYPE vertica_ahm_epoch_lag gauge
vertica_ahm_epoch_lag 0
# HELP vertica_current_epoch Current Epoch
# TYPE vertica_current_epoch counter
vertica_current_epoch 17
# HELP vertica_current_fault_tolerance Vertica Current Fault Tolerance
# TYPE vertica_current_fault_tolerance gauge
vertica_current_fault_tolerance 0
# HELP vertica_designed_fault_tolerance Vertica Designed Fault Tolerance
# TYPE vertica_designed_fault_tolerance gauge
vertica_designed_fault_tolerance 0
# HELP vertica_last_good_epoch Last Good Epoch
# TYPE vertica_last_good_epoch counter
vertica_last_good_epoch 16
# HELP vertica_ros_used_bytes ROS bytes used
# TYPE vertica_ros_used_bytes gauge
vertica_ros_used_bytes 0
# HELP vertica_wos_used_bytes WOS bytes used
# TYPE vertica_wos_used_bytes gauge
vertica_wos_used_bytes 0

I then killed and restarted sql_explorer and immediately got:

No metrics gathered, 5 error(s) occurred:
* [from Gatherer #1] [, collector="base", query="vertica_local_time_seconds"] column(s) ["epoch"] missing from query result SELECT EXTRACT(epoch FROM current_timest
* [from Gatherer #1] [, collector="base", query="vertica_node_state"] column(s) ["count" "node_name" "node_state"] missing from query result SELECT node_name, lower(node_state) as n
* [from Gatherer #1] [, collector="base", query="vertica_epoch_status"] column(s) ["ahm_epoch_lag" "designed_fault_tolerance" "current_fault_tolerance" "wos_used_bytes" "ros_used_bytes" "current_epoch" "ahm_epoch" "last_good_epoch"] missing from query result SELECT current_epoch, ahm_epoch, last_go
* [from Gatherer #1] [, collector="base", query="vertica_delete_vectors"] scanning of query result failed: sql: Scan error on column index 3, name "count": converting driver.Value type  ("") to a float64: invalid syntax
* [from Gatherer #1] [, collector="base", query="vertica_locks"] column(s) ["mode" "scope" "object_name" "count"] missing from query result SELECT lower(locks.lock_mode) as mode, l

UPDATE: @grzm forcing serialization of the queries (as below) "solves" the problem. There is definitely a state issue exacerbated, I'm guessing on the driver side, by having the same query happening from multiple threads at once. Investigation continues.

// Collect implements Collector.
func (c *collector) Collect(ctx context.Context, conn *sql.DB, ch chan<- Metric) {
	for _, q := range c.queries {
		q.Collect(ctx, conn, ch)
	}
	//var wg sync.WaitGroup
	//wg.Add(len(c.queries))
	//for _, q := range c.queries {
	//	go func(q *Query) {
	//		defer wg.Done()
	//		q.Collect(ctx, conn, ch)
	//	}(q)
	//}
	//// Only return once all queries have been processed
	//wg.Wait()
}

@grzm I believe I have found and fixed the issue wherein some messages were getting dropped in some instances which caused other messages to get out of sync. Your environment now returns reasonable values and sql_explorer doesn't complain about mismatches anymore.

These changes have been pushed into the experimental branch you already have:
https://github.com/huebnerr/vertica-sql-go/tree/mutex-experimental

I've added a few other changes related to logging, notably file logging of the protocol (see the README.md file for changes, if interested). If everything works for you, I'll do a PR to the main repo. Thanks!

grzm commented

@huebnerr I can confirm that in the testing I've done I'm no longer seeing the issue. I am seeing some warnings (such as connection: unhandled message: Cmd Description: tag=SELECT, hasRewrite=false, rewrite='' and connection: unhandled message: BindComplete) that are associated with the logging changes you mention (looks like here: https://github.com/huebnerr/vertica-sql-go/commit/33f052b032f73931bc5261c6fd18e9122bdfbdc1).

Thank you!

@grzm I saw that and will address those before PR. I mainly wanted to make sure the issue was fixed on your side. ๐Ÿ˜ Will resolve those other messages tonight (mostly harmless). Thanks again for all of your help tracking down the issue!

Added PR #28 to address this issue. Will close on merge.