Possible issue with wrong rows returned from current stmt results
grzm opened this issue ยท 13 comments
I'm seeing an issue which looks like the driver mismatching prepared statements and result sets when running concurrently. The Prometheus free/sql_exporter runs the queries for each "scrape" concurrently (in go routines). The rows returned by stmt.QueryContext(ctx)
are not always the rows associated with the prepared statement.
This is all with the patch in #21 applied.
Here's the function that executes each query, with some debug logging and comments added by me (un-mutilated, release version):
// run executes the query on the provided database, in the provided context.
func (q *Query) run(ctx context.Context, conn *sql.DB) (*sql.Rows, errors.WithContext) {
// `q` is a struct which contains the query to be executed at q.config.Query
r, _ := regexp.Compile("\\s+")
queryString := r.ReplaceAllString(q.config.Query, " ")
if q.conn != nil && q.conn != conn {
panic(fmt.Sprintf("[%s] Expecting to always run on the same database handle", q.logContext))
}
if q.stmt == nil {
stmt, err := conn.PrepareContext(ctx, q.config.Query)
if err != nil {
return nil, errors.Wrapf(q.logContext, err, "prepare query failed")
}
q.conn = conn
q.stmt = stmt
}
// checking query associated with the prepared statement `stmt`
stmtValue := reflect.ValueOf(q.stmt)
stmtQueryString := r.ReplaceAllString(reflect.Indirect(stmtValue).FieldByName("query").String(), " ")
rows, err := q.stmt.QueryContext(ctx)
columns, _ := rows.Columns()
log.Infof("run query [%s] stmt.query same? %t columns %q", queryString, (stmtQueryString == queryString), columns)
return rows, errors.Wrap(q.logContext, err)
}
Here's an annotated sample of the associated logging output for an exporter that issues 5 queries for each scrape.
///////
// first scrape after launching sql_exporter
///////
// XXX columns don't match query
I0821 17:06:34.977006 22312 query.go:131] run query [SELECT current_epoch, ahm_epoch, last_good_epoch, last_good_epoch - ahm_epoch as ahm_epoch_lag, designed_fault_tolerance, current_fault_tolerance, wos_used_bytes, ros_used_bytes FROM v_monitor.system ] stmt.query same? true columns ["mode" "scope" "object_name" "count"]
// XXX columns missing
I0821 17:06:35.071003 22312 query.go:131] run query [SELECT lower(locks.lock_mode) as mode, lower(locks.lock_scope) as scope, object_name, count(*) FROM v_monitor.locks GROUP BY lower(lock_mode), lower(lock_scope), object_name; ] stmt.query same? true columns []
// OK
I0821 17:06:35.244324 22312 query.go:131] run query [SELECT count(*) FROM v_monitor.delete_vectors ] stmt.query same? true columns ["count"]
// OK
I0821 17:06:35.247949 22312 query.go:131] run query [SELECT EXTRACT(epoch FROM current_timestamp) AS epoch ] stmt.query same? true columns ["epoch"]
// OK
I0821 17:06:35.261840 22312 query.go:131] run query [SELECT node_name, lower(node_state) as node_state, count(*) FROM v_catalog.nodes GROUP BY node_name, node_state ] stmt.query same? true columns ["node_name" "node_state" "count"]
///////
// Second scrape
///////
// XXX columns missing
I0821 17:06:41.483174 22312 query.go:131] run query [SELECT lower(locks.lock_mode) as mode, lower(locks.lock_scope) as scope, object_name, count(*) FROM v_monitor.locks GROUP BY lower(lock_mode), lower(lock_scope), object_name; ] stmt.query same? true columns []
// OK
I0821 17:06:41.589738 22312 query.go:131] run query [SELECT count(*) FROM v_monitor.delete_vectors ] stmt.query same? true columns ["count"]
// OK
I0821 17:06:41.595501 22312 query.go:131] run query [SELECT current_epoch, ahm_epoch, last_good_epoch, last_good_epoch - ahm_epoch as ahm_epoch_lag, designed_fault_tolerance, current_fault_tolerance, wos_used_bytes, ros_used_bytes FROM v_monitor.system ] stmt.query same? true columns ["current_epoch" "ahm_epoch" "last_good_epoch" "ahm_epoch_lag" "designed_fault_tolerance" "current_fault_tolerance" "wos_used_bytes" "ros_used_bytes"]
// OK
I0821 17:06:41.672825 22312 query.go:131] run query [SELECT EXTRACT(epoch FROM current_timestamp) AS epoch ] stmt.query same? true columns ["epoch"]
// OK
I0821 17:06:41.791733 22312 query.go:131] run query [SELECT node_name, lower(node_state) as node_state, count(*) FROM v_catalog.nodes GROUP BY node_name, node_state ] stmt.query same? true columns ["node_name" "node_state" "count"]
///////
// third scrape
///////
// OK
I0821 17:06:53.111116 22312 query.go:131] run query [SELECT EXTRACT(epoch FROM current_timestamp) AS epoch ] stmt.query same? true columns ["epoch"]
// OK
I0821 17:06:53.113658 22312 query.go:131] run query [SELECT count(*) FROM v_monitor.delete_vectors ] stmt.query same? true columns ["count"]
// OK
I0821 17:06:53.210031 22312 query.go:131] run query [SELECT node_name, lower(node_state) as node_state, count(*) FROM v_catalog.nodes GROUP BY node_name, node_state ] stmt.query same? true columns ["node_name" "node_state" "count"]
// XXX mismatched columns
I0821 17:06:53.222417 22312 query.go:131] run query [SELECT current_epoch, ahm_epoch, last_good_epoch, last_good_epoch - ahm_epoch as ahm_epoch_lag, designed_fault_tolerance, current_fault_tolerance, wos_used_bytes, ros_used_bytes FROM v_monitor.system ] stmt.query same? true columns ["mode" "scope" "object_name" "count"]
// OK
I0821 17:06:53.305371 22312 query.go:131] run query [SELECT lower(locks.lock_mode) as mode, lower(locks.lock_scope) as scope, object_name, count(*) FROM v_monitor.locks GROUP BY lower(lock_mode), lower(lock_scope), object_name; ] stmt.query same? true columns ["mode" "scope" "object_name" "count"]
Note that the errors aren't consistent between scrapes.
I haven't been able to produce a test case isolated to just the Vertica driver, though I don't know if that's due the bug actually being in the sql_exporter code (I think unlikely, as I think they would have come across it before now, and the code that runs the queries looks pretty straightforward), my limited go experience, or an inherent difficulty in reproducing concurrency bugs. I'd rather not open an issue without an isolated test case, but at this point I'm not sure how to.
Any insight or suggestions, including pointers on making a test case appreciated.
For completeness, here's the sql_exporter.yml
config file I'm using:
global:
scrape_timeout_offset: 5000ms
min_interval: 0s
max_connections: 3
max_idle_connections: 3
target:
data_source_name: "vertica://dbadmin:PASSWORD@HOST:5433/DBNAME"
collectors:
- base
collectors:
- collector_name: base
queries:
- query_name: vertica_epoch_status
query: |
SELECT current_epoch,
ahm_epoch,
last_good_epoch,
last_good_epoch - ahm_epoch as ahm_epoch_lag,
designed_fault_tolerance,
current_fault_tolerance,
wos_used_bytes,
ros_used_bytes
FROM v_monitor.system
metrics:
- metric_name: vertica_local_time_seconds
type: gauge
help: Local time in seconds since Unix epoch
values: [epoch]
query: |
SELECT EXTRACT(epoch FROM current_timestamp) AS epoch
- metric_name: vertica_node_state
type: gauge
key_labels: [node_name, node_state]
help: Node status of each node
values: [count]
query: |
SELECT node_name, lower(node_state) as node_state, count(*)
FROM v_catalog.nodes
GROUP BY node_name, node_state
- metric_name: vertica_delete_vectors
type: gauge
help: Vertica delete vectors
values: [count]
query: |
SELECT count(*) FROM v_monitor.delete_vectors
- metric_name: vertica_locks
type: gauge
help: Vertica locks
key_labels: [mode, scope, object_name]
values: [count]
query: |
SELECT lower(locks.lock_mode) as mode, lower(locks.lock_scope) as scope, object_name, count(*)
FROM v_monitor.locks
GROUP BY lower(lock_mode), lower(lock_scope), object_name;
- metric_name: vertica_current_epoch
type: counter
help: Current Epoch
query_ref: vertica_epoch_status
values: [current_epoch]
- metric_name: vertica_ahm_epoch
type: counter
help: Ancient History Marker Epoch
query_ref: vertica_epoch_status
values: [ahm_epoch]
- metric_name: vertica_last_good_epoch
type: counter
help: Last Good Epoch
query_ref: vertica_epoch_status
values: [last_good_epoch]
- metric_name: vertica_ahm_epoch_lag
type: gauge
help: Lag of AHM epoch behind Last Good Epoch
query_ref: vertica_epoch_status
values: [ahm_epoch_lag]
- metric_name: vertica_designed_fault_tolerance
type: gauge
help: Vertica Designed Fault Tolerance
query_ref: vertica_epoch_status
values: [designed_fault_tolerance]
- metric_name: vertica_current_fault_tolerance
type: gauge
help: Vertica Current Fault Tolerance
query_ref: vertica_epoch_status
values: [current_fault_tolerance]
- metric_name: vertica_wos_used_bytes
type: gauge
help: WOS bytes used
query_ref: vertica_epoch_status
values: [wos_used_bytes]
- metric_name: vertica_ros_used_bytes
type: gauge
help: ROS bytes used
query_ref: vertica_epoch_status
values: [ros_used_bytes]
Great research. I have an idea about what the problem might be. I suspect that this is a lack of protection from allowing two different threads to run queries at the same time and a race condition between the sibling message loops.
Having multiple cursors/portals would make this easier but for the short term, a mutex around these might solve the problem with the side effect of "serializing" the go routines' queries.
If it's okay, I'll do some repro testing here and we can try the workaround first to verify the cause.
Thanks again. We'll make it to 1.0.0 yet!
Cheers! I appreciate anything insight you can bring. As is probably pretty clear, I'm really interested in getting Prometheus monitoring for Vertica. I'm probably going to try to look at how lib/pq (for a comparison in go) and the Vertica Python driver (for a comparison in Vertica) handle this issue. Would still like to figure out how to write a test that would reproduce the issue independently.
@grzm, I've put together a quick test in this branch. I'm having trouble repo'ing your case. Can you please try the following branch and see if it solves/helps/exacerbates the issue you've defined?
https://github.com/huebnerr/vertica-sql-go/tree/mutex-experimental
@huebnerr Thanks! I've given it a shot locally, and the problem persists. Something I learned is that the sql_exporter has the option of using multiple connections. I've tested it with both 1 and 3 connections, and observed the aberrant behavior in both scenarios.
Given I haven't been able to create a test in the vertica-go-sql test suite, I've put together a self-contained demo of how I've been testing it here https://github.com/grzm/vertica-sql-go-sql-exporter-bug-demo.
@grzm Thank you. I was hopeful that this would at least give an improvement. Thank you for putting together the self-contained demo. I'll debug into it and give it the ol' college try!
Update: Your self-contained demo worked flawlessly. I saw the errors the first time I ran it. Nicely done. Now the hard part. Thanks again!
Weird... I moved it to another box (due to the Pixelbook's unwillingness to let me debug) and for the first 15 runs or so, I invariably got:
# HELP vertica_ahm_epoch Ancient History Marker Epoch # TYPE vertica_ahm_epoch counter vertica_ahm_epoch 16 # HELP vertica_ahm_epoch_lag Lag of AHM epoch behind Last Good Epoch # TYPE vertica_ahm_epoch_lag gauge vertica_ahm_epoch_lag 0 # HELP vertica_current_epoch Current Epoch # TYPE vertica_current_epoch counter vertica_current_epoch 17 # HELP vertica_current_fault_tolerance Vertica Current Fault Tolerance # TYPE vertica_current_fault_tolerance gauge vertica_current_fault_tolerance 0 # HELP vertica_designed_fault_tolerance Vertica Designed Fault Tolerance # TYPE vertica_designed_fault_tolerance gauge vertica_designed_fault_tolerance 0 # HELP vertica_last_good_epoch Last Good Epoch # TYPE vertica_last_good_epoch counter vertica_last_good_epoch 16 # HELP vertica_ros_used_bytes ROS bytes used # TYPE vertica_ros_used_bytes gauge vertica_ros_used_bytes 0 # HELP vertica_wos_used_bytes WOS bytes used # TYPE vertica_wos_used_bytes gauge vertica_wos_used_bytes 0
I then killed and restarted sql_explorer and immediately got:
No metrics gathered, 5 error(s) occurred: * [from Gatherer #1] [, collector="base", query="vertica_local_time_seconds"] column(s) ["epoch"] missing from query result SELECT EXTRACT(epoch FROM current_timest * [from Gatherer #1] [, collector="base", query="vertica_node_state"] column(s) ["count" "node_name" "node_state"] missing from query result SELECT node_name, lower(node_state) as n * [from Gatherer #1] [, collector="base", query="vertica_epoch_status"] column(s) ["ahm_epoch_lag" "designed_fault_tolerance" "current_fault_tolerance" "wos_used_bytes" "ros_used_bytes" "current_epoch" "ahm_epoch" "last_good_epoch"] missing from query result SELECT current_epoch, ahm_epoch, last_go * [from Gatherer #1] [, collector="base", query="vertica_delete_vectors"] scanning of query result failed: sql: Scan error on column index 3, name "count": converting driver.Value type ("") to a float64: invalid syntax * [from Gatherer #1] [, collector="base", query="vertica_locks"] column(s) ["mode" "scope" "object_name" "count"] missing from query result SELECT lower(locks.lock_mode) as mode, l
UPDATE: @grzm forcing serialization of the queries (as below) "solves" the problem. There is definitely a state issue exacerbated, I'm guessing on the driver side, by having the same query happening from multiple threads at once. Investigation continues.
// Collect implements Collector.
func (c *collector) Collect(ctx context.Context, conn *sql.DB, ch chan<- Metric) {
for _, q := range c.queries {
q.Collect(ctx, conn, ch)
}
//var wg sync.WaitGroup
//wg.Add(len(c.queries))
//for _, q := range c.queries {
// go func(q *Query) {
// defer wg.Done()
// q.Collect(ctx, conn, ch)
// }(q)
//}
//// Only return once all queries have been processed
//wg.Wait()
}
@grzm I believe I have found and fixed the issue wherein some messages were getting dropped in some instances which caused other messages to get out of sync. Your environment now returns reasonable values and sql_explorer doesn't complain about mismatches anymore.
These changes have been pushed into the experimental branch you already have:
https://github.com/huebnerr/vertica-sql-go/tree/mutex-experimental
I've added a few other changes related to logging, notably file logging of the protocol (see the README.md file for changes, if interested). If everything works for you, I'll do a PR to the main repo. Thanks!
@huebnerr I can confirm that in the testing I've done I'm no longer seeing the issue. I am seeing some warnings (such as connection: unhandled message: Cmd Description: tag=SELECT, hasRewrite=false, rewrite=''
and connection: unhandled message: BindComplete
) that are associated with the logging changes you mention (looks like here: https://github.com/huebnerr/vertica-sql-go/commit/33f052b032f73931bc5261c6fd18e9122bdfbdc1).
Thank you!