postgrespro/pg_wait_sampling

state of the process is ignored (idle processes are reported)

DmitryNFomin opened this issue · 5 comments

wait events of idle processes reported in the same as wait events of active process (state from pg_stat_activity)

select wait_event, state from pg_stat_activity where state = 'idle';
wait_event | state
------------+-------
ClientRead | idle
Extension  | idle
ClientRead | idle
Extension  | idle
Extension  | idle
ClientRead | idle
ClientRead | idle
ClientRead | idle
ClientRead | idle
ClientRead | idle
ClientRead | idle
ClientRead | idle
(12 rows)

this ClientRead wait events would be reported in _current and _hostory views and it's not possible;e to distinguish them from ClientRead of active processes and can be misleading because system that dies nothing looks like system that have issues with network or "slow" client

would it be more correct to report only active state processes?

Firstly, ClientRead wait_event can't happen while the background process is in active state, only in idle. Waiting for ClientRead means that the process is waiting for new portion of client data (for example - a slow client that hasn't sent a query)

We also can't 100% rely on state because wait_event and state are not changed atomically. (There were attemtps to do something with it, but with no success, e.g. here)

Also, we shouldn't report wait_events of only active processes because sometimes it is usefull to see how much time are our processes in ClientRead

please see below output from one of databases
select wait_event_type, wait_event, state from pg_stat_activity where state = 'active'

 wait_event_type |  wait_event   | state
-----------------+---------------+--------
 Activity        | WalSenderMain | active
 Activity        | WalSenderMain | active
 IO              | WALRead       | active
                 |               | active
 Timeout         | VacuumDelay   | active
 IO              | WALSync       | active
                 |               | active
 IO              | DataFileRead  | active
 Client          | ClientRead    | active
                 |               | active
 Client          | ClientRead    | active
 Client          | ClientRead    | active
 IO              | DataFileRead  | active
 IPC             | SyncRep       | active
 IO              | DataFileRead  | active
 LWLock          | BufferIO      | active
 LWLock          | WALWrite      | active
                 |               | active
 Client          | ClientRead    | active
 Client          | ClientRead    | active
 IPC             | SyncRep       | active
 Client          | ClientRead    | active
 Client          | ClientRead    | active
 IPC             | SyncRep       | active
 IPC             | SyncRep       | active
 IPC             | SyncRep       | active
 IPC             | SyncRep       | active
 IO              | DataFileRead  | active
 IO              | DataFileRead  | active
 IPC             | SyncRep       | active
 Activity        | WalSenderMain | active
                 |               | active
 Client          | ClientRead    | active
                 |               | active
 IO              | DataFileRead  | active
                 |               | active
 IO              | DataFileRead  | active
 IO              | DataFileRead  | active
 Client          | ClientRead    | active
 Client          | ClientRead    | active
                 |               | active
 LWLock          | WALWrite      | active
 Client          | ClientRead    | active
                 |               | active
 Client          | ClientRead    | active
 IPC             | SyncRep       | active
 LWLock          | BufferIO      | active
 IO              | DataFileRead  | active

so ClientRead can be in active state and it's could be slow network or client, while ClientRead in idle does not mean any slowness between client and DB

Please, read an email from my previous message (the only link in the message). It describes that ClientRead in active state is an error. There is also an in-depth explanation of this bug/anomaly in the following article (a little bit after the half-way point).

Also, if you look into the source code, the only place where ClientRead wait_event is registered is function secure_read, that has a comment /* In blocking mode, wait until the socket is ready */ - so this can't happen in active state.

When you see ClientRead/active combination it is actually an error, since for pg_stat_activity we first retrieve info from BackendStatusArray that gets us the state of the backend (active/idle), and then we look into ProcArray where we get ClientRead wait_event. This is NOT an atomic operation (looking up two different arrays in different places in memory), so there could be anomalies such as your findings

Thanks a lot! that really great explanation.
just last last question - why NULL wait_event are filtered out? we loose some part of database activity in the report, especially on vary loaded/active database?
If it's should be a separate issue I will create it.

Actually, we already have such an issue - #10.

We don't have a solution for this issue for now, but thanks for bringing it up