PGSQL Phriday #008: pg_stat_statements

pg_stat_statements for May. As luck would have it, it's been invaluable to me over the past few weeks as I've been solving some performance problems of the "tens of millions of rows, row-level security, inverted indices, tens of thousands of rows returned, oops I never did get around to double-checking work_mem in production did I?" variety. The big lesson this time around: pay attention to the standard deviation of timings!

The most often called (by far) and longest running (by a much closer margin) statements in this scenario were coming from an account synchronization daemon. Every fifteen seconds the daemon pulls user account information from Keycloak and overwrites the materialized local data, a pattern that sounds suspiciously like an inferior implementation of something RDS is not going to ship any time soon. postgres_fdw is there, of course, but then we'd be depending on Keycloak's schema rather than its API, and that's a much chancier proposition.

The initial user sync implementation wrote to three relevant tables in a single statement using CTEs, because why not? It's easy, convenient, and seemed to work just fine in non-production environments.

In production, though:

calls min_exec_time mean_exec_time max_exec_time stddev_exec_time
8,148,657 0.535 9.720 36,272.717 115.918
4,489,798 0.560 15.650 81,526.365 77.713

These are the same statement: with dataset as ([upsert dataset] returning *), person as (insert into person [with dataset membership] returning *) insert into account [reference to person]. For us, accounts are special cases of people and people have a tag array column linking them to datasets; we have reasons to avoid a junction table that don't make a difference here.

The daemon got a dedicated Postgres role 8 million executions after I enabled pg_stat_statements, and used that for 4.5 million more. At its fastest, it completes in about half a millisecond -- great! At worst, though, it takes over a minute, even almost two minutes. The means are decently low, but they're means and it's hard to tell just how many longer-running outliers are contributing to its drift.

All is revealed by the standard deviation, which is quite low in both cases. Most syncs happen within about a tenth of a second of the mean, which is itself closer to a hundredth of a second. 99.7% of timings should fall within three standard deviations, assuming a normal distribution, and an execution time over a second represents between nine and thirteen standard deviations from the mean. If I'm statisticking correctly, this means that out of the 12.5 million samples, the only timings over a second are almost certainly just the two known maxes. It's still not exactly wonderful that the statement can run a minute and a half when the stars align, but if you have a high max with a low min, mean, and deviation, the statement you're looking at isn't the problem.

I don't know for sure what it is that got in the way. My chief suspect is a database function that adds dataset memberships to multiple records at a time, or its counterpart that removes them, both further down my top-20 list. Clients were initially configured to call these with batches of 25,000 records, which quickly exceeded the default 4mb work_mem and would churn for the better part of a minute at minimum. Modified records would all have had foreign keys to accounts -- forcing the sync daemon's changes to wait. Dataset membership management fits the "stars aligning" usage profile as well since mass changes like that aren't yet happening every day. With work_mem adjusted to 16mb, those functions have sped up dramatically, and I haven't noticed any other suspicious timings since.

I did split up the statement, since both accounts and datasets are quite high-traffic tables, the former being targeted by foreign keys all over the place and the latter governing row-level security on several other tables. Millions of syncs after the change, only the accounts insert has ever gone long, for significantly less time than the earlier outliers, and also probably only once. It's also faster and more consistent than the triple insert as might be expected, with a standard deviation of 11ms over practically nothing.

calls min_exec_time mean_exec_time max_exec_time stddev_exec_time
4,887,878 0.026 0.091 22,436.409 10.942

Here's my "leaderboard" query:

select
    userid::regrole::text,
    calls,
    min_exec_time,
    mean_exec_time,
    max_exec_time,
    stddev_exec_time,
    query
from pg_stat_statements
where calls > 100 and max_exec_time > 10000
order by round(calls, -2) desc, round(mean_exec_time::numeric, -2) desc, stddev_exec_time asc
limit 20;