fixing slow row-level security policies

Fixing Slow Row-Level Security Policies

At my day job, we use row-level security extensively. Several different roles interact with Postgres through the same GraphQL API; each role has its own grants and policies on tables; whether a role can see record X in table Y can depend on its access to record A in table B, so these policies aren't merely a function of the contents of the candidate row itself. There's more complexity than that, even, but no need to get into it.

Two tables, then.

set jit = off; -- just-in-time compilation mostly serves to muddy the waters here

create table tag (
  id int generated always as identity primary key,
  name text
);

insert into tag (name)
select * from unnest(array[
    'alpha', 'beta', 'gamma', 'delta', 'epsilon', 'zeta', 'eta', 'iota', 'kappa', 'lambda', 'mu',
    'nu', 'xi', 'omicron', 'pi', 'rho', 'sigma', 'tau', 'upsilon', 'phi', 'chi', 'psi', 'omega'
]);

create table item (
  id int generated always as identity primary key,
  value text,
  tags int[]
);

insert into item (value, tags)
select
  md5(random()::text),
  array_sample((select array_agg(id) from tag), trunc(random() * 4)::int + 1)
from generate_series(1, 1000000);

create index on item using gin (tags);

alter table tag enable row level security;
alter table item enable row level security;

We'll set up two roles to compare performance. item_admin will have a simple policy allowing it to view all items, while item_reader's access will be governed by session settings that the user must configure before attempting to query these tables.

create role item_admin;
grant select on item to item_admin;
grant select on tag to item_admin;

create policy item_admin_tag_policy on tag
for select to item_admin
using (true);

create policy item_admin_item_policy on item
for select to item_admin
using (true);

create role item_reader;
grant select on item to item_reader;
grant select on tag to item_reader;

-- `set item_reader.allowed_tags = '{alpha,beta}'` and see items tagged
-- alpha or beta
create policy item_reader_tag_policy on tag
for select to item_reader
using (
    current_setting('item_reader.allowed_tags') is not null and
    current_setting('item_reader.allowed_tags')::text[] @> array[name]
);

create policy item_reader_item_policy on item
for select to item_reader
using (
    exists (
        select 1 from tag
        where item.tags @> array[tag.id]
    )
);

Before we proceed, this post includes a lot of explain plans. These can be a bit intimidating to read at first (although reading them is a skill very much worth developing if you have any stake in making Postgres databases fast!). There are many explain helpers that you can copy-paste the plans into for a more intuitively structured view; some popular offerings include depesz's, pgExplain, and Dalibo.

Baseline Performance

Okay! Let's look at an example query, first as item_admin. This retrieves items tagged alpha and having a value beginning with the letter A.

set role item_admin;
explain (analyze, verbose, costs, timing) select * from item where value ilike 'a%' and tags && array[1];
                                                            QUERY PLAN
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Bitmap Heap Scan on public.item  (cost=748.75..14704.25 rows=109556 width=68) (actual time=7.041..50.822 rows=6791 loops=1)
   Output: id, value, tags
   Recheck Cond: (item.tags && '{1}'::integer[])
   Filter: (item.value ~~* 'a%'::text)
   Rows Removed by Filter: 101875
   Heap Blocks: exact=12311
   ->  Bitmap Index Scan on item_tags_idx  (cost=0.00..721.36 rows=109567 width=0) (actual time=5.762..5.762 rows=108666 loops=1)
         Index Cond: (item.tags && '{1}'::integer[])
 Query Identifier: 1548793227074419886
 Planning Time: 0.123 ms
 Execution Time: 51.001 ms

So far, so good: we're using the GIN index, the bitmap is small enough to stay lossless (Heap Blocks: exact=12311) so nothing needs to be rechecked, the filter is filtering, 51ms.

Now, as item_reader:

set role item_reader;
set item_reader.allowed_tags = '{alpha,beta}';

explain (analyze, verbose, costs, timing) select * from item where value ilike 'a%' and tags && array[1];
                                                                                                QUERY PLAN
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Seq Scan on public.item  (cost=0.00..41777312.00 rows=54778 width=68) (actual time=0.476..8100.051 rows=6791 loops=1)
   Output: item.id, item.value, item.tags
   Filter: (EXISTS(SubPlan 1) AND (item.value ~~* 'a%'::text) AND (item.tags && '{1}'::integer[]))
   Rows Removed by Filter: 993209
   SubPlan 1
     ->  Seq Scan on public.tag  (cost=0.00..41.75 rows=1 width=0) (actual time=0.008..0.008 rows=0 loops=1000000)
           Filter: ((current_setting('item_reader.allowed_tags'::text) IS NOT NULL) AND ((current_setting('item_reader.allowed_tags'::text))::text[] @> ARRAY[tag.name]) AND (item.tags @> ARRAY[tag.id]))
           Rows Removed by Filter: 18
 Query Identifier: 1548793227074419886
 Planning Time: 0.135 ms
 Execution Time: 8100.319 ms

That's roughly 150 times slower. What's the difference?

The item_reader plan is very different, in fact. Instead of using the GIN index on item.tags, we're sequentially scanning tag in the subplan.

We're sequentially scanning tag in the subplan a million times, once per item record (loops=1000000).

That's probably the issue.

How can we fix it?

Use More Efficient Operations

Our policy on item tests array containment, but that's kind of overkill; the any operation is better optimized for what we need here.

set role postgres;

drop policy if exists item_reader_item_policy on item;
create policy item_reader_item_policy on item
for select to item_reader
using (
    exists (
        select 1 from tag
        where tag.id = any(item.tags)
    )
);

set role item_reader;
explain (analyze, verbose, costs, timing) select * from item where value ilike 'a%' and tags && array[1];
                                                                                QUERY PLAN
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Seq Scan on public.item  (cost=0.00..15035202.44 rows=53295 width=68) (actual time=0.531..2441.101 rows=6854 loops=1)
   Output: item.id, item.value, item.tags
   Filter: (EXISTS(SubPlan 1) AND (item.value ~~* 'a%'::text) AND (item.tags && '{1}'::integer[]))
   Rows Removed by Filter: 993146
   SubPlan 1
     ->  Bitmap Heap Scan on public.tag  (cost=4.23..15.01 rows=1 width=0) (actual time=0.002..0.002 rows=0 loops=1000000)
           Recheck Cond: (tag.id = ANY (item.tags))
           Filter: ((current_setting('item_reader.allowed_tags'::text) IS NOT NULL) AND ((current_setting('item_reader.allowed_tags'::text))::text[] @> ARRAY[tag.name]))
           Rows Removed by Filter: 2
           Heap Blocks: exact=1000000
           ->  Bitmap Index Scan on tag_pkey  (cost=0.00..4.23 rows=10 width=0) (actual time=0.001..0.001 rows=3 loops=1000000)
                 Index Cond: (tag.id = ANY (item.tags))
 Query Identifier: -1492990194093681799
 Planning Time: 0.102 ms
 Execution Time: 2441.345 ms

Quite a bit better, although still very slow by comparison with the admin query. We're still performing a million loops, but the any is able to take advantage of tag's primary key index a million times instead of sequentially scanning it all million times.

Functions can be Execution Boundaries

The tags an item_reader is allowed to see won't change in the middle of the query; even if there's an access-control table being updated concurrently, we're in our own transaction here and won't see or care about changes. There's no need to get the list anew for each candidate item record. Let's try caching that using a function.

set role postgres;
create or replace function item_reader_allowed_tags()
    returns int[]
    language sql
begin atomic;
    select array_agg(id)
    from tag
    where name = any(current_setting('item_reader.allowed_tags')::text[]);
end;

drop policy if exists item_reader_item_policy on item;
create policy item_reader_item_policy on item
for select to item_reader
using (
    item_reader_allowed_tags() && item.tags
);

Et voilà:

set role item_reader;
explain (analyze, verbose, costs, timing) select * from item where value ilike 'a%' and tags && array[1];
                                                        QUERY PLAN
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Seq Scan on public.item  (cost=0.00..279812.00 rows=1096 width=68) (actual time=0.911..13425.743 rows=6791 loops=1)
   Output: id, value, tags
   Filter: ((item_reader_allowed_tags() && item.tags) AND (item.value ~~* 'a%'::text) AND (item.tags && '{1}'::integer[]))
   Rows Removed by Filter: 993209
 Query Identifier: 1548793227074419886
 Planning Time: 0.102 ms
 Execution Time: 13426.034 ms

Hold up a minute. That's worse. Worse even than the baseline, and not by a little.

Functions Need Configuring

Functions are extremely powerful. This means that Postgres considers that any function could get up to absolutely anything in the database, and that means that the query engine has to treat it like it's a screwdriver-slip away from criticality. However, we can pinky promise that our function doesn't do anything untoward.

set role postgres;
alter function item_reader_allowed_tags stable;

set role item_reader;
explain (analyze, verbose, costs, timing) select * from item where value ilike 'a%' and tags && array[1];
                                                             QUERY PLAN
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Bitmap Heap Scan on public.item  (cost=1353.87..68540.82 rows=22474 width=68) (actual time=12.503..2880.043 rows=6791 loops=1)
   Output: id, value, tags
   Filter: ((item_reader_allowed_tags() && item.tags) AND (item.value ~~* 'a%'::text) AND (item.tags && '{1}'::integer[]))
   Rows Removed by Filter: 200730
   Heap Blocks: exact=12312
   ->  Bitmap Index Scan on item_tags_idx  (cost=0.00..1348.25 rows=205140 width=0) (actual time=11.115..11.115 rows=207521 loops=1)
         Index Cond: (item.tags && item_reader_allowed_tags())
 Query Identifier: 1548793227074419886
 Planning Time: 0.280 ms
 Execution Time: 2880.319 ms

Finally, item_reader gets to use the GIN index! With the default volatility setting of volatile, Postgres assumes both that the function could do anything and that "anything" could change at any time, including from successive invocations in the same statement. If we say instead that it will always return the same output for the same input within a statement, the planner takes us at our word and lets us perform an index scan.

This is much better than our eight-second baseline, but still actually a bit worse than the simple inline any policy. We've definitely got a ways to go before we approach the 50ms times those lucky admins get. What if we parallelize?

set role postgres;
alter function item_reader_allowed_tags parallel safe;

set role item_reader;
explain (analyze, verbose, costs, timing) select * from item where value ilike 'a%' and tags && array[1];
                                                                 QUERY PLAN
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Gather  (cost=2353.87..39777.84 rows=22474 width=68) (actual time=15.013..1008.731 rows=6791 loops=1)
   Output: id, value, tags
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Bitmap Heap Scan on public.item  (cost=1353.87..36530.44 rows=9364 width=68) (actual time=9.112..997.387 rows=2264 loops=3)
         Output: id, value, tags
         Filter: ((item_reader_allowed_tags() && item.tags) AND (item.value ~~* 'a%'::text) AND (item.tags && '{1}'::integer[]))
         Rows Removed by Filter: 66910
         Heap Blocks: exact=4128
         Worker 0:  actual time=5.971..994.315 rows=2253 loops=1
         Worker 1:  actual time=6.901..994.412 rows=2256 loops=1
         ->  Bitmap Index Scan on item_tags_idx  (cost=0.00..1348.25 rows=205140 width=0) (actual time=13.099..13.099 rows=207521 loops=1)
               Index Cond: (item.tags && item_reader_allowed_tags())
 Query Identifier: 1548793227074419886
 Planning Time: 0.336 ms
 Execution Time: 1008.934 ms

Well, we parallelize. That sounds about right. It's somewhat better than halved from the stable plan but I'll chalk that up to a warm cache.

There are two more settings that sound like they could be useful.

set role postgres;
alter function item_reader_allowed_tags cost 100000;

You can find some recommendations online to set the function's estimated execution cost to a very high value to "encourage" the planner to execute it as few times as possible. In our example here, upcosting shaves off a few milliseconds, but we're in rounding-error territory. I think there are likely some scenarios in which this does make a significant difference, but I'm not sure what those might be; this is not one. And in any case, it's a kludge.

set role postgres;
alter function item_reader_allowed_tags cost 100;
alter function item_reader_allowed_tags leakproof;

Leakproofness asserts that the function cannot reveal information about its arguments through a side channel (e.g. error messages) in addition to through its return value. Potentially-leaky functions are treated as security barriers that have to be passed before other conditions can be evaluated. However, alter function item_reader_allowed_tags leakproof doesn't seem to change the plan noticeably. Stick a pin in that.

Is This as Good as it Gets?

It sure seems like we're stuck. There are still some differences from the admin query plan: as admin, we don't care about item_reader_allowed_tags, and our index scan performs the test for tag id 1. As reader, our index scan checks for allowed tags, that apparently being more cost-optimal during planning, but then in a post-scan filter we run item_reader_allowed_tags a second time along with the tag-1 test. Both of those filter predicates should qualify for the index scan; why aren't they being pushed down into that?

This is the point where I started source-diving. My initial assumption was that, because RLS was active, the policy predicates needed to be checked separately from and before those in the where clause. However, item_reader_allowed_tags is a simple select statement at heart, and, especially if it's leakproof, it might be inlineable. If so, then all the predicates ("quals" in internalsese) get dumped into one big bucket for the planner to pick through as it likes, which would make the post-scan filtering rather odd. I just needed to find out whether that was happening, and this information isn't available in the explain plan.

Internal Representations

The explain plans I've been posting throughout are more-or-less human-readable representations of what the query planner intends to do or has done. They are rather lossy summaries, it turns out. If you're brave, foolish, desperate, or an internals developer, though, you can turn on a couple of client settings that dump the planner's internal representation out.

set debug_print_plan = on;
set client_min_messages = log;

It's a bad idea to use tab-completion after setting these, for the record. Disable by setting them to off and notice, respectively.

I'm not going to copy the internal representation of the simple test query we've been using here because it's 2500 lines long and the first thing I want to point out is an absence anyway.

After I proved out my performance fix on one role-policy combination, the next set I'm tackling produces an internal representation of more than 2 million lines, well past my terminal's scrollback. psql doesn't pipe debug messages into stdout or stderr, so I wound up having to psql -c "set debug_print_plan=on; set client_min_messages=log; explain analyze ...." 2>&1 | tee out.plan to capture it.

To test my assumption, I was looking for the :hasRowSecurity flag. If it appeared, that would mean item_reader_allowed_tags is not inlined, and constitutes a security barrier forcing quals to be evaluated in multiple stages. The where quals would only be evaluated for a candidate record after the quals in the function, there's an obvious explanation for the post-index-scan filtering step, we're not happy but we understand it.

I didn't see :hasRowSecurity anywhere.

What I did see was not one, but seven plans produced for my test query. Try it!

How Does One Query Turn into Seven Plans?

Or, functions can be execution boundaries, oh no! In order, the plans are:

Aggregation of tag.id based on an array-containment (OPEXPR :opno 2751 :opfuncid 2748) between current_setting (FUNCEXPR :funcid 2077) and an array instantiation. This is clearly item_reader_allowed_tags, although there's some extra funny business like a null test.
Is the same plan again.
Gathering a bitmap heap scan on item, with three quals involved testing item_reader_allowed_tags overlap with the tags column; a text ilike comparison; and a second array overlap with a constant. Only the first qual makes it down into the plan tree. This is the GIN index scan.
Guess what, it's plan 1 again!
And again!
And yet again!
And one for the road!

That's right: six identical plans for RLS, one for the actual query. The null test is from the policy on tag that ensures the item_reader.allowed_tags setting is non-null, so at least it's kind of inlining. The next question is how we get to six.

In the course of doing this analysis for real, I'd been searching up and reading anything I could. What put me on the right track was a discussion in PostgREST's docs issue tracker about performance implications of current_setting in RLS. current_setting, like many other built-in functions, is not leakproof. We may have declared item_reader_allowed_tags leakproof so it could qualify for inlining, but current_setting remains a barrier -- it just doesn't have any quals, so :hasRowSecurity doesn't need to be invoked.

Between the policy on tag and item_reader_allowed_tags we have three invocations of current_setting. item_reader_allowed_tags is parallel safe, and we spin up two workers. That's six. It's easy enough to verify: mark the function parallel unsafe, and we're down to four redundant plans. One inside item_reader_allowed_tags, two in the tag policy; I'm not sure where the fourth is coming from.

The PostgREST discussion contributors did work out how best to optimize the policies. If a plan has no dependencies, it can be pulled up into an InitPlan, which is guaranteed to run only once and is then referenced by other plans. All it takes is keeping the current_setting invocations in their own selects:

set role postgres;

drop policy if exists item_reader_tag_policy on tag;
create policy item_reader_tag_policy on tag
for select to item_reader
using (
    (select current_setting('item_reader.allowed_tags')::text[]) is not null and
    (select current_setting('item_reader.allowed_tags')::text[]) @> array[name]
);

create or replace function item_reader_allowed_tags(allowed_tags text[])
    returns int[]
    language sql
    stable
    leakproof
    parallel safe
begin atomic;
    select array_agg(id)
    from tag
    where name = any(allowed_tags);
end;

drop policy if exists item_reader_item_policy on item;
create policy item_reader_item_policy on item
for select to item_reader
using (
    item_reader_allowed_tags(
        (select current_setting('item_reader.allowed_tags')::text[])
    ) && item.tags
);

With that, we've almost cut our query time in half again:

set role item_reader;
explain (analyze, verbose, costs, timing) select * from item where value ilike 'a%' and tags && array[1];
                                                                    QUERY PLAN
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Gather  (cost=1125.07..14581.22 rows=1096 width=68) (actual time=13.479..627.433 rows=6791 loops=1)
   Output: item.id, item.value, item.tags
   Workers Planned: 2
   Workers Launched: 2
   InitPlan 1
     ->  Result  (cost=0.00..0.02 rows=1 width=32) (actual time=0.005..0.005 rows=1 loops=1)
           Output: (current_setting('item_reader.allowed_tags'::text))::text[]
   ->  Parallel Bitmap Heap Scan on public.item  (cost=125.05..13471.60 rows=457 width=68) (actual time=7.711..617.697 rows=2264 loops=3)
         Output: item.id, item.value, item.tags
         Filter: ((item_reader_allowed_tags((InitPlan 1).col1) && item.tags) AND (item.value ~~* 'a%'::text) AND (item.tags && '{1}'::integer[]))
         Rows Removed by Filter: 66910
         Heap Blocks: exact=4130
         Worker 0:  actual time=4.878..614.809 rows=2285 loops=1
         Worker 1:  actual time=5.181..614.866 rows=2278 loops=1
         ->  Bitmap Index Scan on item_tags_idx  (cost=0.00..124.78 rows=10000 width=0) (actual time=11.706..11.706 rows=207521 loops=1)
               Index Cond: (item.tags && item_reader_allowed_tags((InitPlan 1).col1))
 Query Identifier: 1548793227074419886
 Planning Time: 0.261 ms
 Execution Time: 627.645 ms

The internal representation still shows a surprising number of plans -- 5, with 4 still being redundant item_reader_allowed_tags invocations, this time using a Param node in place of the FuncExpr. And while better than anything else we've gotten by a not-inconsiderable margin, it still takes more than half a second, compared to the admin's 50ms.

This, and especially the presence of multiple item_reader_allowed_tags plans, suggests that we simply aren't InitPlanning hard enough. We know it's not just current_setting that's stable within a single query; item_reader_allowed_tags itself won't change either, but something's preventing the current_setting invocations it triggers in the tag policy from using the InitPlan we generate in the item policy. So let's stuff that whole invocation in an InitPlan! We'll keep the select-wrapped current_setting invocation in its arguments since we might as well eke out every bit of reusability we can, but ensuring we only call item_reader_allowed_tags once will probably have a far bigger effect than that.

set role postgres;

drop policy if exists item_reader_item_policy on item;
create policy item_reader_item_policy on item
for select to item_reader
using (
    (select item_reader_allowed_tags(
        (select current_setting('item_reader.allowed_tags')::text[])
    )) && item.tags
);

set role item_reader;
explain (analyze, verbose, costs, timing) select * from item where value ilike 'a%' and tags && array[1];
                                                            QUERY PLAN
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Bitmap Heap Scan on public.item  (cost=125.08..12531.39 rows=1084 width=68) (actual time=12.057..93.546 rows=6726 loops=1)
   Output: item.id, item.value, item.tags
   Recheck Cond: ((InitPlan 2).col1 && item.tags)
   Filter: ((item.value ~~* 'a%'::text) AND (item.tags && '{1}'::integer[]))
   Rows Removed by Filter: 200646
   Heap Blocks: exact=12311
   InitPlan 2
     ->  Result  (cost=0.02..0.28 rows=1 width=32) (actual time=0.076..0.076 rows=1 loops=1)
           Output: item_reader_allowed_tags((InitPlan 1).col1)
           InitPlan 1
             ->  Result  (cost=0.00..0.02 rows=1 width=32) (actual time=0.003..0.003 rows=1 loops=1)
                   Output: (current_setting('item_reader.allowed_tags'::text))::text[]
   ->  Bitmap Index Scan on item_tags_idx  (cost=0.00..124.53 rows=10000 width=0) (actual time=10.942..10.942 rows=207372 loops=1)
         Index Cond: (item.tags && (InitPlan 2).col1)
 Query Identifier: 2531873819528607867
 Planning Time: 0.088 ms
 Execution Time: 93.713 ms

It did have a far bigger effect! We're not beating the admin timing, but considering we have to perform double the tests on item.tags, not-quite-double the duration is probably about as good as we can hope for, until Postgres' planner gets smart enough to combine the two tests in the index scan.

The internal plan representation is also finally down to two separate plans: the bitmap heap scan, and a single instance of item_reader_allowed_tags.