<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0">
    <channel>
        <title><![CDATA[Dian M Fay]]></title>
        <description><![CDATA[All matter becomes porous to certain scents; they pass / Through everything; it seems they even go through glass]]></description>
        <link>https://di.nmfay.com/</link>
        <generator>RSS for Node</generator>
        <lastBuildDate>Tue, 27 Jan 2026 03:12:21 GMT</lastBuildDate>
        <atom:link href="https://di.nmfay.com/atom.xml" rel="self" type="application/rss+xml"/>
        <copyright><![CDATA[Dian M Fay]]></copyright>
        <item>
            <title><![CDATA[Stateful Systems and the Fitness Floodplain: A Lament]]></title>
            <description><![CDATA[<p>A "fitness landscape" is a topographical metaphor for evolutionary success. A latent space described by genes or physical features, if you will, where fitness, or the suitability of a genotype or morphotype to its own physical environment, corresponds to elevation. Peaks therefore represent combinations of genes or features with high suitability, valleys those which struggle or even fail entirely. The crab, that success story of genetic and convergent evolution alike, occupies peaks; the panda, a valley.</p>
<p>Genotypes and morphotypes are near to or far from each other on the fitness landscape by similarity across dimensions of interest. Crabs and pandas are fairly far apart in most comparisons. One's a decapodal crustacean scavenger, the other an ursid that traded in carnivory for a diet it can barely digest and reduced itself almost to sessility. Spiders generally wind up pretty close to crabs: close common ancestor, chitinous exoskeletons, minus a couple of limbs, plus a few eyes, book lungs instead of (usually) gills. Koalas similarly for pandas, with the fearsomely well-adapted cats staring down at both of them from the heights of the mammalian region.</p>
<p>Natural selection guides the species on the fitness landscape upward, becoming more fit -- ideally. Sometimes a dominant mutation takes more than it gives or overspecializes, and sends the species toward a fitness valley instead. Conversely, if a species not too well fitted for its surroundings is able to migrate physically into a new, more congenial area, it's also moved peakward in terms of fitness.</p>
<p>And like its denizens, the fitness landscape itself isn't static. It normally changes at a much slower rate than they do, which is what makes natural selection effective; sometimes, though, it does change much more rapidly. We call these times extinction events. All of a sudden, the environment oxygenates or deoxygenates or heats up or cools or acidifies or is overtaken by a species itself no longer subject to natural selection, which crowds out habitat and plows forage under. The fitness landscape quakes. Valleys are exalted, mountains and hills made low. Species highly fit for the old environment have likely specialized in ways that limit their tolerance for their new surroundings, and perish. Less specialized species fill the new gaps and, in the absence of fitter competitors or predators, have the opportunity to specialize themselves. The synapsid survivors of the Cretaceous-Paleogene extinction had previously been unable to challenge the dominant fitness of the dinosaurs; post-Chicxulub they thrived and ramified, finding varying levels of long-term success. We're here, cats are here, pandas have found a more suitable habitat in zoos run by a species that thinks they're cute, and koalas are still clinging to their last vestiges of independent suitability, but how many megatheria have you seen lately?</p>
<p>Anyway, software. Evolution itself is a favorite metaphor. Software systems evolve, gaining and losing features and functionality across versions or releases, maybe slowly, maybe quickly. However, the evolution of software has a very different guiding principle from that of species. Where mutation sends individuals in random directions on the fitness landscape and natural selection eventually winnows out those who head downward, software projects are instead built by people intent that the system's overall fitness<sup><a href="#footnote-1">1</a></sup> should always increase. Developers undertake cautious and considered traversals of the fitness landscape, targeting higher and higher peaks while minimizing downward travel.</p>
<p>It'd be nice if each rise led directly to the next in an unbroken line, wouldn't it?</p>
<p>Unfortunately, just as real mountains are separated by valleys, so too are peaks on the fitness landscape interspersed among lower terrain. In order to reach the next peak, to satisfy the next user need, to improve the fitness and therefore the odds of success of your software, you must, more often than not, travel through a valley: refactoring code for reusability, reorganizing data structures to accommodate future extension, unwinding assumptions that no longer hold and deprecating the affordances that depended on them, all the yak-shaving that inevitably accretes as a software system matures, its youthful flexibility ossifies, and the technological environment and market continue to change around it.</p>
<p>Iterative development processes work to keep valley crossings as short and as shallow as possible by introducing feedback loops, both at the build-and-test level and on longer cycles with regular or continuous delivery and frequent user input. The "fail fast" doctrine encourages performing rapid searches in many directions to rule out routes that pull downward, in effect bringing a kind of "natural" selection back into the picture to cull the less fit mutations. Prototyping and spiking even build valley-traversal into software development explicitly, on grounds that seeing the view from the next peak quickly is worth having to make a second trip up from base camp -- and that second climb might even be to a different, higher peak that only became visible from the first.</p>
<p>In stateless systems, this is all manageable, or at least as manageable as the codebase and its interface stability requirements. The opportunity cost of moving to the next peak isn't <em>nothing</em>, but whatever holds us back can often be abandoned as long as we can continue to satisfy user needs without it and nobody else depends on an interface we publish. It's also relatively easy to cut bait and backtrack if we suspect we're crossing an unacceptably deep valley or have climbed a local maximum that could endanger our long-term success. We can afford a fairly naive search of the landscape, using loose-coupling techniques to dodge the riskiest valleys and planning one ascent at a time, because if we find ourselves heading in an unpromising direction, we're out some time and have to throw out some code, but no more than that. The experience may even have taught us new things that we can immediately put to good use on the next ascent.</p>
<p>Stateful systems, meanwhile, must reckon not only with the ordinary inertia of design and code, and almost always with some set of commitments to interface stability in the form of an API or a data dictionary, but also with masses of stored information whose resistance to change cannot be circumvented in the process of migrating to an improved data model and which may even be incompatible with its constraints and expectations. Throwing out code and going back to the drawing board on the design of some subsystem may not be fun. Throwing out specifications and revising interface contracts is politically fraught at best, but can be done. Throwing out real, useful data just because it happens to omit newly-important properties or irreversibly collapses a distinction revealed to be crucial going forward is out of the question.</p>
<p>Iteration remains one of our most helpful methodological tools for its emphasis on controlled traversal and regular fitness checkpoints, but when even seemingly minor decisions can be impossible to reverse, a practice focused on iterating over more-but-smaller decisions exists in tension with the nature of the work. It's not merely that, in order to prepare for the next ascent, the system's maintainers have to haul the whole thing down into a valley instead of leaving the dead weight of outdated models and specifications behind. The valley is also flooded, a tide of precious and obstinate information washing across the landscape, pulling our careful descent off-course, blocking our access to certain peaks, and obscuring the terrain below.</p>
<p>Where are our boats?</p>
<p class="footnote"><a id="footnote-1">1</a>: How is software fitness determined? Satisfaction of explicit user needs; ease of use; reliability; consistency; conceptual simplicity for user and for developer. Really, it's all user needs, one way or another: for a software system to satisfy the explicit needs, it implicitly must also be convenient, reliable, consistent, and not more complicated than the problem. In the absence of an agreed "done" state, it must also admit further modification. The success of "worse is better" only shows that simplicity for developers beats consistency (in the sense of fidelity to a complex problem space) in a fair fight, or, if you're on the side of "better", that you don't get a fair fight.</p>]]></description>
            <link>https://di.nmfay.com/fitness-floodplain</link>
            <guid isPermaLink="true">https://di.nmfay.com/fitness-floodplain</guid>
            <pubDate>Sat, 24 Jan 2026 00:00:00 GMT</pubDate>
        </item>
        <item>
            <title><![CDATA[Fixing Slow Row-Level Security Policies]]></title>
            <description><![CDATA[<p>At my day job, we use <a href="https://www.postgresql.org/docs/current/ddl-rowsecurity.html">row-level security</a> extensively. Several different roles interact with Postgres through the same GraphQL API; each role has its own grants and policies on tables; whether a role can see record X in table Y can depend on its access to record A in table B, so these policies aren't merely a function of the contents of the candidate row itself. There's more complexity than that, even, but no need to get into it.</p>
<p>Two tables, then.</p>
<pre><code class="hljs language-sql"><span class="hljs-keyword">set</span> jit = <span class="hljs-keyword">off</span>; <span class="hljs-comment">-- just-in-time compilation mostly serves to muddy the waters here</span>

<span class="hljs-keyword">create</span> <span class="hljs-keyword">table</span> tag (
  <span class="hljs-keyword">id</span> <span class="hljs-built_in">int</span> <span class="hljs-keyword">generated</span> <span class="hljs-keyword">always</span> <span class="hljs-keyword">as</span> <span class="hljs-keyword">identity</span> primary <span class="hljs-keyword">key</span>,
  <span class="hljs-keyword">name</span> <span class="hljs-built_in">text</span>
);

<span class="hljs-keyword">insert</span> <span class="hljs-keyword">into</span> tag (<span class="hljs-keyword">name</span>)
<span class="hljs-keyword">select</span> * <span class="hljs-keyword">from</span> unnest(<span class="hljs-built_in">array</span>[
    <span class="hljs-string">'alpha'</span>, <span class="hljs-string">'beta'</span>, <span class="hljs-string">'gamma'</span>, <span class="hljs-string">'delta'</span>, <span class="hljs-string">'epsilon'</span>, <span class="hljs-string">'zeta'</span>, <span class="hljs-string">'eta'</span>, <span class="hljs-string">'iota'</span>, <span class="hljs-string">'kappa'</span>, <span class="hljs-string">'lambda'</span>, <span class="hljs-string">'mu'</span>,
    <span class="hljs-string">'nu'</span>, <span class="hljs-string">'xi'</span>, <span class="hljs-string">'omicron'</span>, <span class="hljs-string">'pi'</span>, <span class="hljs-string">'rho'</span>, <span class="hljs-string">'sigma'</span>, <span class="hljs-string">'tau'</span>, <span class="hljs-string">'upsilon'</span>, <span class="hljs-string">'phi'</span>, <span class="hljs-string">'chi'</span>, <span class="hljs-string">'psi'</span>, <span class="hljs-string">'omega'</span>
]);

<span class="hljs-keyword">create</span> <span class="hljs-keyword">table</span> item (
  <span class="hljs-keyword">id</span> <span class="hljs-built_in">int</span> <span class="hljs-keyword">generated</span> <span class="hljs-keyword">always</span> <span class="hljs-keyword">as</span> <span class="hljs-keyword">identity</span> primary <span class="hljs-keyword">key</span>,
  <span class="hljs-keyword">value</span> <span class="hljs-built_in">text</span>,
  tags <span class="hljs-built_in">int</span>[]
);

<span class="hljs-keyword">insert</span> <span class="hljs-keyword">into</span> item (<span class="hljs-keyword">value</span>, tags)
<span class="hljs-keyword">select</span>
  <span class="hljs-keyword">md5</span>(random()::<span class="hljs-built_in">text</span>),
  array_sample((<span class="hljs-keyword">select</span> array_agg(<span class="hljs-keyword">id</span>) <span class="hljs-keyword">from</span> tag), trunc(random() * <span class="hljs-number">4</span>)::<span class="hljs-built_in">int</span> + <span class="hljs-number">1</span>)
<span class="hljs-keyword">from</span> generate_series(<span class="hljs-number">1</span>, <span class="hljs-number">1000000</span>);

<span class="hljs-keyword">create</span> <span class="hljs-keyword">index</span> <span class="hljs-keyword">on</span> item <span class="hljs-keyword">using</span> gin (tags);

<span class="hljs-keyword">alter</span> <span class="hljs-keyword">table</span> tag <span class="hljs-keyword">enable</span> <span class="hljs-keyword">row</span> <span class="hljs-keyword">level</span> <span class="hljs-keyword">security</span>;
<span class="hljs-keyword">alter</span> <span class="hljs-keyword">table</span> item <span class="hljs-keyword">enable</span> <span class="hljs-keyword">row</span> <span class="hljs-keyword">level</span> <span class="hljs-keyword">security</span>;</code></pre>
<p>We'll set up two roles to compare performance. <code>item_admin</code> will have a simple policy allowing it to view all items, while <code>item_reader</code>'s access will be governed by session settings that the user must configure before attempting to query these tables.</p>
<pre><code class="hljs language-sql"><span class="hljs-keyword">create</span> <span class="hljs-keyword">role</span> item_admin;
<span class="hljs-keyword">grant</span> <span class="hljs-keyword">select</span> <span class="hljs-keyword">on</span> item <span class="hljs-keyword">to</span> item_admin;
<span class="hljs-keyword">grant</span> <span class="hljs-keyword">select</span> <span class="hljs-keyword">on</span> tag <span class="hljs-keyword">to</span> item_admin;

<span class="hljs-keyword">create</span> <span class="hljs-keyword">policy</span> item_admin_tag_policy <span class="hljs-keyword">on</span> tag
<span class="hljs-keyword">for</span> <span class="hljs-keyword">select</span> <span class="hljs-keyword">to</span> item_admin
<span class="hljs-keyword">using</span> (<span class="hljs-literal">true</span>);

<span class="hljs-keyword">create</span> <span class="hljs-keyword">policy</span> item_admin_item_policy <span class="hljs-keyword">on</span> item
<span class="hljs-keyword">for</span> <span class="hljs-keyword">select</span> <span class="hljs-keyword">to</span> item_admin
<span class="hljs-keyword">using</span> (<span class="hljs-literal">true</span>);

<span class="hljs-keyword">create</span> <span class="hljs-keyword">role</span> item_reader;
<span class="hljs-keyword">grant</span> <span class="hljs-keyword">select</span> <span class="hljs-keyword">on</span> item <span class="hljs-keyword">to</span> item_reader;
<span class="hljs-keyword">grant</span> <span class="hljs-keyword">select</span> <span class="hljs-keyword">on</span> tag <span class="hljs-keyword">to</span> item_reader;

<span class="hljs-comment">-- `set item_reader.allowed_tags = '{alpha,beta}'` and see items tagged</span>
<span class="hljs-comment">-- alpha or beta</span>
<span class="hljs-keyword">create</span> <span class="hljs-keyword">policy</span> item_reader_tag_policy <span class="hljs-keyword">on</span> tag
<span class="hljs-keyword">for</span> <span class="hljs-keyword">select</span> <span class="hljs-keyword">to</span> item_reader
<span class="hljs-keyword">using</span> (
    current_setting(<span class="hljs-string">'item_reader.allowed_tags'</span>) <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-literal">null</span> <span class="hljs-keyword">and</span>
    current_setting(<span class="hljs-string">'item_reader.allowed_tags'</span>)::<span class="hljs-built_in">text</span>[] @> <span class="hljs-built_in">array</span>[<span class="hljs-keyword">name</span>]
);

<span class="hljs-keyword">create</span> <span class="hljs-keyword">policy</span> item_reader_item_policy <span class="hljs-keyword">on</span> item
<span class="hljs-keyword">for</span> <span class="hljs-keyword">select</span> <span class="hljs-keyword">to</span> item_reader
<span class="hljs-keyword">using</span> (
    <span class="hljs-keyword">exists</span> (
        <span class="hljs-keyword">select</span> <span class="hljs-number">1</span> <span class="hljs-keyword">from</span> tag
        <span class="hljs-keyword">where</span> item.tags @> <span class="hljs-built_in">array</span>[tag.id]
    )
);</code></pre>
<p>Before we proceed, this post includes a lot of <code>explain</code> plans. These can be a bit intimidating to read at first (although reading them is a skill very much worth developing if you have any stake in making Postgres databases fast!). There are many <code>explain</code> helpers that you can copy-paste the plans into for a more intuitively structured view; some popular offerings include <a href="http://explain.depesz.com">depesz's</a>, <a href="https://www.pgexplain.dev">pgExplain</a>, and <a href="https://explain.dalibo.com">Dalibo</a>.</p>
<h2 id="baseline-performance">Baseline Performance</h2>
<p>Okay! Let's look at an example query, first as <code>item_admin</code>. This retrieves <code>item</code>s tagged <code>alpha</code> and having a <code>value</code> beginning with the letter A.</p>
<pre><code class="hljs language-sql"><span class="hljs-keyword">set</span> <span class="hljs-keyword">role</span> item_admin;
<span class="hljs-keyword">explain</span> (<span class="hljs-keyword">analyze</span>, verbose, costs, timing) <span class="hljs-keyword">select</span> * <span class="hljs-keyword">from</span> item <span class="hljs-keyword">where</span> <span class="hljs-keyword">value</span> <span class="hljs-keyword">ilike</span> <span class="hljs-string">'a%'</span> <span class="hljs-keyword">and</span> tags &#x26;&#x26; <span class="hljs-built_in">array</span>[<span class="hljs-number">1</span>];
                                                            QUERY PLAN
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Bitmap Heap Scan on public.item  (cost=748.75..14704.25 rows=109556 width=68) (actual time=7.041..50.822 rows=6791 loops=1)
   Output: id, value, tags
   Recheck Cond: (item.tags &#x26;&#x26; '{1}'::integer[])
   Filter: (item.value ~~* 'a%'::text)
   Rows Removed by Filter: 101875
   Heap Blocks: exact=12311
   ->  Bitmap Index Scan on item_tags_idx  (cost=0.00..721.36 rows=109567 width=0) (actual time=5.762..5.762 rows=108666 loops=1)
         Index Cond: (item.tags &#x26;&#x26; '{1}'::integer[])
 Query Identifier: 1548793227074419886
 Planning Time: 0.123 ms
 Execution Time: 51.001 ms</code></pre>
<p>So far, so good: we're using the GIN index, the bitmap is small enough to stay lossless (<code>Heap Blocks: exact=12311</code>) so nothing needs to be rechecked, the filter is filtering, 51ms.</p>
<p>Now, as <code>item_reader</code>:</p>
<pre><code class="hljs language-sql"><span class="hljs-keyword">set</span> <span class="hljs-keyword">role</span> item_reader;
<span class="hljs-keyword">set</span> item_reader.allowed_tags = <span class="hljs-string">'{alpha,beta}'</span>;

<span class="hljs-keyword">explain</span> (<span class="hljs-keyword">analyze</span>, verbose, costs, timing) <span class="hljs-keyword">select</span> * <span class="hljs-keyword">from</span> item <span class="hljs-keyword">where</span> <span class="hljs-keyword">value</span> <span class="hljs-keyword">ilike</span> <span class="hljs-string">'a%'</span> <span class="hljs-keyword">and</span> tags &#x26;&#x26; <span class="hljs-built_in">array</span>[<span class="hljs-number">1</span>];
                                                                                                QUERY PLAN
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Seq Scan on public.item  (cost=0.00..41777312.00 rows=54778 width=68) (actual time=0.476..8100.051 rows=6791 loops=1)
   Output: item.id, item.value, item.tags
   Filter: (EXISTS(SubPlan 1) AND (item.value ~~* 'a%'::text) AND (item.tags &#x26;&#x26; '{1}'::integer[]))
   Rows Removed by Filter: 993209
   SubPlan 1
     ->  Seq Scan on public.tag  (cost=0.00..41.75 rows=1 width=0) (actual time=0.008..0.008 rows=0 loops=1000000)
           Filter: ((current_setting('item_reader.allowed_tags'::text) IS NOT NULL) AND ((current_setting('item_reader.allowed_tags'::text))::text[] @> ARRAY[tag.name]) AND (item.tags @> ARRAY[tag.id]))
           Rows Removed by Filter: 18
 Query Identifier: 1548793227074419886
 Planning Time: 0.135 ms
 Execution Time: 8100.319 ms</code></pre>
<p>That's roughly 150 times slower. What's the difference?</p>
<p>The <code>item_reader</code> plan is <em>very</em> different, in fact. Instead of using the GIN index on <code>item.tags</code>, we're sequentially scanning <code>tag</code> in the subplan.</p>
<p>We're sequentially scanning <code>tag</code> in the subplan a million times, once per <code>item</code> record (<code>loops=1000000</code>).</p>
<p>That's probably the issue.</p>
<p>How can we fix it?</p>
<h2 id="use-more-efficient-operations">Use More Efficient Operations</h2>
<p>Our policy on <code>item</code> tests array containment, but that's kind of overkill; the <code>any</code> operation is better optimized for what we need here.</p>
<pre><code class="hljs language-sql"><span class="hljs-keyword">set</span> <span class="hljs-keyword">role</span> postgres;

<span class="hljs-keyword">drop</span> <span class="hljs-keyword">policy</span> <span class="hljs-keyword">if</span> <span class="hljs-keyword">exists</span> item_reader_item_policy <span class="hljs-keyword">on</span> item;
<span class="hljs-keyword">create</span> <span class="hljs-keyword">policy</span> item_reader_item_policy <span class="hljs-keyword">on</span> item
<span class="hljs-keyword">for</span> <span class="hljs-keyword">select</span> <span class="hljs-keyword">to</span> item_reader
<span class="hljs-keyword">using</span> (
    <span class="hljs-keyword">exists</span> (
        <span class="hljs-keyword">select</span> <span class="hljs-number">1</span> <span class="hljs-keyword">from</span> tag
        <span class="hljs-keyword">where</span> tag.id = <span class="hljs-keyword">any</span>(item.tags)
    )
);

<span class="hljs-keyword">set</span> <span class="hljs-keyword">role</span> item_reader;
<span class="hljs-keyword">explain</span> (<span class="hljs-keyword">analyze</span>, verbose, costs, timing) <span class="hljs-keyword">select</span> * <span class="hljs-keyword">from</span> item <span class="hljs-keyword">where</span> <span class="hljs-keyword">value</span> <span class="hljs-keyword">ilike</span> <span class="hljs-string">'a%'</span> <span class="hljs-keyword">and</span> tags &#x26;&#x26; <span class="hljs-built_in">array</span>[<span class="hljs-number">1</span>];
                                                                                QUERY PLAN
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Seq Scan on public.item  (cost=0.00..15035202.44 rows=53295 width=68) (actual time=0.531..2441.101 rows=6854 loops=1)
   Output: item.id, item.value, item.tags
   Filter: (EXISTS(SubPlan 1) AND (item.value ~~* 'a%'::text) AND (item.tags &#x26;&#x26; '{1}'::integer[]))
   Rows Removed by Filter: 993146
   SubPlan 1
     ->  Bitmap Heap Scan on public.tag  (cost=4.23..15.01 rows=1 width=0) (actual time=0.002..0.002 rows=0 loops=1000000)
           Recheck Cond: (tag.id = ANY (item.tags))
           Filter: ((current_setting('item_reader.allowed_tags'::text) IS NOT NULL) AND ((current_setting('item_reader.allowed_tags'::text))::text[] @> ARRAY[tag.name]))
           Rows Removed by Filter: 2
           Heap Blocks: exact=1000000
           ->  Bitmap Index Scan on tag_pkey  (cost=0.00..4.23 rows=10 width=0) (actual time=0.001..0.001 rows=3 loops=1000000)
                 Index Cond: (tag.id = ANY (item.tags))
 Query Identifier: -1492990194093681799
 Planning Time: 0.102 ms
 Execution Time: 2441.345 ms</code></pre>
<p>Quite a bit better, although still very slow by comparison with the admin query. We're still performing a million loops, but the <code>any</code> is able to take advantage of <code>tag</code>'s primary key index a million times instead of sequentially scanning it all million times.</p>
<h2 id="functions-can-be-execution-boundaries">Functions can be Execution Boundaries</h2>
<p>The tags an <code>item_reader</code> is allowed to see won't change in the middle of the query; even if there's an access-control table being updated concurrently, we're in our own transaction here and won't see or care about changes. There's no need to get the list anew for each candidate <code>item</code> record. Let's try caching that using a function.</p>
<pre><code class="hljs language-sql"><span class="hljs-keyword">set</span> <span class="hljs-keyword">role</span> postgres;
<span class="hljs-keyword">create</span> <span class="hljs-keyword">or</span> <span class="hljs-keyword">replace</span> <span class="hljs-keyword">function</span> item_reader_allowed_tags()
    <span class="hljs-keyword">returns</span> <span class="hljs-built_in">int</span>[]
    <span class="hljs-keyword">language</span> <span class="hljs-keyword">sql</span>
<span class="hljs-keyword">begin</span> atomic;
    <span class="hljs-keyword">select</span> array_agg(<span class="hljs-keyword">id</span>)
    <span class="hljs-keyword">from</span> tag
    <span class="hljs-keyword">where</span> <span class="hljs-keyword">name</span> = <span class="hljs-keyword">any</span>(current_setting(<span class="hljs-string">'item_reader.allowed_tags'</span>)::<span class="hljs-built_in">text</span>[]);
<span class="hljs-keyword">end</span>;

<span class="hljs-keyword">drop</span> <span class="hljs-keyword">policy</span> <span class="hljs-keyword">if</span> <span class="hljs-keyword">exists</span> item_reader_item_policy <span class="hljs-keyword">on</span> item;
<span class="hljs-keyword">create</span> <span class="hljs-keyword">policy</span> item_reader_item_policy <span class="hljs-keyword">on</span> item
<span class="hljs-keyword">for</span> <span class="hljs-keyword">select</span> <span class="hljs-keyword">to</span> item_reader
<span class="hljs-keyword">using</span> (
    item_reader_allowed_tags() &#x26;&#x26; item.tags
);</code></pre>
<p>Et voilà:</p>
<pre><code class="hljs language-sql"><span class="hljs-keyword">set</span> <span class="hljs-keyword">role</span> item_reader;
<span class="hljs-keyword">explain</span> (<span class="hljs-keyword">analyze</span>, verbose, costs, timing) <span class="hljs-keyword">select</span> * <span class="hljs-keyword">from</span> item <span class="hljs-keyword">where</span> <span class="hljs-keyword">value</span> <span class="hljs-keyword">ilike</span> <span class="hljs-string">'a%'</span> <span class="hljs-keyword">and</span> tags &#x26;&#x26; <span class="hljs-built_in">array</span>[<span class="hljs-number">1</span>];
                                                        QUERY PLAN
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Seq Scan on public.item  (cost=0.00..279812.00 rows=1096 width=68) (actual time=0.911..13425.743 rows=6791 loops=1)
   Output: id, value, tags
   Filter: ((item_reader_allowed_tags() &#x26;&#x26; item.tags) AND (item.value ~~* 'a%'::text) AND (item.tags &#x26;&#x26; '{1}'::integer[]))
   Rows Removed by Filter: 993209
 Query Identifier: 1548793227074419886
 Planning Time: 0.102 ms
 Execution Time: 13426.034 ms</code></pre>
<p>Hold up a minute. That's <em>worse</em>. Worse even than the baseline, and not by a little.</p>
<h2 id="functions-need-configuring">Functions Need Configuring</h2>
<p>Functions are extremely powerful. This means that Postgres considers that any function could get up to absolutely anything in the database, and <em>that</em> means that the query engine has to treat it like it's a screwdriver-slip away from criticality. However, we can pinky promise that our function doesn't do anything untoward.</p>
<pre><code class="hljs language-sql"><span class="hljs-keyword">set</span> <span class="hljs-keyword">role</span> postgres;
<span class="hljs-keyword">alter</span> <span class="hljs-keyword">function</span> item_reader_allowed_tags stable;

<span class="hljs-keyword">set</span> <span class="hljs-keyword">role</span> item_reader;
<span class="hljs-keyword">explain</span> (<span class="hljs-keyword">analyze</span>, verbose, costs, timing) <span class="hljs-keyword">select</span> * <span class="hljs-keyword">from</span> item <span class="hljs-keyword">where</span> <span class="hljs-keyword">value</span> <span class="hljs-keyword">ilike</span> <span class="hljs-string">'a%'</span> <span class="hljs-keyword">and</span> tags &#x26;&#x26; <span class="hljs-built_in">array</span>[<span class="hljs-number">1</span>];
                                                             QUERY PLAN
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Bitmap Heap Scan on public.item  (cost=1353.87..68540.82 rows=22474 width=68) (actual time=12.503..2880.043 rows=6791 loops=1)
   Output: id, value, tags
   Filter: ((item_reader_allowed_tags() &#x26;&#x26; item.tags) AND (item.value ~~* 'a%'::text) AND (item.tags &#x26;&#x26; '{1}'::integer[]))
   Rows Removed by Filter: 200730
   Heap Blocks: exact=12312
   ->  Bitmap Index Scan on item_tags_idx  (cost=0.00..1348.25 rows=205140 width=0) (actual time=11.115..11.115 rows=207521 loops=1)
         Index Cond: (item.tags &#x26;&#x26; item_reader_allowed_tags())
 Query Identifier: 1548793227074419886
 Planning Time: 0.280 ms
 Execution Time: 2880.319 ms</code></pre>
<p>Finally, <code>item_reader</code> gets to use the GIN index! With the <a href="https://www.postgresql.org/docs/current/xfunc-volatility.html">default volatility setting of <code>volatile</code></a>, Postgres assumes both that the function could do anything and that "anything" could change at any time, including from successive invocations in the same statement. If we say instead that it will always return the same output for the same input within a statement, the planner takes us at our word and lets us perform an index scan.</p>
<p>This is much better than our eight-second baseline, but still actually a bit worse than the simple inline <code>any</code> policy. We've definitely got a ways to go before we approach the 50ms times those lucky admins get. What if we parallelize?</p>
<pre><code class="hljs language-sql"><span class="hljs-keyword">set</span> <span class="hljs-keyword">role</span> postgres;
<span class="hljs-keyword">alter</span> <span class="hljs-keyword">function</span> item_reader_allowed_tags <span class="hljs-keyword">parallel</span> <span class="hljs-keyword">safe</span>;

<span class="hljs-keyword">set</span> <span class="hljs-keyword">role</span> item_reader;
<span class="hljs-keyword">explain</span> (<span class="hljs-keyword">analyze</span>, verbose, costs, timing) <span class="hljs-keyword">select</span> * <span class="hljs-keyword">from</span> item <span class="hljs-keyword">where</span> <span class="hljs-keyword">value</span> <span class="hljs-keyword">ilike</span> <span class="hljs-string">'a%'</span> <span class="hljs-keyword">and</span> tags &#x26;&#x26; <span class="hljs-built_in">array</span>[<span class="hljs-number">1</span>];
                                                                 QUERY PLAN
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Gather  (cost=2353.87..39777.84 rows=22474 width=68) (actual time=15.013..1008.731 rows=6791 loops=1)
   Output: id, value, tags
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Bitmap Heap Scan on public.item  (cost=1353.87..36530.44 rows=9364 width=68) (actual time=9.112..997.387 rows=2264 loops=3)
         Output: id, value, tags
         Filter: ((item_reader_allowed_tags() &#x26;&#x26; item.tags) AND (item.value ~~* 'a%'::text) AND (item.tags &#x26;&#x26; '{1}'::integer[]))
         Rows Removed by Filter: 66910
         Heap Blocks: exact=4128
         Worker 0:  actual time=5.971..994.315 rows=2253 loops=1
         Worker 1:  actual time=6.901..994.412 rows=2256 loops=1
         ->  Bitmap Index Scan on item_tags_idx  (cost=0.00..1348.25 rows=205140 width=0) (actual time=13.099..13.099 rows=207521 loops=1)
               Index Cond: (item.tags &#x26;&#x26; item_reader_allowed_tags())
 Query Identifier: 1548793227074419886
 Planning Time: 0.336 ms
 Execution Time: 1008.934 ms</code></pre>
<p>Well, we parallelize. That sounds about right. It's somewhat better than halved from the <code>stable</code> plan but I'll chalk that up to a warm cache.</p>
<p>There are two more settings that sound like they could be useful.</p>
<pre><code class="hljs language-sql"><span class="hljs-keyword">set</span> <span class="hljs-keyword">role</span> postgres;
<span class="hljs-keyword">alter</span> <span class="hljs-keyword">function</span> item_reader_allowed_tags <span class="hljs-keyword">cost</span> <span class="hljs-number">100000</span>;</code></pre>
<p>You can find some recommendations online to set the function's estimated execution cost to a very high value to "encourage" the planner to execute it as few times as possible. In our example here, upcosting shaves off a few milliseconds, but we're in rounding-error territory. I think there are likely some scenarios in which this does make a significant difference, but I'm not sure what those might be; this is not one. And in any case, it's a kludge.</p>
<pre><code class="hljs language-sql"><span class="hljs-keyword">set</span> <span class="hljs-keyword">role</span> postgres;
<span class="hljs-keyword">alter</span> <span class="hljs-keyword">function</span> item_reader_allowed_tags <span class="hljs-keyword">cost</span> <span class="hljs-number">100</span>;
<span class="hljs-keyword">alter</span> <span class="hljs-keyword">function</span> item_reader_allowed_tags leakproof;</code></pre>
<p>Leakproofness asserts that the function cannot reveal information about its arguments through a side channel (e.g. error messages) in addition to through its return value. Potentially-leaky functions are treated as security barriers that have to be passed before other conditions can be evaluated. However, <code>alter function item_reader_allowed_tags leakproof</code> doesn't seem to change the plan noticeably. Stick a pin in that.</p>
<h2 id="is-this-as-good-as-it-gets">Is This as Good as it Gets?</h2>
<p>It sure seems like we're stuck. There are still some differences from the admin query plan: as admin, we don't care about <code>item_reader_allowed_tags</code>, and our index scan performs the test for tag id 1. As reader, our index scan checks for allowed tags, that apparently being more cost-optimal during planning, but then in a post-scan filter we run <code>item_reader_allowed_tags</code> a second time along with the tag-1 test. Both of those filter predicates <em>should</em> qualify for the index scan; why aren't they being pushed down into that?</p>
<p>This is the point where I started source-diving. My initial assumption was that, because RLS was active, the policy predicates needed to be checked separately from and before those in the <code>where</code> clause. However, <code>item_reader_allowed_tags</code> is a simple <code>select</code> statement at heart, and, especially if it's <code>leakproof</code>, it might be inlineable. If so, then all the predicates ("quals" in internalsese) get dumped into one big bucket for the planner to pick through as it likes, which would make the post-scan filtering rather odd. I just needed to find out whether that was happening, and this information isn't available in the <code>explain</code> plan.</p>
<h2 id="internal-representations">Internal Representations</h2>
<p>The <code>explain</code> plans I've been posting throughout are more-or-less human-readable representations of what the query planner intends to do or has done. They are rather lossy summaries, it turns out. If you're brave, foolish, desperate, or an internals developer, though, you can turn on a couple of client settings that dump the planner's internal representation out.</p>
<pre><code class="hljs language-sql"><span class="hljs-keyword">set</span> debug_print_plan = <span class="hljs-keyword">on</span>;
<span class="hljs-keyword">set</span> client_min_messages = <span class="hljs-keyword">log</span>;</code></pre>
<p>It's a bad idea to use tab-completion after setting these, for the record. Disable by setting them to <code>off</code> and <code>notice</code>, respectively.</p>
<p>I'm not going to copy the internal representation of the simple test query we've been using here because it's 2500 lines long and the first thing I want to point out is an absence anyway.</p>
<blockquote>
<p>After I proved out my performance fix on one role-policy combination, the next set I'm tackling produces an internal representation of more than 2 million lines, well past my terminal's scrollback. <code>psql</code> doesn't pipe debug messages into stdout <em>or</em> stderr, so I wound up having to <code>psql -c "set debug_print_plan=on; set client_min_messages=log; explain analyze ...." 2>&#x26;1 | tee out.plan</code> to capture it.</p>
</blockquote>
<p>To test my assumption, I was looking for the <code>:hasRowSecurity</code> flag. If it appeared, that would mean <code>item_reader_allowed_tags</code> is not inlined, and constitutes a security barrier forcing quals to be evaluated in multiple stages. The <code>where</code> quals would only be evaluated for a candidate record after the quals in the function, there's an obvious explanation for the post-index-scan filtering step, we're not happy but we understand it.</p>
<p>I didn't see <code>:hasRowSecurity</code> anywhere.</p>
<p>What I did see was not one, but seven plans produced for my test query. Try it!</p>
<h2 id="how-does-one-query-turn-into-seven-plans">How Does One Query Turn into Seven Plans?</h2>
<p>Or, functions can be execution boundaries, oh no! In order, the plans are:</p>
<ol>
<li>Aggregation of <code>tag.id</code> based on an array-containment (<code>OPEXPR :opno 2751 :opfuncid 2748</code>) between <code>current_setting</code> (<code>FUNCEXPR :funcid 2077</code>) and an array instantiation. This is clearly <code>item_reader_allowed_tags</code>, although there's some extra funny business like a null test.</li>
<li>Is the same plan again.</li>
<li>Gathering a bitmap heap scan on <code>item</code>, with three quals involved testing <code>item_reader_allowed_tags</code> overlap with the <code>tags</code> column; a text <code>ilike</code> comparison; and a second array overlap with a constant. Only the first qual makes it down into the plan tree. This is the GIN index scan.</li>
<li>Guess what, it's plan 1 again!</li>
<li>And again!</li>
<li>And yet again!</li>
<li>And one for the road!</li>
</ol>
<p>That's right: six identical plans for RLS, one for the actual query. The null test is from the policy on <code>tag</code> that ensures the <code>item_reader.allowed_tags</code> setting is non-null, so at least it's kind of inlining. The next question is how we get to six.</p>
<p>In the course of doing this analysis for real, I'd been searching up and reading anything I could. What put me on the right track was a <a href="https://github.com/PostgREST/postgrest-docs/issues/609">discussion in PostgREST's docs issue tracker about performance implications of <code>current_setting</code> in RLS</a>. <code>current_setting</code>, like many other built-in functions, is not <code>leakproof</code>. We may have declared <code>item_reader_allowed_tags</code> <code>leakproof</code> so it could qualify for inlining, but <code>current_setting</code> remains a barrier -- it just doesn't have any quals, so <code>:hasRowSecurity</code> doesn't need to be invoked.</p>
<p>Between the policy on <code>tag</code> and <code>item_reader_allowed_tags</code> we have three invocations of <code>current_setting</code>. <code>item_reader_allowed_tags</code> is <code>parallel safe</code>, and we spin up two workers. That's six. It's easy enough to verify: mark the function <code>parallel unsafe</code>, and we're down to four redundant plans. One inside <code>item_reader_allowed_tags</code>, two in the <code>tag</code> policy; I'm not sure where the fourth is coming from.</p>
<p>The PostgREST discussion contributors did work out how best to optimize the policies. If a plan has no dependencies, it can be pulled up into an InitPlan, which is guaranteed to run only once and is then referenced by other plans. All it takes is keeping the <code>current_setting</code> invocations in their own <code>select</code>s:</p>
<pre><code class="hljs language-sql"><span class="hljs-keyword">set</span> <span class="hljs-keyword">role</span> postgres;

<span class="hljs-keyword">drop</span> <span class="hljs-keyword">policy</span> <span class="hljs-keyword">if</span> <span class="hljs-keyword">exists</span> item_reader_tag_policy <span class="hljs-keyword">on</span> tag;
<span class="hljs-keyword">create</span> <span class="hljs-keyword">policy</span> item_reader_tag_policy <span class="hljs-keyword">on</span> tag
<span class="hljs-keyword">for</span> <span class="hljs-keyword">select</span> <span class="hljs-keyword">to</span> item_reader
<span class="hljs-keyword">using</span> (
    (<span class="hljs-keyword">select</span> current_setting(<span class="hljs-string">'item_reader.allowed_tags'</span>)::<span class="hljs-built_in">text</span>[]) <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-literal">null</span> <span class="hljs-keyword">and</span>
    (<span class="hljs-keyword">select</span> current_setting(<span class="hljs-string">'item_reader.allowed_tags'</span>)::<span class="hljs-built_in">text</span>[]) @> <span class="hljs-built_in">array</span>[<span class="hljs-keyword">name</span>]
);

<span class="hljs-keyword">create</span> <span class="hljs-keyword">or</span> <span class="hljs-keyword">replace</span> <span class="hljs-keyword">function</span> item_reader_allowed_tags(allowed_tags <span class="hljs-built_in">text</span>[])
    <span class="hljs-keyword">returns</span> <span class="hljs-built_in">int</span>[]
    <span class="hljs-keyword">language</span> <span class="hljs-keyword">sql</span>
    stable
    leakproof
    <span class="hljs-keyword">parallel</span> <span class="hljs-keyword">safe</span>
<span class="hljs-keyword">begin</span> atomic;
    <span class="hljs-keyword">select</span> array_agg(<span class="hljs-keyword">id</span>)
    <span class="hljs-keyword">from</span> tag
    <span class="hljs-keyword">where</span> <span class="hljs-keyword">name</span> = <span class="hljs-keyword">any</span>(allowed_tags);
<span class="hljs-keyword">end</span>;

<span class="hljs-keyword">drop</span> <span class="hljs-keyword">policy</span> <span class="hljs-keyword">if</span> <span class="hljs-keyword">exists</span> item_reader_item_policy <span class="hljs-keyword">on</span> item;
<span class="hljs-keyword">create</span> <span class="hljs-keyword">policy</span> item_reader_item_policy <span class="hljs-keyword">on</span> item
<span class="hljs-keyword">for</span> <span class="hljs-keyword">select</span> <span class="hljs-keyword">to</span> item_reader
<span class="hljs-keyword">using</span> (
    item_reader_allowed_tags(
        (<span class="hljs-keyword">select</span> current_setting(<span class="hljs-string">'item_reader.allowed_tags'</span>)::<span class="hljs-built_in">text</span>[])
    ) &#x26;&#x26; item.tags
);</code></pre>
<p>With that, we've almost cut our query time in half again:</p>
<pre><code class="hljs language-sql"><span class="hljs-keyword">set</span> <span class="hljs-keyword">role</span> item_reader;
<span class="hljs-keyword">explain</span> (<span class="hljs-keyword">analyze</span>, verbose, costs, timing) <span class="hljs-keyword">select</span> * <span class="hljs-keyword">from</span> item <span class="hljs-keyword">where</span> <span class="hljs-keyword">value</span> <span class="hljs-keyword">ilike</span> <span class="hljs-string">'a%'</span> <span class="hljs-keyword">and</span> tags &#x26;&#x26; <span class="hljs-built_in">array</span>[<span class="hljs-number">1</span>];
                                                                    QUERY PLAN
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Gather  (cost=1125.07..14581.22 rows=1096 width=68) (actual time=13.479..627.433 rows=6791 loops=1)
   Output: item.id, item.value, item.tags
   Workers Planned: 2
   Workers Launched: 2
   InitPlan 1
     ->  Result  (cost=0.00..0.02 rows=1 width=32) (actual time=0.005..0.005 rows=1 loops=1)
           Output: (current_setting('item_reader.allowed_tags'::text))::text[]
   ->  Parallel Bitmap Heap Scan on public.item  (cost=125.05..13471.60 rows=457 width=68) (actual time=7.711..617.697 rows=2264 loops=3)
         Output: item.id, item.value, item.tags
         Filter: ((item_reader_allowed_tags((InitPlan 1).col1) &#x26;&#x26; item.tags) AND (item.value ~~* 'a%'::text) AND (item.tags &#x26;&#x26; '{1}'::integer[]))
         Rows Removed by Filter: 66910
         Heap Blocks: exact=4130
         Worker 0:  actual time=4.878..614.809 rows=2285 loops=1
         Worker 1:  actual time=5.181..614.866 rows=2278 loops=1
         ->  Bitmap Index Scan on item_tags_idx  (cost=0.00..124.78 rows=10000 width=0) (actual time=11.706..11.706 rows=207521 loops=1)
               Index Cond: (item.tags &#x26;&#x26; item_reader_allowed_tags((InitPlan 1).col1))
 Query Identifier: 1548793227074419886
 Planning Time: 0.261 ms
 Execution Time: 627.645 ms</code></pre>
<p>The internal representation still shows a surprising number of plans -- 5, with 4 still being redundant <code>item_reader_allowed_tags</code> invocations, this time using a Param node in place of the FuncExpr. And while better than anything else we've gotten by a not-inconsiderable margin, it still takes more than half a second, compared to the admin's 50ms.</p>
<p>This, and especially the presence of multiple <code>item_reader_allowed_tags</code> plans, suggests that we simply aren't InitPlanning hard <em>enough</em>. We know it's not just <code>current_setting</code> that's stable within a single query; <code>item_reader_allowed_tags</code> itself won't change either, but something's preventing the <code>current_setting</code> invocations it triggers in the <code>tag</code> policy from using the InitPlan we generate in the <code>item</code> policy. So let's stuff that whole invocation in an InitPlan! We'll keep the <code>select</code>-wrapped <code>current_setting</code> invocation in its arguments since we might as well eke out every bit of reusability we can, but ensuring we only call <code>item_reader_allowed_tags</code> once will probably have a far bigger effect than that.</p>
<pre><code class="hljs language-sql"><span class="hljs-keyword">set</span> <span class="hljs-keyword">role</span> postgres;

<span class="hljs-keyword">drop</span> <span class="hljs-keyword">policy</span> <span class="hljs-keyword">if</span> <span class="hljs-keyword">exists</span> item_reader_item_policy <span class="hljs-keyword">on</span> item;
<span class="hljs-keyword">create</span> <span class="hljs-keyword">policy</span> item_reader_item_policy <span class="hljs-keyword">on</span> item
<span class="hljs-keyword">for</span> <span class="hljs-keyword">select</span> <span class="hljs-keyword">to</span> item_reader
<span class="hljs-keyword">using</span> (
    (<span class="hljs-keyword">select</span> item_reader_allowed_tags(
        (<span class="hljs-keyword">select</span> current_setting(<span class="hljs-string">'item_reader.allowed_tags'</span>)::<span class="hljs-built_in">text</span>[])
    )) &#x26;&#x26; item.tags
);

<span class="hljs-keyword">set</span> <span class="hljs-keyword">role</span> item_reader;
<span class="hljs-keyword">explain</span> (<span class="hljs-keyword">analyze</span>, verbose, costs, timing) <span class="hljs-keyword">select</span> * <span class="hljs-keyword">from</span> item <span class="hljs-keyword">where</span> <span class="hljs-keyword">value</span> <span class="hljs-keyword">ilike</span> <span class="hljs-string">'a%'</span> <span class="hljs-keyword">and</span> tags &#x26;&#x26; <span class="hljs-built_in">array</span>[<span class="hljs-number">1</span>];
                                                            QUERY PLAN
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Bitmap Heap Scan on public.item  (cost=125.08..12531.39 rows=1084 width=68) (actual time=12.057..93.546 rows=6726 loops=1)
   Output: item.id, item.value, item.tags
   Recheck Cond: ((InitPlan 2).col1 &#x26;&#x26; item.tags)
   Filter: ((item.value ~~* 'a%'::text) AND (item.tags &#x26;&#x26; '{1}'::integer[]))
   Rows Removed by Filter: 200646
   Heap Blocks: exact=12311
   InitPlan 2
     ->  Result  (cost=0.02..0.28 rows=1 width=32) (actual time=0.076..0.076 rows=1 loops=1)
           Output: item_reader_allowed_tags((InitPlan 1).col1)
           InitPlan 1
             ->  Result  (cost=0.00..0.02 rows=1 width=32) (actual time=0.003..0.003 rows=1 loops=1)
                   Output: (current_setting('item_reader.allowed_tags'::text))::text[]
   ->  Bitmap Index Scan on item_tags_idx  (cost=0.00..124.53 rows=10000 width=0) (actual time=10.942..10.942 rows=207372 loops=1)
         Index Cond: (item.tags &#x26;&#x26; (InitPlan 2).col1)
 Query Identifier: 2531873819528607867
 Planning Time: 0.088 ms
 Execution Time: 93.713 ms</code></pre>
<p>It did have a far bigger effect! We're not beating the admin timing, but considering we have to perform double the tests on <code>item.tags</code>, not-quite-double the duration is probably about as good as we can hope for, until Postgres' planner gets smart enough to combine the two tests in the index scan.</p>
<p>The internal plan representation is also finally down to two separate plans: the bitmap heap scan, and a single instance of <code>item_reader_allowed_tags</code>.</p>]]></description>
            <link>https://di.nmfay.com/rls-performance</link>
            <guid isPermaLink="true">https://di.nmfay.com/rls-performance</guid>
            <pubDate>Sun, 13 Jul 2025 00:00:00 GMT</pubDate>
        </item>
        <item>
            <title><![CDATA[pdot 1.0.0: Exploring Databases Visually, Part III]]></title>
            <description><![CDATA[<p>In what I can't say <em>isn't</em> a tradition at this point, we're in an odd-numbered year so there's news on the pdot front! <a href="https://gitlab.com/dmfay/pdot/-/releases/v1.0.0">Get it here</a>!</p>
<p>The biggest change (and the reason for the big 1-0-0) is simplifying usage: rather than requiring a shell function to plug the graph body into a template for interactive use, pdot now outputs the entire <code>digraph</code> or <code>flowchart</code> markup. The old behavior is still available with the <code>--body</code> flag, but the new default means it's a lot easier to get started -- <code>pdot postgres_air fks | dot -Tpng | wezterm imgcat</code> and go. You only need scripting to do the pipelining for you, or to customize the graph's appearance.</p>
<p>Other notable updates along the way:</p>
<ul>
<li><code>PGHOST</code>, <code>PGDATABASE</code>, <code>PGUSER</code>, and <code>PGPASSWORD</code> environment variables are honored</li>
<li>new <code>policies</code> graph, and many improvements to others especially <code>triggers</code> and function <code>refs</code></li>
<li>usable as a Rust library!</li>
</ul>
<p>Late last year I also <a href="https://www.youtube.com/watch?v=9I8AIwWVI_k">presented at PGConf.EU in Athens</a>, should you be interested.</p>]]></description>
            <link>https://di.nmfay.com/pdot-1-0-0</link>
            <guid isPermaLink="true">https://di.nmfay.com/pdot-1-0-0</guid>
            <pubDate>Wed, 21 May 2025 00:00:00 GMT</pubDate>
        </item>
        <item>
            <title><![CDATA[ST_MapAlgebra and Tiled Rasters]]></title>
            <description><![CDATA[<p>Let's say you have two rasters, and you want to combine them with some extra value math -- perhaps you want to grade grassland around 50°N 100°E by latitude and elevation, so blue to green to red the more northerly or higher up the point. <a href="https://postgis.net/docs/RT_ST_MapAlgebra_expr.html"><code>ST_MapAlgebra</code></a> to the rescue!</p>
<pre><code class="hljs language-sql"><span class="hljs-keyword">select</span>
  c.rid,
  st_mapalgebra(
    e.rast,
    c.rast,
    <span class="hljs-string">'
      case when [rast2.val] in (13, 14, 16, 17, 18)
        then -[rast1.y] + [rast1.val]
        else 0
      end
    '</span>
  ) <span class="hljs-keyword">as</span> rast
<span class="hljs-keyword">from</span> diva.coverage <span class="hljs-keyword">as</span> c
<span class="hljs-keyword">join</span> diva.elevation <span class="hljs-keyword">as</span> e <span class="hljs-keyword">on</span> e.rid = c.rid;</code></pre>
<p><img src="https://di.nmfay.com/images/postgis-tiled-mapalgebra/banded.jpg" alt="terrain with big obvious horizontal bands in which a red-green-blue gradient repeats over and over, south to north"></p>
<p>Oops. (effect exaggerated for visibility)</p>
<blockquote>
<p>If they are tiled it's a bit more complicated...</p>
<p>— <a href="https://gis.stackexchange.com/a/385269">Pierre Racine</a></p>
</blockquote>
<p>Problem: instead of two big aligned rasters, you have a bunch of tiled aligned rasters, so instead of a smooth gradient south to north you have <em>lots of little smooth south-north gradients</em>. The rest of it works, but only within each tile, or here band since we're ignoring x.</p>
<p>How to fix this? The expression argument only has a handful of available values, and all of them are internal to a specific raster value, that is, the tile. There's a function-callback version of <code>ST_MapAlgebra</code> that could add conditional logic based on external factors like tile latitude; however, <em>I</em> don't want to write and maintain a whole function for this really rather straightforward calculation.</p>
<p>But! <code>ST_MapAlgebra</code> executes per raster, and that means the expression argument is passed into the invocation each time. This means we can use <code>format</code> to pass in external variables -- here, converting pixel y-value to latitude within the SRID. The y-value of any given pixel relative to the SRID as a whole is given by subtracting its y-value within the tile from the latitude of the tile's top edge (<code>ST_UpperLeftY</code>) divided by the height a pixel represents in the SRID (<code>ST_PixelHeight</code>). Mileage may vary in the southern hemisphere, but this too is tractable.</p>
<pre><code class="hljs language-sql"><span class="hljs-keyword">select</span>
  c.rid,
  st_mapalgebra(
    e.rast,
    c.rast,
    <span class="hljs-keyword">format</span>(
      <span class="hljs-string">'
        case when [rast2.val] in (13, 14, 16, 17, 18)
          then %1$s - ([rast1.y] * %2$s) + [rast1.val]
          else 0
        end
      '</span>,
      st_upperlefty(c.rast),
      st_pixelheight(c.rast)
    )
  ) <span class="hljs-keyword">as</span> rast
<span class="hljs-keyword">from</span> diva.coverage <span class="hljs-keyword">as</span> c
<span class="hljs-keyword">join</span> diva.elevation <span class="hljs-keyword">as</span> e <span class="hljs-keyword">on</span> e.rid = c.rid;</code></pre>
<p><img src="https://di.nmfay.com/images/postgis-tiled-mapalgebra/fixed.jpg" alt="the same terrain, with a single smooth south-north gradient"></p>]]></description>
            <link>https://di.nmfay.com/postgis-tiled-mapalgebra</link>
            <guid isPermaLink="true">https://di.nmfay.com/postgis-tiled-mapalgebra</guid>
            <pubDate>Thu, 27 Jun 2024 00:00:00 GMT</pubDate>
        </item>
        <item>
            <title><![CDATA[Terminal Tools for PostGIS]]></title>
            <description><![CDATA[<p>Of late, I've been falling down a bunch of geospatial rabbit holes. One thing has remained true in each of them: it's really hard to debug what you can't see.</p>
<p>There are ways to visualize these. Some more-integrated SQL development environments like pgAdmin recognize and plot columns of geometry type. There's also the option of standing up a webserver to render out raster and/or vector tiles with something like Leaflet. Unfortunately, I don't love either solution. I like psql, vim, and the shell, and I don't want to do some query testing here and copy others into and out of pgAdmin over and over; I'm actually using Leaflet and vector tiles already, but restarting the whole server just to <em>start</em> debugging a modified query is a bit much in feedback loop time.</p>
<p>So: new tools. You need zsh, psql, and per usual, ideally a terminal emulator that can render images. I use wezterm but the only thing you'd need to change is the sole <code>wezterm imgcat</code> call in each. Both can also pipe out to files.</p>
<h2 id="pgisd"><a href="https://gitlab.com/dmfay/dotfiles/-/blob/master/zsh/pgisd.zsh">pgisd</a></h2>
<p>The first one, and <a href="https://gitlab.com/dmfay/dotfiles/-/blob/master/zsh/pgisd.zsh">the tool I used</a> to create the images in the <a href="./random-geography-fluviation">fluviation</a> post. <code>pgisd</code> runs the given SQL script and renders geometry or geography columns in the output. (It actually has to run the query twice, in order to detect and build rendering code for each geom column)</p>
<p>I have some small polygons dumped from rasters, filtered, intersected, sliced, diced, et cetera. My script looks like this:</p>
<pre><code class="hljs language-sql"><span class="hljs-keyword">select</span>
  geom,
  st_asewkt(st_centroid(geom)) <span class="hljs-keyword">as</span> ewkt_centroid,
  <span class="hljs-keyword">format</span>(
    <span class="hljs-string">'%1$s %2$s, radius %3$s'</span>,
    <span class="hljs-keyword">round</span>(st_x((st_maximuminscribedcircle(geom)).center)::<span class="hljs-built_in">numeric</span>, <span class="hljs-number">2</span>),
    <span class="hljs-keyword">round</span>(st_y((st_maximuminscribedcircle(geom)).center)::<span class="hljs-built_in">numeric</span>, <span class="hljs-number">2</span>),
    <span class="hljs-keyword">round</span>((st_maximuminscribedcircle(geom)).radius::<span class="hljs-built_in">numeric</span>, <span class="hljs-number">2</span>)
  ) <span class="hljs-keyword">as</span> text_largest_circle
<span class="hljs-keyword">from</span> lots_of_ctes</code></pre>
<p>Without specifying a bounding box, you can <em>barely</em> pick out a couple of dots near where Mongolia would be on a WGS84 projection, given that the whole thing has been squeezed into some 800ish pixels wide:</p>
<p><img src="https://di.nmfay.com/images/postgis-terminal-tools/pgisd-1.png" alt="a blank &#x22;world map&#x22; rendered in shell, equator and prime meridian but no image, except for two tiny dots in the upper-right quadrant "></p>
<p>Enhance:</p>
<p><img src="https://di.nmfay.com/images/postgis-terminal-tools/pgisd-2.png" alt="a collection of blobs around a crosshair rendered from coordinates 100, 47 - 106, 52"></p>
<p>Tweak the <code>where</code> clause to skip that one outlier and focus on the rest (the crosshair gets a bit flaky at around a single degree of width/height):</p>
<p><img src="https://di.nmfay.com/images/postgis-terminal-tools/pgisd-3.png" alt="more blobs, bigger now"></p>
<p>pgisd can also render multiple geom-prefixed (and ewkt-, and text-) columns in sequence. When piped to a file, only the first geometry is rendered and saved.</p>
<h2 id="pgrast"><a href="https://gitlab.com/dmfay/dotfiles/-/blob/master/zsh/pgrast.zsh">pgrast</a></h2>
<p>And <a href="https://gitlab.com/dmfay/dotfiles/-/blob/master/zsh/pgrast.zsh">then I started needing rasters</a> for things like elevation and land cover (with profuse thanks to the International Potato Center's <a href="https://www.diva-gis.org">DivaGIS</a> project for compiling a ton of these for free!). This one's a bit simpler -- a raster is a raster, you locate the column and define a bounding box for the area you're interested in. Here's the location we were just looking at geometry intersections over:</p>
<p><img src="https://di.nmfay.com/images/postgis-terminal-tools/pgrast-1.png" alt="an elevation map rendered to shell in pseudocolor"></p>
<p>And looking a little further east, here's the northeastern part of the Mongolian plateau in full; that's Lake Baikal at center-left.</p>
<p><img src="https://di.nmfay.com/images/postgis-terminal-tools/pgrast-2.png" alt="a larger elevation map rendered to shell in pseudocolor"></p>
<p>But what if we want to simplify it? This came up a lot with the land cover, where each pixel value is one of 22 options (1 is broadleaf evergreen forest, 13 is grassland, 22 is urban) and I only wanted to see a few at a time, but pgrast's <code>reclass</code> option also works to flatten the pseudocolor output. Here's the same raster, where elevation &#x3C; 1000m is blue, 1000-2000m is green, and anything above 2000m is red:</p>
<p><img src="https://di.nmfay.com/images/postgis-terminal-tools/pgrast-3.png" alt="the previous elevation map, with finer gradations condensed into one of three colors"></p>]]></description>
            <link>https://di.nmfay.com/postgis-terminal-tools</link>
            <guid isPermaLink="true">https://di.nmfay.com/postgis-terminal-tools</guid>
            <pubDate>Sun, 02 Jun 2024 00:00:00 GMT</pubDate>
        </item>
        <item>
            <title><![CDATA[(Plausible) Random Geography Generation with PostGIS: Fluviation]]></title>
            <description><![CDATA[<p>Welcome to Squaria.</p>
<p><img src="https://di.nmfay.com/images/fluviation/final.png" alt="wireframe diagram of a square continent with jagged borders and several river systems"></p>
<p>Squaria is a continent of highly unstable geography defined by a single SQL query (with, as we'll see, many, many CTEs). Its only consistent properties at the moment are its boxy shape and the two unnervingly straight mountain ranges that cross its breadth and meet on its lower eastern edge. Those mountains are impossible, but today's topic is <em>fluviation</em>, that is, rivers and riverine lakes; we'll see about plausible plate tectonics some other time, maybe.</p>
<p>The ever-shifting border of Squaria is defined by a Voronoi diagram within a 100-unit envelope, similar to <a href="https://www.crunchydata.com/blog/random-geometry-generation-with-postgis#random-polygons-with-voronoi-polygons">Paul Ramsey's random polygon generation</a>. Other shapes are of course easily achievable, and I'm probably going to steal his circular envelope outright in the future, but squares are easy to demo.</p>
<pre><code class="hljs language-sql">with recursive envelope as (
  <span class="hljs-keyword">select</span> st_makeenvelope(<span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">100</span>, <span class="hljs-number">100</span>) <span class="hljs-keyword">as</span> geom
), voronoi_unclipped <span class="hljs-keyword">as</span> (
  <span class="hljs-keyword">select</span> (st_dump(st_voronoipolygons(
    g1 => st_generatepoints(
      envelope.geom,
      <span class="hljs-number">500</span> <span class="hljs-comment">-- increase this for a finer polygon mesh</span>
    ),
    tolerance => <span class="hljs-number">0.0</span>,
    extend_to => envelope.geom
  ))).geom <span class="hljs-keyword">as</span> poly
  <span class="hljs-keyword">from</span> envelope
), voronoi <span class="hljs-keyword">as</span> (
  <span class="hljs-comment">-- clip the Voronoi diagram to only those polys fully inside the envelope</span>
  <span class="hljs-keyword">select</span> voronoi_unclipped.poly
  <span class="hljs-keyword">from</span> envelope
  <span class="hljs-keyword">join</span> voronoi_unclipped <span class="hljs-keyword">on</span> st_contains(envelope.geom, voronoi_unclipped.poly)
), border <span class="hljs-keyword">as</span> (
  <span class="hljs-keyword">select</span> st_boundary(st_concavehull(st_union(poly), <span class="hljs-number">0</span>)) <span class="hljs-keyword">as</span> linestr
  <span class="hljs-keyword">from</span> voronoi
)
<span class="hljs-keyword">select</span> * <span class="hljs-keyword">from</span> border;</code></pre>
<p><img src="https://di.nmfay.com/images/fluviation/border.png" alt="border defined by clipping a voronoi diagram to a square"></p>
<p>But it's not just the border that we're concerned about here. If you want to simulate fluviation, you have to go at least a little distance toward simulating fluid <em>mechanics</em>. Water famously flows downhill; a downhill implies an uphill implies height. Let's add those mountain ranges and generate a heightmap while trying not to think too hard about the fact that PostGIS supports a third dimension, making full-scale volumetric simulation theoretically achievable.</p>
<p>All these CTEs build on each other, so if you're following along for fun, you'll need to combine the statements (minus the final selects in each, which just output the current step). I'll post the whole thing at the end too.</p>
<pre><code class="hljs language-sql">with mountain_range as (
  with nonrandom_line as (
    <span class="hljs-keyword">select</span> st_makeline(st_point(<span class="hljs-number">0</span>, v.y1), st_point(<span class="hljs-number">100</span>, v.y2)) <span class="hljs-keyword">as</span> linestr
    <span class="hljs-keyword">from</span> (<span class="hljs-keyword">values</span> (<span class="hljs-number">70</span>, <span class="hljs-number">30</span>), (<span class="hljs-number">20</span>, <span class="hljs-number">35</span>)) <span class="hljs-keyword">as</span> v (y1, y2)
  )
  <span class="hljs-keyword">select</span>
    st_collect(voronoi.poly) <span class="hljs-keyword">as</span> geom,
    st_collect(nonrandom_line.linestr) <span class="hljs-keyword">as</span> linestr
  <span class="hljs-keyword">from</span> voronoi
  <span class="hljs-keyword">cross</span> <span class="hljs-keyword">join</span> nonrandom_line
  <span class="hljs-keyword">where</span> st_intersects(voronoi.poly, nonrandom_line.linestr)
), heightmap <span class="hljs-keyword">as</span> (
  <span class="hljs-keyword">select</span>
    voronoi.poly,
    <span class="hljs-comment">-- height is a function of distance from the mountains, also factoring in</span>
    <span class="hljs-comment">-- x-position (Squaria's east is lower than its west) and a little random</span>
    <span class="hljs-comment">-- variance to make things interesting</span>
    <span class="hljs-number">100</span>
      - (<span class="hljs-keyword">min</span>(st_distance(voronoi.poly, mountain_range.geom)) * <span class="hljs-number">1.5</span>)
      - (st_x(st_centroid(voronoi.poly)) * <span class="hljs-number">1.5</span> / <span class="hljs-number">10</span>)
      + (random() * <span class="hljs-number">6</span> - <span class="hljs-number">3</span>)
      <span class="hljs-keyword">as</span> height
  <span class="hljs-keyword">from</span> voronoi
  <span class="hljs-keyword">cross</span> <span class="hljs-keyword">join</span> mountain_range
  <span class="hljs-keyword">group</span> <span class="hljs-keyword">by</span> voronoi.poly
)
<span class="hljs-keyword">select</span>
  st_collect(
    st_translate(
      st_scale(st_letters(<span class="hljs-keyword">round</span>(h.height)::<span class="hljs-built_in">text</span>), <span class="hljs-number">.03</span>, <span class="hljs-number">.03</span>),
      st_x(st_centroid(h.poly)),
      st_y(st_centroid(h.poly))
    )
  )
<span class="hljs-keyword">from</span> heightmap <span class="hljs-keyword">as</span> h;</code></pre>
<p><img src="https://di.nmfay.com/images/fluviation/heightmap.png" alt="semi-randomly assigned cell heights"></p>
<p>We've separated the high from the low! Now, just add water:</p>
<pre><code class="hljs language-sql">with headwater as (
  <span class="hljs-keyword">select</span> poly, height
  <span class="hljs-keyword">from</span> heightmap
  <span class="hljs-keyword">join</span> border <span class="hljs-keyword">on</span> <span class="hljs-literal">true</span>
  <span class="hljs-keyword">where</span> height &#x3C; <span class="hljs-number">90</span>
    <span class="hljs-keyword">and</span> <span class="hljs-keyword">not</span> st_touches(poly, border.linestr)
  <span class="hljs-keyword">order</span> <span class="hljs-keyword">by</span> random()
  <span class="hljs-keyword">limit</span> <span class="hljs-number">30</span>
)
<span class="hljs-keyword">select</span> st_asewkt(st_centroid(poly)), height
<span class="hljs-keyword">from</span> headwater;</code></pre>
<p>Alternatively, we could favor a more uniform distribution (central Squaria looks a bit desolate, and there's a river in the south flowing between four springs in a row); this spaces headwaters out much more effectively, but placement is the easy part and the current incarnation of Squaria illustrates an important point later on.</p>
<pre><code class="hljs language-sql">with headwater as (
  <span class="hljs-keyword">select</span> poly, height
  <span class="hljs-keyword">from</span> heightmap
  <span class="hljs-keyword">join</span> border <span class="hljs-keyword">on</span> <span class="hljs-literal">true</span>
  <span class="hljs-keyword">join</span> (
    <span class="hljs-comment">-- draw a grid of horizontal and vertical lines 10 units apart</span>
    <span class="hljs-keyword">with</span> x <span class="hljs-keyword">as</span> (
      <span class="hljs-keyword">select</span> st_makeline(st_point(<span class="hljs-number">0</span>, generate_series), st_point(<span class="hljs-number">100</span>, generate_series)) <span class="hljs-keyword">as</span> geom
      <span class="hljs-keyword">from</span> generate_series(<span class="hljs-number">10</span>, <span class="hljs-number">90</span>, <span class="hljs-number">10</span>)
    ), y <span class="hljs-keyword">as</span> (
      <span class="hljs-keyword">select</span> st_makeline(st_point(generate_series, <span class="hljs-number">0</span>), st_point(generate_series, <span class="hljs-number">100</span>)) <span class="hljs-keyword">as</span> geom
      <span class="hljs-keyword">from</span> generate_series(<span class="hljs-number">10</span>, <span class="hljs-number">90</span>, <span class="hljs-number">10</span>)
    )
    <span class="hljs-comment">-- collect the points at which the horizontal and vertical lines cross</span>
    <span class="hljs-keyword">select</span> st_collect(st_intersection(x.geom, y.geom)) <span class="hljs-keyword">as</span> geom
    <span class="hljs-keyword">from</span> x
    <span class="hljs-keyword">cross</span> <span class="hljs-keyword">join</span> y
  ) <span class="hljs-keyword">as</span> grid <span class="hljs-keyword">on</span> <span class="hljs-literal">true</span>
  <span class="hljs-keyword">where</span> height &#x3C; <span class="hljs-number">90</span>
    <span class="hljs-keyword">and</span> <span class="hljs-keyword">not</span> st_touches(poly, border.linestr)
    <span class="hljs-keyword">and</span> st_intersects(poly, grid.geom) <span class="hljs-comment">-- pick polys at those intersection points</span>
  <span class="hljs-keyword">order</span> <span class="hljs-keyword">by</span> random()
  <span class="hljs-keyword">limit</span> <span class="hljs-number">30</span>
)
<span class="hljs-keyword">select</span> st_asewkt(st_centroid(poly)), height
<span class="hljs-keyword">from</span> headwater;</code></pre>
<p>And let it flow:</p>
<pre><code class="hljs language-sql">with river_poly as (
  <span class="hljs-keyword">select</span>
    row_number() <span class="hljs-keyword">over</span> () <span class="hljs-keyword">as</span> <span class="hljs-keyword">id</span>,
    <span class="hljs-number">1</span> <span class="hljs-keyword">as</span> iter,
    <span class="hljs-number">1</span> <span class="hljs-keyword">as</span> <span class="hljs-keyword">length</span>,
    headwater.poly,
    headwater.height,
    <span class="hljs-built_in">array</span>[headwater.poly]::geometry[] <span class="hljs-keyword">as</span> polys,
    <span class="hljs-built_in">array</span>[st_centroid(headwater.poly)]::geometry[] <span class="hljs-keyword">as</span> centroids,
    <span class="hljs-number">0</span> <span class="hljs-keyword">as</span> lake_poly_depth
  <span class="hljs-keyword">from</span> headwater
  <span class="hljs-keyword">union</span>
  <span class="hljs-keyword">select</span>
    <span class="hljs-comment">-- neighbor_poly is null: we could not find a lower polygon to move into, sit here and lake up</span>
    previous.id,
    previous.iter + <span class="hljs-number">1</span> <span class="hljs-keyword">as</span> iter,
    <span class="hljs-keyword">case</span> <span class="hljs-keyword">when</span> neighbor.poly <span class="hljs-keyword">is</span> <span class="hljs-literal">null</span> <span class="hljs-keyword">then</span> previous.length <span class="hljs-keyword">else</span> previous.length + <span class="hljs-number">1</span> <span class="hljs-keyword">end</span> <span class="hljs-keyword">as</span> <span class="hljs-keyword">length</span>,
    <span class="hljs-keyword">coalesce</span>(neighbor.poly, previous.poly) <span class="hljs-keyword">as</span> poly,
    <span class="hljs-keyword">coalesce</span>(neighbor.height, previous.height + <span class="hljs-number">2</span>) <span class="hljs-keyword">as</span> height, <span class="hljs-comment">-- fill in lakebed</span>
    <span class="hljs-keyword">case</span>
      <span class="hljs-keyword">when</span> neighbor.poly <span class="hljs-keyword">is</span> <span class="hljs-literal">null</span> <span class="hljs-keyword">then</span> previous.polys
      <span class="hljs-keyword">else</span> array_cat(previous.polys, <span class="hljs-built_in">array</span>[neighbor.poly])
    <span class="hljs-keyword">end</span> <span class="hljs-keyword">as</span> polys,
    <span class="hljs-keyword">case</span>
      <span class="hljs-keyword">when</span> neighbor.poly <span class="hljs-keyword">is</span> <span class="hljs-literal">null</span> <span class="hljs-keyword">then</span> previous.centroids
      <span class="hljs-keyword">else</span> array_cat(previous.centroids, <span class="hljs-built_in">array</span>[st_centroid(neighbor.poly)])
    <span class="hljs-keyword">end</span> <span class="hljs-keyword">as</span> centroids,
    <span class="hljs-keyword">case</span>
      <span class="hljs-keyword">when</span> neighbor.poly <span class="hljs-keyword">is</span> <span class="hljs-literal">null</span> <span class="hljs-keyword">then</span> previous.lake_poly_depth + <span class="hljs-number">1</span>
      <span class="hljs-keyword">else</span> <span class="hljs-number">0</span>
    <span class="hljs-keyword">end</span> <span class="hljs-keyword">as</span> lake_poly_depth
  <span class="hljs-keyword">from</span> river_poly <span class="hljs-keyword">as</span> previous
  <span class="hljs-keyword">left</span> <span class="hljs-keyword">outer</span> <span class="hljs-keyword">join</span> lateral (
    <span class="hljs-keyword">select</span> *
    <span class="hljs-keyword">from</span> heightmap
      <span class="hljs-keyword">where</span> st_touches(heightmap.poly, previous.poly)
        <span class="hljs-keyword">and</span> heightmap.height &#x3C; previous.height
        <span class="hljs-comment">-- can't return to a poly with the same bounding box as a previously visited one</span>
        <span class="hljs-keyword">and</span> <span class="hljs-keyword">not</span>(heightmap.poly ~= <span class="hljs-keyword">any</span>(previous.polys))
      <span class="hljs-comment">-- pick the closest centroid</span>
      <span class="hljs-keyword">order</span> <span class="hljs-keyword">by</span> st_centroid(heightmap.poly) &#x3C;-> st_centroid(previous.poly)
      <span class="hljs-keyword">limit</span> <span class="hljs-number">1</span>
  ) <span class="hljs-keyword">as</span> neighbor <span class="hljs-keyword">on</span> <span class="hljs-literal">true</span>
  <span class="hljs-comment">-- border is a single-row relation so we can do this weird shortcut antijoin</span>
  <span class="hljs-keyword">inner</span> <span class="hljs-keyword">join</span> border <span class="hljs-keyword">on</span> <span class="hljs-keyword">not</span> st_touches(previous.poly, border.linestr)
  <span class="hljs-keyword">where</span> previous.lake_poly_depth &#x3C; <span class="hljs-number">5</span>
)
<span class="hljs-keyword">select</span> <span class="hljs-keyword">id</span>, st_asewkt(st_makeline(centroids))
<span class="hljs-keyword">from</span> river_poly
<span class="hljs-keyword">inner</span> <span class="hljs-keyword">join</span> (
  <span class="hljs-keyword">select</span> <span class="hljs-keyword">id</span>, <span class="hljs-keyword">max</span>(iter) <span class="hljs-keyword">as</span> iter
  <span class="hljs-keyword">from</span> river_poly
  <span class="hljs-keyword">group</span> <span class="hljs-keyword">by</span> <span class="hljs-keyword">id</span>
) <span class="hljs-keyword">as</span> maxiter
  <span class="hljs-keyword">on</span> maxiter.id = river_poly.id
  <span class="hljs-keyword">and</span> maxiter.iter = river_poly.iter;</code></pre>
<p>Alright, that one's a lot to deal with all at once. This recursive CTE is the very beating heart of our river generator. Like all recursive CTEs, it has a <strong>base</strong> term -- projecting a bunch of fields and initial values from <code>headwater</code> -- and a <strong>recursive</strong> term which works.... not quite how you might expect recursion to operate.</p>
<p>In Postgres, "recursion" is actually iteration. The base term is evaluated first, and its results placed in a "working table". Then, the recursive term is evaluated, with the self-reference <code>river_poly</code> indicating the working table. The results of the recursive evaluation become the new working table; if at this point there's anything in that working table, the recursive term is evaluated again.</p>
<p>The output of the CTE includes everything that has ever been in the working table. This is why the demo select has a self-join: to include only the final results for each river, rather than every step of each one's progress from <code>headwater</code> to cell to cell.</p>
<p>How does that progress happen, though?</p>
<p>The first thing you might notice is the projection logic that depends on whether <code>neighbor.poly</code> is null. Let's talk about <code>neighbor</code> first, though, in that lateral join below. Lateral join: the subquery is evaluated for each record, here from the working table (<code>previous</code>). So for each headwater in the first recursive execution, or for the last chunk of each river added in each successive iteration, we inspect neighboring polygons in the heightmap. We're looking for a lower polygon, not the <em>lowest</em> to avoid racing too quickly to local minima, and one which we haven't seen before for this river. Picking the closest lower centroid keeps things reasonably random and avoids some occasional funny-looking leaps across the landscape.</p>
<p>But wait. If we always move into lower neighboring cells, why do we need an additional check against retracing our steps?</p>
<p>The answer is local minima again. Our best efforts notwithstanding, it's easy for a river to flow into a cell surrounded by higher neighbors on all sides. <a href="https://en.wikipedia.org/wiki/Endorheic_basin">Endorheic basins</a> without outlets exist, but they're not <em>that</em> common (and lakes in them tend to be saline). So if a river enters such a basin, we want to give it a chance or several to exit again. We do this by simulating accumulation in a lake which raises the effective height of the current cell. On the next iteration, neighbors that were previously higher could be lower -- including that from which the river entered the lake cell, which is also by definition the closest lower centroid.</p>
<p>The <code>neighbor.poly</code> null checks in the select clause drive lake formation. When there's no valid neighbor, the river-in-progress increments its lake or effective height and stays still; otherwise, it proceeds into the neighboring cell, accumulating the neighbor's polygon and precalculating its centroid for later. Rivers get five chances to proceed at each iteration before they're excluded from the working table and terminate in an endorheic lake. At +2 per increment, this allows lakes to overcome a difference of up to 8 elevation points.</p>
<p>The last thing <code>river_poly</code> does is detect whether the river has reached the edge of the continent. The "weird shortcut antijoin" keeps rivers in the working table only as long as the last cell it moved into wasn't on the border.</p>
<p><img src="https://di.nmfay.com/images/fluviation/flow.gif" alt="fluviation progress, iteration by iteration"></p>
<p>The "finished" flow hasn't quite reached the border because the rivers are drawn centroid to centroid, but making that connection is an <code>st_closestpoint</code> away.</p>
<p>That worked nicely! There's got to be a catch.</p>
<p>There are a few catches.</p>
<p>First, a philosophical question: after a river joins another, how many rivers do you have? You might be able to make a case for two at the <a href="https://en.wikipedia.org/wiki/Meeting_of_Waters">confluence of the Rio Negro and Solimões</a>, but that's an exception. If we mean this map to be useful, we can't be having two rivers occupy the same space all the way to the sea. One has to end and the other has to keep going.</p>
<p>Second, the output of the recursion includes as many records per river as the river has cells, because the working table is pushed into the result set for each cycle. We need to remove all non-final records.</p>
<p>Third: using lake formation to increase effective height and allow rivers to proceed enables a very specific paradox. A river <code>alpha</code> can flow into <code>beta</code>, but <code>beta</code> could there or further downstream enter and exit a lake, thereby increasing its effective height, and <em>flow back into <code>alpha</code></em>. Simplified:</p>
<pre><code class="hljs">alpha flows east <span class="hljs-number">51</span>>--<span class="hljs-number">-50</span>>--------<span class="hljs-number">-49</span>>------<span class="hljs-number">-48</span>
                        \                      \
                         <span class="hljs-number">51</span>-----&#x3C;[<span class="hljs-number">46</span>+<span class="hljs-number">6</span>=<span class="hljs-number">52</span>]-----&#x3C;<span class="hljs-number">47</span>-----&#x3C;<span class="hljs-number">50</span> beta flows west</code></pre>
<p>You can see this happening in the lower right of the gif, where two rivers start near each other just below the lower mountain range and meet in a lake -- actually, the triangular lake is filled first by the westward-flowing river, but is naturally downhill from the eastward-flowing river. On the next cycle, neither can exit, so the lake fills to depth 4 for westward and eastward forms its own depth-2 lake. The cycle after that, eastward's closest neighbor is westward's headwater; westward manages to flow out as well, but is trapped at the following cycle and has to fill an adjacent lake cell before continuing southwest.</p>
<p>We already have the maximum <code>iter</code> for each river id, and checking for cross-confluence is a matter of looking for common centroids in differing orders:</p>
<pre><code class="hljs language-sql">with river_pruned as (
  <span class="hljs-comment">-- remove intermediary rows generated as river_poly recurses</span>
  <span class="hljs-keyword">select</span> river_poly.*
  <span class="hljs-keyword">from</span> river_poly
  <span class="hljs-keyword">inner</span> <span class="hljs-keyword">join</span> (
    <span class="hljs-keyword">select</span> <span class="hljs-keyword">id</span>, <span class="hljs-keyword">max</span>(iter) <span class="hljs-keyword">as</span> iter
    <span class="hljs-keyword">from</span> river_poly
    <span class="hljs-keyword">group</span> <span class="hljs-keyword">by</span> <span class="hljs-keyword">id</span>
  ) <span class="hljs-keyword">as</span> maxiter
    <span class="hljs-keyword">on</span> maxiter.id = river_poly.id
    <span class="hljs-keyword">and</span> maxiter.iter = river_poly.iter
  <span class="hljs-keyword">where</span> <span class="hljs-keyword">not</span> <span class="hljs-keyword">exists</span> (
    <span class="hljs-comment">-- if two rivers cross each other in opposite directions, pick the one with the lower</span>
    <span class="hljs-comment">-- id and eliminate the other</span>
    <span class="hljs-keyword">select</span> <span class="hljs-number">1</span>
    <span class="hljs-keyword">from</span> river_poly <span class="hljs-keyword">as</span> rp2
    <span class="hljs-keyword">where</span> rp2.id &#x3C; river_poly.id
      <span class="hljs-keyword">and</span> <span class="hljs-built_in">array</span>(<span class="hljs-keyword">select</span> * <span class="hljs-keyword">from</span> unnest(rp2.centroids) <span class="hljs-keyword">where</span> unnest = <span class="hljs-keyword">any</span>(river_poly.centroids)) &#x3C;>
        <span class="hljs-built_in">array</span>(<span class="hljs-keyword">select</span> * <span class="hljs-keyword">from</span> unnest(river_poly.centroids) <span class="hljs-keyword">where</span> unnest = <span class="hljs-keyword">any</span>(rp2.centroids))
  )
)
<span class="hljs-keyword">select</span> <span class="hljs-keyword">id</span>, st_asewkt(st_collect(centroids)) <span class="hljs-keyword">from</span> river_pruned;</code></pre>
<p>Pruning takes care of 2 and 3, but if you run this you'll see the same centroids appear several times in each river. We still need to ensure that a tributary is just a tributary and not the rest of the other river downstream from its confluence.</p>
<pre><code class="hljs language-sql">with cutoff as (
  <span class="hljs-comment">-- find the first point at which a river "loses" a confluence and becomes</span>
  <span class="hljs-comment">-- subsumed in another river's flux</span>
  <span class="hljs-keyword">select</span>
    p.id,
    <span class="hljs-keyword">min</span>(array_position(p.centroids, confluence.centroid)) <span class="hljs-keyword">as</span> <span class="hljs-keyword">position</span>
  <span class="hljs-keyword">from</span> river_pruned <span class="hljs-keyword">as</span> p
  <span class="hljs-keyword">join</span> (
    <span class="hljs-comment">-- centroids of all cells entered by more than one river; furthest upstream</span>
    <span class="hljs-comment">-- wins, with lower ids breaking ties</span>
    <span class="hljs-keyword">select</span>
      unnest <span class="hljs-keyword">as</span> centroid,
      (array_agg(<span class="hljs-keyword">id</span>::<span class="hljs-built_in">text</span> <span class="hljs-keyword">order</span> <span class="hljs-keyword">by</span> <span class="hljs-keyword">ordinality</span>, <span class="hljs-keyword">id</span>))[<span class="hljs-number">1</span>] <span class="hljs-keyword">as</span> winner
    <span class="hljs-keyword">from</span> river_pruned
    <span class="hljs-keyword">join</span> lateral unnest(centroids) <span class="hljs-keyword">with</span> <span class="hljs-keyword">ordinality</span> <span class="hljs-keyword">on</span> <span class="hljs-literal">true</span>
    <span class="hljs-keyword">group</span> <span class="hljs-keyword">by</span> unnest
    <span class="hljs-keyword">having</span> <span class="hljs-keyword">count</span>(*) > <span class="hljs-number">1</span>
  ) <span class="hljs-keyword">as</span> confluence <span class="hljs-keyword">on</span> confluence.winner &#x3C;> p.id::<span class="hljs-built_in">text</span>
    <span class="hljs-keyword">and</span> array_position(p.centroids, confluence.centroid) > <span class="hljs-number">0</span>
  <span class="hljs-keyword">group</span> <span class="hljs-keyword">by</span> p.id
), river_line <span class="hljs-keyword">as</span> (
  <span class="hljs-keyword">select</span>
    river_pruned.id,
    river_pruned.length,
    river_pruned.poly,
    river_pruned.height,
    cutoff.position <span class="hljs-keyword">as</span> cutoff,
    river_pruned.polys[<span class="hljs-number">1</span>:<span class="hljs-keyword">coalesce</span>(cutoff.position, river_pruned.length)] <span class="hljs-keyword">as</span> polys,
    st_makeline(
      <span class="hljs-keyword">case</span>
        <span class="hljs-comment">-- only rivers which are not cut off and which come adjacent to the border</span>
        <span class="hljs-comment">-- require the additional segment connecting to the shore!</span>
        <span class="hljs-keyword">when</span> cutoff.position <span class="hljs-keyword">is</span> <span class="hljs-literal">null</span>
          <span class="hljs-keyword">and</span> st_touches(
              river_pruned.polys[<span class="hljs-keyword">coalesce</span>(cutoff.position, river_pruned.length)],
              border.linestr
          )
        <span class="hljs-keyword">then</span> array_cat(
          river_pruned.centroids[<span class="hljs-number">1</span>:<span class="hljs-keyword">coalesce</span>(cutoff.position, river_pruned.length)],
          <span class="hljs-built_in">array</span>[st_closestpoint(
            border.linestr,
            st_centroid(river_pruned.polys[<span class="hljs-keyword">coalesce</span>(cutoff.position, river_pruned.length)])
          )]
        )
        <span class="hljs-keyword">else</span> river_pruned.centroids[<span class="hljs-number">1</span>:<span class="hljs-keyword">coalesce</span>(cutoff.position, river_pruned.length)]
      <span class="hljs-keyword">end</span>
    ) <span class="hljs-keyword">as</span> geom
  <span class="hljs-keyword">from</span> river_pruned
  <span class="hljs-keyword">left</span> <span class="hljs-keyword">outer</span> <span class="hljs-keyword">join</span> cutoff <span class="hljs-keyword">on</span> cutoff.id = river_pruned.id
  <span class="hljs-keyword">join</span> border <span class="hljs-keyword">on</span> <span class="hljs-literal">true</span>
)
<span class="hljs-keyword">select</span> <span class="hljs-keyword">id</span>, <span class="hljs-keyword">length</span>, cutoff, st_asewkt(geom)
<span class="hljs-keyword">from</span> river_line;</code></pre>
<p>An aside: it took me a while to come up with the nearest-lower-neighbor approach, for no good reason. Before I got there, I still wanted to avoid a race to local minima, but did it with a truly random choice of lower neighbor for each river. Rivers crossed each other willy-nilly towards the sea, which I thought I'd address in a postprocessing step. This got absolutely cursed, involving <em>another</em> recursive CTE running window functions over unnested centroid arrays to eliminate confluence losers from all future contests. It's much, much better this way.</p>
<p>Anyway, at this point we're basically done! All that's left is to accumulate all the geometries for storage or display. Here's the final script:</p>
<details>
<summary>fluviate.sql</summary>
<pre><code class="hljs language-sql">with recursive envelope as (
  <span class="hljs-keyword">select</span> st_makeenvelope(<span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">100</span>, <span class="hljs-number">100</span>) <span class="hljs-keyword">as</span> geom
), voronoi_unclipped <span class="hljs-keyword">as</span> (
  <span class="hljs-keyword">select</span> (st_dump(st_voronoipolygons(
    g1 => st_generatepoints(
      envelope.geom,
      <span class="hljs-number">500</span> <span class="hljs-comment">-- increase this for a finer polygon mesh</span>
    ),
    tolerance => <span class="hljs-number">0.0</span>,
    extend_to => envelope.geom
  ))).geom <span class="hljs-keyword">as</span> poly
  <span class="hljs-keyword">from</span> envelope
), voronoi <span class="hljs-keyword">as</span> (
  <span class="hljs-comment">-- clip the Voronoi diagram to only those polys fully inside the envelope</span>
  <span class="hljs-keyword">select</span> voronoi_unclipped.poly
  <span class="hljs-keyword">from</span> envelope
  <span class="hljs-keyword">join</span> voronoi_unclipped <span class="hljs-keyword">on</span> st_contains(envelope.geom, voronoi_unclipped.poly)
), border <span class="hljs-keyword">as</span> (
  <span class="hljs-keyword">select</span> st_boundary(st_concavehull(st_union(poly), <span class="hljs-number">0</span>)) <span class="hljs-keyword">as</span> linestr
  <span class="hljs-keyword">from</span> voronoi
), mountain_range <span class="hljs-keyword">as</span> (
  <span class="hljs-keyword">with</span> nonrandom_line <span class="hljs-keyword">as</span> (
    <span class="hljs-keyword">select</span> st_makeline(st_point(<span class="hljs-number">0</span>, v.y1), st_point(<span class="hljs-number">100</span>, v.y2)) <span class="hljs-keyword">as</span> linestr
    <span class="hljs-keyword">from</span> (<span class="hljs-keyword">values</span> (<span class="hljs-number">70</span>, <span class="hljs-number">30</span>), (<span class="hljs-number">20</span>, <span class="hljs-number">35</span>)) <span class="hljs-keyword">as</span> v (y1, y2)
  )
  <span class="hljs-keyword">select</span>
    st_collect(voronoi.poly) <span class="hljs-keyword">as</span> geom,
    st_collect(nonrandom_line.linestr) <span class="hljs-keyword">as</span> linestr
  <span class="hljs-keyword">from</span> voronoi
  <span class="hljs-keyword">cross</span> <span class="hljs-keyword">join</span> nonrandom_line
  <span class="hljs-keyword">where</span> st_intersects(voronoi.poly, nonrandom_line.linestr)
), heightmap <span class="hljs-keyword">as</span> (
  <span class="hljs-keyword">select</span>
    voronoi.poly,
    <span class="hljs-comment">-- height is a function of distance from the mountains, also factoring in</span>
    <span class="hljs-comment">-- x-position (Squaria's east is lower than its west) and a little random</span>
    <span class="hljs-comment">-- variance to make things interesting</span>
    <span class="hljs-number">100</span>
      - (<span class="hljs-keyword">min</span>(st_distance(voronoi.poly, mountain_range.geom)) * <span class="hljs-number">1.5</span>)
      - (st_x(st_centroid(voronoi.poly)) * <span class="hljs-number">1.5</span> / <span class="hljs-number">10</span>)
      + (random() * <span class="hljs-number">6</span> - <span class="hljs-number">3</span>)
      <span class="hljs-keyword">as</span> height
  <span class="hljs-keyword">from</span> voronoi
  <span class="hljs-keyword">cross</span> <span class="hljs-keyword">join</span> mountain_range
  <span class="hljs-keyword">group</span> <span class="hljs-keyword">by</span> voronoi.poly
), headwater <span class="hljs-keyword">as</span> (
  <span class="hljs-keyword">select</span> poly, height
  <span class="hljs-keyword">from</span> heightmap
  <span class="hljs-keyword">join</span> border <span class="hljs-keyword">on</span> <span class="hljs-literal">true</span>
  <span class="hljs-keyword">join</span> (
    <span class="hljs-comment">-- draw a grid of horizontal and vertical lines 10 units apart</span>
    <span class="hljs-keyword">with</span> x <span class="hljs-keyword">as</span> (
      <span class="hljs-keyword">select</span> st_makeline(st_point(<span class="hljs-number">0</span>, generate_series), st_point(<span class="hljs-number">100</span>, generate_series)) <span class="hljs-keyword">as</span> geom
      <span class="hljs-keyword">from</span> generate_series(<span class="hljs-number">10</span>, <span class="hljs-number">90</span>, <span class="hljs-number">10</span>)
    ), y <span class="hljs-keyword">as</span> (
      <span class="hljs-keyword">select</span> st_makeline(st_point(generate_series, <span class="hljs-number">0</span>), st_point(generate_series, <span class="hljs-number">100</span>)) <span class="hljs-keyword">as</span> geom
      <span class="hljs-keyword">from</span> generate_series(<span class="hljs-number">10</span>, <span class="hljs-number">90</span>, <span class="hljs-number">10</span>)
    )
    <span class="hljs-comment">-- collect the points at which the horizontal and vertical lines cross</span>
    <span class="hljs-keyword">select</span> st_collect(st_intersection(x.geom, y.geom)) <span class="hljs-keyword">as</span> geom
    <span class="hljs-keyword">from</span> x
    <span class="hljs-keyword">cross</span> <span class="hljs-keyword">join</span> y
  ) <span class="hljs-keyword">as</span> grid <span class="hljs-keyword">on</span> <span class="hljs-literal">true</span>
  <span class="hljs-keyword">where</span> height &#x3C; <span class="hljs-number">90</span>
    <span class="hljs-keyword">and</span> <span class="hljs-keyword">not</span> st_touches(poly, border.linestr)
    <span class="hljs-keyword">and</span> st_intersects(poly, grid.geom) <span class="hljs-comment">-- pick polys at those intersection points</span>
  <span class="hljs-keyword">order</span> <span class="hljs-keyword">by</span> random()
  <span class="hljs-keyword">limit</span> <span class="hljs-number">30</span>
), river_poly <span class="hljs-keyword">as</span> (
  <span class="hljs-keyword">select</span>
    row_number() <span class="hljs-keyword">over</span> () <span class="hljs-keyword">as</span> <span class="hljs-keyword">id</span>,
    <span class="hljs-number">1</span> <span class="hljs-keyword">as</span> iter,
    <span class="hljs-number">1</span> <span class="hljs-keyword">as</span> <span class="hljs-keyword">length</span>,
    headwater.poly,
    headwater.height,
    <span class="hljs-built_in">array</span>[headwater.poly]::geometry[] <span class="hljs-keyword">as</span> polys,
    <span class="hljs-built_in">array</span>[st_centroid(headwater.poly)]::geometry[] <span class="hljs-keyword">as</span> centroids,
    <span class="hljs-number">0</span> <span class="hljs-keyword">as</span> lake_poly_depth
  <span class="hljs-keyword">from</span> headwater
  <span class="hljs-keyword">union</span>
  <span class="hljs-keyword">select</span>
    <span class="hljs-comment">-- neighbor_poly is null: we could not find a lower polygon to move into, sit here and lake up</span>
    previous.id,
    previous.iter + <span class="hljs-number">1</span> <span class="hljs-keyword">as</span> iter,
    <span class="hljs-keyword">case</span> <span class="hljs-keyword">when</span> neighbor.poly <span class="hljs-keyword">is</span> <span class="hljs-literal">null</span> <span class="hljs-keyword">then</span> previous.length <span class="hljs-keyword">else</span> previous.length + <span class="hljs-number">1</span> <span class="hljs-keyword">end</span> <span class="hljs-keyword">as</span> <span class="hljs-keyword">length</span>,
    <span class="hljs-keyword">coalesce</span>(neighbor.poly, previous.poly) <span class="hljs-keyword">as</span> poly,
    <span class="hljs-keyword">coalesce</span>(neighbor.height, previous.height + <span class="hljs-number">2</span>) <span class="hljs-keyword">as</span> height, <span class="hljs-comment">-- fill in lakebed</span>
    <span class="hljs-keyword">case</span>
      <span class="hljs-keyword">when</span> neighbor.poly <span class="hljs-keyword">is</span> <span class="hljs-literal">null</span> <span class="hljs-keyword">then</span> previous.polys
      <span class="hljs-keyword">else</span> array_cat(previous.polys, <span class="hljs-built_in">array</span>[neighbor.poly])
    <span class="hljs-keyword">end</span> <span class="hljs-keyword">as</span> polys,
    <span class="hljs-keyword">case</span>
      <span class="hljs-keyword">when</span> neighbor.poly <span class="hljs-keyword">is</span> <span class="hljs-literal">null</span> <span class="hljs-keyword">then</span> previous.centroids
      <span class="hljs-keyword">else</span> array_cat(previous.centroids, <span class="hljs-built_in">array</span>[st_centroid(neighbor.poly)])
    <span class="hljs-keyword">end</span> <span class="hljs-keyword">as</span> centroids,
    <span class="hljs-keyword">case</span>
      <span class="hljs-keyword">when</span> neighbor.poly <span class="hljs-keyword">is</span> <span class="hljs-literal">null</span> <span class="hljs-keyword">then</span> previous.lake_poly_depth + <span class="hljs-number">1</span>
      <span class="hljs-keyword">else</span> <span class="hljs-number">0</span>
    <span class="hljs-keyword">end</span> <span class="hljs-keyword">as</span> lake_poly_depth
  <span class="hljs-keyword">from</span> river_poly <span class="hljs-keyword">as</span> previous
  <span class="hljs-keyword">left</span> <span class="hljs-keyword">outer</span> <span class="hljs-keyword">join</span> lateral (
    <span class="hljs-keyword">select</span> *
    <span class="hljs-keyword">from</span> heightmap
      <span class="hljs-keyword">where</span> st_touches(heightmap.poly, previous.poly)
        <span class="hljs-keyword">and</span> heightmap.height &#x3C; previous.height
        <span class="hljs-comment">-- can't return to a poly with the same bounding box as a previously visited one</span>
        <span class="hljs-keyword">and</span> <span class="hljs-keyword">not</span>(heightmap.poly ~= <span class="hljs-keyword">any</span>(previous.polys))
      <span class="hljs-comment">-- pick the closest lower centroid</span>
      <span class="hljs-keyword">order</span> <span class="hljs-keyword">by</span> st_centroid(heightmap.poly) &#x3C;-> st_centroid(previous.poly)
      <span class="hljs-keyword">limit</span> <span class="hljs-number">1</span>
  ) <span class="hljs-keyword">as</span> neighbor <span class="hljs-keyword">on</span> <span class="hljs-literal">true</span>
  <span class="hljs-comment">-- border is a single-row relation so we can do this weird shortcut antijoin</span>
  <span class="hljs-keyword">inner</span> <span class="hljs-keyword">join</span> border <span class="hljs-keyword">on</span> <span class="hljs-keyword">not</span> st_touches(previous.poly, border.linestr)
  <span class="hljs-keyword">where</span> previous.lake_poly_depth &#x3C; <span class="hljs-number">5</span>
), river_pruned <span class="hljs-keyword">as</span> (
  <span class="hljs-comment">-- remove intermediary rows generated as river_poly recurses</span>
  <span class="hljs-keyword">select</span> river_poly.*
  <span class="hljs-keyword">from</span> river_poly
  <span class="hljs-keyword">inner</span> <span class="hljs-keyword">join</span> (
    <span class="hljs-keyword">select</span> <span class="hljs-keyword">id</span>, <span class="hljs-keyword">max</span>(iter) <span class="hljs-keyword">as</span> iter
    <span class="hljs-keyword">from</span> river_poly
    <span class="hljs-keyword">group</span> <span class="hljs-keyword">by</span> <span class="hljs-keyword">id</span>
  ) <span class="hljs-keyword">as</span> maxiter
    <span class="hljs-keyword">on</span> maxiter.id = river_poly.id
    <span class="hljs-keyword">and</span> maxiter.iter = river_poly.iter
  <span class="hljs-keyword">where</span> <span class="hljs-keyword">not</span> <span class="hljs-keyword">exists</span> (
    <span class="hljs-comment">-- if two rivers cross each other in opposite directions, pick the one with the lower</span>
    <span class="hljs-comment">-- id and eliminate the other</span>
    <span class="hljs-keyword">select</span> <span class="hljs-number">1</span>
    <span class="hljs-keyword">from</span> river_poly <span class="hljs-keyword">as</span> rp2
    <span class="hljs-keyword">where</span> rp2.id &#x3C; river_poly.id
      <span class="hljs-keyword">and</span> <span class="hljs-built_in">array</span>(<span class="hljs-keyword">select</span> * <span class="hljs-keyword">from</span> unnest(rp2.centroids) <span class="hljs-keyword">where</span> unnest = <span class="hljs-keyword">any</span>(river_poly.centroids)) &#x3C;>
        <span class="hljs-built_in">array</span>(<span class="hljs-keyword">select</span> * <span class="hljs-keyword">from</span> unnest(river_poly.centroids) <span class="hljs-keyword">where</span> unnest = <span class="hljs-keyword">any</span>(rp2.centroids))
  )
), cutoff <span class="hljs-keyword">as</span> (
  <span class="hljs-comment">-- find the first point at which a river "loses" a confluence and becomes</span>
  <span class="hljs-comment">-- subsumed in another river's flux</span>
  <span class="hljs-keyword">select</span> p.id, <span class="hljs-keyword">min</span>(array_position(p.centroids, confluence.centroid)) <span class="hljs-keyword">as</span> <span class="hljs-keyword">position</span>
  <span class="hljs-keyword">from</span> river_pruned <span class="hljs-keyword">as</span> p
  <span class="hljs-keyword">join</span> (
    <span class="hljs-comment">-- centroids of all cells entered by more than one river; furthest upstream</span>
    <span class="hljs-comment">-- wins, with lower ids breaking ties</span>
    <span class="hljs-keyword">select</span> unnest <span class="hljs-keyword">as</span> centroid, (array_agg(<span class="hljs-keyword">id</span>::<span class="hljs-built_in">text</span> <span class="hljs-keyword">order</span> <span class="hljs-keyword">by</span> <span class="hljs-keyword">ordinality</span>, <span class="hljs-keyword">id</span>))[<span class="hljs-number">1</span>] <span class="hljs-keyword">as</span> winner
    <span class="hljs-keyword">from</span> river_pruned
    <span class="hljs-keyword">join</span> lateral unnest(centroids) <span class="hljs-keyword">with</span> <span class="hljs-keyword">ordinality</span> <span class="hljs-keyword">on</span> <span class="hljs-literal">true</span>
    <span class="hljs-keyword">group</span> <span class="hljs-keyword">by</span> unnest
    <span class="hljs-keyword">having</span> <span class="hljs-keyword">count</span>(*) > <span class="hljs-number">1</span>
  ) <span class="hljs-keyword">as</span> confluence
  <span class="hljs-keyword">on</span> confluence.winner &#x3C;> p.id::<span class="hljs-built_in">text</span>
    <span class="hljs-keyword">and</span> array_position(p.centroids, confluence.centroid) > <span class="hljs-number">0</span>
  <span class="hljs-keyword">group</span> <span class="hljs-keyword">by</span> p.id
), river_line <span class="hljs-keyword">as</span> (
  <span class="hljs-keyword">select</span>
    river_pruned.id,
    river_pruned.length,
    river_pruned.poly,
    river_pruned.height,
    river_pruned.polys[<span class="hljs-number">1</span>:<span class="hljs-keyword">coalesce</span>(cutoff.position, river_pruned.length)] <span class="hljs-keyword">as</span> polys,
    st_makeline(
      <span class="hljs-keyword">case</span>
        <span class="hljs-comment">-- only rivers which are not cut off and which come adjacent to the border</span>
        <span class="hljs-comment">-- require the additional segment connecting to the shore!</span>
        <span class="hljs-keyword">when</span> cutoff.position <span class="hljs-keyword">is</span> <span class="hljs-literal">null</span>
          <span class="hljs-keyword">and</span> st_touches(
            river_pruned.polys[<span class="hljs-keyword">coalesce</span>(cutoff.position, river_pruned.length)],
            border.linestr
          )
        <span class="hljs-keyword">then</span> array_cat(
          river_pruned.centroids[<span class="hljs-number">1</span>:<span class="hljs-keyword">coalesce</span>(cutoff.position, river_pruned.length)],
          <span class="hljs-built_in">array</span>[st_closestpoint(
            border.linestr,
            st_centroid(river_pruned.polys[<span class="hljs-keyword">coalesce</span>(cutoff.position, river_pruned.length)])
          )]
        )
        <span class="hljs-keyword">else</span> river_pruned.centroids[<span class="hljs-number">1</span>:<span class="hljs-keyword">coalesce</span>(cutoff.position, river_pruned.length)]
      <span class="hljs-keyword">end</span>
    ) <span class="hljs-keyword">as</span> geom
  <span class="hljs-keyword">from</span> river_pruned
  <span class="hljs-keyword">left</span> <span class="hljs-keyword">outer</span> <span class="hljs-keyword">join</span> cutoff <span class="hljs-keyword">on</span> cutoff.id = river_pruned.id
  <span class="hljs-keyword">join</span> border <span class="hljs-keyword">on</span> <span class="hljs-literal">true</span>
)
<span class="hljs-keyword">select</span>
  <span class="hljs-keyword">id</span>,
  st_collect(geom, points) <span class="hljs-keyword">as</span> geom_with_points,
  st_collect(geom, points) <span class="hljs-keyword">as</span> geom_heightmap
<span class="hljs-keyword">from</span> (
  <span class="hljs-keyword">select</span>
    <span class="hljs-keyword">id</span>,
    st_collect(st_buffer(st_lineinterpolatepoint(geom, <span class="hljs-number">0.0</span>), <span class="hljs-number">1</span>)) <span class="hljs-keyword">as</span> points,
    st_collect(geom) <span class="hljs-keyword">as</span> geom,
    st_collect(poly) <span class="hljs-keyword">as</span> polys
  <span class="hljs-keyword">from</span> river_line
  <span class="hljs-keyword">group</span> <span class="hljs-keyword">by</span> <span class="hljs-keyword">id</span>
) <span class="hljs-keyword">as</span> river
<span class="hljs-keyword">union</span>
<span class="hljs-keyword">select</span> <span class="hljs-literal">null</span>, st_collect(poly, st_centroid(poly)), st_collect(poly, st_centroid(poly))
<span class="hljs-keyword">from</span> (
  <span class="hljs-keyword">select</span> <span class="hljs-keyword">distinct</span> poly
  <span class="hljs-keyword">from</span> river_poly
  <span class="hljs-keyword">where</span> lake_poly_depth > <span class="hljs-number">0</span>
    <span class="hljs-keyword">and</span> poly <span class="hljs-keyword">in</span> (<span class="hljs-keyword">select</span> unnest <span class="hljs-keyword">from</span> river_pruned <span class="hljs-keyword">join</span> lateral unnest(polys) <span class="hljs-keyword">on</span> <span class="hljs-literal">true</span>)
) <span class="hljs-keyword">as</span> lake
<span class="hljs-keyword">union</span>
<span class="hljs-keyword">select</span> <span class="hljs-literal">null</span>, linestr, linestr
<span class="hljs-keyword">from</span> mountain_range
<span class="hljs-keyword">union</span>
<span class="hljs-keyword">select</span>
  <span class="hljs-literal">null</span>,
  st_collect(b.linestr),
  st_collect(
    st_translate(
      st_scale(st_letters(<span class="hljs-keyword">round</span>(h.height)::<span class="hljs-built_in">text</span>), <span class="hljs-number">.03</span>, <span class="hljs-number">.03</span>),
      st_x(st_centroid(h.poly)),
      st_y(st_centroid(h.poly))
    )
  )
<span class="hljs-keyword">from</span> heightmap <span class="hljs-keyword">as</span> h
<span class="hljs-keyword">join</span> border <span class="hljs-keyword">as</span> b <span class="hljs-keyword">on</span> <span class="hljs-literal">true</span>;</code></pre>
</details>
<p><img src="https://di.nmfay.com/images/fluviation/examples.gif" alt="fact of the day: TLC used the phrase &#x22;rivers and lakes&#x22; as a metaphor for a safe and familiar environment; the same expression in Chinese, &#x22;jianghu&#x22;, denotes a lawless setting of danger and adventure"></p>]]></description>
            <link>https://di.nmfay.com/random-geography-fluviation</link>
            <guid isPermaLink="true">https://di.nmfay.com/random-geography-fluviation</guid>
            <pubDate>Mon, 12 Feb 2024 00:00:00 GMT</pubDate>
        </item>
        <item>
            <title><![CDATA[pdot: Exploring Databases Visually, Part II]]></title>
            <description><![CDATA[<p>A couple years ago, I wrote about <a href="https://di.nmfay.com/exploring-databases-visually">exploring a running database</a> by plotting relevant subsets of the foreign key relationship graph in <a href="https://graphviz.org">dot</a> and piping the resulting images directly to the terminal. <a href="https://gitlab.com/dmfay/pdot">Things have progressed since then</a>:</p>
<ul>
<li>I supplemented the original <code>fks</code> shell script with others plotting view dependencies, role hierarchies and grants, and finally started to map the effects of triggers and functions;</li>
<li>I hit my personal ceiling of What It Is Reasonable To Do In Shell Scripts, and decided to pull all this stuff together into a single cross-platform program with a consistent interface;</li>
<li>I did <em>that</em>, and added <a href="https://mermaid.js.org">mermaid</a> support for good measure;</li>
<li>&#x26; then, I forgot to write anything about having released it for a couple of months, as you do</li>
</ul>
<p>The <code>fks</code> side of things hasn't changed much from the earlier post (aside from some niceties around table inheritance), so here's the big new thing:</p>
<p><img src="https://di.nmfay.com/images/pdot-pgair-triggers-flight.png" alt="plot showing affected tables and further trigger cascades resulting from an &#x60;on_flight_delayed&#x60; trigger"></p>
<p><a href="https://gitlab.com/dmfay/pdot/-/releases">pdot</a> is out for Linux (including the <a href="https://aur.archlinux.org/packages/pdot-git">Arch AUR</a>), Windows, and macOS universal. I'll be talking more about it and exploration as a documentation strategy at the <a href="https://www.meetup.com/chicago-postgresql-user-group/">Chicago PUG</a> in November, and possibly elsewhere!</p>]]></description>
            <link>https://di.nmfay.com/pdot</link>
            <guid isPermaLink="true">https://di.nmfay.com/pdot</guid>
            <pubDate>Sun, 13 Aug 2023 00:00:00 GMT</pubDate>
        </item>
        <item>
            <title><![CDATA[PGSQL Phriday #009 Roundup]]></title>
            <description><![CDATA[<p>Another Phriday in the books, and it's time to see what all happened:</p>
<ul>
<li><a href="https://opensource-db.com">Hari Kiran</a> offered an <a href="https://opensource-db.com/database-change-management-pgsql-phriday-009/">introduction to the concepts and processes used in schema evolution</a> with examples showing how to use Flyway, a popular Java-based tool.</li>
<li><a href="https://www.scarydba.com">Grant Fritchey</a> discussed <a href="https://www.scarydba.com/2023/06/02/pgsql-phriday-009-on-rollback/">what it means to roll a change back</a>, why catastrophic failures are the <em>good</em> kind of failure, and how to work with deployment processes to adapt to failures and roll the database forward to a good state instead of trying to turn back time.</li>
<li><a href="https://www.pgmustard.com/blog">Michael Christofides</a> wrote about <a href="https://www.pgmustard.com/blog/database-change-management-vs-performance">table stakes for database automation</a> and his hopes for better integration of performance testing in automated change management.</li>
<li><a href="https://andyatkinson.com/blog">Andy Atkinson</a> ran through <a href="https://andyatkinson.com/blog/2023/06/02/pgsql-phriday-009-schema-change-management">the entire prompt item by item</a>! Read it for a detailed look at Rails migrations <em>in situ</em> -- who writes them, who reviews them, what kinds of problems happen, and how to validate successful changes.</li>
<li><a href="https://www.softwareandbooz.com">Ryan Booz</a> gave me ERWin flashbacks and proclaimed a <a href="https://www.softwareandbooz.com/10-requirements-for-managing-database-changes/">decalogue for the aspiring database automator</a>, from the foundational on up. No comment on whether I've ever achieved #7 without painstakingly restoring manual production dumps.</li>
<li>finally, I covered the <a href="https://di.nmfay.com/pgsql-phriday-three-big-ideas">ideas powering a few less-usual schema evolution tools</a>.</li>
</ul>
<p>Thanks everyone for participating, and look forward to <a href="https://www.pgsqlphriday.com/calendar">Alicja's invitation</a> coming around the end of the month!</p>]]></description>
            <link>https://di.nmfay.com/pgsql-phriday-009-roundup</link>
            <guid isPermaLink="true">https://di.nmfay.com/pgsql-phriday-009-roundup</guid>
            <pubDate>Wed, 07 Jun 2023 00:00:00 GMT</pubDate>
        </item>
        <item>
            <title><![CDATA[PGSQL Phriday #009: Three Big Ideas in Schema Evolution]]></title>
            <description><![CDATA[<p>I've used several migration frameworks in my time. Most have been variations on a common theme dating back lo these past fifteen-twenty years: an ordered directory of SQL scripts with an in-database registry table recording those which have been executed. The good ones checksum each script and validate them every run to make sure nobody's trying to change the old files out from under you. But I've run into three so far, and used two in production, that do something different. Each revolves around a central idea that sets it apart and makes developing and deploying changes easier, faster, or better-organized than its competition -- provided you're able to work within the assumptions and constraints that idea implies.</p>
<h2 id="sqitch-orchestration">sqitch: Orchestration</h2>
<p>The first time I used <a href="http://sqitch.org">sqitch</a>, I screwed up by treating it like any other manager of an ordered directory of SQL scripts with fancier verification and release management capabilities. It does have those, but they weren't why I used sqitch the second and subsequent times.</p>
<p>sqitch wants you to treat your schema and the statements that define it as a supergraph of all your inter-database-object dependencies. Tables depend on types, on other tables via foreign keys, on functions with triggers or constraints; views depend on tables and functions; functions depend on tables, on other functions, on extensions. Each one, roughly, is a single named migration -- more on that in a bit.</p>
<p>So <code>shipments</code> depend on <code>warehouses</code>, since you have to have a source and a destination for the thing you're shipping, and <code>warehouses</code> depend on <code>regions</code>, because they exist at a physical address subject to various laws and business requirements. <code>shipments</code> also have no meaning independently from the thing being shipped, so in the case I'm filing the serial numbers from, that table also maintains a dependency on <code>weather-stations</code>. Both <code>shipments</code> and <code>warehouses</code> depend on the existence of the <code>set_last_updated</code> audit trigger function. The plan file looks like this:</p>
<pre><code class="hljs language-makefile">trigger-set-updated-at 2020-03-19T17:20:30Z dian &#x3C;> <span class="hljs-comment"># trigger function for updated_at audit column</span>
regions 2020-03-19T18:30:27Z dian &#x3C;> <span class="hljs-comment"># region/country lookup</span>
warehouses [regions trigger-set-updated-at] 2020-03-20T16:34:56Z dian &#x3C;> <span class="hljs-comment"># storage for stuff</span>
weather-stations [function-set-updated-at warehouses] 2020-03-20T17:46:36Z dian &#x3C;> <span class="hljs-comment"># stations!</span>
shipping [warehouses weather-stations] 2020-03-20T18:56:49Z dian &#x3C;> <span class="hljs-comment"># move stuff around</span></code></pre>
<p>Or, for the more visually inclined:</p>
<p><img src="https://di.nmfay.com/images/three-big-ideas-shipments.png" alt="sqitch migration dependencies: shipments, warehouses, weather stations"></p>
<p>I have often kept tables and tightly coupled database objects such as types, junction tables, or (some) trigger functions in one file. Here, <code>stations</code> defines health and status types, a serial number sequence, and more, while <code>warehouses</code> includes a cluster of related tables representing inventory quantities.</p>
<p>There are two reasons behind this. First, I've mostly used sqitch on very small teams. If I'm the only person, or nearly the only person (I wrote 97% of the migrations in the weather-stations project) working on the database, the effort of factoring becomes pure overhead well before each database object has its own individual set of files.</p>
<p>Second, orchestration cuts both ways. Reworking and tracking the history of individual database objects is great as long as the changes stay local, but changes to a type or domain, for example, often involve a drop and replacement. The drop can't happen as long as there are columns of that type or domain anywhere else, so <em>those</em> have to be managed simultaneously. It's ugly no matter what, but in a linear "directory of scripts" framework, it's only as ugly as any other major change. Your script can create the new type, migrate dependent columns to it, drop the old type, and finally rename the new.</p>
<p>If you're using sqitch rigorously, the change is smeared across multiple sites and across time: rework the type to add the replacement, rework each dependent table to migrate its columns, tag, rework the type again to drop the old and rename the new, tag. Or you could hijack the <code>typename.sql</code> rework and do everything all at once in it -- undermining the sensible, well-delineated organization of schema objects that distinguishes sqitch in the first place. It's even worse when view dependencies change.</p>
<p>Using closely-related subgraphs instead of individual database objects as the "unit" of sqitch changes keeps many (not all) messy migrations contained, but there's no complete answer.</p>
<h2 id="graphile-migrate-idempotence">graphile-migrate: Idempotence</h2>
<p><a href="https://github.com/graphile/migrate">graphile-migrate</a> is developed alongside but does not require <a href="https://www.graphile.org/postgraphile/">Postgraphile</a>, and hews a lot closer to the traditional directory-of-scripts style. Change scripts are numbered, checksummed, and validated per usual, but the development experience of graphile-migrate is unique.</p>
<p>Every other schema evolution framework I've used has expected me to run the next changeset once and only once on top of the previous and only the previous, even during active development. Any tweaks, fixes, or additions can't be applied until the database has been reset, whether by a revert or "down" migration, manually issuing DDL and deleting the run record from the change registry, or often as not dropping and recreating the dev database from scratch.</p>
<p>graphile-migrate expects you to run the migration you're actively working on over and over again. It even defaults to a file-watch mode which runs in the background and executes the "current" migration every time you save. I don't use that, because I save early and often, draft valid-but-destructive DDL with some frequency, and want to run tests, hence <code>graphile-migrate watch --once &#x26;&#x26; pg_prove</code>; but the fact that executing the current migration just the one time is a special case kind of says it all.</p>
<p>It shouldn't matter whether you run the current migration once or a hundred times: the end database state must be identical. This can take some doing. On the easy side, it's always <code>create or replace</code>, never just <code>create</code>; but sometimes idempotent replacement isn't an option. Types and domains, constraints and options, row-level security policies, and more (views, if existing columns are changing) have to be handled with more care. And <code>if not exists</code> is a trap for the unwary.</p>
<p><code>create table if not exists warehouses (....)</code> runs! The table is there, with the columns we've specified; next time we run the current migration, it skips <code>warehouses</code> seamlessly. It's great -- until we realize there's a column missing and add it in the <code>create table</code> definition, whereupon the next time we run the current migration, it skips <code>warehouses</code> seamlessly. The change needs to be this instead:</p>
<pre><code class="hljs language-sql"><span class="hljs-keyword">drop</span> <span class="hljs-keyword">table</span> <span class="hljs-keyword">if</span> <span class="hljs-keyword">exists</span> warehouses;
<span class="hljs-keyword">create</span> <span class="hljs-keyword">table</span> warehouses(....);</code></pre>
<p>In the case where <code>warehouses</code> was created in an already-committed migration and we need to add the column without dropping existing data, it's time to break out <code>do</code> blocks:</p>
<pre><code class="hljs language-sql"><span class="hljs-keyword">do</span> $maybe_add_active$<span class="hljs-keyword">begin</span>
<span class="hljs-keyword">alter</span> <span class="hljs-keyword">table</span> warehouses <span class="hljs-keyword">add</span> <span class="hljs-keyword">column</span> is_active <span class="hljs-built_in">boolean</span> <span class="hljs-keyword">not</span> <span class="hljs-literal">null</span> <span class="hljs-keyword">default</span> <span class="hljs-literal">true</span>;
exception when duplicate_column then null;
<span class="hljs-keyword">end</span>$maybe_add_active$;</code></pre>
<p>The <a href="https://github.com/graphile/migrate/blob/main/docs/idempotent-examples.md">official examples</a> suggest scanning the system catalogs to determine whether to run a statement, but I've often found it quicker and easier to damn the torpedoes and trap <a href="https://www.postgresql.org/docs/current/errcodes-appendix.html">specific exceptions</a> afterward.</p>
<h2 id="migra-comparison">migra: Comparison</h2>
<p><a href="https://github.com/djrobstep/migra">migra</a> makes me nervous. This is the one I've never deployed, which is largely down to one specific but complicated reason: you don't write migrations (yay!), because it magically infers the necessary changes between old and new schemata (cool!), which means it maintains an internal model or map of Postgres features (stands to reason), which cannot be complete as long as Postgres is actively developed.</p>
<p>Is that necessary incompleteness really a dealbreaker? I actually think it shouldn't be! migra's goal is to save you all the time you formerly spent writing migration scripts at the hopefully-much-reduced cost of reviewing them and revising the tricky or unsupported bits. Its automated playbook doesn't have to be complete to make database development significantly faster.</p>
<p>A legitimate dealbreaker in some situations is that migra does not maintain a registry or even a history of valid schema states. There's only previous and next, with the latest revised diff between the two tracked in source control. It's theoretically feasible to pull all versions of the diff between t0 and t<em>n</em> and apply them one by one to reproduce the schema of a customer on a database dating back to that t<em>n</em>, but at that point you're setting all your other time savings on fire.</p>
<p>I haven't been in such a situation for some time, having had only one production database instance per project. Even so, when it's been up to me I've reached for a known quantity instead of investigating migra any further. Why? Because it isn't just a question of how completely migra supports Postgres features.</p>
<p>Complexity varies from database to database and from change to change. Something like migra could save tons of time on one database and not another, or even from one schema evolution to the next. It's hard to know whether you're in a migra-friendly scenario or not until you've already committed yourself, and the risk of falling <em>out</em> of that state and into writing complex migrations from scratch with next-to-no tool support doesn't go away.</p>
<p>It's a fantastic idea -- I <em>should</em> be able to reshape a database interactively, then generate at least an outline of a migration by comparing it to an unmodified baseline! Better still if I could test my change against that baseline and evaluate progress by the items remaining in the diff. The risks and the lack of history and verification keep me from using migra, but I hope we'll see its DNA in some of the next generation of schema evolution tools.</p>]]></description>
            <link>https://di.nmfay.com/pgsql-phriday-three-big-ideas</link>
            <guid isPermaLink="true">https://di.nmfay.com/pgsql-phriday-three-big-ideas</guid>
            <pubDate>Fri, 02 Jun 2023 00:00:00 GMT</pubDate>
        </item>
        <item>
            <title><![CDATA[PGSQL Phriday #009 Invitation: Making Changes]]></title>
            <description><![CDATA[<p>It's almost Phriday again! This is a monthly blogging event for the PostgreSQL community. <a href="https://www.pgsqlphriday.com/rules/">The rules</a>:</p>
<ul>
<li>publish something on-theme on or near <strong>Friday, June 2nd</strong></li>
<li>include "PGSQL Phriday #009" in your title or first paragraph, and link to this invitation post</li>
<li>share it! The best way to reach the greater Postgresphere is to <a href="https://planet.postgresql.org/add.html">get syndicated on Planet Postgres</a>, but you can also share on <a href="https://postgresteam.slack.com/archives/C044VV1T25S">#pgsqlphriday in the community Slack</a> or post to social media with the #PGSQLPhriday hashtag</li>
</ul>
<p>This month's topic is <strong>database change management</strong>, aka schema evolution. I've been doing this in one form or another, using one framework or another (and on one less-memorable-than-you'd-think occasion writing my own in a thousand lines of <a href="https://ant.apache.org">Ant</a> XML) for almost as long as I've worked in software. If you interact with databases in more than a read-only capacity, you've probably done your share of it as well. It's common, it's necessary, it's not very glamorous.</p>
<p>Every now and then, someone will extol the benefits of version-controlling your schema -- <a href="https://www.scarydba.com">Grant Fritchey</a> discussed this at PGDay Chicago just last month -- or write a how-to for a specific framework. There's a <a href="http://se-pubs.dbs.uni-leipzig.de">slow current of academic interest in the topic</a> which seems to have limited feedback into industry, publications tending toward the descriptive or the heavily specialized with only the occasional experiment like <a href="http://yellowstone.cs.ucla.edu/schema-evolution/index.php/Prism">PRISM</a> seeing daylight. But the people deploying changes day to day don't tend to talk much about the nitty-gritty details or the experience of modifying a running database, because change management is plumbing.</p>
<!-- Of the migrations we write, many more tend to be tedious and/or grueling than are elegant puzzles expressing an alchemic transformation of schema lead into schema gold. Practically all of them are incredibly proprietary, too. So we all know how to deploy changes our way, even if that way is yoloing DDL into production off-hours, fingers crossed in hope no-one notices; and we know everyone else knows how to do it _their_ way, or at least can read the same docs we have. Change management is plumbing. -->
<p>Plumbing is <em>really important</em>, and there are a lot of fascinating technical, procedural, social, even philosophical aspects to it. Let's haul a few of them into the spotlight!</p>
<p>Some starting points:</p>
<ul>
<li>how does a change make it into production? Do you have a dev-QA-staging or similar series of environments it must pass through first? Who reviews changes and what are they looking for?</li>
<li>what's different about modifying huge tables with many millions or billions of rows? How do you tackle those changes? Do you use the same strategy for smaller tables?</li>
<li>how does Postgres make certain kinds of change easier or more difficult compared to other databases?</li>
<li>do you believe that "rolling back" a schema change is a useful and/or meaningful concept? When and why, or why not?</li>
<li>how do you validate a successful schema change? Do you have any useful processes, automated or manual, that have helped you track down problems with rollout, replication, data quality or corruption, and the like?</li>
<li>what schema evolution or migration tools have you used? What did you like about them, what do you wish they did better or (not) at all?</li>
<li>tales of terror in the <a href="https://books.google.me/books?id=n2YA3FAv8DoC&#x26;lpg=PR4&#x26;pg=PR4#v=onepage&#x26;q&#x26;f=false">Kletzian mode</a> are also of course very welcome!</li>
</ul>]]></description>
            <link>https://di.nmfay.com/pgsql-phriday-009-invitation</link>
            <guid isPermaLink="true">https://di.nmfay.com/pgsql-phriday-009-invitation</guid>
            <pubDate>Fri, 26 May 2023 00:00:00 GMT</pubDate>
        </item>
        <item>
            <title><![CDATA[PGSQL Phriday #008: pg_stat_statements]]></title>
            <description><![CDATA[<p><a href="https://www.pgmustard.com/blog/pgsql-phriday-pg-stat-statements"><code>pg_stat_statements</code> for May</a>. As luck would have it, it's been invaluable to me over the past few weeks as I've been solving some performance problems of the "tens of millions of rows, row-level security, inverted indices, tens of thousands of rows returned, oops I never did get around to double-checking <code>work_mem</code> in production did I?" variety. The big lesson this time around: <strong>pay attention to the standard deviation of timings!</strong></p>
<p>The most often called (by far) and longest running (by a much closer margin) statements in this scenario were coming from an account synchronization daemon. Every fifteen seconds the daemon pulls user account information from Keycloak and overwrites the materialized local data, a pattern that sounds suspiciously like an inferior implementation of <a href="https://di.nmfay.com/postgres-user-cache">something RDS is not going to ship any time soon</a>. <code>postgres_fdw</code> is there, of course, but then we'd be depending on Keycloak's schema rather than its API, and that's a much chancier proposition.</p>
<p>The initial user sync implementation wrote to three relevant tables in a single statement using CTEs, because why not? It's easy, convenient, and seemed to work just fine in non-production environments.</p>
<p>In production, though:</p>
<table>
<thead>
<tr>
<th>calls</th>
<th>min_exec_time</th>
<th>mean_exec_time</th>
<th>max_exec_time</th>
<th>stddev_exec_time</th>
</tr>
</thead>
<tbody>
<tr>
<td>8,148,657</td>
<td>0.535</td>
<td>9.720</td>
<td>36,272.717</td>
<td>115.918</td>
</tr>
<tr>
<td>4,489,798</td>
<td>0.560</td>
<td>15.650</td>
<td>81,526.365</td>
<td>77.713</td>
</tr>
</tbody>
</table>
<p>These are the same statement: <code>with dataset as ([upsert dataset] returning *), person as (insert into person [with dataset membership] returning *) insert into account [reference to person]</code>. For us, <code>accounts</code> are special cases of <code>people</code> and <code>people</code> have a tag array column linking them to <code>datasets</code>; we have reasons to avoid a junction table that don't make a difference here.</p>
<p>The daemon got a dedicated Postgres role 8 million executions after I enabled pg_stat_statements, and used that for 4.5 million more. At its fastest, it completes in about half a millisecond -- great! At worst, though, it takes over a minute, even almost two minutes. The means are decently low, but they're means and it's hard to tell just how many longer-running outliers are contributing to its drift.</p>
<p>All is revealed by the standard deviation, which is quite low in both cases. Most syncs happen within about a tenth of a second of the mean, which is itself closer to a hundredth of a second. <a href="https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule">99.7% of timings should fall within three standard deviations</a>, assuming a normal distribution, and an execution time over a second represents between nine and thirteen standard deviations from the mean. If I'm statisticking correctly, this means that out of the 12.5 million samples, the only timings over a second are almost certainly just the two known maxes. It's still not exactly wonderful that the statement <em>can</em> run a minute and a half when the stars align, but if you have a high max with a low min, mean, and deviation, the statement you're looking at isn't the problem.</p>
<p>I don't know for sure what it is that got in the way. My chief suspect is a database function that adds dataset memberships to multiple records at a time, or its counterpart that removes them, both further down my top-20 list. Clients were initially configured to call these with batches of 25,000 records, which quickly exceeded the default 4mb <code>work_mem</code> and would churn for the better part of a minute at minimum. Modified records would all have had foreign keys to <code>accounts</code> -- forcing the sync daemon's changes to wait. Dataset membership management fits the "stars aligning" usage profile as well since mass changes like that aren't yet happening every day. With <code>work_mem</code> adjusted to 16mb, those functions have sped up dramatically, and I haven't noticed any other suspicious timings since.</p>
<p>I did split up the statement, since both <code>accounts</code> and <code>datasets</code> are quite high-traffic tables, the former being targeted by foreign keys all over the place and the latter governing row-level security on several other tables. Millions of syncs after the change, only the <code>accounts</code> insert has ever gone long, for significantly less time than the earlier outliers, and also probably only once. It's also faster and more consistent than the triple insert as might be expected, with a standard deviation of 11ms over practically nothing.</p>
<table>
<thead>
<tr>
<th>calls</th>
<th>min_exec_time</th>
<th>mean_exec_time</th>
<th>max_exec_time</th>
<th>stddev_exec_time</th>
</tr>
</thead>
<tbody>
<tr>
<td>4,887,878</td>
<td>0.026</td>
<td>0.091</td>
<td>22,436.409</td>
<td>10.942</td>
</tr>
</tbody>
</table>
<p>Here's my "leaderboard" query:</p>
<pre><code class="hljs language-sql"><span class="hljs-keyword">select</span>
    userid::regrole::<span class="hljs-built_in">text</span>,
    calls,
    min_exec_time,
    mean_exec_time,
    max_exec_time,
    stddev_exec_time,
    <span class="hljs-keyword">query</span>
<span class="hljs-keyword">from</span> pg_stat_statements
<span class="hljs-keyword">where</span> calls > <span class="hljs-number">100</span> <span class="hljs-keyword">and</span> max_exec_time > <span class="hljs-number">10000</span>
<span class="hljs-keyword">order</span> <span class="hljs-keyword">by</span> <span class="hljs-keyword">round</span>(calls, <span class="hljs-number">-2</span>) <span class="hljs-keyword">desc</span>, <span class="hljs-keyword">round</span>(mean_exec_time::<span class="hljs-built_in">numeric</span>, <span class="hljs-number">-2</span>) <span class="hljs-keyword">desc</span>, stddev_exec_time <span class="hljs-keyword">asc</span>
<span class="hljs-keyword">limit</span> <span class="hljs-number">20</span>;</code></pre>]]></description>
            <link>https://di.nmfay.com/pgsql-phriday-pg-stat-statements</link>
            <guid isPermaLink="true">https://di.nmfay.com/pgsql-phriday-pg-stat-statements</guid>
            <pubDate>Fri, 05 May 2023 00:00:00 GMT</pubDate>
        </item>
        <item>
            <title><![CDATA[Some Notes on ZSH Arrays]]></title>
            <description><![CDATA[<blockquote>
<p> <a href="https://zsh.sourceforge.io/Doc/Release/Expansion.html#Rules">Here is a summary of the rules for substitution</a>; this assumes that braces are present around the substitution, i.e. <code>${...}</code>. Some particular examples are given below. Note that the Zsh Development Group accepts <em>no responsibility</em> for any brain damage which may occur during the reading of the following rules.</p>
</blockquote>
<p>I'm doing <a href="https://gitlab.com/dmfay/dotfiles/-/blob/master/zsh/triggers.zsh">inadvisably complicated things</a> with zsh again; you'll need <a href="https://gitlab.com/dmfay/sql-tsquery">this</a> to use it as well, if you dare. More on that in due course. What I'm here to write about now is the zsh part, and the parts of that part (die sich das Licht gebar) that were a struggle to get right, even with a <a href="https://gist.github.com/ClementNerma/1dd94cb0f1884b9c20d1ba0037bdcde2">quite useful cheatsheet</a>.</p>
<p>This is a six-element zsh array, extracted from its natural habitat in a function (note the <code>local</code>):</p>
<pre><code class="hljs language-zsh"><span class="hljs-built_in">local</span> MYARRAY=(<span class="hljs-string">"alpha beta"</span> gamma delta gamma <span class="hljs-string">"epsilon"</span> <span class="hljs-string">"alpha beta"</span>)</code></pre>
<h2 id="1-deduplication">1. Deduplication</h2>
<p>This turned out to be easy.</p>
<pre><code class="hljs language-zsh"><span class="hljs-built_in">typeset</span> -U MYARRAY</code></pre>
<p>Done and dusted. It's also possible with a parameter expansion flag, though. Sometimes.</p>
<pre><code class="hljs language-zsh"><span class="hljs-built_in">echo</span> <span class="hljs-variable">${(u)MYARRAY}</span> <span class="hljs-comment"># alpha beta gamma delta epsilon</span>
<span class="hljs-built_in">echo</span> <span class="hljs-string">"<span class="hljs-variable">${(u)MYARRAY}</span>"</span> <span class="hljs-comment"># alpha beta gamma delta gamma epsilon alpha beta</span></code></pre>
<p>See, outside a string <code>${}</code> does parameter expansion, which applies to things like arrays. <em>Inside</em> a string, <code>${}</code> is a <em>brace</em> expansion and your flags mean nothing.</p>
<h2 id="2-passing-arrays-to-functions">2. Passing Arrays to Functions</h2>
<pre><code class="hljs language-zsh"><span class="hljs-keyword">function</span> <span class="hljs-function"><span class="hljs-title">otherfunction</span></span>() {
  <span class="hljs-comment"># local ARR=???</span>
  <span class="hljs-built_in">echo</span> <span class="hljs-string">"<span class="hljs-variable">${#ARR}</span> elements in <span class="hljs-variable">$ARR</span>[@]"</span> <span class="hljs-comment"># print count and contents</span>
}

<span class="hljs-keyword">function</span> <span class="hljs-function"><span class="hljs-title">main</span></span>() {
  ....
  otherfunction MYARRAY
}

main</code></pre>
<p>Okay, remember the parentheses in the function signature are a total red herring, arguments are numbered. Let's try filling in that blank the simplest possible way:</p>
<pre><code class="hljs language-zsh">  <span class="hljs-built_in">local</span> ARR=<span class="hljs-variable">$1</span> <span class="hljs-comment"># 7 elements in MYARRAY</span></code></pre>
<p>Nope, that passed the variable name in as a string. We've got to use parameter expansion, specifically the <code>P</code> flag to interpret the value as a parameter name and the <code>A</code> flag to indicate it's an array. Take two:</p>
<pre><code class="hljs language-zsh">  <span class="hljs-built_in">local</span> ARR=<span class="hljs-variable">${(PA)1}</span> <span class="hljs-comment"># 30 elements in alpha beta gamma delta epsilon</span></code></pre>
<p>Well, we have the expected contents, but it's also obviously a string: 30 elements! The secret is to reconstitute the array <em>into</em> an array:</p>
<pre><code class="hljs language-zsh">  <span class="hljs-built_in">local</span> ARR=(<span class="hljs-variable">${(P)1}</span>) <span class="hljs-comment"># 4 elements in alpha beta gamma delta epsilon</span></code></pre>
<p>Success! The <code>A</code> flag can be included or not -- it makes no difference whatsoever.</p>
<h2 id="3-also-watch-your-scopes">3. Also, Watch Your Scopes</h2>
<pre><code class="hljs language-zsh"><span class="hljs-keyword">for</span> TARGET <span class="hljs-keyword">in</span> <span class="hljs-string">"<span class="hljs-variable">${MYARRAY[@]}</span>"</span>; <span class="hljs-keyword">do</span>
  <span class="hljs-keyword">if</span> [ -n <span class="hljs-string">"<span class="hljs-variable">$TARGET</span>"</span> ]; <span class="hljs-keyword">then</span> <span class="hljs-built_in">echo</span> <span class="hljs-string">"<span class="hljs-variable">$TARGET</span> is real!"</span>; <span class="hljs-keyword">fi</span>
<span class="hljs-keyword">done</span></code></pre>
<p>If <code>TARGET</code> already contains a value you get a free spin through the loop that you probably don't want!</p>]]></description>
            <link>https://di.nmfay.com/zsh-gotchas</link>
            <guid isPermaLink="true">https://di.nmfay.com/zsh-gotchas</guid>
            <pubDate>Tue, 02 May 2023 00:00:00 GMT</pubDate>
        </item>
        <item>
            <title><![CDATA[PGSQL Phriday #007: The Art of the Trigger]]></title>
            <description><![CDATA[<p><a href="https://mydbanotebook.org/post/triggers/">It's triggers this time</a>! I've said it before and I'll say it again: <a href="https://www.prisma.io/dataguide/datamodeling/functional-units">if you need to compute, do it as close to your data as you can get away with</a>. But programmed databases, and especially programmed databases that use triggers to encode automatic behaviors and responses, are infamously hard to understand, and the more programmed the more difficult. Why is this, and what can we do about it?</p>
<p>Trigger utility is limited first by the limits of database procedural languages. The other PLs like Python or JavaScript can't touch anything PL/pgSQL can't (it bears mentioning here: there's more than <code>OLD</code> and <code>NEW</code>! <a href="https://www.postgresql.org/docs/current/plpgsql-trigger.html#PLPGSQL-DML-TRIGGER"><code>TG_OP</code>, <code>TG_TABLE_NAME</code>, and <code>TG_ARGV</code> in particular</a>) and are useful because they can express complex and specific manipulations in algorithmic instead of relational-calculus terms. Higher-level abstractions are not available to database functions in general unless built to that purpose, <em>in</em> database procedural languages, which is when I start feeling compelled to apologize to code reviewers in advance.</p>
<p>The real limits, though, aren't purely technological. All things are possible with a Turing complete language and sufficient patience. But let's say we're adequately funded with all the time in the world, have a trusted and capable DBA at the helm, and they've judged that encoding the processes under consideration into the database will save our organization money and simplify our infrastructure. Someone in the room is going to be nervous, and it's not infrequently the DBA: why?</p>
<p>Any successful automation, mechanical or virtual, changes the structure and politics (but I repeat myself) of an organization, absorbing money, risks, responsibilities, jobs, entire professions, and reorganizing them into new, more efficient or more specialized forms; these projects only fail insofar as they do <em>not</em> take over operational territory. That's reason enough for nerves right there. Database automation in particular, though, is notably arcane and access to it is strictly controlled for very good reasons.</p>
<p>Other virtual automations are invisible compared to the mechanical sort, but they at least tend to have names: the such-and-such datafeed ETL, the new-member flow, the delivery queue. In a healthy organization, those names are backed up by teams or at least by relatively well-defined responsibilities. They have a recognizable surface area which can be examined or interacted with. People know when a given ETL job has crashed, they can often see exactly why (whether or not they can use that information), and they usually know whom to call.</p>
<p>The names of database-internal programs, by contrast, are invisible to the uninitiated. Experts can locate and analyze them, but from outside they inhabit The Database, an undifferentiated and undifferentiable space bordering every other territory on the organization's operational map. Responsibility for database programs is often more diffuse but is also harder to identify in the first place. Effects are visible, their causes are not. After The Database takes over a new operational area, both those previously responsible and others across the organization can no longer see what's going on. If any other department worked this way it'd be a sign of major dysfunction, but again: very good reasons.</p>
<p>And triggers are the acme of database programming. When the new-member flow becomes an <code>after insert</code> trigger and a series of database functions, this is in a very real sense the database encroaching on other operational demesnes. For the good of all, naturally: if much of the initial processing of new members can be made to happen in the database, with perhaps the necessary external data sources connected through foreign data wrappers, everyone's happier! Signups are much faster for members. The team currently responsible for setting the latest introductory rate every so often can devolve that to the database team, or even help design a self-service rate lever for the business people, and move on permanently. Ops can even take a node or two off the infrastructure-that-needs-watching graph.</p>
<p>But it also makes the signup process more opaque to everyone else. Downstream dependents are less able to reason about what is happening or has happened, and while the subsumption of the process into the database hopefully gives those dependents less cause to wonder than they used to have, it can't eliminate that need completely. "What happens during signup" is less knowable, less memorable, and less perceivable to the rest of the organization. That's also cause for concern: is encoding our institutional knowledge into this self-governing black box worth what we gain from computing close to the data? Will we be going all the way back to the drawing board if an acquisition or regulation or sheer signup volume forces us to store and process new members differently? Will we become uncertain about the results and ramifications of the encoded processes as they're performed internally? Will we be able to implement changes or respond to problems with appropriate efficiency?</p>
<p>Only experience can tell us whether our programmed-database strategy will be worth the sacrifices we make for speed and simplicity. Each automation project is unique, but there are common workflow adjustments and technical solutions which help improve the odds of success. Our goals on this tactical level are to speed up development and test feedback loops, keep implementors' options open in the face of unforeseen obstacles, and demystify database automation for everyone else who works with it.</p>
<h3 id="priorities">priorities</h3>
<p>Databases change more slowly than do their client programs. New or external processes moving into the database should be as completely defined as possible to avoid flurries of updates as requirements continue to evolve or edge cases and bugs are squashed. It's usually better to give young processes time to stabilize before incorporating them, just like it's less work in aggregate to refine queries embedded in client code before turning them into views.</p>
<h3 id="fast-iteration">fast iteration</h3>
<p>Databases change more slowly than client programs, but during active development the latter change on the scale of seconds. Development databases need to be as close behind that as possible. It should be fast to stand up a clean schema from scratch, faster to reapply changes as implementation progresses.</p>
<p>When I'm writing triggers and functions, I'll often revise them directly in psql, making heavy use of conveniences like <code>\ef</code>. Once I'm happy with the result I'll "canonize" the final code in the schema migration I'm working on. This works best with very focused changes; if the work spreads out to more than one table-trigger-function it's too easy to lose track of individual elements.</p>
<p>Migration frameworks that encourage idempotence, like <a href="https://github.com/graphile/migrate">graphile-migrate</a>, also save a step compared to frameworks with an apply/revert model. In my day job we do a lot with <code>create or replace</code> this, <code>if not exists</code> that, and attempted changes in <code>do</code> blocks ignoring known exceptions:</p>
<pre><code class="hljs language-sql"><span class="hljs-keyword">do</span> $maybe_create$<span class="hljs-keyword">begin</span>
  <span class="hljs-keyword">create</span> <span class="hljs-keyword">domain</span> checked_text <span class="hljs-keyword">as</span> <span class="hljs-built_in">text</span> ....
  <span class="hljs-comment">-- there's no `create domain if not exists`, so trap the exception if it does</span>
  <span class="hljs-keyword">exception</span> <span class="hljs-keyword">when</span> duplicate_object <span class="hljs-keyword">then</span> <span class="hljs-literal">null</span>;
<span class="hljs-keyword">end</span>$maybe_create$;</code></pre>
<h3 id="debug">debug</h3>
<p>I have never used <a href="https://github.com/EnterpriseDB/pldebugger">pldebugger</a> and in fact didn't know it existed until this week. I'm not going to be able to install it on every server I need to debug, although I'm certainly going to try it where I can. Where I can't, <a href="https://www.postgresql.org/docs/current/plpgsql-errors-and-messages.html"><code>raise warning</code></a> will always have my back (<code>notice</code> is too polite: the default <a href="https://www.postgresql.org/docs/current/runtime-config-client.html#GUC-CLIENT-MIN-MESSAGES"><code>client_min_messages</code></a> prints it, but the default <a href="https://www.postgresql.org/docs/current/runtime-config-logging.html#GUC-LOG-MIN-MESSAGES"><code>log_min_messages</code></a> is stricter). Want to see variable values? <code>raise warning</code>. Not sure which execution path it's heading down? <code>raise warning</code>. Is my complicated <code>when</code> predicate even satisfied? <code>raise warning</code> first thing into the function and find out.</p>
<p>Sometimes if there's more data in play than I want to dig through in psql or logs I'll create a temporary (sensu lato) table and have my trigger function write interesting things to it, whereupon I can sort, filter, and the rest. This does only work as long as there are no fatal errors that would roll back the transaction.</p>
<p>And speaking of, transactions are great for testing triggers faster, fully operational or not. Fire off your DML statement, inspect the outcome, and roll back ready to do the same exact thing all over again without having to worry about unique constraint collisions or other consequences of the new database state. I often try to get into loops like this in a dedicated testing psql session, modifying the function separately:</p>
<pre><code class="hljs language-sql"><span class="hljs-keyword">rollback</span>; <span class="hljs-keyword">begin</span>;⏎
↑↑⏎</code></pre>
<h3 id="test">test</h3>
<p><code>↑↑⏎</code> in a REPL is almost an automated test already -- all it's missing is a way to assert and report things about the outcome without human intervention. Trigger development is easier with the ability to evaluate assertions about everything in the database at your fingertips, but more importantly, true automated tests are legible to others as well. Anyone can look at a sufficiently descriptive test output with "success" or "failure" printed next to it and understand instantly what it means without having to know SQL.</p>
<p>For this reason alone, <a href="https://pgtap.org">pgTAP</a> may be the best thing since <a href="https://www.postgresql.org/docs/current/storage-toast.html">TOAST</a>.</p>
<p>It's important to do two things with pgTAP tests: first, make sure they describe themselves adequately in their real context. Many checks are completely self-explanatory already, especially the "schema things" like <code>has_table</code> and <code>policy_roles_are</code>. Others, such as <code>lives_ok</code> and <code>results_eq</code>, usually want a note detailing exactly what just happened or why the comparison matters.</p>
<p>Second, they need to be organized. The default TAP output is a list of files with status or error count, with the errors themselves included. The latter will be useless to external viewers, but it should be clear which major functional groups are being exercised and how they're doing. Splitting up test files also helps with state management. It's all too easy for tests to become implicitly dependent on writes made by previous tests, and innocently introducing a new one in between or reorganizing them can wreak havoc.</p>
<p>pgTAP does represent an extra logistical commitment! Integration tests (in that loose quasi-Bechdelian sense of "at least two programs talking to each other, and writing state to disk") or even well-honed manual test loops usually come first, depending on the purpose the database serves. Testing the whole system can tell you enough about the functioning of the database to get by initially. As the database becomes more extensively programmed, the debugging needs of external statements start to be outweighed by those of procedures and triggers, and there are enough of the latter as well that internal dependencies start to form and changes here can cause failures there. Any sufficiently internally complex subsystem benefits from testing in isolation, and the database is no exception.</p>]]></description>
            <link>https://di.nmfay.com/pgsql-phriday-triggers</link>
            <guid isPermaLink="true">https://di.nmfay.com/pgsql-phriday-triggers</guid>
            <pubDate>Fri, 07 Apr 2023 00:00:00 GMT</pubDate>
        </item>
        <item>
            <title><![CDATA[After Massive]]></title>
            <description><![CDATA[<p><a href="https://gitlab.com/monstrous/monstrous">MassiveJS version 7 went places.</a></p>
<pre><code class="hljs language-javascript"><span class="hljs-keyword">await</span> db.select(
  db.libraries
    .join(db.holdings) <span class="hljs-comment">// implicit join on foreign key holdings.library_id</span>
    .join(db.books)    <span class="hljs-comment">// implicit join on foreign key holdings.book_id</span>
    .join(db.authors, db.$join.left, {[db.authors.$id]: db.books.$author_id})
    .filter({
      [db.libraries.$postcode]: <span class="hljs-string">'12345'</span>,
      [<span class="hljs-string">`<span class="hljs-subst">${db.authors.$name}</span> ilike`</span>]: <span class="hljs-string">'Lauren%Ipsum'</span>
    })
    .project({
      <span class="hljs-attr">$key</span>: db.libraries.$id,
      <span class="hljs-attr">$columns</span>: [...db.libraries],
      <span class="hljs-attr">authors</span>: [{
        <span class="hljs-attr">$key</span>: db.authors.$id,
        <span class="hljs-attr">$columns</span>: [
          db.authors.$name,
          db.expr(
            <span class="hljs-string">`extract(year from age(coalesce(<span class="hljs-subst">${db.authors.$death}</span>, now()), <span class="hljs-subst">${db.authors.$birth}</span>))`</span>
          ).as(<span class="hljs-string">'age'</span>)
        ],
        <span class="hljs-comment">// notice `books` is a collection on authors, even though we join authors to books!</span>
        books: [{
          <span class="hljs-attr">$key</span>: db.books.$id,
          <span class="hljs-attr">$columns</span>: [...db.books]
        }]
      }]
    })
);</code></pre>
<p>It'd be stretching an ecological metaphor to say that the middle tier is being eaten, but GraphQL and the "app logic on the client" tendency in web development make a powerful combination. Together, they constitute a -- big, important, immediately useful -- local maximum on the software fitness landscape.</p>
<p>Of course, fitness one way comes at costs in others, and like any species of software system GraphQL backends are histories of decisions about what to make possible or impossible, simple or detailed, how to balance the correlated complexities of model and interface, fast good or cheap and all that. More important decisions may or may not be intentional but have in common that they exclude or foreclose ways of interacting with, here, your database and its contents. In a very roughly chronological order:</p>
<p>Classic object/relational mappers, including <a href="https://hibernate.org">Hibernate</a> and its kin but also and especially the <a href="https://api.rubyonrails.org/classes/ActiveRecord/Base.html">ActiveRecord</a> pattern, represent a choice to treat the database as a perfect, crystalline extrusion into time of the object graph and decisions on how best to patch over the resulting impedance mismatch. They also often hide or try to replace SQL and tend to target "lowest common denominator" database vendor compatibility.</p>
<p>Other data mappers and query builders, from <a href="https://mybatis.org/mybatis-3">MyBatis</a> to <a href="https://knexjs.org">Knex</a>, identified a better corresponding structure to programmatic objects in the SQL statement, transforming those objects into parameters and from results, and made decisions about whether to generate, store, or construct statements and how.</p>
<p>There's an identifiable "query runner" tendency, projects like <a href="https://github.com/vitaly-t/pg-promise">pg-promise</a>, <a href="https://github.com/gajus/slonik">slonik</a>, <a href="https://github.com/krisajenkins/yesql">yesql</a>, and <a href="https://nackjicholson.github.io/aiosql">aiosql</a>, which offer more affordances than the plain database driver but ultimately decide the important thing is helping you write exactly the SQL you need. Everything before and after getting that hand-written SQL to the driver is best left up to you, even if it means you write your own boilerplate -- at least it's <em>yours</em>.</p>
<p>Finally-so-far, GraphQL backends like <a href="https://www.graphile.org/postgraphile/">Postgraphile</a> go all in on being an HTTP API for independent clients interacting statelessly, and minus a few caveats basically nail atomic create-retrieve-update-delete from that distance. Between database functions and custom resolvers, they can cover even quite complex data models and server-side logic as well, within the bounds of request and response.</p>
<p>The first category isn't dead by any means but its innate internal contradictions are well recognized; many examples of the second are a reaction to them, Massive included. What still unites the two tendencies is their competition on the territory of the web service, which must wane as that of the independent client application has waxed. Between GraphQL serving that use case so effectively, and query runners sufficing for cases that don't involve extensive manipulation of complex object graphs, the space for mappers of any stripe at least has not been getting much bigger, relatively speaking, in the past decade. A data access library of the older school therefore will have to do a lot more than CRUD to compete, or even to differentiate itself, on its traditional terrain. If it can be useful elsewhere too, so much the better.</p>
<p>Massive isn't, and can't be, that library.</p>
<p>"Make working with your data and your database as easy and intuitive as possible, then get out of your way" was and is a great mission statement, but the fact is Massive was largely built for simple CRUD. There's more to it, of course: full-text search, array and JSON field support, runtime document table generation, keyset pagination, sequence and matview management, but these are extras on a design rooted in intentionally chosen simplifications. Finding all fields by a criteria object goes a really long way!</p>
<p>Many of these extra ideas and tools Massive adds on top of that foundation, original and inherited alike, still point a useful way forward: abandoning compatibility to support Postgres in detail, using introspection to facilitate reasoning about and manipulating database objects directly, record schemata inferred from joins or declared as needed without the maintenance and synchronization burden of model classes, collapsing the distinctions between script files and database functions, and more. But it also includes a lot of decisions made for and in the very different context that entailed a decade ago, and for very different approaches to writing JavaScript as well (it antedates the Promise API!). Some of those decisions can't be grown past in a way that remains recognizably Massive.</p>
<p>For example:</p>
<ul>
<li>An API surface of do-it-all functions like <code>readable.find</code> winds up with a fairly low complexity ceiling that can cover many to most common scenarios, but ultimately can't keep up with plenty of still fairly routine data access tasks that could benefit from dynamic construction in JavaScript.</li>
<li>Because a single function call has to convey everything from sort order to streaming to decomposition and beyond, all manner of functional and organizational purposes get crammed into options objects with little rhyme or reason. Some options are mutually exclusive; others contain arbitrarily complex nested objects and arrays.</li>
<li><a href="https://gitlab.com/dmfay/massive-js/-/issues/738">Transaction clones are extremely heavyweight</a> since they copy and substitute the dedicated connection across the entire database object tree.</li>
<li>CommonJS has become a dead end. I don't feel particularly strongly either way about the relative merits of CJS vs ESM, but I think it's better to pick one and Node's use of CJS is odd out.</li>
</ul>
<p>I started monstrous a few months ago, while working on my fourth or fifth really substantial project with Massive. I'd been finding its limitations harder and harder to ignore, and the many other options available didn't serve my goals either.</p>
<p>I do web stuff but I've no intention of trying to keep up with the Modern Frontend Stack. I support a Postgraphile API at my day job, and have only good things to say about it, but my day job is data architecture and Postgres wrangling on behalf of people who aren't me or even on the same team. GraphQL's a sensible choice there given the coordination and communication requirements in play, but my other projects don't have those pressures and constraints.</p>
<p>And I'm never going to write another model class again if I can help it, so that rules out almost everything in the first two categories. It's true <a href="https://knexjs.org">Knex</a> has always been around and doesn't force you to recapitulate your schema in classes, but if Knex organized my data model to the extent and in the direction I wanted, I'd already have been using it.</p>
<p>That leaves query runners, and if I'm going to use a query runner and maintain my own boilerplate -- well, that's kind of what this is, no?</p>
<p>I'd seen <a href="https://github.com/retro/penkala">Penkala</a> some time ago, and that in turn pointed to <a href="https://www.try-alf.org/blog/2013-10-21-relations-as-first-class-citizen">alf</a>/<a href="https://github.com/enspirit/bmg">bmg</a>. If you're looking for something in Clojure or Ruby respectively you should check them out! The latter two implement a full relational algebra and translate it to the relational calculus of SQL, while Penkala extracts the core principle of composability from that approach -- something SQL has never done well. Other tools try to supply that missing piece, most commonly by supporting technically-separable subqueries, but few go as far as these two. However, I'm already locked in to writing JavaScript for my charmingly retro coupled frontends, so I default to writing it on the server as well.</p>
<p>monstrous takes after those two in emphasizing composability. Everything done to a relation is a contained transformation step: <code>join</code> specifies relation, type, and condition; <code>filter</code>, criteria; <code>project</code>, an output record shape. Each transformation yields a new joined or filtered or projected relation. You can <code>attach</code> any such derived relation to the database just as if it were an original table or view, and reference it in other joins or filters as a subquery.</p>
<p>Moreover, you can use the same relations in reads and writes. Possibly monstrous' most fundamental departure from Massive is the inversion of subject and verb, separating statement construction from execution. With Massive, you could pass a criteria object from a <code>find</code> into an <code>update</code>, although there aren't many reasons to. With monstrous, you can much more usefully <code>select</code> an attached relation here and <code>update</code> it there.</p>
<p>In short: still no models, but if a certain complex product is a common motif in your project, you can define it once and reuse it without repeating the same transformations every time it appears. Attached relations are akin to writable views that respect the object graphs you're working with in client code.</p>
<p>The construction-execution split also means that tasks and transactions, which in Massive deep clone the entire database structure to swap a dedicated connection into each attached relation, instead use a cheap, lightweight class comprising a dozen or so functions and practically no extra state.</p>
<p>For more, check out the <a href="https://gitlab.com/monstrous/monstrous">readme</a> and the <a href="https://gitlab.com/monstrous/monstrous/-/tree/main/test">tests</a>!</p>
<p>As for Massive: it still exists, is still moderately popular going by weekly downloads, and even sees the odd issue or merge request. I'll continue to keep an eye on it into the near future, but I think it's developed about as much as it's going to; certainly <em>I've</em> developed it about as much as I'm going to. If there's interest from any extant contributors or users (email address is up top!) I'll see about spinning it out into its own group/organization and adding maintainers.</p>]]></description>
            <link>https://di.nmfay.com/massive-monstrous</link>
            <guid isPermaLink="true">https://di.nmfay.com/massive-monstrous</guid>
            <pubDate>Sun, 19 Feb 2023 00:00:00 GMT</pubDate>
        </item>
        <item>
            <title><![CDATA[PGSQL Phriday #004: Scripting in the Industrial Age]]></title>
            <description><![CDATA[<blockquote>
<p>While we are concentrating on our task, both our tools and our materials merge into one entity of perception which gives us feedback about state and progress of our work.... Software tools <em>generalize</em> our ways of handling aspects of the world around us; they organize our actions and condense them into gestures.</p>
<p>— Reinhard Budde &#x26; Heinz Züllighoven, <cite>Software Tools in a Programming Workshop</cite></p>
</blockquote>
<p>An internal combustion engine isn't a tool but a car can be, although it remains in the immediate material sense something else, a system demanding full bodily integration: you climb in, close the door, buckle your seatbelt, insert and turn the key or press the brake and ignition, move your hand from gearshift (tool for adjusting torque) to wheel (tool for sensing and adjusting orientation). You can do other things while driving, talk or listen or think, and choose how much of your attention and motion to divert to other tasks, some involving yet other tools. But driving itself represents an attentional and physical restriction of some variable but never-zero degree. The car is your pair of seven-league boots, a tool for working with time and distance, at the same time as it's a physical machine you're strapped into and which you successfully use only by altering your own thinking, perception, even your sense of your own mass, shape and size, velocity, and inertia.</p>
<p>Budde and Züllighoven on machines in this sort of activity-theoretical sense:</p>
<blockquote>
<p>A <em>machine</em> is <em>repeatable motion</em> which is abstracted from its specific context and cast into construction.... [It] incorporates and reproduces the <em>mechanical reproduction</em> of human activities. It thus <em>decontextualizes</em> human activities.</p>
</blockquote>
<p>You bring your tools to the work; you take your work to the machine. Machines may afford their operators the power of many tools and the speed of automation and sometimes parallelization, but also embody a fixity of purpose, an integrated way of conceiving and acting on the work materials that resists or forbids working in other ways. Tools and machines can even have intersecting domains: a pneumatic wrench is a machine component, fixed in place by its air hose, but does the same job as an ordinary manual wrench. When you have a sufficient number of things to loosen or tighten, other goals to accomplish related to the loosening or tightening, and/or reasons to guarantee a minimum torque, it's worth the extra effort of moving the work as well as the worker.</p>
<p>The classic example, which Budde and Züllighoven mention in passing on to a broader point about how automation makes humans and machines interchangeable, is the fixed station of an assembly-line worker, who may pick up and put down physical tools in the course of producing outputs from the inputs they're provided (this is, naturally, at the scale of individual human beings: the assembly line itself is a machine producing cars from the outputs of other materials processing machines, the auto company a machine generating capital from labor and extracting surplus value, the stock market a machine redirecting flows of capital, each hosting masses of people whose activity is governed and directed not by their own desires but by the social and physical construction of the line, the office, the trading floor). Integrated development environments are, like cars, a more ambiguous case. They're tools at the scale or in the context of software systems writ large, while in that of computer use they're machines you enter into that require nearly full mental or attentional integration to repeat the motions of type-build-test-package-run-debug operating its constituent smaller machines and tools.</p>
<blockquote>
<p>While using software in our work, we wish to <strong>handle</strong> it like a tool; but while constructing it, we wish to <strong>design</strong> its parts like a machine.</p>
</blockquote>
<p>In database work, Microsoft's SQL Server Management Studio illustrates this tension well. SSMS is an IDE for database administrators, architects, and analysts alike, and hews to about the same general outline as any other: here's your left-side tree with context menu upon context menu of tools for managing your tables and views and functions and roles, here to edit the definition, there to print out DDL; here's an SQL interpreting machine, we'll put results or errors at the bottom. From the distance at which the DBA is forty person-hours and the database is a cylinder on an architecture diagram it itself is a tool for designing and detailing what that cylinder represents, but the activity of using it is machinic: you go to SSMS to perform database work. Before its release with SQL Server 2005, you would go to Enterprise Manager to define the schema and administer the server, or to Query Analyzer to write and run SQL. SSMS integrated both predecessor machines into a more consistent whole.</p>
<p>And SSMS was great! Even with one foot in application development I usually had both its predecessors open alongside Visual Studio anyway. The more interesting part of this historical digression is what <em>resisted</em> integration -- most notably ETL/ingestion suite SQL Server Integration Services (SSIS, née Data Transformation Services) and Profiler, which captures statement text and parameters on execution through a dizzying array of configurable filters.</p>
<p>Postgres doesn't have an SSIS. That's a good thing: even if the community wanted to support an official ETL machine, it's a bad direction to go for an open source project, with an unbounded and infinitely edge-cased panoply of input specifications. Controlling more of the backend can add value for commercial DBMSs, but for Postgres there's nothing to be gained. It's an interesting contrast with schema migration, where <em>nobody</em> has an official change management system, but that's getting beside the point.</p>
<p>Profiler, though, I miss almost every day. And although the distinction between tool and machine can be an especially slippery one for programs, it's more tool-like in use: it assists with whatever <em>other</em> thing you're doing that you need to peek at database activity rather than organizing tasks into a workflow, and you pick it up when you need it and put it away when you're done, or in other words, it's "ready to hand" rather than being a system you step into and operate. As an application developer it gave me instant insight into what I'd actually communicated to the database. Its filters were more customizable and more powerful than any reasonable <code>grep</code> invocation. Best of all, I could start and stop tracing without touching or knowing anything about the server's log settings or having SSH access.</p>
<p><code>pg_stat_statements</code> of course exists, <a href="https://github.com/lesovsky/pgcenter">pgcenter</a> has a <code>top</code> style view that's some use in tracking down frequent long-running statements, EDB have an SQL profiler module that installs server-side, but there's nothing even approaching a 1:1 equivalent <em>client</em> program as far as I know.</p>
<blockquote>
<p>A <em>programming environment viewed as a workshop</em> offers a set of tools, but does not implement an overall strategy of software development. However, it may be used to automate a selected set of familiar and routine activities (such as <em>change management</em> or <em>compilation</em>).</p>
<p>Users define working processes by drawing on their knowledge of tasks, materials and tools. The programming workshop "surrounds" the user with sets of tools and automata [machines with hidden internal processes that "appear as machines when in use"], each with its own specific application and suitability for a particular type of material.</p>
</blockquote>
<p>Application developers use a lot of tools, but even when those tools don't come pre-integrated into an application development machine like an IDE, they build these machines for themselves anyway; the workshop is an environment which facilitates their design and construction ad hoc. Such machines might be distributed across multiple programs -- editor, shell, compiler, linker, debugger, version control -- each individually a tool or a smaller machine bringing tools together for a single purpose, connected and mechanized into an inhabitable whole in order to speed up and standardize the motions of software development: that is, the industrial production of software machines from other software machines. Developers use their meta-machines to combine machines for data access, machines for rendering text or graphics, machines for telling time or hashing strings or an infinite variety of other purposes into new machines that meet their own or their organization's goals.</p>
<p>And here, a thousand and some words in, I make it to <a href="http://www.pgsqlphriday.com/2022/12/pgsql-phriday-004/">the prompt</a>. Database workers have plenty of machines at which we do our database-work, vast and comprehensive like SSMS or small and simple like psql, which repeats the motion read-evaluate-print and yields to external editing machines, source and destination machines connected through pipes, and tools like <code>less</code> or <a href="https://github.com/okbob/pspg">pspg</a> when the user performs a different or a specialized task that isn't its core competency. pgTAP is another machine that exercises a database according to its input, a player piano that detects its own off notes. It's one of the few we have that connects to developers' meta-machines. Efforts to bridge the chasm from the other side have so far mostly resulted in pared-down implementations of the SSMS-type being bolted into their IDEs.</p>
<p>And database workers' tools?</p>
<p>Well, what <em>are</em> our tools? Profiler, there's one, SQL Server's virtual microscope. When it comes to Postgres, of course, we have to attach the <code>pg_stat_statements</code> machine to it or make do with SSH and <code>grep</code>, not database tools specifically. There are a smattering of mostly operations-focused tools like <a href="https://pgbackrest.org">pgbackrest</a> or <a href="https://postgresqlco.nf">postgresqlco.nf</a>. Otherwise, we have SQL scripts: a script to calculate bloat, a script to check index statistics, a script to report outputs or patch up recurrent data quality issues or populate static tables.</p>
<p>We have so many of these tools it's difficult to keep track of them all.</p>
<p>We don't have a standard way to organize or remember or even name most of them.</p>
<p>We don't have a dedicated infrastructure to share and update and standardize them, outside a specific class of tool/machine, extensions, having <a href="https://pgxn.org">pgxn</a>. And new extensions face an uphill climb to widespread adoption as more database workloads shift to cloud providers which allow them on a case-by-case basis.</p>
<p>Most of all we lack simple, well-defined ways for anyone else to <em>use</em> our tools without requiring them first to step out of their machines and into ours.</p>
<p>It reminds me a lot of (what I saw of) the state of *nix admin before most distros standardized on systemd. Linux had and has init daemons aplenty: SysVinit, upstart, runit, and more. Most of them orchestrate assemblages of more or less glorified shell scripts. The computer boots and starts process id 1, which in turn rummages around in /etc and runs anything that looks like it needs running, prioritized however the daemon prioritizes. Want to kick off some long-running service on boot? Write a shell script, season to your init daemon's taste, and drop it somewhere in /etc/init or /etc/rc.d. Every software vendor and every sysadmin seemed to have a slightly different approach to the infinite possibilities of upstart's <code>script</code> block or SysV full stop. Every boot rebuilt the runtime configuration -- the operating system machine -- from scratch by the automatic application of heterogeneous tool after heterogeneous tool. More than once I found myself in the shoes of the broomstick-multiplying sorcerer's apprentice of the poem as my adjustments went horribly awry.</p>
<p>These init systems, not unlike psql, are small, simple machines which defer to external tools wherever possible. systemd, meanwhile, integrated several other machines and components like login, networking, logging, and cron into a relatively maximalist operating system orchestrator. It restricted the infinite customizability of init scripts and more or less unified those several disparate ways of working, sacrificing "do one thing well" for "do many common things ~consistently". This, naturally, cuts in several directions, but from my perspective as an occasional or dilettante sysadmin it's been a huge improvement even only on grounds that my knowledge of service management and troubleshooting on Arch carries over to Ubuntu or Fedora out of the box. Instead of learning how to hand-assemble this particular Rolls, I can drive off the lot right away in a more basic car and get to my own goals immediately.</p>
<p>It's those goals that determine the contents of my SQL toolbox, same as with everyone else. The tools in it are not all created equal; some are inevitably too tightly bound to the specific context they originate from to justify adding them to the standard kit. But in other situations, having one decent answer is better than having five great answers, and some tools can usefully be mechanized, standardized, centralized. The trick is identifying them: which ones help database workers avoid reinventing wheels and integrate easily and usefully into other workers' machines?</p>
<p>Some of the tools I've written:</p>
<ul>
<li>having thrown out too many brand new and already outdated entity-relationship diagrams ever to want to draw another one, I <a href="https://gitlab.com/dmfay/dotfiles/-/blob/master/zsh/fks.zsh">used Graphviz and an image-capable terminal</a> to <a href="https://di.nmfay.com/exploring-databases-visually">explore the foreign key graph</a>. There's a similar script that analyzes <a href="https://gitlab.com/dmfay/dotfiles/-/blob/master/zsh/deps.zsh">view dependencies</a>, but fks.zsh is the star of that show. As shell functions, they're available anywhere (there's that readiness-to-hand again), show the view from whatever vantage point you select, and get out of the way.</li>
<li>also in zsh, an autocompleting <a href="https://gitlab.com/dmfay/dotfiles/-/blob/master/zsh/sql.zsh">SQL file runner</a> over a directory of scripts organized by database. This came in especially useful when I dealt with a lot of database dumps with other environments' security and FDW settings: bake the <code>alter</code> statements into a script one time, then <code>sql dbname post-restore.sql</code> forever after. I use <a href="https://syncthing.net">SyncThing</a> to keep scripts consistent across computers.</li>
<li>while dealing with the pain of multiple codebases interacting over multiple evolving database schemas, I developed an automated build module that <a href="https://ectomigo.com">indexes data access code and checks the blast radius of migration scripts across the entire organization</a>. ectomigo is more a machine -- it centralizes the repeated motions of syntax analysis and comparison for individual connected repositories -- but itself connects to machines developers already use via review comments.</li>
<li>and I've built a few pgTAP checks for work recently (validating things like <a href="https://www.graphile.org/postgraphile/smart-comments/">object comments</a> and row-level security status) I should probably look at upstreaming in the new year.</li>
</ul>
<p>I'd love for tools like pspg, <a href="https://github.com/sjstoelting/pgsql-tweaks">pgsql-tweaks</a>, or the <a href="https://gitlab.com/dmfay/dotfiles/-/blob/master/psqlrc#L28">scripts we've all copied</a> out of the <a href="https://wiki.postgresql.org/wiki/Main_Page">Akashic records</a>, and machines like pgcenter to become more integrated into the psql or Postgres machines. Not in the strict software sense necessarily (e.g. pgsql-tweaks belongs in core, but compatibility beyond Postgres is a semi-explicit goal of pspg), of sharing repo space or aligning to Postgres' own release cycle -- smaller projects are much more nimble. But I think there could be a role for the Postgres <em>social</em> machine to play even for the really independent projects in its orbit. There's a lot of redundant work that has to happen, such as packaging for different distros and operating systems, that right now happens as each project's maintainers have time, awareness of the need, and the resources necessary to fulfill it. A centralizing strategy could eliminate or at least contain a lot of that redundancy and make useful tools and affordances much more widely available to database workers and downstream developers alike.</p>]]></description>
            <link>https://di.nmfay.com/pgsql-phriday-scripting-in-the-industrial-age</link>
            <guid isPermaLink="true">https://di.nmfay.com/pgsql-phriday-scripting-in-the-industrial-age</guid>
            <pubDate>Fri, 06 Jan 2023 00:00:00 GMT</pubDate>
        </item>
        <item>
            <title><![CDATA[Hanukkah of Data 2022/5783]]></title>
            <description><![CDATA[<p>Eight days, <a href="https://hanukkah.bluebird.sh/5783/">eight data analysis puzzles</a>, eight solutions. After working out the password I imported the SQLite database into Postgres the <a href="https://stackoverflow.com/a/25924065">simplest possible way</a> (with a couple of tweaks at the end of the giant <code>sed</code> replacer; <code>items.array</code> seems to have been replaced by the <code>orders_items</code> junction table):</p>
<pre><code class="hljs language-sh">createdb hanukkah
sqlite3 noahs.sqlite .dump | sed -e <span class="hljs-string">'s/INTEGER PRIMARY KEY AUTOINCREMENT/SERIAL PRIMARY KEY/g;s/PRAGMA foreign_keys=OFF;//;s/unsigned big int/BIGINT/g;s/UNSIGNED BIG INT/BIGINT/g;s/BIG INT/BIGINT/g;s/UNSIGNED INT(10)/BIGINT/g;s/BOOLEAN/SMALLINT/g;s/boolean/SMALLINT/g;s/UNSIGNED BIG INT/INTEGER/g;s/INT(3)/INT2/g;s/DATETIME/TIMESTAMP/g;s/desc text/description text/g;s/items array/items text/g'</span> | psql hanukkah</code></pre>
<p>The key fields in <code>orders</code> got turned into text somewhere along the line but that's easily fixed with <code>alter table orders alter column x type int using x::int</code>.</p>
<p>I also imposed the following completely arbitrary constraints on myself:</p>
<ul>
<li>read only, no changing information or writing intermediary data.</li>
<li>produce exactly the target information, no extra rows or columns.</li>
<li>do it in a single DML statement (common table expressions and subqueries okay).</li>
</ul>
<h2 id="day-one-beehive">day one: beehive</h2>
<p>This is a fun one! We represent the number:letter correspondence with a common table expression, unnest the customer's last name (all customers have only a first and last name, no variations) into another table-like object, then join our keypad-simulating CTE to find the one customer whose last name converted into a phone number <em>is</em> their phone number.</p>
<pre><code class="hljs language-sql">with keys (num, vals) as (
  values
    (2, string_to_array('abc',  null)), <span class="hljs-comment">-- null delimiter splits each character</span>
    (3, string_to_array('def',  null)),
    (4, string_to_array('ghi',  null)),
    (5, string_to_array('jkl',  null)),
    (6, string_to_array('mno',  null)),
    (7, string_to_array('pqrs', null)),
    (8, string_to_array('tuv',  null)),
    (9, string_to_array('wxyz', null))
)
<span class="hljs-keyword">select</span> customers.phone
<span class="hljs-keyword">from</span> customers
<span class="hljs-keyword">join</span> lateral unnest(
  string_to_array(
    <span class="hljs-comment">-- get just the last name; Postgres uses 1-based indexing for arrays</span>
    (regexp_split_to_array(<span class="hljs-keyword">lower</span>(customers.name), <span class="hljs-string">'\s'</span>))[<span class="hljs-number">2</span>],
    <span class="hljs-literal">null</span>
  )
<span class="hljs-comment">-- `with ordinality` is exactly what it sounds like: tack a numeric index on,</span>
<span class="hljs-comment">-- which string_agg() can use to keep the individual letters sorted; order is</span>
<span class="hljs-comment">-- not otherwise guaranteed!</span>
) <span class="hljs-keyword">with</span> <span class="hljs-keyword">ordinality</span> <span class="hljs-keyword">as</span> namearr (v, i) <span class="hljs-keyword">on</span> <span class="hljs-literal">true</span>
<span class="hljs-keyword">join</span> <span class="hljs-keyword">keys</span> <span class="hljs-keyword">on</span> vals @> <span class="hljs-built_in">array</span>[namearr.v]
<span class="hljs-keyword">group</span> <span class="hljs-keyword">by</span> customers.phone
<span class="hljs-keyword">having</span> regexp_replace(customers.phone, <span class="hljs-string">'-'</span>, <span class="hljs-string">''</span>, <span class="hljs-string">'g'</span>) =
  string_agg(keys.num::<span class="hljs-built_in">text</span>, <span class="hljs-string">''</span> <span class="hljs-keyword">order</span> <span class="hljs-keyword">by</span> namearr.i);</code></pre>
<h2 id="day-two-snail">day two: snail</h2>
<p>Noah's is <em>not</em> selling enough coffee to be worth the effort involved, and this makes those who do order it easily findable with just a couple other dimensions to search on.</p>
<pre><code class="hljs language-sql"><span class="hljs-keyword">select</span> c.phone <span class="hljs-keyword">from</span> customers <span class="hljs-keyword">as</span> c
<span class="hljs-keyword">join</span> orders <span class="hljs-keyword">as</span> o <span class="hljs-keyword">using</span> (customerid)
<span class="hljs-keyword">join</span> orders_items <span class="hljs-keyword">as</span> oi <span class="hljs-keyword">using</span> (orderid)
<span class="hljs-keyword">join</span> products <span class="hljs-keyword">as</span> p <span class="hljs-keyword">using</span> (sku)
<span class="hljs-keyword">where</span> c.name <span class="hljs-keyword">like</span> <span class="hljs-string">'J% D%'</span>
  <span class="hljs-keyword">and</span> <span class="hljs-keyword">extract</span> (<span class="hljs-keyword">year</span> <span class="hljs-keyword">from</span> ordered) = <span class="hljs-number">2017</span>
  <span class="hljs-keyword">and</span> p.description <span class="hljs-keyword">ilike</span> <span class="hljs-string">'coffee,%'</span>;</code></pre>
<h2 id="day-three-spider">day three: spider</h2>
<p>Another "three clues, three predicates, one result" puzzle; no need even to check for orders having occurred more recently.</p>
<pre><code class="hljs language-sql"><span class="hljs-keyword">select</span> phone
<span class="hljs-keyword">from</span> customers
<span class="hljs-comment">-- subtracting two from the year lines us up with the zodiacal dog; other animal</span>
<span class="hljs-comment">-- years won't divide evenly by 12</span>
<span class="hljs-keyword">where</span> ((<span class="hljs-keyword">extract</span>(<span class="hljs-keyword">year</span> <span class="hljs-keyword">from</span> birthdate::<span class="hljs-built_in">date</span>) - <span class="hljs-number">2</span>) / <span class="hljs-number">12</span>)::<span class="hljs-built_in">int</span> =
       (<span class="hljs-keyword">extract</span>(<span class="hljs-keyword">year</span> <span class="hljs-keyword">from</span> birthdate::<span class="hljs-built_in">date</span>) - <span class="hljs-number">2</span>) / <span class="hljs-number">12</span>
  <span class="hljs-keyword">and</span> to_char(birthdate::timestamptz, <span class="hljs-string">'MMDD'</span>)::<span class="hljs-built_in">int</span> <span class="hljs-keyword">between</span> <span class="hljs-number">0320</span> <span class="hljs-keyword">and</span> <span class="hljs-number">0420</span>
  <span class="hljs-keyword">and</span> citystatezip = <span class="hljs-string">'South Ozone Park, NY 11420'</span>;</code></pre>
<h2 id="day-four-owl">day four: owl</h2>
<p>Some refining of predicates involved in this one but it's still pretty straightforward to solve after a quick peek at the products table to find out how sku prefixes work: there are two people who've bought bakery items between 4 and 5 am ever, and only one of them makes a habit of it.</p>
<pre><code class="hljs language-sql"><span class="hljs-keyword">select</span> c.phone
<span class="hljs-keyword">from</span> customers <span class="hljs-keyword">as</span> c
<span class="hljs-keyword">join</span> orders <span class="hljs-keyword">as</span> o <span class="hljs-keyword">using</span> (customerid)
<span class="hljs-keyword">join</span> orders_items <span class="hljs-keyword">as</span> oi <span class="hljs-keyword">using</span> (orderid)
<span class="hljs-keyword">where</span> oi.sku <span class="hljs-keyword">ilike</span> <span class="hljs-string">'bky%'</span>
  <span class="hljs-keyword">and</span> numrange(<span class="hljs-number">4</span>, <span class="hljs-number">5</span>, <span class="hljs-string">'[)'</span>) @> <span class="hljs-keyword">extract</span>(<span class="hljs-keyword">hour</span> <span class="hljs-keyword">from</span> o.ordered)
  <span class="hljs-keyword">and</span> numrange(<span class="hljs-number">4</span>, <span class="hljs-number">5</span>, <span class="hljs-string">'[)'</span>) @> <span class="hljs-keyword">extract</span>(<span class="hljs-keyword">hour</span> <span class="hljs-keyword">from</span> o.shipped)
<span class="hljs-keyword">group</span> <span class="hljs-keyword">by</span> c.phone
<span class="hljs-keyword">order</span> <span class="hljs-keyword">by</span> <span class="hljs-keyword">count</span>(*) <span class="hljs-keyword">desc</span>
<span class="hljs-keyword">limit</span> <span class="hljs-number">1</span>;</code></pre>
<h2 id="day-five-koala">day five: koala</h2>
<p>Only one person has ever bought cat food more than one time, so we could use <code>having</code> and omit the <code>limit</code> entirely (we could also have done this yesterday), but someone might make a repeat purchase tomorrow so order-limit is a more reliable solution.</p>
<pre><code class="hljs language-sql"><span class="hljs-keyword">select</span> phone
<span class="hljs-keyword">from</span> customers <span class="hljs-keyword">as</span> c
<span class="hljs-keyword">join</span> orders <span class="hljs-keyword">as</span> o <span class="hljs-keyword">using</span> (customerid)
<span class="hljs-keyword">join</span> orders_items <span class="hljs-keyword">as</span> oi <span class="hljs-keyword">using</span> (orderid)
<span class="hljs-keyword">join</span> products <span class="hljs-keyword">as</span> p <span class="hljs-keyword">using</span> (sku)
<span class="hljs-keyword">where</span> c.citystatezip <span class="hljs-keyword">ilike</span> <span class="hljs-string">'queens village%'</span>
  <span class="hljs-keyword">and</span> oi.sku <span class="hljs-keyword">ilike</span> <span class="hljs-string">'pet%'</span>
  <span class="hljs-keyword">and</span> p.description <span class="hljs-keyword">ilike</span> <span class="hljs-string">'%cat%'</span>
<span class="hljs-keyword">group</span> <span class="hljs-keyword">by</span> phone
<span class="hljs-comment">-- count number of orders, not number of items bought</span>
<span class="hljs-keyword">order</span> <span class="hljs-keyword">by</span> <span class="hljs-keyword">count</span>(<span class="hljs-keyword">distinct</span> o.orderid) <span class="hljs-keyword">desc</span>
<span class="hljs-keyword">limit</span> <span class="hljs-number">1</span>;</code></pre>
<h2 id="day-six-squirrel">day six: squirrel</h2>
<p>This one was far and away my worst score (20 attempts over four hours from opening the puzzle, although I probably only spent somewhere between one and two of those hours actually trying to solve it) because I got complacent and didn't think through computing savings. I initially tested order price vs wholesale price, i.e. <em>margin</em>, and went up a blind alley involving window functions trying to detect changes in order behavior. When I subtracted paid price from the maximum ever paid for each product I got an unambiguous result: one person has lifetime savings greater than their spending.</p>
<pre><code class="hljs language-sql">with max_prices as (
  <span class="hljs-keyword">select</span> p.sku, <span class="hljs-keyword">max</span>(oi.unit_price) <span class="hljs-keyword">as</span> price
  <span class="hljs-keyword">from</span> products <span class="hljs-keyword">as</span> p
  <span class="hljs-keyword">join</span> orders_items <span class="hljs-keyword">as</span> oi <span class="hljs-keyword">using</span> (sku)
  <span class="hljs-keyword">group</span> <span class="hljs-keyword">by</span> p.sku
)
<span class="hljs-keyword">select</span> c.phone
<span class="hljs-keyword">from</span> customers <span class="hljs-keyword">as</span> c
<span class="hljs-keyword">join</span> orders <span class="hljs-keyword">as</span> o <span class="hljs-keyword">using</span> (customerid)
<span class="hljs-keyword">join</span> orders_items <span class="hljs-keyword">as</span> oi <span class="hljs-keyword">using</span> (orderid)
<span class="hljs-keyword">join</span> max_prices <span class="hljs-keyword">as</span> p <span class="hljs-keyword">using</span> (sku)
<span class="hljs-keyword">group</span> <span class="hljs-keyword">by</span> c.customerid, c.name, c.phone
<span class="hljs-comment">-- the standard maximum price * quantity is what _would_ have been paid without</span>
<span class="hljs-comment">-- any discounts or coupons</span>
<span class="hljs-keyword">having</span> <span class="hljs-keyword">sum</span>(p.price * oi.qty - oi.unit_price * oi.qty) > <span class="hljs-keyword">sum</span>(oi.unit_price * oi.qty);</code></pre>
<h2 id="day-seven-toucan">day seven: toucan</h2>
<p>Self-joining orders within a reasonable time window and filtering for different skus with similar descriptions (colors are always parenthesized) yields one match to an order from the customer in the previous puzzle.</p>
<pre><code class="hljs language-sql"><span class="hljs-keyword">select</span> c.phone
<span class="hljs-keyword">from</span> orders <span class="hljs-keyword">as</span> o1
<span class="hljs-keyword">join</span> orders_items <span class="hljs-keyword">as</span> oi1 <span class="hljs-keyword">using</span> (orderid)
<span class="hljs-keyword">join</span> products <span class="hljs-keyword">as</span> p1 <span class="hljs-keyword">using</span> (sku)
<span class="hljs-keyword">join</span> orders <span class="hljs-keyword">as</span> o2
  <span class="hljs-keyword">on</span> date_trunc(<span class="hljs-string">'day'</span>, o2.ordered) = date_trunc(<span class="hljs-string">'day'</span>, o1.ordered)
  <span class="hljs-keyword">and</span> o2.ordered <span class="hljs-keyword">between</span> o1.ordered - <span class="hljs-built_in">interval</span> <span class="hljs-string">'1 hour'</span> <span class="hljs-keyword">and</span> o1.ordered + <span class="hljs-built_in">interval</span> <span class="hljs-string">'1 hour'</span>
  <span class="hljs-keyword">and</span> o2.customerid &#x3C;> o1.customerid
<span class="hljs-keyword">join</span> orders_items <span class="hljs-keyword">as</span> oi2 <span class="hljs-keyword">on</span> oi2.orderid = o2.orderid
<span class="hljs-keyword">join</span> products <span class="hljs-keyword">as</span> p2 <span class="hljs-keyword">on</span> p2.sku = oi2.sku
<span class="hljs-keyword">join</span> customers <span class="hljs-keyword">as</span> c <span class="hljs-keyword">on</span> c.customerid = o2.customerid
<span class="hljs-keyword">where</span> o1.customerid = <span class="hljs-number">8342</span>
  <span class="hljs-keyword">and</span> p1.sku &#x3C;> p2.sku
  <span class="hljs-keyword">and</span> regexp_replace(p1.description, <span class="hljs-string">'\([^)]+\)'</span>, <span class="hljs-string">''</span>) =
      regexp_replace(p2.description, <span class="hljs-string">'\([^)]+\)'</span>, <span class="hljs-string">''</span>);</code></pre>
<h2 id="day-eight-snake">day eight: snake</h2>
<p>Another simple slicing problem to wrap it up: join everything in, filter for product descriptions, count, grab the highest.</p>
<pre><code class="hljs language-sql"><span class="hljs-keyword">select</span> c.phone
<span class="hljs-keyword">from</span> customers <span class="hljs-keyword">as</span> c
<span class="hljs-keyword">join</span> orders <span class="hljs-keyword">as</span> o <span class="hljs-keyword">using</span> (customerid)
<span class="hljs-keyword">join</span> orders_items <span class="hljs-keyword">as</span> oi <span class="hljs-keyword">using</span> (orderid)
<span class="hljs-keyword">join</span> products <span class="hljs-keyword">as</span> p <span class="hljs-keyword">using</span> (sku)
<span class="hljs-keyword">where</span> p.description <span class="hljs-keyword">ilike</span> <span class="hljs-string">'noah%'</span>
<span class="hljs-keyword">group</span> <span class="hljs-keyword">by</span> c.name, c.phone
<span class="hljs-keyword">order</span> <span class="hljs-keyword">by</span> <span class="hljs-keyword">count</span>(*) <span class="hljs-keyword">desc</span>
<span class="hljs-keyword">limit</span> <span class="hljs-number">1</span>;</code></pre>
<h2 id="retrospectively">retrospectively</h2>
<p>I had fun! Most of the puzzles wound up being much more straightforward than I'd hoped, but then it's tough to come up with a reasonable challenge at at the novice to intermediate level that isn't rendered trivial by expertise with a tool purpose-built for exactly this kind of information work. Other people are tackling this with VisiData, Excel, jq, or whatever else (I both want to see and absolutely do not want to solve day one in jq). That first puzzle set the bar <em>super</em> high, though, and variations on your basic join-where-sort-limit query had a really tough time following it. Honorable mention to days six and seven; it feels like on puzzle definition alone the smoothest difficulty curve in SQL would've been something like 2-8-3-4-5-7-6-1.</p>]]></description>
            <link>https://di.nmfay.com/hanukkah</link>
            <guid isPermaLink="true">https://di.nmfay.com/hanukkah</guid>
            <pubDate>Mon, 26 Dec 2022 00:00:00 GMT</pubDate>
        </item>
        <item>
            <title><![CDATA[ectomigo: Safer Schema Migrations]]></title>
            <description><![CDATA[<p>The team I work with at my day job maintains many applications and processes interacting across a smaller number of databases. This is hardly exceptional. We are also constantly adding, subtracting, and refining not only the client programs but also the database schemas themselves. This too is hardly exceptional: business requirements change, external systems expose new information and deprecate old interfaces, von Moltke's Law ("no plan of operations remains certain once the armies have met") comes calling. Every now and again we just make a modeling or implementation mistake that manages to sneak through review and up to production. Sic semper startups.</p>
<p>So our database schemas are continually evolving. Each of those many applications and processes has to evolve along with them, or we get paged when the renamed column or dropped table breaks something we hadn't accounted for, and instant breakage is the <em>best</em> case. We've had schema incompatibilities lie in wait for over a month to catch us completely flatfooted. The complexities of even a single moderately-sized codebase are beyond the grasp of human memory. What hope do we have of recalling which relevant subset of database interactions appear where across two or ten or more?</p>
<p>What we need is a distinctly <em>inhuman</em> memory, one for which summoning up each and every reference to a changing table or view takes a moment's effort, and which cannot forget. A memory which operates at the level of the organization, rather than that of the project or of the individual developer/reviewer, only able to focus on a single target at a time. A memory we can consult when, or better yet before, code is ready to deploy -- "<a href="https://en.wikipedia.org/wiki/Shift-left_testing">shifting left</a>", as they say.</p>
<p>We need a database.</p>
<p><a href="https://ectomigo.com">I built one</a>.</p>
<p><img src="https://di.nmfay.com/images/ectomigo.png" alt="a schema migration alters a table, renaming a column; ectomigo leaves a GitHub review comment pointing out references to that table in two repositories. Each reference includes the columns ectomigo has been able to identify. One reference uses the column&#x27;s new name, indicating it&#x27;s been updated, but another in the second repository still uses the old name and must be fixed."></p>
<p>ectomigo is a continuous integration module (initially a <a href="https://github.com/ectomigo/ectomigo">GitHub action</a>) which parses your source files using <a href="https://tree-sitter.github.io/tree-sitter/">tree-sitter</a> to find data access code: SQL scripts and inline SQL in Java, JavaScript, and Python; <a href="https://massivejs.org">MassiveJS</a> calls; <a href="https://www.sqlalchemy.org">SQLAlchemy</a> definitions; and more languages, data access patterns, analysis features, and platform support on the way after launch. Everything it finds it indexes, storing database object names and the file row-column positions of each reference.</p>
<p>When you submit schema changes for review, it parses <em>that</em> code as well, and matches the targets you're altering or dropping against every codebase your organization has enabled. If it does find any matches -- in other words, you still have live references to an affected database object, in this or another repository -- it leaves review comments alerting you to each potential problem.</p>
<p>ectomigo is <a href="https://github.com/ectomigo/ectomigo">launching on GitHub</a> free for public and up to two private projects, with <a href="https://ectomigo.com/pricing">pricing available</a> beyond that. The action code and the <a href="https://gitlab.com/ectomigo/core">core</a> code analysis library it integrates are open under the AGPL should you be interested in that.</p>
<p>We've been using early ectomigo builds at my workplace for a couple of months now, and it's already saved our bacon a few times with reports on database object usage in places we'd forgotten. If you're faced with migration risk yourself, I hope it can help you.</p>]]></description>
            <link>https://di.nmfay.com/ectomigo</link>
            <guid isPermaLink="true">https://di.nmfay.com/ectomigo</guid>
            <pubDate>Tue, 29 Mar 2022 00:00:00 GMT</pubDate>
        </item>
        <item>
            <title><![CDATA[Exploring Databases Visually]]></title>
            <description><![CDATA[<p>In <a href="https://gitlab.com/dmfay/dotfiles/-/blob/master/zsh/fks.zsh">"things you can do with a terminal emulator that renders images"</a>:</p>
<p>One way to look at a database's structure is as a graph of foreign key relationships among tables. Two styles of visual representation predominate: models or entity-relationship diagrams (ERDs) created as part of requirements negotiation and design, and descriptive diagrams of an extant database. The former are drawn by hand on a whiteboard or in diagramming software; the latter are often generated by database management tools with some manual cleanup and organization. Both styles usually take the complete database as their object, and whether descriptive or prescriptive, their role in the software development process is as reference material, or documentation.</p>
<p>Documentation isn't disposable. Even though these diagrams are out of date practically as soon as they're saved off, they take effort to make, or at least to make legible -- automated tools are only so good at layout, especially as table and relationship counts grow. That effort isn't lightly discarded, and anyway a diagram that's still <em>mostly</em> accurate remains a useful reference.</p>
<p>Documentation isn't disposable. But documentation isn't the only tool we have for orienting ourselves in a system: we can also explore, view the system in parts and from different angles, follow individual paths through the model from concept to concept. Exploration depends on adopting a partial, mobile perspective from the inside of the model, with rapid feedback and enough context to navigate but not so much as to be overwhelmed. The view from a single point is more or less important depending on the point itself, but in order to facilitate exploration that view has to be generated and discarded on demand. Look, move, look, move.</p>
<p>This is a partial perspective of the <a href="https://github.com/devrimgunduz/pagila">pagila</a> sample database, from the table <code>film</code>:</p>
<p><img src="https://di.nmfay.com/images/fks-film.png" alt="the &#x22;film&#x22; table in a graph showing its dependence via foreign key on the &#x22;language&#x22; table, and other tables&#x27; dependencies on &#x22;film&#x22;. A film has corresponding records in &#x22;film_actor&#x22; and &#x22;film_category&#x22; (junction tables, to &#x22;actor&#x22; and &#x22;category&#x22; tables not shown in this partial perspective); copies of a film are in &#x22;inventory&#x22;; inventory items in turn are referenced in &#x22;rental&#x22;; and rentals turn up in a set of &#x22;payment&#x22; tables partitioned by month."></p>
<p>It's generated by <a href="https://gitlab.com/dmfay/dotfiles/-/blob/master/zsh/fks.zsh">this <code>fks</code> zsh function</a> which queries Postgres' catalog of foreign keys using a <a href="https://www.citusdata.com/blog/2018/05/15/fun-with-sql-recursive-ctes/">recursive common table expression</a> to identify and visualize everything connected in a straight line to the target. The query output is passed to the <a href="https://graphviz.org">Graphviz suite's <code>dot</code></a> with a template, rendered to png, and the png displayed with <a href="https://wezfurlong.org/wezterm/"><code>wezterm imgcat</code></a>. No files are created or harmed at any point in the process.</p>
<p>Why only a straight line, though? The graph above has obvious gaps: <code>film_actor</code> implies an <code>actor</code>, and <code>film_category</code> its own table on the other side of the junction. <code>inventory</code> probably wants a <code>store</code>, and <code>rental</code> and the payment tables aren't much use without a <code>customer</code>. The view from <code>rental</code> is markedly different, with half a dozen tables that weren't visible at all from <code>film</code>:</p>
<p><img src="https://di.nmfay.com/images/fks-rental.png" alt="a perspective on the pagila sample database from the &#x22;rental&#x22; table. The same &#x22;payment&#x22; tables depend on it, but upstream &#x22;inventory&#x22; is joined by &#x22;customer&#x22; and &#x22;staff&#x22;, and further up &#x22;store&#x22;, &#x22;address&#x22; (relating to customers, staff, and stores), &#x22;city&#x22;, and &#x22;country&#x22; tables. &#x22;Film&#x22; and &#x22;language&#x22; are also present upstream from &#x22;inventory&#x22;."></p>
<p>This graph is familiar in part: there's <code>rental</code> itself, the payment tables, <code>inventory</code>, <code>film</code> -- the last shorn of the junctions to the still-missing <code>actor</code> and <code>category</code> tables. Those have passed around a metaphorical corner, since in order to get from <code>rental</code> to <code>film_actor</code> you must travel first <em>up</em> foreign keys into <code>film</code> (via <code>rental.inventory_id</code> and <code>inventory.film_id</code>), then <em>down</em> by way of <code>film_actor.film_id</code>. <code>language</code>, meanwhile, is "upwards" of <code>film</code> and therefore remains visible from <code>rental</code>.</p>
<p>The reason <code>fks</code> restricts its search to straight lines from the target table is to keep context narrow. You can get a fuller picture of the table structure by navigating and viewing the graph from multiple perspectives; what <code>fks</code> shows is the set of tables which can affect the target, or which will be affected by changes in the target. If you delete a <code>store</code> or a <code>film</code>, rentals from that store or of that film are invalidated (and, unless the intermediary foreign keys are set to cascade, the delete fails). But deleting a <code>film_actor</code> has nothing to do with <code>rental</code>, and vice versa.</p>
<p>There's an actual, serious problem with unrestricted traversal, too. If you recurse through <em>all</em> relationships, you wind up mapping entire subgraphs, or clusters of related tables. And clusters grow quickly. Stuart Kauffman has a great illustration of the principle in his book <em>At Home in the Universe: The Search for the Laws of Self-Organization and Complexity</em>.</p>
<blockquote>
<p>Imagine 10,000 buttons scattered on a hardwood floor. Randomly choose two buttons and connect them with a thread. Now put this pair down and randomly choose two more buttons, pick them up, and connect them with a thread. As you continue to do this, at first you will almost certainly pick up buttons that you have not picked up before. After a while, however, you are more likely to pick at random a pair of buttons and find that you have already chosen one of the pair. So when you tie a thread between the two newly chosen buttons, you will find three buttons tied together. In short, as you continue to choose random pairs of buttons to connect with a thread, after a while the buttons start becoming interconnected into larger clusters.</p>
</blockquote>
<p>When the ratio of threads to buttons, or relationships to tables, passes 0.5, there's a phase transition. Enough clusters exist that the next thread or relationship will likely connect one cluster to another, and the next, and the next. A supercluster emerges, nearly the size of the entire relationship graph. We can see what the relationship:table ratio looks like in a database by querying the system catalogs:</p>
<pre><code class="hljs language-sql">WITH tbls AS (
  <span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">count</span>(*) <span class="hljs-keyword">AS</span> <span class="hljs-keyword">num</span> <span class="hljs-keyword">FROM</span> information_schema.tables
  <span class="hljs-keyword">WHERE</span> table_schema <span class="hljs-keyword">NOT</span> <span class="hljs-keyword">IN</span> (<span class="hljs-string">'pg_catalog'</span>, <span class="hljs-string">'information_schema'</span>)
), fks <span class="hljs-keyword">AS</span> (
  <span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">count</span>(*) <span class="hljs-keyword">AS</span> <span class="hljs-keyword">num</span> <span class="hljs-keyword">FROM</span> pg_constraint <span class="hljs-keyword">WHERE</span> contype = <span class="hljs-string">'f'</span>
)
<span class="hljs-keyword">SELECT</span> fks.num <span class="hljs-keyword">AS</span> f, tbls.num <span class="hljs-keyword">AS</span> t, fks.num::<span class="hljs-built_in">decimal</span> / tbls.num <span class="hljs-keyword">AS</span> r
<span class="hljs-keyword">FROM</span> tbls <span class="hljs-keyword">CROSS</span> <span class="hljs-keyword">JOIN</span> fks;</code></pre>
<p>The lowest ratio I have in a real working database is 0.56, and it's a small one, with f=14 and t=25. Others range from 0.61 (f=78, t=126) all the way up to 1.96 (f=2171, t=1107 thanks to a heavily partitioned table with multiple foreign keys); pagila itself is in the middle at 1.08 (f=27, t=25). I don't have enough data to back this up, but I think it's reasonable to expect that the number of relationships tends to increase faster than the number of tables. Without restrictions on traversal, you might as well draw a regular ERD: superclusters are inevitable.</p>
<p><code>fks</code> will draw a regular ERD if passed only the database name, but like I said earlier, automated tools are only so good at layout (and in a terminal of limited width, even a smallish database is liable to produce an illegibly zoomed-out model). With no way to add universal render hints, Graphviz does a lot better with the smaller, more restricted graphs from local perspectives inside the database -- and so do humans. Reading a full-scale data model is hard! Tens or hundreds of nodes have to be sorted by relevance to the problem at hand; nodes and relationships which matter have to be mapped, the irrelevant actively ignored, others tagged with a mental question mark. Often a given problem involves more relevant entities than the human mind can track unaided. <code>fks</code> doesn't resolve the issue completely, but making a database spatial and navigating that space visually goes some way to meet our limitations and those of our tools.</p>]]></description>
            <link>https://di.nmfay.com/exploring-databases-visually</link>
            <guid isPermaLink="true">https://di.nmfay.com/exploring-databases-visually</guid>
            <pubDate>Sun, 04 Apr 2021 00:00:00 GMT</pubDate>
        </item>
        <item>
            <title><![CDATA[Extra-fuzzy History Searching with Mnem]]></title>
            <description><![CDATA[<p>Update: <a href="https://github.com/cantino/mcfly">mcfly</a> already existed, with a slightly different approach (neural network instead of structural analysis) with a lack of fuzzy searching its only real downside, so I added that there. Use mcfly instead!</p>
<p>I use a lot of Rust command-line tools: <a href="https://github.com/BurntSushi/ripgrep">ripgrep</a>, <a href="https://github.com/sharkdp/fd">fd</a>, <a href="https://github.com/bootandy/dust">dust</a>, and more. So when I had my own idea for a better command-line mousetrap, it seemed like the way to go.</p>
<p>Shells log the commands you enter to a history file. Bash has <code>.bash_history</code>, zsh uses <code>.histfile</code>. The <code>EXTENDED_HISTORY</code> option in the latter adds timestamps, but that's about as fancy as it gets. Both shells (and presumably others) also have "reverse search" functionality which lets you look backwards and forwards through it, one line at a time.</p>
<p><img src="https://di.nmfay.com/images/mnem-ctrl-r.gif" alt="reverse searching for rustc calls"></p>
<p>Functional! But not especially friendly. Only seeing one result at a time makes it difficult to evaluate multiple similar matches; matching is strictly linear, as you can see by my typos; and the chronological is only sometimes the most useful order.</p>
<p>I do a lot with the AWS CLI, SaltStack, and other complicated command-line interfaces. I want to compare invocations to see how I've combined verbs and flags in the past, and for tasks I repeat just often enough to forget how to do them sorting by overall frequency is more useful than sorting by time.</p>
<p>Enter <a href="https://gitlab.com/dmfay/mnem">Mnem</a> (regrettably, I missed getting <code>clio</code>, the Muse of history, by a matter of weeks):</p>
<p><img src="https://di.nmfay.com/images/mnem.gif" alt="mnem in use"></p>
<p>The idea is pretty simple: load the history file, and reduce every command to its syntactic structure. <code>git commit -m "some message here"</code> becomes <code>git commit -m &#x3C;val></code>; <code>mv "hither" "thither"</code> turns into <code>mv &#x3C;arg1> &#x3C;arg2></code>. Many entries will have the same structure, especially if switches are sorted consistently, so counting up occurrences yields each structure's overall popularity.</p>
<p>Picking one such aggregate yields a second selector over the original incidences, and selecting one of those prints it to stdout. This can be referenced, copied and pasted, or even <code>eval</code>ed in the shell.</p>
<p>So far I've released Mnem to the <a href="https://aur.archlinux.org/packages/mnem/">Arch AUR</a> and a Homebrew tap:</p>
<pre><code class="hljs">brew tap dmfay<span class="hljs-regexp">/mnem https:/</span><span class="hljs-regexp">/gitlab.com/</span>dmfay/homebrew-mnem.git
brew install dmfay<span class="hljs-regexp">/mnem/</span>mnem</code></pre>]]></description>
            <link>https://di.nmfay.com/mnem</link>
            <guid isPermaLink="true">https://di.nmfay.com/mnem</guid>
            <pubDate>Thu, 17 Sep 2020 00:00:00 GMT</pubDate>
        </item>
        <item>
            <title><![CDATA[Plex: A Life]]></title>
            <description><![CDATA[<p>A little while back I got my hands on a copy of <em>Software Development and Reality Construction</em>, the output of a conference held in Berlin in 1988. Among a variety of other more or less philosophical treatments of the theory and practice of software development, Don Knuth analyzes errors he made in his work on TeX; Kristen "SIMULA" Nygaard reviews his collaboration with labor unions to ensure that software meant to coordinate and control work does not wind up controlling the workers as well, a rather grim read in the era of Uber and Amazon; Heinz Klein and Kalle Lyytinen embark on a discussion of data modeling as production rather than as interpretation or hermeneutics. In all, it's some of the most insightful writing about programming and software engineering I've encountered.</p>
<p>This isn't about those contributions.</p>
<p>There's an entry fairly early on from one Douglas T. Ross, called "From Scientific Practice to Epistemological Discovery". Ross, who died in 2007, was a computer scientist and engineer most remembered today for the influential APT machine tools programming language and for coining the term "computer-aided design".</p>
<p>This isn't about the things Doug Ross is remembered for.</p>
<p>Doug Ross had a <em>system</em>. The system began its public life as an early software engineering methodology in the Cambrian explosion of such methodologies enabled by the spread of high-level programming languages in the 60s and 70s. The system went by a few names. Ross's company, SofTech Inc., called it the Structured Analysis and Design Technique or SADT. The US Air Force, never wont to use merely one acronym where two will do, called it IDEF0: ICAM (Integrated Computer Aided Manufacturing) DEFinition for function modeling.</p>
<p>To Doug Ross, the system was Plex. And Plex was everything. When the Department of Defense cut the Structured Analysis data modeling approach from IDEF0 in favor of a simpler methodology to be developed by SofTech subcontracters and named IDEF1, Ross decried the decision as destroying the "mathematical elegance and symmetric completeness of SADT [...] IDEF0 became merely the best of a competing zoo of other software development CASE tools, none of which were scientifically founded". He saw his career, and, indeed, his life, as drawing him inevitably toward the discovery and promulgation of his "philosophy of problem-solving", and furthering Plex's development became more and more important to him as time went on. In the mid-80s, he stopped drawing a salary at SofTech and went back to MIT, lecturing part-time on electrical engineering in order to focus more of his efforts on Plex.</p>
<p>But even MIT was, in Ross's own words, "not yet ready for [Structured Analysis] much less Plex". A graduate seminar on Plex itself was briefly offered in 1984, but was canceled due to lack of student interest. In "From Scientific Practice" Ross bemoans his inability to gain traction for Plex, writing of feeling "an intolerable burden of responsibility to still be the only person in the world (to my knowledge) pursuing it". His only recourse was to turn inward and "generate book after book on Plex in my office at home, in order that Plex will be ready when the world is ready for it!"</p>
<p>At this point, Doug Ross might be sounding a little bit like a crank. Let me be clear: Douglas T. Ross, computer science pioneer, was <em>absolutely</em> a crank of the first water. This is just as absolutely to his credit; any fool can make it from the sublime to the ridiculous, but it takes real talent to go in the other direction. And Plex <em>is</em> sublime, if in its own dry, academic way. Ross is not the celestial paranoiac Francis E. Dec, ranting and raving about the <a href="http://www.bentoandstarchky.com/dec/intro.htm">Worldwide Deadly Gangster Communist Computer God</a> and lunar brain storage depots; nor is Plex the gonzo experience of <a href="https://timecube.2enp.com/">Nature's Harmonic Simultaneous 4-Day Time Cube</a>. That said, Ross never devolves into the racist vituperations Dec and Time Cube's Gene Ray were sometimes given to, either. So it goes.</p>
<div class="align-center"><p><strong>⁂</strong></p></div>
<p>Plex itself is a sprawling, incoherent metaphysics built, according to Ross, on the foundation of a single pun (or, more properly, double entendre): "nothing can be left out". Thus inspired, Ross embarks upon the classic Cartesian thought experiment. But where Descartes discards every proposition except the cogito ("I think, therefore I am"), Ross's buck stops at "nothing doesn't exist".</p>
<p>Or, in Ross's own framing:</p>
<blockquote>
<p><strong>Nothing doesn't exist</strong>. That is <em>the</em> <strong>First Definition</strong> of Plex -- a scientific philosophy whose aim is <em>understanding our understanding of the nature of nature</em>. Plex does not attempt to understand nature <em>itself</em>, but only our <em>understanding</em> of it. We are <em>included</em> in nature as we do "our understanding", both scientific and informal, so we must understand <em>ourselves</em> as well -- not just what we <em>think</em> we are, but as we <em>really</em> are, as <em>integral, natural <strong>beings</strong> of nature</em>. <em>How</em> one "understand"s and even who "we" <em>are</em> as we <em>do</em> "our understanding" necessarily is left completely open, for all that must arise <em>naturally</em> from the very <em>nature</em> of nature.</p>
</blockquote>
<p>All emphasis -- all of it, I assure you -- original. Ross's dedication to bold and italic text wavers from work to work and page to page, but on balance "From Scientific Practice to Epistemological Discovery" is in fine form. Early entries he refers to in his "thousands of C-pages" (that is, "chronological working pages", all of which may or may not have been lost) and <a href="https://groups.csail.mit.edu/mac/projects/studentaut/lecture4/plex_lectures_book_ok.htm">lecture notes he prepared in 1985</a> sometimes switch between up to eight colors every few words. The lecture notes are of particular interest compared to the other extant materials, comprising a "study of an SADT Data Model which expresses all aspects of any object which obeys laws of physical cause and effect" delivered as a dialogue between Ross and a genie reminiscent of <em>Gödel, Escher, Bach</em>.</p>
<p>Having arrived at the First Definition, Ross next attempts to deduce everything else from it, claiming that Plex need make no assumptions. "Nothing doesn't exist" leads, expanded this way and that, to "Only that which is <em>known by definition</em> is <strong>known</strong> -- by definition", as, "<em>without</em> a definition for something, we only can know it as Nothing". Within the space of a few paragraphs, he's slammed what appears to be his own misinterpretation of Stephen Hawking and (unknowingly?) reinvented Spinoza's pantheism, on the grounds that "Nothing <strong>isn't</strong>; Plex is what Nothing <strong>isn't</strong>". And for what it's worth, this is all still in the first two pages of "From Scientific Practice".</p>
<div class="align-center"><p><strong>⁂</strong></p></div>
<p>In another instance, Plex guides Ross to enlightenment regarding questions of information theory. It turns out that a single bit actually requires 3/2 binary digits for encoding, "because the value of the <em>half-bit</em> is 3/4 !!!".</p>
<blockquote>
<p>-- which ultimately results from the fact that in <em>actuality</em>, when you don't have something, it is <em>not</em> the case that you <em>have</em> it <em>but</em> it is Nothing -- it is that you <strong>don't have</strong> it; whereas when you <em>do</em> have something, that is because you <strong>don't have</strong> what it <em>isn't</em>!</p>
</blockquote>
<p>At a closer reading, this isn't necessarily the gibberish it might seem at first blush. Plex's foundation in "Nothing" makes <code>zero</code> the default state. But <code>one</code> is only understandable when there's an understood meaning for <code>one</code>. The elaboration about nothings and somethings makes it seem like Ross is counting this other <code>one</code> -- that is, half a bit -- towards the cost of encoding any other bit. In semiotic terms, this is the <em>interpretant</em> or subjective value Charles Sanders Peirce sees implicit in signification. But if Ross ever investigated the ways logicians and linguists had already been exploring this territory, there's no indication that he attached any significance (as it were) to their work. And while including the interpretant for half the possible values may yield the same final figure, it does not account for the 3/4 half-bit; so in the face of storage hardware design as practiced, Ross's insistence on 3/2 seems more mystical than scientific.</p>
<p>I have no idea how <em>au courant</em> Ross was with the humanities in general, but it seems likely that the answer is "not very". He was, of course, quite well-versed in math and engineering. Even deep in the mire of Plex, one can find him struggling to accommodate the realization that he was, in essence, defining formal systems backwards (he settles this with the ingenious maneuver of declaring the distinction akin to chirality), but the only philosopher he mentions is Plato. His efforts at deductive logic too seem thoroughly warped, as evinced by his "proof that every point is the whole world". For reference, an object's "identity" is tautologically defined as above: the set of "that" which "this" isn't.</p>
<pre><code class="hljs language-markdown">  I  n = 1: A world of one point is the whole world.
 II  Assume the theorem is true for (n - 1) points. (n > 1),
<span class="hljs-code">     i.e., for any collection of (n - 1) points, every point is the whole world.</span>
<span class="hljs-code">     [ed: remember, Plex needs no assumptions, let alone "assume the theorem is true"]</span>
III  To prove the theorem for n points given its truth for (n - 1) points
<span class="hljs-code">     (n > 1)</span>
<span class="hljs-code">     (a) The identity of any one point, p, in the collection is a collection of (n -</span>
<span class="hljs-code">         1) points, each of which is the whole world, by II.</span>
<span class="hljs-code">     (b) The identity of any other point, q, i.e., a point of the identity of p, is</span>
<span class="hljs-code">         a collection of (n - 1) points, each of which is the whole world, by II.</span>
<span class="hljs-code">     (c) The identity of p and the identity of q are identical except that where</span>
<span class="hljs-code">         the identity of p has q the identity of q has p. In any case p is the</span>
<span class="hljs-code">         whole world by (b) and q is the whole world by (a).</span>
<span class="hljs-code">     (d) Hence both p and q are the whole world, as are all the other points (if</span>
<span class="hljs-code">         any) in their respective identities (and shared between them).</span>
<span class="hljs-code">     (e) Hence all n points are the whole world.</span>
 IV  For n = 2, I is used (via II) in IIIa and IIIb, q.e.d.
  V  Q.E.D. by natural induction.</code></pre>
<p>As mentioned, Ross generated a wealth of C-pages, lecture notes, and other writings on Plex, but except for a small fraction apparently hosted on <a href="https://groups.csail.mit.edu/mac/projects/studentaut/index.htm">his last MIT faculty/program page</a>, I have no idea where most of this collection ended up. If you're interested in reading further in Ross's own words, the best places to start are probably "From Scientific Practice to Epistemological Discovery" in <a href="https://www.researchgate.net/publication/242530010_Software_Development_and_Reality_Construction"><em>Software Development and Reality Construction</em></a> or <a href="https://groups.csail.mit.edu/mac/projects/studentaut/The%20Plex%20Tract.htm">The Plex Tract</a>.</p>
<h2 id="coda">Coda</h2>
<p>Doug Ross himself remains a rather cryptic figure. There's some biographical information out there, but after his birth to missionary parents in what's now Guangdong and childhood homecoming to the Finger Lakes region of New York it mostly concerns where, when, with whom, and on what he was working. In his writings he comes off somewhat full of himself, as tends to be the case with esoteric philosophers and visionaries for whom the world is not yet and will never be ready. But when Ross talks about the necessary perfection, or perfect necessity, of his marriage to his wife Pat, herself a human computer at MIT's Lincoln Laboratory, it's still a little bit charming. And when he writes, with complete seriousness, that "being a pioneer came naturally" to him, I can't exactly say otherwise.</p>
<p>I wonder what it was like in that conference hall in 1988. I don't know whether the attendees or the organizers knew what they were in for when Ross got up to talk about this beautiful, all-consuming nonsense that was driving him to desperation. But sense isn't everything; and as a project of <em>reality construction</em> Plex is a monumental accomplishment. And the reality we ourselves have collectively constructed, in which points are points, a bit corresponds to a single binary digit, and genies obstinately refuse to appear no matter how we manipulate bottles, is the richer for its existence.</p>]]></description>
            <link>https://di.nmfay.com/plex</link>
            <guid isPermaLink="true">https://di.nmfay.com/plex</guid>
            <pubDate>Fri, 06 Sep 2019 00:00:00 GMT</pubDate>
        </item>
        <item>
            <title><![CDATA[JOIN Semiotics and MassiveJS v6]]></title>
            <description><![CDATA[<p><a href="https://massivejs.org">MassiveJS</a> version 6 is imminent. This next release closes the widest remaining gap between Massive-generated APIs and everyday SQL, not to mention other higher-level data access libraries: <code>JOIN</code>s.</p>
<p>This is something of a reversal for Massive, which until now has had very limited functionality for working with multiple database entities at once. I've even <a href="https://di.nmfay.com/views">written about this as a constraint not without benefits</a> (and, for the record, I think that still -- ad-hoc joins are a tool to be used judiciously in application code!).</p>
<p>But the main reason for this lack was always that I'd never come up with any solution that didn't fit awkwardly into an already-awkward options object. <a href="https://massivejs.org/docs/persistence#deep-insert">Deep insert</a> and <a href="https://massivejs.org/docs/resultset-decomposition">resultset decomposition</a> were quite enough to keep track of. I am naturally loath to concede any inherent advantages to constructing models, but this really seemed like one for the longest time.</p>
<p>There are, however, ways. Here's what Massive joins look like, if we invade the imaginary privacy of an imaginary library system's imaginary patrons:</p>
<pre><code class="hljs language-js"><span class="hljs-keyword">const</span> whoCheckedOutCalvino = <span class="hljs-keyword">await</span> db.libraries.join({
  <span class="hljs-attr">books</span>: {
    <span class="hljs-attr">on</span>: {<span class="hljs-attr">library_id</span>: <span class="hljs-string">'id'</span>},
    <span class="hljs-attr">patron_books</span>: {
      <span class="hljs-attr">type</span>: <span class="hljs-string">'LEFT OUTER'</span>,
      <span class="hljs-attr">pk</span>: [<span class="hljs-string">'patron_id'</span>, <span class="hljs-string">'book_id'</span>],
      <span class="hljs-attr">on</span>: {<span class="hljs-attr">book_id</span>: <span class="hljs-string">'books.id'</span>},
      <span class="hljs-attr">omit</span>: <span class="hljs-literal">true</span>
    },
    <span class="hljs-attr">who_checked_out</span>: {
      <span class="hljs-attr">type</span>: <span class="hljs-string">'LEFT OUTER'</span>,
      <span class="hljs-attr">relation</span>: <span class="hljs-string">'patrons'</span>,
      <span class="hljs-attr">on</span>: {<span class="hljs-attr">id</span>: <span class="hljs-string">'patron_books.patron_id'</span>}
    }
  }
}).find({
  <span class="hljs-attr">state</span>: <span class="hljs-string">'EV'</span>,
  <span class="hljs-string">'books.author ILIKE'</span>: <span class="hljs-string">'calvino, %'</span>
});</code></pre>
<p>(<code>relation</code> in this sense indicates a table or view.)</p>
<p>And the output:</p>
<pre><code class="hljs language-js">[{
  <span class="hljs-string">"id"</span>: <span class="hljs-number">2</span>,
  <span class="hljs-string">"name"</span>: <span class="hljs-string">"East Virginia State U"</span>,
  <span class="hljs-string">"state"</span>: <span class="hljs-string">"EV"</span>,
  <span class="hljs-string">"books"</span>: [{
    <span class="hljs-string">"author"</span>: <span class="hljs-string">"Calvino, Italo"</span>,
    <span class="hljs-string">"id"</span>: <span class="hljs-number">1</span>,
    <span class="hljs-string">"library_id"</span>: <span class="hljs-number">2</span>,
    <span class="hljs-string">"title"</span>: <span class="hljs-string">"Cosmicomics"</span>,
    <span class="hljs-string">"who_checked_out"</span>: [{
      <span class="hljs-string">"id"</span>: <span class="hljs-number">1</span>,
      <span class="hljs-string">"name"</span>: <span class="hljs-string">"Lauren Ipsum"</span>
    }]
  }]
}, {
  <span class="hljs-string">"id"</span>: <span class="hljs-number">3</span>,
  <span class="hljs-string">"name"</span>: <span class="hljs-string">"Neitherfolk Public Library"</span>,
  <span class="hljs-string">"state"</span>: <span class="hljs-string">"EV"</span>,
  <span class="hljs-string">"books"</span>: [{
    <span class="hljs-string">"author"</span>: <span class="hljs-string">"Calvino, Italo"</span>,
    <span class="hljs-string">"id"</span>: <span class="hljs-number">2</span>,
    <span class="hljs-string">"library_id"</span>: <span class="hljs-number">3</span>,
    <span class="hljs-string">"title"</span>: <span class="hljs-string">"Cosmicomics"</span>,
    <span class="hljs-string">"who_checked_out"</span>: [{
      <span class="hljs-string">"id"</span>: <span class="hljs-number">2</span>,
      <span class="hljs-string">"name"</span>: <span class="hljs-string">"Daler S. Ahmet"</span>
    }]
  }, {
    <span class="hljs-string">"author"</span>: <span class="hljs-string">"Calvino, Italo"</span>,
    <span class="hljs-string">"id"</span>: <span class="hljs-number">4</span>,
    <span class="hljs-string">"library_id"</span>: <span class="hljs-number">3</span>,
    <span class="hljs-string">"title"</span>: <span class="hljs-string">"Invisible Cities"</span>,
    <span class="hljs-string">"who_checked_out"</span>: []
  }]
}]</code></pre>
<p>Or in other words, exactly what you'd hope it would look like -- and what, if you use Massive, you may previously have been dealing with a view and decomposition schema to achieve. This is a moderately complex example, and between defaults (e.g. <code>type</code> to <code>INNER</code>) and introspection, declaring a join can be as simple as naming the target: <code>db.libraries.join('books')</code>.</p>
<p>The join schema is something of an evolution on the decomposition schema, sharing the same structure but inferring column lists, table primary keys, and even some <code>on</code> conditions where unambiguous foreign key relationships exist. It's more concise, less fragile, and still only defined exactly when and where it's needed. Even better, compound entities created from tables can use persistence methods, meaning that <code>join()</code> can replace many if not most existing usages of deep insert and resultset decomposition.</p>
<p>It might seem a little unconventional to just invent ersatz database entities out of whole cloth. There's some precedent -- Massive already treats scripts like database functions -- but the compound entities created by <code>Readable.join()</code> are a good bit more complex than that. There's a method to this madness though, and its origins date back to before Ted Codd came up with the idea of the relational database itself.</p>
<h2 id="semiotics-from-30000-feet">Semiotics from 30,000 Feet</h2>
<p>Semiotics is, briefly, the study of meaning-making, with 19th-century roots in both linguistics and formal logic. It's also a sprawling intellectual tradition in dialogue with multifarious other sprawling intellectual traditions, so I am not remotely going to do it justice here. The foundational idea is credited on the linguistics side to Ferdinand de Saussure: meaning is produced in the relation of a <em>signifier</em> to a <em>signified</em>, or taken together a <em>sign</em>. Smoke to fire, letter to sound, and so forth. Everything else proceeds from that relationship. There is, of course, a lot more of that everything else, and like so many other foundational ideas the original Saussurean dyad is something of a museum piece.</p>
<p>But the idea of theorizing meaning itself in almost algebraic terms would outlive de Saussure. The logician Charles Sanders Peirce had already come to similar conclusions, and had realized to boot that the interpreted value of the signifier's relationship to its signified is as important as the other two. Peirce, following this line of reasoning, understood this "interpretant" itself to be a sign comprising its own signifier and signified which in turn yield their own interpretant, in infinite chains of signification. Louis Hjelmslev, meanwhile, reimagined de Saussure's dyad as a relation of <em>expression</em> to <em>content</em>, and added a second dimension of <em>form</em> and <em>substance</em>. To Hjelmslev, a sign is a function, in the mathematical sense, mapping the "form of expression" to the "form of content", naming as the "substance of expression" and "substance of content" the raw materials formed into the sign.</p>
<p>The use of the term "substance" sounds kind of like some sort of philosophically-détourned jargon, but there are no tricks here: it's just <em>stuff</em>. There's no more specific designation than the likes of "substance" for "that which has been made into a sign"; the category includes everything from physical materials to light, gesture, positioning, electricity, more, in endless combinations. A sign is created by these matters being selected and formed into content and expression: fuel, oxygen, and heat organized into fire and smoke, or sounds uttered in an order corresponding to a known linguistic quantity. It should be said also that consciousness need not enter into it: anything can make a sign, and even a plant can interpret one.</p>
<p>This all is to say: there's stuff out there, and what it has in common is that it is made to mean things. Most stuff, in fact, is constantly meaning many things at the same time, as long as there's an interpreting process -- and there's always <em>something</em>. The philosopher-psychologist tag team of Gilles Deleuze and Felix Guattari envisioned the primordial soup of matters-awaiting-further-formation as a spatial dimension: the <em>plane of consistency</em> or <em>plane of immanence</em>. Signification, as they proposed in <em>1000 Plateaus</em>, happens on and above the plane of consistency, as matters are selected and drawn up from it to become substance and sign. The recursive nature of signification means that these signs are then selected into the substance of yet other signs, becoming layers or strata on the plane in a fashion they compare to the formation of sedimentary rock.</p>
<h2 id="signs-and-databases">Signs and Databases</h2>
<p>A database management system, like any other program, is an immensely complex system of signs. However, what sets DBMSs (and some other categories of software, like ledgers and version control systems) apart is that they're designed to manage <em>other</em> systems of signs. Thanks to this recursive aspect, a database can be imagined as a plane of consistency, a space from which any combination of unformed bytes might be drawn up into column-signs and row-signs which in turn are gathered into table-signs and view-signs and query-signs.</p>
<p>And if tables and views and queries are all still signs at base, where exactly do the differences come in? Tables store persistent data and are therefore mutable, while views and queries do not and are not, and must be constituted from tables themselves and (in the case of views) from each other. Tables constitute a lower stratum of signs, with views forming table- and view-substance into signs on higher strata, and queries higher still, at a sufficient remove from the plane of consistency that they're no longer stored in the database itself.</p>
<p>This is, of course, arriving at inheritance the long way around. In Massive terms, database entities are first instances of a base <code>Entity</code> class, after which they inherit a second prototype: one of <code>Sequence</code>, <code>Executable</code>, or <code>Readable</code>. Some of the latter may be further articulated as <code>Writable</code>s, as well; there are no <code>Writable</code>s which are not also <code>Readable</code>s.</p>
<p>But there's more than one thing happening here, and the ordering of tables, views, and database functions into class-strata is the second step -- matters must be chosen before they can be formed into signs. It's in this first step of stratification that Massive adds script files to the API system of signs, treating them (almost) identically to functions and procedures.</p>
<p><code>Readable.join()</code> takes the same idea further to expand on the database's relations: before, a <code>Readable</code> mapped one-to-one with a single table or view. But as long as SQL can be generated to suit, there's no reason one <code>Readable</code> couldn't map to multiple relations. <code>Writable</code>s too, for that matter:</p>
<pre><code class="hljs language-js"><span class="hljs-keyword">const</span> librariesWithBooks = db.libraries.join(<span class="hljs-string">'books'</span>);
<span class="hljs-keyword">const</span> libraryMembers = db.patrons.join(<span class="hljs-string">'libraries'</span>);

<span class="hljs-comment">// inserts work exactly like deep insert, persisting an</span>
<span class="hljs-comment">// entire object tree</span>
<span class="hljs-keyword">const</span> newLibrary = <span class="hljs-keyword">await</span> librariesWithBooks.insert({
  <span class="hljs-attr">name</span>: <span class="hljs-string">'Lichfield Public Library'</span>,
  <span class="hljs-attr">state</span>: <span class="hljs-string">'EV'</span>,
  <span class="hljs-attr">books</span>: [{
    <span class="hljs-attr">library_id</span>: <span class="hljs-literal">undefined</span>,
    <span class="hljs-attr">title</span>: <span class="hljs-string">'Jurgen: A Comedy of Justice'</span>,
    <span class="hljs-attr">author</span>: <span class="hljs-string">'Cabell, James Branch'</span>
  }, {
    <span class="hljs-attr">library_id</span>: <span class="hljs-literal">undefined</span>,
    <span class="hljs-attr">title</span>: <span class="hljs-string">'If On a Winter\'s Night a Traveller'</span>,
    <span class="hljs-attr">author</span>: <span class="hljs-string">'Calvino, Italo'</span>
  }]
});

<span class="hljs-comment">// updates make changes in the origin table, based on</span>
<span class="hljs-comment">// criteria which can reference the joined tables</span>
<span class="hljs-keyword">const</span> withCabell = <span class="hljs-keyword">await</span> librariesWithBooks.update({
  <span class="hljs-string">'books.author ilike'</span>: <span class="hljs-string">'cabell, %'</span>
}, {
  <span class="hljs-attr">has_cabell</span>: <span class="hljs-literal">true</span>
});

<span class="hljs-comment">// deletes, like updates, affect the origin table only</span>
<span class="hljs-keyword">const</span> iplPatrons = <span class="hljs-keyword">await</span> libraryMembers.destroy({
  <span class="hljs-string">'libraries.name ilike'</span>: <span class="hljs-string">'Imaginary Public Library'</span>
});</code></pre>
<h2 id="try-it-out">Try it Out!</h2>
<p>The first v6 prerelease is available now: <code>npm i massive@next</code>. There's now a <a href="https://massivejs.org/docs/prerelease">prerelease section of the docs</a> going over what's new and different in detail. But to sum up the other changes: </p>
<ul>
<li>Node &#x3C; 7.6 is no longer supported.</li>
<li>Implicit ordering has been dropped.</li>
<li>Resultset decomposition now yields arrays instead of objects by default. The <code>array</code> schema field is no longer recognized, and you'll need to remove it from your existing decomposition schemas. To yield objects, set <code>decomposeTo: 'object'</code> instead.</li>
<li>JSON and JSONB properties are now sorted as their original type instead of being processed as text.</li>
<li>The <code>type</code> property of the <code>order</code> option has been deprecated in favor of Postgres-style <code>field::type</code> casting as used elsewhere. It will continue to work through the 6.x lifecycle but may be removed in a subsequent major release.</li>
</ul>
<p>This is a feature I've been wishing I could make happen somehow ever since I first published the original resultset decomposition Gist more than two years ago. It's involved extensive changes to table loading, criteria parsing, and statement generation. I've endeavored <em>not</em> to break these areas, and have informally experimented by dropping pre-prerelease versions into an existing codebase. Results have been good, but should you find an issue with this or any other Massive functionality, please <a href="https://gitlab.com/dmfay/massive-js/issues">let me know</a>!</p>
<p>I'm really excited to see just how far joins expand Massive's capabilities, but in truth there's just one thing I think I and most other Massive users will get the most mileage out of: plain old query predicate generation with criteria objects, without having to define and manage a plethora of views to cover basic <code>JOIN</code>s. Stratification is a useful way to think about the production of meaning -- but strata themselves can also be dead weight.</p>]]></description>
            <link>https://di.nmfay.com/join-semiotics</link>
            <guid isPermaLink="true">https://di.nmfay.com/join-semiotics</guid>
            <pubDate>Tue, 13 Aug 2019 00:00:00 GMT</pubDate>
        </item>
        <item>
            <title><![CDATA[A Self-Sourcing Cassandra Cluster with SaltStack and EC2]]></title>
            <description><![CDATA[<p>Anybody doing something interesting to a production Cassandra cluster is generally advised, for a host of excellent reasons, to try it out in a test environment first. Here's how to make those environments effectively disposable.</p>
<p>The something interesting we're trying to do to our Cassandra cluster is actually two somethings: upgrading from v2 to v3, while also factoring Cassandra itself out from the group of EC2 servers that currently run Cassandra-and-also-some-other-important-stuff. We have a "pets" situation and want a "cattle" situation, per Bill Baker: pets have names and you care deeply about each one's welfare, while cattle are, not to put too fine a point on it, fungible. If we can bring new dedicated nodes into the cluster, start removing the original nodes as replication takes its course, and finally upgrade this Database of Theseus, that'll be some significant progress -- and without downtime, even! But it's going to take a lot of testing, to say nothing of managing the new nodes for real.</p>
<p>We already use SaltStack to monitor and manage other areas of our infrastructure besides the data pipeline, and SaltStack includes a "salt-cloud" module which can work with EC2. I'd rather have a single infra-as-code solution, so that part's all good. What isn't: the <a href="https://github.com/salt-formulas/salt-formula-cassandra">official Cassandra formula</a> is geared more towards single-node instances or some-assembly-required clusters, and provisioning is a separate concern. I expect to be creating and destroying clusters with abandon, so I need this to be as automatic as possible.</p>
<!-- vim-markdown-toc GitLab -->
<ul>
<li>
<p><a href="#salt-cloud-configuration">Salt-Cloud Configuration</a></p>
<ul>
<li><a href="#etccloudprofilesdec2conf">etc/cloud.profiles.d/ec2.conf</a></li>
<li><a href="#cassandra-testmap">cassandra-test.map</a></li>
</ul>
</li>
<li>
<p><a href="#pillar-and-mine">Pillar and Mine</a></p>
<ul>
<li><a href="#srvsaltpillartopsls">srv/salt/pillar/top.sls</a></li>
<li><a href="#srvsaltpillarsystem-user-ubuntusls">srv/salt/pillar/system-user-ubuntu.sls</a></li>
<li><a href="#srvsaltpillarmine-network-infosls">srv/salt/pillar/mine-network-info.sls</a></li>
<li><a href="#srvsaltpillarjavasls">srv/salt/pillar/java.sls</a></li>
<li><a href="#srvsaltpillarcassandrasls">srv/salt/pillar/cassandra.sls</a></li>
</ul>
</li>
<li>
<p><a href="#the-cassandra-state">The Cassandra State</a></p>
<ul>
<li><a href="#srvsaltcassandradefaultsyaml">srv/salt/cassandra/defaults.yaml</a></li>
<li><a href="#srvsaltcassandramapjinja">srv/salt/cassandra/map.jinja</a></li>
<li><a href="#srvsaltcassandrainitsls">srv/salt/cassandra/init.sls</a></li>
<li><a href="#srvsaltcassandrafilesinstallsh">srv/salt/cassandra/files/install.sh</a></li>
<li><a href="#srvsaltcassandrafilescassandraservice">srv/salt/cassandra/files/cassandra.service</a></li>
<li><a href="#srvsaltcassandrafiles2212cassandrayaml">srv/salt/cassandra/files/2.2.12/cassandra.yaml</a></li>
<li><a href="#srvsaltcassandrafiles2212timewindowcompactionstrategy-225jar">srv/salt/cassandra/files/2.2.12/TimeWindowCompactionStrategy-2.2.5.jar</a></li>
</ul>
</li>
<li>
<p><a href="#highstate">Highstate</a></p>
<ul>
<li><a href="#srvsalttopsls-changes">srv/salt/top.sls changes</a></li>
</ul>
</li>
<li><a href="#startup">Startup!</a></li>
</ul>
<!-- vim-markdown-toc -->
<h2 id="salt-cloud-configuration">Salt-Cloud Configuration</h2>
<p>The first part of connecting salt-cloud is to set up a provider and profile. On the Salt master, these are in /etc/cloud.providers.d and /etc/cloud.profiles.d. We keep everything in source control and symlink these directories.</p>
<p>Our cloud stuff is hosted on AWS, so we're using the <a href="https://docs.saltstack.com/en/latest/topics/cloud/aws.html">EC2 provider</a>. That part is basically stock, but in profiles we do need to define a template for the Cassandra nodes themselves.</p>
<h3 id="etccloudprofilesdec2conf">etc/cloud.profiles.d/ec2.conf</h3>
<pre><code class="hljs language-yaml">cassandra_node:
  provider: [your provider <span class="hljs-keyword">from</span> etc/cloud.providers.d/ec2.conf]
  image: ami-abc123
  ssh_interface: private_ips
  size: m5.large
  securitygroup:
    - default
    - others</code></pre>
<h3 id="cassandra-testmap">cassandra-test.map</h3>
<p>With the <code>cassandra_node</code> template defined in the profile configuration, we can establish the cluster layout in a <em>map file</em>. The filename doesn't matter; mine is cassandra-test.map. One important thing to note is that we're establishing a naming convention for our nodes: <code>cassandra-*</code>. Each node is also defined as <code>t2.small</code> size, overriding the default <code>m5.large</code> -- we don't need all that horsepower while we're just testing! <code>t2.micro</code> instances, however, did prove to be too underpowered to run Cassandra.</p>
<pre><code class="hljs language-yaml">cassandra_node:
  - cassandra<span class="hljs-number">-1</span>:
      size: t2.small
      cassandra-seed: true
  - cassandra<span class="hljs-number">-2</span>:
      size: t2.small
      cassandra-seed: true
  - cassandra<span class="hljs-number">-3</span>:
      size: t2.small</code></pre>
<p><code>cassandra-seed</code> (and <code>size</code>, for that matter) is a <em>grain</em>, a fact each Salt-managed "minion" knows about itself. When Cassandra comes up in a multi-node configuration, each node looks for help joining the cluster from a list of "seed" nodes. Without seeds, nothing can join the cluster; however, only non-seeds will bootstrap data from the seeds on joining so it's not a good idea to make everything a seed. And the seed layout needs to toposort: if A has B and C for seeds, B has A and C, and C has A and B, it's the same situation as no seeds. If two instances know that they're special somehow, we can use grain matching to target them specifically.</p>
<h2 id="pillar-and-mine">Pillar and Mine</h2>
<p>The Salt "pillar" is a centralized configuration database stored on the master. Minions make local copies on initialization, and their caches can be updated with <code>salt minion-name saltutil.refresh_pillar</code>. Pillars can target nodes based on name, grains, or other criteria, and are commonly used to store configuration. We have a lot of configuration, and most of it will be the same for all nodes, so using pillars is a natural fit.</p>
<h3 id="srvsaltpillartopsls">srv/salt/pillar/top.sls</h3>
<p>Like the <code>top.sls</code> for Salt itself, the Pillar <code>top.sls</code> defines a <em>highstate</em> or default state for new minions. First, we declare the pillars we're adding appertain to minions whose names match the pattern <code>cassandra-*</code>.</p>
<pre><code class="hljs language-yaml">base:
  <span class="hljs-string">'cassandra-*'</span>:
    - system-user-ubuntu
    - mine-network-info
    - java
    - cassandra</code></pre>
<h3 id="srvsaltpillarsystem-user-ubuntusls">srv/salt/pillar/system-user-ubuntu.sls</h3>
<p>Nothing special here, just a user so we can ssh in and poke things. The private key for the user is defined in the cloud provider configuration.</p>
<pre><code class="hljs language-yaml">system:
  user: ubuntu
  home: /home/ubuntu</code></pre>
<h3 id="srvsaltpillarmine-network-infosls">srv/salt/pillar/mine-network-info.sls</h3>
<p>The Salt "mine" is another centralized database, this one storing grain information so minions can retrieve facts about other minions from the master instead of dealing with peer-to-peer communication. Minions use a <code>mine_functions</code> pillar (or salt-minion configuration, but we're sticking with the pillar) to determine whether and what to store. For Cassandra nodes, we want internal network configuration and the public DNS name, which latter each node has to get by asking AWS where it is with <code>curl</code>.</p>
<pre><code class="hljs language-yaml">mine_functions:
  network.interfaces: [eth0]
  network.ip_addrs: [eth0]
  <span class="hljs-comment"># ask amazon's network config what we're public as</span>
  public_dns:
    - mine_function: cmd.run
    - <span class="hljs-string">'curl -s http://169.254.169.254/latest/meta-data/public-hostname'</span></code></pre>
<h3 id="srvsaltpillarjavasls">srv/salt/pillar/java.sls</h3>
<p>Cassandra requires Java 8 to be installed (<a href="https://issues.apache.org/jira/browse/CASSANDRA-9608">prospective Java 9 support became prospective Java 11 support</a> and is due with Cassandra 4). This pillar sets up the <a href="https://github.com/saltstack-formulas/sun-java-formula">official Java formula</a> accordingly -- or rather, it did until Oracle archived the Java 8 binaries in April 2019. We're now pulling it from Artifactory, which is a whole other thing.</p>
<pre><code class="hljs language-yaml">java:
  <span class="hljs-comment"># vitals</span>
  release: <span class="hljs-string">'8'</span>
  major: <span class="hljs-string">'0'</span>
  minor: <span class="hljs-string">'202'</span>
  development: false
  
  <span class="hljs-comment"># tarball</span>
  prefix: /usr/share/java <span class="hljs-comment"># unpack here</span>
  version_name: jdk1<span class="hljs-number">.8</span><span class="hljs-number">.0</span>_202 <span class="hljs-comment"># root directory name</span>
  source_url: https://download.oracle.com/otn-pub/java/jdk/<span class="hljs-number">8</span>u202-b08/<span class="hljs-number">1961070e4</span>c9b4e26a04e7f5a083f551e/server-jre<span class="hljs-number">-8</span>u202-linux-x64.tar.gz
  source_hash: sha256=<span class="hljs-number">61292e9</span>d9ef84d9702f0e30f57b208e8fbd9a272d87cd530aece4f5213c98e4e
  dl_opts: -b oraclelicense=accept-securebackup-cookie -L</code></pre>
<h3 id="srvsaltpillarcassandrasls">srv/salt/pillar/cassandra.sls</h3>
<p>Finally, the Cassandra pillar defines properties common to all nodes in the cluster. My upgrade plan is to bring everything up on 2.2.12, switch the central pillar definition over, and then supply the new version number to each minion by refreshing its pillar as part of the upgrade process.</p>
<pre><code class="hljs language-yaml">cassandra:
  version: <span class="hljs-string">'2.2.12'</span>
  cluster_name: <span class="hljs-string">'Test Cluster'</span>
  authenticator: <span class="hljs-string">'AllowAllAuthenticator'</span>
  endpoint_snitch: <span class="hljs-string">'Ec2Snitch'</span>
  twcs_jar:
    <span class="hljs-string">'2.2.12'</span>: <span class="hljs-string">'TimeWindowCompactionStrategy-2.2.5.jar'</span>
    <span class="hljs-string">'3.0.8'</span>: <span class="hljs-string">'TimeWindowCompactionStrategy-3.0.0.jar'</span></code></pre>
<p>The <code>twcs_jar</code> dictionary gets into one of the reasons I'm not using the official formula: we're using the <a href="http://thelastpickle.com/blog/2016/12/08/TWCS-part1.html">TimeWindowCompactionStrategy</a>. TWCS was integrated into Cassandra starting in 3.0.8 or 3.8, but it has to be compiled and installed separately for earlier versions. Pre-integration versions of TWCS also have a different package name (<code>com.jeffjirsa</code> instead of <code>org.apache</code>). 3.0.8 is the common point, having the <code>org.apache</code> TWCS built in but also being a valid compilation target for the <code>com.jeffjirsa</code> TWCS. After upgrading to 3.0.8 I'll be able to <code>ALTER TABLE</code> to apply the <code>org.apache</code> version before proceeding.</p>
<p>With the provider, profile, map file, and pillar setup we can actually spin up a barebones cluster of Ubuntu VMs now and retrieve the centrally-stored network information from the Salt mine:</p>
<pre><code class="hljs language-bash">sudo salt-cloud -m cassandra-test.map

sudo salt <span class="hljs-string">'cassandra-1'</span> <span class="hljs-string">'mine.get'</span> <span class="hljs-string">'*'</span> <span class="hljs-string">'public_dns'</span></code></pre>
<p>We can't do much else, since we don't have anything installed on the nodes yet, but it's progress!</p>
<h2 id="the-cassandra-state">The Cassandra State</h2>
<p>The state definition includes everything a Cassandra node <em>has</em> to have in order to be part of the cluster: the installed binaries, a <code>cassandra</code> group and user, a config file, a data directory, and a running SystemD unit. The definition itself is sort of an ouroboros of YAML and Jinja:</p>
<h3 id="srvsaltcassandradefaultsyaml">srv/salt/cassandra/defaults.yaml</h3>
<p>First, there's a perfectly ordinary YAML file with some defaults. These could easily be in the pillar we set up above (or the pillar config could all be in this file); the principal distinction seems to be in whether you want to propagate changes via <code>saltutil.refresh_pillar</code>, or by (re)applying the Cassandra state either directly or via highstate. This is definitely more complicated than it needs to be right now, but given that this is my first major SaltStack project, I don't yet know enough to land on one side or the other, or if combining a defaults file with the pillar configuration will eventually be necessary.</p>
<pre><code class="hljs language-yaml">cassandra:
  dc: dc1
  rack: rack1</code></pre>
<h3 id="srvsaltcassandramapjinja">srv/salt/cassandra/map.jinja</h3>
<p>The map template loads the defaults file and merges them with the pillar, creating a <code>server</code> dictionary with all the Cassandra parameters we're setting.</p>
<pre><code class="hljs language-jinja"><span class="xml"></span><span class="hljs-template-tag">{% <span class="hljs-name">import_yaml</span> "cassandra/defaults.yaml" <span class="hljs-keyword">as</span> default_settings %}</span><span class="xml">

</span><span class="hljs-template-tag">{% <span class="hljs-name">set</span> server = salt['pillar.get']('cassandra', default=default_settings.cassandra, merge=True) %}</span><span class="xml"></span></code></pre>
<h3 id="srvsaltcassandrainitsls">srv/salt/cassandra/init.sls</h3>
<p>Finally, the Cassandra state entrypoint init.sls is another Jinja template that happens to look a lot like a YAML file and renders a YAML file, which for SaltStack is good enough. Jinja is required here since values from the <code>server</code> dictionary, like the server version or the TWCS JAR filename, need to be interpolated at the time the state is applied.</p>
<p>When the Cassandra state is applied to a fresh minion:</p>
<ol>
<li><code>wget</code> will be installed</li>
<li>A <code>CASSANDRA_VERSION</code> environment variable will be set to the value defined in the pillar</li>
<li>A user and group named <code>cassandra</code> will be created</li>
<li>A script named <code>install.sh</code> will download and extract Cassandra itself, once the above three conditions are met</li>
<li>A node configuration file named <code>cassandra.yaml</code> will be generated from a Jinja template and installed to <code>/etc/cassandra</code></li>
<li>If necessary, the TWCS jar will be added to the Cassandra lib directory</li>
<li>The directory <code>/var/lib/cassandra</code> will be created and chowned to the <code>cassandra</code> user</li>
<li>A SystemD unit for Cassandra will be installed and started once all its prerequisites are in order</li>
</ol>
<pre><code class="hljs language-yaml">{% <span class="hljs-keyword">from</span> <span class="hljs-string">"cassandra/map.jinja"</span> <span class="hljs-keyword">import</span> server <span class="hljs-keyword">with</span> context %}

wget:
  pkg.installed

cassandra:
  environ.setenv:
    - name: CASSANDRA_VERSION
    - value: {{ server.version }}

  cmd.script:
    - require:
      - pkg: wget
      - user: cassandra
      - environ: CASSANDRA_VERSION
    - source: salt://cassandra/files/install.sh
    - user: root
    - cwd: ~

  group.present: []

  user.present:
    - require:
      - group: cassandra
    - gid_from_name: <span class="hljs-keyword">True</span>
    - createhome: <span class="hljs-keyword">False</span>

  service.running:
    - enable: <span class="hljs-keyword">True</span>
    - require:
      - file: /etc/cassandra/cassandra.yaml
      - file: /etc/systemd/system/cassandra.service
{%- <span class="hljs-keyword">if</span> server.twcs_jar[server.version] %}
      - file: /opt/cassandra/lib/{{ server.twcs_jar[server.version] }}
{%- endif %}

<span class="hljs-comment"># Main configuration</span>
/etc/cassandra/cassandra.yaml:
  file.managed:
    - source: salt://cassandra/files/{{ server.version }}/cassandra.yaml
    - template: jinja
    - makedirs: <span class="hljs-keyword">True</span>
    - user: cassandra
    - group: cassandra
    - mode: <span class="hljs-number">644</span>

<span class="hljs-comment"># Load TWCS jar if necessary</span>
{%- <span class="hljs-keyword">if</span> server.twcs_jar[server.version] %}
/opt/cassandra/lib/{{ server.twcs_jar[server.version] }}:
  file.managed:
    - require:
      - user: cassandra
      - group: cassandra
    - source: salt://cassandra/files/{{ server.version }}/{{ server.twcs_jar[server.version] }}
    - user: cassandra
    - group: cassandra
    - mode: <span class="hljs-number">644</span>
{%- endif %}

<span class="hljs-comment"># Data directory</span>
/var/lib/cassandra:
  file.directory:
    - user: cassandra
    - group: cassandra
    - mode: <span class="hljs-number">755</span>

<span class="hljs-comment"># SystemD unit</span>
/etc/systemd/system/cassandra.service:
  file.managed:
    - source: salt://cassandra/files/cassandra.service
    - user: root
    - group: root
    - mode: <span class="hljs-number">644</span></code></pre>
<h3 id="srvsaltcassandrafilesinstallsh">srv/salt/cassandra/files/install.sh</h3>
<p>This script downloads and extracts the target version of Cassandra and points the symlink <code>/opt/cassandra</code> to it. If the target version already exists, it just updates the symlink since everything else is already set up.</p>
<pre><code class="hljs language-bash"><span class="hljs-meta">#!/bin/bash
</span>
<span class="hljs-function"><span class="hljs-title">update_symlink</span></span>() {
  rm /opt/cassandra
  ln -s <span class="hljs-string">"/opt/apache-cassandra-<span class="hljs-variable">$CASSANDRA_VERSION</span>"</span> /opt/cassandra

  <span class="hljs-built_in">echo</span> <span class="hljs-string">"Updated symlink"</span>
}

<span class="hljs-comment"># already installed?</span>
<span class="hljs-keyword">if</span> [ -d <span class="hljs-string">"/opt/apache-cassandra-<span class="hljs-variable">$CASSANDRA_VERSION</span>"</span> ]; <span class="hljs-keyword">then</span>
  <span class="hljs-built_in">echo</span> <span class="hljs-string">"Cassandra <span class="hljs-variable">$CASSANDRA_VERSION</span> is already installed!"</span>

  update_symlink

  <span class="hljs-built_in">exit</span> 0
<span class="hljs-keyword">fi</span>

<span class="hljs-comment"># download and extract</span>
wget <span class="hljs-string">"https://archive.apache.org/dist/cassandra/<span class="hljs-variable">$CASSANDRA_VERSION</span>/apache-cassandra-<span class="hljs-variable">$CASSANDRA_VERSION</span>-bin.tar.gz"</span>
tar xf <span class="hljs-string">"apache-cassandra-<span class="hljs-variable">$CASSANDRA_VERSION</span>-bin.tar.gz"</span>
rm <span class="hljs-string">"apache-cassandra-<span class="hljs-variable">$CASSANDRA_VERSION</span>-bin.tar.gz"</span>

<span class="hljs-comment"># install to /opt and link /opt/cassandra</span>
mv <span class="hljs-string">"apache-cassandra-<span class="hljs-variable">$CASSANDRA_VERSION</span>"</span> /opt
update_symlink

<span class="hljs-comment"># create log directory</span>
mkdir -p /opt/cassandra/logs

<span class="hljs-comment"># set ownership</span>
chown -R cassandra:cassandra <span class="hljs-string">"/opt/apache-cassandra-<span class="hljs-variable">$CASSANDRA_VERSION</span>"</span>
chown cassandra:cassandra /opt/cassandra</code></pre>
<p>It's probably possible to do most of this, at least the symlink juggling and directory management, with "pure" Salt (and the environment variable could be eliminated by rendering <code>install.sh</code> as a Jinja template with the <code>server</code> dictionary), but the script does what I want it to and it's already idempotent and centrally managed.</p>
<h3 id="srvsaltcassandrafilescassandraservice">srv/salt/cassandra/files/cassandra.service</h3>
<p>This is a basic SystemD unit, with some system limits customized to give Cassandra enough room to run. It starts whatever Cassandra executable it finds at /opt/cassandra, so all that's necessary to resume operations after the symlink changes during the upgrade is to restart the service.</p>
<pre><code class="hljs language-ini"><span class="hljs-section">[Unit]</span>
<span class="hljs-attr">Description</span>=Apache Cassandra database server
<span class="hljs-attr">Documentation</span>=http://cassandra.apache.org
<span class="hljs-attr">Requires</span>=network.target remote-fs.target
<span class="hljs-attr">After</span>=network.target remote-fs.target
<span class="hljs-section">
[Service]</span>
<span class="hljs-attr">Type</span>=forking
<span class="hljs-attr">User</span>=cassandra
<span class="hljs-attr">Group</span>=cassandra
<span class="hljs-attr">ExecStart</span>=/opt/cassandra/bin/cassandra -Dcassandra.config=file:///etc/cassandra/cassandra.yaml
<span class="hljs-attr">LimitNOFILE</span>=<span class="hljs-number">100000</span>
<span class="hljs-attr">LimitNPROC</span>=<span class="hljs-number">32768</span>
<span class="hljs-attr">LimitMEMLOCK</span>=infinity
<span class="hljs-attr">LimitAS</span>=infinity
<span class="hljs-section">
[Install]</span>
<span class="hljs-attr">WantedBy</span>=multi-user.target</code></pre>
<h3 id="srvsaltcassandrafiles2212cassandrayaml">srv/salt/cassandra/files/2.2.12/cassandra.yaml</h3>
<p>The full <code>cassandra.yaml</code> is enormous, so I won't reproduce it here in full. The interesting parts are where values are being automatically interpolated by Salt. Like the Cassandra state, this is actually a Jinja template which <em>renders</em> a YAML file.</p>
<p>First, we get a list of internal IP addresses corresponding to <code>cassandra-seed</code> minions from the Salt mine and build a list of <code>known_seeds</code>.</p>
<pre><code class="hljs language-jinja"><span class="xml"></span><span class="hljs-template-tag">{%- <span class="hljs-name">from</span> 'cassandra/map.jinja' import server with context -%}</span><span class="xml">
</span><span class="hljs-template-tag">{% <span class="hljs-name">set</span> known_seeds = [] %}</span><span class="xml">
</span><span class="hljs-template-tag">{% <span class="hljs-name"><span class="hljs-name">for</span></span> minion, ip_array <span class="hljs-keyword">in</span> salt['mine.get']('cassandra-seed:true', 'network.ip_addrs', 'grain').items() if ip_array is not sameas false and known_seeds|<span class="hljs-name">length</span> &#x3C; 2 %}</span><span class="xml">
</span><span class="hljs-template-tag">{%   <span class="hljs-name"><span class="hljs-name">for</span></span> ip <span class="hljs-keyword">in</span> ip_array %}</span><span class="xml">
</span><span class="hljs-template-tag">{%     <span class="hljs-name">do</span> known_seeds.append(ip) %}</span><span class="xml">
</span><span class="hljs-template-tag">{%   <span class="hljs-name"><span class="hljs-name">endfor</span></span> %}</span><span class="xml">
</span><span class="hljs-template-tag">{% <span class="hljs-name"><span class="hljs-name">endfor</span></span> %}</span><span class="xml"></span></code></pre>
<p>This becomes the list of seeds the node looks for when trying to join the cluster.</p>
<pre><code class="hljs language-yaml">seed_provider:
    - class_name: org.apache.cassandra.locator.SimpleSeedProvider
      parameters:
          - seeds: <span class="hljs-string">"{{ known_seeds|unique|join(',') }}"</span></code></pre>
<p>Listen and broadcast addresses are configured per node. The broadcast addresses are a little special due to our network configuration needs: each node has to get its public dns name from the Salt mine. This is perhaps a bit overcomplicated compared to a custom grain or capturing the output from running the Salt modules at render time, but it's there and it works and at this point messing with it isn't a great use of time.</p>
<pre><code class="hljs language-yaml">listen_address: {{ grains[<span class="hljs-string">'fqdn'</span>] }}
broadcast_address: {{ salt[<span class="hljs-string">'mine.get'</span>](grains[<span class="hljs-string">'id'</span>], <span class="hljs-string">'public_dns'</span>).items()[<span class="hljs-number">0</span>][<span class="hljs-number">1</span>] }}
rpc_address: {{ grains[<span class="hljs-string">'fqdn'</span>] }}
broadcast_rpc_address: {{ salt[<span class="hljs-string">'mine.get'</span>](grains[<span class="hljs-string">'id'</span>], <span class="hljs-string">'public_dns'</span>).items()[<span class="hljs-number">0</span>][<span class="hljs-number">1</span>] }}</code></pre>
<p>The cluster name and other central settings are interpolated from the pillar+defaults <code>server</code> dictionary.</p>
<pre><code class="hljs language-yaml">cluster_name: <span class="hljs-string">"{{ server.cluster_name }}"</span>
...
authenticator: <span class="hljs-string">"{{ server.authenticator }}"</span>
...
endpoint_snitch: <span class="hljs-string">"{{ server.endpoint_snitch }}"</span></code></pre>
<p>The changes to the Cassandra 3.0.8 configuration are identical.</p>
<h3 id="srvsaltcassandrafiles2212timewindowcompactionstrategy-225jar">srv/salt/cassandra/files/2.2.12/TimeWindowCompactionStrategy-2.2.5.jar</h3>
<p>See <a href="http://thelastpickle.com/blog/2017/01/10/twcs-part2.html">this post on TheLastPickle</a> for directions on building the TWCS jar.</p>
<h2 id="highstate">Highstate</h2>
<p>Finally, the Salt highstate needs to ensure that our <code>cassandra-*</code> nodes have the Java and Cassandra states applied. Since Salt-Cloud minions come configured, however, we have to ensure the default <code>salt.minion</code> state is excluded from our Cassandra nodes since otherwise a highstate will blow away the cloud-specific configuration.</p>
<h3 id="srvsalttopsls-changes">srv/salt/top.sls changes</h3>
<pre><code class="hljs language-yaml">base:
  <span class="hljs-string">'not cassandra-*'</span>:
    - match: compound
    - salt.minion
  <span class="hljs-string">'cassandra-*'</span>:
    - sun-java
    - sun-java.env
    - cassandra</code></pre>
<h2 id="startup">Startup!</h2>
<p>Set the Salt config dir to <code>etc</code> with <code>-c</code> and pass in the map file with <code>-m</code>:</p>
<pre><code class="hljs">sudo salt-cloud -<span class="hljs-built_in">c</span> etc -m cassandra-test.<span class="hljs-built_in">map</span></code></pre>
<p>To clean up:</p>
<pre><code class="hljs">sudo salt-cloud -d cassandra<span class="hljs-number">-1</span> cassandra<span class="hljs-number">-2</span> cassandra<span class="hljs-number">-3</span></code></pre>]]></description>
            <link>https://di.nmfay.com/salt-cloud-cassandra</link>
            <guid isPermaLink="true">https://di.nmfay.com/salt-cloud-cassandra</guid>
            <pubDate>Wed, 27 Mar 2019 00:00:00 GMT</pubDate>
        </item>
        <item>
            <title><![CDATA[Automatic Node Deploys to Elastic Beanstalk]]></title>
            <description><![CDATA[<p>One of my favorite good ideas to ignore is the maxim that you should have your deployment pipeline ready to go before you start writing code. There's always some wrinkle you couldn't have anticipated anyway, so while it sounds good on paper I just don't think it's the best possible use of time. But with anything sufficiently complicated, there's a point where you just have to buckle down and automate rather than waste time repeating the same steps yet again (or, worse, forgetting one). I hit that point recently: the application isn't in production yet, so I'd been "deploying" by means of pulling the repo on an EC2 server, installing dependencies and building in-place, then killing and restarting the node process with <code>nohup</code>. Good enough for demos, not sustainable long-term. Also, I might have in fact missed a step Friday before last and not realized things were mostly broken until the following Monday.</p>
<p>I'd been using CircleCI to build and test the application already, so I wanted to stick with it for deployment as well. However, this precluded using the same EC2 instance: the build container would need to connect to it to run commands over SSH, <em>but</em> this connection would be coming from any of a huge possible range of build container IP addresses. I didn't want to open the server up to the whole world to accommodate the build system. Eventually I settled on Elastic Beanstalk, which can be controlled through the AWS command-line interface with the proper credentials instead of the morass of VPCs and security groups. Just upload a zip file!</p>
<p>The cost of using EBS, it turned out, was that while it made difficult things easy it also made easy things difficult. How do you deploy the same application to different environments? You don't. Everything has to be in that zip file, and if that includes any per-environment configuration then the right config files had better be where they're expected to be. This is less than ideal, but at least it can be scripted. Here's the whole thing (assuming <code>awscli</code> has already been installed):</p>
<pre><code class="hljs language-bash"><span class="hljs-comment"># what time is it?</span>
TIMESTAMP=$(date +%Y%m%d%H%M%S)

<span class="hljs-comment"># work around Elastic Beanstalk permissions for node-gyp (bcrypt)</span>
<span class="hljs-built_in">echo</span> <span class="hljs-string">"unsafe-perm=true"</span> > .npmrc

<span class="hljs-comment"># generate artifacts</span>
npm run build

<span class="hljs-comment"># download config</span>
aws s3 cp s3://elasticbeanstalk-bucket-name/app/development.config.json .

<span class="hljs-comment"># zip everything up</span>
zip -r app-dev.zip . \
  --exclude <span class="hljs-string">"node_modules/*"</span> <span class="hljs-string">".git/*"</span> <span class="hljs-string">"coverage/*"</span> <span class="hljs-string">".nyc_output/*"</span> <span class="hljs-string">"test/*"</span> <span class="hljs-string">".circleci/*"</span>

<span class="hljs-comment"># upload to s3</span>
aws s3 mv ./app-dev.zip s3://elasticbeanstalk-bucket-name/app/app-dev-<span class="hljs-variable">$TIMESTAMP</span>.zip

<span class="hljs-comment"># create new version</span>
aws elasticbeanstalk create-application-version --region us-west-2 \
  --application-name app --version-label development-<span class="hljs-variable">$TIMESTAMP</span> \
  --<span class="hljs-built_in">source</span>-bundle S3Bucket=elasticbeanstalk-bucket-name,S3Key=app/app-dev-<span class="hljs-variable">$TIMESTAMP</span>.zip

<span class="hljs-comment"># deploy to dev environment</span>
<span class="hljs-comment"># --application-name app is not specified because apt installs</span>
<span class="hljs-comment"># an older version of awscli which doesn't accept that option</span>
aws elasticbeanstalk update-environment --region us-west-2 --environment-name app-dev \
  --version-label development-<span class="hljs-variable">$TIMESTAMP</span></code></pre>
<p>The <code>TIMESTAMP</code> ensures the build can be uniquely identified later. The <code>.npmrc</code> setting is for AWS reasons: as detailed in <a href="https://stackoverflow.com/questions/46001516/beanstalk-node-js-deployment-node-gyp-fails-due-to-permission-denied">this StackOverflow answer</a>, the unfortunately-acronymed <code>node-gyp</code> runs as the instance's ec2-user account and doesn't have permissions it needs to compile bcrypt. If you're not using bcrypt (or another project that involves a <code>node-gyp</code> step on install), you don't need that line.</p>
<p>The zip is assembled in three steps:</p>
<ol>
<li><code>npm build</code> compiles stylesheets, dynamic Pug templates, frontend JavaScript, and so forth.</li>
<li>The appropriate environment config is downloaded from an S3 bucket.</li>
<li>Everything is rolled together in the zip file, minus the detritus of source control and test results.</li>
</ol>
<p>Finally, the Elastic Beanstalk deploy happens in two stages:</p>
<ol>
<li><code>aws elasticbeanstalk create-application-version</code> does what it sounds like: each timestamped zip file becomes a new "version". These don't map exactly to versions as more commonly understood thanks to the target environment configuration, so naming them for the target environment and giving the timestamp helps identify them.</li>
<li><code>aws elasticbeanstalk update-environment</code> actually deploys the newly-created "version" to the destination environment.</li>
</ol>
<p>Obviously, when it comes time to roll the project out to production, I'll factor the environment out into a variable to download and upload the appropriate artifacts. But even in its current state, this one small script has almost made deployment continuous: every pushed commit gets deployed to Elastic Beanstalk with no manual intervention, unless there are database changes. That's next.</p>]]></description>
            <link>https://di.nmfay.com/node-elastic-beanstalk</link>
            <guid isPermaLink="true">https://di.nmfay.com/node-elastic-beanstalk</guid>
            <pubDate>Mon, 08 Oct 2018 00:00:00 GMT</pubDate>
        </item>
        <item>
            <title><![CDATA[Surrealist Remixes with Markov Chains]]></title>
            <description><![CDATA[<p>There's a new button at the bottom of this (and each) post. Try clicking it! (If you're reading this on <a href="https://dev.to">dev.to</a> or an RSS reader, you'll need to visit <a href="https://di.nmfay.com/markov-remix">di.nmfay.com</a> to see it)</p>
<p>By now everyone's run into Twitter bots and automated text generators that combine words in ways that <em>almost</em> compute. There's even a <a href="https://www.reddit.com/r/SubredditSimulator/">subreddit</a> that runs the user-generated content of other subreddits through individual accounts which make posts that seem vaguely representative of their sources, but either defy comprehension or break through into a sublime silliness.</p>
<p>People have engaged in wordplay (and word-work) for as long as we've communicated with words. Taking language apart and putting it back together in novel ways has been the domain of poets, philosophers, and magicians alike for eons, to say nothing of puns, dad jokes, glossolalia, and word salad.</p>
<p>In the early 20th century, artists associated with the Surrealist movement played a game, variously for entertainment and inspiration, called "exquisite corpse". Each player writes a word (in this version, everyone is assigned a part of speech ahead of time) or draws on an exposed section of paper, then folds the sheet over to obscure their work from the next player. Once everyone's had a turn, the full sentence or picture is revealed. The game takes its name from its first recorded result: <em>le cadavre exquis boira le vin nouveau</em>, or "the exquisite corpse shall drink the new wine".</p>
<p>The Surrealist seeds fell on fertile ground and their ideas spread throughout the artistic and literary world, just as they themselves had been informed by earlier avant-garde movements like Symbolism and Dada. In the mid-century, writers and occultists like Brion Gysin and William Burroughs used similar techniques to discover new meanings in old texts. The only real difference in our modern toys is that they run on their own -- it's a little bit horror movie ouija board, except you can see the workings for yourself.</p>
<p>There are a variety of ways to implement this kind of functionality. On the more primitive side, you have "mad libs" algorithms which select random values to insert into known placeholders, as many Twitter bots such as <a href="https://twitter.com/godtributes">@godtributes</a> or <a href="https://twitter.com/bottest_takes">@bottest_takes</a> do. This method runs up against obvious limitations fairly quickly: the set of substitutions is finite, and the structure they're substituted into likewise becomes predictable.</p>
<p>More advanced text generators are predictive, reorganizing words or phrases from a body of text or <em>corpus</em> in ways which reflect the composition of the corpus itself: words aren't simply jumbled up at random, but follow each other in identifiable sequences. Many generators like these run on Markov chains, probabilistic state machines where the next state is a function only of the current state.</p>
<h2 id="implementing-a-textual-markov-chain">Implementing a Textual Markov Chain</h2>
<p>The first order of business in using a Markov chain to generate text is to break up the original corpus. Regular expressions matching whitespace make that easy enough, turning it into an array of words. The next step is to establish the links between states, which is where things start getting a little complex.</p>
<p>Textual Markov chains have one important parameter: the prefix length, which defines how many previous states (words) comprise the current state and must be evaluated to find potential next states. Prefixes must comprise at least one word, but for the purposes of natural-seeming text generation the sweet spot tends to be between two and four words depending on corpus length. With too short a prefix length, the output tends to be simply garbled; too long a prefix or too short a corpus, and there may be too few potential next states for the chain to diverge from the original text.</p>
<p>Mapping prefixes to next states requires a sliding window on the array. This is more easily illustrated. Here's a passage from <em>Les Chants de Maldoror</em>, a 19th-century prose poem rediscovered and given new fame (or infamy) by the Surrealists, who identified in its obscene grandiosity a deconstruction of language and the still-developing format of the modern novel that prefigured their own artistic ideology:</p>
<blockquote>
<p>He is as fair as the retractility of the claws of birds of prey; or again, as the uncertainty of the muscular movements in wounds in the soft parts of the lower cervical region; or rather, as that perpetual rat-trap always reset by the trapped animal, which by itself can catch rodents indefinitely and work even when hidden under straw; and above all, as the chance meeting on a dissecting-table of a sewing-machine and an umbrella!</p>
</blockquote>
<p>Assuming a prefix length of 2, the mapping might start to take this shape:</p>
<pre><code class="hljs language-json"><span class="hljs-string">"He is"</span>: [<span class="hljs-string">"as"</span>],
<span class="hljs-string">"is as"</span>: [<span class="hljs-string">"fair"</span>],
<span class="hljs-string">"as fair"</span>: [<span class="hljs-string">"as"</span>],
<span class="hljs-string">"fair as"</span>: [<span class="hljs-string">"the"</span>]</code></pre>
<p>Starting from the first prefix ("He is"), there is only one next state possible since the words "He is" only appear once in the corpus. Upon reaching the next state, the active prefix is now "is as", which likewise has only one possible next state, and so forth. But when the current state reaches "as the", the next word to be added may be "retractility", "uncertainty", or "chance", and what happens after that depends on the route taken. Multiple next states introduce the potential for divergence; this is also why having too long a prefix length, or too short a corpus, results in uninteresting output!</p>
<p>Because the prefix is constantly losing its earliest word and appending the next, it's stored as a stringified array rather than as a concatenated string. The order of operations goes like this:</p>
<ol>
<li>Select one of the potential next states for the current stringified prefix array.</li>
<li><code>shift</code> the earliest word out of the prefix array and <code>push</code> the selected next word onto the end.</li>
<li>Stringify the new prefix array.</li>
<li>Repeat until bored, or until there's no possible next state.</li>
</ol>
<h2 id="remix">Remix!</h2>
<p>If you're interested in the actual code, it's <code>remix.js</code> in devtools, or you can find it in <a href="https://gitlab.com/dmfay/blog/blob/master/assets/remix.js">source control</a>.</p>
<p>Markov chain generators aren't usually interactive; that's where the "probabilistic" part of "probabilistic state machine" comes into play. This makes the implementation here incomplete by design. Where only one possible next state exists, the state machine advances on its own, but where there are multiple, it allows the user to choose how to proceed. This, along with starting from the beginning instead of selecting a random opening prefix, gives it more an exploratory direction than if it simply restructured the entire corpus at the push of a button. The jury's still out on whether any great insights lie waiting to be unearthed, as the more mystically-minded practitioners of aleatory editing hoped, but in the mean time, the results are at least good fun.</p>]]></description>
            <link>https://di.nmfay.com/markov-remix</link>
            <guid isPermaLink="true">https://di.nmfay.com/markov-remix</guid>
            <pubDate>Sun, 05 Aug 2018 00:00:00 GMT</pubDate>
        </item>
        <item>
            <title><![CDATA[Summer 2018: Massive, Twice Over]]></title>
            <description><![CDATA[<p><a href="https://vimeo.com/281409168">NDC talks are up</a>!</p>
<p>There's also the <a href="https://skillsmatter.com/skillscasts/12008-database-as-api-with-postgresql-and-massive-js">FullStack London</a> version which is slightly condensed for a shorter timeslot, if you have a SkillsMatter account and want to get right to the fun parts.</p>
<p>If you've read (almost) anything I've written, text or code, odds are you've run into <a href="https://massivejs.org">Massive.js</a>. On the off chance you haven't, the elevator pitch is that PostgreSQL exclusivity lets you get a lot more mileage out of your database (as long as it's Postgres) and JavaScript being a dynamically typed, functional-ish language lets you get away with it really easily.</p>
<p>This talk goes over Massive in much more depth: first laying out a case for alternatives to the dominant object-relational mapping data access technique, in general and especially in JavaScript; and then diving into the architecture of Massive itself with plenty of examples. Also, there's some trivia about early 20th century Russian avant-garde art and another bit poking fun at French modernist architect Le Corbusier.</p>
<p>It's the second talk I've done, and overall I was pretty happy with how it went in Oslo and London both! I'm the furthest thing from a natural public speaker but I covered what I wanted to cover, finished at a reasonable time, and didn't screw anything up too badly -- so that's a success in my book. And after all, the only way to improve this particular skill is to keep doing it.</p>]]></description>
            <link>https://di.nmfay.com/summer-2018</link>
            <guid isPermaLink="true">https://di.nmfay.com/summer-2018</guid>
            <pubDate>Mon, 30 Jul 2018 00:00:00 GMT</pubDate>
        </item>
        <item>
            <title><![CDATA[Centralize Your Query Logic!]]></title>
            <description><![CDATA[<p>At a talk I gave earlier this month, an audience member asked if <a href="https://massivejs.org">Massive</a> supported joining information from multiple tables together. It's come up on the issue tracker before as well. Massive does not currently have this functionality, and while I'm open to suggestions it's not on my own radar.</p>
<p>The central reason for this is that join logic can be tricky to manage from the application architecture side. The ability to correlate and combine what you need when you need it is certainly powerful, but it also embeds assumptions about your database layout in client code. As the database and application evolve, these assumptions can easily fall out of date and out of sync with each other. In real terms, if your application's "model" (whether implicit or explicit) of a user loaded from the database includes only the user record itself sometimes, but other times looks for information in a separate profile table, adds current statistics, et cetera, and you have functionality that operates on A User, either you understand that users come in different shapes and handle them accordingly across the board or you are living on borrowed uptime.</p>
<p>Some application architectures approach this scenario by grouping the query logic together. In the enterprise world, <em>n-tier</em> applications frequently pull related queries into "services" or Data Access Objects (DAOs) so there's at least some kind of organizational schema. This reduces the maintenance overhead somewhat, but it's an imperfect solution, not least because there's nothing but fallible code reviews (if that) standing in the way of someone dropping data access code somewhere else.</p>
<p>Fortunately, there's already part of the application-database ecosystem dedicated to organizing things -- the database itself! And as an organizing principle, it already has its own way to manage complex queries. Sure, it'll involve writing a little SQL, but let's face it: you were going to wind up writing SQL eventually anyway.</p>
<p>If you've only scratched the surface of working with databases, you might not be familiar with views. The good news is they're pretty straightforward: a view is a stored SQL query with a name, given life with the statement <code>CREATE VIEW myview AS SELECT...</code>. You can <code>SELECT</code> from a view just like you can a table, optionally with <code>JOIN</code>s and a <code>WHERE</code> clause and all the other trimmings, whereupon the database executes the query. Results are not stored so the information you get out of a view is always current, unless you intentionally sacrifice realtime data for speed by creating a <em>materialized</em> view which does persist results and has to be manually refreshed.</p>
<p>The reason views are underrated and underutilized in application development has mostly to do with the frameworks developers use to communicate with databases. When you have to provide a concrete implementation of a unary <code>User</code> model, odds are you only care about things you can both read <em>and</em> write to, so you back it up with tables instead of using views to shape data for your needs. There's little room for views in object/relational mapping, and when I've had to use O/RMs I've really only been able to take advantage of views to streamline the raw SQL queries you have to write anyway when you use O/RMs.</p>
<p>If you're not stuck with an object-relational mapper, though, you can really get your money's worth out of views! Retrieving user records from a view, or building more complex user-inclusive results by joining it into other views, ensures that you have a consistent definition of <em>what information comprises a user</em> built into your database. You can't always stop other developers from winging it, naturally, but having that central definition to point to eliminates at least one major potential ambiguity. Massive's omission of the join feature encourages developers using it to center their thinking on the database and the tools it offers for organizing information.</p>
<p>As with anything, there are tradeoffs. Here, it's flexibility. Views may be ephemeral stored queries, but they're still part of the database schema for all that, and the schema takes more planning and effort to change than does application code. But it's a good idea to be thinking carefully about this stuff in the first place.</p>]]></description>
            <link>https://di.nmfay.com/views</link>
            <guid isPermaLink="true">https://di.nmfay.com/views</guid>
            <pubDate>Wed, 25 Jul 2018 00:00:00 GMT</pubDate>
        </item>
        <item>
            <title><![CDATA[Shell Bonsai with tree]]></title>
            <description><![CDATA[<p>The shell has just about all the tooling I need for day-to-day operation of a computer: navigating and managing directories and files, text editing, and building, testing, and running projects I'm working on. What it <em>isn't</em> so great at is layouts, or really, displaying anything that isn't a text file (as fun as it is, I'm unwilling to switch out a proper image viewer for <a href="https://github.com/radare/tiv">tiv</a>).</p>
<p>Directory trees are one of the more commonly-encountered layouts that don't do too well with monospaced ASCII. There's the venerable <code>tree</code> -- and that just about covers the possibilities, because there aren't many more ways to display that kind of structure under those constraints. Fortunately, <code>tree</code> comes with amenities, <a href="https://www.systutorials.com/docs/linux/man/1-tree/">from pattern-matching to JSON output</a>.</p>
<p>I also do a lot of work on projects which contain certain files I don't care about. With git, I use a <a href="https://git-scm.com/docs/gitignore"><code>.gitignore</code></a> file in the project root to ensure I don't accidentally add and commit them. This file gets used by more than git, too: my search utility of choice, <a href="https://github.com/BurntSushi/ripgrep">ripgrep</a>, respects <code>.gitignore</code> rules, as do many other tools all the way up to graphical IDEs.</p>
<p><code>tree</code>, which predates git by something like a decade at absolute minimum, does not care about your <code>.gitignore</code>. When inspecting the layout of a repository with a moderately-sized ignore ruleset and/or something like <code>node_modules</code>, this makes it all but unusable.</p>
<p>One of <code>tree</code>'s features is the <code>-I</code> flag, which ignores files matching a wildcard pattern similar to that used in <code>.gitignore</code>. That means it should be possible to hack something together which respects <code>.gitignore</code> rules without mucking around in coreutils: other system tools output and manipulate files, <code>xargs</code> can manage other commands' arguments, and pipes hook the whole thing together.</p>
<p>Here's the full alias from my <code>.zshrc</code>, if you're just interested in that part (note it all needs to be on one line):</p>
<pre><code class="hljs language-bash"><span class="hljs-built_in">alias</span> trii=<span class="hljs-string">"(cat .gitignore &#x26; echo '.git') |
  sed 's/^\(.\+\)$/\1\|/' |
  tr -d '\n' |
  xargs printf \"-I '%s'\" |
  xargs tree -C"</span></code></pre>
<p>With the exception of <code>-I</code>, you can still pass <code>tree</code>'s arguments to <code>trii</code>, so the rest of its toolkit is still available. It's also safe if there's no ignore file in the current directory.</p>
<p>Now, in more depth:</p>
<pre><code class="hljs language-bash">(cat .gitignore &#x26; <span class="hljs-built_in">echo</span> <span class="hljs-string">'.git'</span>)</code></pre>
<p><code>cat</code> dumps the ignore file to standard output (the console) and <code>echo</code> simply repeats the string ".git" to ensure that the full ruleset excludes the repository directory itself (only a problem with the <code>-a</code> switch which displays hidden files and directories). The single <code>&#x26;</code> is just a separator to ensure that both commands run in sequence, as opposed to the more common double <code>&#x26;&#x26;</code> which aborts at the first non-zero exit code. The parentheses run the whole thing in a subshell, returning the full output to be piped into the next segment.</p>
<pre><code class="hljs language-bash">sed <span class="hljs-string">'s/^\(.\+\)$/\1\|/'</span></code></pre>
<p>You can't specify multiple <code>-I</code> values: the last one always wins. Instead, <code>-I</code> can read multiple patterns which are joined together with pipe <code>|</code> characters. That's possible, but it's going to take a couple of steps.</p>
<p><code>sed</code> is a <strong>s</strong>tream <strong>ed</strong>itor which modifies each line coming from the previous segment. Here, it's simply appending the pipe character. Because <code>sed</code> operates on each line as a discrete entity, it can't join them together; that's up to the next segment:</p>
<pre><code class="hljs language-bash">tr -d <span class="hljs-string">'\n'</span></code></pre>
<p>Unlike <code>sed</code>, <code>tr</code> (<strong>tr</strong>anslate) operates on standard input as it comes in, instead of line by line. The <code>-d</code> switch deletes characters, here the newline. This completes the ignore pattern, with a sample project's <code>.gitignore</code>s transformed into this:</p>
<pre><code class="hljs">.git|<span class="hljs-string">src</span>|<span class="hljs-string">pkg</span>|<span class="hljs-string">**/*.tar.xz</span>|</code></pre>
<p>There's a terminating pipe, but it doesn't make a difference to <code>tree</code>. This line gets passed to yet another command:</p>
<pre><code class="hljs language-bash">xargs <span class="hljs-built_in">printf</span> <span class="hljs-string">"-I '%s'"</span></code></pre>
<p><code>xargs</code> passes lines from standard input to another command. Here there's only one line, since <code>tr</code> removed all the newline characters, and it's being passed to <code>printf</code>. This is not to be confused with the C standard library function <code>printf</code>: it's a standalone program in the GNU coreutils, although it does much the same thing as its near relative. The net effect of this command is to print the <code>-I</code> switch <em>and</em> the concatenated ignore list together.</p>
<pre><code class="hljs language-bash">xargs tree -C</code></pre>
<p>Finally, it's time to invoke <code>tree</code>! The <code>-C</code> flag adds color to the output. <code>xargs</code> passes the combined <code>-I</code> and ignorelist into the command string, and the result is a <code>tree</code> that excludes everything from the <code>.gitignore</code>.</p>]]></description>
            <link>https://di.nmfay.com/bonsai</link>
            <guid isPermaLink="true">https://di.nmfay.com/bonsai</guid>
            <pubDate>Sun, 01 Jul 2018 00:00:00 GMT</pubDate>
        </item>
        <item>
            <title><![CDATA[Automating Maven Releases with CircleCI]]></title>
            <description><![CDATA[<p>Maven's probably the only all-in-one build tool I've ever really <em>appreciated</em>. I'll probably come to like <code>make</code> eventually and cement my status as old-before-her-time *nix crone, but I haven't had a reason to really dig into it yet so Maven it is. And I'm back at a mostly-Java shop, so let's have some fun!</p>
<p>This week's goal: automating releases from our CircleCI instance. Sounds simple enough, right? Bump the version, cut a tag, publish. How hard could it be?</p>
<p>Well, first off, we're using <a href="https://nvie.com/posts/a-successful-git-branching-model/">git-flow</a>, or at least we're preserving <code>master</code> for releases and working off a separate <code>verify</code> branch. Budget git-flow, if you will. That's one complication, since the release has to be tagged on <code>master</code> but <code>verify</code> also needs to be updated so the two don't diverge.</p>
<p>If you're familiar with Maven you may already have guessed the second complication. It's trickier. Maven doesn't work in nice, straightforward <a href="https://semver.org">semver</a>: Maven accepts several different versioning schemes and has a special <code>SNAPSHOT</code> qualifier for non-release builds. If you're working towards a 1.0 release, your version number is 1.0-SNAPSHOT. After you cut the release, you resume development with 1.1-SNAPSHOT (or 2.0-SNAPSHOT if it really needs a rework already). And so on. It's not <em>meant</em> to be automated, because releases are a <em>big deal</em> in the Maven world and you're expected to have a plan for what you're going to do next instead of reacting to whether you fixed bugs, introduced features, or broke compatibility. And honestly, there are some compelling arguments for doing it this way.</p>
<p>I'm not going to go into them because I'm one half of the software team by myself and they're less applicable working on proprietary stuff at this scale. So let's get to automating!</p>
<h2 id="workflow">Workflow</h2>
<p>We're using Circle v2 and its workflow feature to organize the build. Every branch gets built: <code>verify</code> and <code>master</code> get deployed to Artifactory, while <code>release</code> triggers its own job, which latter is the linchpin of the whole structure.</p>
<pre><code class="hljs language-yaml">workflows:
  version: <span class="hljs-number">2</span>
  build-<span class="hljs-keyword">and</span>-deploy:
    jobs:
      - build
      - deploy:
          requires:
            - build
          filters:
            branches:
              only: /^(master|verify)$/
      - release:
          requires:
            - build
          filters:
            branches:
              only: /^release$/</code></pre>
<h2 id="just-build">Just Build</h2>
<p>I'll be honest, I copied &#x26; pasted most of this job definition right out of the docs:</p>
<pre><code class="hljs language-yaml">steps:
  - checkout
  - restore_cache:
      keys:
      - v1-dependencies-{{ checksum <span class="hljs-string">"pom.xml"</span> }}
      <span class="hljs-comment"># fallback to using the latest cache if no exact match is found</span>
      - v1-dependencies-
  - run: mvn clean install
  - save_cache:
      paths:
        - ~/.m2
      key: v1-dependencies-{{ checksum <span class="hljs-string">"pom.xml"</span> }}
  - persist_to_workspace:
      &#x3C;&#x3C;: *source</code></pre>
<p>We're caching our dependencies because that's how one does it; <code>mvn clean install</code> is likely overkill (we probably don't need to bother with installing the dependency to the local Maven cache) but it builds and runs our tests and generates the artifact. The only really interesting part here is that we're persisting the important files to a workspace so we can recover it later -- <code>*source</code> refers to another YAML block with a <code>root</code> string and list of <code>paths</code>.</p>
<h2 id="and-deploy">And Deploy</h2>
<pre><code class="hljs language-yaml">steps:
  - attach_workspace:
      at: .
  - run:
      name: Deploy to Artifactory
      command: mvn deploy</code></pre>
<p>Here's where we use that workspace. Whenever this job runs, it'll reattach the file structure we saved from the build job. <code>mvn deploy</code> still runs all the intermediary lifecycle stages because that's how Maven rolls, but we don't need to check out the code again.</p>
<p>We've got our POMs set up with the <a href="https://www.jfrog.com/confluence/display/RTF/Maven+Artifactory+Plugin">artifactory-maven-plugin</a> so all we have to do to publish is issue <code>mvn deploy</code>. That makes that easy, at least; there's the Artifactory CLI if you prefer, but Maven's whole deal is managing everything so as far as I'm concerned we should let it.</p>
<p>There's just one piece missing, though: how do we actually release a new <em>version</em> of the artifact and set up to begin on the next?</p>
<h2 id="the-release-trigger">The Release Trigger</h2>
<p>One of the ideas of git-flow is that when you're gearing up for a release, you cut a new branch that only contains work towards that release. This is great if you're working on multiple versions of the code simultaneously and releases can take awhile, so you might cherry-pick a bugfix from current development into a legacy release branch to ensure it doesn't affect a subset of your users. Since we're not a product company, we don't really have to worry about that. We're always working on the next release, and it drops when it's ready to drop.</p>
<p>This is going to get complicated. Here's the <code>release</code> build steps in full:</p>
<pre><code class="hljs language-yaml">steps:
  - checkout
  - run:
      name: Cut new release
      command: |
        <span class="hljs-comment"># assemble current and new version numbers</span>
        OLD_VERSION=$(mvn -s .circleci/settings.xml -q \
          -Dexec.executable=<span class="hljs-string">"echo"</span> -Dexec.args=<span class="hljs-string">'${project.version}'</span> \
          --non-recursive org.codehaus.mojo:<span class="hljs-keyword">exec</span>-maven-plugin:<span class="hljs-number">1.3</span><span class="hljs-number">.1</span>:<span class="hljs-keyword">exec</span>)
        NEW_VERSION=<span class="hljs-string">"${OLD_VERSION/-SNAPSHOT/}"</span>
        echo <span class="hljs-string">"Releasing $OLD_VERSION as $NEW_VERSION"</span>

        <span class="hljs-comment"># ensure dependencies use release versions</span>
        mvn -s .circleci/settings.xml versions:use-releases

        <span class="hljs-comment"># write release version to POM</span>
        mvn -s .circleci/settings.xml versions:set -DnewVersion=<span class="hljs-string">"$NEW_VERSION"</span>

        <span class="hljs-comment"># setup git</span>
        git config user.name <span class="hljs-string">"Release Script"</span>
        git config user.email <span class="hljs-string">"builds@understoryweather.com"</span>

        <span class="hljs-comment"># commit and tag</span>
        git add pom.xml
        git commit -m <span class="hljs-string">"release: $NEW_VERSION"</span>
        git tag <span class="hljs-string">"$NEW_VERSION"</span>

        <span class="hljs-comment"># land on master and publish</span>
        git checkout master
        git merge --no-edit release
        git push origin master --tags

        <span class="hljs-comment"># increment minor version number</span>
        MAJ_VERSION=$(echo <span class="hljs-string">"$NEW_VERSION"</span> | cut -d <span class="hljs-string">'.'</span> -f <span class="hljs-number">1</span>)
        MIN_VERSION=$(echo <span class="hljs-string">"$NEW_VERSION"</span> | cut -d <span class="hljs-string">'.'</span> -f <span class="hljs-number">2</span>)
        NEW_MINOR=$(($MIN_VERSION + <span class="hljs-number">1</span>))
        DEV_VERSION=<span class="hljs-string">"$MAJ_VERSION.$NEW_MINOR-SNAPSHOT"</span>

        <span class="hljs-comment"># ready development branch</span>
        git checkout verify
        git merge --no-edit release
        mvn -s .circleci/settings.xml versions:set -DnewVersion=<span class="hljs-string">"$DEV_VERSION"</span>
        git add pom.xml
        git commit -m <span class="hljs-string">"ready for development: $DEV_VERSION"</span>
        git push origin verify

        <span class="hljs-comment"># clean up release branch</span>
        git push origin :release</code></pre>
<p>It's not <em>messy</em>, but that's... a lot of bash script. But just like any sufficiently complicated database task involves writing SQL, any sufficiently complicated ops task involves bash. Let's break it down:</p>
<h3 id="getting-version-numbers">Getting Version Numbers</h3>
<pre><code class="hljs language-bash"><span class="hljs-comment"># assemble current and new version numbers</span>
OLD_VERSION=$(mvn -s .circleci/settings.xml -q \
  -Dexec.executable=<span class="hljs-string">"echo"</span> -Dexec.args=<span class="hljs-string">'${project.version}'</span> \
  --non-recursive org.codehaus.mojo:<span class="hljs-built_in">exec</span>-maven-plugin:1.3.1:<span class="hljs-built_in">exec</span>)
NEW_VERSION=<span class="hljs-string">"<span class="hljs-variable">${OLD_VERSION/-SNAPSHOT/}</span>"</span>
<span class="hljs-built_in">echo</span> <span class="hljs-string">"Releasing <span class="hljs-variable">$OLD_VERSION</span> as <span class="hljs-variable">$NEW_VERSION</span>"</span></code></pre>
<p>Note the <code>-s .circleci/settings.xml</code>: since Circle's just spinning up a basic OpenJDK image, we have a <code>settings.xml</code> checked into source control. Credentials are interpolated through environment variables, but it's still not <em>great</em>; at some point, I'll want to come back and create a custom Docker image to centralize our configuration.</p>
<p>Maven stores version numbers in the POM. We could pull them out with XPath, but since this is Maven, there's a plugin for that. The <code>OLD_VERSION</code> is the current value; since we're always releasing from the <code>verify</code> branch, this is guaranteed to be a snapshot version, and we need to strip that qualifier off to get <code>NEW_VERSION</code> for the release.</p>
<h3 id="update-versions">Update Versions</h3>
<pre><code class="hljs language-bash"><span class="hljs-comment"># ensure dependencies use release versions</span>
mvn -s .circleci/settings.xml versions:use-releases

<span class="hljs-comment"># write release version to POM</span>
mvn -s .circleci/settings.xml versions:<span class="hljs-built_in">set</span> -DnewVersion=<span class="hljs-string">"<span class="hljs-variable">$NEW_VERSION</span>"</span></code></pre>
<p>We don't have a ton of Java libraries, but there are enough that release management is (obviously) a concern. The first statement here makes sure that when we release, we aren't depending on a snapshot version of another of our libraries. The second actually sets the version field in the POM to the release version we generated just now.</p>
<p>You may be asking: why didn't I just alias <code>mvn</code> to <code>mvn -s .circleci/settings.xml</code>? And the answer is: I did, and spent half a day trying to figure out why it didn't work. I don't know if it's this particular image or Circle in general or what, but aliases are just ignored.</p>
<h3 id="release">Release!</h3>
<pre><code class="hljs language-bash"><span class="hljs-comment"># setup git</span>
git config user.name <span class="hljs-string">"Release Script"</span>
git config user.email <span class="hljs-string">"builds@understoryweather.com"</span>

<span class="hljs-comment"># commit and tag</span>
git add pom.xml
git commit -m <span class="hljs-string">"release: <span class="hljs-variable">$NEW_VERSION</span>"</span>
git tag <span class="hljs-string">"<span class="hljs-variable">$NEW_VERSION</span>"</span>

<span class="hljs-comment"># land on master and publish</span>
git checkout master
git merge --no-edit release
git push origin master --tags</code></pre>
<p>Since we're going to be committing code, we need to do a little more git configuration to attribute the commits properly. This is another element I could streamline with a custom build image later on.</p>
<p>Next, we commit the updated POM and create a tag. When we merge (with <code>--no-edit</code> since the script can't change the commit message), the release commit and tag will land on the <code>master</code> branch. Then it's just a matter of pushing to the origin.</p>
<h3 id="next-up">Next Up...</h3>
<p>We've released, but we're not quite done. If we left it here, the next release from the <code>verify</code> branch would run into merge conflicts since <code>master</code> has an updated version in the POM. To prevent that, we have to merge <em>back into <code>verify</code></em>. Preferably with a snapshot version qualifier, because Maven.</p>
<pre><code class="hljs language-bash"><span class="hljs-comment"># increment minor version number</span>
MAJ_VERSION=$(<span class="hljs-built_in">echo</span> <span class="hljs-string">"<span class="hljs-variable">$NEW_VERSION</span>"</span> | cut -d <span class="hljs-string">'.'</span> -f 1)
MIN_VERSION=$(<span class="hljs-built_in">echo</span> <span class="hljs-string">"<span class="hljs-variable">$NEW_VERSION</span>"</span> | cut -d <span class="hljs-string">'.'</span> -f 2)
NEW_MINOR=$((<span class="hljs-variable">$MIN_VERSION</span> + 1))
DEV_VERSION=<span class="hljs-string">"<span class="hljs-variable">$MAJ_VERSION</span>.<span class="hljs-variable">$NEW_MINOR</span>-SNAPSHOT"</span></code></pre>
<p>I switched us over to two-part version numbers strictly out of convenience. Since Maven expects you to know what you're working towards, going from 1.0 to 1.1 is a lot more realistic than trying to suss out whether you're looking at 1.0.1 or 1.1.0 next. We can always update the version ourselves if we decide the next release should actually be 2.0, but I'm trying to minimize human involvement here.</p>
<pre><code class="hljs language-bash"><span class="hljs-comment"># ready development branch</span>
git checkout verify
git merge --no-edit release
mvn -s .circleci/settings.xml versions:<span class="hljs-built_in">set</span> -DnewVersion=<span class="hljs-string">"<span class="hljs-variable">$DEV_VERSION</span>"</span>
git add pom.xml
git commit -m <span class="hljs-string">"ready for development: <span class="hljs-variable">$DEV_VERSION</span>"</span>
git push origin verify</code></pre>
<p>Merging <code>release</code> into <code>verify</code> saves us from any potential merge conflicts down the line, since the same release commit now exists both on <code>master</code> and in <code>verify</code>. The script then adds a second commit to <code>verify</code> with the new snapshot version and sends it all up to the origin.</p>
<pre><code class="hljs language-bash"><span class="hljs-comment"># clean up release branch</span>
git push origin :release</code></pre>
<p>Finally: when a trigger goes off, it resets. We don't want the <code>release</code> branch to hang around long-term. If we did, we'd have to push the release commit up to the origin to avoid merge conflicts in future, and doing that would kick off an infinite loop since the <code>release</code> <em>job</em> is watching this branch. So instead we just delete it from the origin, since it's done everything it needed to do.</p>
<h2 id="setting-it-off">Setting it Off</h2>
<pre><code class="hljs language-bash">git checkout -b release
git push origin release</code></pre>
<p>That's the payoff. Whenever we're ready to drop a new version, all that has to happen is a new branch named <code>release</code>. You can even do it through the GitHub UI if you're so inclined, in two clicks and seven letters. Once <code>release</code> builds and deletes itself, the ordinary build and deploy jobs take over on both updated <code>master</code> and <code>verify</code> branches. Within a few minutes we've got a release and the first snapshot towards the next landing in Artifactory!</p>]]></description>
            <link>https://di.nmfay.com/circle-maven-versions</link>
            <guid isPermaLink="true">https://di.nmfay.com/circle-maven-versions</guid>
            <pubDate>Sat, 26 May 2018 00:00:00 GMT</pubDate>
        </item>
        <item>
            <title><![CDATA[The Ultimate Postgres vs MySQL Blog Post]]></title>
            <description><![CDATA[<p>I should probably say up front that <a href="https://massivejs.org">I love working with Postgres</a> and could die happy without ever seeing a <code>mysql></code> prompt again. This is not an unbiased comparison -- but those are no fun anyway.</p>
<p>The scenario: two applications, using <a href="https://massivejs.org">Massive.js</a> to store and retrieve data. Massive is closely coupled to Postgres by design. Specializing lets it take advantage of features which only exist in some or no other relational databases to streamline data access in a lighter, more "JavaScripty" way than a more traditional object-relational mapper. It's great for getting things done, since the basics are easy and for the complicated stuff where you'd be writing SQL anyway.... you write SQL, you store it in one central place for reuse, and the API makes running it simple.</p>
<p>Where Massive is less useful is if you have to support another RDBMS. This is, ideally, something you know about up front. Anyway: things happen, and sometimes you find yourself having to answer the question "what's it going to look like if we need to run these applications with light but tightly coupled data layers on MySQL?"</p>
<p>Not good, was the obvious answer, but less immediately obvious was <em>how</em> not good. I knew there were some things Postgres did that MySQL didn't, but I also knew there were a ton of things I'd just never tried in the latter. So as I got to work on this, I started keeping notes. Here's everything I found.</p>
<h2 id="schema-layout">Schema Layout</h2>
<p>Now that we're all basically over the collective hallucination of a "schemaless" future, arguably the most important aspect of data storage is <em>how information is modeled</em> in a database. Postgres and MySQL are both relational databases, grouping records in strictly-defined tables. But there's a lot of room for variation within that theme.</p>
<h3 id="multiple-schemas">Multiple Schemas</h3>
<p>First things first: "schema" doesn't always mean the same thing. To MySQL, "schema" is synonymous with "database". For Postgres, a "schema" is a namespace <em>within</em> a database, which allows you to group tables, views, and functions together without having to break them apart into different databases.</p>
<p>MySQL's simplicity in this respect is ameliorated by its offering cross-database queries:</p>
<pre><code class="hljs language-sql"><span class="hljs-keyword">SELECT</span> *
<span class="hljs-keyword">FROM</span> db1.table1 t1
<span class="hljs-keyword">JOIN</span> db2.table2 t2 <span class="hljs-keyword">ON</span> t2.t1_id = t1.id;</code></pre>
<p>With Postgres, you can work across schemas, but if you need to query information in a different <em>database</em>, that's a job for...</p>
<h3 id="foreign-data-wrappers">Foreign Data Wrappers</h3>
<p>Foreign data wrappers let Postgres talk to practically anything that represents information as discrete records. You can create a "foreign table" in a Postgres database and <code>SELECT</code> or <code>JOIN</code> it like any other table -- only under the hood, it's actually reading a CSV, talking to another DBMS, or even querying a REST API. It's a powerful enough feature that NoSQL stalwart MongoDB <a href="https://www.linkedin.com/pulse/mongodb-32-now-powered-postgresql-john-de-goes/">sneakily built their BI Connector on top of Postgres with foreign data wrappers</a>. You don't even need to know C to write a new FDW when <a href="http://multicorn.org/">Multicorn</a> lets you do it in Python!</p>
<p>Oracle and SQL Server both have some functionality for registering external data sources, but Postgres' offering is the most extensible I'm aware of. MySQL, besides the inter-database query support mentioned above, has nothing.</p>
<h3 id="table-inheritance">Table Inheritance</h3>
<p>Inheritance is more commonly thought of as an attribute of object-oriented programming languages rather than databases, but Postgres is technically an <em>ORDBMS</em> or object-relational database management system. So you can have a table <code>cities</code> with columns <code>name</code> and <code>population</code>, and a table <code>capitals</code> which inherits the definition of <code>cities</code> but adds an <code>of_country</code> column only relevant, of course, for capital cities. If you <code>SELECT</code> from <code>cities</code>, you get rows from <code>capitals</code> -- they're cities too! You can of course <code>SELECT name FROM ONLY cities</code> to exclude the capitals. This is something of a niche feature, but <a href="https://di.nmfay.com/postgres-user-cache">when you have the right use case</a> it really shines.</p>
<p>MySQL, being a traditional RDBMS, doesn't do this.</p>
<h3 id="materialized-views">Materialized Views</h3>
<p>Materialized views are like regular views, except the results of the specifying query are physically stored ('materialized') and must be explicitly refreshed. This allows database developers to cache the results of slower queries when the results don't have to be realtime.</p>
<p>Oracle has materialized views, and SQL Server's indexed views are similar, but MySQL has no materialized view support.</p>
<h3 id="check-constraints">Check Constraints</h3>
<p>Constraints in general ensure that invalid data is not stored. The most common constraint is <code>NOT NULL</code>, which prevents records without a value for the non-nullable column from being inserted or updated. Foreign key constraints do likewise when a reference to a record in another table is invalid. Check constraints are the most flexible, and allow validation of any predicate you could put in a <code>WHERE</code> clause -- for example, asserting that prices have to be positive numbers, or that US zip codes have to be five digits.</p>
<p>Per the MySQL docs: <a href="https://dev.mysql.com/doc/refman/5.7/en/create-table.html">the <code>CHECK</code> clause is parsed but ignored by all storage engines.</a></p>
<h3 id="jsonb-and-indexing">JSONB and Indexing</h3>
<p>Postgres and MySQL both have a <code>JSON</code> column type (MySQL replacement MariaDB does too, but it's currently just an alias for <code>LONGTEXT</code>) and functions for building, processing, and querying JSON fields. Postgres actually goes a step further by offering a <code>JSONB</code> type which processes input data into a binary format. This means it's a little bit slower to write, but much faster to query.</p>
<p>It also means you can index the binary data. A GIN or <em>Generalized INverted</em> index allows queries checking for the existence of specific keys or key-value pairs to avoid scanning every single record for matches. This is huge if you run queries which dig into JSON fields in the <code>WHERE</code> clause.</p>
<h3 id="default-values-defined-by-functions">Default Values Defined by Functions</h3>
<p><code>DEFAULT</code> is a useful specification for columns in a <code>CREATE TABLE</code> statement. At the simplest level, this could be used to baseline a boolean field to <code>true</code> or <code>false</code> if the <code>INSERT</code> statement doesn't give an explicit value. But you can do more than set a scalar value: a timestamp can default to <code>now()</code>, a UUID to any of a variety of UUID-generating functions, any other field to the value returned by whatever function you care to write -- the sky's the limit!</p>
<p>Unless you're using MySQL, in which case the only function you can reference in a <code>DEFAULT</code> clause is <code>now()</code>.</p>
<h2 id="type-differences">Type Differences</h2>
<p>Layout's only part of the story, though. Equally important is the difference in type support. The benefit of a robust type system is in enabling database architects to represent information with the greatest accuracy possible. If a value is difficult or impossible to represent with built-in types, it's harder for developers to work with in turn, and if compromises have to be made to cut the data to fit then they can affect entire applications. Some types can even affect the overall database design, such as arrays and enumerations. In general, the more options you have the better.</p>
<h3 id="uuids">UUIDs</h3>
<p>Postgres has a UUID type. MySQL does not. If you want to store a UUID in MySQL, your options are CHAR, if you want values to be as human-readable as UUIDs ever are, or BINARY, if you want it to be faster but more difficult to work with manually. Postgres also generates more types of UUIDs.</p>
<h3 id="booleans">Booleans</h3>
<p>Boolean seems like a pretty basic type to have! However, MySQL's boolean is actualy an alias for TINYINT(1). This is why query results show 0 or 1 instead of <code>true</code> or <code>false</code>. It's also why you can set the value of an ostensibly boolean field to 2. Try it!</p>
<p>Postgres: has proper booleans.</p>
<h3 id="varlena-and-lengths">Varlena and Lengths</h3>
<p>MySQL isn't alone in aliasing standard types in strange ways, however. CHAR, VARCHAR, and TEXT types in Postgres are all aliased representations of the same structure -- the only distinction is that length constraints will be enforced if specified. The documentation notes that this is actually slower, and recommends that unbounded text simply be defined as the TEXT type instead of given an arbitrary maximum length.</p>
<p>What's happening here is that Postgres uses a data structure called a <em>varlena</em>, or <em>VAriable LENgth Array</em>, to store the information. A varlena's first four bytes store the length of the value, making it easy for the database to pick the whole thing out of storage. TEXT is only one of the types that uses this structure, but it's easily the most commonly encountered.</p>
<p>If a varlena is longer than would fit inline, the database uses a system called TOAST ("The Oversized Attribute Storage Technique") to offload it to extended storage transparently. Queries with predicates involving a TOASTable field might not be all that performant with large tables unless designed and indexed carefully, but when the database is returning records it's easy enough to follow the TOAST pointer that the overhead is barely noticeable for most cases.</p>
<p>The upshot of all this, as far as most people are concerned, is this: with Postgres, you only have to worry about establishing a length constraint on fields that have a <em>reason</em> for a length constraint. If there's no clear requirement to limit how much information can go into a field, you don't have to pick an arbitrary number and try to match it up with your page size.</p>
<h3 id="arrays">Arrays</h3>
<p>Non-scalar values in records! Madness! Dogs and cats living together! Anyone who's worked with JSON, XML, YAML, or even HTML understands that information isn't always flat. Relational architectures have traditionally mandated breaking out any vectors, let alone even more complex values, into new tables. Sometimes that's useful, but often enough it adds complexity to no real purpose. Inlining arrays makes many tasks -- such as tagging records -- much easier.</p>
<p>Postgres has arrays, as does Oracle; MySQL and SQL Server don't.</p>
<h3 id="customizing-types">Customizing Types</h3>
<p>If the built-in types aren't sufficient, you can always add your own. Custom types let you define a value to be exactly what you want. Domains are a related concept: types (custom or built-in) which enforce constraints on values. You might for example create a domain to represent a zip code as a TEXT value which uses regular expressions in a <code>CHECK</code> clause to ensure that values consist of five digits, optionally followed by a dash and four more digits.</p>
<p>If you're using Postgres, that is. Oracle and SQL Server both offer some custom type functionality, but MySQL has nothing. You can't even use table-level <code>CHECK</code> constraints because the engine simply ignores them.</p>
<h3 id="enums">Enums</h3>
<p>Enumerations don't get enough love. If I had a dollar for every INT -- or worse, VARCHAR -- field I've seen representing one of a fixed set of potential values, I probably still couldn't retire but I could at least have a pretty nice evening out. There are drawbacks to using enums, to be sure: adding new values requires DDL, and you can't remove values at all. But appropriate use cases for them are still reasonably common.</p>
<p>MySQL and Postgres both offer enums. The critical distinction is that Postgres' enums are proper reusable types. MySQL's enums are more like the otherwise-ignored <code>CHECK</code> constraints and specify a valid value list for a single column in a single table. Possible improvement on allowing a boolean column to contain -100?</p>
<h2 id="querying-data">Querying Data</h2>
<p>So that's data modeling covered. There's an entire other half to go: actually working with the information being stored. SQL itself is divided in two parts, the "data definition language" which defines the structure of a database and the "data manipulation language". This latter comprises the <code>SELECT</code>, <code>INSERT</code>, and other statements most people think of when they hear the name "SQL". And just as with modeling, there are substantial differences between Postgres and MySQL in querying.</p>
<h3 id="returning">RETURNING</h3>
<p>Autogenerating primary keys takes a huge headache out of storing data. But there's one catch: when you insert a new record into a table, you don't know what its primary key value got set to. Most relational databases will tell you what the last autogenerated key was if you call a special function; some, like SQL Server, even let you filter down to the single table you're interested in.</p>
<p>Postgres goes above and beyond with the <code>RETURNING</code> clause. Any write statement -- <code>INSERT</code>, <code>UPDATE</code>, <code>DELETE</code> -- can end with a <code>RETURNING [column-list]</code>, which acts as a <code>SELECT</code> on the affected records. <code>RETURNING *</code> gives you the entire recordset from whatever you just did, or you can restrict what you're interested in to certain columns.</p>
<p>That means you can do this:</p>
<pre><code class="hljs language-sql"><span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> foos (<span class="hljs-keyword">name</span>)
<span class="hljs-keyword">VALUES</span> (<span class="hljs-string">'alpha'</span>), (<span class="hljs-string">'beta'</span>)
<span class="hljs-keyword">RETURNING</span> *;

 id │ name  
────┼───────
  1 │ alpha
  2 │ beta
(2 rows)</code></pre>
<p>With MySQL, you're stuck with calling <code>LAST_INSERT_ID()</code> after you add a new record. If you added multiple, <code>LAST_INSERT_ID</code> only gives you the earliest new id, leaving you to work out the rest yourself. And of course, this is only good for integer primary keys.</p>
<p>MySQL also has no counterpart to this functionality for <code>UPDATE</code>s and <code>DELETE</code>s. Competitor MariaDB supports <code>RETURNING</code> on <code>DELETE</code>, but not on any other kind of statement.</p>
<h3 id="common-table-expressions">Common Table Expressions</h3>
<p>Common Table Expressions or CTEs allow complex queries to be broken up and assembled from self-contained parts. You might write this:</p>
<pre><code class="hljs language-sql">WITH page_visits AS (
  <span class="hljs-keyword">SELECT</span> p.id, p.site_id, p.title, <span class="hljs-keyword">COUNT</span>(*) <span class="hljs-keyword">AS</span> visits
  <span class="hljs-keyword">FROM</span> pages <span class="hljs-keyword">AS</span> p
  <span class="hljs-keyword">JOIN</span> page_visitors <span class="hljs-keyword">AS</span> v <span class="hljs-keyword">ON</span> v.page_id = p.id
  <span class="hljs-keyword">GROUP</span> <span class="hljs-keyword">BY</span> p.id, p.site_id, p.title
), max_visits <span class="hljs-keyword">AS</span> (
  <span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">DISTINCT</span> <span class="hljs-keyword">ON</span> (site_id)
    site_id, title, visits
  <span class="hljs-keyword">FROM</span> page_visits
  <span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> site_id, visits <span class="hljs-keyword">DESC</span>
)
<span class="hljs-keyword">SELECT</span> s.id, s.name,
  max_visits.title <span class="hljs-keyword">AS</span> most_popular_page,
  <span class="hljs-keyword">SUM</span>(page_visits.visits) <span class="hljs-keyword">AS</span> total_visits
<span class="hljs-keyword">FROM</span> sites <span class="hljs-keyword">AS</span> s
<span class="hljs-keyword">JOIN</span> page_visits <span class="hljs-keyword">ON</span> page_visits.site_id = s.id
<span class="hljs-keyword">JOIN</span> max_visits <span class="hljs-keyword">ON</span> max_visits.site_id = s.id
<span class="hljs-keyword">GROUP</span> <span class="hljs-keyword">BY</span> s.id, s.name, max_visits.title
<span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> total_visits <span class="hljs-keyword">DESC</span>;</code></pre>
<p>In the first query, we aggregate visit counts; in the second, we use <a href="https://www.postgresql.org/docs/10/static/sql-select.html#SQL-DISTINCT"><code>DISTINCT ON</code></a> on the results of the first to filter out all but the most popular pages; finally, we join both of our intermediary results to provide the output we're looking for. CTEs are a really readable way to factor query logic out, and they let you do some things in one statement that you can't otherwise.</p>
<p>MySQL does have CTEs! However: thanks to the <code>RETURNING</code> clause, Postgres can <em>write records in a CTE</em> and operate on the results. This is <em>huge</em> for application logic. This next query writes a record in a CTE, then adds a corresponding entry to a junction table -- all in the same transaction.</p>
<pre><code class="hljs language-sql">WITH wine AS (
  <span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> wines (<span class="hljs-keyword">name</span>, <span class="hljs-keyword">year</span>)
  <span class="hljs-keyword">VALUES</span> (<span class="hljs-string">'Herrenreben'</span>, <span class="hljs-number">2015</span>)
  <span class="hljs-keyword">RETURNING</span> <span class="hljs-keyword">id</span>
), reviewer <span class="hljs-keyword">AS</span> (
  <span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">id</span>
  <span class="hljs-keyword">FROM</span> reviewers
  <span class="hljs-keyword">WHERE</span> <span class="hljs-keyword">name</span> = <span class="hljs-string">'Wine Enthusiast'</span>
)
<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> wine_ratings (wine_id, reviewer_id, score)
<span class="hljs-keyword">SELECT</span> wine.id, reviewer.id, <span class="hljs-number">92</span>
<span class="hljs-keyword">FROM</span> wine
<span class="hljs-keyword">JOIN</span> reviewer <span class="hljs-keyword">ON</span> <span class="hljs-literal">TRUE</span>;</code></pre>
<h3 id="casting">Casting</h3>
<p>Sometimes a query needs to treat a value as if it has a different type, whether to store it or to operate on it somehow. Postgres even lets you define additional conversions between types with <code>CREATE CAST</code>.</p>
<p>MySQL supports casting values to binary, char/nchar, date/datetime/time, decimal, JSON, and signed and unsigned integers. Absent from this list: tinyints, which, since booleans are actually tinyints, means you're stuck with conditionals when you need to coerce a value to true or false for storage in a "boolean" column.</p>
<h3 id="lateral-joins">Lateral Joins</h3>
<p>A lateral join is fundamentally similar to a correlated subquery, in that it executes for each row of the current result set. However, a correlated subquery is limited to returning a single value for a <code>SELECT</code> list or <code>WHERE</code> clause; subqueries in the <code>FROM</code> clause run in isolation. A lateral join can refer back to information in the rest of the result set:</p>
<pre><code class="hljs language-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> docs (<span class="hljs-keyword">id</span> <span class="hljs-built_in">serial</span>, <span class="hljs-keyword">body</span> jsonb);

<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> docs (<span class="hljs-keyword">body</span>) <span class="hljs-keyword">VALUES</span> (<span class="hljs-string">'{"a": "one", "b": "two"}'</span>), (<span class="hljs-string">'{"c": "three"}'</span>);

<span class="hljs-keyword">SELECT</span> docs.id, keys.*
<span class="hljs-keyword">FROM</span> docs
<span class="hljs-keyword">JOIN</span> LATERAL jsonb_each(docs.body) <span class="hljs-keyword">AS</span> <span class="hljs-keyword">keys</span> <span class="hljs-keyword">ON</span> <span class="hljs-literal">TRUE</span>;

 id │ key │  value  
────┼─────┼─────────
  1 │ a   │ "one"
  1 │ b   │ "two"
  2 │ c   │ "three"
(3 rows)</code></pre>
<p>It can also invoke table functions like <code>unnest</code> which return multiple rows and columns:</p>
<pre><code class="hljs language-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> multiple_arrays(arr1 <span class="hljs-built_in">int</span>[], arr2 <span class="hljs-built_in">int</span>[]);

<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> multiple_arrays (arr1, arr2)
<span class="hljs-keyword">VALUES</span>
	(<span class="hljs-string">'{1,2,3}'</span>, <span class="hljs-string">'{4,5}'</span>),
	(<span class="hljs-string">'{6,7}'</span>, <span class="hljs-string">'{8,9,10}'</span>);

<span class="hljs-keyword">SELECT</span> raw.*
<span class="hljs-keyword">FROM</span> multiple_arrays
<span class="hljs-keyword">JOIN</span> LATERAL unnest(arr1, arr2) <span class="hljs-keyword">AS</span> <span class="hljs-keyword">raw</span> <span class="hljs-keyword">ON</span> <span class="hljs-literal">TRUE</span>;

 unnest │ unnest 
────────┼────────
      1 │      4
      2 │      5
      3 │ (null)
      6 │      8
      7 │      9
 (null) │     10
(6 rows)</code></pre>
<p>Oracle and SQL Server offer similar functionality with the <code>LATERAL</code> keyword in the former, and <code>CROSS APPLY</code>/<code>OUTER APPLY</code>. MySQL does not.</p>
<h3 id="variadic-function-arguments">Variadic Function Arguments</h3>
<p>Functions! Procedures, if you believe in making that distinction! They're great! You can declare variadic arguments -- "varargs" or "rest parameters" in other languages -- to pull an arbitrary number of arguments into a single collection named for the final argument.</p>
<p>In Postgres.</p>
<h3 id="predicate-operations">Predicate Operations</h3>
<p>A handful of useful operations which allow more expressive <code>WHERE</code> clauses with Postgres:</p>
<ul>
<li><code>IS DISTINCT FROM</code> and its counterpart <code>IS NOT DISTINCT FROM</code> offer a null-sensitive equality test. Null isn't ordinarily comparable since it represents the <em>absence</em> of a value, so the predicate <code>WHERE field &#x3C;> 1</code> will not return records where <code>field</code> is null. <code>WHERE field IS DISTINCT FROM 1</code> returns all records where <code>field</code> is other-than-1, including where it's null.</li>
<li><code>ILIKE</code> is a case-insensitive <code>LIKE</code> operation. MySQL does have the capability for case-insensitive pattern matching, but it depends on your collation and can't be toggled on a per-query basis (the default collation is case-insensitive, to be completely fair).</li>
<li><code>~</code>, <code>~*</code>, <code>!~</code>, and <code>!~*</code> form a set of POSIX regular expression tests: match, case-insensitive match, no match, and no case-insensitive match respectively. MySQL does have <code>REGEXP</code> and <code>NOT REGEXP</code>; however, Postgres' implementation has lookahead and lookbehind.</li>
</ul>
<h2 id="general-database-work">General Database Work</h2>
<p>That's it for the architecture and query language feature gaps I discovered. I did run into a couple other things that bear mentioning, however:</p>
<h3 id="dependencies">Dependencies</h3>
<p>MySQL doesn't care about dependencies among database objects. You can tell it to drop a table a view or proc depends on and it will go right ahead and drop it. You'll have no idea something's gone wrong until the next time you try to invoke the view or proc. Postgres saves you from yourself, unless you're really sure and drop your dependents too with <code>CASCADE</code>.</p>
<h3 id="triggers-and-table-writes">Triggers and Table Writes</h3>
<p>Just the mention of triggers is probably putting some people off their lunch. They're not <em>that</em> bad, honest (well, they can be, but it's not like it's their fault). Anyway, point is: sometimes you want to write a trigger that modifies other rows in the table it's being activated from.</p>
<p>Well, you can't in MySQL.</p>
<h2 id="the-end">The End?</h2>
<p>This may have exhausted <em>me</em>, but I'm pretty sure it's still not an exhaustive list of the feature gaps between Postgres and MySQL. I did cop to my preference up front, but having spent six weeks putting the effort into converting the comparison is pretty damning. I think there could still be reasons to pick MySQL -- but I'm not sure they could be technical.</p>]]></description>
            <link>https://di.nmfay.com/postgres-vs-mysql</link>
            <guid isPermaLink="true">https://di.nmfay.com/postgres-vs-mysql</guid>
            <pubDate>Wed, 11 Apr 2018 00:00:00 GMT</pubDate>
        </item>
        <item>
            <title><![CDATA[The Orchid, the Wasp, and the Test Fixture]]></title>
            <description><![CDATA[<p>I write a lot of integration tests that operate on data. The usual format for this is a setup function which gets the database into a particular state, a test or tests which validate the appropriate application functionality, and then a teardown function which cleans everything up so the next test suite can do its thing. There are different names and some little complexities (Mocha and AVA offer a <code>before</code> and a <code>beforeEach</code>, for example) but generally speaking this is How It's Done in every language/framework I've written tests in. This seems less a product of conscious architecture than it does a natural evolution of testing processes; nobody's<a href="#doctrine">*</a> really nailed down a formal model for test data management yet.</p>
<p>The end result is that these setup functions, or <em>fixtures</em>, tend to be developed ad-hoc and inconsistently. It's not difficult to wind up with two test suites taking completely different approaches to generate what's practically speaking the same data. It gets worse when something changes and a bunch of your tests become out of date with you none the wiser until a bug report lands in your lap. I've written a lot of fixtures like that, and I want to stop.</p>
<p>The only solution to inconsistency is centralization: there needs to be a single source of data. If there's one place to go for fixture data, that goes a long way toward ensuring tests stay current. However, just bringing all the fixtures under one roof isn't enough. If some tests exercise carryout orders and others exercise delivery orders, the database state could be 75% identical -- but one has a phone number and a pickup time attached, the other an address and a driver. One fixture alone won't do the job, and breaking it up is backsliding towards the original problem. Centralization is only part of the solution; fixtures have to be flexible as well.</p>
<h2 id="meanwhile-in-southwestern-australia">Meanwhile, in Southwestern Australia</h2>
<p>The hammer orchid has a very specific mechanism of reproduction. Each of the species in the <em>Drakaea</em> genus mimics the scent (not to mention color and shape) of the female of a symbiotic species of wasp. The scent attracts male wasps, which attempt to mate with the flower only to become covered in the orchid's pollen. Eventually they give up and fly off. Enough of them proceed to fall for the same trick again, rubbing the pollen off onto a new flower, to ensure the survival of the orchids; and, presumably, enough of them find actual mates to ensure the survival of their own species.</p>
<p>Of course, to say the orchid tricks the wasp is a blatant anthropomorphization. The orchid may be a marvel of evolutionary architecture, but it can't think and it can't plan. It is simply following a program which requires that it become, in a certain sense -- quite literally, smell -- a wasp. An orchid which fails to be a wasp does not reproduce. The wasp, too, is an orchid when it deposits pollen on the waiting stigma of another flower.</p>
<p>The poststructuralists Gilles Deleuze and Felix Guattari used the orchid and the wasp to exemplify what they called a <em>rhizome</em>. The rhizome is an organizational model, a way of thinking about structure and process and the structure <em>of</em> process, which counterpoints the more familiar hierarchical or arborescent model. A corporation is a hierarchy of power which flows top to bottom; meanwhile, a labor union may have officials and bureaucracy, but these local hierarchies don't define the entire organization. Power in a union flows in many directions. There's a lot to like about the rhizomatic model, but one of its principal attributes is just what we're looking for: flexibility.</p>
<p>Deleuze and Guattari identify six characteristics of a rhizome in <em>1000 Plateaus</em>. The first two and last two are each closely related and considered together.</p>
<h2 id="connection-and-heterogeneity">Connection and Heterogeneity</h2>
<p>A rhizome is a crowd or cluster of different (heterogeneous) things which can be and are connected non-hierarchically. This describes a lot of technological stuff, especially distributed systems! If you're thinking of serverless applications, Cassandra, or Kubernetes clusters: that's where we're going with this.</p>
<p>Our data consists, at an atomic level, of records in different tables. If we consider an "initializer" function which generates one of these records as an element of a rhizome, we can compose multiple initializers to generate any data state we need to test.</p>
<p>An initializer looks something like this:</p>
<pre><code class="hljs"><span class="hljs-keyword">async</span> (db, data) => {
  <span class="hljs-keyword">return</span> db.drivers.insert({name: <span class="hljs-string">'Taylor'</span>, license: <span class="hljs-string">'abc123'</span>});
};</code></pre>
<p>Other initializers may cover the <code>franchises</code> table, the <code>destinations</code> table, and the <code>orders</code> table. Each is as simple as possible, generating records of one and only one type. An initializer which creates records of multiple types is a throwback to the complex fixtures we're trying to avoid.</p>
<p>There are always some tests that need to do something specific with the data. What happens when a driver doesn't have a license? If Taylor always has one, we can't exercise that code. We have a few options here:</p>
<ul>
<li>Update Taylor's record to remove her license at the beginning of the "drivers without licenses get ticketed" test</li>
<li>Create a second <code>driver-without-license</code> initializer which generates a record for Taylor's hapless compatriot Tyler, sans license</li>
<li>Generate records for both Taylor, with a license, and Tyler, without, in the single <code>driver</code> initializer</li>
</ul>
<p>There's no cut and dried answer here; the best solution depends on the situation. Here, if there's only one test that depends on having a driver without a license, I'd go with option A. If there are several, it might be time to consider the others.</p>
<h2 id="multiplicities">Multiplicities</h2>
<p>Rhizomes must be thought of in terms of the discrete elements which make it up, and how those elements interact with the elements of other systems. The reproduction of the hammer orchid consists of flowers and wasps, and both flower and wasp interact with things outside. Deleuze and Guattari offer a more direct example: a puppet's strings, considered as a multiplicity, are connected not to the will of the puppeteer but to another multiplicity of nerves. The puppeteer's nervous system becomes a puppet in the same way that the hammer orchid becomes a wasp.</p>
<p>Thinking in multiplicities inverts the question of how fixture data is set up. It's no longer about the state for this or that test, but about the ability to describe and therefore build <em>any</em> data state. Each test suite selects the initializer functions it requires and builds a rhizome from them. The order of invocation does matter for local hierarchies; for example, we can't create a delivery order without a driver.</p>
<p>I have a <code>ContextFactory</code> to which I can pass the names of initializer functions. This factory returns a new function which, when executed, runs the initializers in sequence and collects the records each generates, passing the current state or context into each succeeding initializer so elements in local hierarchies can create their relationships correctly. Each test suite's <code>before</code> function creates a new <code>ContextFactory</code> in the global scope:</p>
<pre><code class="hljs"><span class="hljs-attribute">contextFactory</span> = await ContextFactory(<span class="hljs-string">'franchise'</span>, <span class="hljs-string">'driver'</span>, <span class="hljs-string">'destination'</span>, <span class="hljs-string">'delivery-order'</span>);</code></pre>
<p>This example contains two local hierarchies: franchise-driver-order and destination-order. The only constraint on ordering is that nothing can appear before its dependencies; for example, we could create the <code>destination</code> before anything else, but <code>delivery-order</code> has to be created last.</p>
<h2 id="asignifying-rupture">Asignifying Rupture</h2>
<p>Have I mentioned that poststructuralism takes a lot of heat for impenetrable jargon? In fairness, it's difficult to establish a vocabulary to talk about things as abstract as it does, but its reputation is still deserved to a certain extent. Think of this as representing a "self-healing" capability if one of the components of the rhizome breaks down. If a single wasp doesn't make it to a second flower, it makes little difference; there are other wasps and other flowers. Political rhizomes especially have a way of recurring even under harsh repression, as does quackgrass.</p>
<p>This is a useful property for distributed architectures and concurrent processing: if a Spark job has incomplete results because something took an executor offline, the cluster manager can schedule other executors to cover the missing data. But for our purposes, a breakdown means inconsistency, so this is a point of departure for us -- we're better off raising an exception and aborting.</p>
<h2 id="cartography-and-decalcomania">Cartography and Decalcomania</h2>
<p>A rhizome is "a map and not a tracing". Where the latter creates an immutable still-life representation, a map is open to interpretation, interrogation, and most importantly, modification. Maps change all the time, because what they represent is permanently in flux. Territories declare independence, are recognized or not, are annexed; borders shift, connections are made and broken, cultures and languages ebb and flow. Maps do more than merely show this information: they transfer it ("decalcomania" is a process of reproducing images, the origin of the more common and subtly different word "decal"). A border defines the understood limits of a territory; a route on an atlas becomes a route in the mind of a driver.</p>
<p>When the <code>ContextFactory</code> is invoked, it returns an object mapping initializers to the data each have created.</p>
<pre><code class="hljs">ctx = contextFactory();

<span class="hljs-built_in">assert</span>.equal(ctx.<span class="hljs-built_in">driver</span>.<span class="hljs-built_in">name</span>, <span class="hljs-string">'Taylor'</span>);</code></pre>
<p>A monolithic fixture is a tracing: it freezes a snapshot of the data model as it appeared at one point in time. The initializers, by contrast, map out our application's data model bit by bit, each piece adding more definition. If the information which makes up a driver changes -- adding a last name or whether they're on shift -- that gets added to the initializer. Every test is automatically up to date. If one breaks, that's a good thing! It means the code being exercised can't handle the new information correctly, and needs to be fixed before we can ship.</p>
<h2 id="end">End</h2>
<p>The <a href="https://www.npmjs.com/package/rhizo">rhizomatic model</a> makes test fixtures endlessly flexible. Where monolithic fixtures multiply complexity and fall out of date with little warning, a unified, composable set of discrete fixtures keeps data generation centralized and ensures that tests that exercise related functionality use a consistent and current data set.</p>
<p><a name='doctrine'>*</a> The <a href="https://www.laraveldoctrine.org/docs/1.3/orm/testing">Doctrine</a> O/RM for PHP provides a framework for loading and executing discrete centralized test fixtures, making it the only example I've seen in the wild of what I'm about to cover, if you're the kind of person who skips down to read footnotes before continuing. Anyway, score one for PHP!</p>]]></description>
            <link>https://di.nmfay.com/orchid-wasp-fixture</link>
            <guid isPermaLink="true">https://di.nmfay.com/orchid-wasp-fixture</guid>
            <pubDate>Sun, 25 Feb 2018 00:00:00 GMT</pubDate>
        </item>
        <item>
            <title><![CDATA[Decomposing Object Trees From Relational Results]]></title>
            <description><![CDATA[<p>This is a <a href="https://massivejs.org/docs/resultset-decomposition">feature</a> I added to Massive recently. I had cases where I was querying views on hierarchies of multiple <code>JOIN</code>ed tables to reference data. For an example, here's a query that returns a list of wineries, some of their wines, and the grapes that go into each:</p>
<pre><code class="hljs"><span class="hljs-keyword">SELECT</span> ws.id, ws.name, ws.country, w.id <span class="hljs-keyword">AS</span> wine_id, w.name <span class="hljs-keyword">AS</span> wine_name, w.year,
  va.id <span class="hljs-keyword">AS</span> varietal_id, va.name <span class="hljs-keyword">AS</span> varietal_name
<span class="hljs-keyword">FROM</span> wineries ws
<span class="hljs-keyword">JOIN</span> wines w <span class="hljs-keyword">ON</span> w.winery_id = ws.id
<span class="hljs-keyword">JOIN</span> wine_varietals wv <span class="hljs-keyword">ON</span> wv.wine_id = w.id
<span class="hljs-keyword">JOIN</span> varietals va <span class="hljs-keyword">ON</span> va.id = wv.varietal_id
<span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> w.year;</code></pre>
<p>The result set looks like this:</p>
<pre><code class="hljs language-html"> id |         name         | country | wine_id |       wine_name       | year | varietal_id |   varietal_name    
----+----------------------+---------+---------+-----------------------+------+-------------+--------------------
  4 | Chateau Ducasse      | FR      |       7 | Graves                | 2010 |           6 | Cabernet Franc
  2 | Bodega Catena Zapata | AR      |       5 | Nicolás Catena Zapata | 2010 |           4 | Malbec
  2 | Bodega Catena Zapata | AR      |       5 | Nicolás Catena Zapata | 2010 |           1 | Cabernet Sauvignon
  4 | Chateau Ducasse      | FR      |       7 | Graves                | 2010 |           5 | Merlot
  4 | Chateau Ducasse      | FR      |       7 | Graves                | 2010 |           1 | Cabernet Sauvignon
  3 | Domäne Wachau        | AT      |       6 | Terrassen Federspiel  | 2011 |           7 | Grüner Veltliner
  1 | Cass Vineyards       | US      |       1 | Grenache              | 2013 |           2 | Grenache
  1 | Cass Vineyards       | US      |       2 | Mourvedre             | 2013 |           3 | Mourvedre
  2 | Bodega Catena Zapata | AR      |       3 | Catena Alta           | 2013 |           4 | Malbec
  2 | Bodega Catena Zapata | AR      |       4 | Catena Alta           | 2013 |           1 | Cabernet Sauvignon</code></pre>
<p>This tells us a lot: we've got two single-varietal wines from Cass, two (note the differing <code>wine_id</code>s) and a blend from Catena, one grüner from Wachau, and one classic Bordeaux blend from Ducasse. But while I can pick out the information I'm interested in from this result set easily enough, it's not directly usable by application code which processes the records one at a time. If I needed to use these results to drive a site which offered winery profiles and allowed users to drill down into their offerings, I'd have a rough time of it. That structure looks more like this:</p>
<pre><code class="hljs language-html">├── Bodega Catena Zapata
│   ├── Catena Alta
│   │   └── Cabernet Sauvignon
│   ├── Catena Alta
│   │   └── Malbec
│   └── Nicolás Catena Zapata
│       ├── Cabernet Sauvignon
│       └── Malbec
├── Cass Vineyards
│   ├── Grenache
│   │   └── Grenache
│   └── Mourvedre
│       └── Mourvedre
├── Chateau Ducasse
│   └── Graves
│       ├── Cabernet Franc
│       ├── Cabernet Sauvignon
│       └── Merlot
└── Domäne Wachau
    └── Terrassen Federspiel
        └── Grüner Veltliner</code></pre>
<p>Relational databases don't do trees well at all. This is one of the compelling points of document databases like MongoDB, which would be able to represent this structure quite easily. However, our data really is relational: we've also got "search by grape" functionality, and it's a lot easier to pick out wines which match "Mourvedre" by starting with the single record in <code>varietals</code> and performing a foreign key scan. It's even indexable. By comparison, to do this with a document database you'd need to look in every document to see if its <code>varietals</code> had a match, and that still leaves the issue of ensuring that each winery only appears once in the output. Worse, there's no guarantee someone didn't typo "Moruvedre" somewhere.</p>
<p>There's an easy way to generate the profile-wine-varietal tree: just iterate the result set, see if we have a new winery and add it if so, see if the wine is new to this winery and add it if so, see if the varietal is new for this wine and add it if so. It's not very efficient, but this isn't the kind of thing one does at the millions-of-records scale anyway. The bigger problem is it only works for these specific results. Next time I run into this scenario, I'll have to start from scratch. I'm lazy. I only want to have to write this thing <em>once</em>.</p>
<h2 id="location-location-location">Location, Location, Location</h2>
<p>The first problem is determining which columns belong where in the object tree. The query result doesn't say which table a given column came from, and even if it did, that's no guarantee that it really belongs there. Meaning is contextual: a developer might want to merge joined results from a 1:1 relationship into a single object, or do more complicated things I can't anticipate.</p>
<p>To place each column, Massive needs a schema. Defining any kind of data model was something I'd avoided in the project for as long as possible; coming as I do from a strongly-typed background, it's almost instinctive. Strong typing, its many good points aside, is one of the reasons the object-relational mapper pattern (O/RM) dominates data access in languages like Java and C#: the requirement to map out class definitions ahead of time lends itself all too easily to creating a parallel representation of your data model as an object graph. This is the "object-relational impedance mismatch", also known as the <a href="http://blogs.tedneward.com/post/the-vietnam-of-computer-science/">Vietnam of computer science</a>. You now have two data models, each subtly out of sync with the other, each trying to shoehorn data into formats that don't quite fit it. By contrast, JavaScript basically doesn't care what an object is. That lets Massive get away without any kind of modeling: it builds an API out of Tables and Queryables and Executables, but after that it's all arrays of anonymous result objects.</p>
<p>In an early version of this code, I automatically generated the schema based on column aliasing. The field <code>wines__id</code> would be allocated to an element of a collection named <code>wines</code> in the output. I wound up dropping this: naming conventions require significant up-front work, and if you're trying to do this to a view that already exists, it probably doesn't follow conventions I just came up with. This is poison for Massive, which is supposed to be a versatile toolkit with few expectations about your model. Providing a schema on invocation is still a non-negligible effort, but you only have to do it when you absolutely need it.</p>
<p>A schema looks like this:</p>
<pre><code class="hljs">{
  <span class="hljs-attr">"pk"</span>: <span class="hljs-string">"id"</span>,
  <span class="hljs-attr">"columns"</span>: [<span class="hljs-string">"id"</span>, <span class="hljs-string">"name"</span>, <span class="hljs-string">"country"</span>],
  <span class="hljs-attr">"wines"</span>: {
    <span class="hljs-attr">"pk"</span>: <span class="hljs-string">"wine_id"</span>,
    <span class="hljs-attr">"columns"</span>: {<span class="hljs-attr">"wine_id"</span>: <span class="hljs-string">"id"</span>, <span class="hljs-attr">"wine_name"</span>: <span class="hljs-string">"name"</span>, <span class="hljs-attr">"year"</span>: <span class="hljs-string">"year"</span>},
    <span class="hljs-attr">"array"</span>: <span class="hljs-literal">true</span>,
    <span class="hljs-attr">"varietals"</span>: {
      <span class="hljs-attr">"pk"</span>: <span class="hljs-string">"varietal_id"</span>,
      <span class="hljs-attr">"columns"</span>: {<span class="hljs-attr">"varietal_id"</span>: <span class="hljs-string">"id"</span>, <span class="hljs-attr">"varietal_name"</span>: <span class="hljs-string">"name"</span>},
      <span class="hljs-attr">"array"</span>: <span class="hljs-literal">true</span>
    }
  }
}</code></pre>
<p>Each nested element defines a <code>pk</code> field, which we'll use to distinguish records belonging to different objects at the appropriate level of the tree. <code>columns</code> may be an array or an object to allow renaming (every single one of our tables has a column called <code>name</code>, and prefixes only make sense for flat result sets). The <code>array</code> flag on inner schemas indicates whether objects created from the schema should be appended to a collection or added as a nested object on the parent. We don't have any instances of the latter, but it's something you'd use for a user with a rich profile object or another 1:1 relationship.</p>
<h2 id="making-a-hash-of-things">Making a Hash of Things</h2>
<p>Given a resultset and a schema to apply to it, our first order of business is consolidation. Chateau Ducasse only has one wine in our dataset, but since it's a cabernet sauvignon/merlot/cabernet franc blend, it shows up in three rows. And through some quirk of the sorting engine, those three rows aren't even adjacent. We'd be in trouble if we just accumulated data until the <code>id</code> changed -- we'd have records for a 2010 Chateau Ducasse cab franc and a 2010 Ducasse merlot/cab sauv, neither of which actually exists. If we did it <em>really</em> badly, we'd have two distinct Chateaux Ducasse with one imaginary wine each.</p>
<p>Fortunately, our schema defines a primary key field which will ensure that Chateau Ducasse is the only Chateau Ducasse; and we have hashtables. We can represent the query results as a recursively nested dictionary matching each object's primary key with its values for fields defined by the schema. Even for a relatively small data set like we have, this mapping gets big fast. This is what Chateau Ducasse's section looks like in full:</p>
<pre><code class="hljs">{ ...,
  <span class="hljs-string">"4"</span>: {
    <span class="hljs-string">"id"</span>: <span class="hljs-number">4</span>,
    <span class="hljs-string">"name"</span>: <span class="hljs-string">"Chateau Ducasse"</span>,
    <span class="hljs-string">"country"</span>: <span class="hljs-string">"FR"</span>,
    <span class="hljs-string">"wines"</span>: {
      <span class="hljs-string">"7"</span>: {
        <span class="hljs-string">"id"</span>: <span class="hljs-number">7</span>,
        <span class="hljs-string">"name"</span>: <span class="hljs-string">"Graves"</span>,
        <span class="hljs-string">"year"</span>: <span class="hljs-number">2010</span>,
        <span class="hljs-string">"varietals"</span>: {
          <span class="hljs-string">"1"</span>: {
            <span class="hljs-string">"id"</span>: <span class="hljs-number">1</span>,
            <span class="hljs-string">"name"</span>: <span class="hljs-string">"Cabernet Sauvignon"</span>
          },
          <span class="hljs-string">"5"</span>: {
            <span class="hljs-string">"id"</span>: <span class="hljs-number">5</span>,
            <span class="hljs-string">"name"</span>: <span class="hljs-string">"Merlot"</span>
          },
          <span class="hljs-string">"6"</span>: {
            <span class="hljs-string">"id"</span>: <span class="hljs-number">6</span>,
            <span class="hljs-string">"name"</span>: <span class="hljs-string">"Cabernet Franc"</span>
          }
        }
      }
    }
  }
}</code></pre>
<p>To generate this, we iterate over the resultset and pass each row through a function which recursively steps through the schema tree to apply the record data. For this schema, we're starting from <code>wineries</code> so the <code>id</code> 4 corresponds to Chateau Ducasse. Inside that object, the wine <code>id</code> 7 in the <code>wines</code> mapping corresponds to their 2010 Bordeaux, and so on.</p>
<h2 id="simplify">Simplify!</h2>
<p>However, the primary key mapping is obnoxious to work with. It's served its purpose of structuring our data in an arborescent rather than a tabular form; now it needs to go away, because it's an extra layer of complexity on top of our super-simple winery-wine-varietal tree. We need to break each winery value in the outer dictionary out into its own object, recurse into each of those to do the same for their wines, and finally recurse into the wines to handle the varietals.</p>
<p>If this sounds really similar to what we just did, that's because it is. It's technically possible to do this in one pass instead of two, but processing the raw results into a hashtable is much, much faster than the potential number of array scans we'd be doing.</p>
<p>To arrive at the final format, we reduce the mapping's key list; these are the primary keys of each winery in the example dataset. The corresponding values from the mapping go in the <code>reduce</code> accumulator. Since we're only dealing with arrays here, the accumulator will always be an array; if we had a subobject with a 1:1 relationship, we'd use an object accumulator instead by turning <code>array</code> off in the schema definition. This would result in the subobject being directly accessible as a property of its parent object.</p>
<p>Here's Catena:</p>
<pre><code class="hljs">[ ...,
  {
    <span class="hljs-attr">"id"</span>: <span class="hljs-number">2</span>,
    <span class="hljs-attr">"name"</span>: <span class="hljs-string">"Bodega Catena Zapata"</span>,
    <span class="hljs-attr">"country"</span>: <span class="hljs-string">"AR"</span>,
    <span class="hljs-attr">"wines"</span>: [ {
      <span class="hljs-attr">"id"</span>: <span class="hljs-number">3</span>,
      <span class="hljs-attr">"name"</span>: <span class="hljs-string">"Catena Alta"</span>,
      <span class="hljs-attr">"year"</span>: <span class="hljs-number">2013</span>,
      <span class="hljs-attr">"varietals"</span>: [ {
        <span class="hljs-attr">"id"</span>: <span class="hljs-number">4</span>,
        <span class="hljs-attr">"name"</span>: <span class="hljs-string">"Malbec"</span>
      } ]
    }, {
      <span class="hljs-attr">"id"</span>: <span class="hljs-number">4</span>,
      <span class="hljs-attr">"name"</span>: <span class="hljs-string">"Catena Alta"</span>,
      <span class="hljs-attr">"year"</span>: <span class="hljs-number">2013</span>,
      <span class="hljs-attr">"varietals"</span>: [ {
        <span class="hljs-attr">"id"</span>: <span class="hljs-number">1</span>,
        <span class="hljs-attr">"name"</span>: <span class="hljs-string">"Cabernet Sauvignon"</span>
      } ]
    }, {
      <span class="hljs-attr">"id"</span>: <span class="hljs-number">5</span>,
      <span class="hljs-attr">"name"</span>: <span class="hljs-string">"Nicolás Catena Zapata"</span>,
      <span class="hljs-attr">"year"</span>: <span class="hljs-number">2010</span>,
      <span class="hljs-attr">"varietals"</span>: [ {
        <span class="hljs-attr">"id"</span>: <span class="hljs-number">1</span>,
        <span class="hljs-attr">"name"</span>: <span class="hljs-string">"Cabernet Sauvignon"</span>
      }, {
        <span class="hljs-attr">"id"</span>: <span class="hljs-number">4</span>,
        <span class="hljs-attr">"name"</span>: <span class="hljs-string">"Malbec"</span>
      } ]
    } ]
  },
... ]</code></pre>
<p>Dead simple: we've got wineries, wineries have wines, wines have varietals. Everything lines up with the real primary key values from the original query result. We've turned a raw resultset with embedded relationships into a model <em>of</em> those relationships. This is much easier to manage outside the relational context in client code, and it's an accurate representation of the mental model we want our users to have. The schema does add a bit of overhead, but it's as contained about as well as possible. Further automation only makes it less flexible from here out.</p>]]></description>
            <link>https://di.nmfay.com/decomposition</link>
            <guid isPermaLink="true">https://di.nmfay.com/decomposition</guid>
            <pubDate>Fri, 26 Jan 2018 00:00:00 GMT</pubDate>
        </item>
        <item>
            <title><![CDATA[Behind the Curve: "New" vs "Compatible" in Node.js Package Development]]></title>
            <description><![CDATA[<p>The pace of Node.js development has created a complicated space for growing and maintaining reusable libraries. As new features are introduced, there's a certain pressure to keep up with the latest and greatest in order to simplify existing code and take advantage of new capabilities; but there's pressure in the opposite direction too, since projects which depend on the package aren't always themselves keeping up with Node.</p>
<p>My main open source project is <a href="https://massivejs.org">Massive.js</a>. It's a data access library for Node and the PostgreSQL relational database. I started participating in its development back before io.js merged back into Node and brought it up to ES6, and as of right now I'm still using it in one (not actively developed) product with an old-school callback-based API. I'm also relying on it in other projects with Node 8, the latest stable release line, so I've gotten to use a lot of the newer feature set which have collectively made Node development a lot more fun.</p>
<p>Given that libraries like mine are used with older projects and on older engines, the code has to run on as many of them as is practical. It's easy to assume with open source projects that if someone <em>really needs</em> to do whatever it is your package does in an engine from the stone age (better known as "yesterday" in Node) they can raise an issue or submit a pull request, or worst case fork your project and do whatever they have to to make it work. But in practice, the smaller the userbase for a package the less point there is to developing it in the first place, so there's a delicate balance to strike between currency and compatibility.</p>
<h2 id="important-numbers-in-nodejs-history">Important Numbers in Node.js History</h2>
<ul>
<li><strong>0.12</strong>: The last version before io.js merged back into Node and brought the newest version of Google's V8 engine and the beginnings of ES6 implementation with it. </li>
<li><strong>4</strong>: The major release series beginning with the reintegration of io.js in September 2015. Some ES6 language features such as promises and generators become natively available, freeing those Node developers able to upgrade from "callback hell". Node also moves to an "even major versions stable with long term support, odd major versions active development" release pattern.</li>
<li><strong>6</strong>: The 2016 long term support (LTS) release series rounds out the ES6 feature set with proxies, destructuring, and default function parameters. The former is a brand new way of working with objects, while the latter two are big quality-of-life improvements for developers.</li>
<li><strong>8</strong>: The 2017 LTS release series, current until Node 10 is released April 2018. The big deal here is async functions: promises turned out to still be a bit unwieldy, leading to the rise of libraries like <a href="https://github.com/tj/co">co</a> exploiting generators to simplify asynchronous functionality. With <code>async</code>/<code>await</code>, these promise management libraries are no longer needed.</li>
</ul>
<h2 id="what-maximum-compatibility-means">What Maximum Compatibility Means</h2>
<p>For a utility library like Massive, the ideal scenario for end users is one where they don't have to care which engine they're using. Still on 0.12, or even before? Shouldn't matter, just drop it in and watch it go. Unfortunately, not only does this mean Massive can't take advantage of new language features, it affects what everyone else can do with the package themselves.</p>
<p>The most obvious impact is with promises, which only became standard in 4.0.0. Prior to that, there were multiple independent implementations like <a href="https://github.com/kriskowal/q">q</a> or <a href="https://github.com/petkaantonov/bluebird/">bluebird</a>, most conforming to the <a href="https://promisesaplus.com/">A+</a> standard. For Massive to use promises internally while running on older engines, it would have to bundle one of these. And that <em>still</em> wouldn't make a promise-based API useful unless the project itself integrated a promise library, since the only API metaphor guaranteed available on pre-4.0.0 engines is the callback.</p>
<p>Some of the most popular features which have been added to the language specification are ways to get away from callbacks. This is with good reason, although I won't go into detail here; suffice to say, callbacks are unwieldy in the best of cases. Older versions of Massive even shipped with an optional "deasync" wrapper which would turn callback-based API methods into synchronous -- blocking -- calls. This usage was wholly unsuitable for production, but easier to get off the ground with.</p>
<h2 id="a-breaking-point">A Breaking Point</h2>
<p>With the version 4 update, actively developed projects started moving toward promises at a good clip. We started seeing the occasional request for a promise-based API on the issue tracker. My one older project even got a small "promisify" API wrapper around Massive as we upgraded the engine and started writing routes and reusable functions with promises and generators thanks to <code>co</code>. Eventually things got to the point where there was no reason <em>not</em> to move Massive over to promises: anything that still needed callbacks was likely stable with the current API, if not legacy code outright.</p>
<p>This meant a clean break. The new release of Massive could use promises exclusively, while anything relying on callbacks would have to stay on the older version. By <a href="https://semver.org/">semantic versioning</a> standards, an incompatible API change requires a new major version. In addition to complying with semver, releasing the promise-based implementation as 3.0.0 would allow urgent patches to be made on the existing 2.x series concurrently with new and improved 3.x releases.</p>
<h2 id="multiple-concurrent-releases-with-tags">Multiple Concurrent Releases with Tags</h2>
<p>The npm registry identifies specific release series with a "dist-tag" system. When I <code>npm publish</code> Massive, it updates the release version on the <code>latest</code> tag; when a user runs <code>npm install massive</code>, whatever <code>latest</code> points to is downloaded to their system. Package authors can create and publish to other tags if they don't want to change the default (since without an alternative tag, <code>latest</code> will be updated). This is frequently used to let users opt in to prereleases, but it can just as easily let legacy users opt <em>out</em> of updates.</p>
<p>Publishing from a legacy branch in the code repository to a second tag means installing the most recent callback-based release is as easy as <code>npm i massive@legacy</code>. Or it could be even simpler: <code>npm i massive@2</code> resolves to the latest release with that major version. And of course, package.json disallows major version changes by default, so there's no worries about accidental upgrades.</p>
<p>You can list active dist-tags by issuing <code>npm dist-tag ls</code>, and manage them through other <code>npm dist-tag</code> commands.</p>
<h2 id="the-one-time-i-kind-of-screwed-up">The One Time I Kind of Screwed Up</h2>
<p>In July, a user reported an issue using Massive 3.x on a version 4 series engine. The version 6 stable release had been out for a while, and my active projects had already been upgraded to that for some time. The even newer version 8 series, with full <code>async</code> and <code>await</code> support, had just been released. The problem turned out to be that I'd unwittingly used default function parameters to simplify the codebase. This feature was only introduced in the version 6 release series, which meant Massive no longer functioned with version 4 engines.</p>
<p>Fixing the issue to allow Massive to run on the older engine would be a bit annoying, but possible. However, I had some ideas in the works that would require breaking compatibility with the version 4 series anyway: proxies are not backwards-compatible, so anything using them can only run on version 6 series and newer engines. Rather than fix compatibility with an engine which was now superseded twice over only to break it again later, I ultimately decided to leave well enough alone and clarify the engine version requirement instead.</p>
<h2 id="move-slowly-and-deliberately-and-try-not-to-break-things">Move Slowly and Deliberately and Try Not to Break Things</h2>
<p>The main lesson of package development on Node is that you have to stay some distance behind current engine developments in order to reach the most users. How <em>far</em> behind is more subjective and depends on the project and the userbase. I think Massive is fine one full LTS version back, but a contrasting example can be found in the <a href="https://github.com/vitaly-t/pg-promise">pg-promise</a> driver it uses. Vitaly even goes as far as allowing non-native promise libraries to be dropped in, which hasn't strictly been necessary since 2015 -- unless you're stuck on an engine from before the io.js merge, which users of a more general-purpose query tool seem more likely to be.</p>
<p>Following semantic versioning practices not only ensures stability for users, but also makes legacy updates practical -- just check out the legacy branch, fix what needs fixing, and publish to the <code>legacy</code> tag instead of <code>latest</code>. One new feature and a couple of patches actually have landed on Massive v2 so far, but it's generally been quiet.</p>
<p>Having a clearly-defined standard for versioning has also helped manage the pace of continued development better: figuring out when and how to integrate breaking changes to minimize their impact is still tough, but it's vastly preferable to holding off on them indefinitely.</p>]]></description>
            <link>https://di.nmfay.com/behind-the-curve</link>
            <guid isPermaLink="true">https://di.nmfay.com/behind-the-curve</guid>
            <pubDate>Fri, 22 Dec 2017 00:00:00 GMT</pubDate>
        </item>
    </channel>
</rss>