Dian M Fay

ST_MapAlgebra and Tiled Rasters

Thu, 27 Jun 2024 00:00:00 GMT

Let's say you have two rasters, and you want to combine them with some extra value math -- perhaps you want to grade grassland around 50°N 100°E by latitude and elevation, so blue to green to red the more northerly or higher up the point. ST_MapAlgebra to the rescue!

select
  c.rid,
  st_mapalgebra(
    e.rast,
    c.rast,
    '
      case when [rast2.val] in (13, 14, 16, 17, 18)
        then -[rast1.y] + [rast1.val]
        else 0
      end
    '
  ) as rast
from diva.coverage as c
join diva.elevation as e on e.rid = c.rid;

Oops. (effect exaggerated for visibility)

If they are tiled it's a bit more complicated...

— Pierre Racine

Problem: instead of two big aligned rasters, you have a bunch of tiled aligned rasters, so instead of a smooth gradient south to north you have lots of little smooth south-north gradients. The rest of it works, but only within each tile, or here band since we're ignoring x.

How to fix this? The expression argument only has a handful of available values, and all of them are internal to a specific raster value, that is, the tile. There's a function-callback version of ST_MapAlgebra that could add conditional logic based on external factors like tile latitude; however, I don't want to write and maintain a whole function for this really rather straightforward calculation.

But! ST_MapAlgebra executes per raster, and that means the expression argument is passed into the invocation each time. This means we can use format to pass in external variables -- here, converting pixel y-value to latitude within the SRID. The y-value of any given pixel relative to the SRID as a whole is given by subtracting its y-value within the tile from the latitude of the tile's top edge (ST_UpperLeftY) divided by the height a pixel represents in the SRID (ST_PixelHeight). Mileage may vary in the southern hemisphere, but this too is tractable.

select
  c.rid,
  st_mapalgebra(
    e.rast,
    c.rast,
    format(
      '
        case when [rast2.val] in (13, 14, 16, 17, 18)
          then %1$s - [rast1.y] + [rast1.val]
          else 0
        end
      ',
      st_upperlefty(c.rast) / st_pixelheight(c.rast)
    )
  ) as rast
from diva.coverage as c
join diva.elevation as e on e.rid = c.rid;

Terminal Tools for PostGIS

Sun, 02 Jun 2024 00:00:00 GMT

Of late, I've been falling down a bunch of geospatial rabbit holes. One thing has remained true in each of them: it's really hard to debug what you can't see.

There are ways to visualize these. Some more-integrated SQL development environments like pgAdmin recognize and plot columns of geometry type. There's also the option of standing up a webserver to render out raster and/or vector tiles with something like Leaflet. Unfortunately, I don't love either solution. I like psql, vim, and the shell, and I don't want to do some query testing here and copy others into and out of pgAdmin over and over; I'm actually using Leaflet and vector tiles already, but restarting the whole server just to start debugging a modified query is a bit much in feedback loop time.

So: new tools. You need zsh, psql, and per usual, ideally a terminal emulator that can render images. I use wezterm but the only thing you'd need to change is the sole wezterm imgcat call in each. Both can also pipe out to files.

pgisd

The first one, and the tool I used to create the images in the fluviation post. pgisd runs the given SQL script and renders geometry or geography columns in the output. (It actually has to run the query twice, in order to detect and build rendering code for each geom column)

I have some small polygons dumped from rasters, filtered, intersected, sliced, diced, et cetera. My script looks like this:

select
  geom,
  st_asewkt(st_centroid(geom)) as ewkt_centroid,
  format(
    '%1$s %2$s, radius %3$s',
    round(st_x((st_maximuminscribedcircle(geom)).center)::numeric, 2),
    round(st_y((st_maximuminscribedcircle(geom)).center)::numeric, 2),
    round((st_maximuminscribedcircle(geom)).radius::numeric, 2)
  ) as text_largest_circle
from lots_of_ctes

Without specifying a bounding box, you can barely pick out a couple of dots near where Mongolia would be on a WGS84 projection, given that the whole thing has been squeezed into some 800ish pixels wide:

Enhance:

Tweak the where clause to skip that one outlier and focus on the rest (the crosshair gets a bit flaky at around a single degree of width/height):

pgisd can also render multiple geom-prefixed (and ewkt-, and text-) columns in sequence. When piped to a file, only the first geometry is rendered and saved.

pgrast

And then I started needing rasters for things like elevation and land cover (with profuse thanks to the International Potato Center's DivaGIS project for compiling a ton of these for free!). This one's a bit simpler -- a raster is a raster, you locate the column and define a bounding box for the area you're interested in. Here's the location we were just looking at geometry intersections over:

And looking a little further east, here's the northeastern part of the Mongolian plateau in full; that's Lake Baikal at center-left.

But what if we want to simplify it? This came up a lot with the land cover, where each pixel value is one of 22 options (1 is broadleaf evergreen forest, 13 is grassland, 22 is urban) and I only wanted to see a few at a time, but pgrast's reclass option also works to flatten the pseudocolor output. Here's the same raster, where elevation < 1000m is blue, 1000-2000m is green, and anything above 2000m is red:

(Plausible) Random Geography Generation with PostGIS: Fluviation

Mon, 12 Feb 2024 00:00:00 GMT

Welcome to Squaria.

Squaria is a continent of highly unstable geography defined by a single SQL query (with, as we'll see, many, many CTEs). Its only consistent properties at the moment are its boxy shape and the two unnervingly straight mountain ranges that cross its breadth and meet on its lower eastern edge. Those mountains are impossible, but today's topic is fluviation, that is, rivers and riverine lakes; we'll see about plausible plate tectonics some other time, maybe.

The ever-shifting border of Squaria is defined by a Voronoi diagram within a 100-unit envelope, similar to Paul Ramsey's random polygon generation. Other shapes are of course easily achievable, and I'm probably going to steal his circular envelope outright in the future, but squares are easy to demo.

with recursive envelope as (
  select st_makeenvelope(0, 0, 100, 100) as geom
), voronoi_unclipped as (
  select (st_dump(st_voronoipolygons(
    g1 => st_generatepoints(
      envelope.geom,
      500 -- increase this for a finer polygon mesh
    ),
    tolerance => 0.0,
    extend_to => envelope.geom
  ))).geom as poly
  from envelope
), voronoi as (
  -- clip the Voronoi diagram to only those polys fully inside the envelope
  select voronoi_unclipped.poly
  from envelope
  join voronoi_unclipped on st_contains(envelope.geom, voronoi_unclipped.poly)
), border as (
  select st_boundary(st_concavehull(st_union(poly), 0)) as linestr
  from voronoi
)
select * from border;

But it's not just the border that we're concerned about here. If you want to simulate fluviation, you have to go at least a little distance toward simulating fluid mechanics. Water famously flows downhill; a downhill implies an uphill implies height. Let's add those mountain ranges and generate a heightmap while trying not to think too hard about the fact that PostGIS supports a third dimension, making full-scale volumetric simulation theoretically achievable.

All these CTEs build on each other, so if you're following along for fun, you'll need to combine the statements (minus the final selects in each, which just output the current step). I'll post the whole thing at the end too.

with mountain_range as (
  with nonrandom_line as (
    select st_makeline(st_point(0, v.y1), st_point(100, v.y2)) as linestr
    from (values (70, 30), (20, 35)) as v (y1, y2)
  )
  select
    st_collect(voronoi.poly) as geom,
    st_collect(nonrandom_line.linestr) as linestr
  from voronoi
  cross join nonrandom_line
  where st_intersects(voronoi.poly, nonrandom_line.linestr)
), heightmap as (
  select
    voronoi.poly,
    -- height is a function of distance from the mountains, also factoring in
    -- x-position (Squaria's east is lower than its west) and a little random
    -- variance to make things interesting
    100
      - (min(st_distance(voronoi.poly, mountain_range.geom)) * 1.5)
      - (st_x(st_centroid(voronoi.poly)) * 1.5 / 10)
      + (random() * 6 - 3)
      as height
  from voronoi
  cross join mountain_range
  group by voronoi.poly
)
select
  st_collect(
    st_translate(
      st_scale(st_letters(round(h.height)::text), .03, .03),
      st_x(st_centroid(h.poly)),
      st_y(st_centroid(h.poly))
    )
  )
from heightmap as h;

We've separated the high from the low! Now, just add water:

with headwater as (
  select poly, height
  from heightmap
  join border on true
  where height < 90
    and not st_touches(poly, border.linestr)
  order by random()
  limit 30
)
select st_asewkt(st_centroid(poly)), height
from headwater;

Alternatively, we could favor a more uniform distribution (central Squaria looks a bit desolate, and there's a river in the south flowing between four springs in a row); this spaces headwaters out much more effectively, but placement is the easy part and the current incarnation of Squaria illustrates an important point later on.

with headwater as (
  select poly, height
  from heightmap
  join border on true
  join (
    -- draw a grid of horizontal and vertical lines 10 units apart
    with x as (
      select st_makeline(st_point(0, generate_series), st_point(100, generate_series)) as geom
      from generate_series(10, 90, 10)
    ), y as (
      select st_makeline(st_point(generate_series, 0), st_point(generate_series, 100)) as geom
      from generate_series(10, 90, 10)
    )
    -- collect the points at which the horizontal and vertical lines cross
    select st_collect(st_intersection(x.geom, y.geom)) as geom
    from x
    cross join y
  ) as grid on true
  where height < 90
    and not st_touches(poly, border.linestr)
    and st_intersects(poly, grid.geom) -- pick polys at those intersection points
  order by random()
  limit 30
)
select st_asewkt(st_centroid(poly)), height
from headwater;

And let it flow:

with river_poly as (
  select
    row_number() over () as id,
    1 as iter,
    1 as length,
    headwater.poly,
    headwater.height,
    array[headwater.poly]::geometry[] as polys,
    array[st_centroid(headwater.poly)]::geometry[] as centroids,
    0 as lake_poly_depth
  from headwater
  union
  select
    -- neighbor_poly is null: we could not find a lower polygon to move into, sit here and lake up
    previous.id,
    previous.iter + 1 as iter,
    case when neighbor.poly is null then previous.length else previous.length + 1 end as length,
    coalesce(neighbor.poly, previous.poly) as poly,
    coalesce(neighbor.height, previous.height + 2) as height, -- fill in lakebed
    case
      when neighbor.poly is null then previous.polys
      else array_cat(previous.polys, array[neighbor.poly])
    end as polys,
    case
      when neighbor.poly is null then previous.centroids
      else array_cat(previous.centroids, array[st_centroid(neighbor.poly)])
    end as centroids,
    case
      when neighbor.poly is null then previous.lake_poly_depth + 1
      else 0
    end as lake_poly_depth
  from river_poly as previous
  left outer join lateral (
    select *
    from heightmap
      where st_touches(heightmap.poly, previous.poly)
        and heightmap.height < previous.height
        -- can't return to a poly with the same bounding box as a previously visited one
        and not(heightmap.poly ~= any(previous.polys))
      -- pick the closest centroid
      order by st_centroid(heightmap.poly) <-> st_centroid(previous.poly)
      limit 1
  ) as neighbor on true
  -- border is a single-row relation so we can do this weird shortcut antijoin
  inner join border on not st_touches(previous.poly, border.linestr)
  where previous.lake_poly_depth < 5
)
select id, st_asewkt(st_makeline(centroids))
from river_poly
inner join (
  select id, max(iter) as iter
  from river_poly
  group by id
) as maxiter
  on maxiter.id = river_poly.id
  and maxiter.iter = river_poly.iter;

Alright, that one's a lot to deal with all at once. This recursive CTE is the very beating heart of our river generator. Like all recursive CTEs, it has a base term -- projecting a bunch of fields and initial values from headwater -- and a recursive term which works.... not quite how you might expect recursion to operate.

In Postgres, "recursion" is actually iteration. The base term is evaluated first, and its results placed in a "working table". Then, the recursive term is evaluated, with the self-reference river_poly indicating the working table. The results of the recursive evaluation become the new working table; if at this point there's anything in that working table, the recursive term is evaluated again.

The output of the CTE includes everything that has ever been in the working table. This is why the demo select has a self-join: to include only the final results for each river, rather than every step of each one's progress from headwater to cell to cell.

How does that progress happen, though?

The first thing you might notice is the projection logic that depends on whether neighbor.poly is null. Let's talk about neighbor first, though, in that lateral join below. Lateral join: the subquery is evaluated for each record, here from the working table (previous). So for each headwater in the first recursive execution, or for the last chunk of each river added in each successive iteration, we inspect neighboring polygons in the heightmap. We're looking for a lower polygon, not the lowest to avoid racing too quickly to local minima, and one which we haven't seen before for this river. Picking the closest lower centroid keeps things reasonably random and avoids some occasional funny-looking leaps across the landscape.

But wait. If we always move into lower neighboring cells, why do we need an additional check against retracing our steps?

The answer is local minima again. Our best efforts notwithstanding, it's easy for a river to flow into a cell surrounded by higher neighbors on all sides. Endorheic basins without outlets exist, but they're not that common (and lakes in them tend to be saline). So if a river enters such a basin, we want to give it a chance or several to exit again. We do this by simulating accumulation in a lake which raises the effective height of the current cell. On the next iteration, neighbors that were previously higher could be lower -- including that from which the river entered the lake cell, which is also by definition the closest lower centroid.

The neighbor.poly null checks in the select clause drive lake formation. When there's no valid neighbor, the river-in-progress increments its lake or effective height and stays still; otherwise, it proceeds into the neighboring cell, accumulating the neighbor's polygon and precalculating its centroid for later. Rivers get five chances to proceed at each iteration before they're excluded from the working table and terminate in an endorheic lake. At +2 per increment, this allows lakes to overcome a difference of up to 8 elevation points.

The last thing river_poly does is detect whether the river has reached the edge of the continent. The "weird shortcut antijoin" keeps rivers in the working table only as long as the last cell it moved into wasn't on the border.

The "finished" flow hasn't quite reached the border because the rivers are drawn centroid to centroid, but making that connection is an st_closestpoint away.

That worked nicely! There's got to be a catch.

There are a few catches.

First, a philosophical question: after a river joins another, how many rivers do you have? You might be able to make a case for two at the confluence of the Rio Negro and Solimões, but that's an exception. If we mean this map to be useful, we can't be having two rivers occupy the same space all the way to the sea. One has to end and the other has to keep going.

Second, the output of the recursion includes as many records per river as the river has cells, because the working table is pushed into the result set for each cycle. We need to remove all non-final records.

Third: using lake formation to increase effective height and allow rivers to proceed enables a very specific paradox. A river alpha can flow into beta, but beta could there or further downstream enter and exit a lake, thereby increasing its effective height, and flow back into alpha. Simplified:

alpha flows east 51>---50>---------49>-------48
                        \                      \
                         51-----<[46+6=52]-----<47-----<50 beta flows west

You can see this happening in the lower right of the gif, where two rivers start near each other just below the lower mountain range and meet in a lake -- actually, the triangular lake is filled first by the westward-flowing river, but is naturally downhill from the eastward-flowing river. On the next cycle, neither can exit, so the lake fills to depth 4 for westward and eastward forms its own depth-2 lake. The cycle after that, eastward's closest neighbor is westward's headwater; westward manages to flow out as well, but is trapped at the following cycle and has to fill an adjacent lake cell before continuing southwest.

We already have the maximum iter for each river id, and checking for cross-confluence is a matter of looking for common centroids in differing orders:

with river_pruned as (
  -- remove intermediary rows generated as river_poly recurses
  select river_poly.*
  from river_poly
  inner join (
    select id, max(iter) as iter
    from river_poly
    group by id
  ) as maxiter
    on maxiter.id = river_poly.id
    and maxiter.iter = river_poly.iter
  where not exists (
    -- if two rivers cross each other in opposite directions, pick the one with the lower
    -- id and eliminate the other
    select 1
    from river_poly as rp2
    where rp2.id < river_poly.id
      and array(select * from unnest(rp2.centroids) where unnest = any(river_poly.centroids)) <>
        array(select * from unnest(river_poly.centroids) where unnest = any(rp2.centroids))
  )
)
select id, st_asewkt(st_collect(centroids)) from river_pruned;

Pruning takes care of 2 and 3, but if you run this you'll see the same centroids appear several times in each river. We still need to ensure that a tributary is just a tributary and not the rest of the other river downstream from its confluence.

with cutoff as (
  -- find the first point at which a river "loses" a confluence and becomes
  -- subsumed in another river's flux
  select
    p.id,
    min(array_position(p.centroids, confluence.centroid)) as position
  from river_pruned as p
  join (
    -- centroids of all cells entered by more than one river; furthest upstream
    -- wins, with lower ids breaking ties
    select
      unnest as centroid,
      (array_agg(id::text order by ordinality, id))[1] as winner
    from river_pruned
    join lateral unnest(centroids) with ordinality on true
    group by unnest
    having count(*) > 1
  ) as confluence on confluence.winner <> p.id::text
    and array_position(p.centroids, confluence.centroid) > 0
  group by p.id
), river_line as (
  select
    river_pruned.id,
    river_pruned.length,
    river_pruned.poly,
    river_pruned.height,
    cutoff.position as cutoff,
    river_pruned.polys[1:coalesce(cutoff.position, river_pruned.length)] as polys,
    st_makeline(
      case
        -- only rivers which are not cut off and which come adjacent to the border
        -- require the additional segment connecting to the shore!
        when cutoff.position is null
          and st_touches(
              river_pruned.polys[coalesce(cutoff.position, river_pruned.length)],
              border.linestr
          )
        then array_cat(
          river_pruned.centroids[1:coalesce(cutoff.position, river_pruned.length)],
          array[st_closestpoint(
            border.linestr,
            st_centroid(river_pruned.polys[coalesce(cutoff.position, river_pruned.length)])
          )]
        )
        else river_pruned.centroids[1:coalesce(cutoff.position, river_pruned.length)]
      end
    ) as geom
  from river_pruned
  left outer join cutoff on cutoff.id = river_pruned.id
  join border on true
)
select id, length, cutoff, st_asewkt(geom)
from river_line;

An aside: it took me a while to come up with the nearest-lower-neighbor approach, for no good reason. Before I got there, I still wanted to avoid a race to local minima, but did it with a truly random choice of lower neighbor for each river. Rivers crossed each other willy-nilly towards the sea, which I thought I'd address in a postprocessing step. This got absolutely cursed, involving another recursive CTE running window functions over unnested centroid arrays to eliminate confluence losers from all future contests. It's much, much better this way.

Anyway, at this point we're basically done! All that's left is to accumulate all the geometries for storage or display. Here's the final script:

fluviate.sql

pdot: Exploring Databases Visually, Part II

Sun, 13 Aug 2023 00:00:00 GMT

A couple years ago, I wrote about exploring a running database by plotting relevant subsets of the foreign key relationship graph in dot and piping the resulting images directly to the terminal. Things have progressed since then:

I supplemented the original fks shell script with others plotting view dependencies, role hierarchies and grants, and finally started to map the effects of triggers and functions;
I hit my personal ceiling of What It Is Reasonable To Do In Shell Scripts, and decided to pull all this stuff together into a single cross-platform program with a consistent interface;
I did that, and added mermaid support for good measure;
& then, I forgot to write anything about having released it for a couple of months, as you do

The fks side of things hasn't changed much from the earlier post (aside from some niceties around table inheritance), so here's the big new thing:

pdot is out for Linux (including the Arch AUR), Windows, and macOS universal. I'll be talking more about it and exploration as a documentation strategy at the Chicago PUG in November, and possibly elsewhere!

PGSQL Phriday #009 Roundup

Wed, 07 Jun 2023 00:00:00 GMT

Another Phriday in the books, and it's time to see what all happened:

Hari Kiran offered an introduction to the concepts and processes used in schema evolution with examples showing how to use Flyway, a popular Java-based tool.
Grant Fritchey discussed what it means to roll a change back, why catastrophic failures are the good kind of failure, and how to work with deployment processes to adapt to failures and roll the database forward to a good state instead of trying to turn back time.
Michael Christofides wrote about table stakes for database automation and his hopes for better integration of performance testing in automated change management.
Andy Atkinson ran through the entire prompt item by item! Read it for a detailed look at Rails migrations in situ -- who writes them, who reviews them, what kinds of problems happen, and how to validate successful changes.
Ryan Booz gave me ERWin flashbacks and proclaimed a decalogue for the aspiring database automator, from the foundational on up. No comment on whether I've ever achieved #7 without painstakingly restoring manual production dumps.
finally, I covered the ideas powering a few less-usual schema evolution tools.

Thanks everyone for participating, and look forward to Alicja's invitation coming around the end of the month!

PGSQL Phriday #009: Three Big Ideas in Schema Evolution

Fri, 02 Jun 2023 00:00:00 GMT

I've used several migration frameworks in my time. Most have been variations on a common theme dating back lo these past fifteen-twenty years: an ordered directory of SQL scripts with an in-database registry table recording those which have been executed. The good ones checksum each script and validate them every run to make sure nobody's trying to change the old files out from under you. But I've run into three so far, and used two in production, that do something different. Each revolves around a central idea that sets it apart and makes developing and deploying changes easier, faster, or better-organized than its competition -- provided you're able to work within the assumptions and constraints that idea implies.

sqitch: Orchestration

The first time I used sqitch, I screwed up by treating it like any other manager of an ordered directory of SQL scripts with fancier verification and release management capabilities. It does have those, but they weren't why I used sqitch the second and subsequent times.

sqitch wants you to treat your schema and the statements that define it as a supergraph of all your inter-database-object dependencies. Tables depend on types, on other tables via foreign keys, on functions with triggers or constraints; views depend on tables and functions; functions depend on tables, on other functions, on extensions. Each one, roughly, is a single named migration -- more on that in a bit.

So shipments depend on warehouses, since you have to have a source and a destination for the thing you're shipping, and warehouses depend on regions, because they exist at a physical address subject to various laws and business requirements. shipments also have no meaning independently from the thing being shipped, so in the case I'm filing the serial numbers from, that table also maintains a dependency on weather-stations. Both shipments and warehouses depend on the existence of the set_last_updated audit trigger function. The plan file looks like this:

trigger-set-updated-at 2020-03-19T17:20:30Z dian <> # trigger function for updated_at audit column
regions 2020-03-19T18:30:27Z dian <> # region/country lookup
warehouses [regions trigger-set-updated-at] 2020-03-20T16:34:56Z dian <> # storage for stuff
weather-stations [function-set-updated-at warehouses] 2020-03-20T17:46:36Z dian <> # stations!
shipping [warehouses weather-stations] 2020-03-20T18:56:49Z dian <> # move stuff around

Or, for the more visually inclined:

I have often kept tables and tightly coupled database objects such as types, junction tables, or (some) trigger functions in one file. Here, stations defines health and status types, a serial number sequence, and more, while warehouses includes a cluster of related tables representing inventory quantities.

There are two reasons behind this. First, I've mostly used sqitch on very small teams. If I'm the only person, or nearly the only person (I wrote 97% of the migrations in the weather-stations project) working on the database, the effort of factoring becomes pure overhead well before each database object has its own individual set of files.

Second, orchestration cuts both ways. Reworking and tracking the history of individual database objects is great as long as the changes stay local, but changes to a type or domain, for example, often involve a drop and replacement. The drop can't happen as long as there are columns of that type or domain anywhere else, so those have to be managed simultaneously. It's ugly no matter what, but in a linear "directory of scripts" framework, it's only as ugly as any other major change. Your script can create the new type, migrate dependent columns to it, drop the old type, and finally rename the new.

If you're using sqitch rigorously, the change is smeared across multiple sites and across time: rework the type to add the replacement, rework each dependent table to migrate its columns, tag, rework the type again to drop the old and rename the new, tag. Or you could hijack the typename.sql rework and do everything all at once in it -- undermining the sensible, well-delineated organization of schema objects that distinguishes sqitch in the first place. It's even worse when view dependencies change.

Using closely-related subgraphs instead of individual database objects as the "unit" of sqitch changes keeps many (not all) messy migrations contained, but there's no complete answer.

graphile-migrate: Idempotence

graphile-migrate is developed alongside but does not require Postgraphile, and hews a lot closer to the traditional directory-of-scripts style. Change scripts are numbered, checksummed, and validated per usual, but the development experience of graphile-migrate is unique.

Every other schema evolution framework I've used has expected me to run the next changeset once and only once on top of the previous and only the previous, even during active development. Any tweaks, fixes, or additions can't be applied until the database has been reset, whether by a revert or "down" migration, manually issuing DDL and deleting the run record from the change registry, or often as not dropping and recreating the dev database from scratch.

graphile-migrate expects you to run the migration you're actively working on over and over again. It even defaults to a file-watch mode which runs in the background and executes the "current" migration every time you save. I don't use that, because I save early and often, draft valid-but-destructive DDL with some frequency, and want to run tests, hence graphile-migrate watch --once && pg_prove; but the fact that executing the current migration just the one time is a special case kind of says it all.

It shouldn't matter whether you run the current migration once or a hundred times: the end database state must be identical. This can take some doing. On the easy side, it's always create or replace, never just create; but sometimes idempotent replacement isn't an option. Types and domains, constraints and options, row-level security policies, and more (views, if existing columns are changing) have to be handled with more care. And if not exists is a trap for the unwary.

create table if not exists warehouses (....) runs! The table is there, with the columns we've specified; next time we run the current migration, it skips warehouses seamlessly. It's great -- until we realize there's a column missing and add it in the create table definition, whereupon the next time we run the current migration, it skips warehouses seamlessly. The change needs to be this instead:

drop table if exists warehouses;
create table warehouses(....);

In the case where warehouses was created in an already-committed migration and we need to add the column without dropping existing data, it's time to break out do blocks:

do $maybe_add_active$begin
alter table warehouses add column is_active boolean not null default true;
exception when duplicate_column then null;
end$maybe_add_active$;

The official examples suggest scanning the system catalogs to determine whether to run a statement, but I've often found it quicker and easier to damn the torpedoes and trap specific exceptions afterward.

migra: Comparison

migra makes me nervous. This is the one I've never deployed, which is largely down to one specific but complicated reason: you don't write migrations (yay!), because it magically infers the necessary changes between old and new schemata (cool!), which means it maintains an internal model or map of Postgres features (stands to reason), which cannot be complete as long as Postgres is actively developed.

Is that necessary incompleteness really a dealbreaker? I actually think it shouldn't be! migra's goal is to save you all the time you formerly spent writing migration scripts at the hopefully-much-reduced cost of reviewing them and revising the tricky or unsupported bits. Its automated playbook doesn't have to be complete to make database development significantly faster.

A legitimate dealbreaker in some situations is that migra does not maintain a registry or even a history of valid schema states. There's only previous and next, with the latest revised diff between the two tracked in source control. It's theoretically feasible to pull all versions of the diff between t0 and tn and apply them one by one to reproduce the schema of a customer on a database dating back to that tn, but at that point you're setting all your other time savings on fire.

I haven't been in such a situation for some time, having had only one production database instance per project. Even so, when it's been up to me I've reached for a known quantity instead of investigating migra any further. Why? Because it isn't just a question of how completely migra supports Postgres features.

Complexity varies from database to database and from change to change. Something like migra could save tons of time on one database and not another, or even from one schema evolution to the next. It's hard to know whether you're in a migra-friendly scenario or not until you've already committed yourself, and the risk of falling out of that state and into writing complex migrations from scratch with next-to-no tool support doesn't go away.

It's a fantastic idea -- I should be able to reshape a database interactively, then generate at least an outline of a migration by comparing it to an unmodified baseline! Better still if I could test my change against that baseline and evaluate progress by the items remaining in the diff. The risks and the lack of history and verification keep me from using migra, but I hope we'll see its DNA in some of the next generation of schema evolution tools.

PGSQL Phriday #009 Invitation: Making Changes

Fri, 26 May 2023 00:00:00 GMT

It's almost Phriday again! This is a monthly blogging event for the PostgreSQL community. The rules:

publish something on-theme on or near Friday, June 2nd
include "PGSQL Phriday #009" in your title or first paragraph, and link to this invitation post
share it! The best way to reach the greater Postgresphere is to get syndicated on Planet Postgres, but you can also share on #pgsqlphriday in the community Slack or post to social media with the #PGSQLPhriday hashtag

This month's topic is database change management, aka schema evolution. I've been doing this in one form or another, using one framework or another (and on one less-memorable-than-you'd-think occasion writing my own in a thousand lines of Ant XML) for almost as long as I've worked in software. If you interact with databases in more than a read-only capacity, you've probably done your share of it as well. It's common, it's necessary, it's not very glamorous.

Every now and then, someone will extol the benefits of version-controlling your schema -- Grant Fritchey discussed this at PGDay Chicago just last month -- or write a how-to for a specific framework. There's a slow current of academic interest in the topic which seems to have limited feedback into industry, publications tending toward the descriptive or the heavily specialized with only the occasional experiment like PRISM seeing daylight. But the people deploying changes day to day don't tend to talk much about the nitty-gritty details or the experience of modifying a running database, because change management is plumbing.

Plumbing is really important, and there are a lot of fascinating technical, procedural, social, even philosophical aspects to it. Let's haul a few of them into the spotlight!

Some starting points:

how does a change make it into production? Do you have a dev-QA-staging or similar series of environments it must pass through first? Who reviews changes and what are they looking for?
what's different about modifying huge tables with many millions or billions of rows? How do you tackle those changes? Do you use the same strategy for smaller tables?
how does Postgres make certain kinds of change easier or more difficult compared to other databases?
do you believe that "rolling back" a schema change is a useful and/or meaningful concept? When and why, or why not?
how do you validate a successful schema change? Do you have any useful processes, automated or manual, that have helped you track down problems with rollout, replication, data quality or corruption, and the like?
what schema evolution or migration tools have you used? What did you like about them, what do you wish they did better or (not) at all?
tales of terror in the Kletzian mode are also of course very welcome!

PGSQL Phriday #008: pg_stat_statements

Fri, 05 May 2023 00:00:00 GMT

pg_stat_statements for May. As luck would have it, it's been invaluable to me over the past few weeks as I've been solving some performance problems of the "tens of millions of rows, row-level security, inverted indices, tens of thousands of rows returned, oops I never did get around to double-checking work_mem in production did I?" variety. The big lesson this time around: pay attention to the standard deviation of timings!

The most often called (by far) and longest running (by a much closer margin) statements in this scenario were coming from an account synchronization daemon. Every fifteen seconds the daemon pulls user account information from Keycloak and overwrites the materialized local data, a pattern that sounds suspiciously like an inferior implementation of something RDS is not going to ship any time soon. postgres_fdw is there, of course, but then we'd be depending on Keycloak's schema rather than its API, and that's a much chancier proposition.

The initial user sync implementation wrote to three relevant tables in a single statement using CTEs, because why not? It's easy, convenient, and seemed to work just fine in non-production environments.

In production, though:

calls	min_exec_time	mean_exec_time	max_exec_time	stddev_exec_time
8,148,657	0.535	9.720	36,272.717	115.918
4,489,798	0.560	15.650	81,526.365	77.713

These are the same statement: with dataset as ([upsert dataset] returning *), person as (insert into person [with dataset membership] returning *) insert into account [reference to person]. For us, accounts are special cases of people and people have a tag array column linking them to datasets; we have reasons to avoid a junction table that don't make a difference here.

The daemon got a dedicated Postgres role 8 million executions after I enabled pg_stat_statements, and used that for 4.5 million more. At its fastest, it completes in about half a millisecond -- great! At worst, though, it takes over a minute, even almost two minutes. The means are decently low, but they're means and it's hard to tell just how many longer-running outliers are contributing to its drift.

All is revealed by the standard deviation, which is quite low in both cases. Most syncs happen within about a tenth of a second of the mean, which is itself closer to a hundredth of a second. 99.7% of timings should fall within three standard deviations, assuming a normal distribution, and an execution time over a second represents between nine and thirteen standard deviations from the mean. If I'm statisticking correctly, this means that out of the 12.5 million samples, the only timings over a second are almost certainly just the two known maxes. It's still not exactly wonderful that the statement can run a minute and a half when the stars align, but if you have a high max with a low min, mean, and deviation, the statement you're looking at isn't the problem.

I don't know for sure what it is that got in the way. My chief suspect is a database function that adds dataset memberships to multiple records at a time, or its counterpart that removes them, both further down my top-20 list. Clients were initially configured to call these with batches of 25,000 records, which quickly exceeded the default 4mb work_mem and would churn for the better part of a minute at minimum. Modified records would all have had foreign keys to accounts -- forcing the sync daemon's changes to wait. Dataset membership management fits the "stars aligning" usage profile as well since mass changes like that aren't yet happening every day. With work_mem adjusted to 16mb, those functions have sped up dramatically, and I haven't noticed any other suspicious timings since.

I did split up the statement, since both accounts and datasets are quite high-traffic tables, the former being targeted by foreign keys all over the place and the latter governing row-level security on several other tables. Millions of syncs after the change, only the accounts insert has ever gone long, for significantly less time than the earlier outliers, and also probably only once. It's also faster and more consistent than the triple insert as might be expected, with a standard deviation of 11ms over practically nothing.

calls	min_exec_time	mean_exec_time	max_exec_time	stddev_exec_time
4,887,878	0.026	0.091	22,436.409	10.942

Here's my "leaderboard" query:

select
    userid::regrole::text,
    calls,
    min_exec_time,
    mean_exec_time,
    max_exec_time,
    stddev_exec_time,
    query
from pg_stat_statements
where calls > 100 and max_exec_time > 10000
order by round(calls, -2) desc, round(mean_exec_time::numeric, -2) desc, stddev_exec_time asc
limit 20;

Some Notes on ZSH Arrays

Tue, 02 May 2023 00:00:00 GMT

Here is a summary of the rules for substitution; this assumes that braces are present around the substitution, i.e. ${...}. Some particular examples are given below. Note that the Zsh Development Group accepts no responsibility for any brain damage which may occur during the reading of the following rules.

I'm doing inadvisably complicated things with zsh again; you'll need this to use it as well, if you dare. More on that in due course. What I'm here to write about now is the zsh part, and the parts of that part (die sich das Licht gebar) that were a struggle to get right, even with a quite useful cheatsheet.

This is a six-element zsh array, extracted from its natural habitat in a function (note the local):

local MYARRAY=("alpha beta" gamma delta gamma "epsilon" "alpha beta")

1. Deduplication

This turned out to be easy.

typeset -U MYARRAY

Done and dusted. It's also possible with a parameter expansion flag, though. Sometimes.

echo ${(u)MYARRAY} # alpha beta gamma delta epsilon
echo "${(u)MYARRAY}" # alpha beta gamma delta gamma epsilon alpha beta

See, outside a string ${} does parameter expansion, which applies to things like arrays. Inside a string, ${} is a brace expansion and your flags mean nothing.

2. Passing Arrays to Functions

function otherfunction() {
  # local ARR=???
  echo "${#ARR} elements in $ARR[@]" # print count and contents
}

function main() {
  ....
  otherfunction MYARRAY
}

main

Okay, remember the parentheses in the function signature are a total red herring, arguments are numbered. Let's try filling in that blank the simplest possible way:

  local ARR=$1 # 7 elements in MYARRAY

Nope, that passed the variable name in as a string. We've got to use parameter expansion, specifically the P flag to interpret the value as a parameter name and the A flag to indicate it's an array. Take two:

  local ARR=${(PA)1} # 30 elements in alpha beta gamma delta epsilon

Well, we have the expected contents, but it's also obviously a string: 30 elements! The secret is to reconstitute the array into an array:

  local ARR=(${(P)1}) # 4 elements in alpha beta gamma delta epsilon

Success! The A flag can be included or not -- it makes no difference whatsoever.

3. Also, Watch Your Scopes

for TARGET in "${MYARRAY[@]}"; do
  if [ -n "$TARGET" ]; then echo "$TARGET is real!"; fi
done

If TARGET already contains a value you get a free spin through the loop that you probably don't want!

PGSQL Phriday #007: The Art of the Trigger

Fri, 07 Apr 2023 00:00:00 GMT

It's triggers this time! I've said it before and I'll say it again: if you need to compute, do it as close to your data as you can get away with. But programmed databases, and especially programmed databases that use triggers to encode automatic behaviors and responses, are infamously hard to understand, and the more programmed the more difficult. Why is this, and what can we do about it?

Trigger utility is limited first by the limits of database procedural languages. The other PLs like Python or JavaScript can't touch anything PL/pgSQL can't (it bears mentioning here: there's more than OLD and NEW! TG_OP, TG_TABLE_NAME, and TG_ARGV in particular) and are useful because they can express complex and specific manipulations in algorithmic instead of relational-calculus terms. Higher-level abstractions are not available to database functions in general unless built to that purpose, in database procedural languages, which is when I start feeling compelled to apologize to code reviewers in advance.

The real limits, though, aren't purely technological. All things are possible with a Turing complete language and sufficient patience. But let's say we're adequately funded with all the time in the world, have a trusted and capable DBA at the helm, and they've judged that encoding the processes under consideration into the database will save our organization money and simplify our infrastructure. Someone in the room is going to be nervous, and it's not infrequently the DBA: why?

Any successful automation, mechanical or virtual, changes the structure and politics (but I repeat myself) of an organization, absorbing money, risks, responsibilities, jobs, entire professions, and reorganizing them into new, more efficient or more specialized forms; these projects only fail insofar as they do not take over operational territory. That's reason enough for nerves right there. Database automation in particular, though, is notably arcane and access to it is strictly controlled for very good reasons.

Other virtual automations are invisible compared to the mechanical sort, but they at least tend to have names: the such-and-such datafeed ETL, the new-member flow, the delivery queue. In a healthy organization, those names are backed up by teams or at least by relatively well-defined responsibilities. They have a recognizable surface area which can be examined or interacted with. People know when a given ETL job has crashed, they can often see exactly why (whether or not they can use that information), and they usually know whom to call.

The names of database-internal programs, by contrast, are invisible to the uninitiated. Experts can locate and analyze them, but from outside they inhabit The Database, an undifferentiated and undifferentiable space bordering every other territory on the organization's operational map. Responsibility for database programs is often more diffuse but is also harder to identify in the first place. Effects are visible, their causes are not. After The Database takes over a new operational area, both those previously responsible and others across the organization can no longer see what's going on. If any other department worked this way it'd be a sign of major dysfunction, but again: very good reasons.

And triggers are the acme of database programming. When the new-member flow becomes an after insert trigger and a series of database functions, this is in a very real sense the database encroaching on other operational demesnes. For the good of all, naturally: if much of the initial processing of new members can be made to happen in the database, with perhaps the necessary external data sources connected through foreign data wrappers, everyone's happier! Signups are much faster for members. The team currently responsible for setting the latest introductory rate every so often can devolve that to the database team, or even help design a self-service rate lever for the business people, and move on permanently. Ops can even take a node or two off the infrastructure-that-needs-watching graph.

But it also makes the signup process more opaque to everyone else. Downstream dependents are less able to reason about what is happening or has happened, and while the subsumption of the process into the database hopefully gives those dependents less cause to wonder than they used to have, it can't eliminate that need completely. "What happens during signup" is less knowable, less memorable, and less perceivable to the rest of the organization. That's also cause for concern: is encoding our institutional knowledge into this self-governing black box worth what we gain from computing close to the data? Will we be going all the way back to the drawing board if an acquisition or regulation or sheer signup volume forces us to store and process new members differently? Will we become uncertain about the results and ramifications of the encoded processes as they're performed internally? Will we be able to implement changes or respond to problems with appropriate efficiency?

Only experience can tell us whether our programmed-database strategy will be worth the sacrifices we make for speed and simplicity. Each automation project is unique, but there are common workflow adjustments and technical solutions which help improve the odds of success. Our goals on this tactical level are to speed up development and test feedback loops, keep implementors' options open in the face of unforeseen obstacles, and demystify database automation for everyone else who works with it.

priorities

Databases change more slowly than do their client programs. New or external processes moving into the database should be as completely defined as possible to avoid flurries of updates as requirements continue to evolve or edge cases and bugs are squashed. It's usually better to give young processes time to stabilize before incorporating them, just like it's less work in aggregate to refine queries embedded in client code before turning them into views.

fast iteration

Databases change more slowly than client programs, but during active development the latter change on the scale of seconds. Development databases need to be as close behind that as possible. It should be fast to stand up a clean schema from scratch, faster to reapply changes as implementation progresses.

When I'm writing triggers and functions, I'll often revise them directly in psql, making heavy use of conveniences like \ef. Once I'm happy with the result I'll "canonize" the final code in the schema migration I'm working on. This works best with very focused changes; if the work spreads out to more than one table-trigger-function it's too easy to lose track of individual elements.

Migration frameworks that encourage idempotence, like graphile-migrate, also save a step compared to frameworks with an apply/revert model. In my day job we do a lot with create or replace this, if not exists that, and attempted changes in do blocks ignoring known exceptions:

do $maybe_create$begin
  create domain checked_text as text ....
  -- there's no `create domain if not exists`, so trap the exception if it does
  exception when duplicate_object then null;
end$maybe_create$;

debug

I have never used pldebugger and in fact didn't know it existed until this week. I'm not going to be able to install it on every server I need to debug, although I'm certainly going to try it where I can. Where I can't, raise warning will always have my back (notice is too polite: the default client_min_messages prints it, but the default log_min_messages is stricter). Want to see variable values? raise warning. Not sure which execution path it's heading down? raise warning. Is my complicated when predicate even satisfied? raise warning first thing into the function and find out.

Sometimes if there's more data in play than I want to dig through in psql or logs I'll create a temporary (sensu lato) table and have my trigger function write interesting things to it, whereupon I can sort, filter, and the rest. This does only work as long as there are no fatal errors that would roll back the transaction.

And speaking of, transactions are great for testing triggers faster, fully operational or not. Fire off your DML statement, inspect the outcome, and roll back ready to do the same exact thing all over again without having to worry about unique constraint collisions or other consequences of the new database state. I often try to get into loops like this in a dedicated testing psql session, modifying the function separately:

rollback; begin;⏎
↑↑⏎

test

↑↑⏎ in a REPL is almost an automated test already -- all it's missing is a way to assert and report things about the outcome without human intervention. Trigger development is easier with the ability to evaluate assertions about everything in the database at your fingertips, but more importantly, true automated tests are legible to others as well. Anyone can look at a sufficiently descriptive test output with "success" or "failure" printed next to it and understand instantly what it means without having to know SQL.

For this reason alone, pgTAP may be the best thing since TOAST.

It's important to do two things with pgTAP tests: first, make sure they describe themselves adequately in their real context. Many checks are completely self-explanatory already, especially the "schema things" like has_table and policy_roles_are. Others, such as lives_ok and results_eq, usually want a note detailing exactly what just happened or why the comparison matters.

Second, they need to be organized. The default TAP output is a list of files with status or error count, with the errors themselves included. The latter will be useless to external viewers, but it should be clear which major functional groups are being exercised and how they're doing. Splitting up test files also helps with state management. It's all too easy for tests to become implicitly dependent on writes made by previous tests, and innocently introducing a new one in between or reorganizing them can wreak havoc.

pgTAP does represent an extra logistical commitment! Integration tests (in that loose quasi-Bechdelian sense of "at least two programs talking to each other, and writing state to disk") or even well-honed manual test loops usually come first, depending on the purpose the database serves. Testing the whole system can tell you enough about the functioning of the database to get by initially. As the database becomes more extensively programmed, the debugging needs of external statements start to be outweighed by those of procedures and triggers, and there are enough of the latter as well that internal dependencies start to form and changes here can cause failures there. Any sufficiently internally complex subsystem benefits from testing in isolation, and the database is no exception.

After Massive

Sun, 19 Feb 2023 00:00:00 GMT

MassiveJS version 7 went places.

await db.select(
  db.libraries
    .join(db.holdings) // implicit join on foreign key holdings.library_id
    .join(db.books)    // implicit join on foreign key holdings.book_id
    .join(db.authors, db.$join.left, {[db.authors.$id]: db.books.$author_id})
    .filter({
      [db.libraries.$postcode]: '12345',
      [`${db.authors.$name} ilike`]: 'Lauren%Ipsum'
    })
    .project({
      $key: db.libraries.$id,
      $columns: [...db.libraries],
      authors: [{
        $key: db.authors.$id,
        $columns: [
          db.authors.$name,
          db.expr(
            `extract(year from age(coalesce(${db.authors.$death}, now()), ${db.authors.$birth}))`
          ).as('age')
        ],
        // notice `books` is a collection on authors, even though we join authors to books!
        books: [{
          $key: db.books.$id,
          $columns: [...db.books]
        }]
      }]
    })
);

It'd be stretching an ecological metaphor to say that the middle tier is being eaten, but GraphQL and the "app logic on the client" tendency in web development make a powerful combination. Together, they constitute a -- big, important, immediately useful -- local maximum on the software fitness landscape.

Of course, fitness one way comes at costs in others, and like any species of software system GraphQL backends are histories of decisions about what to make possible or impossible, simple or detailed, how to balance the correlated complexities of model and interface, fast good or cheap and all that. More important decisions may or may not be intentional but have in common that they exclude or foreclose ways of interacting with, here, your database and its contents. In a very roughly chronological order:

Classic object/relational mappers, including Hibernate and its kin but also and especially the ActiveRecord pattern, represent a choice to treat the database as a perfect, crystalline extrusion into time of the object graph and decisions on how best to patch over the resulting impedance mismatch. They also often hide or try to replace SQL and tend to target "lowest common denominator" database vendor compatibility.

Other data mappers and query builders, from MyBatis to Knex, identified a better corresponding structure to programmatic objects in the SQL statement, transforming those objects into parameters and from results, and made decisions about whether to generate, store, or construct statements and how.

There's an identifiable "query runner" tendency, projects like pg-promise, slonik, yesql, and aiosql, which offer more affordances than the plain database driver but ultimately decide the important thing is helping you write exactly the SQL you need. Everything before and after getting that hand-written SQL to the driver is best left up to you, even if it means you write your own boilerplate -- at least it's yours.

Finally-so-far, GraphQL backends like Postgraphile go all in on being an HTTP API for independent clients interacting statelessly, and minus a few caveats basically nail atomic create-retrieve-update-delete from that distance. Between database functions and custom resolvers, they can cover even quite complex data models and server-side logic as well, within the bounds of request and response.

The first category isn't dead by any means but its innate internal contradictions are well recognized; many examples of the second are a reaction to them, Massive included. What still unites the two tendencies is their competition on the territory of the web service, which must wane as that of the independent client application has waxed. Between GraphQL serving that use case so effectively, and query runners sufficing for cases that don't involve extensive manipulation of complex object graphs, the space for mappers of any stripe at least has not been getting much bigger, relatively speaking, in the past decade. A data access library of the older school therefore will have to do a lot more than CRUD to compete, or even to differentiate itself, on its traditional terrain. If it can be useful elsewhere too, so much the better.

Massive isn't, and can't be, that library.

"Make working with your data and your database as easy and intuitive as possible, then get out of your way" was and is a great mission statement, but the fact is Massive was largely built for simple CRUD. There's more to it, of course: full-text search, array and JSON field support, runtime document table generation, keyset pagination, sequence and matview management, but these are extras on a design rooted in intentionally chosen simplifications. Finding all fields by a criteria object goes a really long way!

Many of these extra ideas and tools Massive adds on top of that foundation, original and inherited alike, still point a useful way forward: abandoning compatibility to support Postgres in detail, using introspection to facilitate reasoning about and manipulating database objects directly, record schemata inferred from joins or declared as needed without the maintenance and synchronization burden of model classes, collapsing the distinctions between script files and database functions, and more. But it also includes a lot of decisions made for and in the very different context that entailed a decade ago, and for very different approaches to writing JavaScript as well (it antedates the Promise API!). Some of those decisions can't be grown past in a way that remains recognizably Massive.

For example:

An API surface of do-it-all functions like readable.find winds up with a fairly low complexity ceiling that can cover many to most common scenarios, but ultimately can't keep up with plenty of still fairly routine data access tasks that could benefit from dynamic construction in JavaScript.
Because a single function call has to convey everything from sort order to streaming to decomposition and beyond, all manner of functional and organizational purposes get crammed into options objects with little rhyme or reason. Some options are mutually exclusive; others contain arbitrarily complex nested objects and arrays.
Transaction clones are extremely heavyweight since they copy and substitute the dedicated connection across the entire database object tree.
CommonJS has become a dead end. I don't feel particularly strongly either way about the relative merits of CJS vs ESM, but I think it's better to pick one and Node's use of CJS is odd out.

I started monstrous a few months ago, while working on my fourth or fifth really substantial project with Massive. I'd been finding its limitations harder and harder to ignore, and the many other options available didn't serve my goals either.

I do web stuff but I've no intention of trying to keep up with the Modern Frontend Stack. I support a Postgraphile API at my day job, and have only good things to say about it, but my day job is data architecture and Postgres wrangling on behalf of people who aren't me or even on the same team. GraphQL's a sensible choice there given the coordination and communication requirements in play, but my other projects don't have those pressures and constraints.

And I'm never going to write another model class again if I can help it, so that rules out almost everything in the first two categories. It's true Knex has always been around and doesn't force you to recapitulate your schema in classes, but if Knex organized my data model to the extent and in the direction I wanted, I'd already have been using it.

That leaves query runners, and if I'm going to use a query runner and maintain my own boilerplate -- well, that's kind of what this is, no?

I'd seen Penkala some time ago, and that in turn pointed to alf/bmg. If you're looking for something in Clojure or Ruby respectively you should check them out! The latter two implement a full relational algebra and translate it to the relational calculus of SQL, while Penkala extracts the core principle of composability from that approach -- something SQL has never done well. Other tools try to supply that missing piece, most commonly by supporting technically-separable subqueries, but few go as far as these two. However, I'm already locked in to writing JavaScript for my charmingly retro coupled frontends, so I default to writing it on the server as well.

monstrous takes after those two in emphasizing composability. Everything done to a relation is a contained transformation step: join specifies relation, type, and condition; filter, criteria; project, an output record shape. Each transformation yields a new joined or filtered or projected relation. You can attach any such derived relation to the database just as if it were an original table or view, and reference it in other joins or filters as a subquery.

Moreover, you can use the same relations in reads and writes. Possibly monstrous' most fundamental departure from Massive is the inversion of subject and verb, separating statement construction from execution. With Massive, you could pass a criteria object from a find into an update, although there aren't many reasons to. With monstrous, you can much more usefully select an attached relation here and update it there.

In short: still no models, but if a certain complex product is a common motif in your project, you can define it once and reuse it without repeating the same transformations every time it appears. Attached relations are akin to writable views that respect the object graphs you're working with in client code.

The construction-execution split also means that tasks and transactions, which in Massive deep clone the entire database structure to swap a dedicated connection into each attached relation, instead use a cheap, lightweight class comprising a dozen or so functions and practically no extra state.

For more, check out the readme and the tests!

As for Massive: it still exists, is still moderately popular going by weekly downloads, and even sees the odd issue or merge request. I'll continue to keep an eye on it into the near future, but I think it's developed about as much as it's going to; certainly I've developed it about as much as I'm going to. If there's interest from any extant contributors or users (email address is up top!) I'll see about spinning it out into its own group/organization and adding maintainers.

PGSQL Phriday #004: Scripting in the Industrial Age

Fri, 06 Jan 2023 00:00:00 GMT

While we are concentrating on our task, both our tools and our materials merge into one entity of perception which gives us feedback about state and progress of our work.... Software tools generalize our ways of handling aspects of the world around us; they organize our actions and condense them into gestures.

— Reinhard Budde & Heinz Züllighoven, Software Tools in a Programming Workshop

An internal combustion engine isn't a tool but a car can be, although it remains in the immediate material sense something else, a system demanding full bodily integration: you climb in, close the door, buckle your seatbelt, insert and turn the key or press the brake and ignition, move your hand from gearshift (tool for adjusting torque) to wheel (tool for sensing and adjusting orientation). You can do other things while driving, talk or listen or think, and choose how much of your attention and motion to divert to other tasks, some involving yet other tools. But driving itself represents an attentional and physical restriction of some variable but never-zero degree. The car is your pair of seven-league boots, a tool for working with time and distance, at the same time as it's a physical machine you're strapped into and which you successfully use only by altering your own thinking, perception, even your sense of your own mass, shape and size, velocity, and inertia.

Budde and Züllighoven on machines in this sort of activity-theoretical sense:

A machine is repeatable motion which is abstracted from its specific context and cast into construction.... [It] incorporates and reproduces the mechanical reproduction of human activities. It thus decontextualizes human activities.

You bring your tools to the work; you take your work to the machine. Machines may afford their operators the power of many tools and the speed of automation and sometimes parallelization, but also embody a fixity of purpose, an integrated way of conceiving and acting on the work materials that resists or forbids working in other ways. Tools and machines can even have intersecting domains: a pneumatic wrench is a machine component, fixed in place by its air hose, but does the same job as an ordinary manual wrench. When you have a sufficient number of things to loosen or tighten, other goals to accomplish related to the loosening or tightening, and/or reasons to guarantee a minimum torque, it's worth the extra effort of moving the work as well as the worker.

The classic example, which Budde and Züllighoven mention in passing on to a broader point about how automation makes humans and machines interchangeable, is the fixed station of an assembly-line worker, who may pick up and put down physical tools in the course of producing outputs from the inputs they're provided (this is, naturally, at the scale of individual human beings: the assembly line itself is a machine producing cars from the outputs of other materials processing machines, the auto company a machine generating capital from labor and extracting surplus value, the stock market a machine redirecting flows of capital, each hosting masses of people whose activity is governed and directed not by their own desires but by the social and physical construction of the line, the office, the trading floor). Integrated development environments are, like cars, a more ambiguous case. They're tools at the scale or in the context of software systems writ large, while in that of computer use they're machines you enter into that require nearly full mental or attentional integration to repeat the motions of type-build-test-package-run-debug operating its constituent smaller machines and tools.

While using software in our work, we wish to handle it like a tool; but while constructing it, we wish to design its parts like a machine.

In database work, Microsoft's SQL Server Management Studio illustrates this tension well. SSMS is an IDE for database administrators, architects, and analysts alike, and hews to about the same general outline as any other: here's your left-side tree with context menu upon context menu of tools for managing your tables and views and functions and roles, here to edit the definition, there to print out DDL; here's an SQL interpreting machine, we'll put results or errors at the bottom. From the distance at which the DBA is forty person-hours and the database is a cylinder on an architecture diagram it itself is a tool for designing and detailing what that cylinder represents, but the activity of using it is machinic: you go to SSMS to perform database work. Before its release with SQL Server 2005, you would go to Enterprise Manager to define the schema and administer the server, or to Query Analyzer to write and run SQL. SSMS integrated both predecessor machines into a more consistent whole.

And SSMS was great! Even with one foot in application development I usually had both its predecessors open alongside Visual Studio anyway. The more interesting part of this historical digression is what resisted integration -- most notably ETL/ingestion suite SQL Server Integration Services (SSIS, née Data Transformation Services) and Profiler, which captures statement text and parameters on execution through a dizzying array of configurable filters.

Postgres doesn't have an SSIS. That's a good thing: even if the community wanted to support an official ETL machine, it's a bad direction to go for an open source project, with an unbounded and infinitely edge-cased panoply of input specifications. Controlling more of the backend can add value for commercial DBMSs, but for Postgres there's nothing to be gained. It's an interesting contrast with schema migration, where nobody has an official change management system, but that's getting beside the point.

Profiler, though, I miss almost every day. And although the distinction between tool and machine can be an especially slippery one for programs, it's more tool-like in use: it assists with whatever other thing you're doing that you need to peek at database activity rather than organizing tasks into a workflow, and you pick it up when you need it and put it away when you're done, or in other words, it's "ready to hand" rather than being a system you step into and operate. As an application developer it gave me instant insight into what I'd actually communicated to the database. Its filters were more customizable and more powerful than any reasonable grep invocation. Best of all, I could start and stop tracing without touching or knowing anything about the server's log settings or having SSH access.

pg_stat_statements of course exists, pgcenter has a top style view that's some use in tracking down frequent long-running statements, EDB have an SQL profiler module that installs server-side, but there's nothing even approaching a 1:1 equivalent client program as far as I know.

A programming environment viewed as a workshop offers a set of tools, but does not implement an overall strategy of software development. However, it may be used to automate a selected set of familiar and routine activities (such as change management or compilation).

Users define working processes by drawing on their knowledge of tasks, materials and tools. The programming workshop "surrounds" the user with sets of tools and automata [machines with hidden internal processes that "appear as machines when in use"], each with its own specific application and suitability for a particular type of material.

Application developers use a lot of tools, but even when those tools don't come pre-integrated into an application development machine like an IDE, they build these machines for themselves anyway; the workshop is an environment which facilitates their design and construction ad hoc. Such machines might be distributed across multiple programs -- editor, shell, compiler, linker, debugger, version control -- each individually a tool or a smaller machine bringing tools together for a single purpose, connected and mechanized into an inhabitable whole in order to speed up and standardize the motions of software development: that is, the industrial production of software machines from other software machines. Developers use their meta-machines to combine machines for data access, machines for rendering text or graphics, machines for telling time or hashing strings or an infinite variety of other purposes into new machines that meet their own or their organization's goals.

And here, a thousand and some words in, I make it to the prompt. Database workers have plenty of machines at which we do our database-work, vast and comprehensive like SSMS or small and simple like psql, which repeats the motion read-evaluate-print and yields to external editing machines, source and destination machines connected through pipes, and tools like less or pspg when the user performs a different or a specialized task that isn't its core competency. pgTAP is another machine that exercises a database according to its input, a player piano that detects its own off notes. It's one of the few we have that connects to developers' meta-machines. Efforts to bridge the chasm from the other side have so far mostly resulted in pared-down implementations of the SSMS-type being bolted into their IDEs.

And database workers' tools?

Well, what are our tools? Profiler, there's one, SQL Server's virtual microscope. When it comes to Postgres, of course, we have to attach the pg_stat_statements machine to it or make do with SSH and grep, not database tools specifically. There are a smattering of mostly operations-focused tools like pgbackrest or postgresqlco.nf. Otherwise, we have SQL scripts: a script to calculate bloat, a script to check index statistics, a script to report outputs or patch up recurrent data quality issues or populate static tables.

We have so many of these tools it's difficult to keep track of them all.

We don't have a standard way to organize or remember or even name most of them.

We don't have a dedicated infrastructure to share and update and standardize them, outside a specific class of tool/machine, extensions, having pgxn. And new extensions face an uphill climb to widespread adoption as more database workloads shift to cloud providers which allow them on a case-by-case basis.

Most of all we lack simple, well-defined ways for anyone else to use our tools without requiring them first to step out of their machines and into ours.

It reminds me a lot of (what I saw of) the state of *nix admin before most distros standardized on systemd. Linux had and has init daemons aplenty: SysVinit, upstart, runit, and more. Most of them orchestrate assemblages of more or less glorified shell scripts. The computer boots and starts process id 1, which in turn rummages around in /etc and runs anything that looks like it needs running, prioritized however the daemon prioritizes. Want to kick off some long-running service on boot? Write a shell script, season to your init daemon's taste, and drop it somewhere in /etc/init or /etc/rc.d. Every software vendor and every sysadmin seemed to have a slightly different approach to the infinite possibilities of upstart's script block or SysV full stop. Every boot rebuilt the runtime configuration -- the operating system machine -- from scratch by the automatic application of heterogeneous tool after heterogeneous tool. More than once I found myself in the shoes of the broomstick-multiplying sorcerer's apprentice of the poem as my adjustments went horribly awry.

These init systems, not unlike psql, are small, simple machines which defer to external tools wherever possible. systemd, meanwhile, integrated several other machines and components like login, networking, logging, and cron into a relatively maximalist operating system orchestrator. It restricted the infinite customizability of init scripts and more or less unified those several disparate ways of working, sacrificing "do one thing well" for "do many common things ~consistently". This, naturally, cuts in several directions, but from my perspective as an occasional or dilettante sysadmin it's been a huge improvement even only on grounds that my knowledge of service management and troubleshooting on Arch carries over to Ubuntu or Fedora out of the box. Instead of learning how to hand-assemble this particular Rolls, I can drive off the lot right away in a more basic car and get to my own goals immediately.

It's those goals that determine the contents of my SQL toolbox, same as with everyone else. The tools in it are not all created equal; some are inevitably too tightly bound to the specific context they originate from to justify adding them to the standard kit. But in other situations, having one decent answer is better than having five great answers, and some tools can usefully be mechanized, standardized, centralized. The trick is identifying them: which ones help database workers avoid reinventing wheels and integrate easily and usefully into other workers' machines?

Some of the tools I've written:

having thrown out too many brand new and already outdated entity-relationship diagrams ever to want to draw another one, I used Graphviz and an image-capable terminal to explore the foreign key graph. There's a similar script that analyzes view dependencies, but fks.zsh is the star of that show. As shell functions, they're available anywhere (there's that readiness-to-hand again), show the view from whatever vantage point you select, and get out of the way.
also in zsh, an autocompleting SQL file runner over a directory of scripts organized by database. This came in especially useful when I dealt with a lot of database dumps with other environments' security and FDW settings: bake the alter statements into a script one time, then sql dbname post-restore.sql forever after. I use SyncThing to keep scripts consistent across computers.
while dealing with the pain of multiple codebases interacting over multiple evolving database schemas, I developed an automated build module that indexes data access code and checks the blast radius of migration scripts across the entire organization. ectomigo is more a machine -- it centralizes the repeated motions of syntax analysis and comparison for individual connected repositories -- but itself connects to machines developers already use via review comments.
and I've built a few pgTAP checks for work recently (validating things like object comments and row-level security status) I should probably look at upstreaming in the new year.

I'd love for tools like pspg, pgsql-tweaks, or the scripts we've all copied out of the Akashic records, and machines like pgcenter to become more integrated into the psql or Postgres machines. Not in the strict software sense necessarily (e.g. pgsql-tweaks belongs in core, but compatibility beyond Postgres is a semi-explicit goal of pspg), of sharing repo space or aligning to Postgres' own release cycle -- smaller projects are much more nimble. But I think there could be a role for the Postgres social machine to play even for the really independent projects in its orbit. There's a lot of redundant work that has to happen, such as packaging for different distros and operating systems, that right now happens as each project's maintainers have time, awareness of the need, and the resources necessary to fulfill it. A centralizing strategy could eliminate or at least contain a lot of that redundancy and make useful tools and affordances much more widely available to database workers and downstream developers alike.

Hanukkah of Data 2022/5783

Mon, 26 Dec 2022 00:00:00 GMT

Eight days, eight data analysis puzzles, eight solutions. After working out the password I imported the SQLite database into Postgres the simplest possible way (with a couple of tweaks at the end of the giant sed replacer; items.array seems to have been replaced by the orders_items junction table):

createdb hanukkah
sqlite3 noahs.sqlite .dump | sed -e 's/INTEGER PRIMARY KEY AUTOINCREMENT/SERIAL PRIMARY KEY/g;s/PRAGMA foreign_keys=OFF;//;s/unsigned big int/BIGINT/g;s/UNSIGNED BIG INT/BIGINT/g;s/BIG INT/BIGINT/g;s/UNSIGNED INT(10)/BIGINT/g;s/BOOLEAN/SMALLINT/g;s/boolean/SMALLINT/g;s/UNSIGNED BIG INT/INTEGER/g;s/INT(3)/INT2/g;s/DATETIME/TIMESTAMP/g;s/desc text/description text/g;s/items array/items text/g' | psql hanukkah

The key fields in orders got turned into text somewhere along the line but that's easily fixed with alter table orders alter column x type int using x::int.

I also imposed the following completely arbitrary constraints on myself:

read only, no changing information or writing intermediary data.
produce exactly the target information, no extra rows or columns.
do it in a single DML statement (common table expressions and subqueries okay).

day one: beehive

This is a fun one! We represent the number:letter correspondence with a common table expression, unnest the customer's last name (all customers have only a first and last name, no variations) into another table-like object, then join our keypad-simulating CTE to find the one customer whose last name converted into a phone number is their phone number.

with keys (num, vals) as (
  values
    (2, string_to_array('abc',  null)), -- null delimiter splits each character
    (3, string_to_array('def',  null)),
    (4, string_to_array('ghi',  null)),
    (5, string_to_array('jkl',  null)),
    (6, string_to_array('mno',  null)),
    (7, string_to_array('pqrs', null)),
    (8, string_to_array('tuv',  null)),
    (9, string_to_array('wxyz', null))
)
select customers.phone
from customers
join lateral unnest(
  string_to_array(
    -- get just the last name; Postgres uses 1-based indexing for arrays
    (regexp_split_to_array(lower(customers.name), '\s'))[2],
    null
  )
-- `with ordinality` is exactly what it sounds like: tack a numeric index on,
-- which string_agg() can use to keep the individual letters sorted; order is
-- not otherwise guaranteed!
) with ordinality as namearr (v, i) on true
join keys on vals @> array[namearr.v]
group by customers.phone
having regexp_replace(customers.phone, '-', '', 'g') =
  string_agg(keys.num::text, '' order by namearr.i);

day two: snail

Noah's is not selling enough coffee to be worth the effort involved, and this makes those who do order it easily findable with just a couple other dimensions to search on.

select c.phone from customers as c
join orders as o using (customerid)
join orders_items as oi using (orderid)
join products as p using (sku)
where c.name like 'J% D%'
  and extract (year from ordered) = 2017
  and p.description ilike 'coffee,%';

day three: spider

Another "three clues, three predicates, one result" puzzle; no need even to check for orders having occurred more recently.

select phone
from customers
-- subtracting two from the year lines us up with the zodiacal dog; other animal
-- years won't divide evenly by 12
where ((extract(year from birthdate::date) - 2) / 12)::int =
       (extract(year from birthdate::date) - 2) / 12
  and to_char(birthdate::timestamptz, 'MMDD')::int between 0320 and 0420
  and citystatezip = 'South Ozone Park, NY 11420';

day four: owl

Some refining of predicates involved in this one but it's still pretty straightforward to solve after a quick peek at the products table to find out how sku prefixes work: there are two people who've bought bakery items between 4 and 5 am ever, and only one of them makes a habit of it.

select c.phone
from customers as c
join orders as o using (customerid)
join orders_items as oi using (orderid)
where oi.sku ilike 'bky%'
  and numrange(4, 5, '[)') @> extract(hour from o.ordered)
  and numrange(4, 5, '[)') @> extract(hour from o.shipped)
group by c.phone
order by count(*) desc
limit 1;

day five: koala

Only one person has ever bought cat food more than one time, so we could use having and omit the limit entirely (we could also have done this yesterday), but someone might make a repeat purchase tomorrow so order-limit is a more reliable solution.

select phone
from customers as c
join orders as o using (customerid)
join orders_items as oi using (orderid)
join products as p using (sku)
where c.citystatezip ilike 'queens village%'
  and oi.sku ilike 'pet%'
  and p.description ilike '%cat%'
group by phone
-- count number of orders, not number of items bought
order by count(distinct o.orderid) desc
limit 1;

day six: squirrel

This one was far and away my worst score (20 attempts over four hours from opening the puzzle, although I probably only spent somewhere between one and two of those hours actually trying to solve it) because I got complacent and didn't think through computing savings. I initially tested order price vs wholesale price, i.e. margin, and went up a blind alley involving window functions trying to detect changes in order behavior. When I subtracted paid price from the maximum ever paid for each product I got an unambiguous result: one person has lifetime savings greater than their spending.

with max_prices as (
  select p.sku, max(oi.unit_price) as price
  from products as p
  join orders_items as oi using (sku)
  group by p.sku
)
select c.phone
from customers as c
join orders as o using (customerid)
join orders_items as oi using (orderid)
join max_prices as p using (sku)
group by c.customerid, c.name, c.phone
-- the standard maximum price * quantity is what _would_ have been paid without
-- any discounts or coupons
having sum(p.price * oi.qty - oi.unit_price * oi.qty) > sum(oi.unit_price * oi.qty);

day seven: toucan

Self-joining orders within a reasonable time window and filtering for different skus with similar descriptions (colors are always parenthesized) yields one match to an order from the customer in the previous puzzle.

select c.phone
from orders as o1
join orders_items as oi1 using (orderid)
join products as p1 using (sku)
join orders as o2
  on date_trunc('day', o2.ordered) = date_trunc('day', o1.ordered)
  and o2.ordered between o1.ordered - interval '1 hour' and o1.ordered + interval '1 hour'
  and o2.customerid <> o1.customerid
join orders_items as oi2 on oi2.orderid = o2.orderid
join products as p2 on p2.sku = oi2.sku
join customers as c on c.customerid = o2.customerid
where o1.customerid = 8342
  and p1.sku <> p2.sku
  and regexp_replace(p1.description, '\([^)]+\)', '') =
      regexp_replace(p2.description, '\([^)]+\)', '');

day eight: snake

Another simple slicing problem to wrap it up: join everything in, filter for product descriptions, count, grab the highest.

select c.phone
from customers as c
join orders as o using (customerid)
join orders_items as oi using (orderid)
join products as p using (sku)
where p.description ilike 'noah%'
group by c.name, c.phone
order by count(*) desc
limit 1;

retrospectively

I had fun! Most of the puzzles wound up being much more straightforward than I'd hoped, but then it's tough to come up with a reasonable challenge at at the novice to intermediate level that isn't rendered trivial by expertise with a tool purpose-built for exactly this kind of information work. Other people are tackling this with VisiData, Excel, jq, or whatever else (I both want to see and absolutely do not want to solve day one in jq). That first puzzle set the bar super high, though, and variations on your basic join-where-sort-limit query had a really tough time following it. Honorable mention to days six and seven; it feels like on puzzle definition alone the smoothest difficulty curve in SQL would've been something like 2-8-3-4-5-7-6-1.

ectomigo: Safer Schema Migrations

Tue, 29 Mar 2022 00:00:00 GMT

The team I work with at my day job maintains many applications and processes interacting across a smaller number of databases. This is hardly exceptional. We are also constantly adding, subtracting, and refining not only the client programs but also the database schemas themselves. This too is hardly exceptional: business requirements change, external systems expose new information and deprecate old interfaces, von Moltke's Law ("no plan of operations remains certain once the armies have met") comes calling. Every now and again we just make a modeling or implementation mistake that manages to sneak through review and up to production. Sic semper startups.

So our database schemas are continually evolving. Each of those many applications and processes has to evolve along with them, or we get paged when the renamed column or dropped table breaks something we hadn't accounted for, and instant breakage is the best case. We've had schema incompatibilities lie in wait for over a month to catch us completely flatfooted. The complexities of even a single moderately-sized codebase are beyond the grasp of human memory. What hope do we have of recalling which relevant subset of database interactions appear where across two or ten or more?

What we need is a distinctly inhuman memory, one for which summoning up each and every reference to a changing table or view takes a moment's effort, and which cannot forget. A memory which operates at the level of the organization, rather than that of the project or of the individual developer/reviewer, only able to focus on a single target at a time. A memory we can consult when, or better yet before, code is ready to deploy -- "shifting left", as they say.

We need a database.

I built one.

ectomigo is a continuous integration module (initially a GitHub action) which parses your source files using tree-sitter to find data access code: SQL scripts and inline SQL in Java, JavaScript, and Python; MassiveJS calls; SQLAlchemy definitions; and more languages, data access patterns, analysis features, and platform support on the way after launch. Everything it finds it indexes, storing database object names and the file row-column positions of each reference.

When you submit schema changes for review, it parses that code as well, and matches the targets you're altering or dropping against every codebase your organization has enabled. If it does find any matches -- in other words, you still have live references to an affected database object, in this or another repository -- it leaves review comments alerting you to each potential problem.

ectomigo is launching on GitHub free for public and up to two private projects, with pricing available beyond that. The action code and the core code analysis library it integrates are open under the AGPL should you be interested in that.

We've been using early ectomigo builds at my workplace for a couple of months now, and it's already saved our bacon a few times with reports on database object usage in places we'd forgotten. If you're faced with migration risk yourself, I hope it can help you.

Exploring Databases Visually

Sun, 04 Apr 2021 00:00:00 GMT

In "things you can do with a terminal emulator that renders images":

One way to look at a database's structure is as a graph of foreign key relationships among tables. Two styles of visual representation predominate: models or entity-relationship diagrams (ERDs) created as part of requirements negotiation and design, and descriptive diagrams of an extant database. The former are drawn by hand on a whiteboard or in diagramming software; the latter are often generated by database management tools with some manual cleanup and organization. Both styles usually take the complete database as their object, and whether descriptive or prescriptive, their role in the software development process is as reference material, or documentation.

Documentation isn't disposable. Even though these diagrams are out of date practically as soon as they're saved off, they take effort to make, or at least to make legible -- automated tools are only so good at layout, especially as table and relationship counts grow. That effort isn't lightly discarded, and anyway a diagram that's still mostly accurate remains a useful reference.

Documentation isn't disposable. But documentation isn't the only tool we have for orienting ourselves in a system: we can also explore, view the system in parts and from different angles, follow individual paths through the model from concept to concept. Exploration depends on adopting a partial, mobile perspective from the inside of the model, with rapid feedback and enough context to navigate but not so much as to be overwhelmed. The view from a single point is more or less important depending on the point itself, but in order to facilitate exploration that view has to be generated and discarded on demand. Look, move, look, move.

This is a partial perspective of the pagila sample database, from the table film:

It's generated by this fks zsh function which queries Postgres' catalog of foreign keys using a recursive common table expression to identify and visualize everything connected in a straight line to the target. The query output is passed to the Graphviz suite's dot with a template, rendered to png, and the png displayed with wezterm imgcat. No files are created or harmed at any point in the process.

Why only a straight line, though? The graph above has obvious gaps: film_actor implies an actor, and film_category its own table on the other side of the junction. inventory probably wants a store, and rental and the payment tables aren't much use without a customer. The view from rental is markedly different, with half a dozen tables that weren't visible at all from film:

This graph is familiar in part: there's rental itself, the payment tables, inventory, film -- the last shorn of the junctions to the still-missing actor and category tables. Those have passed around a metaphorical corner, since in order to get from rental to film_actor you must travel first up foreign keys into film (via rental.inventory_id and inventory.film_id), then down by way of film_actor.film_id. language, meanwhile, is "upwards" of film and therefore remains visible from rental.

The reason fks restricts its search to straight lines from the target table is to keep context narrow. You can get a fuller picture of the table structure by navigating and viewing the graph from multiple perspectives; what fks shows is the set of tables which can affect the target, or which will be affected by changes in the target. If you delete a store or a film, rentals from that store or of that film are invalidated (and, unless the intermediary foreign keys are set to cascade, the delete fails). But deleting a film_actor has nothing to do with rental, and vice versa.

There's an actual, serious problem with unrestricted traversal, too. If you recurse through all relationships, you wind up mapping entire subgraphs, or clusters of related tables. And clusters grow quickly. Stuart Kauffman has a great illustration of the principle in his book At Home in the Universe: The Search for the Laws of Self-Organization and Complexity.

Imagine 10,000 buttons scattered on a hardwood floor. Randomly choose two buttons and connect them with a thread. Now put this pair down and randomly choose two more buttons, pick them up, and connect them with a thread. As you continue to do this, at first you will almost certainly pick up buttons that you have not picked up before. After a while, however, you are more likely to pick at random a pair of buttons and find that you have already chosen one of the pair. So when you tie a thread between the two newly chosen buttons, you will find three buttons tied together. In short, as you continue to choose random pairs of buttons to connect with a thread, after a while the buttons start becoming interconnected into larger clusters.

When the ratio of threads to buttons, or relationships to tables, passes 0.5, there's a phase transition. Enough clusters exist that the next thread or relationship will likely connect one cluster to another, and the next, and the next. A supercluster emerges, nearly the size of the entire relationship graph. We can see what the relationship:table ratio looks like in a database by querying the system catalogs:

WITH tbls AS (
  SELECT count(*) AS num FROM information_schema.tables
  WHERE table_schema NOT IN ('pg_catalog', 'information_schema')
), fks AS (
  SELECT count(*) AS num FROM pg_constraint WHERE contype = 'f'
)
SELECT fks.num AS f, tbls.num AS t, fks.num::decimal / tbls.num AS r
FROM tbls CROSS JOIN fks;

The lowest ratio I have in a real working database is 0.56, and it's a small one, with f=14 and t=25. Others range from 0.61 (f=78, t=126) all the way up to 1.96 (f=2171, t=1107 thanks to a heavily partitioned table with multiple foreign keys); pagila itself is in the middle at 1.08 (f=27, t=25). I don't have enough data to back this up, but I think it's reasonable to expect that the number of relationships tends to increase faster than the number of tables. Without restrictions on traversal, you might as well draw a regular ERD: superclusters are inevitable.

fks will draw a regular ERD if passed only the database name, but like I said earlier, automated tools are only so good at layout (and in a terminal of limited width, even a smallish database is liable to produce an illegibly zoomed-out model). With no way to add universal render hints, Graphviz does a lot better with the smaller, more restricted graphs from local perspectives inside the database -- and so do humans. Reading a full-scale data model is hard! Tens or hundreds of nodes have to be sorted by relevance to the problem at hand; nodes and relationships which matter have to be mapped, the irrelevant actively ignored, others tagged with a mental question mark. Often a given problem involves more relevant entities than the human mind can track unaided. fks doesn't resolve the issue completely, but making a database spatial and navigating that space visually goes some way to meet our limitations and those of our tools.

Extra-fuzzy History Searching with Mnem

Thu, 17 Sep 2020 00:00:00 GMT

Update: mcfly already existed, with a slightly different approach (neural network instead of structural analysis) with a lack of fuzzy searching its only real downside, so I added that there. Use mcfly instead!

I use a lot of Rust command-line tools: ripgrep, fd, dust, and more. So when I had my own idea for a better command-line mousetrap, it seemed like the way to go.

Shells log the commands you enter to a history file. Bash has .bash_history, zsh uses .histfile. The EXTENDED_HISTORY option in the latter adds timestamps, but that's about as fancy as it gets. Both shells (and presumably others) also have "reverse search" functionality which lets you look backwards and forwards through it, one line at a time.

Functional! But not especially friendly. Only seeing one result at a time makes it difficult to evaluate multiple similar matches; matching is strictly linear, as you can see by my typos; and the chronological is only sometimes the most useful order.

I do a lot with the AWS CLI, SaltStack, and other complicated command-line interfaces. I want to compare invocations to see how I've combined verbs and flags in the past, and for tasks I repeat just often enough to forget how to do them sorting by overall frequency is more useful than sorting by time.

Enter Mnem (regrettably, I missed getting clio, the Muse of history, by a matter of weeks):

The idea is pretty simple: load the history file, and reduce every command to its syntactic structure. git commit -m "some message here" becomes git commit -m <val>; mv "hither" "thither" turns into mv <arg1> <arg2>. Many entries will have the same structure, especially if switches are sorted consistently, so counting up occurrences yields each structure's overall popularity.

Picking one such aggregate yields a second selector over the original incidences, and selecting one of those prints it to stdout. This can be referenced, copied and pasted, or even evaled in the shell.

So far I've released Mnem to the Arch AUR and a Homebrew tap:

brew tap dmfay/mnem https://gitlab.com/dmfay/homebrew-mnem.git
brew install dmfay/mnem/mnem

Plex: A Life

Fri, 06 Sep 2019 00:00:00 GMT

A little while back I got my hands on a copy of Software Development and Reality Construction, the output of a conference held in Berlin in 1988. Among a variety of other more or less philosophical treatments of the theory and practice of software development, Don Knuth analyzes errors he made in his work on TeX; Kristen "SIMULA" Nygaard reviews his collaboration with labor unions to ensure that software meant to coordinate and control work does not wind up controlling the workers as well, a rather grim read in the era of Uber and Amazon; Heinz Klein and Kalle Lyytinen embark on a discussion of data modeling as production rather than as interpretation or hermeneutics. In all, it's some of the most insightful writing about programming and software engineering I've encountered.

This isn't about those contributions.

There's an entry fairly early on from one Douglas T. Ross, called "From Scientific Practice to Epistemological Discovery". Ross, who died in 2007, was a computer scientist and engineer most remembered today for the influential APT machine tools programming language and for coining the term "computer-aided design".

This isn't about the things Doug Ross is remembered for.

Doug Ross had a system. The system began its public life as an early software engineering methodology in the Cambrian explosion of such methodologies enabled by the spread of high-level programming languages in the 60s and 70s. The system went by a few names. Ross's company, SofTech Inc., called it the Structured Analysis and Design Technique or SADT. The US Air Force, never wont to use merely one acronym where two will do, called it IDEF0: ICAM (Integrated Computer Aided Manufacturing) DEFinition for function modeling.

To Doug Ross, the system was Plex. And Plex was everything. When the Department of Defense cut the Structured Analysis data modeling approach from IDEF0 in favor of a simpler methodology to be developed by SofTech subcontracters and named IDEF1, Ross decried the decision as destroying the "mathematical elegance and symmetric completeness of SADT [...] IDEF0 became merely the best of a competing zoo of other software development CASE tools, none of which were scientifically founded". He saw his career, and, indeed, his life, as drawing him inevitably toward the discovery and promulgation of his "philosophy of problem-solving", and furthering Plex's development became more and more important to him as time went on. In the mid-80s, he stopped drawing a salary at SofTech and went back to MIT, lecturing part-time on electrical engineering in order to focus more of his efforts on Plex.

But even MIT was, in Ross's own words, "not yet ready for [Structured Analysis] much less Plex". A graduate seminar on Plex itself was briefly offered in 1984, but was canceled due to lack of student interest. In "From Scientific Practice" Ross bemoans his inability to gain traction for Plex, writing of feeling "an intolerable burden of responsibility to still be the only person in the world (to my knowledge) pursuing it". His only recourse was to turn inward and "generate book after book on Plex in my office at home, in order that Plex will be ready when the world is ready for it!"

At this point, Doug Ross might be sounding a little bit like a crank. Let me be clear: Douglas T. Ross, computer science pioneer, was absolutely a crank of the first water. This is just as absolutely to his credit; any fool can make it from the sublime to the ridiculous, but it takes real talent to go in the other direction. And Plex is sublime, if in its own dry, academic way. Ross is not the celestial paranoiac Francis E. Dec, ranting and raving about the Worldwide Deadly Gangster Communist Computer God and lunar brain storage depots; nor is Plex the gonzo experience of Nature's Harmonic Simultaneous 4-Day Time Cube. That said, Ross never devolves into the racist vituperations Dec and Time Cube's Gene Ray were sometimes given to, either. So it goes.

⁂

Plex itself is a sprawling, incoherent metaphysics built, according to Ross, on the foundation of a single pun (or, more properly, double entendre): "nothing can be left out". Thus inspired, Ross embarks upon the classic Cartesian thought experiment. But where Descartes discards every proposition except the cogito ("I think, therefore I am"), Ross's buck stops at "nothing doesn't exist".

Or, in Ross's own framing:

Nothing doesn't exist. That is the First Definition of Plex -- a scientific philosophy whose aim is understanding our understanding of the nature of nature. Plex does not attempt to understand nature itself, but only our understanding of it. We are included in nature as we do "our understanding", both scientific and informal, so we must understand ourselves as well -- not just what we think we are, but as we really are, as integral, natural beings of nature. How one "understand"s and even who "we" are as we do "our understanding" necessarily is left completely open, for all that must arise naturally from the very nature of nature.

All emphasis -- all of it, I assure you -- original. Ross's dedication to bold and italic text wavers from work to work and page to page, but on balance "From Scientific Practice to Epistemological Discovery" is in fine form. Early entries he refers to in his "thousands of C-pages" (that is, "chronological working pages", all of which may or may not have been lost) and lecture notes he prepared in 1985 sometimes switch between up to eight colors every few words. The lecture notes are of particular interest compared to the other extant materials, comprising a "study of an SADT Data Model which expresses all aspects of any object which obeys laws of physical cause and effect" delivered as a dialogue between Ross and a genie reminiscent of Gödel, Escher, Bach.

Having arrived at the First Definition, Ross next attempts to deduce everything else from it, claiming that Plex need make no assumptions. "Nothing doesn't exist" leads, expanded this way and that, to "Only that which is known by definition is known -- by definition", as, "without a definition for something, we only can know it as Nothing". Within the space of a few paragraphs, he's slammed what appears to be his own misinterpretation of Stephen Hawking and (unknowingly?) reinvented Spinoza's pantheism, on the grounds that "Nothing isn't; Plex is what Nothing isn't". And for what it's worth, this is all still in the first two pages of "From Scientific Practice".

⁂

In another instance, Plex guides Ross to enlightenment regarding questions of information theory. It turns out that a single bit actually requires 3/2 binary digits for encoding, "because the value of the half-bit is 3/4 !!!".

-- which ultimately results from the fact that in actuality, when you don't have something, it is not the case that you have it but it is Nothing -- it is that you don't have it; whereas when you do have something, that is because you don't have what it isn't!

At a closer reading, this isn't necessarily the gibberish it might seem at first blush. Plex's foundation in "Nothing" makes zero the default state. But one is only understandable when there's an understood meaning for one. The elaboration about nothings and somethings makes it seem like Ross is counting this other one -- that is, half a bit -- towards the cost of encoding any other bit. In semiotic terms, this is the interpretant or subjective value Charles Sanders Peirce sees implicit in signification. But if Ross ever investigated the ways logicians and linguists had already been exploring this territory, there's no indication that he attached any significance (as it were) to their work. And while including the interpretant for half the possible values may yield the same final figure, it does not account for the 3/4 half-bit; so in the face of storage hardware design as practiced, Ross's insistence on 3/2 seems more mystical than scientific.

I have no idea how au courant Ross was with the humanities in general, but it seems likely that the answer is "not very". He was, of course, quite well-versed in math and engineering. Even deep in the mire of Plex, one can find him struggling to accommodate the realization that he was, in essence, defining formal systems backwards (he settles this with the ingenious maneuver of declaring the distinction akin to chirality), but the only philosopher he mentions is Plato. His efforts at deductive logic too seem thoroughly warped, as evinced by his "proof that every point is the whole world". For reference, an object's "identity" is tautologically defined as above: the set of "that" which "this" isn't.

  I  n = 1: A world of one point is the whole world.
 II  Assume the theorem is true for (n - 1) points. (n > 1),
     i.e., for any collection of (n - 1) points, every point is the whole world.
     [ed: remember, Plex needs no assumptions, let alone "assume the theorem is true"]
III  To prove the theorem for n points given its truth for (n - 1) points
     (n > 1)
     (a) The identity of any one point, p, in the collection is a collection of (n -
         1) points, each of which is the whole world, by II.
     (b) The identity of any other point, q, i.e., a point of the identity of p, is
         a collection of (n - 1) points, each of which is the whole world, by II.
     (c) The identity of p and the identity of q are identical except that where
         the identity of p has q the identity of q has p. In any case p is the
         whole world by (b) and q is the whole world by (a).
     (d) Hence both p and q are the whole world, as are all the other points (if
         any) in their respective identities (and shared between them).
     (e) Hence all n points are the whole world.
 IV  For n = 2, I is used (via II) in IIIa and IIIb, q.e.d.
  V  Q.E.D. by natural induction.

As mentioned, Ross generated a wealth of C-pages, lecture notes, and other writings on Plex, but except for a small fraction apparently hosted on his last MIT faculty/program page, I have no idea where most of this collection ended up. If you're interested in reading further in Ross's own words, the best places to start are probably "From Scientific Practice to Epistemological Discovery" in Software Development and Reality Construction or The Plex Tract.

Coda

Doug Ross himself remains a rather cryptic figure. There's some biographical information out there, but after his birth to missionary parents in what's now Guangdong and childhood homecoming to the Finger Lakes region of New York it mostly concerns where, when, with whom, and on what he was working. In his writings he comes off somewhat full of himself, as tends to be the case with esoteric philosophers and visionaries for whom the world is not yet and will never be ready. But when Ross talks about the necessary perfection, or perfect necessity, of his marriage to his wife Pat, herself a human computer at MIT's Lincoln Laboratory, it's still a little bit charming. And when he writes, with complete seriousness, that "being a pioneer came naturally" to him, I can't exactly say otherwise.

I wonder what it was like in that conference hall in 1988. I don't know whether the attendees or the organizers knew what they were in for when Ross got up to talk about this beautiful, all-consuming nonsense that was driving him to desperation. But sense isn't everything; and as a project of reality construction Plex is a monumental accomplishment. And the reality we ourselves have collectively constructed, in which points are points, a bit corresponds to a single binary digit, and genies obstinately refuse to appear no matter how we manipulate bottles, is the richer for its existence.

JOIN Semiotics and MassiveJS v6

Tue, 13 Aug 2019 00:00:00 GMT

MassiveJS version 6 is imminent. This next release closes the widest remaining gap between Massive-generated APIs and everyday SQL, not to mention other higher-level data access libraries: JOINs.

This is something of a reversal for Massive, which until now has had very limited functionality for working with multiple database entities at once. I've even written about this as a constraint not without benefits (and, for the record, I think that still -- ad-hoc joins are a tool to be used judiciously in application code!).

But the main reason for this lack was always that I'd never come up with any solution that didn't fit awkwardly into an already-awkward options object. Deep insert and resultset decomposition were quite enough to keep track of. I am naturally loath to concede any inherent advantages to constructing models, but this really seemed like one for the longest time.

There are, however, ways. Here's what Massive joins look like, if we invade the imaginary privacy of an imaginary library system's imaginary patrons:

const whoCheckedOutCalvino = await db.libraries.join({
  books: {
    on: {library_id: 'id'},
    patron_books: {
      type: 'LEFT OUTER',
      pk: ['patron_id', 'book_id'],
      on: {book_id: 'books.id'},
      omit: true
    },
    who_checked_out: {
      type: 'LEFT OUTER',
      relation: 'patrons',
      on: {id: 'patron_books.patron_id'}
    }
  }
}).find({
  state: 'EV',
  'books.author ILIKE': 'calvino, %'
});

(relation in this sense indicates a table or view.)

And the output:

[{
  "id": 2,
  "name": "East Virginia State U",
  "state": "EV",
  "books": [{
    "author": "Calvino, Italo",
    "id": 1,
    "library_id": 2,
    "title": "Cosmicomics",
    "who_checked_out": [{
      "id": 1,
      "name": "Lauren Ipsum"
    }]
  }]
}, {
  "id": 3,
  "name": "Neitherfolk Public Library",
  "state": "EV",
  "books": [{
    "author": "Calvino, Italo",
    "id": 2,
    "library_id": 3,
    "title": "Cosmicomics",
    "who_checked_out": [{
      "id": 2,
      "name": "Daler S. Ahmet"
    }]
  }, {
    "author": "Calvino, Italo",
    "id": 4,
    "library_id": 3,
    "title": "Invisible Cities",
    "who_checked_out": []
  }]
}]

Or in other words, exactly what you'd hope it would look like -- and what, if you use Massive, you may previously have been dealing with a view and decomposition schema to achieve. This is a moderately complex example, and between defaults (e.g. type to INNER) and introspection, declaring a join can be as simple as naming the target: db.libraries.join('books').

The join schema is something of an evolution on the decomposition schema, sharing the same structure but inferring column lists, table primary keys, and even some on conditions where unambiguous foreign key relationships exist. It's more concise, less fragile, and still only defined exactly when and where it's needed. Even better, compound entities created from tables can use persistence methods, meaning that join() can replace many if not most existing usages of deep insert and resultset decomposition.

It might seem a little unconventional to just invent ersatz database entities out of whole cloth. There's some precedent -- Massive already treats scripts like database functions -- but the compound entities created by Readable.join() are a good bit more complex than that. There's a method to this madness though, and its origins date back to before Ted Codd came up with the idea of the relational database itself.

Semiotics from 30,000 Feet

Semiotics is, briefly, the study of meaning-making, with 19th-century roots in both linguistics and formal logic. It's also a sprawling intellectual tradition in dialogue with multifarious other sprawling intellectual traditions, so I am not remotely going to do it justice here. The foundational idea is credited on the linguistics side to Ferdinand de Saussure: meaning is produced in the relation of a signifier to a signified, or taken together a sign. Smoke to fire, letter to sound, and so forth. Everything else proceeds from that relationship. There is, of course, a lot more of that everything else, and like so many other foundational ideas the original Saussurean dyad is something of a museum piece.

But the idea of theorizing meaning itself in almost algebraic terms would outlive de Saussure. The logician Charles Sanders Peirce had already come to similar conclusions, and had realized to boot that the interpreted value of the signifier's relationship to its signified is as important as the other two. Peirce, following this line of reasoning, understood this "interpretant" itself to be a sign comprising its own signifier and signified which in turn yield their own interpretant, in infinite chains of signification. Louis Hjelmslev, meanwhile, reimagined de Saussure's dyad as a relation of expression to content, and added a second dimension of form and substance. To Hjelmslev, a sign is a function, in the mathematical sense, mapping the "form of expression" to the "form of content", naming as the "substance of expression" and "substance of content" the raw materials formed into the sign.

The use of the term "substance" sounds kind of like some sort of philosophically-détourned jargon, but there are no tricks here: it's just stuff. There's no more specific designation than the likes of "substance" for "that which has been made into a sign"; the category includes everything from physical materials to light, gesture, positioning, electricity, more, in endless combinations. A sign is created by these matters being selected and formed into content and expression: fuel, oxygen, and heat organized into fire and smoke, or sounds uttered in an order corresponding to a known linguistic quantity. It should be said also that consciousness need not enter into it: anything can make a sign, and even a plant can interpret one.

This all is to say: there's stuff out there, and what it has in common is that it is made to mean things. Most stuff, in fact, is constantly meaning many things at the same time, as long as there's an interpreting process -- and there's always something. The philosopher-psychologist tag team of Gilles Deleuze and Felix Guattari envisioned the primordial soup of matters-awaiting-further-formation as a spatial dimension: the plane of consistency or plane of immanence. Signification, as they proposed in 1000 Plateaus, happens on and above the plane of consistency, as matters are selected and drawn up from it to become substance and sign. The recursive nature of signification means that these signs are then selected into the substance of yet other signs, becoming layers or strata on the plane in a fashion they compare to the formation of sedimentary rock.

Signs and Databases

A database management system, like any other program, is an immensely complex system of signs. However, what sets DBMSs (and some other categories of software, like ledgers and version control systems) apart is that they're designed to manage other systems of signs. Thanks to this recursive aspect, a database can be imagined as a plane of consistency, a space from which any combination of unformed bytes might be drawn up into column-signs and row-signs which in turn are gathered into table-signs and view-signs and query-signs.

And if tables and views and queries are all still signs at base, where exactly do the differences come in? Tables store persistent data and are therefore mutable, while views and queries do not and are not, and must be constituted from tables themselves and (in the case of views) from each other. Tables constitute a lower stratum of signs, with views forming table- and view-substance into signs on higher strata, and queries higher still, at a sufficient remove from the plane of consistency that they're no longer stored in the database itself.

This is, of course, arriving at inheritance the long way around. In Massive terms, database entities are first instances of a base Entity class, after which they inherit a second prototype: one of Sequence, Executable, or Readable. Some of the latter may be further articulated as Writables, as well; there are no Writables which are not also Readables.

But there's more than one thing happening here, and the ordering of tables, views, and database functions into class-strata is the second step -- matters must be chosen before they can be formed into signs. It's in this first step of stratification that Massive adds script files to the API system of signs, treating them (almost) identically to functions and procedures.

Readable.join() takes the same idea further to expand on the database's relations: before, a Readable mapped one-to-one with a single table or view. But as long as SQL can be generated to suit, there's no reason one Readable couldn't map to multiple relations. Writables too, for that matter:

const librariesWithBooks = db.libraries.join('books');
const libraryMembers = db.patrons.join('libraries');

// inserts work exactly like deep insert, persisting an
// entire object tree
const newLibrary = await librariesWithBooks.insert({
  name: 'Lichfield Public Library',
  state: 'EV',
  books: [{
    library_id: undefined,
    title: 'Jurgen: A Comedy of Justice',
    author: 'Cabell, James Branch'
  }, {
    library_id: undefined,
    title: 'If On a Winter\'s Night a Traveller',
    author: 'Calvino, Italo'
  }]
});

// updates make changes in the origin table, based on
// criteria which can reference the joined tables
const withCabell = await librariesWithBooks.update({
  'books.author ilike': 'cabell, %'
}, {
  has_cabell: true
});

// deletes, like updates, affect the origin table only
const iplPatrons = await libraryMembers.destroy({
  'libraries.name ilike': 'Imaginary Public Library'
});

Try it Out!

The first v6 prerelease is available now: npm i massive@next. There's now a prerelease section of the docs going over what's new and different in detail. But to sum up the other changes:

Node < 7.6 is no longer supported.
Implicit ordering has been dropped.
Resultset decomposition now yields arrays instead of objects by default. The array schema field is no longer recognized, and you'll need to remove it from your existing decomposition schemas. To yield objects, set decomposeTo: 'object' instead.
JSON and JSONB properties are now sorted as their original type instead of being processed as text.
The type property of the order option has been deprecated in favor of Postgres-style field::type casting as used elsewhere. It will continue to work through the 6.x lifecycle but may be removed in a subsequent major release.

This is a feature I've been wishing I could make happen somehow ever since I first published the original resultset decomposition Gist more than two years ago. It's involved extensive changes to table loading, criteria parsing, and statement generation. I've endeavored not to break these areas, and have informally experimented by dropping pre-prerelease versions into an existing codebase. Results have been good, but should you find an issue with this or any other Massive functionality, please let me know!

I'm really excited to see just how far joins expand Massive's capabilities, but in truth there's just one thing I think I and most other Massive users will get the most mileage out of: plain old query predicate generation with criteria objects, without having to define and manage a plethora of views to cover basic JOINs. Stratification is a useful way to think about the production of meaning -- but strata themselves can also be dead weight.

A Self-Sourcing Cassandra Cluster with SaltStack and EC2

Wed, 27 Mar 2019 00:00:00 GMT

Anybody doing something interesting to a production Cassandra cluster is generally advised, for a host of excellent reasons, to try it out in a test environment first. Here's how to make those environments effectively disposable.

The something interesting we're trying to do to our Cassandra cluster is actually two somethings: upgrading from v2 to v3, while also factoring Cassandra itself out from the group of EC2 servers that currently run Cassandra-and-also-some-other-important-stuff. We have a "pets" situation and want a "cattle" situation, per Bill Baker: pets have names and you care deeply about each one's welfare, while cattle are, not to put too fine a point on it, fungible. If we can bring new dedicated nodes into the cluster, start removing the original nodes as replication takes its course, and finally upgrade this Database of Theseus, that'll be some significant progress -- and without downtime, even! But it's going to take a lot of testing, to say nothing of managing the new nodes for real.

We already use SaltStack to monitor and manage other areas of our infrastructure besides the data pipeline, and SaltStack includes a "salt-cloud" module which can work with EC2. I'd rather have a single infra-as-code solution, so that part's all good. What isn't: the official Cassandra formula is geared more towards single-node instances or some-assembly-required clusters, and provisioning is a separate concern. I expect to be creating and destroying clusters with abandon, so I need this to be as automatic as possible.

Salt-Cloud Configuration
- etc/cloud.profiles.d/ec2.conf
- cassandra-test.map
Pillar and Mine
The Cassandra State
Highstate
- srv/salt/top.sls changes
Startup!

Salt-Cloud Configuration

The first part of connecting salt-cloud is to set up a provider and profile. On the Salt master, these are in /etc/cloud.providers.d and /etc/cloud.profiles.d. We keep everything in source control and symlink these directories.

Our cloud stuff is hosted on AWS, so we're using the EC2 provider. That part is basically stock, but in profiles we do need to define a template for the Cassandra nodes themselves.

etc/cloud.profiles.d/ec2.conf

cassandra_node:
  provider: [your provider from etc/cloud.providers.d/ec2.conf]
  image: ami-abc123
  ssh_interface: private_ips
  size: m5.large
  securitygroup:
    - default
    - others

cassandra-test.map

With the cassandra_node template defined in the profile configuration, we can establish the cluster layout in a map file. The filename doesn't matter; mine is cassandra-test.map. One important thing to note is that we're establishing a naming convention for our nodes: cassandra-*. Each node is also defined as t2.small size, overriding the default m5.large -- we don't need all that horsepower while we're just testing! t2.micro instances, however, did prove to be too underpowered to run Cassandra.

cassandra_node:
  - cassandra-1:
      size: t2.small
      cassandra-seed: true
  - cassandra-2:
      size: t2.small
      cassandra-seed: true
  - cassandra-3:
      size: t2.small

cassandra-seed (and size, for that matter) is a grain, a fact each Salt-managed "minion" knows about itself. When Cassandra comes up in a multi-node configuration, each node looks for help joining the cluster from a list of "seed" nodes. Without seeds, nothing can join the cluster; however, only non-seeds will bootstrap data from the seeds on joining so it's not a good idea to make everything a seed. And the seed layout needs to toposort: if A has B and C for seeds, B has A and C, and C has A and B, it's the same situation as no seeds. If two instances know that they're special somehow, we can use grain matching to target them specifically.

Pillar and Mine

The Salt "pillar" is a centralized configuration database stored on the master. Minions make local copies on initialization, and their caches can be updated with salt minion-name saltutil.refresh_pillar. Pillars can target nodes based on name, grains, or other criteria, and are commonly used to store configuration. We have a lot of configuration, and most of it will be the same for all nodes, so using pillars is a natural fit.

srv/salt/pillar/top.sls

Like the top.sls for Salt itself, the Pillar top.sls defines a highstate or default state for new minions. First, we declare the pillars we're adding appertain to minions whose names match the pattern cassandra-*.

base:
  'cassandra-*':
    - system-user-ubuntu
    - mine-network-info
    - java
    - cassandra

srv/salt/pillar/system-user-ubuntu.sls

Nothing special here, just a user so we can ssh in and poke things. The private key for the user is defined in the cloud provider configuration.

system:
  user: ubuntu
  home: /home/ubuntu

srv/salt/pillar/mine-network-info.sls

The Salt "mine" is another centralized database, this one storing grain information so minions can retrieve facts about other minions from the master instead of dealing with peer-to-peer communication. Minions use a mine_functions pillar (or salt-minion configuration, but we're sticking with the pillar) to determine whether and what to store. For Cassandra nodes, we want internal network configuration and the public DNS name, which latter each node has to get by asking AWS where it is with curl.

mine_functions:
  network.interfaces: [eth0]
  network.ip_addrs: [eth0]
  # ask amazon's network config what we're public as
  public_dns:
    - mine_function: cmd.run
    - 'curl -s http://169.254.169.254/latest/meta-data/public-hostname'

srv/salt/pillar/java.sls

Cassandra requires Java 8 to be installed (prospective Java 9 support became prospective Java 11 support and is due with Cassandra 4). This pillar sets up the official Java formula accordingly -- or rather, it did until Oracle archived the Java 8 binaries in April 2019. We're now pulling it from Artifactory, which is a whole other thing.

java:
  # vitals
  release: '8'
  major: '0'
  minor: '202'
  development: false
  
  # tarball
  prefix: /usr/share/java # unpack here
  version_name: jdk1.8.0_202 # root directory name
  source_url: https://download.oracle.com/otn-pub/java/jdk/8u202-b08/1961070e4c9b4e26a04e7f5a083f551e/server-jre-8u202-linux-x64.tar.gz
  source_hash: sha256=61292e9d9ef84d9702f0e30f57b208e8fbd9a272d87cd530aece4f5213c98e4e
  dl_opts: -b oraclelicense=accept-securebackup-cookie -L

srv/salt/pillar/cassandra.sls

Finally, the Cassandra pillar defines properties common to all nodes in the cluster. My upgrade plan is to bring everything up on 2.2.12, switch the central pillar definition over, and then supply the new version number to each minion by refreshing its pillar as part of the upgrade process.

cassandra:
  version: '2.2.12'
  cluster_name: 'Test Cluster'
  authenticator: 'AllowAllAuthenticator'
  endpoint_snitch: 'Ec2Snitch'
  twcs_jar:
    '2.2.12': 'TimeWindowCompactionStrategy-2.2.5.jar'
    '3.0.8': 'TimeWindowCompactionStrategy-3.0.0.jar'

The twcs_jar dictionary gets into one of the reasons I'm not using the official formula: we're using the TimeWindowCompactionStrategy. TWCS was integrated into Cassandra starting in 3.0.8 or 3.8, but it has to be compiled and installed separately for earlier versions. Pre-integration versions of TWCS also have a different package name (com.jeffjirsa instead of org.apache). 3.0.8 is the common point, having the org.apache TWCS built in but also being a valid compilation target for the com.jeffjirsa TWCS. After upgrading to 3.0.8 I'll be able to ALTER TABLE to apply the org.apache version before proceeding.

With the provider, profile, map file, and pillar setup we can actually spin up a barebones cluster of Ubuntu VMs now and retrieve the centrally-stored network information from the Salt mine:

sudo salt-cloud -m cassandra-test.map

sudo salt 'cassandra-1' 'mine.get' '*' 'public_dns'

We can't do much else, since we don't have anything installed on the nodes yet, but it's progress!

The Cassandra State

The state definition includes everything a Cassandra node has to have in order to be part of the cluster: the installed binaries, a cassandra group and user, a config file, a data directory, and a running SystemD unit. The definition itself is sort of an ouroboros of YAML and Jinja:

srv/salt/cassandra/defaults.yaml

First, there's a perfectly ordinary YAML file with some defaults. These could easily be in the pillar we set up above (or the pillar config could all be in this file); the principal distinction seems to be in whether you want to propagate changes via saltutil.refresh_pillar, or by (re)applying the Cassandra state either directly or via highstate. This is definitely more complicated than it needs to be right now, but given that this is my first major SaltStack project, I don't yet know enough to land on one side or the other, or if combining a defaults file with the pillar configuration will eventually be necessary.

cassandra:
  dc: dc1
  rack: rack1

srv/salt/cassandra/map.jinja

The map template loads the defaults file and merges them with the pillar, creating a server dictionary with all the Cassandra parameters we're setting.

{% import_yaml "cassandra/defaults.yaml" as default_settings %}

{% set server = salt['pillar.get']('cassandra', default=default_settings.cassandra, merge=True) %}

srv/salt/cassandra/init.sls

Finally, the Cassandra state entrypoint init.sls is another Jinja template that happens to look a lot like a YAML file and renders a YAML file, which for SaltStack is good enough. Jinja is required here since values from the server dictionary, like the server version or the TWCS JAR filename, need to be interpolated at the time the state is applied.

When the Cassandra state is applied to a fresh minion:

wget will be installed
A CASSANDRA_VERSION environment variable will be set to the value defined in the pillar
A user and group named cassandra will be created
A script named install.sh will download and extract Cassandra itself, once the above three conditions are met
A node configuration file named cassandra.yaml will be generated from a Jinja template and installed to /etc/cassandra
If necessary, the TWCS jar will be added to the Cassandra lib directory
The directory /var/lib/cassandra will be created and chowned to the cassandra user
A SystemD unit for Cassandra will be installed and started once all its prerequisites are in order

{% from "cassandra/map.jinja" import server with context %}

wget:
  pkg.installed

cassandra:
  environ.setenv:
    - name: CASSANDRA_VERSION
    - value: {{ server.version }}

  cmd.script:
    - require:
      - pkg: wget
      - user: cassandra
      - environ: CASSANDRA_VERSION
    - source: salt://cassandra/files/install.sh
    - user: root
    - cwd: ~

  group.present: []

  user.present:
    - require:
      - group: cassandra
    - gid_from_name: True
    - createhome: False

  service.running:
    - enable: True
    - require:
      - file: /etc/cassandra/cassandra.yaml
      - file: /etc/systemd/system/cassandra.service
{%- if server.twcs_jar[server.version] %}
      - file: /opt/cassandra/lib/{{ server.twcs_jar[server.version] }}
{%- endif %}

# Main configuration
/etc/cassandra/cassandra.yaml:
  file.managed:
    - source: salt://cassandra/files/{{ server.version }}/cassandra.yaml
    - template: jinja
    - makedirs: True
    - user: cassandra
    - group: cassandra
    - mode: 644

# Load TWCS jar if necessary
{%- if server.twcs_jar[server.version] %}
/opt/cassandra/lib/{{ server.twcs_jar[server.version] }}:
  file.managed:
    - require:
      - user: cassandra
      - group: cassandra
    - source: salt://cassandra/files/{{ server.version }}/{{ server.twcs_jar[server.version] }}
    - user: cassandra
    - group: cassandra
    - mode: 644
{%- endif %}

# Data directory
/var/lib/cassandra:
  file.directory:
    - user: cassandra
    - group: cassandra
    - mode: 755

# SystemD unit
/etc/systemd/system/cassandra.service:
  file.managed:
    - source: salt://cassandra/files/cassandra.service
    - user: root
    - group: root
    - mode: 644

srv/salt/cassandra/files/install.sh

This script downloads and extracts the target version of Cassandra and points the symlink /opt/cassandra to it. If the target version already exists, it just updates the symlink since everything else is already set up.

#!/bin/bash

update_symlink() {
  rm /opt/cassandra
  ln -s "/opt/apache-cassandra-$CASSANDRA_VERSION" /opt/cassandra

  echo "Updated symlink"
}

# already installed?
if [ -d "/opt/apache-cassandra-$CASSANDRA_VERSION" ]; then
  echo "Cassandra $CASSANDRA_VERSION is already installed!"

  update_symlink

  exit 0
fi

# download and extract
wget "https://archive.apache.org/dist/cassandra/$CASSANDRA_VERSION/apache-cassandra-$CASSANDRA_VERSION-bin.tar.gz"
tar xf "apache-cassandra-$CASSANDRA_VERSION-bin.tar.gz"
rm "apache-cassandra-$CASSANDRA_VERSION-bin.tar.gz"

# install to /opt and link /opt/cassandra
mv "apache-cassandra-$CASSANDRA_VERSION" /opt
update_symlink

# create log directory
mkdir -p /opt/cassandra/logs

# set ownership
chown -R cassandra:cassandra "/opt/apache-cassandra-$CASSANDRA_VERSION"
chown cassandra:cassandra /opt/cassandra

It's probably possible to do most of this, at least the symlink juggling and directory management, with "pure" Salt (and the environment variable could be eliminated by rendering install.sh as a Jinja template with the server dictionary), but the script does what I want it to and it's already idempotent and centrally managed.

srv/salt/cassandra/files/cassandra.service

This is a basic SystemD unit, with some system limits customized to give Cassandra enough room to run. It starts whatever Cassandra executable it finds at /opt/cassandra, so all that's necessary to resume operations after the symlink changes during the upgrade is to restart the service.

[Unit]
Description=Apache Cassandra database server
Documentation=http://cassandra.apache.org
Requires=network.target remote-fs.target
After=network.target remote-fs.target

[Service]
Type=forking
User=cassandra
Group=cassandra
ExecStart=/opt/cassandra/bin/cassandra -Dcassandra.config=file:///etc/cassandra/cassandra.yaml
LimitNOFILE=100000
LimitNPROC=32768
LimitMEMLOCK=infinity
LimitAS=infinity

[Install]
WantedBy=multi-user.target

srv/salt/cassandra/files/2.2.12/cassandra.yaml

The full cassandra.yaml is enormous, so I won't reproduce it here in full. The interesting parts are where values are being automatically interpolated by Salt. Like the Cassandra state, this is actually a Jinja template which renders a YAML file.

First, we get a list of internal IP addresses corresponding to cassandra-seed minions from the Salt mine and build a list of known_seeds.

{%- from 'cassandra/map.jinja' import server with context -%}
{% set known_seeds = [] %}
{% for minion, ip_array in salt['mine.get']('cassandra-seed:true', 'network.ip_addrs', 'grain').items() if ip_array is not sameas false and known_seeds|length < 2 %}
{%   for ip in ip_array %}
{%     do known_seeds.append(ip) %}
{%   endfor %}
{% endfor %}

This becomes the list of seeds the node looks for when trying to join the cluster.

seed_provider:
    - class_name: org.apache.cassandra.locator.SimpleSeedProvider
      parameters:
          - seeds: "{{ known_seeds|unique|join(',') }}"

Listen and broadcast addresses are configured per node. The broadcast addresses are a little special due to our network configuration needs: each node has to get its public dns name from the Salt mine. This is perhaps a bit overcomplicated compared to a custom grain or capturing the output from running the Salt modules at render time, but it's there and it works and at this point messing with it isn't a great use of time.

listen_address: {{ grains['fqdn'] }}
broadcast_address: {{ salt['mine.get'](grains['id'], 'public_dns').items()[0][1] }}
rpc_address: {{ grains['fqdn'] }}
broadcast_rpc_address: {{ salt['mine.get'](grains['id'], 'public_dns').items()[0][1] }}

The cluster name and other central settings are interpolated from the pillar+defaults server dictionary.

cluster_name: "{{ server.cluster_name }}"
...
authenticator: "{{ server.authenticator }}"
...
endpoint_snitch: "{{ server.endpoint_snitch }}"

The changes to the Cassandra 3.0.8 configuration are identical.

srv/salt/cassandra/files/2.2.12/TimeWindowCompactionStrategy-2.2.5.jar

See this post on TheLastPickle for directions on building the TWCS jar.

Highstate

Finally, the Salt highstate needs to ensure that our cassandra-* nodes have the Java and Cassandra states applied. Since Salt-Cloud minions come configured, however, we have to ensure the default salt.minion state is excluded from our Cassandra nodes since otherwise a highstate will blow away the cloud-specific configuration.

srv/salt/top.sls changes

base:
  'not cassandra-*':
    - match: compound
    - salt.minion
  'cassandra-*':
    - sun-java
    - sun-java.env
    - cassandra

Startup!

Set the Salt config dir to etc with -c and pass in the map file with -m:

sudo salt-cloud -c etc -m cassandra-test.map

To clean up:

sudo salt-cloud -d cassandra-1 cassandra-2 cassandra-3

Automatic Node Deploys to Elastic Beanstalk

Mon, 08 Oct 2018 00:00:00 GMT

One of my favorite good ideas to ignore is the maxim that you should have your deployment pipeline ready to go before you start writing code. There's always some wrinkle you couldn't have anticipated anyway, so while it sounds good on paper I just don't think it's the best possible use of time. But with anything sufficiently complicated, there's a point where you just have to buckle down and automate rather than waste time repeating the same steps yet again (or, worse, forgetting one). I hit that point recently: the application isn't in production yet, so I'd been "deploying" by means of pulling the repo on an EC2 server, installing dependencies and building in-place, then killing and restarting the node process with nohup. Good enough for demos, not sustainable long-term. Also, I might have in fact missed a step Friday before last and not realized things were mostly broken until the following Monday.

I'd been using CircleCI to build and test the application already, so I wanted to stick with it for deployment as well. However, this precluded using the same EC2 instance: the build container would need to connect to it to run commands over SSH, but this connection would be coming from any of a huge possible range of build container IP addresses. I didn't want to open the server up to the whole world to accommodate the build system. Eventually I settled on Elastic Beanstalk, which can be controlled through the AWS command-line interface with the proper credentials instead of the morass of VPCs and security groups. Just upload a zip file!

The cost of using EBS, it turned out, was that while it made difficult things easy it also made easy things difficult. How do you deploy the same application to different environments? You don't. Everything has to be in that zip file, and if that includes any per-environment configuration then the right config files had better be where they're expected to be. This is less than ideal, but at least it can be scripted. Here's the whole thing (assuming awscli has already been installed):

# what time is it?
TIMESTAMP=$(date +%Y%m%d%H%M%S)

# work around Elastic Beanstalk permissions for node-gyp (bcrypt)
echo "unsafe-perm=true" > .npmrc

# generate artifacts
npm run build

# download config
aws s3 cp s3://elasticbeanstalk-bucket-name/app/development.config.json .

# zip everything up
zip -r app-dev.zip . \
  --exclude "node_modules/*" ".git/*" "coverage/*" ".nyc_output/*" "test/*" ".circleci/*"

# upload to s3
aws s3 mv ./app-dev.zip s3://elasticbeanstalk-bucket-name/app/app-dev-$TIMESTAMP.zip

# create new version
aws elasticbeanstalk create-application-version --region us-west-2 \
  --application-name app --version-label development-$TIMESTAMP \
  --source-bundle S3Bucket=elasticbeanstalk-bucket-name,S3Key=app/app-dev-$TIMESTAMP.zip

# deploy to dev environment
# --application-name app is not specified because apt installs
# an older version of awscli which doesn't accept that option
aws elasticbeanstalk update-environment --region us-west-2 --environment-name app-dev \
  --version-label development-$TIMESTAMP

The TIMESTAMP ensures the build can be uniquely identified later. The .npmrc setting is for AWS reasons: as detailed in this StackOverflow answer, the unfortunately-acronymed node-gyp runs as the instance's ec2-user account and doesn't have permissions it needs to compile bcrypt. If you're not using bcrypt (or another project that involves a node-gyp step on install), you don't need that line.

The zip is assembled in three steps:

npm build compiles stylesheets, dynamic Pug templates, frontend JavaScript, and so forth.
The appropriate environment config is downloaded from an S3 bucket.
Everything is rolled together in the zip file, minus the detritus of source control and test results.

Finally, the Elastic Beanstalk deploy happens in two stages:

aws elasticbeanstalk create-application-version does what it sounds like: each timestamped zip file becomes a new "version". These don't map exactly to versions as more commonly understood thanks to the target environment configuration, so naming them for the target environment and giving the timestamp helps identify them.
aws elasticbeanstalk update-environment actually deploys the newly-created "version" to the destination environment.

Obviously, when it comes time to roll the project out to production, I'll factor the environment out into a variable to download and upload the appropriate artifacts. But even in its current state, this one small script has almost made deployment continuous: every pushed commit gets deployed to Elastic Beanstalk with no manual intervention, unless there are database changes. That's next.

Surrealist Remixes with Markov Chains

Sun, 05 Aug 2018 00:00:00 GMT

There's a new button at the bottom of this (and each) post. Try clicking it! (If you're reading this on dev.to or an RSS reader, you'll need to visit di.nmfay.com to see it)

By now everyone's run into Twitter bots and automated text generators that combine words in ways that almost compute. There's even a subreddit that runs the user-generated content of other subreddits through individual accounts which make posts that seem vaguely representative of their sources, but either defy comprehension or break through into a sublime silliness.

People have engaged in wordplay (and word-work) for as long as we've communicated with words. Taking language apart and putting it back together in novel ways has been the domain of poets, philosophers, and magicians alike for eons, to say nothing of puns, dad jokes, glossolalia, and word salad.

In the early 20th century, artists associated with the Surrealist movement played a game, variously for entertainment and inspiration, called "exquisite corpse". Each player writes a word (in this version, everyone is assigned a part of speech ahead of time) or draws on an exposed section of paper, then folds the sheet over to obscure their work from the next player. Once everyone's had a turn, the full sentence or picture is revealed. The game takes its name from its first recorded result: le cadavre exquis boira le vin nouveau, or "the exquisite corpse shall drink the new wine".

The Surrealist seeds fell on fertile ground and their ideas spread throughout the artistic and literary world, just as they themselves had been informed by earlier avant-garde movements like Symbolism and Dada. In the mid-century, writers and occultists like Brion Gysin and William Burroughs used similar techniques to discover new meanings in old texts. The only real difference in our modern toys is that they run on their own -- it's a little bit horror movie ouija board, except you can see the workings for yourself.

There are a variety of ways to implement this kind of functionality. On the more primitive side, you have "mad libs" algorithms which select random values to insert into known placeholders, as many Twitter bots such as @godtributes or @bottest_takes do. This method runs up against obvious limitations fairly quickly: the set of substitutions is finite, and the structure they're substituted into likewise becomes predictable.

More advanced text generators are predictive, reorganizing words or phrases from a body of text or corpus in ways which reflect the composition of the corpus itself: words aren't simply jumbled up at random, but follow each other in identifiable sequences. Many generators like these run on Markov chains, probabilistic state machines where the next state is a function only of the current state.

Implementing a Textual Markov Chain

The first order of business in using a Markov chain to generate text is to break up the original corpus. Regular expressions matching whitespace make that easy enough, turning it into an array of words. The next step is to establish the links between states, which is where things start getting a little complex.

Textual Markov chains have one important parameter: the prefix length, which defines how many previous states (words) comprise the current state and must be evaluated to find potential next states. Prefixes must comprise at least one word, but for the purposes of natural-seeming text generation the sweet spot tends to be between two and four words depending on corpus length. With too short a prefix length, the output tends to be simply garbled; too long a prefix or too short a corpus, and there may be too few potential next states for the chain to diverge from the original text.

Mapping prefixes to next states requires a sliding window on the array. This is more easily illustrated. Here's a passage from Les Chants de Maldoror, a 19th-century prose poem rediscovered and given new fame (or infamy) by the Surrealists, who identified in its obscene grandiosity a deconstruction of language and the still-developing format of the modern novel that prefigured their own artistic ideology:

He is as fair as the retractility of the claws of birds of prey; or again, as the uncertainty of the muscular movements in wounds in the soft parts of the lower cervical region; or rather, as that perpetual rat-trap always reset by the trapped animal, which by itself can catch rodents indefinitely and work even when hidden under straw; and above all, as the chance meeting on a dissecting-table of a sewing-machine and an umbrella!

Assuming a prefix length of 2, the mapping might start to take this shape:

"He is": ["as"],
"is as": ["fair"],
"as fair": ["as"],
"fair as": ["the"]

Starting from the first prefix ("He is"), there is only one next state possible since the words "He is" only appear once in the corpus. Upon reaching the next state, the active prefix is now "is as", which likewise has only one possible next state, and so forth. But when the current state reaches "as the", the next word to be added may be "retractility", "uncertainty", or "chance", and what happens after that depends on the route taken. Multiple next states introduce the potential for divergence; this is also why having too long a prefix length, or too short a corpus, results in uninteresting output!

Because the prefix is constantly losing its earliest word and appending the next, it's stored as a stringified array rather than as a concatenated string. The order of operations goes like this:

Select one of the potential next states for the current stringified prefix array.
shift the earliest word out of the prefix array and push the selected next word onto the end.
Stringify the new prefix array.
Repeat until bored, or until there's no possible next state.

Remix!

If you're interested in the actual code, it's remix.js in devtools, or you can find it in source control.

Markov chain generators aren't usually interactive; that's where the "probabilistic" part of "probabilistic state machine" comes into play. This makes the implementation here incomplete by design. Where only one possible next state exists, the state machine advances on its own, but where there are multiple, it allows the user to choose how to proceed. This, along with starting from the beginning instead of selecting a random opening prefix, gives it more an exploratory direction than if it simply restructured the entire corpus at the push of a button. The jury's still out on whether any great insights lie waiting to be unearthed, as the more mystically-minded practitioners of aleatory editing hoped, but in the mean time, the results are at least good fun.

Summer 2018: Massive, Twice Over

Mon, 30 Jul 2018 00:00:00 GMT

NDC talks are up!

There's also the FullStack London version which is slightly condensed for a shorter timeslot, if you have a SkillsMatter account and want to get right to the fun parts.

If you've read (almost) anything I've written, text or code, odds are you've run into Massive.js. On the off chance you haven't, the elevator pitch is that PostgreSQL exclusivity lets you get a lot more mileage out of your database (as long as it's Postgres) and JavaScript being a dynamically typed, functional-ish language lets you get away with it really easily.

This talk goes over Massive in much more depth: first laying out a case for alternatives to the dominant object-relational mapping data access technique, in general and especially in JavaScript; and then diving into the architecture of Massive itself with plenty of examples. Also, there's some trivia about early 20th century Russian avant-garde art and another bit poking fun at French modernist architect Le Corbusier.

It's the second talk I've done, and overall I was pretty happy with how it went in Oslo and London both! I'm the furthest thing from a natural public speaker but I covered what I wanted to cover, finished at a reasonable time, and didn't screw anything up too badly -- so that's a success in my book. And after all, the only way to improve this particular skill is to keep doing it.

Centralize Your Query Logic!

Wed, 25 Jul 2018 00:00:00 GMT

At a talk I gave earlier this month, an audience member asked if Massive supported joining information from multiple tables together. It's come up on the issue tracker before as well. Massive does not currently have this functionality, and while I'm open to suggestions it's not on my own radar.

The central reason for this is that join logic can be tricky to manage from the application architecture side. The ability to correlate and combine what you need when you need it is certainly powerful, but it also embeds assumptions about your database layout in client code. As the database and application evolve, these assumptions can easily fall out of date and out of sync with each other. In real terms, if your application's "model" (whether implicit or explicit) of a user loaded from the database includes only the user record itself sometimes, but other times looks for information in a separate profile table, adds current statistics, et cetera, and you have functionality that operates on A User, either you understand that users come in different shapes and handle them accordingly across the board or you are living on borrowed uptime.

Some application architectures approach this scenario by grouping the query logic together. In the enterprise world, n-tier applications frequently pull related queries into "services" or Data Access Objects (DAOs) so there's at least some kind of organizational schema. This reduces the maintenance overhead somewhat, but it's an imperfect solution, not least because there's nothing but fallible code reviews (if that) standing in the way of someone dropping data access code somewhere else.

Fortunately, there's already part of the application-database ecosystem dedicated to organizing things -- the database itself! And as an organizing principle, it already has its own way to manage complex queries. Sure, it'll involve writing a little SQL, but let's face it: you were going to wind up writing SQL eventually anyway.

If you've only scratched the surface of working with databases, you might not be familiar with views. The good news is they're pretty straightforward: a view is a stored SQL query with a name, given life with the statement CREATE VIEW myview AS SELECT.... You can SELECT from a view just like you can a table, optionally with JOINs and a WHERE clause and all the other trimmings, whereupon the database executes the query. Results are not stored so the information you get out of a view is always current, unless you intentionally sacrifice realtime data for speed by creating a materialized view which does persist results and has to be manually refreshed.

The reason views are underrated and underutilized in application development has mostly to do with the frameworks developers use to communicate with databases. When you have to provide a concrete implementation of a unary User model, odds are you only care about things you can both read and write to, so you back it up with tables instead of using views to shape data for your needs. There's little room for views in object/relational mapping, and when I've had to use O/RMs I've really only been able to take advantage of views to streamline the raw SQL queries you have to write anyway when you use O/RMs.

If you're not stuck with an object-relational mapper, though, you can really get your money's worth out of views! Retrieving user records from a view, or building more complex user-inclusive results by joining it into other views, ensures that you have a consistent definition of what information comprises a user built into your database. You can't always stop other developers from winging it, naturally, but having that central definition to point to eliminates at least one major potential ambiguity. Massive's omission of the join feature encourages developers using it to center their thinking on the database and the tools it offers for organizing information.

As with anything, there are tradeoffs. Here, it's flexibility. Views may be ephemeral stored queries, but they're still part of the database schema for all that, and the schema takes more planning and effort to change than does application code. But it's a good idea to be thinking carefully about this stuff in the first place.

Shell Bonsai with tree

Sun, 01 Jul 2018 00:00:00 GMT

The shell has just about all the tooling I need for day-to-day operation of a computer: navigating and managing directories and files, text editing, and building, testing, and running projects I'm working on. What it isn't so great at is layouts, or really, displaying anything that isn't a text file (as fun as it is, I'm unwilling to switch out a proper image viewer for tiv).

Directory trees are one of the more commonly-encountered layouts that don't do too well with monospaced ASCII. There's the venerable tree -- and that just about covers the possibilities, because there aren't many more ways to display that kind of structure under those constraints. Fortunately, tree comes with amenities, from pattern-matching to JSON output.

I also do a lot of work on projects which contain certain files I don't care about. With git, I use a .gitignore file in the project root to ensure I don't accidentally add and commit them. This file gets used by more than git, too: my search utility of choice, ripgrep, respects .gitignore rules, as do many other tools all the way up to graphical IDEs.

tree, which predates git by something like a decade at absolute minimum, does not care about your .gitignore. When inspecting the layout of a repository with a moderately-sized ignore ruleset and/or something like node_modules, this makes it all but unusable.

One of tree's features is the -I flag, which ignores files matching a wildcard pattern similar to that used in .gitignore. That means it should be possible to hack something together which respects .gitignore rules without mucking around in coreutils: other system tools output and manipulate files, xargs can manage other commands' arguments, and pipes hook the whole thing together.

Here's the full alias from my .zshrc, if you're just interested in that part (note it all needs to be on one line):

alias trii="(cat .gitignore & echo '.git') |
  sed 's/^\(.\+\)$/\1\|/' |
  tr -d '\n' |
  xargs printf \"-I '%s'\" |
  xargs tree -C"

With the exception of -I, you can still pass tree's arguments to trii, so the rest of its toolkit is still available. It's also safe if there's no ignore file in the current directory.

Now, in more depth:

(cat .gitignore & echo '.git')

cat dumps the ignore file to standard output (the console) and echo simply repeats the string ".git" to ensure that the full ruleset excludes the repository directory itself (only a problem with the -a switch which displays hidden files and directories). The single & is just a separator to ensure that both commands run in sequence, as opposed to the more common double && which aborts at the first non-zero exit code. The parentheses run the whole thing in a subshell, returning the full output to be piped into the next segment.

sed 's/^\(.\+\)$/\1\|/'

You can't specify multiple -I values: the last one always wins. Instead, -I can read multiple patterns which are joined together with pipe | characters. That's possible, but it's going to take a couple of steps.

sed is a stream editor which modifies each line coming from the previous segment. Here, it's simply appending the pipe character. Because sed operates on each line as a discrete entity, it can't join them together; that's up to the next segment:

tr -d '\n'

Unlike sed, tr (translate) operates on standard input as it comes in, instead of line by line. The -d switch deletes characters, here the newline. This completes the ignore pattern, with a sample project's .gitignores transformed into this:

.git|src|pkg|**/*.tar.xz|

There's a terminating pipe, but it doesn't make a difference to tree. This line gets passed to yet another command:

xargs printf "-I '%s'"

xargs passes lines from standard input to another command. Here there's only one line, since tr removed all the newline characters, and it's being passed to printf. This is not to be confused with the C standard library function printf: it's a standalone program in the GNU coreutils, although it does much the same thing as its near relative. The net effect of this command is to print the -I switch and the concatenated ignore list together.

xargs tree -C

Finally, it's time to invoke tree! The -C flag adds color to the output. xargs passes the combined -I and ignorelist into the command string, and the result is a tree that excludes everything from the .gitignore.

Automating Maven Releases with CircleCI

Sat, 26 May 2018 00:00:00 GMT

Maven's probably the only all-in-one build tool I've ever really appreciated. I'll probably come to like make eventually and cement my status as old-before-her-time *nix crone, but I haven't had a reason to really dig into it yet so Maven it is. And I'm back at a mostly-Java shop, so let's have some fun!

This week's goal: automating releases from our CircleCI instance. Sounds simple enough, right? Bump the version, cut a tag, publish. How hard could it be?

Well, first off, we're using git-flow, or at least we're preserving master for releases and working off a separate verify branch. Budget git-flow, if you will. That's one complication, since the release has to be tagged on master but verify also needs to be updated so the two don't diverge.

If you're familiar with Maven you may already have guessed the second complication. It's trickier. Maven doesn't work in nice, straightforward semver: Maven accepts several different versioning schemes and has a special SNAPSHOT qualifier for non-release builds. If you're working towards a 1.0 release, your version number is 1.0-SNAPSHOT. After you cut the release, you resume development with 1.1-SNAPSHOT (or 2.0-SNAPSHOT if it really needs a rework already). And so on. It's not meant to be automated, because releases are a big deal in the Maven world and you're expected to have a plan for what you're going to do next instead of reacting to whether you fixed bugs, introduced features, or broke compatibility. And honestly, there are some compelling arguments for doing it this way.

I'm not going to go into them because I'm one half of the software team by myself and they're less applicable working on proprietary stuff at this scale. So let's get to automating!

Workflow

We're using Circle v2 and its workflow feature to organize the build. Every branch gets built: verify and master get deployed to Artifactory, while release triggers its own job, which latter is the linchpin of the whole structure.

workflows:
  version: 2
  build-and-deploy:
    jobs:
      - build
      - deploy:
          requires:
            - build
          filters:
            branches:
              only: /^(master|verify)$/
      - release:
          requires:
            - build
          filters:
            branches:
              only: /^release$/

Just Build

I'll be honest, I copied & pasted most of this job definition right out of the docs:

steps:
  - checkout
  - restore_cache:
      keys:
      - v1-dependencies-{{ checksum "pom.xml" }}
      # fallback to using the latest cache if no exact match is found
      - v1-dependencies-
  - run: mvn clean install
  - save_cache:
      paths:
        - ~/.m2
      key: v1-dependencies-{{ checksum "pom.xml" }}
  - persist_to_workspace:
      <<: *source

We're caching our dependencies because that's how one does it; mvn clean install is likely overkill (we probably don't need to bother with installing the dependency to the local Maven cache) but it builds and runs our tests and generates the artifact. The only really interesting part here is that we're persisting the important files to a workspace so we can recover it later -- *source refers to another YAML block with a root string and list of paths.

And Deploy

steps:
  - attach_workspace:
      at: .
  - run:
      name: Deploy to Artifactory
      command: mvn deploy

Here's where we use that workspace. Whenever this job runs, it'll reattach the file structure we saved from the build job. mvn deploy still runs all the intermediary lifecycle stages because that's how Maven rolls, but we don't need to check out the code again.

We've got our POMs set up with the artifactory-maven-plugin so all we have to do to publish is issue mvn deploy. That makes that easy, at least; there's the Artifactory CLI if you prefer, but Maven's whole deal is managing everything so as far as I'm concerned we should let it.

There's just one piece missing, though: how do we actually release a new version of the artifact and set up to begin on the next?

The Release Trigger

One of the ideas of git-flow is that when you're gearing up for a release, you cut a new branch that only contains work towards that release. This is great if you're working on multiple versions of the code simultaneously and releases can take awhile, so you might cherry-pick a bugfix from current development into a legacy release branch to ensure it doesn't affect a subset of your users. Since we're not a product company, we don't really have to worry about that. We're always working on the next release, and it drops when it's ready to drop.

This is going to get complicated. Here's the release build steps in full:

steps:
  - checkout
  - run:
      name: Cut new release
      command: |
        # assemble current and new version numbers
        OLD_VERSION=$(mvn -s .circleci/settings.xml -q \
          -Dexec.executable="echo" -Dexec.args='${project.version}' \
          --non-recursive org.codehaus.mojo:exec-maven-plugin:1.3.1:exec)
        NEW_VERSION="${OLD_VERSION/-SNAPSHOT/}"
        echo "Releasing $OLD_VERSION as $NEW_VERSION"

        # ensure dependencies use release versions
        mvn -s .circleci/settings.xml versions:use-releases

        # write release version to POM
        mvn -s .circleci/settings.xml versions:set -DnewVersion="$NEW_VERSION"

        # setup git
        git config user.name "Release Script"
        git config user.email "builds@understoryweather.com"

        # commit and tag
        git add pom.xml
        git commit -m "release: $NEW_VERSION"
        git tag "$NEW_VERSION"

        # land on master and publish
        git checkout master
        git merge --no-edit release
        git push origin master --tags

        # increment minor version number
        MAJ_VERSION=$(echo "$NEW_VERSION" | cut -d '.' -f 1)
        MIN_VERSION=$(echo "$NEW_VERSION" | cut -d '.' -f 2)
        NEW_MINOR=$(($MIN_VERSION + 1))
        DEV_VERSION="$MAJ_VERSION.$NEW_MINOR-SNAPSHOT"

        # ready development branch
        git checkout verify
        git merge --no-edit release
        mvn -s .circleci/settings.xml versions:set -DnewVersion="$DEV_VERSION"
        git add pom.xml
        git commit -m "ready for development: $DEV_VERSION"
        git push origin verify

        # clean up release branch
        git push origin :release

It's not messy, but that's... a lot of bash script. But just like any sufficiently complicated database task involves writing SQL, any sufficiently complicated ops task involves bash. Let's break it down:

Getting Version Numbers

# assemble current and new version numbers
OLD_VERSION=$(mvn -s .circleci/settings.xml -q \
  -Dexec.executable="echo" -Dexec.args='${project.version}' \
  --non-recursive org.codehaus.mojo:exec-maven-plugin:1.3.1:exec)
NEW_VERSION="${OLD_VERSION/-SNAPSHOT/}"
echo "Releasing $OLD_VERSION as $NEW_VERSION"

Note the -s .circleci/settings.xml: since Circle's just spinning up a basic OpenJDK image, we have a settings.xml checked into source control. Credentials are interpolated through environment variables, but it's still not great; at some point, I'll want to come back and create a custom Docker image to centralize our configuration.

Maven stores version numbers in the POM. We could pull them out with XPath, but since this is Maven, there's a plugin for that. The OLD_VERSION is the current value; since we're always releasing from the verify branch, this is guaranteed to be a snapshot version, and we need to strip that qualifier off to get NEW_VERSION for the release.

Update Versions

# ensure dependencies use release versions
mvn -s .circleci/settings.xml versions:use-releases

# write release version to POM
mvn -s .circleci/settings.xml versions:set -DnewVersion="$NEW_VERSION"

We don't have a ton of Java libraries, but there are enough that release management is (obviously) a concern. The first statement here makes sure that when we release, we aren't depending on a snapshot version of another of our libraries. The second actually sets the version field in the POM to the release version we generated just now.

You may be asking: why didn't I just alias mvn to mvn -s .circleci/settings.xml? And the answer is: I did, and spent half a day trying to figure out why it didn't work. I don't know if it's this particular image or Circle in general or what, but aliases are just ignored.

Release!

# setup git
git config user.name "Release Script"
git config user.email "builds@understoryweather.com"

# commit and tag
git add pom.xml
git commit -m "release: $NEW_VERSION"
git tag "$NEW_VERSION"

# land on master and publish
git checkout master
git merge --no-edit release
git push origin master --tags

Since we're going to be committing code, we need to do a little more git configuration to attribute the commits properly. This is another element I could streamline with a custom build image later on.

Next, we commit the updated POM and create a tag. When we merge (with --no-edit since the script can't change the commit message), the release commit and tag will land on the master branch. Then it's just a matter of pushing to the origin.

Next Up...

We've released, but we're not quite done. If we left it here, the next release from the verify branch would run into merge conflicts since master has an updated version in the POM. To prevent that, we have to merge back into verify. Preferably with a snapshot version qualifier, because Maven.

# increment minor version number
MAJ_VERSION=$(echo "$NEW_VERSION" | cut -d '.' -f 1)
MIN_VERSION=$(echo "$NEW_VERSION" | cut -d '.' -f 2)
NEW_MINOR=$(($MIN_VERSION + 1))
DEV_VERSION="$MAJ_VERSION.$NEW_MINOR-SNAPSHOT"

I switched us over to two-part version numbers strictly out of convenience. Since Maven expects you to know what you're working towards, going from 1.0 to 1.1 is a lot more realistic than trying to suss out whether you're looking at 1.0.1 or 1.1.0 next. We can always update the version ourselves if we decide the next release should actually be 2.0, but I'm trying to minimize human involvement here.

# ready development branch
git checkout verify
git merge --no-edit release
mvn -s .circleci/settings.xml versions:set -DnewVersion="$DEV_VERSION"
git add pom.xml
git commit -m "ready for development: $DEV_VERSION"
git push origin verify

Merging release into verify saves us from any potential merge conflicts down the line, since the same release commit now exists both on master and in verify. The script then adds a second commit to verify with the new snapshot version and sends it all up to the origin.

# clean up release branch
git push origin :release

Finally: when a trigger goes off, it resets. We don't want the release branch to hang around long-term. If we did, we'd have to push the release commit up to the origin to avoid merge conflicts in future, and doing that would kick off an infinite loop since the release job is watching this branch. So instead we just delete it from the origin, since it's done everything it needed to do.

Setting it Off

git checkout -b release
git push origin release

That's the payoff. Whenever we're ready to drop a new version, all that has to happen is a new branch named release. You can even do it through the GitHub UI if you're so inclined, in two clicks and seven letters. Once release builds and deletes itself, the ordinary build and deploy jobs take over on both updated master and verify branches. Within a few minutes we've got a release and the first snapshot towards the next landing in Artifactory!

The Ultimate Postgres vs MySQL Blog Post

Wed, 11 Apr 2018 00:00:00 GMT

I should probably say up front that I love working with Postgres and could die happy without ever seeing a mysql> prompt again. This is not an unbiased comparison -- but those are no fun anyway.

The scenario: two applications, using Massive.js to store and retrieve data. Massive is closely coupled to Postgres by design. Specializing lets it take advantage of features which only exist in some or no other relational databases to streamline data access in a lighter, more "JavaScripty" way than a more traditional object-relational mapper. It's great for getting things done, since the basics are easy and for the complicated stuff where you'd be writing SQL anyway.... you write SQL, you store it in one central place for reuse, and the API makes running it simple.

Where Massive is less useful is if you have to support another RDBMS. This is, ideally, something you know about up front. Anyway: things happen, and sometimes you find yourself having to answer the question "what's it going to look like if we need to run these applications with light but tightly coupled data layers on MySQL?"

Not good, was the obvious answer, but less immediately obvious was how not good. I knew there were some things Postgres did that MySQL didn't, but I also knew there were a ton of things I'd just never tried in the latter. So as I got to work on this, I started keeping notes. Here's everything I found.

Schema Layout

Now that we're all basically over the collective hallucination of a "schemaless" future, arguably the most important aspect of data storage is how information is modeled in a database. Postgres and MySQL are both relational databases, grouping records in strictly-defined tables. But there's a lot of room for variation within that theme.

Multiple Schemas

First things first: "schema" doesn't always mean the same thing. To MySQL, "schema" is synonymous with "database". For Postgres, a "schema" is a namespace within a database, which allows you to group tables, views, and functions together without having to break them apart into different databases.

MySQL's simplicity in this respect is ameliorated by its offering cross-database queries:

SELECT *
FROM db1.table1 t1
JOIN db2.table2 t2 ON t2.t1_id = t1.id;

With Postgres, you can work across schemas, but if you need to query information in a different database, that's a job for...

Foreign Data Wrappers

Foreign data wrappers let Postgres talk to practically anything that represents information as discrete records. You can create a "foreign table" in a Postgres database and SELECT or JOIN it like any other table -- only under the hood, it's actually reading a CSV, talking to another DBMS, or even querying a REST API. It's a powerful enough feature that NoSQL stalwart MongoDB sneakily built their BI Connector on top of Postgres with foreign data wrappers. You don't even need to know C to write a new FDW when Multicorn lets you do it in Python!

Oracle and SQL Server both have some functionality for registering external data sources, but Postgres' offering is the most extensible I'm aware of. MySQL, besides the inter-database query support mentioned above, has nothing.

Table Inheritance

Inheritance is more commonly thought of as an attribute of object-oriented programming languages rather than databases, but Postgres is technically an ORDBMS or object-relational database management system. So you can have a table cities with columns name and population, and a table capitals which inherits the definition of cities but adds an of_country column only relevant, of course, for capital cities. If you SELECT from cities, you get rows from capitals -- they're cities too! You can of course SELECT name FROM ONLY cities to exclude the capitals. This is something of a niche feature, but when you have the right use case it really shines.

MySQL, being a traditional RDBMS, doesn't do this.

Materialized Views

Materialized views are like regular views, except the results of the specifying query are physically stored ('materialized') and must be explicitly refreshed. This allows database developers to cache the results of slower queries when the results don't have to be realtime.

Oracle has materialized views, and SQL Server's indexed views are similar, but MySQL has no materialized view support.

Check Constraints

Constraints in general ensure that invalid data is not stored. The most common constraint is NOT NULL, which prevents records without a value for the non-nullable column from being inserted or updated. Foreign key constraints do likewise when a reference to a record in another table is invalid. Check constraints are the most flexible, and allow validation of any predicate you could put in a WHERE clause -- for example, asserting that prices have to be positive numbers, or that US zip codes have to be five digits.

Per the MySQL docs: the CHECK clause is parsed but ignored by all storage engines.

JSONB and Indexing

Postgres and MySQL both have a JSON column type (MySQL replacement MariaDB does too, but it's currently just an alias for LONGTEXT) and functions for building, processing, and querying JSON fields. Postgres actually goes a step further by offering a JSONB type which processes input data into a binary format. This means it's a little bit slower to write, but much faster to query.

It also means you can index the binary data. A GIN or Generalized INverted index allows queries checking for the existence of specific keys or key-value pairs to avoid scanning every single record for matches. This is huge if you run queries which dig into JSON fields in the WHERE clause.

Default Values Defined by Functions

DEFAULT is a useful specification for columns in a CREATE TABLE statement. At the simplest level, this could be used to baseline a boolean field to true or false if the INSERT statement doesn't give an explicit value. But you can do more than set a scalar value: a timestamp can default to now(), a UUID to any of a variety of UUID-generating functions, any other field to the value returned by whatever function you care to write -- the sky's the limit!

Unless you're using MySQL, in which case the only function you can reference in a DEFAULT clause is now().

Type Differences

Layout's only part of the story, though. Equally important is the difference in type support. The benefit of a robust type system is in enabling database architects to represent information with the greatest accuracy possible. If a value is difficult or impossible to represent with built-in types, it's harder for developers to work with in turn, and if compromises have to be made to cut the data to fit then they can affect entire applications. Some types can even affect the overall database design, such as arrays and enumerations. In general, the more options you have the better.

UUIDs

Postgres has a UUID type. MySQL does not. If you want to store a UUID in MySQL, your options are CHAR, if you want values to be as human-readable as UUIDs ever are, or BINARY, if you want it to be faster but more difficult to work with manually. Postgres also generates more types of UUIDs.

Booleans

Boolean seems like a pretty basic type to have! However, MySQL's boolean is actualy an alias for TINYINT(1). This is why query results show 0 or 1 instead of true or false. It's also why you can set the value of an ostensibly boolean field to 2. Try it!

Postgres: has proper booleans.

Varlena and Lengths

MySQL isn't alone in aliasing standard types in strange ways, however. CHAR, VARCHAR, and TEXT types in Postgres are all aliased representations of the same structure -- the only distinction is that length constraints will be enforced if specified. The documentation notes that this is actually slower, and recommends that unbounded text simply be defined as the TEXT type instead of given an arbitrary maximum length.

What's happening here is that Postgres uses a data structure called a varlena, or VAriable LENgth Array, to store the information. A varlena's first four bytes store the length of the value, making it easy for the database to pick the whole thing out of storage. TEXT is only one of the types that uses this structure, but it's easily the most commonly encountered.

If a varlena is longer than would fit inline, the database uses a system called TOAST ("The Oversized Attribute Storage Technique") to offload it to extended storage transparently. Queries with predicates involving a TOASTable field might not be all that performant with large tables unless designed and indexed carefully, but when the database is returning records it's easy enough to follow the TOAST pointer that the overhead is barely noticeable for most cases.

The upshot of all this, as far as most people are concerned, is this: with Postgres, you only have to worry about establishing a length constraint on fields that have a reason for a length constraint. If there's no clear requirement to limit how much information can go into a field, you don't have to pick an arbitrary number and try to match it up with your page size.

Arrays

Non-scalar values in records! Madness! Dogs and cats living together! Anyone who's worked with JSON, XML, YAML, or even HTML understands that information isn't always flat. Relational architectures have traditionally mandated breaking out any vectors, let alone even more complex values, into new tables. Sometimes that's useful, but often enough it adds complexity to no real purpose. Inlining arrays makes many tasks -- such as tagging records -- much easier.

Postgres has arrays, as does Oracle; MySQL and SQL Server don't.

Customizing Types

If the built-in types aren't sufficient, you can always add your own. Custom types let you define a value to be exactly what you want. Domains are a related concept: types (custom or built-in) which enforce constraints on values. You might for example create a domain to represent a zip code as a TEXT value which uses regular expressions in a CHECK clause to ensure that values consist of five digits, optionally followed by a dash and four more digits.

If you're using Postgres, that is. Oracle and SQL Server both offer some custom type functionality, but MySQL has nothing. You can't even use table-level CHECK constraints because the engine simply ignores them.

Enums

Enumerations don't get enough love. If I had a dollar for every INT -- or worse, VARCHAR -- field I've seen representing one of a fixed set of potential values, I probably still couldn't retire but I could at least have a pretty nice evening out. There are drawbacks to using enums, to be sure: adding new values requires DDL, and you can't remove values at all. But appropriate use cases for them are still reasonably common.

MySQL and Postgres both offer enums. The critical distinction is that Postgres' enums are proper reusable types. MySQL's enums are more like the otherwise-ignored CHECK constraints and specify a valid value list for a single column in a single table. Possible improvement on allowing a boolean column to contain -100?

Querying Data

So that's data modeling covered. There's an entire other half to go: actually working with the information being stored. SQL itself is divided in two parts, the "data definition language" which defines the structure of a database and the "data manipulation language". This latter comprises the SELECT, INSERT, and other statements most people think of when they hear the name "SQL". And just as with modeling, there are substantial differences between Postgres and MySQL in querying.

RETURNING

Autogenerating primary keys takes a huge headache out of storing data. But there's one catch: when you insert a new record into a table, you don't know what its primary key value got set to. Most relational databases will tell you what the last autogenerated key was if you call a special function; some, like SQL Server, even let you filter down to the single table you're interested in.

Postgres goes above and beyond with the RETURNING clause. Any write statement -- INSERT, UPDATE, DELETE -- can end with a RETURNING [column-list], which acts as a SELECT on the affected records. RETURNING * gives you the entire recordset from whatever you just did, or you can restrict what you're interested in to certain columns.

That means you can do this:

INSERT INTO foos (name)
VALUES ('alpha'), ('beta')
RETURNING *;

 id │ name  
────┼───────
  1 │ alpha
  2 │ beta
(2 rows)

With MySQL, you're stuck with calling LAST_INSERT_ID() after you add a new record. If you added multiple, LAST_INSERT_ID only gives you the earliest new id, leaving you to work out the rest yourself. And of course, this is only good for integer primary keys.

MySQL also has no counterpart to this functionality for UPDATEs and DELETEs. Competitor MariaDB supports RETURNING on DELETE, but not on any other kind of statement.

Common Table Expressions

Common Table Expressions or CTEs allow complex queries to be broken up and assembled from self-contained parts. You might write this:

WITH page_visits AS (
  SELECT p.id, p.site_id, p.title, COUNT(*) AS visits
  FROM pages AS p
  JOIN page_visitors AS v ON v.page_id = p.id
  GROUP BY p.id, p.site_id, p.title
), max_visits AS (
  SELECT DISTINCT ON (site_id)
    site_id, title, visits
  FROM page_visits
  ORDER BY site_id, visits DESC
)
SELECT s.id, s.name,
  max_visits.title AS most_popular_page,
  SUM(page_visits.visits) AS total_visits
FROM sites AS s
JOIN page_visits ON page_visits.site_id = s.id
JOIN max_visits ON max_visits.site_id = s.id
GROUP BY s.id, s.name, max_visits.title
ORDER BY total_visits DESC;

In the first query, we aggregate visit counts; in the second, we use DISTINCT ON on the results of the first to filter out all but the most popular pages; finally, we join both of our intermediary results to provide the output we're looking for. CTEs are a really readable way to factor query logic out, and they let you do some things in one statement that you can't otherwise.

MySQL does have CTEs! However: thanks to the RETURNING clause, Postgres can write records in a CTE and operate on the results. This is huge for application logic. This next query writes a record in a CTE, then adds a corresponding entry to a junction table -- all in the same transaction.

WITH wine AS (
  INSERT INTO wines (name, year)
  VALUES ('Herrenreben', 2015)
  RETURNING id
), reviewer AS (
  SELECT id
  FROM reviewers
  WHERE name = 'Wine Enthusiast'
)
INSERT INTO wine_ratings (wine_id, reviewer_id, score)
SELECT wine.id, reviewer.id, 92
FROM wine
JOIN reviewer ON TRUE;

Casting

Sometimes a query needs to treat a value as if it has a different type, whether to store it or to operate on it somehow. Postgres even lets you define additional conversions between types with CREATE CAST.

MySQL supports casting values to binary, char/nchar, date/datetime/time, decimal, JSON, and signed and unsigned integers. Absent from this list: tinyints, which, since booleans are actually tinyints, means you're stuck with conditionals when you need to coerce a value to true or false for storage in a "boolean" column.

Lateral Joins

A lateral join is fundamentally similar to a correlated subquery, in that it executes for each row of the current result set. However, a correlated subquery is limited to returning a single value for a SELECT list or WHERE clause; subqueries in the FROM clause run in isolation. A lateral join can refer back to information in the rest of the result set:

CREATE TABLE docs (id serial, body jsonb);

INSERT INTO docs (body) VALUES ('{"a": "one", "b": "two"}'), ('{"c": "three"}');

SELECT docs.id, keys.*
FROM docs
JOIN LATERAL jsonb_each(docs.body) AS keys ON TRUE;

 id │ key │  value  
────┼─────┼─────────
  1 │ a   │ "one"
  1 │ b   │ "two"
  2 │ c   │ "three"
(3 rows)

It can also invoke table functions like unnest which return multiple rows and columns:

CREATE TABLE multiple_arrays(arr1 int[], arr2 int[]);

INSERT INTO multiple_arrays (arr1, arr2)
VALUES
	('{1,2,3}', '{4,5}'),
	('{6,7}', '{8,9,10}');

SELECT raw.*
FROM multiple_arrays
JOIN LATERAL unnest(arr1, arr2) AS raw ON TRUE;

 unnest │ unnest 
────────┼────────
      1 │      4
      2 │      5
      3 │ (null)
      6 │      8
      7 │      9
 (null) │     10
(6 rows)

Oracle and SQL Server offer similar functionality with the LATERAL keyword in the former, and CROSS APPLY/OUTER APPLY. MySQL does not.

Variadic Function Arguments

Functions! Procedures, if you believe in making that distinction! They're great! You can declare variadic arguments -- "varargs" or "rest parameters" in other languages -- to pull an arbitrary number of arguments into a single collection named for the final argument.

In Postgres.

Predicate Operations

A handful of useful operations which allow more expressive WHERE clauses with Postgres:

IS DISTINCT FROM and its counterpart IS NOT DISTINCT FROM offer a null-sensitive equality test. Null isn't ordinarily comparable since it represents the absence of a value, so the predicate WHERE field <> 1 will not return records where field is null. WHERE field IS DISTINCT FROM 1 returns all records where field is other-than-1, including where it's null.
ILIKE is a case-insensitive LIKE operation. MySQL does have the capability for case-insensitive pattern matching, but it depends on your collation and can't be toggled on a per-query basis (the default collation is case-insensitive, to be completely fair).
~, ~*, !~, and !~* form a set of POSIX regular expression tests: match, case-insensitive match, no match, and no case-insensitive match respectively. MySQL does have REGEXP and NOT REGEXP; however, Postgres' implementation has lookahead and lookbehind.

General Database Work

That's it for the architecture and query language feature gaps I discovered. I did run into a couple other things that bear mentioning, however:

Dependencies

MySQL doesn't care about dependencies among database objects. You can tell it to drop a table a view or proc depends on and it will go right ahead and drop it. You'll have no idea something's gone wrong until the next time you try to invoke the view or proc. Postgres saves you from yourself, unless you're really sure and drop your dependents too with CASCADE.

Triggers and Table Writes

Just the mention of triggers is probably putting some people off their lunch. They're not that bad, honest (well, they can be, but it's not like it's their fault). Anyway, point is: sometimes you want to write a trigger that modifies other rows in the table it's being activated from.

Well, you can't in MySQL.

The End?

This may have exhausted me, but I'm pretty sure it's still not an exhaustive list of the feature gaps between Postgres and MySQL. I did cop to my preference up front, but having spent six weeks putting the effort into converting the comparison is pretty damning. I think there could still be reasons to pick MySQL -- but I'm not sure they could be technical.

The Orchid, the Wasp, and the Test Fixture

Sun, 25 Feb 2018 00:00:00 GMT

I write a lot of integration tests that operate on data. The usual format for this is a setup function which gets the database into a particular state, a test or tests which validate the appropriate application functionality, and then a teardown function which cleans everything up so the next test suite can do its thing. There are different names and some little complexities (Mocha and AVA offer a before and a beforeEach, for example) but generally speaking this is How It's Done in every language/framework I've written tests in. This seems less a product of conscious architecture than it does a natural evolution of testing processes; nobody's* really nailed down a formal model for test data management yet.

The end result is that these setup functions, or fixtures, tend to be developed ad-hoc and inconsistently. It's not difficult to wind up with two test suites taking completely different approaches to generate what's practically speaking the same data. It gets worse when something changes and a bunch of your tests become out of date with you none the wiser until a bug report lands in your lap. I've written a lot of fixtures like that, and I want to stop.

The only solution to inconsistency is centralization: there needs to be a single source of data. If there's one place to go for fixture data, that goes a long way toward ensuring tests stay current. However, just bringing all the fixtures under one roof isn't enough. If some tests exercise carryout orders and others exercise delivery orders, the database state could be 75% identical -- but one has a phone number and a pickup time attached, the other an address and a driver. One fixture alone won't do the job, and breaking it up is backsliding towards the original problem. Centralization is only part of the solution; fixtures have to be flexible as well.

Meanwhile, in Southwestern Australia

The hammer orchid has a very specific mechanism of reproduction. Each of the species in the Drakaea genus mimics the scent (not to mention color and shape) of the female of a symbiotic species of wasp. The scent attracts male wasps, which attempt to mate with the flower only to become covered in the orchid's pollen. Eventually they give up and fly off. Enough of them proceed to fall for the same trick again, rubbing the pollen off onto a new flower, to ensure the survival of the orchids; and, presumably, enough of them find actual mates to ensure the survival of their own species.

Of course, to say the orchid tricks the wasp is a blatant anthropomorphization. The orchid may be a marvel of evolutionary architecture, but it can't think and it can't plan. It is simply following a program which requires that it become, in a certain sense -- quite literally, smell -- a wasp. An orchid which fails to be a wasp does not reproduce. The wasp, too, is an orchid when it deposits pollen on the waiting stigma of another flower.

The poststructuralists Gilles Deleuze and Felix Guattari used the orchid and the wasp to exemplify what they called a rhizome. The rhizome is an organizational model, a way of thinking about structure and process and the structure of process, which counterpoints the more familiar hierarchical or arborescent model. A corporation is a hierarchy of power which flows top to bottom; meanwhile, a labor union may have officials and bureaucracy, but these local hierarchies don't define the entire organization. Power in a union flows in many directions. There's a lot to like about the rhizomatic model, but one of its principal attributes is just what we're looking for: flexibility.

Deleuze and Guattari identify six characteristics of a rhizome in 1000 Plateaus. The first two and last two are each closely related and considered together.

Connection and Heterogeneity

A rhizome is a crowd or cluster of different (heterogeneous) things which can be and are connected non-hierarchically. This describes a lot of technological stuff, especially distributed systems! If you're thinking of serverless applications, Cassandra, or Kubernetes clusters: that's where we're going with this.

Our data consists, at an atomic level, of records in different tables. If we consider an "initializer" function which generates one of these records as an element of a rhizome, we can compose multiple initializers to generate any data state we need to test.

An initializer looks something like this:

async (db, data) => {
  return db.drivers.insert({name: 'Taylor', license: 'abc123'});
};

Other initializers may cover the franchises table, the destinations table, and the orders table. Each is as simple as possible, generating records of one and only one type. An initializer which creates records of multiple types is a throwback to the complex fixtures we're trying to avoid.

There are always some tests that need to do something specific with the data. What happens when a driver doesn't have a license? If Taylor always has one, we can't exercise that code. We have a few options here:

Update Taylor's record to remove her license at the beginning of the "drivers without licenses get ticketed" test
Create a second driver-without-license initializer which generates a record for Taylor's hapless compatriot Tyler, sans license
Generate records for both Taylor, with a license, and Tyler, without, in the single driver initializer

There's no cut and dried answer here; the best solution depends on the situation. Here, if there's only one test that depends on having a driver without a license, I'd go with option A. If there are several, it might be time to consider the others.

Multiplicities

Rhizomes must be thought of in terms of the discrete elements which make it up, and how those elements interact with the elements of other systems. The reproduction of the hammer orchid consists of flowers and wasps, and both flower and wasp interact with things outside. Deleuze and Guattari offer a more direct example: a puppet's strings, considered as a multiplicity, are connected not to the will of the puppeteer but to another multiplicity of nerves. The puppeteer's nervous system becomes a puppet in the same way that the hammer orchid becomes a wasp.

Thinking in multiplicities inverts the question of how fixture data is set up. It's no longer about the state for this or that test, but about the ability to describe and therefore build any data state. Each test suite selects the initializer functions it requires and builds a rhizome from them. The order of invocation does matter for local hierarchies; for example, we can't create a delivery order without a driver.

I have a ContextFactory to which I can pass the names of initializer functions. This factory returns a new function which, when executed, runs the initializers in sequence and collects the records each generates, passing the current state or context into each succeeding initializer so elements in local hierarchies can create their relationships correctly. Each test suite's before function creates a new ContextFactory in the global scope:

contextFactory = await ContextFactory('franchise', 'driver', 'destination', 'delivery-order');

This example contains two local hierarchies: franchise-driver-order and destination-order. The only constraint on ordering is that nothing can appear before its dependencies; for example, we could create the destination before anything else, but delivery-order has to be created last.

Asignifying Rupture

Have I mentioned that poststructuralism takes a lot of heat for impenetrable jargon? In fairness, it's difficult to establish a vocabulary to talk about things as abstract as it does, but its reputation is still deserved to a certain extent. Think of this as representing a "self-healing" capability if one of the components of the rhizome breaks down. If a single wasp doesn't make it to a second flower, it makes little difference; there are other wasps and other flowers. Political rhizomes especially have a way of recurring even under harsh repression, as does quackgrass.

This is a useful property for distributed architectures and concurrent processing: if a Spark job has incomplete results because something took an executor offline, the cluster manager can schedule other executors to cover the missing data. But for our purposes, a breakdown means inconsistency, so this is a point of departure for us -- we're better off raising an exception and aborting.

Cartography and Decalcomania

A rhizome is "a map and not a tracing". Where the latter creates an immutable still-life representation, a map is open to interpretation, interrogation, and most importantly, modification. Maps change all the time, because what they represent is permanently in flux. Territories declare independence, are recognized or not, are annexed; borders shift, connections are made and broken, cultures and languages ebb and flow. Maps do more than merely show this information: they transfer it ("decalcomania" is a process of reproducing images, the origin of the more common and subtly different word "decal"). A border defines the understood limits of a territory; a route on an atlas becomes a route in the mind of a driver.

When the ContextFactory is invoked, it returns an object mapping initializers to the data each have created.

ctx = contextFactory();

assert.equal(ctx.driver.name, 'Taylor');

A monolithic fixture is a tracing: it freezes a snapshot of the data model as it appeared at one point in time. The initializers, by contrast, map out our application's data model bit by bit, each piece adding more definition. If the information which makes up a driver changes -- adding a last name or whether they're on shift -- that gets added to the initializer. Every test is automatically up to date. If one breaks, that's a good thing! It means the code being exercised can't handle the new information correctly, and needs to be fixed before we can ship.

End

The rhizomatic model makes test fixtures endlessly flexible. Where monolithic fixtures multiply complexity and fall out of date with little warning, a unified, composable set of discrete fixtures keeps data generation centralized and ensures that tests that exercise related functionality use a consistent and current data set.

* The Doctrine O/RM for PHP provides a framework for loading and executing discrete centralized test fixtures, making it the only example I've seen in the wild of what I'm about to cover, if you're the kind of person who skips down to read footnotes before continuing. Anyway, score one for PHP!

Decomposing Object Trees From Relational Results

Fri, 26 Jan 2018 00:00:00 GMT

This is a feature I added to Massive recently. I had cases where I was querying views on hierarchies of multiple JOINed tables to reference data. For an example, here's a query that returns a list of wineries, some of their wines, and the grapes that go into each:

SELECT ws.id, ws.name, ws.country, w.id AS wine_id, w.name AS wine_name, w.year,
  va.id AS varietal_id, va.name AS varietal_name
FROM wineries ws
JOIN wines w ON w.winery_id = ws.id
JOIN wine_varietals wv ON wv.wine_id = w.id
JOIN varietals va ON va.id = wv.varietal_id
ORDER BY w.year;

The result set looks like this:

 id |         name         | country | wine_id |       wine_name       | year | varietal_id |   varietal_name    
----+----------------------+---------+---------+-----------------------+------+-------------+--------------------
  4 | Chateau Ducasse      | FR      |       7 | Graves                | 2010 |           6 | Cabernet Franc
  2 | Bodega Catena Zapata | AR      |       5 | Nicolás Catena Zapata | 2010 |           4 | Malbec
  2 | Bodega Catena Zapata | AR      |       5 | Nicolás Catena Zapata | 2010 |           1 | Cabernet Sauvignon
  4 | Chateau Ducasse      | FR      |       7 | Graves                | 2010 |           5 | Merlot
  4 | Chateau Ducasse      | FR      |       7 | Graves                | 2010 |           1 | Cabernet Sauvignon
  3 | Domäne Wachau        | AT      |       6 | Terrassen Federspiel  | 2011 |           7 | Grüner Veltliner
  1 | Cass Vineyards       | US      |       1 | Grenache              | 2013 |           2 | Grenache
  1 | Cass Vineyards       | US      |       2 | Mourvedre             | 2013 |           3 | Mourvedre
  2 | Bodega Catena Zapata | AR      |       3 | Catena Alta           | 2013 |           4 | Malbec
  2 | Bodega Catena Zapata | AR      |       4 | Catena Alta           | 2013 |           1 | Cabernet Sauvignon

This tells us a lot: we've got two single-varietal wines from Cass, two (note the differing wine_ids) and a blend from Catena, one grüner from Wachau, and one classic Bordeaux blend from Ducasse. But while I can pick out the information I'm interested in from this result set easily enough, it's not directly usable by application code which processes the records one at a time. If I needed to use these results to drive a site which offered winery profiles and allowed users to drill down into their offerings, I'd have a rough time of it. That structure looks more like this:

├── Bodega Catena Zapata
│   ├── Catena Alta
│   │   └── Cabernet Sauvignon
│   ├── Catena Alta
│   │   └── Malbec
│   └── Nicolás Catena Zapata
│       ├── Cabernet Sauvignon
│       └── Malbec
├── Cass Vineyards
│   ├── Grenache
│   │   └── Grenache
│   └── Mourvedre
│       └── Mourvedre
├── Chateau Ducasse
│   └── Graves
│       ├── Cabernet Franc
│       ├── Cabernet Sauvignon
│       └── Merlot
└── Domäne Wachau
    └── Terrassen Federspiel
        └── Grüner Veltliner

Relational databases don't do trees well at all. This is one of the compelling points of document databases like MongoDB, which would be able to represent this structure quite easily. However, our data really is relational: we've also got "search by grape" functionality, and it's a lot easier to pick out wines which match "Mourvedre" by starting with the single record in varietals and performing a foreign key scan. It's even indexable. By comparison, to do this with a document database you'd need to look in every document to see if its varietals had a match, and that still leaves the issue of ensuring that each winery only appears once in the output. Worse, there's no guarantee someone didn't typo "Moruvedre" somewhere.

There's an easy way to generate the profile-wine-varietal tree: just iterate the result set, see if we have a new winery and add it if so, see if the wine is new to this winery and add it if so, see if the varietal is new for this wine and add it if so. It's not very efficient, but this isn't the kind of thing one does at the millions-of-records scale anyway. The bigger problem is it only works for these specific results. Next time I run into this scenario, I'll have to start from scratch. I'm lazy. I only want to have to write this thing once.

Location, Location, Location

The first problem is determining which columns belong where in the object tree. The query result doesn't say which table a given column came from, and even if it did, that's no guarantee that it really belongs there. Meaning is contextual: a developer might want to merge joined results from a 1:1 relationship into a single object, or do more complicated things I can't anticipate.

To place each column, Massive needs a schema. Defining any kind of data model was something I'd avoided in the project for as long as possible; coming as I do from a strongly-typed background, it's almost instinctive. Strong typing, its many good points aside, is one of the reasons the object-relational mapper pattern (O/RM) dominates data access in languages like Java and C#: the requirement to map out class definitions ahead of time lends itself all too easily to creating a parallel representation of your data model as an object graph. This is the "object-relational impedance mismatch", also known as the Vietnam of computer science. You now have two data models, each subtly out of sync with the other, each trying to shoehorn data into formats that don't quite fit it. By contrast, JavaScript basically doesn't care what an object is. That lets Massive get away without any kind of modeling: it builds an API out of Tables and Queryables and Executables, but after that it's all arrays of anonymous result objects.

In an early version of this code, I automatically generated the schema based on column aliasing. The field wines__id would be allocated to an element of a collection named wines in the output. I wound up dropping this: naming conventions require significant up-front work, and if you're trying to do this to a view that already exists, it probably doesn't follow conventions I just came up with. This is poison for Massive, which is supposed to be a versatile toolkit with few expectations about your model. Providing a schema on invocation is still a non-negligible effort, but you only have to do it when you absolutely need it.

A schema looks like this:

{
  "pk": "id",
  "columns": ["id", "name", "country"],
  "wines": {
    "pk": "wine_id",
    "columns": {"wine_id": "id", "wine_name": "name", "year": "year"},
    "array": true,
    "varietals": {
      "pk": "varietal_id",
      "columns": {"varietal_id": "id", "varietal_name": "name"},
      "array": true
    }
  }
}

Each nested element defines a pk field, which we'll use to distinguish records belonging to different objects at the appropriate level of the tree. columns may be an array or an object to allow renaming (every single one of our tables has a column called name, and prefixes only make sense for flat result sets). The array flag on inner schemas indicates whether objects created from the schema should be appended to a collection or added as a nested object on the parent. We don't have any instances of the latter, but it's something you'd use for a user with a rich profile object or another 1:1 relationship.

Making a Hash of Things

Given a resultset and a schema to apply to it, our first order of business is consolidation. Chateau Ducasse only has one wine in our dataset, but since it's a cabernet sauvignon/merlot/cabernet franc blend, it shows up in three rows. And through some quirk of the sorting engine, those three rows aren't even adjacent. We'd be in trouble if we just accumulated data until the id changed -- we'd have records for a 2010 Chateau Ducasse cab franc and a 2010 Ducasse merlot/cab sauv, neither of which actually exists. If we did it really badly, we'd have two distinct Chateaux Ducasse with one imaginary wine each.

Fortunately, our schema defines a primary key field which will ensure that Chateau Ducasse is the only Chateau Ducasse; and we have hashtables. We can represent the query results as a recursively nested dictionary matching each object's primary key with its values for fields defined by the schema. Even for a relatively small data set like we have, this mapping gets big fast. This is what Chateau Ducasse's section looks like in full:

{ ...,
  "4": {
    "id": 4,
    "name": "Chateau Ducasse",
    "country": "FR",
    "wines": {
      "7": {
        "id": 7,
        "name": "Graves",
        "year": 2010,
        "varietals": {
          "1": {
            "id": 1,
            "name": "Cabernet Sauvignon"
          },
          "5": {
            "id": 5,
            "name": "Merlot"
          },
          "6": {
            "id": 6,
            "name": "Cabernet Franc"
          }
        }
      }
    }
  }
}

To generate this, we iterate over the resultset and pass each row through a function which recursively steps through the schema tree to apply the record data. For this schema, we're starting from wineries so the id 4 corresponds to Chateau Ducasse. Inside that object, the wine id 7 in the wines mapping corresponds to their 2010 Bordeaux, and so on.

Simplify!

However, the primary key mapping is obnoxious to work with. It's served its purpose of structuring our data in an arborescent rather than a tabular form; now it needs to go away, because it's an extra layer of complexity on top of our super-simple winery-wine-varietal tree. We need to break each winery value in the outer dictionary out into its own object, recurse into each of those to do the same for their wines, and finally recurse into the wines to handle the varietals.

If this sounds really similar to what we just did, that's because it is. It's technically possible to do this in one pass instead of two, but processing the raw results into a hashtable is much, much faster than the potential number of array scans we'd be doing.

To arrive at the final format, we reduce the mapping's key list; these are the primary keys of each winery in the example dataset. The corresponding values from the mapping go in the reduce accumulator. Since we're only dealing with arrays here, the accumulator will always be an array; if we had a subobject with a 1:1 relationship, we'd use an object accumulator instead by turning array off in the schema definition. This would result in the subobject being directly accessible as a property of its parent object.

Here's Catena:

[ ...,
  {
    "id": 2,
    "name": "Bodega Catena Zapata",
    "country": "AR",
    "wines": [ {
      "id": 3,
      "name": "Catena Alta",
      "year": 2013,
      "varietals": [ {
        "id": 4,
        "name": "Malbec"
      } ]
    }, {
      "id": 4,
      "name": "Catena Alta",
      "year": 2013,
      "varietals": [ {
        "id": 1,
        "name": "Cabernet Sauvignon"
      } ]
    }, {
      "id": 5,
      "name": "Nicolás Catena Zapata",
      "year": 2010,
      "varietals": [ {
        "id": 1,
        "name": "Cabernet Sauvignon"
      }, {
        "id": 4,
        "name": "Malbec"
      } ]
    } ]
  },
... ]

Dead simple: we've got wineries, wineries have wines, wines have varietals. Everything lines up with the real primary key values from the original query result. We've turned a raw resultset with embedded relationships into a model of those relationships. This is much easier to manage outside the relational context in client code, and it's an accurate representation of the mental model we want our users to have. The schema does add a bit of overhead, but it's as contained about as well as possible. Further automation only makes it less flexible from here out.

Behind the Curve: "New" vs "Compatible" in Node.js Package Development

Fri, 22 Dec 2017 00:00:00 GMT

The pace of Node.js development has created a complicated space for growing and maintaining reusable libraries. As new features are introduced, there's a certain pressure to keep up with the latest and greatest in order to simplify existing code and take advantage of new capabilities; but there's pressure in the opposite direction too, since projects which depend on the package aren't always themselves keeping up with Node.

My main open source project is Massive.js. It's a data access library for Node and the PostgreSQL relational database. I started participating in its development back before io.js merged back into Node and brought it up to ES6, and as of right now I'm still using it in one (not actively developed) product with an old-school callback-based API. I'm also relying on it in other projects with Node 8, the latest stable release line, so I've gotten to use a lot of the newer feature set which have collectively made Node development a lot more fun.

Given that libraries like mine are used with older projects and on older engines, the code has to run on as many of them as is practical. It's easy to assume with open source projects that if someone really needs to do whatever it is your package does in an engine from the stone age (better known as "yesterday" in Node) they can raise an issue or submit a pull request, or worst case fork your project and do whatever they have to to make it work. But in practice, the smaller the userbase for a package the less point there is to developing it in the first place, so there's a delicate balance to strike between currency and compatibility.

Important Numbers in Node.js History

0.12: The last version before io.js merged back into Node and brought the newest version of Google's V8 engine and the beginnings of ES6 implementation with it.
4: The major release series beginning with the reintegration of io.js in September 2015. Some ES6 language features such as promises and generators become natively available, freeing those Node developers able to upgrade from "callback hell". Node also moves to an "even major versions stable with long term support, odd major versions active development" release pattern.
6: The 2016 long term support (LTS) release series rounds out the ES6 feature set with proxies, destructuring, and default function parameters. The former is a brand new way of working with objects, while the latter two are big quality-of-life improvements for developers.
8: The 2017 LTS release series, current until Node 10 is released April 2018. The big deal here is async functions: promises turned out to still be a bit unwieldy, leading to the rise of libraries like co exploiting generators to simplify asynchronous functionality. With async/await, these promise management libraries are no longer needed.

What Maximum Compatibility Means

For a utility library like Massive, the ideal scenario for end users is one where they don't have to care which engine they're using. Still on 0.12, or even before? Shouldn't matter, just drop it in and watch it go. Unfortunately, not only does this mean Massive can't take advantage of new language features, it affects what everyone else can do with the package themselves.

The most obvious impact is with promises, which only became standard in 4.0.0. Prior to that, there were multiple independent implementations like q or bluebird, most conforming to the A+ standard. For Massive to use promises internally while running on older engines, it would have to bundle one of these. And that still wouldn't make a promise-based API useful unless the project itself integrated a promise library, since the only API metaphor guaranteed available on pre-4.0.0 engines is the callback.

Some of the most popular features which have been added to the language specification are ways to get away from callbacks. This is with good reason, although I won't go into detail here; suffice to say, callbacks are unwieldy in the best of cases. Older versions of Massive even shipped with an optional "deasync" wrapper which would turn callback-based API methods into synchronous -- blocking -- calls. This usage was wholly unsuitable for production, but easier to get off the ground with.

A Breaking Point

With the version 4 update, actively developed projects started moving toward promises at a good clip. We started seeing the occasional request for a promise-based API on the issue tracker. My one older project even got a small "promisify" API wrapper around Massive as we upgraded the engine and started writing routes and reusable functions with promises and generators thanks to co. Eventually things got to the point where there was no reason not to move Massive over to promises: anything that still needed callbacks was likely stable with the current API, if not legacy code outright.

This meant a clean break. The new release of Massive could use promises exclusively, while anything relying on callbacks would have to stay on the older version. By semantic versioning standards, an incompatible API change requires a new major version. In addition to complying with semver, releasing the promise-based implementation as 3.0.0 would allow urgent patches to be made on the existing 2.x series concurrently with new and improved 3.x releases.

Multiple Concurrent Releases with Tags

The npm registry identifies specific release series with a "dist-tag" system. When I npm publish Massive, it updates the release version on the latest tag; when a user runs npm install massive, whatever latest points to is downloaded to their system. Package authors can create and publish to other tags if they don't want to change the default (since without an alternative tag, latest will be updated). This is frequently used to let users opt in to prereleases, but it can just as easily let legacy users opt out of updates.

Publishing from a legacy branch in the code repository to a second tag means installing the most recent callback-based release is as easy as npm i massive@legacy. Or it could be even simpler: npm i massive@2 resolves to the latest release with that major version. And of course, package.json disallows major version changes by default, so there's no worries about accidental upgrades.

You can list active dist-tags by issuing npm dist-tag ls, and manage them through other npm dist-tag commands.

The One Time I Kind of Screwed Up

In July, a user reported an issue using Massive 3.x on a version 4 series engine. The version 6 stable release had been out for a while, and my active projects had already been upgraded to that for some time. The even newer version 8 series, with full async and await support, had just been released. The problem turned out to be that I'd unwittingly used default function parameters to simplify the codebase. This feature was only introduced in the version 6 release series, which meant Massive no longer functioned with version 4 engines.

Fixing the issue to allow Massive to run on the older engine would be a bit annoying, but possible. However, I had some ideas in the works that would require breaking compatibility with the version 4 series anyway: proxies are not backwards-compatible, so anything using them can only run on version 6 series and newer engines. Rather than fix compatibility with an engine which was now superseded twice over only to break it again later, I ultimately decided to leave well enough alone and clarify the engine version requirement instead.

Move Slowly and Deliberately and Try Not to Break Things

The main lesson of package development on Node is that you have to stay some distance behind current engine developments in order to reach the most users. How far behind is more subjective and depends on the project and the userbase. I think Massive is fine one full LTS version back, but a contrasting example can be found in the pg-promise driver it uses. Vitaly even goes as far as allowing non-native promise libraries to be dropped in, which hasn't strictly been necessary since 2015 -- unless you're stuck on an engine from before the io.js merge, which users of a more general-purpose query tool seem more likely to be.

Following semantic versioning practices not only ensures stability for users, but also makes legacy updates practical -- just check out the legacy branch, fix what needs fixing, and publish to the legacy tag instead of latest. One new feature and a couple of patches actually have landed on Massive v2 so far, but it's generally been quiet.

Having a clearly-defined standard for versioning has also helped manage the pace of continued development better: figuring out when and how to integrate breaking changes to minimize their impact is still tough, but it's vastly preferable to holding off on them indefinitely.

A Unified Multi-Tenant User Cache with PostgreSQL

Sun, 10 Dec 2017 00:00:00 GMT

I've been working on a multitenant Node.js product which recently moved its authentication into a Single Sign-On (SSO) system. With PostgreSQL we were able to structure and retrieve user data efficiently through an interesting combination of uncommon or unique database functionality:

Foreign data wrappers (FDWs)
Table inheritance
Materialized views

Foreign Data Wrappers

When we began redesigning the application's user infrastructure we wanted to avoid maintaining a copy of user data independent from the chosen SSO system, Keycloak. We knew we could represent data from other sources through a foreign data wrapper. This is (so far as I know) a unique feature to Postgres, which lets you represent data in other sources as tables by implementing a standard connecting API.

The bad news: Postgres is written in C, and while I could probably brush up on pointers and make it work eventually, higher-level languages have spoiled me. Fortunately, there's a project which enables FDW development in Python: Multicorn. With my coworker's efforts on foreign-keycloak-wrapper, that got us as far as being able to create a table representing a particular "realm" or user organization in Keycloak (we kept our organizations table in order to have referential integrity in our data ownership) and retrieve its users through the Keycloak API.

CREATE SERVER "myrealm_server"
FOREIGN DATA WRAPPER multicorn
OPTIONS (
  wrapper 'keycloak.Keycloak',
  url 'url to the Keycloak instance',
  username 'a realm admin username',
  password 'a realm admin password',
  realm 'myrealm',
  client_id 'a realm client id',
  grant_type 'password',
  client_secret 'a realm client secret',
  organization_id 'id of an entry in the organizations table'
);

CREATE FOREIGN TABLE "myrealm" (
  id uuid,
  username text,
  "firstName" text,
  "lastName" text,
  email text,
  organization_id uuid
) SERVER "myrealm_server";

SELECT * FROM myrealm;

We'd still have to create a foreign server and table for each realm, since the Keycloak API only retrieves users per realm by design. In a multitenant system, we want bootstrapping new organizations to be easy and automated on the backend. Here Keycloak has the ability to export realm connection information as JSON, which lets us access the information required to CREATE FOREIGN SERVERs and CREATE FOREIGN TABLEs on the fly. So, while it's possible to pull user information from new realms after creation, it will always be separated by realm. We didn't want to have to figure out which table to pull from in the JavaScript API -- best to keep that as straightforward as possible and manage data complexity in the database. It's what it's there for.

Table Inheritance

Table inheritance is another feature unique to Postgres among the four major RDBMSs. Setting up a base users table and declaring that the myrealm table INHERITS (users) accomplishes two things:

First, myrealm builds on top of users' column list. This mostly makes the CREATE FOREIGN TABLE statement shorter (it's also optional), since we have no new columns to add as long as the base users schema conforms to the Keycloak API contract.

Second, myrealm's data can be accessed through users with a simple SELECT. In fact, this is the default behavior, and a SELECT must specify FROM ONLY users in order to omit rows from descendant tables.

CREATE TABLE "users" (
  id uuid,
  username text,
  "firstName" text,
  "lastName" text,
  email text,
  organization_id uuid
);

CREATE FOREIGN TABLE "myrealm" (
) INHERITS "users" SERVER "myrealm_server";

SELECT * FROM users;

The SELECT now combines information from every active realm, so for higher-level APIs the only question is one of ensuring the requesting user is authorized to see the information retrieved. This is exactly the way we had it with the local users table, so we've already got that authorization infrastructure in place and overall impact to the rest of the application is minimal.

Materialized Views

The Keycloak server being separate from the application database server means longer roundtrip time in any query involving user records. There are some advantages to having the data stored locally, after all! However, the real problem isn't in having the data but in ensuring it stays current: what we need is a cache. A materialized view is exactly that.

Materialized views are found in Postgres, SQL Server, and Oracle. If you use MySQL, you're out of luck (but then, if you're following this whole thing, it's Postgres or bust anyway). It's defined just like a regular view, with the MATERIALIZED between CREATE and VIEW the important difference indicating that the results of the view query are to be stored until refreshed. The stored results can be indexed just like tables, too.

CREATE MATERIALIZED VIEW cached_users AS
SELECT * FROM users;

If we add a new realm and its foreign table, or if information inside an existing realm changes (such as if we see a previously-unknown user try to login), we can REFRESH MATERIALIZED VIEW CONCURRENTLY cached_users; to.... refresh the cache. The CONCURRENTLY means it happens in the background, so SELECTs happening while the data is being retrieved see the old version. It's not staying as close to realtime as possible; we could do that with cron or a systemd timer if we really wanted to, but for our purposes refreshing on new organizations being created or unknown users authenticating suffices.

Wrap-up

Overall this has added some complexity to our database setup. We're no longer running stock Postgres since both Multicorn and foreign-keycloak-wrapper must be installed. Discrepancies between Python versions bundled with Postgres on various operating systems have also caused some issues -- universally resolved with a careful inspection of the Postgres configuration and the install logs, but annoying. Docker's taken some of the pain out of that, since we can ship an image with everything ready to go and use volumes to persist data.

Lastly, realms have to be created in Keycloak before we can do anything with them, so there are more moving parts to keep track of. Oh well; we have our unified user cache so the application logic stays simple, and that's all we wanted out of it. SSO is supposed to make life easier for users, not necessarily for architects!

Automate Your Way to Self-Assembling Documentation

Fri, 01 Dec 2017 00:00:00 GMT

Documentation is what makes it possible for people to use your software without having to put in almost as much work to understand it as you did to write it. It's also one of the dreariest chores of maintaining code, the kind of housekeeping work programmers are notoriously averse to. I'm no exception to that rule, but at the same time I run a moderately popular library, Massive.js, which absolutely needs docs if it's to be useful to anyone else on the planet. So in the spirit of Larry Wall's first virtue, I've gone to considerable lengths to do as little as possible about it.

What is Documentation?

Documentation has taken many forms over the years, from actual dead-tree books to man pages to API documentation sites generated from specially formatted comments and everything in between. There are various advantages and disadvantages to each: anything else beats the book in terms of searchability, but if you need a more structured introduction to something, or are working behind an air gap, books absolutely have their place. Format is something of an independent concern.

A more important question is: what makes documentation good? This is naturally subjective, but a few basic principles make sense:

Good documentation is current: new features and changes are documented at the time they're integrated, and documentation for the latest release is always up-to-date
Good documentation is complete: it covers every notable API function, configuration setting, option, and gotcha in the system that end users can expect to deal with
Good documentation is readable, even -- especially -- for people with limited experience (they need it more than the experts will!)
Good documentation takes as little time and effort to maintain without sacrificing too much of the above three as possible

Since the only ways to get Massive are from npm or from GitHub, it's a fairly safe assumption that anyone who needs the documentation will be online. This makes things easier: I can provide documentation as a static site. By "static", I don't mean that it's eternally unchanging, but that it's just plain HTML and CSS, maybe a little JavaScript to liven things up a bit. There's no database, no backend API, no server-side processing.

Full Automation

The absolute easiest way to get something up is to use a documentation generator. These have been around for ages; perldoc and JavaDoc are probably the best-known, but JSDoc has existed for almost 20 years too. With it, I can decorate every function and module with a comment block containing detailed usage information, then run a program which assembles those blocks into a static website.

The JSDoc comment blocks, like JavaDoc, are indicated by a /** header. This one shows a function, with @param and @return tags indicating its arguments and return value respectively. Other tags cover attributes of modules and classes, or provide hints for the JSDoc compiler to change how it organizes pages (distinguishing entities can be tricky in a language like JavaScript!).

/**
 * Perform a full-text search on queryable fields. If options.document is true,
 * looks in the document body fields instead of the table columns.
 *
 * @param {Object} plan - Search definition.
 * @param {Array} plan.fields - List of the fields to search.
 * @param {String} plan.term - Search term.
 * @param {Object} [options] - {@link https://massivejs.org/docs/options-objects|Select options}.
 * @return {Promise} An array containing any query results.
 */
Queryable.prototype.search = function (plan, options = {}) {

I don't need a complicated .jsdoc.json config for this:

{
  "source": {
    "include": ["index.js", "lib", "README.md"]
  },
  "opts": {
    "recurse": true
  }
}

All that's left is to add a script in my package.json to run JSDoc:

"docs": "rm -rf ./docs/api && jsdoc -d ./docs/api -c ./.jsdoc.json -r"

Now npm run docs generates a fresh API documentation site -- all I have to do is keep my comment blocks up to date and remember to run it!

There are two problems with this picture:

First, that particular bit of documentation raises as many questions as it answers. What are document body fields? I'm just assuming people know what those are. And the description of the options object is -- well, that's getting a bit ahead of myself. Queryable.search doesn't exist in a void: in order to understand what that function does, a developer needs to understand what the options object can do and what documents and their body fields are. That's a lot to dump into a single JSDoc comment. Especially when you consider that the options object applies to most of Massive's data access functions, many of which concern documents! Clearly, I need a second level of documentation which serves as a conceptual rather than a purely technical reference. But: I can't generate something like that automatically.

Second, I have to remember to run it. It's a one-line shell script. I shouldn't have to remember to run it. Let's get that one out of the way first:

Lifecycle Events

Several npm tasks provide hooks for you to execute scripts from your package.json before or after execution. Some, like npm test, require you to implement the task itself as a script. One such task with hooks is npm version. The preversion script executes before it bumps the version number; the version script executes after the bump, but before it commits the changed package definition into source control; and the postversion script executes after the commit.

I really only have to make sure the API documentation is up to date when I'm releasing a new version. Running JSDoc in preversion is perfect. If I want to keep the documentation update separate from the version bump, I can just put together a shell script that runs in the hook:

#!/bin/bash

echo "regenerating API docs"

npm run docs

echo "committing updated API docs"

git add docs/api

git commit -m "regenerate api docs"

Conceptual Reference: Jekyll and GitHub Pages

JSDoc is a great tool, but it can't introduce and connect the concepts users need to understand in order to work with Massive. The only way that's happening is if I write it myself, but I don't want to write raw HTML when I could work with the much more friendly Markdown instead. Fortunately, there's no shortage of static site generators which can convert Markdown to HTML. I use Fledermaus for my blog. Or I could use ReadTheDocs, a documentation-focused generator as a service, again. That's where the legacy docs are already hosted. But it's pretty much just me on Massive, so I want to centralize. GitHub Pages uses Jekyll; that makes that an easy decision.

I think the hardest part of using Jekyll is deciding on a theme. Other than that, the _config.yml is pretty basic, and once I figure out I can customize the layout by copying the theme's base to my own _layouts/default.html and get the path to my stylesheet straightened out all that's left is writing the content.

Pages in a Jekyll site, like articles on dev.to and (probably) other platforms, are Markdown files with an optional "front matter" section at the top of the file (the front matter is required for blog posts).

Seeing what the documentation looks like locally takes a few steps:

Install Ruby via package manager
gem install bundler
Create a Gemfile which pulls in the github-pages Ruby gem
bundle install
Then, unless I add more dependencies to the Gemfile, I can bundle exec jekyll serve and point my browser to the local address Jekyll is running on

At this point, I have a docs/ directory in my working tree:

docs
├── api                 # JSDoc output
├── assets
│   └── css
│       └── style.scss  # Jekyll handles processing SCSS
├── _config.yml         # Main Jekyll config
├── Gemfile             # Jekyll dependency management
├── Gemfile.lock        # Auto-generated Jekyll dependency manifest
├── index.md            # Documentation landing page
├── _layouts
│   └── default.html    # Customized HTML layout template
├── some-docs.md        # Some documentation!
└── _site               # Jekyll output (this is .gitignored)

GitHub Pages can host an entire repository from the master branch, a docs directory in master, or a separate gh-pages branch. While I do have a docs directory, I don't want my documentation to update every time I land a commit on master. Massive's docs need to be current for the version of the library people get from npm install, not for every little change I make. So I create a gh-pages branch, clean it out, and copy my docs directory into the root (minus _site since GitHub Pages runs Jekyll itself). The JSDoc output is included so the static site is complete, containing both the conceptual and the technical references.

After pushing and a bit of trial and error, I have the site up and working! But I really, really don't want to have to do all this manually every time I cut a release.

Automating Documentation Management

My script for the preversion lifecycle event lets me basically ignore the JSDoc as long as I keep it up to date. If I can script out the steps to update the gh-pages branch, I can use another lifecycle event to take the work out of managing the rest of it. Since everything's happening in another branch, kicking off after the version bump with postversion is sufficient.

First things first: what version am I updating the docs for? That information is in a couple of places: I could look for the latest git tag, or I could pull it out of package.json. Which to use is mostly a matter of taste. I'm pretty familiar with jq (think sed for JSON), so I go with that over git describe:

type jq >/dev/null 2>&1 && { VERSION=$(jq .version package.json); } || exit 1

This line first ensures that jq exists on the system. If it does, it sets the VERSION variable to the version field in package.json; otherwise, it aborts with a failing error code to stop execution.

The next step is to get the current branch name and the commit SHA for the version bump:

BRANCH=$(git symbolic-ref --short HEAD)
COMMIT=$(git rev-parse --short "$BRANCH")

Then, it's time to git checkout gh-pages and get to work. I want to make sure no old files are present in the working tree, but I do have a customized .gitignore that I need to keep.

git clean -dfqx
git ls-tree --name-only gh-pages | grep -v "\(.gitignore\)" | xargs -I {} rm -r {}

git clean deletes all untracked files from the working tree. Then I git ls-tree the branch's root directory, perform an inverse grep to filter out my .gitignore, and pass every other file in it into rm -r with xargs. At the end of this, the working tree should be completely empty except for the .gitignore. Now to pull the up-to-date documentation over from the original branch:

git checkout "$BRANCH" -- docs

mv docs/* .

rm -r docs

Fairly straightforward: it checks out only the docs directory, moves its contents into the working tree root, and cleans up the now-empty directory. This is the home stretch.

git add .

git commit -m "regenerate documentation for $VERSION ($BRANCH $COMMIT)"

git checkout "$BRANCH"

Add the files, commit them with the new version number and source commit information. Then with that all done, checkout the original branch again. I could push gh-pages, but I'm a little paranoid about automating uploads, so my script just echoes a reminder to do that manually.

This all goes in another shell script and then I just have to make sure that that script runs on postversion!

Start to Finish

Now, when I npm version to create a new release of Massive, my scripts fire on the lifecycle events. The preversion script updates my API documentation and commits it before anything else happens. The standard version functionality takes over at that point, setting the new version in package.json, committing the change, and tagging it with the new version. Finally, my postversion script assembles the latest documentation and commits it to the gh-pages branch. The only thing left for me to do manually is to push that branch along with master and the new tag. As long as I keep my JSDoc comments and reference documentation up to date, the rest of it takes care of itself!

Cluster Organization in Docker Compose

Wed, 22 Nov 2017 00:00:00 GMT

I'll make a long story short here: this time last year, I knew nothing about containers or orchestration save that Vagrant had sounded like a cool idea but hadn't done much for me in practice. But we had an architect who did know more, and who set up our applications with a really quite fancy Kubernetes- and Docker-based build and deploy system (more on that some other time, perhaps). Our dev and QA environments became Kubernetes clusters, I started learning how it all worked, things were good. Then he moved on, making myself and one other coworker who knew around as much as I did the de facto experts on everything cloud here. Oops.

One thing neither the architect nor I had anticipated was that many of our enterprise clients turned out not to be on board with Kubernetes. At all. Some of them aren't even comfortable with Docker period, but there's not much to do on that count except wait. For the rest, we decided that orchestration made things so much easier we were going to do it wherever we could get away with it, so we needed to have Docker Compose definitions ready to go.

I did inherit some basic Compose configs, but they were badly out of date; in the interim, we'd added a couple of Postgres extensions, integrated a single sign-on service, done a bunch of restructuring -- you know how it goes. So I wound up going back to the drawing board for all the most complicated bits. And in the process I found out that there were some things I'd taken for granted with Kubernetes that I couldn't with Compose.

A Rocky Start

Like jobs. I was about to miss jobs a lot.

In Kubernetes, jobs let you run one-off tasks. We use this functionality to stand up the database, run migrations, and seed initial data for the dev environment since we regenerate that every day. It works alright: deployment pods bounce off until the database comes up initialized. If something unexpected happens, kill the pod and Kubernetes starts another for you. So far, so good.

Docker Compose doesn't do that. In Docker Compose, things that start up are supposed to stay up, or be replaced if they don't stay up. This was a problem. I was looking for a way to issue a single docker-compose up and have a brand new cluster with all the complicated once-off init stuff done for me. It'd be easy to expand the application server image entrypoint to do all the initialization, but each cluster runs two or three of those behind a load balancer, so just doing that could have inconsistent results from the spinup code firing multiple times.

Broken down, here's everything that needs to happen between the application services and the database when the cluster comes up, in order:

If Postgres isn't running yet, don't start any app services.
If the application database roles do not exist, create them.
If the application database does not exist, create it.
Deploy any outstanding migration scripts to the database.
If the database infrastructure for single sign-on does not exist, create it.
Deploy any new content to the database.
Update the locale files for new content in each application container.
Apply configuration to each application container.
Start the application services.

Everything up through #6 needs to happen once and only once. But with Docker Compose, we don't have one-offs. So it all has to go in the entrypoint script, or near enough; we just have to make sure only one of the application services can execute the sensitive parts, and that its peers wait for that to happen before they spin up.

Scheduling Startup

The first problem to solve is making sure nothing tries to come up until the database is there. With Kubernetes, we use init containers for this: both the setup job and the application server deployment declare an init container which tries to select 1 every few seconds until it succeeds. Docker Compose doesn't have anything like that to my knowledge; the most it does is generate a dependency graph from your links and depends_on and a couple other service attributes. This ensures that services are started in a particular order, but since Postgres takes a couple seconds to come up the dependent containers could in fact finish their startup before it's ready.

The way to ensure nothing tries to talk to Postgres until it's good and ready is to wrap the startup command. The Docker Compose documentation recommends a few options; I went with wait-for-it. It looks like this in the Compose config:

    entrypoint: ["./wait-for-it.sh", "postgres:5432", "--", "bash", "./entrypoint.sh"]

Our entrypoint.sh is not run unless and until the Postgres container starts listening on its default port 5432. That's great, but there's one other thing that makes this really useful: since we already have multiple application services defined (Swarm isn't guaranteed so we can't set replicated mode), we can pick one of those to wait for Postgres to come up, and have the rest wait for it to come up in turn. That's our init container.

Secrets

At this point we can ensure that nothing that depends on the database will start up before the database is ready, and that one of our app services will always finish its startup before any others begin theirs. What we need now is a way to distinguish that service from the others so it can execute our once-only tasks. That's where secrets come in.

Secrets are basically the same concept between Kubernetes and Docker Compose: files containing sensitive data which get loaded onto nodes and mounted to the container filesystem. It's more secure than using environment variables. Secrets are defined as a top-level block in the compose config:

secrets:
  db_owner_password:
    file: ./secrets/db_owner_password.txt

And then attached to each service definition:

  appserver:
    image: myimage
    links:
      - postgres
    secrets:
      - db_owner_password
    entrypoint: ["./wait-for-it.sh", "postgres:5432", "--", "bash", "./entrypoint.sh"]

The dependent application services don't need the db_owner_password; that's only required to initialize the database. So we can test for the presence of the secret in our entrypoint script, and kick all that off only if it's present:

if [ -a /run/secrets/db_owner_password ]; then
  # check and create the application roles and database, then run the migrations
fi

Now the appserver service is unique, and we've restricted the ability to stage the database to it. We can't be completely careless -- if appserver blindly emits a createdb every time it starts, it'll fail with a "database already exists" error every time after the first -- but since we've guaranteed there will only ever be one container trying to create the database at a time, we can simply check up front.

That leaves the shared configuration and content, which together are more than secrets are meant to deal with.

Volumes

Mounting information from the host system into containers is a pretty general use case. Secrets cover a specific subset of this. For everything else, there's volumes (and again, Kubernetes' version of the concept is a pretty close analogue). Since volumes can be much larger than secrets, they aren't automatically shared across nodes; you have to create a named volume explicitly, and use a driver which is multi-host aware.

Declare named volumes for config and content:

volumes:
  app_conf:
    driver: local # this is obviously not multi-host aware, but it's good enough for testing
  app_content:
    driver: local

Then in the appserver service definition, add a volumes block:

volumes:
  - app_conf:/home/appserver/app/conf
  - app_content:/home/appserver/app/content

Docker Compose will create the volumes if they do not exist, or you can docker volume create them ahead of time. Better to do the latter, since otherwise the first time you bring up the cluster everything will die horribly since the volumes are empty. If you create them manually, you can docker volume inspect them, find the mountpoint on the host system, and copy the instance configuration and content in before you start spinning things up.

One caveat: the names app_conf and app_content are not actually the names Docker Compose looks for. Compose prepends docker_ to the names you supply, so the volumes should be named docker_app_conf and docker_app_content.

The End

Start to finish, it took me a few weeks to get my first real Compose cluster set up. It's rough getting started, even though the Docker and Compose documentation is quite good; it's a lot to wrap your head around, and there are a lot of concepts you really just have to sort of brute force your way into understanding. I had a lot of other stuff on my plate at the time (still do!), which certainly didn't help matters either.

The good news is, yesterday I had to set up another app with a similar Compose configuration from scratch. This time, I had it up and running within a couple hours. Once you've got the structure down and understand how the pieces fit together it's a lot more manageable.