pgsql phriday #004: scripting in the industrial age

PGSQL Phriday #004: Scripting in the Industrial Age

While we are concentrating on our task, both our tools and our materials merge into one entity of perception which gives us feedback about state and progress of our work.... Software tools generalize our ways of handling aspects of the world around us; they organize our actions and condense them into gestures.

— Reinhard Budde & Heinz Züllighoven, Software Tools in a Programming Workshop

An internal combustion engine isn't a tool but a car can be, although it remains in the immediate material sense something else, a system demanding full bodily integration: you climb in, close the door, buckle your seatbelt, insert and turn the key or press the brake and ignition, move your hand from gearshift (tool for adjusting torque) to wheel (tool for sensing and adjusting orientation). You can do other things while driving, talk or listen or think, and choose how much of your attention and motion to divert to other tasks, some involving yet other tools. But driving itself represents an attentional and physical restriction of some variable but never-zero degree. The car is your pair of seven-league boots, a tool for working with time and distance, at the same time as it's a physical machine you're strapped into and which you successfully use only by altering your own thinking, perception, even your sense of your own mass, shape and size, velocity, and inertia.

Budde and Züllighoven on machines in this sort of activity-theoretical sense:

A machine is repeatable motion which is abstracted from its specific context and cast into construction.... [It] incorporates and reproduces the mechanical reproduction of human activities. It thus decontextualizes human activities.

You bring your tools to the work; you take your work to the machine. Machines may afford their operators the power of many tools and the speed of automation and sometimes parallelization, but also embody a fixity of purpose, an integrated way of conceiving and acting on the work materials that resists or forbids working in other ways. Tools and machines can even have intersecting domains: a pneumatic wrench is a machine component, fixed in place by its air hose, but does the same job as an ordinary manual wrench. When you have a sufficient number of things to loosen or tighten, other goals to accomplish related to the loosening or tightening, and/or reasons to guarantee a minimum torque, it's worth the extra effort of moving the work as well as the worker.

The classic example, which Budde and Züllighoven mention in passing on to a broader point about how automation makes humans and machines interchangeable, is the fixed station of an assembly-line worker, who may pick up and put down physical tools in the course of producing outputs from the inputs they're provided (this is, naturally, at the scale of individual human beings: the assembly line itself is a machine producing cars from the outputs of other materials processing machines, the auto company a machine generating capital from labor and extracting surplus value, the stock market a machine redirecting flows of capital, each hosting masses of people whose activity is governed and directed not by their own desires but by the social and physical construction of the line, the office, the trading floor). Integrated development environments are, like cars, a more ambiguous case. They're tools at the scale or in the context of software systems writ large, while in that of computer use they're machines you enter into that require nearly full mental or attentional integration to repeat the motions of type-build-test-package-run-debug operating its constituent smaller machines and tools.

While using software in our work, we wish to handle it like a tool; but while constructing it, we wish to design its parts like a machine.

In database work, Microsoft's SQL Server Management Studio illustrates this tension well. SSMS is an IDE for database administrators, architects, and analysts alike, and hews to about the same general outline as any other: here's your left-side tree with context menu upon context menu of tools for managing your tables and views and functions and roles, here to edit the definition, there to print out DDL; here's an SQL interpreting machine, we'll put results or errors at the bottom. From the distance at which the DBA is forty person-hours and the database is a cylinder on an architecture diagram it itself is a tool for designing and detailing what that cylinder represents, but the activity of using it is machinic: you go to SSMS to perform database work. Before its release with SQL Server 2005, you would go to Enterprise Manager to define the schema and administer the server, or to Query Analyzer to write and run SQL. SSMS integrated both predecessor machines into a more consistent whole.

And SSMS was great! Even with one foot in application development I usually had both its predecessors open alongside Visual Studio anyway. The more interesting part of this historical digression is what resisted integration -- most notably ETL/ingestion suite SQL Server Integration Services (SSIS, née Data Transformation Services) and Profiler, which captures statement text and parameters on execution through a dizzying array of configurable filters.

Postgres doesn't have an SSIS. That's a good thing: even if the community wanted to support an official ETL machine, it's a bad direction to go for an open source project, with an unbounded and infinitely edge-cased panoply of input specifications. Controlling more of the backend can add value for commercial DBMSs, but for Postgres there's nothing to be gained. It's an interesting contrast with schema migration, where nobody has an official change management system, but that's getting beside the point.

Profiler, though, I miss almost every day. And although the distinction between tool and machine can be an especially slippery one for programs, it's more tool-like in use: it assists with whatever other thing you're doing that you need to peek at database activity rather than organizing tasks into a workflow, and you pick it up when you need it and put it away when you're done, or in other words, it's "ready to hand" rather than being a system you step into and operate. As an application developer it gave me instant insight into what I'd actually communicated to the database. Its filters were more customizable and more powerful than any reasonable grep invocation. Best of all, I could start and stop tracing without touching or knowing anything about the server's log settings or having SSH access.

pg_stat_statements of course exists, pgcenter has a top style view that's some use in tracking down frequent long-running statements, EDB have an SQL profiler module that installs server-side, but there's nothing even approaching a 1:1 equivalent client program as far as I know.

A programming environment viewed as a workshop offers a set of tools, but does not implement an overall strategy of software development. However, it may be used to automate a selected set of familiar and routine activities (such as change management or compilation).

Users define working processes by drawing on their knowledge of tasks, materials and tools. The programming workshop "surrounds" the user with sets of tools and automata [machines with hidden internal processes that "appear as machines when in use"], each with its own specific application and suitability for a particular type of material.

Application developers use a lot of tools, but even when those tools don't come pre-integrated into an application development machine like an IDE, they build these machines for themselves anyway; the workshop is an environment which facilitates their design and construction ad hoc. Such machines might be distributed across multiple programs -- editor, shell, compiler, linker, debugger, version control -- each individually a tool or a smaller machine bringing tools together for a single purpose, connected and mechanized into an inhabitable whole in order to speed up and standardize the motions of software development: that is, the industrial production of software machines from other software machines. Developers use their meta-machines to combine machines for data access, machines for rendering text or graphics, machines for telling time or hashing strings or an infinite variety of other purposes into new machines that meet their own or their organization's goals.

And here, a thousand and some words in, I make it to the prompt. Database workers have plenty of machines at which we do our database-work, vast and comprehensive like SSMS or small and simple like psql, which repeats the motion read-evaluate-print and yields to external editing machines, source and destination machines connected through pipes, and tools like less or pspg when the user performs a different or a specialized task that isn't its core competency. pgTAP is another machine that exercises a database according to its input, a player piano that detects its own off notes. It's one of the few we have that connects to developers' meta-machines. Efforts to bridge the chasm from the other side have so far mostly resulted in pared-down implementations of the SSMS-type being bolted into their IDEs.

And database workers' tools?

Well, what are our tools? Profiler, there's one, SQL Server's virtual microscope. When it comes to Postgres, of course, we have to attach the pg_stat_statements machine to it or make do with SSH and grep, not database tools specifically. There are a smattering of mostly operations-focused tools like pgbackrest or postgresqlco.nf. Otherwise, we have SQL scripts: a script to calculate bloat, a script to check index statistics, a script to report outputs or patch up recurrent data quality issues or populate static tables.

We have so many of these tools it's difficult to keep track of them all.

We don't have a standard way to organize or remember or even name most of them.

We don't have a dedicated infrastructure to share and update and standardize them, outside a specific class of tool/machine, extensions, having pgxn. And new extensions face an uphill climb to widespread adoption as more database workloads shift to cloud providers which allow them on a case-by-case basis.

Most of all we lack simple, well-defined ways for anyone else to use our tools without requiring them first to step out of their machines and into ours.

It reminds me a lot of (what I saw of) the state of *nix admin before most distros standardized on systemd. Linux had and has init daemons aplenty: SysVinit, upstart, runit, and more. Most of them orchestrate assemblages of more or less glorified shell scripts. The computer boots and starts process id 1, which in turn rummages around in /etc and runs anything that looks like it needs running, prioritized however the daemon prioritizes. Want to kick off some long-running service on boot? Write a shell script, season to your init daemon's taste, and drop it somewhere in /etc/init or /etc/rc.d. Every software vendor and every sysadmin seemed to have a slightly different approach to the infinite possibilities of upstart's script block or SysV full stop. Every boot rebuilt the runtime configuration -- the operating system machine -- from scratch by the automatic application of heterogeneous tool after heterogeneous tool. More than once I found myself in the shoes of the broomstick-multiplying sorcerer's apprentice of the poem as my adjustments went horribly awry.

These init systems, not unlike psql, are small, simple machines which defer to external tools wherever possible. systemd, meanwhile, integrated several other machines and components like login, networking, logging, and cron into a relatively maximalist operating system orchestrator. It restricted the infinite customizability of init scripts and more or less unified those several disparate ways of working, sacrificing "do one thing well" for "do many common things ~consistently". This, naturally, cuts in several directions, but from my perspective as an occasional or dilettante sysadmin it's been a huge improvement even only on grounds that my knowledge of service management and troubleshooting on Arch carries over to Ubuntu or Fedora out of the box. Instead of learning how to hand-assemble this particular Rolls, I can drive off the lot right away in a more basic car and get to my own goals immediately.

It's those goals that determine the contents of my SQL toolbox, same as with everyone else. The tools in it are not all created equal; some are inevitably too tightly bound to the specific context they originate from to justify adding them to the standard kit. But in other situations, having one decent answer is better than having five great answers, and some tools can usefully be mechanized, standardized, centralized. The trick is identifying them: which ones help database workers avoid reinventing wheels and integrate easily and usefully into other workers' machines?

Some of the tools I've written:

having thrown out too many brand new and already outdated entity-relationship diagrams ever to want to draw another one, I used Graphviz and an image-capable terminal to explore the foreign key graph. There's a similar script that analyzes view dependencies, but fks.zsh is the star of that show. As shell functions, they're available anywhere (there's that readiness-to-hand again), show the view from whatever vantage point you select, and get out of the way.
also in zsh, an autocompleting SQL file runner over a directory of scripts organized by database. This came in especially useful when I dealt with a lot of database dumps with other environments' security and FDW settings: bake the alter statements into a script one time, then sql dbname post-restore.sql forever after. I use SyncThing to keep scripts consistent across computers.
while dealing with the pain of multiple codebases interacting over multiple evolving database schemas, I developed an automated build module that indexes data access code and checks the blast radius of migration scripts across the entire organization. ectomigo is more a machine -- it centralizes the repeated motions of syntax analysis and comparison for individual connected repositories -- but itself connects to machines developers already use via review comments.
and I've built a few pgTAP checks for work recently (validating things like object comments and row-level security status) I should probably look at upstreaming in the new year.

I'd love for tools like pspg, pgsql-tweaks, or the scripts we've all copied out of the Akashic records, and machines like pgcenter to become more integrated into the psql or Postgres machines. Not in the strict software sense necessarily (e.g. pgsql-tweaks belongs in core, but compatibility beyond Postgres is a semi-explicit goal of pspg), of sharing repo space or aligning to Postgres' own release cycle -- smaller projects are much more nimble. But I think there could be a role for the Postgres social machine to play even for the really independent projects in its orbit. There's a lot of redundant work that has to happen, such as packaging for different distros and operating systems, that right now happens as each project's maintainers have time, awareness of the need, and the resources necessary to fulfill it. A centralizing strategy could eliminate or at least contain a lot of that redundancy and make useful tools and affordances much more widely available to database workers and downstream developers alike.