ectomigo: safer schema migrations

ectomigo: Safer Schema Migrations

The team I work with at my day job maintains many applications and processes interacting across a smaller number of databases. This is hardly exceptional. We are also constantly adding, subtracting, and refining not only the client programs but also the database schemas themselves. This too is hardly exceptional: business requirements change, external systems expose new information and deprecate old interfaces, von Moltke's Law ("no plan of operations remains certain once the armies have met") comes calling. Every now and again we just make a modeling or implementation mistake that manages to sneak through review and up to production. Sic semper startups.

So our database schemas are continually evolving. Each of those many applications and processes has to evolve along with them, or we get paged when the renamed column or dropped table breaks something we hadn't accounted for, and instant breakage is the best case. We've had schema incompatibilities lie in wait for over a month to catch us completely flatfooted. The complexities of even a single moderately-sized codebase are beyond the grasp of human memory. What hope do we have of recalling which relevant subset of database interactions appear where across two or ten or more?

What we need is a distinctly inhuman memory, one for which summoning up each and every reference to a changing table or view takes a moment's effort, and which cannot forget. A memory which operates at the level of the organization, rather than that of the project or of the individual developer/reviewer, only able to focus on a single target at a time. A memory we can consult when, or better yet before, code is ready to deploy -- "shifting left", as they say.

We need a database.

I built one.

a schema migration alters a table, renaming a column; ectomigo leaves a GitHub review comment pointing out references to that table in two repositories. Each reference includes the columns ectomigo has been able to identify. One reference uses the column's new name, indicating it's been updated, but another in the second repository still uses the old name and must be fixed.

ectomigo is a continuous integration module (initially a GitHub action) which parses your source files using tree-sitter to find data access code: SQL scripts and inline SQL in Java, JavaScript, and Python; MassiveJS calls; SQLAlchemy definitions; and more languages, data access patterns, analysis features, and platform support on the way after launch. Everything it finds it indexes, storing database object names and the file row-column positions of each reference.

When you submit schema changes for review, it parses that code as well, and matches the targets you're altering or dropping against every codebase your organization has enabled. If it does find any matches -- in other words, you still have live references to an affected database object, in this or another repository -- it leaves review comments alerting you to each potential problem.

ectomigo is launching on GitHub free for public and up to two private projects, with pricing available beyond that. The action code and the core code analysis library it integrates are open under the AGPL should you be interested in that.

We've been using early ectomigo builds at my workplace for a couple of months now, and it's already saved our bacon a few times with reports on database object usage in places we'd forgotten. If you're faced with migration risk yourself, I hope it can help you.