The Ultimate Postgres vs MySQL Blog Post

I should probably say up front that I love working with Postgres and could die happy without ever seeing a mysql> prompt again. This is not an unbiased comparison -- but those are no fun anyway.

The scenario: two applications, using Massive.js to store and retrieve data. Massive is closely coupled to Postgres by design. Specializing lets it take advantage of features which only exist in some or no other relational databases to streamline data access in a lighter, more "JavaScripty" way than a more traditional object-relational mapper. It's great for getting things done, since the basics are easy and for the complicated stuff where you'd be writing SQL anyway.... you write SQL, you store it in one central place for reuse, and the API makes running it simple.

Where Massive is less useful is if you have to support another RDBMS. This is, ideally, something you know about up front. Anyway: things happen, and sometimes you find yourself having to answer the question "what's it going to look like if we need to run these applications with light but tightly coupled data layers on MySQL?"

Not good, was the obvious answer, but less immediately obvious was how not good. I knew there were some things Postgres did that MySQL didn't, but I also knew there were a ton of things I'd just never tried in the latter. So as I got to work on this, I started keeping notes. Here's everything I found.

Schema Layout

Now that we're all basically over the collective hallucination of a "schemaless" future, arguably the most important aspect of data storage is how information is modeled in a database. Postgres and MySQL are both relational databases, grouping records in strictly-defined tables. But there's a lot of room for variation within that theme.

Multiple Schemas

First things first: "schema" doesn't always mean the same thing. To MySQL, "schema" is synonymous with "database". For Postgres, a "schema" is a namespace within a database, which allows you to group tables, views, and functions together without having to break them apart into different databases.

MySQL's simplicity in this respect is ameliorated by its offering cross-database queries:

SELECT *
FROM db1.table1 t1
JOIN db2.table2 t2 ON t2.t1_id = t1.id;

With Postgres, you can work across schemas, but if you need to query information in a different database, that's a job for...

Foreign Data Wrappers

Foreign data wrappers let Postgres talk to practically anything that represents information as discrete records. You can create a "foreign table" in a Postgres database and SELECT or JOIN it like any other table -- only under the hood, it's actually reading a CSV, talking to another DBMS, or even querying a REST API. It's a powerful enough feature that NoSQL stalwart MongoDB sneakily built their BI Connector on top of Postgres with foreign data wrappers. You don't even need to know C to write a new FDW when Multicorn lets you do it in Python!

Oracle and SQL Server both have some functionality for registering external data sources, but Postgres' offering is the most extensible I'm aware of. MySQL, besides the inter-database query support mentioned above, has nothing.

Table Inheritance

Inheritance is more commonly thought of as an attribute of object-oriented programming languages rather than databases, but Postgres is technically an ORDBMS or object-relational database management system. So you can have a table cities with columns name and population, and a table capitals which inherits the definition of cities but adds an of_country column only relevant, of course, for capital cities. If you SELECT from cities, you get rows from capitals -- they're cities too! You can of course SELECT name FROM ONLY cities to exclude the capitals. This is something of a niche feature, but when you have the right use case it really shines.

MySQL, being a traditional RDBMS, doesn't do this.

Materialized Views

Materialized views are like regular views, except the results of the specifying query are physically stored ('materialized') and must be explicitly refreshed. This allows database developers to cache the results of slower queries when the results don't have to be realtime.

Oracle has materialized views, and SQL Server's indexed views are similar, but MySQL has no materialized view support.

Check Constraints

Constraints in general ensure that invalid data is not stored. The most common constraint is NOT NULL, which prevents records without a value for the non-nullable column from being inserted or updated. Foreign key constraints do likewise when a reference to a record in another table is invalid. Check constraints are the most flexible, and allow validation of any predicate you could put in a WHERE clause -- for example, asserting that prices have to be positive numbers, or that US zip codes have to be five digits.

Per the MySQL docs: the CHECK clause is parsed but ignored by all storage engines.

JSONB and Indexing

Postgres and MySQL both have a JSON column type (MySQL replacement MariaDB does too, but it's currently just an alias for LONGTEXT) and functions for building, processing, and querying JSON fields. Postgres actually goes a step further by offering a JSONB type which processes input data into a binary format. This means it's a little bit slower to write, but much faster to query.

It also means you can index the binary data. A GIN or Generalized INverted index allows queries checking for the existence of specific keys or key-value pairs to avoid scanning every single record for matches. This is huge if you run queries which dig into JSON fields in the WHERE clause.

Default Values Defined by Functions

DEFAULT is a useful specification for columns in a CREATE TABLE statement. At the simplest level, this could be used to baseline a boolean field to true or false if the INSERT statement doesn't give an explicit value. But you can do more than set a scalar value: a timestamp can default to now(), a UUID to any of a variety of UUID-generating functions, any other field to the value returned by whatever function you care to write -- the sky's the limit!

Unless you're using MySQL, in which case the only function you can reference in a DEFAULT clause is now().

Type Differences

Layout's only part of the story, though. Equally important is the difference in type support. The benefit of a robust type system is in enabling database architects to represent information with the greatest accuracy possible. If a value is difficult or impossible to represent with built-in types, it's harder for developers to work with in turn, and if compromises have to be made to cut the data to fit then they can affect entire applications. Some types can even affect the overall database design, such as arrays and enumerations. In general, the more options you have the better.

UUIDs

Postgres has a UUID type. MySQL does not. If you want to store a UUID in MySQL, your options are CHAR, if you want values to be as human-readable as UUIDs ever are, or BINARY, if you want it to be faster but more difficult to work with manually. Postgres also generates more types of UUIDs.

Booleans

Boolean seems like a pretty basic type to have! However, MySQL's boolean is actualy an alias for TINYINT(1). This is why query results show 0 or 1 instead of true or false. It's also why you can set the value of an ostensibly boolean field to 2. Try it!

Postgres: has proper booleans.

Varlena and Lengths

MySQL isn't alone in aliasing standard types in strange ways, however. CHAR, VARCHAR, and TEXT types in Postgres are all aliased representations of the same structure -- the only distinction is that length constraints will be enforced if specified. The documentation notes that this is actually slower, and recommends that unbounded text simply be defined as the TEXT type instead of given an arbitrary maximum length.

What's happening here is that Postgres uses a data structure called a varlena, or VAriable LENgth Array, to store the information. A varlena's first four bytes store the length of the value, making it easy for the database to pick the whole thing out of storage. TEXT is only one of the types that uses this structure, but it's easily the most commonly encountered.

If a varlena is longer than would fit inline, the database uses a system called TOAST ("The Oversized Attribute Storage Technique") to offload it to extended storage transparently. Queries with predicates involving a TOASTable field might not be all that performant with large tables unless designed and indexed carefully, but when the database is returning records it's easy enough to follow the TOAST pointer that the overhead is barely noticeable for most cases.

The upshot of all this, as far as most people are concerned, is this: with Postgres, you only have to worry about establishing a length constraint on fields that have a reason for a length constraint. If there's no clear requirement to limit how much information can go into a field, you don't have to pick an arbitrary number and try to match it up with your page size.

Arrays

Non-scalar values in records! Madness! Dogs and cats living together! Anyone who's worked with JSON, XML, YAML, or even HTML understands that information isn't always flat. Relational architectures have traditionally mandated breaking out any vectors, let alone even more complex values, into new tables. Sometimes that's useful, but often enough it adds complexity to no real purpose. Inlining arrays makes many tasks -- such as tagging records -- much easier.

Postgres has arrays, as does Oracle; MySQL and SQL Server don't.

Customizing Types

If the built-in types aren't sufficient, you can always add your own. Custom types let you define a value to be exactly what you want. Domains are a related concept: types (custom or built-in) which enforce constraints on values. You might for example create a domain to represent a zip code as a TEXT value which uses regular expressions in a CHECK clause to ensure that values consist of five digits, optionally followed by a dash and four more digits.

If you're using Postgres, that is. Oracle and SQL Server both offer some custom type functionality, but MySQL has nothing. You can't even use table-level CHECK constraints because the engine simply ignores them.

Enums

Enumerations don't get enough love. If I had a dollar for every INT -- or worse, VARCHAR -- field I've seen representing one of a fixed set of potential values, I probably still couldn't retire but I could at least have a pretty nice evening out. There are drawbacks to using enums, to be sure: adding new values requires DDL, and you can't remove values at all. But appropriate use cases for them are still reasonably common.

MySQL and Postgres both offer enums. The critical distinction is that Postgres' enums are proper reusable types. MySQL's enums are more like the otherwise-ignored CHECK constraints and specify a valid value list for a single column in a single table. Possible improvement on allowing a boolean column to contain -100?

Querying Data

So that's data modeling covered. There's an entire other half to go: actually working with the information being stored. SQL itself is divided in two parts, the "data definition language" which defines the structure of a database and the "data manipulation language". This latter comprises the SELECT, INSERT, and other statements most people think of when they hear the name "SQL". And just as with modeling, there are substantial differences between Postgres and MySQL in querying.

RETURNING

Autogenerating primary keys takes a huge headache out of storing data. But there's one catch: when you insert a new record into a table, you don't know what its primary key value got set to. Most relational databases will tell you what the last autogenerated key was if you call a special function; some, like SQL Server, even let you filter down to the single table you're interested in.

Postgres goes above and beyond with the RETURNING clause. Any write statement -- INSERT, UPDATE, DELETE -- can end with a RETURNING [column-list], which acts as a SELECT on the affected records. RETURNING * gives you the entire recordset from whatever you just did, or you can restrict what you're interested in to certain columns.

That means you can do this:

INSERT INTO foos (name)
VALUES ('alpha'), ('beta')
RETURNING *;

 id │ name  
────┼───────
  1 │ alpha
  2 │ beta
(2 rows)

With MySQL, you're stuck with calling LAST_INSERT_ID() after you add a new record. If you added multiple, LAST_INSERT_ID only gives you the earliest new id, leaving you to work out the rest yourself. And of course, this is only good for integer primary keys.

MySQL also has no counterpart to this functionality for UPDATEs and DELETEs. Competitor MariaDB supports RETURNING on DELETE, but not on any other kind of statement.

Common Table Expressions

Common Table Expressions or CTEs allow complex queries to be broken up and assembled from self-contained parts. You might write this:

WITH page_visits AS (
  SELECT p.id, p.site_id, p.title, COUNT(*) AS visits
  FROM pages AS p
  JOIN page_visitors AS v ON v.page_id = p.id
  GROUP BY p.id, p.site_id, p.title
), max_visits AS (
  SELECT DISTINCT ON (site_id)
    site_id, title, visits
  FROM page_visits
  ORDER BY site_id, visits DESC
)
SELECT s.id, s.name,
  max_visits.title AS most_popular_page,
  SUM(page_visits.visits) AS total_visits
FROM sites AS s
JOIN page_visits ON page_visits.site_id = s.id
JOIN max_visits ON max_visits.site_id = s.id
GROUP BY s.id, s.name, max_visits.title
ORDER BY total_visits DESC;

In the first query, we aggregate visit counts; in the second, we use DISTINCT ON on the results of the first to filter out all but the most popular pages; finally, we join both of our intermediary results to provide the output we're looking for. CTEs are a really readable way to factor query logic out, and they let you do some things in one statement that you can't otherwise.

MySQL does have CTEs! However: thanks to the RETURNING clause, Postgres can write records in a CTE and operate on the results. This is huge for application logic. This next query writes a record in a CTE, then adds a corresponding entry to a junction table -- all in the same transaction.

WITH wine AS (
  INSERT INTO wines (name, year)
  VALUES ('Herrenreben', 2015)
  RETURNING id
), reviewer AS (
  SELECT id
  FROM reviewers
  WHERE name = 'Wine Enthusiast'
)
INSERT INTO wine_ratings (wine_id, reviewer_id, score)
SELECT wine.id, reviewer.id, 92
FROM wine
JOIN reviewer ON TRUE;

Casting

Sometimes a query needs to treat a value as if it has a different type, whether to store it or to operate on it somehow. Postgres even lets you define additional conversions between types with CREATE CAST.

MySQL supports casting values to binary, char/nchar, date/datetime/time, decimal, JSON, and signed and unsigned integers. Absent from this list: tinyints, which, since booleans are actually tinyints, means you're stuck with conditionals when you need to coerce a value to true or false for storage in a "boolean" column.

Lateral Joins

A lateral join is fundamentally similar to a correlated subquery, in that it executes for each row of the current result set. However, a correlated subquery is limited to returning a single value for a SELECT list or WHERE clause; subqueries in the FROM clause run in isolation. A lateral join can refer back to information in the rest of the result set:

CREATE TABLE docs (id serial, body jsonb);

INSERT INTO docs (body) VALUES ('{"a": "one", "b": "two"}'), ('{"c": "three"}');

SELECT docs.id, keys.*
FROM docs
JOIN LATERAL jsonb_each(docs.body) AS keys ON TRUE;

 id │ key │  value  
────┼─────┼─────────
  1 │ a   │ "one"
  1 │ b   │ "two"
  2 │ c   │ "three"
(3 rows)

It can also invoke table functions like unnest which return multiple rows and columns:

CREATE TABLE multiple_arrays(arr1 int[], arr2 int[]);

INSERT INTO multiple_arrays (arr1, arr2)
VALUES
	('{1,2,3}', '{4,5}'),
	('{6,7}', '{8,9,10}');

SELECT raw.*
FROM multiple_arrays
JOIN LATERAL unnest(arr1, arr2) AS raw ON TRUE;

 unnest │ unnest 
────────┼────────
      1 │      4
      2 │      5
      3 │ (null)
      6 │      8
      7 │      9
 (null) │     10
(6 rows)

Oracle and SQL Server offer similar functionality with the LATERAL keyword in the former, and CROSS APPLY/OUTER APPLY. MySQL does not.

Variadic Function Arguments

Functions! Procedures, if you believe in making that distinction! They're great! You can declare variadic arguments -- "varargs" or "rest parameters" in other languages -- to pull an arbitrary number of arguments into a single collection named for the final argument.

In Postgres.

Predicate Operations

A handful of useful operations which allow more expressive WHERE clauses with Postgres:

  • IS DISTINCT FROM and its counterpart IS NOT DISTINCT FROM offer a null-sensitive equality test. Null isn't ordinarily comparable since it represents the absence of a value, so the predicate WHERE field <> 1 will not return records where field is null. WHERE field IS DISTINCT FROM 1 returns all records where field is other-than-1, including where it's null.
  • ILIKE is a case-insensitive LIKE operation. MySQL does have the capability for case-insensitive pattern matching, but it depends on your collation and can't be toggled on a per-query basis (the default collation is case-insensitive, to be completely fair).
  • ~, ~*, !~, and !~* form a set of POSIX regular expression tests: match, case-insensitive match, no match, and no case-insensitive match respectively. MySQL does have REGEXP and NOT REGEXP; however, Postgres' implementation has lookahead and lookbehind.

General Database Work

That's it for the architecture and query language feature gaps I discovered. I did run into a couple other things that bear mentioning, however:

Dependencies

MySQL doesn't care about dependencies among database objects. You can tell it to drop a table a view or proc depends on and it will go right ahead and drop it. You'll have no idea something's gone wrong until the next time you try to invoke the view or proc. Postgres saves you from yourself, unless you're really sure and drop your dependents too with CASCADE.

Triggers and Table Writes

Just the mention of triggers is probably putting some people off their lunch. They're not that bad, honest (well, they can be, but it's not like it's their fault). Anyway, point is: sometimes you want to write a trigger that modifies other rows in the table it's being activated from.

Well, you can't in MySQL.

The End?

This may have exhausted me, but I'm pretty sure it's still not an exhaustive list of the feature gaps between Postgres and MySQL. I did cop to my preference up front, but having spent six weeks putting the effort into converting the comparison is pretty damning. I think there could still be reasons to pick MySQL -- but I'm not sure they could be technical.