Decomposing Object Trees From Relational Results

This is a feature I added to Massive recently. I had cases where I was querying views on hierarchies of multiple JOINed tables to reference data. For an example, here's a query that returns a list of wineries, some of their wines, and the grapes that go into each:

SELECT ws.id, ws.name, ws.country, w.id AS wine_id, w.name AS wine_name, w.year,
  va.id AS varietal_id, va.name AS varietal_name
FROM wineries ws
JOIN wines w ON w.winery_id = ws.id
JOIN wine_varietals wv ON wv.wine_id = w.id
JOIN varietals va ON va.id = wv.varietal_id
ORDER BY w.year;

The result set looks like this:

 id |         name         | country | wine_id |       wine_name       | year | varietal_id |   varietal_name    
----+----------------------+---------+---------+-----------------------+------+-------------+--------------------
  4 | Chateau Ducasse      | FR      |       7 | Graves                | 2010 |           6 | Cabernet Franc
  2 | Bodega Catena Zapata | AR      |       5 | Nicolás Catena Zapata | 2010 |           4 | Malbec
  2 | Bodega Catena Zapata | AR      |       5 | Nicolás Catena Zapata | 2010 |           1 | Cabernet Sauvignon
  4 | Chateau Ducasse      | FR      |       7 | Graves                | 2010 |           5 | Merlot
  4 | Chateau Ducasse      | FR      |       7 | Graves                | 2010 |           1 | Cabernet Sauvignon
  3 | Domäne Wachau        | AT      |       6 | Terrassen Federspiel  | 2011 |           7 | Grüner Veltliner
  1 | Cass Vineyards       | US      |       1 | Grenache              | 2013 |           2 | Grenache
  1 | Cass Vineyards       | US      |       2 | Mourvedre             | 2013 |           3 | Mourvedre
  2 | Bodega Catena Zapata | AR      |       3 | Catena Alta           | 2013 |           4 | Malbec
  2 | Bodega Catena Zapata | AR      |       4 | Catena Alta           | 2013 |           1 | Cabernet Sauvignon

This tells us a lot: we've got two single-varietal wines from Cass, two (note the differing wine_ids) and a blend from Catena, one grüner from Wachau, and one classic Bordeaux blend from Ducasse. But while I can pick out the information I'm interested in from this result set easily enough, it's not directly usable by application code which processes the records one at a time. If I needed to use these results to drive a site which offered winery profiles and allowed users to drill down into their offerings, I'd have a rough time of it. That structure looks more like this:

├── Bodega Catena Zapata
│   ├── Catena Alta
│   │   └── Cabernet Sauvignon
│   ├── Catena Alta
│   │   └── Malbec
│   └── Nicolás Catena Zapata
│       ├── Cabernet Sauvignon
│       └── Malbec
├── Cass Vineyards
│   ├── Grenache
│   │   └── Grenache
│   └── Mourvedre
│       └── Mourvedre
├── Chateau Ducasse
│   └── Graves
│       ├── Cabernet Franc
│       ├── Cabernet Sauvignon
│       └── Merlot
└── Domäne Wachau
    └── Terrassen Federspiel
        └── Grüner Veltliner

Relational databases don't do trees well at all. This is one of the compelling points of document databases like MongoDB, which would be able to represent this structure quite easily. However, our data really is relational: we've also got "search by grape" functionality, and it's a lot easier to pick out wines which match "Mourvedre" by starting with the single record in varietals and performing a foreign key scan. It's even indexable. By comparison, to do this with a document database you'd need to look in every document to see if its varietals had a match, and that still leaves the issue of ensuring that each winery only appears once in the output. Worse, there's no guarantee someone didn't typo "Moruvedre" somewhere.

There's an easy way to generate the profile-wine-varietal tree: just iterate the result set, see if we have a new winery and add it if so, see if the wine is new to this winery and add it if so, see if the varietal is new for this wine and add it if so. It's not very efficient, but this isn't the kind of thing one does at the millions-of-records scale anyway. The bigger problem is it only works for these specific results. Next time I run into this scenario, I'll have to start from scratch. I'm lazy. I only want to have to write this thing once.

Location, Location, Location

The first problem is determining which columns belong where in the object tree. The query result doesn't say which table a given column came from, and even if it did, that's no guarantee that it really belongs there. Meaning is contextual: a developer might want to merge joined results from a 1:1 relationship into a single object, or do more complicated things I can't anticipate.

To place each column, Massive needs a schema. Defining any kind of data model was something I'd avoided in the project for as long as possible; coming as I do from a strongly-typed background, it's almost instinctive. Strong typing, its many good points aside, is one of the reasons the object-relational mapper pattern (O/RM) dominates data access in languages like Java and C#: the requirement to map out class definitions ahead of time lends itself all too easily to creating a parallel representation of your data model as an object graph. This is the "object-relational impedance mismatch", also known as the Vietnam of computer science. You now have two data models, each subtly out of sync with the other, each trying to shoehorn data into formats that don't quite fit it. By contrast, JavaScript basically doesn't care what an object is. That lets Massive get away without any kind of modeling: it builds an API out of Tables and Queryables and Executables, but after that it's all arrays of anonymous result objects.

In an early version of this code, I automatically generated the schema based on column aliasing. The field wines__id would be allocated to an element of a collection named wines in the output. I wound up dropping this: naming conventions require significant up-front work, and if you're trying to do this to a view that already exists, it probably doesn't follow conventions I just came up with. This is poison for Massive, which is supposed to be a versatile toolkit with few expectations about your model. Providing a schema on invocation is still a non-negligible effort, but you only have to do it when you absolutely need it.

A schema looks like this:

{
  "pk": "id",
  "columns": ["id", "name", "country"],
  "wines": {
    "pk": "wine_id",
    "columns": {"wine_id": "id", "wine_name": "name", "year": "year"},
    "array": true,
    "varietals": {
      "pk": "varietal_id",
      "columns": {"varietal_id": "id", "varietal_name": "name"},
      "array": true
    }
  }
}

Each nested element defines a pk field, which we'll use to distinguish records belonging to different objects at the appropriate level of the tree. columns may be an array or an object to allow renaming (every single one of our tables has a column called name, and prefixes only make sense for flat result sets). The array flag on inner schemas indicates whether objects created from the schema should be appended to a collection or added as a nested object on the parent. We don't have any instances of the latter, but it's something you'd use for a user with a rich profile object or another 1:1 relationship.

Making a Hash of Things

Given a resultset and a schema to apply to it, our first order of business is consolidation. Chateau Ducasse only has one wine in our dataset, but since it's a cabernet sauvignon/merlot/cabernet franc blend, it shows up in three rows. And through some quirk of the sorting engine, those three rows aren't even adjacent. We'd be in trouble if we just accumulated data until the id changed -- we'd have records for a 2010 Chateau Ducasse cab franc and a 2010 Ducasse merlot/cab sauv, neither of which actually exists. If we did it really badly, we'd have two distinct Chateaux Ducasse with one imaginary wine each.

Fortunately, our schema defines a primary key field which will ensure that Chateau Ducasse is the only Chateau Ducasse; and we have hashtables. We can represent the query results as a recursively nested dictionary matching each object's primary key with its values for fields defined by the schema. Even for a relatively small data set like we have, this mapping gets big fast. This is what Chateau Ducasse's section looks like in full:

{ ...,
  "4": {
    "id": 4,
    "name": "Chateau Ducasse",
    "country": "FR",
    "wines": {
      "7": {
        "id": 7,
        "name": "Graves",
        "year": 2010,
        "varietals": {
          "1": {
            "id": 1,
            "name": "Cabernet Sauvignon"
          },
          "5": {
            "id": 5,
            "name": "Merlot"
          },
          "6": {
            "id": 6,
            "name": "Cabernet Franc"
          }
        }
      }
    }
  }
}

To generate this, we iterate over the resultset and pass each row through a function which recursively steps through the schema tree to apply the record data. For this schema, we're starting from wineries so the id 4 corresponds to Chateau Ducasse. Inside that object, the wine id 7 in the wines mapping corresponds to their 2010 Bordeaux, and so on.

Simplify!

However, the primary key mapping is obnoxious to work with. It's served its purpose of structuring our data in an arborescent rather than a tabular form; now it needs to go away, because it's an extra layer of complexity on top of our super-simple winery-wine-varietal tree. We need to break each winery value in the outer dictionary out into its own object, recurse into each of those to do the same for their wines, and finally recurse into the wines to handle the varietals.

If this sounds really similar to what we just did, that's because it is. It's technically possible to do this in one pass instead of two, but processing the raw results into a hashtable is much, much faster than the potential number of array scans we'd be doing.

To arrive at the final format, we reduce the mapping's key list; these are the primary keys of each winery in the example dataset. The corresponding values from the mapping go in the reduce accumulator. Since we're only dealing with arrays here, the accumulator will always be an array; if we had a subobject with a 1:1 relationship, we'd use an object accumulator instead by turning array off in the schema definition. This would result in the subobject being directly accessible as a property of its parent object.

Here's Catena:

[ ...,
  {
    "id": 2,
    "name": "Bodega Catena Zapata",
    "country": "AR",
    "wines": [ {
      "id": 3,
      "name": "Catena Alta",
      "year": 2013,
      "varietals": [ {
        "id": 4,
        "name": "Malbec"
      } ]
    }, {
      "id": 4,
      "name": "Catena Alta",
      "year": 2013,
      "varietals": [ {
        "id": 1,
        "name": "Cabernet Sauvignon"
      } ]
    }, {
      "id": 5,
      "name": "Nicolás Catena Zapata",
      "year": 2010,
      "varietals": [ {
        "id": 1,
        "name": "Cabernet Sauvignon"
      }, {
        "id": 4,
        "name": "Malbec"
      } ]
    } ]
  },
... ]

Dead simple: we've got wineries, wineries have wines, wines have varietals. Everything lines up with the real primary key values from the original query result. We've turned a raw resultset with embedded relationships into a model of those relationships. This is much easier to manage outside the relational context in client code, and it's an accurate representation of the mental model we want our users to have. The schema does add a bit of overhead, but it's as contained about as well as possible. Further automation only makes it less flexible from here out.