hanukkah of data 2022/5783

Hanukkah of Data 2022/5783

Eight days, eight data analysis puzzles, eight solutions. After working out the password I imported the SQLite database into Postgres the simplest possible way (with a couple of tweaks at the end of the giant sed replacer; items.array seems to have been replaced by the orders_items junction table):

createdb hanukkah
sqlite3 noahs.sqlite .dump | sed -e 's/INTEGER PRIMARY KEY AUTOINCREMENT/SERIAL PRIMARY KEY/g;s/PRAGMA foreign_keys=OFF;//;s/unsigned big int/BIGINT/g;s/UNSIGNED BIG INT/BIGINT/g;s/BIG INT/BIGINT/g;s/UNSIGNED INT(10)/BIGINT/g;s/BOOLEAN/SMALLINT/g;s/boolean/SMALLINT/g;s/UNSIGNED BIG INT/INTEGER/g;s/INT(3)/INT2/g;s/DATETIME/TIMESTAMP/g;s/desc text/description text/g;s/items array/items text/g' | psql hanukkah

The key fields in orders got turned into text somewhere along the line but that's easily fixed with alter table orders alter column x type int using x::int.

I also imposed the following completely arbitrary constraints on myself:

read only, no changing information or writing intermediary data.
produce exactly the target information, no extra rows or columns.
do it in a single DML statement (common table expressions and subqueries okay).

day one: beehive

This is a fun one! We represent the number:letter correspondence with a common table expression, unnest the customer's last name (all customers have only a first and last name, no variations) into another table-like object, then join our keypad-simulating CTE to find the one customer whose last name converted into a phone number is their phone number.

with keys (num, vals) as (
  values
    (2, string_to_array('abc',  null)), -- null delimiter splits each character
    (3, string_to_array('def',  null)),
    (4, string_to_array('ghi',  null)),
    (5, string_to_array('jkl',  null)),
    (6, string_to_array('mno',  null)),
    (7, string_to_array('pqrs', null)),
    (8, string_to_array('tuv',  null)),
    (9, string_to_array('wxyz', null))
)
select customers.phone
from customers
join lateral unnest(
  string_to_array(
    -- get just the last name; Postgres uses 1-based indexing for arrays
    (regexp_split_to_array(lower(customers.name), '\s'))[2],
    null
  )
-- `with ordinality` is exactly what it sounds like: tack a numeric index on,
-- which string_agg() can use to keep the individual letters sorted; order is
-- not otherwise guaranteed!
) with ordinality as namearr (v, i) on true
join keys on vals @> array[namearr.v]
group by customers.phone
having regexp_replace(customers.phone, '-', '', 'g') =
  string_agg(keys.num::text, '' order by namearr.i);

day two: snail

Noah's is not selling enough coffee to be worth the effort involved, and this makes those who do order it easily findable with just a couple other dimensions to search on.

select c.phone from customers as c
join orders as o using (customerid)
join orders_items as oi using (orderid)
join products as p using (sku)
where c.name like 'J% D%'
  and extract (year from ordered) = 2017
  and p.description ilike 'coffee,%';

day three: spider

Another "three clues, three predicates, one result" puzzle; no need even to check for orders having occurred more recently.

select phone
from customers
-- subtracting two from the year lines us up with the zodiacal dog; other animal
-- years won't divide evenly by 12
where ((extract(year from birthdate::date) - 2) / 12)::int =
       (extract(year from birthdate::date) - 2) / 12
  and to_char(birthdate::timestamptz, 'MMDD')::int between 0320 and 0420
  and citystatezip = 'South Ozone Park, NY 11420';

day four: owl

Some refining of predicates involved in this one but it's still pretty straightforward to solve after a quick peek at the products table to find out how sku prefixes work: there are two people who've bought bakery items between 4 and 5 am ever, and only one of them makes a habit of it.

select c.phone
from customers as c
join orders as o using (customerid)
join orders_items as oi using (orderid)
where oi.sku ilike 'bky%'
  and numrange(4, 5, '[)') @> extract(hour from o.ordered)
  and numrange(4, 5, '[)') @> extract(hour from o.shipped)
group by c.phone
order by count(*) desc
limit 1;

day five: koala

Only one person has ever bought cat food more than one time, so we could use having and omit the limit entirely (we could also have done this yesterday), but someone might make a repeat purchase tomorrow so order-limit is a more reliable solution.

select phone
from customers as c
join orders as o using (customerid)
join orders_items as oi using (orderid)
join products as p using (sku)
where c.citystatezip ilike 'queens village%'
  and oi.sku ilike 'pet%'
  and p.description ilike '%cat%'
group by phone
-- count number of orders, not number of items bought
order by count(distinct o.orderid) desc
limit 1;

day six: squirrel

This one was far and away my worst score (20 attempts over four hours from opening the puzzle, although I probably only spent somewhere between one and two of those hours actually trying to solve it) because I got complacent and didn't think through computing savings. I initially tested order price vs wholesale price, i.e. margin, and went up a blind alley involving window functions trying to detect changes in order behavior. When I subtracted paid price from the maximum ever paid for each product I got an unambiguous result: one person has lifetime savings greater than their spending.

with max_prices as (
  select p.sku, max(oi.unit_price) as price
  from products as p
  join orders_items as oi using (sku)
  group by p.sku
)
select c.phone
from customers as c
join orders as o using (customerid)
join orders_items as oi using (orderid)
join max_prices as p using (sku)
group by c.customerid, c.name, c.phone
-- the standard maximum price * quantity is what _would_ have been paid without
-- any discounts or coupons
having sum(p.price * oi.qty - oi.unit_price * oi.qty) > sum(oi.unit_price * oi.qty);

day seven: toucan

Self-joining orders within a reasonable time window and filtering for different skus with similar descriptions (colors are always parenthesized) yields one match to an order from the customer in the previous puzzle.

select c.phone
from orders as o1
join orders_items as oi1 using (orderid)
join products as p1 using (sku)
join orders as o2
  on date_trunc('day', o2.ordered) = date_trunc('day', o1.ordered)
  and o2.ordered between o1.ordered - interval '1 hour' and o1.ordered + interval '1 hour'
  and o2.customerid <> o1.customerid
join orders_items as oi2 on oi2.orderid = o2.orderid
join products as p2 on p2.sku = oi2.sku
join customers as c on c.customerid = o2.customerid
where o1.customerid = 8342
  and p1.sku <> p2.sku
  and regexp_replace(p1.description, '\([^)]+\)', '') =
      regexp_replace(p2.description, '\([^)]+\)', '');

day eight: snake

Another simple slicing problem to wrap it up: join everything in, filter for product descriptions, count, grab the highest.

select c.phone
from customers as c
join orders as o using (customerid)
join orders_items as oi using (orderid)
join products as p using (sku)
where p.description ilike 'noah%'
group by c.name, c.phone
order by count(*) desc
limit 1;

retrospectively

I had fun! Most of the puzzles wound up being much more straightforward than I'd hoped, but then it's tough to come up with a reasonable challenge at at the novice to intermediate level that isn't rendered trivial by expertise with a tool purpose-built for exactly this kind of information work. Other people are tackling this with VisiData, Excel, jq, or whatever else (I both want to see and absolutely do not want to solve day one in jq). That first puzzle set the bar super high, though, and variations on your basic join-where-sort-limit query had a really tough time following it. Honorable mention to days six and seven; it feels like on puzzle definition alone the smoothest difficulty curve in SQL would've been something like 2-8-3-4-5-7-6-1.