Hanukkah of Data 2022/5783
Eight days, eight data analysis puzzles, eight solutions. After working out the password I imported the SQLite database into Postgres the simplest possible way (with a couple of tweaks at the end of the giant
items.array seems to have been replaced by the
orders_items junction table):
createdb hanukkah sqlite3 noahs.sqlite .dump | sed -e 's/INTEGER PRIMARY KEY AUTOINCREMENT/SERIAL PRIMARY KEY/g;s/PRAGMA foreign_keys=OFF;//;s/unsigned big int/BIGINT/g;s/UNSIGNED BIG INT/BIGINT/g;s/BIG INT/BIGINT/g;s/UNSIGNED INT(10)/BIGINT/g;s/BOOLEAN/SMALLINT/g;s/boolean/SMALLINT/g;s/UNSIGNED BIG INT/INTEGER/g;s/INT(3)/INT2/g;s/DATETIME/TIMESTAMP/g;s/desc text/description text/g;s/items array/items text/g' | psql hanukkah
The key fields in
orders got turned into text somewhere along the line but that's easily fixed with
alter table orders alter column x type int using x::int.
I also imposed the following completely arbitrary constraints on myself:
- read only, no changing information or writing intermediary data.
- produce exactly the target information, no extra rows or columns.
- do it in a single DML statement (common table expressions and subqueries okay).
day one: beehive
This is a fun one! We represent the number:letter correspondence with a common table expression, unnest the customer's last name (all customers have only a first and last name, no variations) into another table-like object, then join our keypad-simulating CTE to find the one customer whose last name converted into a phone number is their phone number.
with keys (num, vals) as ( values (2, string_to_array('abc', null)), -- null delimiter splits each character (3, string_to_array('def', null)), (4, string_to_array('ghi', null)), (5, string_to_array('jkl', null)), (6, string_to_array('mno', null)), (7, string_to_array('pqrs', null)), (8, string_to_array('tuv', null)), (9, string_to_array('wxyz', null)) ) select customers.phone from customers join lateral unnest( string_to_array( -- get just the last name; Postgres uses 1-based indexing for arrays (regexp_split_to_array(lower(customers.name), '\s')), null ) -- `with ordinality` is exactly what it sounds like: tack a numeric index on, -- which string_agg() can use to keep the individual letters sorted; order is -- not otherwise guaranteed! ) with ordinality as namearr (v, i) on true join keys on vals @> array[namearr.v] group by customers.phone having regexp_replace(customers.phone, '-', '', 'g') = string_agg(keys.num::text, '' order by namearr.i);
day two: snail
Noah's is not selling enough coffee to be worth the effort involved, and this makes those who do order it easily findable with just a couple other dimensions to search on.
select c.phone from customers as c join orders as o using (customerid) join orders_items as oi using (orderid) join products as p using (sku) where c.name like 'J% D%' and extract (year from ordered) = 2017 and p.description ilike 'coffee,%';
day three: spider
Another "three clues, three predicates, one result" puzzle; no need even to check for orders having occurred more recently.
select phone from customers -- subtracting two from the year lines us up with the zodiacal dog; other animal -- years won't divide evenly by 12 where ((extract(year from birthdate::date) - 2) / 12)::int = (extract(year from birthdate::date) - 2) / 12 and to_char(birthdate::timestamptz, 'MMDD')::int between 0320 and 0420 and citystatezip = 'South Ozone Park, NY 11420';
day four: owl
Some refining of predicates involved in this one but it's still pretty straightforward to solve after a quick peek at the products table to find out how sku prefixes work: there are two people who've bought bakery items between 4 and 5 am ever, and only one of them makes a habit of it.
select c.phone from customers as c join orders as o using (customerid) join orders_items as oi using (orderid) where oi.sku ilike 'bky%' and numrange(4, 5, '[)') @> extract(hour from o.ordered) and numrange(4, 5, '[)') @> extract(hour from o.shipped) group by c.phone order by count(*) desc limit 1;
day five: koala
Only one person has ever bought cat food more than one time, so we could use
having and omit the
limit entirely (we could also have done this yesterday), but someone might make a repeat purchase tomorrow so order-limit is a more reliable solution.
select phone from customers as c join orders as o using (customerid) join orders_items as oi using (orderid) join products as p using (sku) where c.citystatezip ilike 'queens village%' and oi.sku ilike 'pet%' and p.description ilike '%cat%' group by phone -- count number of orders, not number of items bought order by count(distinct o.orderid) desc limit 1;
day six: squirrel
This one was far and away my worst score (20 attempts over four hours from opening the puzzle, although I probably only spent somewhere between one and two of those hours actually trying to solve it) because I got complacent and didn't think through computing savings. I initially tested order price vs wholesale price, i.e. margin, and went up a blind alley involving window functions trying to detect changes in order behavior. When I subtracted paid price from the maximum ever paid for each product I got an unambiguous result: one person has lifetime savings greater than their spending.
with max_prices as ( select p.sku, max(oi.unit_price) as price from products as p join orders_items as oi using (sku) group by p.sku ) select c.phone from customers as c join orders as o using (customerid) join orders_items as oi using (orderid) join max_prices as p using (sku) group by c.customerid, c.name, c.phone -- the standard maximum price * quantity is what _would_ have been paid without -- any discounts or coupons having sum(p.price * oi.qty - oi.unit_price * oi.qty) > sum(oi.unit_price * oi.qty);
day seven: toucan
Self-joining orders within a reasonable time window and filtering for different skus with similar descriptions (colors are always parenthesized) yields one match to an order from the customer in the previous puzzle.
select c.phone from orders as o1 join orders_items as oi1 using (orderid) join products as p1 using (sku) join orders as o2 on date_trunc('day', o2.ordered) = date_trunc('day', o1.ordered) and o2.ordered between o1.ordered - interval '1 hour' and o1.ordered + interval '1 hour' and o2.customerid <> o1.customerid join orders_items as oi2 on oi2.orderid = o2.orderid join products as p2 on p2.sku = oi2.sku join customers as c on c.customerid = o2.customerid where o1.customerid = 8342 and p1.sku <> p2.sku and regexp_replace(p1.description, '\([^)]+\)', '') = regexp_replace(p2.description, '\([^)]+\)', '');
day eight: snake
Another simple slicing problem to wrap it up: join everything in, filter for product descriptions, count, grab the highest.
select c.phone from customers as c join orders as o using (customerid) join orders_items as oi using (orderid) join products as p using (sku) where p.description ilike 'noah%' group by c.name, c.phone order by count(*) desc limit 1;
I had fun! Most of the puzzles wound up being much more straightforward than I'd hoped, but then it's tough to come up with a reasonable challenge at at the novice to intermediate level that isn't rendered trivial by expertise with a tool purpose-built for exactly this kind of information work. Other people are tackling this with VisiData, Excel, jq, or whatever else (I both want to see and absolutely do not want to solve day one in jq). That first puzzle set the bar super high, though, and variations on your basic join-where-sort-limit query had a really tough time following it. Honorable mention to days six and seven; it feels like on puzzle definition alone the smoothest difficulty curve in SQL would've been something like 2-8-3-4-5-7-6-1.