2023-11-16

UpSet plots, succinctly

UpSet plots are not widely known and are useful for understanding the intersections of categorised data - think of them as Venn diagrams on steroids. Let's go straight to an example:

First, let's put data into a csv file, in this case from Postgres:

psql postgres://... -t -A -F ',' -c \
'COPY ( SELECT ... ) TO STDOUT WITH CSV HEADER' > ~/example.csv

Our example data looks like:

id,animal,dob
1,cat,2021-01-01
2,dog,2021-01-02
3,cat,2021-01-03
4,cat,2021-01-04
5,snail,2021-01-05
6,dog,2021-01-06

Now we set up a Python environment with all the libraries we need:

mkdir i-can-do-data-science; cd i-can-do-data-science
python -m venv venv; source venv/bin/activate
pip install pandas jupyter upsetplot
jupyter notebook  # this will open a browser window

Let's import some junk in a new cell:

import warnings
warnings.filterwarnings('ignore')  # Disable all warnings
from collections import defaultdict
import pandas as pd
import upsetplot

Now let's plot our data:

This is maybe not the most efficient way of doing things, but I find it the most intuitive. We simply construct a dict of "which ids are in which category":

grouped = {
    "CATEGORY_1": {"id_1", "id_2", ...},
    "CATEGORY_2": {"id_2", ...},
}

Our plot tells us some useful facts: