How do pandas DataFrames work? (kinda)

When you're used to plain ol' dicts, ints, lists etc, pandas.DataFrames exhibit some weirdo behaviour, particulary concerning assignment and operators. This page is a short walk-through of how some of these things happen (and a quick intro to Python's magic methods), you can see the outcome here.

Disclaimer: the things presented here are not entirely as the pandas DataFrames work, they are more intended as a guide to how they do.

The below examples use python type hints to help keep things a bit clearer, at the top, we have:

from typing import Any, Dict, List

DataFrames are a collection of Series (AKA columns), let's start with a really dumb FakeSeries.

class FakeSeries:
    def __init__(self, name: str, data: Dict[int, Any]):
        self.name = name
        self.data = data

    def __repr__(self) -> str:
        return f'<FakeSeries: {self.name} {self.data}>'

>>> my_series = FakeSeries("some_column_name", {0: 5, 1: 7, 2: 9})
<FakeSeries: some_column_name {0: 5, 1: 7, 2: 9}>

Note how the __repr__ method is used by print()
There a list of all the magic methods you can override on a class here
Note how we are storing the series as a map of indices (0, 1, 2) to values (5, 7, 9)

Now we will define our FakeDataFrame, it similarly has a useful __init__ and __repr__ (although this is only fully fleshed out in the original). On initialisation, it sets self.series_map which is a map of series names to series.

class FakeDataFrame:
    def __init__(self, d: Dict[str, List[Any]]):
        self.series_map = {
            k: FakeSeries(k, {i: v for i, v in enumerate(l)})
            for k, l in d.items()
        }
        self.length = len(list(d.values())[0])

    def __repr__(self):
        width = 5
        ...
        return '\n'.join((headers, divider) + rows) + '\n'

Already, we can see the beginnings of a pandas-like DataFrame interface.

>>> df = FakeDataFrame({
    'a': [4, 5, 6],
    'b': [7, 8, 9],
})

    a |     b
-------------
    4 |     7
    5 |     8
    6 |     9

Now the clever stuff begins, lets add two methods to FakeDataFrame so that we can retreive and set its Series.

    # handle []
    def __getitem__(self, key: str) -> FakeSeries:
        return self.series_map[key]

    # handle [] =
    def __setitem__(self, key: str, value: FakeSeries) -> None:
        if key not in self.series_map:
            self.series_map[key] = FakeSeries(key, {})
        for i, v in value.data.items():
            self[key].data[i] = v

Let's retreive a series.

>>> df['b']
<FakeSeries: b {0: 7, 1: 8, 2: 9}>

And let's set one.

>>> df['b'] = FakeSeries("not_b", {1: 'foo', 2: 'bar'})
>>> df
    a |     b
-------------
    4 |     7
    5 |   foo
    6 |   bar

Note how that the name of the series didn't need to align with "b", and that we were able to assign to series b at only indices 1 and 2.

Now to add some more smarts to our FakeSeries.

    # handle *
    def __mul__(self, other: int) -> 'FakeSeries':
        return FakeSeries(
            self.name,
            {i: v * other for i, v in self.data.items()},
        )

    # handle >;
    def __gt__(self, other: int) -> 'FakeSeries':
        return FakeSeries(
            self.name,
            {i: v > other for i, v in self.data.items()},
        )

    # handle []
    def __getitem__(self, key: 'FakeSeries') -> 'FakeSeries':
        return FakeSeries(
            self.name,
            {i: v for i, v in self.data.items() if key.data.get(i, False)},
        )

__mul__ takes an integer and returns a new FakeSeries with each of the values multiplied by it
__gt__ takes an integer and returns a new FakeSeries where each of the values is greater than it
__getitem__ takes another FakeSeries called key and returns a new FakeSeries with each of the values that had an index value contained in key's index

We can now do some super pandas-y stuff, let's remind ourselves of the DataFrame we're working with.

    a |     b
-------------
    4 |     7
    5 |     8
    6 |     9

>>> df['b'] > 7
<FakeSeries: b {0: False, 1: True, 2: True}>

>>> df['a'][df['b'] > 7]
<FakeSeries: a {1: 5, 2: 6}>

And to put it all together.

>>> df['mult'] = df['a'][df['b'] > 7] * 2
>>> df
    a |     b |  mult
---------------------
    4 |     7 |   NaN
    5 |     8 |    10
    6 |     9 |    12

Pretty cool huh!