2019-11-13
When you're used to plain ol' dict
s, int
s, list
s etc, pandas.DataFrame
s exhibit some weirdo behaviour, particulary concerning assignment and operators. This page is a short walk-through of how some of these things happen (and a quick intro to Python's magic methods), you can see the outcome here.
Disclaimer: the things presented here are not entirely as the pandas
DataFrame
s work, they are more intended as a guide to how they do.
The below examples use python type hints to help keep things a bit clearer, at the top, we have:
from typing import Any, Dict, List
DataFrame
s are a collection of Series
(AKA columns), let's start with a really dumb FakeSeries
.
class FakeSeries:
def __init__(self, name: str, data: Dict[int, Any]):
self.name = name
self.data = data
def __repr__(self) -> str:
return f'<FakeSeries: {self.name} {self.data}>'
>>> my_series = FakeSeries("some_column_name", {0: 5, 1: 7, 2: 9})
<FakeSeries: some_column_name {0: 5, 1: 7, 2: 9}>
__repr__
method is used by print()
Now we will define our FakeDataFrame
, it similarly has a useful __init__
and __repr__
(although this is only fully fleshed out in the original). On initialisation, it sets self.series_map
which is a map of series names to series.
class FakeDataFrame:
def __init__(self, d: Dict[str, List[Any]]):
self.series_map = {
k: FakeSeries(k, {i: v for i, v in enumerate(l)})
for k, l in d.items()
}
self.length = len(list(d.values())[0])
def __repr__(self):
width = 5
...
return '\n'.join((headers, divider) + rows) + '\n'
Already, we can see the beginnings of a pandas
-like DataFrame
interface.
>>> df = FakeDataFrame({
'a': [4, 5, 6],
'b': [7, 8, 9],
})
a | b
-------------
4 | 7
5 | 8
6 | 9
Now the clever stuff begins, lets add two methods to FakeDataFrame
so that we can retreive and set its Series
.
# handle []
def __getitem__(self, key: str) -> FakeSeries:
return self.series_map[key]
# handle [] =
def __setitem__(self, key: str, value: FakeSeries) -> None:
if key not in self.series_map:
self.series_map[key] = FakeSeries(key, {})
for i, v in value.data.items():
self[key].data[i] = v
Let's retreive a series.
>>> df['b']
<FakeSeries: b {0: 7, 1: 8, 2: 9}>
And let's set one.
>>> df['b'] = FakeSeries("not_b", {1: 'foo', 2: 'bar'})
>>> df
a | b
-------------
4 | 7
5 | foo
6 | bar
Note how that the name of the series didn't need to align with "b", and that we were able to assign to series b
at only indices 1 and 2.
Now to add some more smarts to our FakeSeries
.
# handle *
def __mul__(self, other: int) -> 'FakeSeries':
return FakeSeries(
self.name,
{i: v * other for i, v in self.data.items()},
)
# handle >;
def __gt__(self, other: int) -> 'FakeSeries':
return FakeSeries(
self.name,
{i: v > other for i, v in self.data.items()},
)
# handle []
def __getitem__(self, key: 'FakeSeries') -> 'FakeSeries':
return FakeSeries(
self.name,
{i: v for i, v in self.data.items() if key.data.get(i, False)},
)
__mul__
takes an integer and returns a new FakeSeries
with each of the values multiplied by it__gt__
takes an integer and returns a new FakeSeries
where each of the values is greater than it__getitem__
takes another FakeSeries
called key
and returns a new FakeSeries
with each of the values that had an index value contained in key
's index
We can now do some super pandas
-y stuff, let's remind ourselves of the DataFrame we're working with.
a | b
-------------
4 | 7
5 | 8
6 | 9
>>> df['b'] > 7
<FakeSeries: b {0: False, 1: True, 2: True}>
>>> df['a'][df['b'] > 7]
<FakeSeries: a {1: 5, 2: 6}>
And to put it all together.
>>> df['mult'] = df['a'][df['b'] > 7] * 2
>>> df
a | b | mult
---------------------
4 | 7 | NaN
5 | 8 | 10
6 | 9 | 12
Pretty cool huh!