In data science one often needs to bin continuous data together to generalise noisy observations. Histograms are the prime example of data binning, allowing you to quickly identify patterns in data.
The particulars of how one groups these bins can vary. For the cases when you’re just using central values to be ceiled/floored to, I’ve come up with an elegant and performative solution using Python’s container data model.
Containers are the objects that store stuff and use the square brackets []
notation for access:
-
Elements of a list
my_list
can be accessed and modified via an indexi
with amy_list[i]
statement. -
Values of a dictionary
my_dict
can be accessed and mofied via a keyk
with amy_dict[k]
statement.
I’ve created a custom container type Bins
to abstract data bins.
Elements of the data bins my_bins
can be accessed and modified
via any real number n
with a my_bins[n]
.
If n
is not an actual key of my_bins
,
then it will be rounded to closest actual key.
Bins
is initialised with the desired intervals.
Each interval is essentially a key paired up with a count,
where the count starts at 0.
>>> bins = Bins([-6, -3, 0, 3, 6])
>>> bins
{-6: 0, -3: 0, 0: 0, 3: 0, 6: 0}
>>> bins[3] += 1 # n = 3
>>> bins
{-6: 0, -3: 0, 0: 0, 3: 1, 6: 0}
>>> bins[7] += 1 # n = 6
>>> bins[11] += 1 # n = 6
>>> bins[6.5] += 1 # n = 6
>>> bins
{-6: 0, -3: 0, 0: 0, 3: 1, 6: 3}
>>> bins[-1000000] += 1 # n = -6
>>> bins
{-6: 1, -3: 0, 0: 0, 3: 1, 6: 3}
>>> bins[0.5] += 1 # n = 0
>>> bins
{-6: 1, -3: 0, 0: 1, 3: 1, 6: 3}
As you can see, this feels an awful lot like a dictionary
—infact, Bins
inherits the abstract base class MutableMapping
to mirror the interface one would expect from a dict
.
A minimal subclass would require the following methods to be overridden.
class Bins(MutableMapping):
def __init__(self, *args, **kwargs):
...
def __getitem__(self, key):
...
def __setitem__(self, key, value):
...
def __delitem__(self, key):
...
def __iter__(self):
...
def __len__(self):
...
First we want to initialise Bins
with the the desired intervals.
We can can store this as in an internal dict
that will hold the contents of Bins
.
def __init__(self, intervals):
empty_bins = {interval: 0 for interval in intervals}
self._dict = empty_bins
The __getitem__()
and __setitem__()
methods
define the behaviour of using
the square brackets []
notation.
We want to intersect key
and round it to the closest interval,
before applying a valid key (an interval) to our internal dictionary.
def __getitem__(self, key):
interval = self._roundkey(key)
return self._dict[interval]
def __setitem__(self, key, value):
interval = self._roundkey(key)
self._dict[interval] = value
def _roundkey(self, key):
intervals = list(self._dict.keys())
minkey = intervals[0]
midkeys = intervals[1:-1]
maxkey = intervals[-1]
if key <= minkey:
return minkey
elif key >= maxkey:
return maxkey
elif key in midkeys:
return key
else:
i = bisect_left(intervals, key)
leftkey = intervals[i - 1]
rightkey = intervals[i]
if abs(leftkey - key) < abs(rightkey - key):
return leftkey
else:
return rightkey
As you can see,
the method _roundkey()
rounds the key
to the closest interval.
We check whether key
is an actual interval first,
before using Python’s bisect_left
to find the ceil and floor intervals relative to key
and return the nearest one.
And what do we have to do to get inplace operators such as +=
working?
Nothing!
All they’re doing is retrieving a value from the passed key via __getitem__()
,
applying an operation to value,
and assigning the new value to the key via __setitem__()
.
So if you just have __delitem__()
, __iter__()
and __len__()
directly interface with self._dict
,
you’ll have a working Bins
of your own!
Production readiness
But I’m not quite happy with this yet.
First of, if we passed intervals which are out-of-order,
then we completely screw up how _roundkey()
works
in finding the closest interval.
We can just sort the intervals ourselves at initialisation.
def __init__(self, intervals):
empty_bins = {interval: 0 for interval in sorted(intervals)}
self._dict = empty_bins
Before Python 3.7 this wouldn’t be ideal as
dictionaries did not guarantee insertion order.
They do now, however we can still end up with an unordered self._dict
if someone used the update()
method.
I can’t forsee a situation someone would want to do this,
but update()
is used in data serialisation and anywho is exposed by MutableMapping
,
so it’s a good idea to update our internal dictionary’s order as well.
Thankfully we have the SortedDict
from the sortedcontainers
package
which will guantee the dictionary’s keys will always be sorted.
def __init__(self, intervals):
empty_bins = {interval: 0 for interval in intervals}
self._sdict = SortedDict(empty_bins)
We can also cache the _roundkey()
result of frequently passed keys.
I found in some situations I was using Bins
in an algorithm
that this could drastically improve performance,
as rounding a float
-type key to an interval
is a bit expensive
but the exact same key is being passed regularly.
Python’s @lru_cache()
decorator makes this simple to implement.
You can just pop it on a function with hashable arguments,
and it will use a LRU cache
to store the results of frequent function calls
As the act of rounding keys is only interested in the intervals of Bins
,
I created a static method (i.e. instance-agnostic)
which takes only the intervals and passed key
to find the closest interval,
and is conveniently wrapped by _roundkey()
.
@property
def intervals(self):
return tuple(self._sdict.keys())
def _roundkey(self, key):
return Bins._find_closest_interval(self.intervals, key)
@staticmethod
@lru_cache()
def _find_closest_interval(intervals, key):
minkey = intervals[0]
midkeys = intervals[1:-1]
maxkey = intervals[-1]
if key <= minkey:
return minkey
elif key >= maxkey:
return maxkey
elif key in midkeys:
return key
else:
i = bisect_left(intervals, key)
leftkey = intervals[i - 1]
rightkey = intervals[i]
if abs(leftkey - key) < abs(rightkey - key):
return leftkey
else:
return rightkey
Fin
I hope you’ve learnt a thing or two today, and maybe even have a new tool in your data science workbench.
I’ve exposed the Bins
implementation I’ve been using in my
randomness testing library coinflip,
available in the coinflip.collections.Bins
namespace.
I believe it’s production ready,
but possibly there are quirks I have yet to encounter and test for!
The source code is available on
GitHub
(tests too).
A really cool project would be implementing interval ranges
so that binning doesn’t have to rely on “central values”.
This could be achieved very nicely with slices
(e.g. the obj[a:b:c]
syntax)
—this is not unprecedented as pandas uses slices for
expressing operations quite nicely.
For any Raymond Hettinger fans out there, you’ll know that every trick and tool I’m using here was heavily influenced by him. Thanks also to redditor ElevenPhonons for giving me some great feedback!