Simplify data binning with a custom dict


In data science one often needs to bin continuous data together to generalise noisy observations. Histograms are the prime example of data binning, allowing you to quickly identify patterns in data.

The particulars of how one groups these bins can vary. For the cases when you’re just using central values to be ceiled/floored to, I’ve come up with an elegant and performative solution using Python’s container data model.

Containers are the objects that store stuff and use the square brackets [] notation for access:

  • Elements of a list my_list can be accessed and modified via an index i with a my_list[i] statement.

  • Values of a dictionary my_dict can be accessed and mofied via a key k with a my_dict[k] statement.

I’ve created a custom container type Bins to abstract data bins. Elements of the data bins my_bins can be accessed and modified via any real number n with a my_bins[n]. If n is not an actual key of my_bins, then it will be rounded to closest actual key.

Bins is initialised with the desired intervals. Each interval is essentially a key paired up with a count, where the count starts at 0.

>>> bins = Bins([-6, -3, 0, 3, 6])
>>> bins
{-6: 0, -3: 0, 0: 0, 3: 0, 6: 0}
>>> bins[3] += 1                    # n = 3
>>> bins
{-6: 0, -3: 0, 0: 0, 3: 1, 6: 0}
>>> bins[7] += 1                    # n = 6
>>> bins[11] += 1                   # n = 6
>>> bins[6.5] += 1                  # n = 6
>>> bins
{-6: 0, -3: 0, 0: 0, 3: 1, 6: 3}
>>> bins[-1000000] += 1             # n = -6
>>> bins
{-6: 1, -3: 0, 0: 0, 3: 1, 6: 3}
>>> bins[0.5] += 1                  # n = 0
>>> bins
{-6: 1, -3: 0, 0: 1, 3: 1, 6: 3}

As you can see, this feels an awful lot like a dictionary —infact, Bins inherits the abstract base class MutableMapping to mirror the interface one would expect from a dict. A minimal subclass would require the following methods to be overridden.

class Bins(MutableMapping):
    def __init__(self, *args, **kwargs):
        ...

    def __getitem__(self, key):
        ...

    def __setitem__(self, key, value):
        ...

    def __delitem__(self, key):
        ...

    def __iter__(self):
        ...
    
    def __len__(self):
        ...

First we want to initialise Bins with the the desired intervals. We can can store this as in an internal dict that will hold the contents of Bins.

    def __init__(self, intervals):
        empty_bins = {interval: 0 for interval in intervals}
        self._dict = empty_bins

The __getitem__() and __setitem__() methods define the behaviour of using the square brackets [] notation. We want to intersect key and round it to the closest interval, before applying a valid key (an interval) to our internal dictionary.

    def __getitem__(self, key):
        interval = self._roundkey(key)
        return self._dict[interval]

    def __setitem__(self, key, value):
        interval = self._roundkey(key)
        self._dict[interval] = value
        
    def _roundkey(self, key):
        intervals = list(self._dict.keys())
        minkey = intervals[0]
        midkeys = intervals[1:-1]
        maxkey = intervals[-1]

        if key <= minkey:
            return minkey
        elif key >= maxkey:
            return maxkey
        elif key in midkeys:
            return key
        else:
            i = bisect_left(intervals, key)
            leftkey = intervals[i - 1]
            rightkey = intervals[i]

            if abs(leftkey - key) < abs(rightkey - key):
                return leftkey
            else:
                return rightkey

As you can see, the method _roundkey() rounds the key to the closest interval. We check whether key is an actual interval first, before using Python’s bisect_left to find the ceil and floor intervals relative to key and return the nearest one.

And what do we have to do to get inplace operators such as += working? Nothing! All they’re doing is retrieving a value from the passed key via __getitem__(), applying an operation to value, and assigning the new value to the key via __setitem__().

So if you just have __delitem__(), __iter__() and __len__() directly interface with self._dict, you’ll have a working Bins of your own!

Production readiness

But I’m not quite happy with this yet. First of, if we passed intervals which are out-of-order, then we completely screw up how _roundkey() works in finding the closest interval. We can just sort the intervals ourselves at initialisation.

    def __init__(self, intervals):
        empty_bins = {interval: 0 for interval in sorted(intervals)}
        self._dict = empty_bins

Before Python 3.7 this wouldn’t be ideal as dictionaries did not guarantee insertion order. They do now, however we can still end up with an unordered self._dict if someone used the update() method. I can’t forsee a situation someone would want to do this, but update() is used in data serialisation and anywho is exposed by MutableMapping, so it’s a good idea to update our internal dictionary’s order as well.

Thankfully we have the SortedDict from the sortedcontainers package which will guantee the dictionary’s keys will always be sorted.

    def __init__(self, intervals):
        empty_bins = {interval: 0 for interval in intervals}
        self._sdict = SortedDict(empty_bins)

We can also cache the _roundkey() result of frequently passed keys. I found in some situations I was using Bins in an algorithm that this could drastically improve performance, as rounding a float-type key to an interval is a bit expensive but the exact same key is being passed regularly.

Python’s @lru_cache() decorator makes this simple to implement. You can just pop it on a function with hashable arguments, and it will use a LRU cache to store the results of frequent function calls

As the act of rounding keys is only interested in the intervals of Bins, I created a static method (i.e. instance-agnostic) which takes only the intervals and passed key to find the closest interval, and is conveniently wrapped by _roundkey().

    @property
    def intervals(self):
        return tuple(self._sdict.keys())

    def _roundkey(self, key):
        return Bins._find_closest_interval(self.intervals, key)

    @staticmethod
    @lru_cache()
    def _find_closest_interval(intervals, key):
        minkey = intervals[0]
        midkeys = intervals[1:-1]
        maxkey = intervals[-1]

        if key <= minkey:
            return minkey
        elif key >= maxkey:
            return maxkey
        elif key in midkeys:
            return key
        else:
            i = bisect_left(intervals, key)
            leftkey = intervals[i - 1]
            rightkey = intervals[i]

            if abs(leftkey - key) < abs(rightkey - key):
                return leftkey
            else:
                return rightkey

Fin

I hope you’ve learnt a thing or two today, and maybe even have a new tool in your data science workbench.

I’ve exposed the Bins implementation I’ve been using in my randomness testing library coinflip, available in the coinflip.collections.Bins namespace. I believe it’s production ready, but possibly there are quirks I have yet to encounter and test for! The source code is available on GitHub (tests too).

A really cool project would be implementing interval ranges so that binning doesn’t have to rely on “central values”. This could be achieved very nicely with slices (e.g. the obj[a:b:c] syntax) —this is not unprecedented as pandas uses slices for expressing operations quite nicely.

For any Raymond Hettinger fans out there, you’ll know that every trick and tool I’m using here was heavily influenced by him. Thanks also to redditor ElevenPhonons for giving me some great feedback!