Python’s
defaultdict
is an incredibly useful tool
to specify default values for missing keys
(RealPython has a
great primer
on them).
I found I was taking up so much memory to store the value 0 for an algorithm which
counted rather infrequent elements in a huge sequence split into many chunks.
Using a defaultdict(int)
for each distinct element
to store which indices had a non-zero count of the respective element
greatly optimised it,
but I yearned for a defaultlist
equivalent to standardise this approach and
generally make my code more expressive.
Humble beginnings
There are some opinionated decisions of what constitutes a “defaultlist”,
so whilst an
excellent implementation already exists by c0fec0de,
I disagreed with some behaviours that led me to implement my own.
My idea of a defaultlist
is to be a glorified wrapper of the defaultdict
.
We can use the
MutableSequence
abstract base class to mirror the interface one would expect from a builtin list
.
class defaultlist(MutableSequence):
def __getitem__(self, i):
...
def __setitem__(self, i, value):
...
def __delitem__(self, i):
...
def __len__(self):
...
def insert(self):
...
Firstly we can initialise an internal defaultdict
in self._ddict
,
taking a default_facotry
argument.
Each key of self._ddict
will represent an index of a list.
Note that defaultdict
does not require a default_factory
and will raise a KeyError
when you try to access a missing key in such scenarios.
For a list variant this behaviour would be nonsensical as
missing elements between existing ones really aren’t erroneous,
so we can just make default_factory
return None
if not specified.
def __init__(self, default_factory=None):
self._ddict = defaultdict(default_factory or defaultlist._none_factory)
@staticmethod
def _none_factory():
return None
The get and set item methods,
which determine the behaviour of square brackets notation (e.g. my_dlist[i]
),
can also directly interface with the internal defaultdict
we initialised.
This means that all elements of our defaultlist
will default
just like its dictionary counter-part.
def __getitem__(self, i):
return self._ddict[i]
def __setitem__(self, i, v):
self._ddict[i] = v
A delete item method
(called like del my_dlist[i]
)
requires a bit more work.
The list
builtin will reindex elements above the deleted index,
and so we can emulate that behaviour
by decrementing the keys of our internal defaultdict
.
Conversely an insert method can increment the keys of self._ddict
before safely assigning the value at i
.
def __delitem__(self, i):
# I decided to be agnostic if the element exists or not
try:
del self._ddict[i]
except KeyError:
pass
larger_keys = [k for k in self._ddict.keys() if k > i]
reindexed_subdict = {k - 1: self._ddict[k] for k in larger_keys}
for k in larger_keys:
del self._ddict[k]
self._ddict.update(reindexed_subdict)
def insert(self, i, value):
larger_keys = [k for k in self._ddict.keys() if k >= i]
reindexed_subdict = {k + 1: self._ddict[k] for k in larger_keys}
for k in larger_keys:
del self._ddict[k]
self._ddict.update(reindexed_subdict)
self._ddict[i] = value
As we’re treating keys in the self._ddict
as indexes,
we can treat the highest existing index as the deciding factor
for what constitutes the length
(e.g. result of len(my_dlist)
)
of our defaultlist
.
def __len__(self):
try:
return max(self._ddict.keys()) + 1
except ValueError:
return 0
We are currently assuming positive indices are being passed,
but list
does take in negative indices
that translate to the length of the list n
minus (or rather plus) the negative index.
We can support this behaviour by inspecting and modifying the passed index
before we actually use it on our internal defaultdict
.
def _actualise_index(self, i):
if i >= 0:
return i
else:
n = len(self)
if i >= -n:
return n + i
else:
raise IndexError("negative list index larger than list length, unresolvable")
def __getitem__(self, i):
i = self._actualise_index(i)
return self._ddict[i]
def __setitem__(self, i, v):
i = self._actualise_index(i)
self._ddict[i] = v
def __delitem__(self, key: Union[int, slice]):
i = self._actualise_index(key)
try:
del self._ddict[i]
except KeyError:
pass
...
As we’ve successfully overridden the MutableSequence
interface,
it will conveniently deal with second-order features of a list
such as iteration (for x in my_dlist
)
and appending (my_dlist.append(x)
).
Bounding infinity
Currently, iterating over a defaultlist
will go on forever.
Iteration of an object is determined by it’s __iter__()
method,
and inheriting MutableSequence
comes with such a method
which will __getitem__(i)
from 0 until an IndexError
is raised.
As defaultlist
is an infinite field of elements,
no such error is ever raised.
I decided its iteration capabilities should be bounded to the length property (defined earlier).
def __iter__(self):
for k in range(len(self)):
yield self._ddict[k]
This provides an easy way for user to specify iteration bounds
using slice notation.
Say you were implementing an algorithm that divvied-up a sequence into 100 chunks
and counted in each chunk using counts = defaultlist(int)
(i.e. default the count to 0),
then you can specify the first 100 elements of counts
when it’s time for a for loop.
for count in counts[:100]:
...
…that is, if slicing is supported.
We can actually detect the stop:start:step
notation—which
is infact a builtin
slice
object—in
our __getitem__()
method
by inspecting the passed key
with an instance check (i.e. isinstance()
).
A builtin list
handles slices as a range bounded by the list’s length—for
the most part,
check out my
list reference implementation
for a complete specification of how list
handles slicing.
We however don’t need to worry about going out-of-bounds
as the defaultlist
is essentially an infinite field of elements.
All we need to do then is handle None
values
before making a 1-to-1 translation from slice
to the range
builtin.
def __getitem__(self, key):
if isinstance(key, int):
i = self._actualise_index(key)
return self._ddict[i]
elif isinstance(key, slice):
start = key.start or 0
stop = key.start or 0
step = key.step or 1
if key.start is None:
start = 0 if step > 0 else len(self)
else:
start = self._actualise_index(key.start)
if key.stop is None:
stop = len(self) if step > 0 else 0
else:
stop = self._actualise_index(key.stop)
srange = range(start, stop, step)
dlist = defaultlist(self._ddict.default_factory)
dlist.extend(self._ddict[i] for i in srange)
return dlist
At this point, we’ve got ourselves a pretty functional defaultlist
!
Filling in the gaps
Complete slicing support
We can also support slices for setting and deleting elements.
Like with the insert
method,
it’s simply a matter of reindexing the internal defaultdict
to accommodate for the addition and/or removal of existing elements
… but as you can see,
it’s a bit of a headache to keep track of things.
def _determine_srange(self, slice_):
step = slice_.step or 1
if slice_.start is None:
start = 0 if step > 0 else len(self)
else:
start = self._actualise_index(slice_.start)
if slice_.stop is None:
stop = len(self) if step > 0 else 0
else:
stop = self._actualise_index(slice_.stop)
return range(start, stop, step)
def __getitem__(self, key):
if isinstance(key, int):
i = self._actualise_index(key)
return self._ddict[i]
elif isinstance(key, slice):
srange = self._determine_srange(key)
dlist = defaultlist(self._ddict.default_factory)
dlist.extend(self._ddict[i] for i in srange)
return dlist
def __setitem__(self, key, value):
if isinstance(key, int):
i = self._actualise_index(key)
self._ddict[i] = value
elif isinstance(key, slice):
values = list(value) if isinstance(value, Iterable) else [value]
nvalues = len(values)
srange = self._determine_srange(key)
if srange:
srange_min = min(srange[0], srange[-1])
srange_max = max(srange[0], srange[-1])
else:
srange_min = srange_max = srange.start
for i in srange:
try:
del self._ddict[i]
except KeyError:
pass
diff = nvalues - len(srange)
if diff:
keys = list(self._ddict.keys())
threshold = srange_min + nvalues
reindexed_subdict = {}
mid_keys = [k for k in keys if srange_min <= k < srange_max]
for i, k in enumerate(sorted(mid_keys), start=threshold):
reindexed_subdict[i] = self._ddict[k]
larger_keys = [k for k in self._ddict.keys() if k > srange_max]
for k in larger_keys:
i = k + diff
reindexed_subdict[i] = self._ddict[k]
self._ddict.update(reindexed_subdict)
for i, v in enumerate(values, start=srange_min):
self[i] = v
def __delitem__(self, key):
if isinstance(key, int):
i = self._actualise_index(key)
try:
del self._ddict[i]
except KeyError:
pass
larger_keys = [k for k in self._ddict.keys() if k > i]
reindexed_subdict = {k - 1: self._ddict[k] for k in larger_keys}
for k in larger_keys:
del self._ddict[k]
self._ddict.update(reindexed_subdict)
elif isinstance(key, slice):
srange = self._determine_srange(key)
if srange:
for i in srange:
try:
del self._ddict[i]
except KeyError:
pass
srange_min = min(srange[0], srange[-1])
srange_max = max(srange[0], srange[-1])
reindexed_subdict = {}
mid_keys = [k for k in self._ddict.keys() if srange_min <= k < srange_max]
for i, k in enumerate(values, start=srange_min):
reindexed_subdict[i] = self._ddict[k]
larger_keys = [k for k in self._ddict.keys() if k > srange_max]
for k in larger_keys:
i = k - len(srange)
reindexed_subdict[i] = self._ddict[k]
for k in chain(mid_keys, larger_keys):
del self._ddict[k]
self._ddict.update(reindexed_subdict)
Equality checks
We might also want to emulate how list
can be equality checked (i.e. a == b
notation)
with other list
objects.
Equality check behaviour is specified
in the __eq__(self, other)
dunder method
of the first instance (i.e. a
in a == b
).
Rather than caring if a defaultlist
is being checked with another defaultlist
however,
I found broadly check against all
Sequence
-like types
to make it much more versatile.
def __eq__(self, other):
if isinstance(other, Sequence):
if len(self) != len(other):
return False
else:
return all(a == b for a, b in zip(self, other))
else:
return False
Note that zip(self, other)
is using the __iter__()
methods
of both the defaultlist
instance and other
.
This means you can check if an iteration of a defaultlist
would actually match up with your expectations
using the nice brackets notation that Python uses for its list
.
>>> my_dlist = defaultlist(int)
>>> my_dlist[2] = 2
>>> my_dlist == [0, 0, 2]
True
Finding first instances
MutableSequence
also adds the
index()
method to our defaultlist()
automatically,
but like with __iter__()
it expects an IndexError
from our get method—as
this never comes to fruition,
index()
can go on forever.
All this means is we need to override index()
to bound the iteration to it’s pseudo-length.
Like with our __eq__()
method above,
note that enumerate(self)
is calling __iter__()
.
def index(self, x):
for i, v in enumerate(self):
if v == x:
return i
else:
raise ValueError(f"'{x}' is not in defaultlist")
This also fixes the automatic remove()
method too,
as that relies upon index()
to find the i
to then delete.
Fin
A hopefully production-ready defaultlist
is available in my coinflip library,
available in the coinflip.collections.defaultlist
namespace.
Its source code is available on
GitHub
(and don’t forget tests).
Huge thanks to c0fec0de for their defaultlist
implementation,
which really helped me think deeply about what I wanted the API to be for my own take.
Also I suppose a shout out to Guido, who I think implemented the much beloved defaultdict
. Maybe it was Raymond Hettinger…?
I couldn’t work this one out—git blame and all—so if anyone knows this then please let me know!