gotchas where Numpy differs from straight python?

gotchas where Numpy differs from straight python?

Because __eq__ does not return a bool, using numpy arrays in any kind of containers prevents equality testing without a container-specific work around.

Example:

>>> import numpy
>>> a = numpy.array(range(3))
>>> b = numpy.array(range(3))
>>> a == b
array([ True,  True,  True], dtype=bool)
>>> x = (a, banana)
>>> y = (b, banana)
>>> x == y
Traceback (most recent call last):
  File <stdin>, line 1, in <module>
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

This is a horrible problem. For example, you cannot write unittests for containers which use TestCase.assertEqual() and must instead write custom comparison functions. Suppose we write a work-around function special_eq_for_numpy_and_tuples. Now we can do this in a unittest:

x = (array1, deserialized)
y = (array2, deserialized)
self.failUnless( special_eq_for_numpy_and_tuples(x, y) )

Now we must do this for every container type we might use to store numpy arrays. Furthermore, __eq__ might return a bool rather than an array of bools:

>>> a = numpy.array(range(3))
>>> b = numpy.array(range(5))
>>> a == b
False

Now each of our container-specific equality comparison functions must also handle that special case.

Maybe we can patch over this wart with a subclass?

>>> class SaneEqualityArray (numpy.ndarray):
...   def __eq__(self, other):
...     return isinstance(other, SaneEqualityArray) and self.shape == other.shape and (numpy.ndarray.__eq__(self, other)).all()
... 
>>> a = SaneEqualityArray( (2, 3) )
>>> a.fill(7)
>>> b = SaneEqualityArray( (2, 3) )
>>> b.fill(7)
>>> a == b
True
>>> x = (a, banana)
>>> y = (b, banana)
>>> x == y
True
>>> c = SaneEqualityArray( (7, 7) )
>>> c.fill(7)
>>> a == c
False

That seems to do the right thing. The class should also explicitly export elementwise comparison, since that is often useful.

The biggest gotcha for me was that almost every standard operator is overloaded to distribute across the array.

Define a list and an array

>>> l = range(10)
>>> l
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> import numpy
>>> a = numpy.array(l)
>>> a
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Multiplication duplicates the python list, but distributes over the numpy array

>>> l * 2
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> a * 2
array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

Addition and division are not defined on python lists

>>> l + 2
Traceback (most recent call last):
  File <stdin>, line 1, in <module>
TypeError: can only concatenate list (not int) to list
>>> a + 2
array([ 2,  3,  4,  5,  6,  7,  8,  9, 10, 11])
>>> l / 2.0
Traceback (most recent call last):
  File <stdin>, line 1, in <module>
TypeError: unsupported operand type(s) for /: list and float
>>> a / 2.0
array([ 0. ,  0.5,  1. ,  1.5,  2. ,  2.5,  3. ,  3.5,  4. ,  4.5])

Numpy overloads to treat lists like arrays sometimes

>>> a + a
array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])
>>> a + l
array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

gotchas where Numpy differs from straight python?

I think this one is funny:

>>> import numpy as n
>>> a = n.array([[1,2],[3,4]])
>>> a[1], a[0] = a[0], a[1]
>>> a
array([[1, 2],
       [1, 2]])

For Python lists on the other hand this works as intended:

>>> b = [[1,2],[3,4]]
>>> b[1], b[0] = b[0], b[1]
>>> b
[[3, 4], [1, 2]]

Funny side note: numpy itself had a bug in the shuffle function, because it used that notation 🙂 (see here).

The reason is that in the first case we are dealing with views of the array, so the values are overwritten in-place.

Leave a Reply

Your email address will not be published.