Quantcast

poor performance of sum with sub-machine-word integer types

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

poor performance of sum with sub-machine-word integer types

Zachary Pincus-2
Hello all,

As a result of the "fast greyscale conversion" thread, I noticed an anomaly with numpy.ndararray.sum(): summing along certain axes is much slower with sum() than versus doing it explicitly, but only with integer dtypes and when the size of the dtype is less than the machine word. I checked in 32-bit and 64-bit modes and in both cases only once the dtype got as large as that did the speed difference go away. See below...

Is this something to do with numpy or something inexorable about machine / memory architecture?

Zach

Timings -- 64-bit mode:
----------------------
In [2]: i = numpy.ones((1024,1024,4), numpy.int8)
In [3]: timeit i.sum(axis=-1)
10 loops, best of 3: 131 ms per loop
In [4]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
100 loops, best of 3: 2.57 ms per loop

In [5]: i = numpy.ones((1024,1024,4), numpy.int16)
In [6]: timeit i.sum(axis=-1)
10 loops, best of 3: 131 ms per loop
In [7]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
100 loops, best of 3: 4.75 ms per loop

In [8]: i = numpy.ones((1024,1024,4), numpy.int32)
In [9]: timeit i.sum(axis=-1)
10 loops, best of 3: 131 ms per loop
In [10]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
100 loops, best of 3: 6.37 ms per loop

In [11]: i = numpy.ones((1024,1024,4), numpy.int64)
In [12]: timeit i.sum(axis=-1)
100 loops, best of 3: 16.6 ms per loop
In [13]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
100 loops, best of 3: 15.1 ms per loop



Timings -- 32-bit mode:
----------------------
In [2]: i = numpy.ones((1024,1024,4), numpy.int8)
In [3]: timeit i.sum(axis=-1)
10 loops, best of 3: 138 ms per loop
In [4]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
100 loops, best of 3: 3.68 ms per loop

In [5]: i = numpy.ones((1024,1024,4), numpy.int16)
In [6]: timeit i.sum(axis=-1)
10 loops, best of 3: 140 ms per loop
In [7]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
100 loops, best of 3: 4.17 ms per loop

In [8]: i = numpy.ones((1024,1024,4), numpy.int32)
In [9]: timeit i.sum(axis=-1)
10 loops, best of 3: 22.4 ms per loop
In [10]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
100 loops, best of 3: 12.2 ms per loop

In [11]: i = numpy.ones((1024,1024,4), numpy.int64)
In [12]: timeit i.sum(axis=-1)
10 loops, best of 3: 29.2 ms per loop
In [13]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
10 loops, best of 3: 23.8 ms per loop

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: poor performance of sum with sub-machine-word integer types

Charles R Harris


On Tue, Jun 21, 2011 at 10:46 AM, Zachary Pincus <[hidden email]> wrote:
Hello all,

As a result of the "fast greyscale conversion" thread, I noticed an anomaly with numpy.ndararray.sum(): summing along certain axes is much slower with sum() than versus doing it explicitly, but only with integer dtypes and when the size of the dtype is less than the machine word. I checked in 32-bit and 64-bit modes and in both cases only once the dtype got as large as that did the speed difference go away. See below...

Is this something to do with numpy or something inexorable about machine / memory architecture?


It's because of the type conversion sum uses by default for greater precision.

 In [8]: timeit i.sum(axis=-1)
10 loops, best of 3: 140 ms per loop

In [9]: timeit i.sum(axis=-1, dtype=int8)
100 loops, best of 3: 16.2 ms per loop

If you have 1.6, einsum is faster but also conserves type:

In [10]: timeit einsum('ijk->ij', i)
100 loops, best of 3: 5.95 ms per loop


We could probably make better loops for summing within kinds, i.e., accumulate in higher precision, then cast to specified precision.

<snip>

Chuck


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: poor performance of sum with sub-machine-word integer types

Keith Goodman
In reply to this post by Zachary Pincus-2
On Tue, Jun 21, 2011 at 9:46 AM, Zachary Pincus <[hidden email]> wrote:

> Hello all,
>
> As a result of the "fast greyscale conversion" thread, I noticed an anomaly with numpy.ndararray.sum(): summing along certain axes is much slower with sum() than versus doing it explicitly, but only with integer dtypes and when the size of the dtype is less than the machine word. I checked in 32-bit and 64-bit modes and in both cases only once the dtype got as large as that did the speed difference go away. See below...
>
> Is this something to do with numpy or something inexorable about machine / memory architecture?
>
> Zach
>
> Timings -- 64-bit mode:
> ----------------------
> In [2]: i = numpy.ones((1024,1024,4), numpy.int8)
> In [3]: timeit i.sum(axis=-1)
> 10 loops, best of 3: 131 ms per loop
> In [4]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
> 100 loops, best of 3: 2.57 ms per loop
>
> In [5]: i = numpy.ones((1024,1024,4), numpy.int16)
> In [6]: timeit i.sum(axis=-1)
> 10 loops, best of 3: 131 ms per loop
> In [7]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
> 100 loops, best of 3: 4.75 ms per loop
>
> In [8]: i = numpy.ones((1024,1024,4), numpy.int32)
> In [9]: timeit i.sum(axis=-1)
> 10 loops, best of 3: 131 ms per loop
> In [10]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
> 100 loops, best of 3: 6.37 ms per loop
>
> In [11]: i = numpy.ones((1024,1024,4), numpy.int64)
> In [12]: timeit i.sum(axis=-1)
> 100 loops, best of 3: 16.6 ms per loop
> In [13]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
> 100 loops, best of 3: 15.1 ms per loop
>
>
>
> Timings -- 32-bit mode:
> ----------------------
> In [2]: i = numpy.ones((1024,1024,4), numpy.int8)
> In [3]: timeit i.sum(axis=-1)
> 10 loops, best of 3: 138 ms per loop
> In [4]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
> 100 loops, best of 3: 3.68 ms per loop
>
> In [5]: i = numpy.ones((1024,1024,4), numpy.int16)
> In [6]: timeit i.sum(axis=-1)
> 10 loops, best of 3: 140 ms per loop
> In [7]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
> 100 loops, best of 3: 4.17 ms per loop
>
> In [8]: i = numpy.ones((1024,1024,4), numpy.int32)
> In [9]: timeit i.sum(axis=-1)
> 10 loops, best of 3: 22.4 ms per loop
> In [10]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
> 100 loops, best of 3: 12.2 ms per loop
>
> In [11]: i = numpy.ones((1024,1024,4), numpy.int64)
> In [12]: timeit i.sum(axis=-1)
> 10 loops, best of 3: 29.2 ms per loop
> In [13]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
> 10 loops, best of 3: 23.8 ms per loop

One difference is that i.sum() changes the output dtype of int input
when the int dtype is less than the default int dtype:

    >> i.dtype
       dtype('int32')
    >> i.sum(axis=-1).dtype
       dtype('int64') #  <-- dtype changed
    >> (i[...,0]+i[...,1]+i[...,2]+i[...,3]).dtype
       dtype('int32')

Here are my timings

    >> i = numpy.ones((1024,1024,4), numpy.int32)
    >> timeit i.sum(axis=-1)
    1 loops, best of 3: 278 ms per loop
    >> timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
    100 loops, best of 3: 12.1 ms per loop
    >> import bottleneck as bn
    >> timeit bn.func.nansum_3d_int32_axis2(i)
    100 loops, best of 3: 8.27 ms per loop

Does making an extra copy of the input explain all of the speed
difference (is this what np.sum does internally?):

    >> timeit i.astype(numpy.int64)
    10 loops, best of 3: 29.2 ms per loop

No.

Initializing the output also adds some time:

    >> timeit np.empty((1024,1024,4), dtype=np.int32)
    100000 loops, best of 3: 2.67 us per loop
    >> timeit np.empty((1024,1024,4), dtype=np.int64)
    100000 loops, best of 3: 12.8 us per loop

Switching back and forth between the input and output array takes more
"memory" time too with int64 arrays compared to int32.
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: poor performance of sum with sub-machine-word integer types

Charles R Harris


On Tue, Jun 21, 2011 at 11:17 AM, Keith Goodman <[hidden email]> wrote:
On Tue, Jun 21, 2011 at 9:46 AM, Zachary Pincus <[hidden email]> wrote:
> Hello all,
>
> As a result of the "fast greyscale conversion" thread, I noticed an anomaly with numpy.ndararray.sum(): summing along certain axes is much slower with sum() than versus doing it explicitly, but only with integer dtypes and when the size of the dtype is less than the machine word. I checked in 32-bit and 64-bit modes and in both cases only once the dtype got as large as that did the speed difference go away. See below...
>
> Is this something to do with numpy or something inexorable about machine / memory architecture?
>
> Zach
>
> Timings -- 64-bit mode:
> ----------------------
> In [2]: i = numpy.ones((1024,1024,4), numpy.int8)
> In [3]: timeit i.sum(axis=-1)
> 10 loops, best of 3: 131 ms per loop
> In [4]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
> 100 loops, best of 3: 2.57 ms per loop
>
> In [5]: i = numpy.ones((1024,1024,4), numpy.int16)
> In [6]: timeit i.sum(axis=-1)
> 10 loops, best of 3: 131 ms per loop
> In [7]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
> 100 loops, best of 3: 4.75 ms per loop
>
> In [8]: i = numpy.ones((1024,1024,4), numpy.int32)
> In [9]: timeit i.sum(axis=-1)
> 10 loops, best of 3: 131 ms per loop
> In [10]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
> 100 loops, best of 3: 6.37 ms per loop
>
> In [11]: i = numpy.ones((1024,1024,4), numpy.int64)
> In [12]: timeit i.sum(axis=-1)
> 100 loops, best of 3: 16.6 ms per loop
> In [13]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
> 100 loops, best of 3: 15.1 ms per loop
>
>
>
> Timings -- 32-bit mode:
> ----------------------
> In [2]: i = numpy.ones((1024,1024,4), numpy.int8)
> In [3]: timeit i.sum(axis=-1)
> 10 loops, best of 3: 138 ms per loop
> In [4]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
> 100 loops, best of 3: 3.68 ms per loop
>
> In [5]: i = numpy.ones((1024,1024,4), numpy.int16)
> In [6]: timeit i.sum(axis=-1)
> 10 loops, best of 3: 140 ms per loop
> In [7]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
> 100 loops, best of 3: 4.17 ms per loop
>
> In [8]: i = numpy.ones((1024,1024,4), numpy.int32)
> In [9]: timeit i.sum(axis=-1)
> 10 loops, best of 3: 22.4 ms per loop
> In [10]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
> 100 loops, best of 3: 12.2 ms per loop
>
> In [11]: i = numpy.ones((1024,1024,4), numpy.int64)
> In [12]: timeit i.sum(axis=-1)
> 10 loops, best of 3: 29.2 ms per loop
> In [13]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
> 10 loops, best of 3: 23.8 ms per loop

One difference is that i.sum() changes the output dtype of int input
when the int dtype is less than the default int dtype:

   >> i.dtype
      dtype('int32')
   >> i.sum(axis=-1).dtype
      dtype('int64') #  <-- dtype changed
   >> (i[...,0]+i[...,1]+i[...,2]+i[...,3]).dtype
      dtype('int32')

Here are my timings

   >> i = numpy.ones((1024,1024,4), numpy.int32)
   >> timeit i.sum(axis=-1)
   1 loops, best of 3: 278 ms per loop
   >> timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
   100 loops, best of 3: 12.1 ms per loop
   >> import bottleneck as bn
   >> timeit bn.func.nansum_3d_int32_axis2(i)
   100 loops, best of 3: 8.27 ms per loop

Does making an extra copy of the input explain all of the speed
difference (is this what np.sum does internally?):

   >> timeit i.astype(numpy.int64)
   10 loops, best of 3: 29.2 ms per loop

No.


I think you can see the overhead here:

In [14]: timeit einsum('ijk->ij', i, dtype=int32)
100 loops, best of 3: 17.6 ms per loop

In [15]: timeit einsum('ijk->ij', i, dtype=int64)
100 loops, best of 3: 18 ms per loop

In [16]: timeit einsum('ijk->ij', i, dtype=int16)
100 loops, best of 3: 18.3 ms per loop

In [17]: timeit einsum('ijk->ij', i, dtype=int8)
100 loops, best of 3: 5.87 ms per loop
 
Initializing the output also adds some time:

   >> timeit np.empty((1024,1024,4), dtype=np.int32)
   100000 loops, best of 3: 2.67 us per loop
   >> timeit np.empty((1024,1024,4), dtype=np.int64)
   100000 loops, best of 3: 12.8 us per loop

Switching back and forth between the input and output array takes more
"memory" time too with int64 arrays compared to int32.

Chuck

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: poor performance of sum with sub-machine-word integer types

Zachary Pincus-2
In reply to this post by Charles R Harris
On Jun 21, 2011, at 1:16 PM, Charles R Harris wrote:

> It's because of the type conversion sum uses by default for greater precision.

Aah, makes sense. Thanks for the detailed explanations and timings!
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Loading...