# odd performance of sum?

23 messages
12
Open this post in threaded view
|

## odd performance of sum?

 Hi,   Observing following performance: In []: m= 1e5 In []: n= 1e2 In []: o= ones(n) In []: M= randn(m, n) In []: timeit M.sum(1)10 loops, best of 3: 38.3 ms per loop In []: timeit dot(M, o)10 loops, best of 3: 21.1 ms per loop   In []: m= 1e2 In []: n= 1e5 In []: o= ones(n) In []: M= randn(m, n) In []: timeit M.sum(1)100 loops, best of 3: 18.3 ms per loop In []: timeit dot(M, o)10 loops, best of 3: 21.2 ms per loop   One would expect sum to outperform dot with a clear marginal. Does there exixts any 'tricks' to increase the performance of sum?   In []: sys.versionOut[]: '2.7.1 (r271:86832, Nov 27 2010, 18:30:46) [MSC v.1500 32 bit (Intel)]' # installed binaries from http://python.org/ In []: np.version.versionOut[]: '1.5.1' # installed binaries from http://scipy.org/     Regards, eat     _______________________________________________ NumPy-Discussion mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/numpy-discussion
Open this post in threaded view
|

## Re: odd performance of sum?

 On Thu, Feb 10, 2011 at 8:29 AM, eat wrote: Hi,   Observing following performance: In []: m= 1e5 In []: n= 1e2 In []: o= ones(n) In []: M= randn(m, n) In []: timeit M.sum(1)10 loops, best of 3: 38.3 ms per loop In []: timeit dot(M, o)10 loops, best of 3: 21.1 ms per loop   In []: m= 1e2 In []: n= 1e5 In []: o= ones(n) In []: M= randn(m, n) In []: timeit M.sum(1)100 loops, best of 3: 18.3 ms per loop In []: timeit dot(M, o)10 loops, best of 3: 21.2 ms per loop   One would expect sum to outperform dot with a clear marginal. Does there exixts any 'tricks' to increase the performance of sum?  I'm not surprised, much depends on the version of ATLAS or MKL you are linked to. If you aren't linked to either and just using numpy's version then the results are a bit strange. With numpy development I get In [1]: m= 1e5In [2]: n= 1e2 In [3]: o= ones(n)In [4]: M= randn(m, n) In [5]: timeit M.sum(1)100 loops, best of 3: 19.2 ms per loop In [6]: timeit dot(M, o)100 loops, best of 3: 15 ms per loop In [7]: m= 1e2In [8]: n= 1e5 In [9]: o= ones(n)In [10]: M= randn(m, n) In [11]: timeit M.sum(1)100 loops, best of 3: 17.4 ms per loop In [12]: timeit dot(M, o)100 loops, best of 3: 14.2 ms per loop Chuck    _______________________________________________ NumPy-Discussion mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/numpy-discussion
Open this post in threaded view
|

## Re: odd performance of sum?

 Thanks Chuck,   for replying. But don't you still feel very odd that dot outperforms sum in your machine? Just to get it simply; why sum can't outperform dot? Whatever architecture (computer, cache) you have, it don't make any sense at all that when performing significantly less instructions, you'll reach to spend more time ;-).     Regards, eat On Thu, Feb 10, 2011 at 7:10 PM, Charles R Harris wrote: On Thu, Feb 10, 2011 at 8:29 AM, eat wrote: Hi,   Observing following performance: In []: m= 1e5 In []: n= 1e2 In []: o= ones(n) In []: M= randn(m, n) In []: timeit M.sum(1)10 loops, best of 3: 38.3 ms per loop In []: timeit dot(M, o)10 loops, best of 3: 21.1 ms per loop   In []: m= 1e2 In []: n= 1e5 In []: o= ones(n) In []: M= randn(m, n) In []: timeit M.sum(1)100 loops, best of 3: 18.3 ms per loop In []: timeit dot(M, o)10 loops, best of 3: 21.2 ms per loop   One would expect sum to outperform dot with a clear marginal. Does there exixts any 'tricks' to increase the performance of sum?   I'm not surprised, much depends on the version of ATLAS or MKL you are linked to. If you aren't linked to either and just using numpy's version then the results are a bit strange. With numpy development I get In [1]: m= 1e5In [2]: n= 1e2 In [3]: o= ones(n)In [4]: M= randn(m, n) In [5]: timeit M.sum(1)100 loops, best of 3: 19.2 ms per loop In [6]: timeit dot(M, o)100 loops, best of 3: 15 ms per loop In [7]: m= 1e2In [8]: n= 1e5 In [9]: o= ones(n)In [10]: M= randn(m, n) In [11]: timeit M.sum(1)100 loops, best of 3: 17.4 ms per loop In [12]: timeit dot(M, o)100 loops, best of 3: 14.2 ms per loop Chuck    _______________________________________________NumPy-Discussion mailing list[hidden email]http://mail.scipy.org/mailman/listinfo/numpy-discussion _______________________________________________ NumPy-Discussion mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/numpy-discussion
Open this post in threaded view
|

## Re: odd performance of sum?

 On Thu, Feb 10, 2011 at 11:53, eat <[hidden email]> wrote: > Thanks Chuck, > > for replying. But don't you still feel very odd that dot outperforms sum in > your machine? Just to get it simply; why sum can't outperform dot? Whatever > architecture (computer, cache) you have, it don't make any sense at all that > when performing significantly less instructions, you'll reach to spend more > time ;-). These days, the determining factor is less often instruction count than memory latency, and the optimized BLAS implementations of dot() heavily optimize the memory access patterns. Additionally, the number of instructions in your dot() probably isn't that many more than the sum(). The sum() is pretty dumb and just does a linear accumulation using the ufunc reduce mechanism, so (m*n-1) ADDs plus quite a few instructions for traversing the array in a generic manner. With fused multiply-adds, being able to assume contiguous data and ignore the numpy iterator overhead, and applying divide-and-conquer kernels to arrange sums, the optimized dot() implementations could have a comparable instruction count. If you were willing to spend that amount of developer time and code complexity to make platform-specific backends to sum(), you could make it go really fast, too. Typically, it's not all that important to make it worthwhile, though. One thing that might be worthwhile is to make implementations of sum() and cumsum() that avoid the ufunc machinery and do their iterations more quickly, at least for some common combinations of dtype and contiguity. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth."   -- Umberto Eco _______________________________________________ NumPy-Discussion mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/numpy-discussion
Open this post in threaded view
|

## Re: odd performance of sum?

 Thu, 10 Feb 2011 12:16:12 -0600, Robert Kern wrote: [clip] > One thing that might be worthwhile is to make > implementations of sum() and cumsum() that avoid the ufunc machinery and > do their iterations more quickly, at least for some common combinations > of dtype and contiguity. I wonder what is the balance between the iterator overhead and the time taken in the reduction inner loop. This should be straightforward to benchmark. Apparently, some overhead decreased with the new iterators, since current Numpy master outperforms 1.5.1 by a factor of 2 for this benchmark: In [8]: %timeit M.sum(1)     # Numpy 1.5.1 10 loops, best of 3: 85 ms per loop In [8]: %timeit M.sum(1)     # Numpy master 10 loops, best of 3: 49.5 ms per loop I don't think this is explainable by the new memory layout optimizations, since M is C-contiguous. Perhaps there would be room for more optimization, even within the ufunc framework? -- Pauli Virtanen _______________________________________________ NumPy-Discussion mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/numpy-discussion
Open this post in threaded view
|

## Re: odd performance of sum?

Open this post in threaded view
|

## Re: odd performance of sum?

 In reply to this post by Pauli Virtanen-3 Hi Pauli, On Thu, Feb 10, 2011 at 8:31 PM, Pauli Virtanen wrote: Thu, 10 Feb 2011 12:16:12 -0600, Robert Kern wrote:[clip] > One thing that might be worthwhile is to make> implementations of sum() and cumsum() that avoid the ufunc machinery and> do their iterations more quickly, at least for some common combinations > of dtype and contiguity.I wonder what is the balance between the iterator overhead and the timetaken in the reduction inner loop. This should be straightforward tobenchmark.Apparently, some overhead decreased with the new iterators, since current Numpy master outperforms 1.5.1 by a factor of 2 for this benchmark:In [8]: %timeit M.sum(1)     # Numpy 1.5.110 loops, best of 3: 85 ms per loopIn [8]: %timeit M.sum(1)     # Numpy master10 loops, best of 3: 49.5 ms per loop I don't think this is explainable by the new memory layout optimizations,since M is C-contiguous.Perhaps there would be room for more optimization, even within the ufuncframework? I hope so. Please suggest if there's anything that I can do to further advance this. (My C skills are allready bit rusty, but at any higher level I'll try my best to contribute).     Thanks, eat --Pauli Virtanen _______________________________________________NumPy-Discussion mailing list[hidden email]http://mail.scipy.org/mailman/listinfo/numpy-discussion _______________________________________________ NumPy-Discussion mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/numpy-discussion
Open this post in threaded view
|

## Re: odd performance of sum?

 On Thu, 10 Feb 2011 22:38:52 +0200, eat wrote: [clip] > I hope so. Please suggest if there's anything that I can do to further > advance this. (My C skills are allready bit rusty, but at any higher > level I'll try my best to contribute). If someone wants to try to improve the situation, here's a possible plan of attack:   1. Check first if the bottleneck is in the inner reduction loop (function DOUBLE_add in loops.c.src:712) or in the outer iteration (function PyUFunc_ReductionOp in ufunc_object.c:2781).   2. If it's in the inner loop, some optimizations are possible, e.g. specialized cases for sizeof(item) strides. Think how to add them cleanly.   3. If it's in the outer iteration, try to think how to make it faster. This will be a more messy problem to solve. -- Pauli Virtanen _______________________________________________ NumPy-Discussion mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/numpy-discussion
Open this post in threaded view
|

## Re: odd performance of sum?

 Maybe I'm missing something, but why not just implement sum() using dot() and ones() ? --Josh On Thu, Feb 10, 2011 at 11:49 AM, Pauli Virtanen <[hidden email]> wrote: > On Thu, 10 Feb 2011 22:38:52 +0200, eat wrote: > [clip] >> I hope so. Please suggest if there's anything that I can do to further >> advance this. (My C skills are allready bit rusty, but at any higher >> level I'll try my best to contribute). > > If someone wants to try to improve the situation, here's a possible plan > of attack: > >  1. Check first if the bottleneck is in the inner reduction loop > (function DOUBLE_add in loops.c.src:712) or in the outer iteration > (function PyUFunc_ReductionOp in ufunc_object.c:2781). > >  2. If it's in the inner loop, some optimizations are possible, e.g. > specialized cases for sizeof(item) strides. Think how to add them cleanly. > >  3. If it's in the outer iteration, try to think how to make it faster. > This will be a more messy problem to solve. > > -- > Pauli Virtanen > > _______________________________________________ > NumPy-Discussion mailing list > [hidden email] > http://mail.scipy.org/mailman/listinfo/numpy-discussion> _______________________________________________ NumPy-Discussion mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/numpy-discussion
Open this post in threaded view
|

## Re: odd performance of sum?

Open this post in threaded view
|

## Re: odd performance of sum?

 In reply to this post by Joshua Holbrook On Thu, Feb 10, 2011 at 14:51, Joshua Holbrook <[hidden email]> wrote: > Maybe I'm missing something, but why not just implement sum() using > dot() and ones() ? You can't do everything that sum() does with just dot() and ones(). -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth."   -- Umberto Eco _______________________________________________ NumPy-Discussion mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/numpy-discussion
Open this post in threaded view
|

## Re: odd performance of sum?

 In reply to this post by Pauli Virtanen-3 On Thu, Feb 10, 2011 at 10:31 AM, Pauli Virtanen wrote: Thu, 10 Feb 2011 12:16:12 -0600, Robert Kern wrote: [clip] > One thing that might be worthwhile is to make > implementations of sum() and cumsum() that avoid the ufunc machinery and > do their iterations more quickly, at least for some common combinations > of dtype and contiguity. I wonder what is the balance between the iterator overhead and the time taken in the reduction inner loop. This should be straightforward to benchmark. Apparently, some overhead decreased with the new iterators, since current Numpy master outperforms 1.5.1 by a factor of 2 for this benchmark: In [8]: %timeit M.sum(1)     # Numpy 1.5.1 10 loops, best of 3: 85 ms per loop In [8]: %timeit M.sum(1)     # Numpy master 10 loops, best of 3: 49.5 ms per loop I don't think this is explainable by the new memory layout optimizations, since M is C-contiguous. Perhaps there would be room for more optimization, even within the ufunc framework?I played around with this in einsum, where it's a bit easier to specialize this case than in the ufunc machinery. What I found made the biggest difference is to use SSE prefetching instructions to prepare the cache in advance. Here are the kind of numbers I get, all from the current Numpy master: In [7]: timeit M.sum(1)10 loops, best of 3: 44.6 ms per loopIn [8]: timeit dot(M, o)10 loops, best of 3: 36.8 ms per loop In [9]: timeit einsum('ij->i', M)10 loops, best of 3: 32.1 ms per loop...In [14]: timeit M.sum(1)10 loops, best of 3: 41.5 ms per loop In [15]: timeit dot(M, o)10 loops, best of 3: 42.1 ms per loopIn [16]: timeit einsum('ij->i', M)10 loops, best of 3: 30 ms per loop -Mark _______________________________________________ NumPy-Discussion mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/numpy-discussion
Open this post in threaded view
|

## Re: odd performance of sum?

Open this post in threaded view
|

## Re: odd performance of sum?

Open this post in threaded view
|

## Re: odd performance of sum?

 In reply to this post by Mark Wiebe On Thu, Feb 10, 2011 at 2:26 PM, Mark Wiebe wrote: On Thu, Feb 10, 2011 at 10:31 AM, Pauli Virtanen wrote: Thu, 10 Feb 2011 12:16:12 -0600, Robert Kern wrote: [clip] > One thing that might be worthwhile is to make > implementations of sum() and cumsum() that avoid the ufunc machinery and > do their iterations more quickly, at least for some common combinations > of dtype and contiguity. I wonder what is the balance between the iterator overhead and the time taken in the reduction inner loop. This should be straightforward to benchmark. Apparently, some overhead decreased with the new iterators, since current Numpy master outperforms 1.5.1 by a factor of 2 for this benchmark: In [8]: %timeit M.sum(1)     # Numpy 1.5.1 10 loops, best of 3: 85 ms per loop In [8]: %timeit M.sum(1)     # Numpy master 10 loops, best of 3: 49.5 ms per loop I don't think this is explainable by the new memory layout optimizations, since M is C-contiguous. Perhaps there would be room for more optimization, even within the ufunc framework?I played around with this in einsum, where it's a bit easier to specialize this case than in the ufunc machinery. What I found made the biggest difference is to use SSE prefetching instructions to prepare the cache in advance. Here are the kind of numbers I get, all from the current Numpy master: In [7]: timeit M.sum(1)10 loops, best of 3: 44.6 ms per loopIn [8]: timeit dot(M, o)10 loops, best of 3: 36.8 ms per loop In [9]: timeit einsum('ij->i', M)10 loops, best of 3: 32.1 ms per loop...I get an even bigger speedup:In [5]: timeit M.sum(1)10 loops, best of 3: 19.2 ms per loop In [6]: timeit dot(M, o)100 loops, best of 3: 15.2 ms per loopIn [7]: timeit einsum('ij->i', M)100 loops, best of 3: 11.4 ms per loopChuck _______________________________________________ NumPy-Discussion mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/numpy-discussion
Open this post in threaded view
|

## Re: odd performance of sum?

Open this post in threaded view
|

## Re: odd performance of sum?

Open this post in threaded view
|

## Re: odd performance of sum?

 In reply to this post by Pauli Virtanen-3 Thu, 10 Feb 2011 20:49:28 +0000, Pauli Virtanen wrote: [clip] >   1. Check first if the bottleneck is in the inner reduction loop > (function DOUBLE_add in loops.c.src:712) or in the outer iteration > (function PyUFunc_ReductionOp in ufunc_object.c:2781). >  2. If it's in the inner loop, some optimizations are possible, e.g. > specialized cases for sizeof(item) strides. Think how to add them > cleanly. A quick check (just replace the inner loop with a no-op) shows that for 100 items, the bottleneck is in the inner loop. The cross-over between inner loop time and strided iterator overhead apparently occurs around ~20-30 items (on the machine I used for testing). Anyway, spending time for optimizing the inner loop for a 30% speed gain (max) seems questionable... -- Pauli Virtanen _______________________________________________ NumPy-Discussion mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/numpy-discussion
Open this post in threaded view
|

## Re: odd performance of sum?

 In reply to this post by eat-3 Den 10.02.2011 16:29, skrev eat: > One would expect sum to outperform dot with a clear marginal. Does > there exixts any 'tricks' to increase the performance of sum? I see that others have ansvered already. The ufunc np.sum is not going going to beat np.dot. You are racing the heavy machinery of NumPy (array iterators, type chekcs, bound checks, etc.) against level-3 BLAS routine DGEMM, the most heavily optimized numerical kernel ever written. Also beware that computation is much cheaper than memory access. Although DGEMM does more arithmetics, and even is O(N3) in that respect, it is always faster except for very sparse arrays. If you need fast loops, you can always write your own Fortran or C, and even insert OpenMP pragmas. But don't expect that to beat optimized high-level BLAS kernels by any margin. The first chapters of "Numerical Methods in Fortran 90" might be worth reading. It deals with several of these issues, including dimensional expansion, which is important for writing fast numerical code -- but not intuitively obvious. "I expect this to be faster because it does less work" is a fundamental misconception in numerical computing. Whatever cause less traffic on the memory BUS (the real bottleneck) will almost always be faster, regardless of the amount of work done by the CPU. A good advice is to use high-level BLAS whenever you can. The only exception, as mentioned, is when matrices get very sparse. Sturla _______________________________________________ NumPy-Discussion mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/numpy-discussion