
12

Hi,
Observing following performance:
In []: m= 1e5
In []: n= 1e2
In []: o= ones(n)
In []: M= randn(m, n)
In []: timeit M.sum(1) 10 loops, best of 3: 38.3 ms per loop
In []: timeit dot(M, o) 10 loops, best of 3: 21.1 ms per loop
In []: m= 1e2
In []: n= 1e5
In []: o= ones(n)
In []: M= randn(m, n)
In []: timeit M.sum(1) 100 loops, best of 3: 18.3 ms per loop
In []: timeit dot(M, o) 10 loops, best of 3: 21.2 ms per loop
One would expect sum to outperform dot with a clear marginal. Does there exixts any 'tricks' to increase the performance of sum?
In []: sys.version Out[]: '2.7.1 (r271:86832, Nov 27 2010, 18:30:46) [MSC v.1500 32 bit (Intel)]'
In []: np.version.version Out[]: '1.5.1'
Regards,
eat
_______________________________________________
NumPyDiscussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpydiscussion


On Thu, Feb 10, 2011 at 8:29 AM, eat <[hidden email]> wrote:
Hi,
Observing following performance:
In []: m= 1e5
In []: n= 1e2
In []: o= ones(n)
In []: M= randn(m, n)
In []: timeit M.sum(1) 10 loops, best of 3: 38.3 ms per loop
In []: timeit dot(M, o) 10 loops, best of 3: 21.1 ms per loop
In []: m= 1e2
In []: n= 1e5
In []: o= ones(n)
In []: M= randn(m, n)
In []: timeit M.sum(1) 100 loops, best of 3: 18.3 ms per loop
In []: timeit dot(M, o) 10 loops, best of 3: 21.2 ms per loop
One would expect sum to outperform dot with a clear marginal. Does there exixts any 'tricks' to increase the performance of sum?
I'm not surprised, much depends on the version of ATLAS or MKL you are linked to. If you aren't linked to either and just using numpy's version then the results are a bit strange. With numpy development I get
In [1]: m= 1e5
In [2]: n= 1e2
In [3]: o= ones(n)
In [4]: M= randn(m, n)
In [5]: timeit M.sum(1) 100 loops, best of 3: 19.2 ms per loop
In [6]: timeit dot(M, o) 100 loops, best of 3: 15 ms per loop
In [7]: m= 1e2
In [8]: n= 1e5
In [9]: o= ones(n)
In [10]: M= randn(m, n)
In [11]: timeit M.sum(1) 100 loops, best of 3: 17.4 ms per loop
In [12]: timeit dot(M, o) 100 loops, best of 3: 14.2 ms per loop
Chuck
_______________________________________________
NumPyDiscussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpydiscussion


Thanks Chuck,
for replying. But don't you still feel very odd that dot outperforms sum in your machine? Just to get it simply; why sum can't outperform dot? Whatever architecture (computer, cache) you have, it don't make any sense at all that when performing significantly less instructions, you'll reach to spend more time ;).
Regards,
eat
On Thu, Feb 10, 2011 at 7:10 PM, Charles R Harris <[hidden email]> wrote:
On Thu, Feb 10, 2011 at 8:29 AM, eat <[hidden email]> wrote:
Hi,
Observing following performance:
In []: m= 1e5
In []: n= 1e2
In []: o= ones(n)
In []: M= randn(m, n)
In []: timeit M.sum(1) 10 loops, best of 3: 38.3 ms per loop
In []: timeit dot(M, o) 10 loops, best of 3: 21.1 ms per loop
In []: m= 1e2
In []: n= 1e5
In []: o= ones(n)
In []: M= randn(m, n)
In []: timeit M.sum(1) 100 loops, best of 3: 18.3 ms per loop
In []: timeit dot(M, o) 10 loops, best of 3: 21.2 ms per loop
One would expect sum to outperform dot with a clear marginal. Does there exixts any 'tricks' to increase the performance of sum?
I'm not surprised, much depends on the version of ATLAS or MKL you are linked to. If you aren't linked to either and just using numpy's version then the results are a bit strange. With numpy development I get
In [1]: m= 1e5
In [2]: n= 1e2
In [3]: o= ones(n)
In [4]: M= randn(m, n)
In [5]: timeit M.sum(1) 100 loops, best of 3: 19.2 ms per loop
In [6]: timeit dot(M, o) 100 loops, best of 3: 15 ms per loop
In [7]: m= 1e2
In [8]: n= 1e5
In [9]: o= ones(n)
In [10]: M= randn(m, n)
In [11]: timeit M.sum(1) 100 loops, best of 3: 17.4 ms per loop
In [12]: timeit dot(M, o) 100 loops, best of 3: 14.2 ms per loop
Chuck
_______________________________________________ NumPyDiscussion mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/numpydiscussion
_______________________________________________
NumPyDiscussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpydiscussion


On Thu, Feb 10, 2011 at 11:53, eat < [hidden email]> wrote:
> Thanks Chuck,
>
> for replying. But don't you still feel very odd that dot outperforms sum in
> your machine? Just to get it simply; why sum can't outperform dot? Whatever
> architecture (computer, cache) you have, it don't make any sense at all that
> when performing significantly less instructions, you'll reach to spend more
> time ;).
These days, the determining factor is less often instruction count
than memory latency, and the optimized BLAS implementations of dot()
heavily optimize the memory access patterns. Additionally, the number
of instructions in your dot() probably isn't that many more than the
sum(). The sum() is pretty dumb and just does a linear accumulation
using the ufunc reduce mechanism, so (m*n1) ADDs plus quite a few
instructions for traversing the array in a generic manner. With fused
multiplyadds, being able to assume contiguous data and ignore the
numpy iterator overhead, and applying divideandconquer kernels to
arrange sums, the optimized dot() implementations could have a
comparable instruction count.
If you were willing to spend that amount of developer time and code
complexity to make platformspecific backends to sum(), you could make
it go really fast, too. Typically, it's not all that important to make
it worthwhile, though. One thing that might be worthwhile is to make
implementations of sum() and cumsum() that avoid the ufunc machinery
and do their iterations more quickly, at least for some common
combinations of dtype and contiguity.

Robert Kern
"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
 Umberto Eco
_______________________________________________
NumPyDiscussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpydiscussion


Thu, 10 Feb 2011 12:16:12 0600, Robert Kern wrote:
[clip]
> One thing that might be worthwhile is to make
> implementations of sum() and cumsum() that avoid the ufunc machinery and
> do their iterations more quickly, at least for some common combinations
> of dtype and contiguity.
I wonder what is the balance between the iterator overhead and the time
taken in the reduction inner loop. This should be straightforward to
benchmark.
Apparently, some overhead decreased with the new iterators, since current
Numpy master outperforms 1.5.1 by a factor of 2 for this benchmark:
In [8]: %timeit M.sum(1) # Numpy 1.5.1
10 loops, best of 3: 85 ms per loop
In [8]: %timeit M.sum(1) # Numpy master
10 loops, best of 3: 49.5 ms per loop
I don't think this is explainable by the new memory layout optimizations,
since M is Ccontiguous.
Perhaps there would be room for more optimization, even within the ufunc
framework?

Pauli Virtanen
_______________________________________________
NumPyDiscussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpydiscussion


Hi Robert,
On Thu, Feb 10, 2011 at 8:16 PM, Robert Kern <[hidden email]> wrote:
On Thu, Feb 10, 2011 at 11:53, eat < [hidden email]> wrote: > Thanks Chuck, > > for replying. But don't you still feel very odd that dot outperforms sum in
> your machine? Just to get it simply; why sum can't outperform dot? Whatever > architecture (computer, cache) you have, it don't make any sense at all that > when performing significantly less instructions, you'll reach to spend more
> time ;). These days, the determining factor is less often instruction count than memory latency, and the optimized BLAS implementations of dot() heavily optimize the memory access patterns.
Can't we have this as well with simple sum?
Additionally, the number of instructions in your dot() probably isn't that many more than the sum(). The sum() is pretty dumb
But does it need to be?
and just does a linear accumulation using the ufunc reduce mechanism, so (m*n1) ADDs plus quite a few
instructions for traversing the array in a generic manner. With fused multiplyadds, being able to assume contiguous data and ignore the numpy iterator overhead, and applying divideandconquer kernels to arrange sums, the optimized dot() implementations could have a
comparable instruction count.
Couldn't sum benefit with similar logic?
If you were willing to spend that amount of developer time and code complexity to make platformspecific backends to sum()
Actually I would, but I'm not competent at all in that detailed level (:, But I'm willing to spend more on my own time for example for testing, debugging, analysing various improvements and suggestions if such emerge.
, you could make it go really fast, too. Typically, it's not all that important to make it worthwhile, though. One thing that might be worthwhile is to make
implementations of sum() and cumsum() that avoid the ufunc machinery and do their iterations more quickly, at least for some common combinations of dtype and contiguity.
Well I'm allready perplexd before reaching that 'ufunc machinery', it's actually anyway trivial (for us more mortal ;) to figure out what's happening with sum on fromnumeric.py!
Regards,
eat
 Robert Kern
"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth."  Umberto Eco
_______________________________________________
NumPyDiscussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpydiscussion


Hi Pauli,
On Thu, Feb 10, 2011 at 8:31 PM, Pauli Virtanen <[hidden email]> wrote:
Thu, 10 Feb 2011 12:16:12 0600, Robert Kern wrote: [clip]
> One thing that might be worthwhile is to make > implementations of sum() and cumsum() that avoid the ufunc machinery and > do their iterations more quickly, at least for some common combinations
> of dtype and contiguity.
I wonder what is the balance between the iterator overhead and the time taken in the reduction inner loop. This should be straightforward to benchmark.
Apparently, some overhead decreased with the new iterators, since current
Numpy master outperforms 1.5.1 by a factor of 2 for this benchmark:
In [8]: %timeit M.sum(1) # Numpy 1.5.1 10 loops, best of 3: 85 ms per loop
In [8]: %timeit M.sum(1) # Numpy master 10 loops, best of 3: 49.5 ms per loop
I don't think this is explainable by the new memory layout optimizations, since M is Ccontiguous.
Perhaps there would be room for more optimization, even within the ufunc framework?
I hope so. Please suggest if there's anything that I can do to further advance this. (My C skills are allready bit rusty, but at any higher level I'll try my best to contribute).
Thanks,
eat
 Pauli Virtanen
_______________________________________________
NumPyDiscussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpydiscussion


On Thu, 10 Feb 2011 22:38:52 +0200, eat wrote:
[clip]
> I hope so. Please suggest if there's anything that I can do to further
> advance this. (My C skills are allready bit rusty, but at any higher
> level I'll try my best to contribute).
If someone wants to try to improve the situation, here's a possible plan
of attack:
1. Check first if the bottleneck is in the inner reduction loop
(function DOUBLE_add in loops.c.src:712) or in the outer iteration
(function PyUFunc_ReductionOp in ufunc_object.c:2781).
2. If it's in the inner loop, some optimizations are possible, e.g.
specialized cases for sizeof(item) strides. Think how to add them cleanly.
3. If it's in the outer iteration, try to think how to make it faster.
This will be a more messy problem to solve.

Pauli Virtanen
_______________________________________________
NumPyDiscussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpydiscussion


Maybe I'm missing something, but why not just implement sum() using
dot() and ones() ?
Josh
On Thu, Feb 10, 2011 at 11:49 AM, Pauli Virtanen < [hidden email]> wrote:
> On Thu, 10 Feb 2011 22:38:52 +0200, eat wrote:
> [clip]
>> I hope so. Please suggest if there's anything that I can do to further
>> advance this. (My C skills are allready bit rusty, but at any higher
>> level I'll try my best to contribute).
>
> If someone wants to try to improve the situation, here's a possible plan
> of attack:
>
> 1. Check first if the bottleneck is in the inner reduction loop
> (function DOUBLE_add in loops.c.src:712) or in the outer iteration
> (function PyUFunc_ReductionOp in ufunc_object.c:2781).
>
> 2. If it's in the inner loop, some optimizations are possible, e.g.
> specialized cases for sizeof(item) strides. Think how to add them cleanly.
>
> 3. If it's in the outer iteration, try to think how to make it faster.
> This will be a more messy problem to solve.
>
> 
> Pauli Virtanen
>
> _______________________________________________
> NumPyDiscussion mailing list
> [hidden email]
> http://mail.scipy.org/mailman/listinfo/numpydiscussion>
_______________________________________________
NumPyDiscussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpydiscussion


On Thu, Feb 10, 2011 at 14:29, eat < [hidden email]> wrote:
> Hi Robert,
>
> On Thu, Feb 10, 2011 at 8:16 PM, Robert Kern < [hidden email]> wrote:
>>
>> On Thu, Feb 10, 2011 at 11:53, eat < [hidden email]> wrote:
>> > Thanks Chuck,
>> >
>> > for replying. But don't you still feel very odd that dot outperforms sum
>> > in
>> > your machine? Just to get it simply; why sum can't outperform dot?
>> > Whatever
>> > architecture (computer, cache) you have, it don't make any sense at all
>> > that
>> > when performing significantly less instructions, you'll reach to spend
>> > more
>> > time ;).
>>
>> These days, the determining factor is less often instruction count
>> than memory latency, and the optimized BLAS implementations of dot()
>> heavily optimize the memory access patterns.
>
> Can't we have this as well with simple sum?
It's technically feasible to accomplish, but as I mention later, it
entails quite a large cost. Those optimized BLASes represent many
manyears of effort and cause substantial headaches for people
building and installing numpy. However, they are frequently worth it
because those operations are often bottlenecks in whole applications.
sum(), even in its stupidest implementation, rarely is. In the places
where it is a significant bottleneck, an ad hoc implementation in C or
Cython or even FORTRAN for just that application is pretty easy to
write. You can gain speed by specializing to just your use case, e.g.
contiguous data, summing down to one number, or summing along one axis
of only 2D data, etc. There's usually no reason to try to generalize
that implementation to put it back into numpy.
>> Additionally, the number
>> of instructions in your dot() probably isn't that many more than the
>> sum(). The sum() is pretty dumb
>
> But does it need to be?
As I also allude to later in my email, no, but there are still costs involved.
>> and just does a linear accumulation
>> using the ufunc reduce mechanism, so (m*n1) ADDs plus quite a few
>> instructions for traversing the array in a generic manner. With fused
>> multiplyadds, being able to assume contiguous data and ignore the
>> numpy iterator overhead, and applying divideandconquer kernels to
>> arrange sums, the optimized dot() implementations could have a
>> comparable instruction count.
>
> Couldn't sum benefit with similar logic?
Etc. I'm not going to keep repeating myself.

Robert Kern
"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
 Umberto Eco
_______________________________________________
NumPyDiscussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpydiscussion


On Thu, Feb 10, 2011 at 14:51, Joshua Holbrook < [hidden email]> wrote:
> Maybe I'm missing something, but why not just implement sum() using
> dot() and ones() ?
You can't do everything that sum() does with just dot() and ones().

Robert Kern
"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
 Umberto Eco
_______________________________________________
NumPyDiscussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpydiscussion


On Thu, Feb 10, 2011 at 10:31 AM, Pauli Virtanen <[hidden email]> wrote:
Thu, 10 Feb 2011 12:16:12 0600, Robert Kern wrote:
[clip]
> One thing that might be worthwhile is to make
> implementations of sum() and cumsum() that avoid the ufunc machinery and
> do their iterations more quickly, at least for some common combinations
> of dtype and contiguity.
I wonder what is the balance between the iterator overhead and the time
taken in the reduction inner loop. This should be straightforward to
benchmark.
Apparently, some overhead decreased with the new iterators, since current
Numpy master outperforms 1.5.1 by a factor of 2 for this benchmark:
In [8]: %timeit M.sum(1) # Numpy 1.5.1
10 loops, best of 3: 85 ms per loop
In [8]: %timeit M.sum(1) # Numpy master
10 loops, best of 3: 49.5 ms per loop
I don't think this is explainable by the new memory layout optimizations,
since M is Ccontiguous.
Perhaps there would be room for more optimization, even within the ufunc
framework?
I played around with this in einsum, where it's a bit easier to specialize this case than in the ufunc machinery. What I found made the biggest difference is to use SSE prefetching instructions to prepare the cache in advance. Here are the kind of numbers I get, all from the current Numpy master:
In [7]: timeit M.sum(1) 10 loops, best of 3: 44.6 ms per loop
In [8]: timeit dot(M, o) 10 loops, best of 3: 36.8 ms per loop
In [9]: timeit einsum('ij>i', M) 10 loops, best of 3: 32.1 ms per loop ... In [14]: timeit M.sum(1) 10 loops, best of 3: 41.5 ms per loop
In [15]: timeit dot(M, o) 10 loops, best of 3: 42.1 ms per loop
In [16]: timeit einsum('ij>i', M) 10 loops, best of 3: 30 ms per loop
Mark
_______________________________________________
NumPyDiscussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpydiscussion


Hi Robert,
On Thu, Feb 10, 2011 at 10:58 PM, Robert Kern <[hidden email]> wrote:
On Thu, Feb 10, 2011 at 14:29, eat < [hidden email]> wrote: > Hi Robert, > > On Thu, Feb 10, 2011 at 8:16 PM, Robert Kern < [hidden email]> wrote:
>> >> On Thu, Feb 10, 2011 at 11:53, eat < [hidden email]> wrote: >> > Thanks Chuck, >> > >> > for replying. But don't you still feel very odd that dot outperforms sum
>> > in >> > your machine? Just to get it simply; why sum can't outperform dot? >> > Whatever >> > architecture (computer, cache) you have, it don't make any sense at all
>> > that >> > when performing significantly less instructions, you'll reach to spend >> > more >> > time ;). >> >> These days, the determining factor is less often instruction count
>> than memory latency, and the optimized BLAS implementations of dot() >> heavily optimize the memory access patterns. > > Can't we have this as well with simple sum? It's technically feasible to accomplish, but as I mention later, it
entails quite a large cost. Those optimized BLASes represent many manyears of effort
Yes I acknowledge this. But didn't they then ignore them something simpler, like sum (but which actually could benefit exactly similiar optimizations).
and cause substantial headaches for people building and installing numpy.
I appreciate this. No doubt at all.
However, they are frequently worth it because those operations are often bottlenecks in whole applications.
sum(), even in its stupidest implementation, rarely is. In the places where it is a significant bottleneck, an ad hoc implementation in C or Cython or even FORTRAN for just that application is pretty easy to write.
But here I have to disagree; I'll think that at least I (if not even the majority of numpy users) don't like (nor I'm be capable/ or have enough time/ resources) go to dwell such details. I'm sorry but I'll have to restate that it's quite reasonable to expect that sum outperforms dot in any case. Lets now to start make such movements, which enables sum to outperform dot.
You can gain speed by specializing to just your use case, e.g. contiguous data, summing down to one number, or summing along one axis
of only 2D data, etc. There's usually no reason to try to generalize that implementation to put it back into numpy.
Yes, I would really like to specialize into my case, but 'without going out the python realm.'
Thanks,
eat
>> Additionally, the number >> of instructions in your dot() probably isn't that many more than the >> sum(). The sum() is pretty dumb > > But does it need to be?
As I also allude to later in my email, no, but there are still costs involved.
>> and just does a linear accumulation >> using the ufunc reduce mechanism, so (m*n1) ADDs plus quite a few >> instructions for traversing the array in a generic manner. With fused
>> multiplyadds, being able to assume contiguous data and ignore the >> numpy iterator overhead, and applying divideandconquer kernels to >> arrange sums, the optimized dot() implementations could have a
>> comparable instruction count. > > Couldn't sum benefit with similar logic?
Etc. I'm not going to keep repeating myself.

Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth."
 Umberto Eco _______________________________________________ NumPyDiscussion mailing list [hidden email]http://mail.scipy.org/mailman/listinfo/numpydiscussion
_______________________________________________
NumPyDiscussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpydiscussion


On Thu, Feb 10, 2011 at 15:32, eat < [hidden email]> wrote:
> Hi Robert,
>
> On Thu, Feb 10, 2011 at 10:58 PM, Robert Kern < [hidden email]> wrote:
>>
>> On Thu, Feb 10, 2011 at 14:29, eat < [hidden email]> wrote:
>> > Hi Robert,
>> >
>> > On Thu, Feb 10, 2011 at 8:16 PM, Robert Kern < [hidden email]>
>> > wrote:
>> >>
>> >> On Thu, Feb 10, 2011 at 11:53, eat < [hidden email]> wrote:
>> >> > Thanks Chuck,
>> >> >
>> >> > for replying. But don't you still feel very odd that dot outperforms
>> >> > sum
>> >> > in
>> >> > your machine? Just to get it simply; why sum can't outperform dot?
>> >> > Whatever
>> >> > architecture (computer, cache) you have, it don't make any sense at
>> >> > all
>> >> > that
>> >> > when performing significantly less instructions, you'll reach to
>> >> > spend
>> >> > more
>> >> > time ;).
>> >>
>> >> These days, the determining factor is less often instruction count
>> >> than memory latency, and the optimized BLAS implementations of dot()
>> >> heavily optimize the memory access patterns.
>> >
>> > Can't we have this as well with simple sum?
>>
>> It's technically feasible to accomplish, but as I mention later, it
>> entails quite a large cost. Those optimized BLASes represent many
>> manyears of effort
>
> Yes I acknowledge this. But didn't they then ignore them something simpler,
> like sum (but which actually could benefit exactly similiar optimizations).
Let's set aside the fact that the people who optimized the
implementation of dot() (the authors of ATLAS or the MKL or whichever
optimized BLAS library you linked to) are different from those who
implemented sum() (the numpy devs). Let me repeat a reason why one
would put a lot of effort into optimizing dot() but not sum():
"""
>> However, they are frequently worth it
>> because those operations are often bottlenecks in whole applications.
>> sum(), even in its stupidest implementation, rarely is.
"""
I don't know if I'm just not communicating very clearly, or if you
just reply to individual statements before reading the whole email.
>> and cause substantial headaches for people
>> building and installing numpy.
>
> I appreciate this. No doubt at all.
>>
>> However, they are frequently worth it
>> because those operations are often bottlenecks in whole applications.
>> sum(), even in its stupidest implementation, rarely is. In the places
>> where it is a significant bottleneck, an ad hoc implementation in C or
>> Cython or even FORTRAN for just that application is pretty easy to
>> write.
>
> But here I have to disagree; I'll think that at least I (if not even the
> majority of numpy users) don't like (nor I'm be capable/ or have enough
> time/ resources) go to dwell such details.
And you think we have the time and resources to do it for you?
> I'm sorry but I'll have to
> restate that it's quite reasonable to expect that sum outperforms dot in any
> case.
You don't optimize a function just because you are capable of it. You
optimize a function because it is taking up a significant portion of
total runtime in your real application. Anything else is a waste of
time.
> Lets now to start make such movements, which enables sum to outperform
> dot.
Sorry, you don't get to volunteer anyone's time or set anyone's
priorities but your own. There are some sensible things one could do
to optimize sums or even general ufunc reductions, as outlined my
Mark, Pauli and myself, but please stop using the acceleratedBLAS
dot() as your benchmark. It's a completely inappropriate comparison.
>> You can gain speed by specializing to just your use case, e.g.
>> contiguous data, summing down to one number, or summing along one axis
>> of only 2D data, etc. There's usually no reason to try to generalize
>> that implementation to put it back into numpy.
>
> Yes, I would really like to specialize into my case, but 'without going out
> the python realm.'
The Bottleneck project is a good place for such things. It's a nice
middleground for somewhatspecialized routines that are still pretty
common but not general enough to be in numpy yet.
http://pypi.python.org/pypi/BottleneckIf you're not willing to learn how to implement it yourself in Cython,
I'm afraid that you're stuck waiting for someone to do it for you. But
please don't expect anyone to feel obligated to do so.

Robert Kern
"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
 Umberto Eco
_______________________________________________
NumPyDiscussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpydiscussion


On Thu, Feb 10, 2011 at 2:26 PM, Mark Wiebe <[hidden email]> wrote:
On Thu, Feb 10, 2011 at 10:31 AM, Pauli Virtanen <[hidden email]> wrote:
Thu, 10 Feb 2011 12:16:12 0600, Robert Kern wrote:
[clip]
> One thing that might be worthwhile is to make
> implementations of sum() and cumsum() that avoid the ufunc machinery and
> do their iterations more quickly, at least for some common combinations
> of dtype and contiguity.
I wonder what is the balance between the iterator overhead and the time
taken in the reduction inner loop. This should be straightforward to
benchmark.
Apparently, some overhead decreased with the new iterators, since current
Numpy master outperforms 1.5.1 by a factor of 2 for this benchmark:
In [8]: %timeit M.sum(1) # Numpy 1.5.1
10 loops, best of 3: 85 ms per loop
In [8]: %timeit M.sum(1) # Numpy master
10 loops, best of 3: 49.5 ms per loop
I don't think this is explainable by the new memory layout optimizations,
since M is Ccontiguous.
Perhaps there would be room for more optimization, even within the ufunc
framework?
I played around with this in einsum, where it's a bit easier to specialize this case than in the ufunc machinery. What I found made the biggest difference is to use SSE prefetching instructions to prepare the cache in advance. Here are the kind of numbers I get, all from the current Numpy master:
In [7]: timeit M.sum(1) 10 loops, best of 3: 44.6 ms per loop
In [8]: timeit dot(M, o) 10 loops, best of 3: 36.8 ms per loop
In [9]: timeit einsum('ij>i', M) 10 loops, best of 3: 32.1 ms per loop ... I get an even bigger speedup:
In [5]: timeit M.sum(1) 10 loops, best of 3: 19.2 ms per loop
In [6]: timeit dot(M, o) 100 loops, best of 3: 15.2 ms per loop
In [7]: timeit einsum('ij>i', M) 100 loops, best of 3: 11.4 ms per loop
<snip>
Chuck
_______________________________________________
NumPyDiscussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpydiscussion


On Thu, Feb 10, 2011 at 3:08 PM, Robert Kern <[hidden email]> wrote:
On Thu, Feb 10, 2011 at 15:32, eat < [hidden email]> wrote:
> Hi Robert,
>
> On Thu, Feb 10, 2011 at 10:58 PM, Robert Kern < [hidden email]> wrote:
>>
>> On Thu, Feb 10, 2011 at 14:29, eat < [hidden email]> wrote:
>> > Hi Robert,
>> >
>> > On Thu, Feb 10, 2011 at 8:16 PM, Robert Kern < [hidden email]>
>> > wrote:
>> >>
>> >> On Thu, Feb 10, 2011 at 11:53, eat < [hidden email]> wrote:
>> >> > Thanks Chuck,
>> >> >
>> >> > for replying. But don't you still feel very odd that dot outperforms
>> >> > sum
>> >> > in
>> >> > your machine? Just to get it simply; why sum can't outperform dot?
>> >> > Whatever
>> >> > architecture (computer, cache) you have, it don't make any sense at
>> >> > all
>> >> > that
>> >> > when performing significantly less instructions, you'll reach to
>> >> > spend
>> >> > more
>> >> > time ;).
>> >>
>> >> These days, the determining factor is less often instruction count
>> >> than memory latency, and the optimized BLAS implementations of dot()
>> >> heavily optimize the memory access patterns.
>> >
>> > Can't we have this as well with simple sum?
>>
>> It's technically feasible to accomplish, but as I mention later, it
>> entails quite a large cost. Those optimized BLASes represent many
>> manyears of effort
>
> Yes I acknowledge this. But didn't they then ignore them something simpler,
> like sum (but which actually could benefit exactly similiar optimizations).
Let's set aside the fact that the people who optimized the
implementation of dot() (the authors of ATLAS or the MKL or whichever
optimized BLAS library you linked to) are different from those who
implemented sum() (the numpy devs). Let me repeat a reason why one
would put a lot of effort into optimizing dot() but not sum():
"""
>> However, they are frequently worth it
>> because those operations are often bottlenecks in whole applications.
>> sum(), even in its stupidest implementation, rarely is.
"""
I don't know if I'm just not communicating very clearly, or if you
just reply to individual statements before reading the whole email.
>> and cause substantial headaches for people
>> building and installing numpy.
>
> I appreciate this. No doubt at all.
>>
>> However, they are frequently worth it
>> because those operations are often bottlenecks in whole applications.
>> sum(), even in its stupidest implementation, rarely is. In the places
>> where it is a significant bottleneck, an ad hoc implementation in C or
>> Cython or even FORTRAN for just that application is pretty easy to
>> write.
>
> But here I have to disagree; I'll think that at least I (if not even the
> majority of numpy users) don't like (nor I'm be capable/ or have enough
> time/ resources) go to dwell such details.
And you think we have the time and resources to do it for you?
> I'm sorry but I'll have to
> restate that it's quite reasonable to expect that sum outperforms dot in any
> case.
You don't optimize a function just because you are capable of it. You
optimize a function because it is taking up a significant portion of
total runtime in your real application. Anything else is a waste of
time.
Heh. Reminds me of a passage in General Bradley's A Soldier's Story where he admonished one of his officers in North Africa for taking a hill and suffering casualties, telling him that one didn't take a hill because one could, but because doing so served a purpose in the larger campaign.
<snip>
Chuck
_______________________________________________
NumPyDiscussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpydiscussion


Hi,
On Fri, Feb 11, 2011 at 12:08 AM, Robert Kern <[hidden email]> wrote:
On Thu, Feb 10, 2011 at 15:32, eat < [hidden email]> wrote: > Hi Robert, > > On Thu, Feb 10, 2011 at 10:58 PM, Robert Kern < [hidden email]> wrote:
>> >> On Thu, Feb 10, 2011 at 14:29, eat < [hidden email]> wrote: >> > Hi Robert, >> > >> > On Thu, Feb 10, 2011 at 8:16 PM, Robert Kern < [hidden email]>
>> > wrote: >> >> >> >> On Thu, Feb 10, 2011 at 11:53, eat < [hidden email]> wrote: >> >> > Thanks Chuck,
>> >> > >> >> > for replying. But don't you still feel very odd that dot outperforms >> >> > sum >> >> > in >> >> > your machine? Just to get it simply; why sum can't outperform dot?
>> >> > Whatever >> >> > architecture (computer, cache) you have, it don't make any sense at >> >> > all >> >> > that >> >> > when performing significantly less instructions, you'll reach to
>> >> > spend >> >> > more >> >> > time ;). >> >> >> >> These days, the determining factor is less often instruction count >> >> than memory latency, and the optimized BLAS implementations of dot()
>> >> heavily optimize the memory access patterns. >> > >> > Can't we have this as well with simple sum? >> >> It's technically feasible to accomplish, but as I mention later, it
>> entails quite a large cost. Those optimized BLASes represent many >> manyears of effort > > Yes I acknowledge this. But didn't they then ignore them something simpler, > like sum (but which actually could benefit exactly similiar optimizations).
Let's set aside the fact that the people who optimized the implementation of dot() (the authors of ATLAS or the MKL or whichever optimized BLAS library you linked to) are different from those who
implemented sum() (the numpy devs). Let me repeat a reason why one would put a lot of effort into optimizing dot() but not sum():
""" >> However, they are frequently worth it >> because those operations are often bottlenecks in whole applications. >> sum(), even in its stupidest implementation, rarely is.
"""
I don't know if I'm just not communicating very clearly, or if you just reply to individual statements before reading the whole email.
You are communicating very well. It's me who's not communicating so well.
>> and cause substantial headaches for people >> building and installing numpy. > > I appreciate this. No doubt at all. >> >> However, they are frequently worth it
>> because those operations are often bottlenecks in whole applications. >> sum(), even in its stupidest implementation, rarely is. In the places >> where it is a significant bottleneck, an ad hoc implementation in C or
>> Cython or even FORTRAN for just that application is pretty easy to >> write. > > But here I have to disagree; I'll think that at least I (if not even the > majority of numpy users) don't like (nor I'm be capable/ or have enough
> time/ resources) go to dwell such details.
And you think we have the time and resources to do it for you?
My intention was not to suggest anything like that.
> I'm sorry but I'll have to > restate that it's quite reasonable to expect that sum outperforms dot in any > case.
You don't optimize a function just because you are capable of it. You
optimize a function because it is taking up a significant portion of total runtime in your real application. Anything else is a waste of time.
> Lets now to start make such movements, which enables sum to outperform > dot.
Sorry, you don't get to volunteer anyone's time or set anyone's priorities but your own. There are some sensible things one could do
to optimize sums or even general ufunc reductions, as outlined my Mark, Pauli and myself, but please stop using the acceleratedBLAS dot() as your benchmark. It's a completely inappropriate comparison.
OK. Lets compare sum to itself:
In []: M= randn(1e5, 1e2)
In []: timeit M.sum(0) 10 loops, best of 3: 169 ms per loop
In []: timeit M.sum(1) 10 loops, best of 3: 37.5 ms per loop
In []: timeit M.sum() 10 loops, best of 3: 18.1 ms per loop
In []: timeit sum(M.sum(0)) 10 loops, best of 3: 169 ms per loop
In []: timeit sum(M.sum(1)) 10 loops, best of 3: 37.7 ms per loop
In []: timeit M.T.sum(0) 10 loops, best of 3: 37.7 ms per loop
In []: timeit M.T.sum(1) 10 loops, best of 3: 171 ms per loop
In []: timeit M.T.sum() 1 loops, best of 3: 288 ms per loop
In []: timeit sum(M.T.sum(0)) 10 loops, best of 3: 37.7 ms per loop
In []: timeit sum(M.T.sum(1)) 10 loops, best of 3: 170 ms per loop
In []: 288./ 18.1 Out[]: 15.91160220994475
>> You can gain speed by specializing to just your use case, e.g. >> contiguous data, summing down to one number, or summing along one axis >> of only 2D data, etc. There's usually no reason to try to generalize
>> that implementation to put it back into numpy. > > Yes, I would really like to specialize into my case, but 'without going out > the python realm.'
The Bottleneck project is a good place for such things. It's a nice
middleground for somewhatspecialized routines that are still pretty common but not general enough to be in numpy yet.
http://pypi.python.org/pypi/Bottleneck
If you're not willing to learn how to implement it yourself in Cython, I'm afraid that you're stuck waiting for someone to do it for you. But please don't expect anyone to feel obligated to do so.
Perhaps my bad wordings, but I do not expect anyone to do it for me. I also think that there exists some real substance I addressed (as demonstrated now with comparing sum to itself).
Regards,
eat

Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth."
 Umberto Eco _______________________________________________ NumPyDiscussion mailing list [hidden email]http://mail.scipy.org/mailman/listinfo/numpydiscussion
_______________________________________________
NumPyDiscussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpydiscussion


Thu, 10 Feb 2011 20:49:28 +0000, Pauli Virtanen wrote:
[clip]
> 1. Check first if the bottleneck is in the inner reduction loop
> (function DOUBLE_add in loops.c.src:712) or in the outer iteration
> (function PyUFunc_ReductionOp in ufunc_object.c:2781).
> 2. If it's in the inner loop, some optimizations are possible, e.g.
> specialized cases for sizeof(item) strides. Think how to add them
> cleanly.
A quick check (just replace the inner loop with a noop) shows that for
100 items, the bottleneck is in the inner loop. The crossover between
inner loop time and strided iterator overhead apparently occurs around
~2030 items (on the machine I used for testing).
Anyway, spending time for optimizing the inner loop for a 30% speed gain
(max) seems questionable...

Pauli Virtanen
_______________________________________________
NumPyDiscussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpydiscussion


Den 10.02.2011 16:29, skrev eat:
> One would expect sum to outperform dot with a clear marginal. Does
> there exixts any 'tricks' to increase the performance of sum?
I see that others have ansvered already. The ufunc np.sum is not going
going to beat np.dot. You are racing the heavy machinery of NumPy (array
iterators, type chekcs, bound checks, etc.) against level3 BLAS routine
DGEMM, the most heavily optimized numerical kernel ever written. Also
beware that computation is much cheaper than memory access. Although
DGEMM does more arithmetics, and even is O(N3) in that respect, it is
always faster except for very sparse arrays. If you need fast loops, you
can always write your own Fortran or C, and even insert OpenMP pragmas.
But don't expect that to beat optimized highlevel BLAS kernels by any
margin. The first chapters of "Numerical Methods in Fortran 90" might be
worth reading. It deals with several of these issues, including
dimensional expansion, which is important for writing fast numerical
code  but not intuitively obvious. "I expect this to be faster because
it does less work" is a fundamental misconception in numerical
computing. Whatever cause less traffic on the memory BUS (the real
bottleneck) will almost always be faster, regardless of the amount of
work done by the CPU. A good advice is to use highlevel BLAS whenever
you can. The only exception, as mentioned, is when matrices get very sparse.
Sturla
_______________________________________________
NumPyDiscussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpydiscussion


Hi Sturla,
On Sat, Feb 12, 2011 at 5:38 PM, Sturla Molden <[hidden email]> wrote:
Den 10.02.2011 16:29, skrev eat:
> One would expect sum to outperform dot with a clear marginal. Does > there exixts any 'tricks' to increase the performance of sum?
First of all, thanks for you still replying. Well, I'm still a little bit unsure how I should proceed with this discussion... I may have used bad wordings and created unneccessary mayhyem with my original question (:. Trust me, I'm only trying to discuss here with a positive criticism in my mind.
Now, I'm not pretending to know what kind of a person a 'typical' numpy user is. But I'm assuming that there just exists more than me with roughly similar questions in their (our) minds and who wish to utilize numpy more 'pythonic; all batteries included' way. Ocassionally I (we) may ask really stupid questions, but please beare with us.
Said that, I'm still very confident that (from a users point of view) there's some real substance on the issue I addressed.
I see that others have ansvered already. The ufunc np.sum is not going going to beat np.dot. You are racing the heavy machinery of NumPy (array
iterators, type chekcs, bound checks, etc.) against level3 BLAS routine DGEMM, the most heavily optimized numerical kernel ever written.
Fair enough.
Also beware that computation is much cheaper than memory access.
Sure, that's exactly where I expected the performance boost to emerge.
Although DGEMM does more arithmetics, and even is O(N3) in that respect, it is always faster except for very sparse arrays. If you need fast loops, you
can always write your own Fortran or C, and even insert OpenMP pragmas.
That's a very important potential, but surely not all numpy users are expected to master that ;)
But don't expect that to beat optimized highlevel BLAS kernels by any margin. The first chapters of "Numerical Methods in Fortran 90" might be
worth reading. It deals with several of these issues, including dimensional expansion, which is important for writing fast numerical code  but not intuitively obvious. "I expect this to be faster because it does less work" is a fundamental misconception in numerical
computing. Whatever cause less traffic on the memory BUS (the real bottleneck) will almost always be faster, regardless of the amount of work done by the CPU.
And I'm totally aware of it, and actually it was exactly the original intended logic of my question: "how bout if the sum could follow the steps of dot; then, since less instructions it must be bounded below of the execution time of dot". But as R. Kern gently pointed out allready it's not fruitfull enough avenue to proceed. And I'm able to live with that.
Regards,
eat
A good advice is to use highlevel BLAS whenever you can. The only exception, as mentioned, is when matrices get very sparse.
Sturla
_______________________________________________
NumPyDiscussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpydiscussion

12
