Quick Question about Optimization

classic Classic list List threaded Threaded
24 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Quick Question about Optimization

jbsnyder
Hi -

First off, I know that optimization is evil, and I should make sure
that everything works as expected prior to bothering with squeezing
out extra performance, but the situation is that this particular block
of code works, but it is about half as fast with numpy as in matlab,
and I'm wondering if there's a better approach than what I'm doing.

I have a chunk of code, below, that generally iterates over 2000
iterations, and the vectors that are being worked on at a given step
generally have ~14000 elements in them.

In matlab, doing pretty much exactly the same thing takes about 6-7
seconds, always around 13-14 with numpy on the same machine.  I've
gotten this on Linux & Mac OS X.

self.aff_input has the bulk of the data in it (2000x14000 array), and
the various steps are for computing the state of some afferent neurons
(if there's any interest, here's a paper that includes the model:
Brandman, R. and Nelson ME (2002) A simple model of long-term spike
train regularization. Neural Computation 14, 1575-1597.)

I've imported numpy as np.

Is there anything in practice here that could be done to speed this
up?  I'm looking more for general numpy usage tips, that I can use
while writing further code and not so things that would be obscure or
difficult to maintain in the future.

Also, the results of this are a binary array, I'm wondering if there's
anything more compact for expressing than using 8 bits to represent
each single bit.  I've poked around, but I haven't come up with any
clean and unhackish ideas :-)

Thanks!

I can provide the rest of the code if needed, but it's basically just
filling some vectors with random and empty data and initializing a few
things.

        for n in range(0,time_milliseconds):
            self.u  =  self.expfac_m  *  self.prev_u +
(1-self.expfac_m) * self.aff_input[n,:]
            self.v = self.u + self.sigma *
np.random.standard_normal(size=(1,self.naff))
            self.theta = self.expfac_theta * self.prev_theta -
(1-self.expfac_theta)

            idx_spk = np.where(self.v>=self.theta)

            self.S[n,idx_spk] = 1
            self.theta[idx_spk] = self.theta[idx_spk] + self.b

            self.prev_u = self.u
            self.prev_theta = self.theta

--
James Snyder
Biomedical Engineering
Northwestern University
[hidden email]
_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Quick Question about Optimization

Robin-62
On Mon, May 19, 2008 at 7:08 PM, James Snyder <[hidden email]> wrote:

>
>        for n in range(0,time_milliseconds):
>            self.u  =  self.expfac_m  *  self.prev_u +
> (1-self.expfac_m) * self.aff_input[n,:]
>            self.v = self.u + self.sigma *
> np.random.standard_normal(size=(1,self.naff))
>            self.theta = self.expfac_theta * self.prev_theta -
> (1-self.expfac_theta)
>
>            idx_spk = np.where(self.v>=self.theta)
>
>            self.S[n,idx_spk] = 1
>            self.theta[idx_spk] = self.theta[idx_spk] + self.b
>
>            self.prev_u = self.u
>            self.prev_theta = self.theta

Hello,

The only thoughts I had were that depending on how 'vectorised'
random.standard_normal is, it might be better to calculate a big block
of random data outside the loop and index it in the same way as the
aff_input. Not certain if the indexing would be faster than the
function call but it could be worth a try if you have enough memory.

The other thing is there are a lot of self.'s there. I don't have a
lot of practicle experience, but I've read (
http://wiki.python.org/moin/PythonSpeed/PerformanceTips#head-aa6c07c46a630a2fa10bd6502510e532806f1f62
) that . based lookups are slower than local variables so another
thing to try would be to rebind everything to a local variable outside
the loop:
u = self.u
v = self.v
etc. which although a bit unsightly actually can make the inner loop
more readable and might speed things up.

The only other thing is be careful with things like this when
translating from matlab:
>            self.prev_u = self.u
since this is a reference not a copy of the data. This is OK because
when you recreate u as a product it creates a new object, but if you
changed u in another way ie self.u[:100] = 10 then self.prev_u would
still be pointing to the same array and also reflect those changes.
In this case it doesn't look like you explicitly need the prev_ values
so it's possible you could do the v and theta updates in place
(although I'm not sure if that's quicker)
u *= expfac_m
u += (1-expfac_m)*aff_input.. etc.
Of course you can also take the (1-)'s outside of the loop although
again I'm not sure how much difference it would make.

So sorry I can't give any concrete advise but I hope I've given some ideas...

Cheers

Robin
_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Quick Question about Optimization

Robin-62
Also you could use xrange instead of range...

Again, not sure of the size of the effect but it seems to be
recommended by the docstring.

Robin
_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Quick Question about Optimization

Hoyt Koepke
In reply to this post by jbsnyder
>        for n in range(0,time_milliseconds):
>            self.u  =  self.expfac_m  *  self.prev_u +
> (1-self.expfac_m) * self.aff_input[n,:]
>            self.v = self.u + self.sigma *
> np.random.standard_normal(size=(1,self.naff))
>            self.theta = self.expfac_theta * self.prev_theta -
> (1-self.expfac_theta)
>
>            idx_spk = np.where(self.v>=self.theta)
>
>            self.S[n,idx_spk] = 1
>            self.theta[idx_spk] = self.theta[idx_spk] + self.b
>
>            self.prev_u = self.u
>            self.prev_theta = self.theta

Copying elements into array objects that already exist will always be
faster than creating a new object with separate data.  However, in
this case, you don't need to do any copying or creation if you use a
flip-flopping index to handle keeping track of the previous. If I drop
the selfs, you can translate the above into (untested):


curidx = 0  # prev will be 2-curidx

u = empty( (2, naff) )
v = empty( naff )
theta = empty( (2, naff) )

stdnormrvs = np.random.standard_normal(size=(time_milliseconds,naff) )

for n in xrange(time_milliseconds):
    u[curidx, :] = expfac_m  *  u[2-curidx, :] + (1-expfac_m) * aff_input[n,:]
    v[:] = u[curidx, :] + sigma * stdnormrvs[n, :]
    theta[curidx, :] = expfac_theta * theta[2-curidx] - (1-expfac_theta)

    idx_spk = np.where(v >= theta)
    S[n,idx_spk] = 1
    theta[curidx, idx_spk] += b

    # Flop to handle previous stuff
    curidx = 2 - curidx

This should give you a substantial speedup.  Also, I have to say that
this is begging for weave.blitz, which compiles such statements using
templated c++ code to avoid temporaries.  It doesn't work on all
systems, but if it does in your case, here's what your code might look
like:

import scipy.weave as wv

curidx = 0

u = empty( (2, naff) )
v = empty( (1, naff) )
theta = empty( (2, naff) )

stdnormrvs = np.random.standard_normal(size=(time_milliseconds,naff) )

for n in xrange(time_milliseconds):
    wv.blitz("u[curidx, :] = expfac_m  *  u[2-curidx, :] +
(1-expfac_m) * aff_input[n,:]")
    wv.blitz("v[:] = u[curidx, :] + sigma * stdnormrvs[n, :]")
    wv.blitz("theta[curidx, :] = expfac_theta * theta[2-curidx] -
(1-expfac_theta)")

    idx_spk = np.where(v >= theta)
    S[n,idx_spk] = 1
    theta[curidx, idx_spk] += b

    # Flop to handle previous stuff
    curidx = 2 - curidx



--
+++++++++++++++++++++++++++++++++++
Hoyt Koepke
UBC Department of Computer Science
http://www.cs.ubc.ca/~hoytak/
[hidden email]
+++++++++++++++++++++++++++++++++++


On Mon, May 19, 2008 at 11:08 AM, James Snyder <[hidden email]> wrote:

> Hi -
>
> First off, I know that optimization is evil, and I should make sure
> that everything works as expected prior to bothering with squeezing
> out extra performance, but the situation is that this particular block
> of code works, but it is about half as fast with numpy as in matlab,
> and I'm wondering if there's a better approach than what I'm doing.
>
> I have a chunk of code, below, that generally iterates over 2000
> iterations, and the vectors that are being worked on at a given step
> generally have ~14000 elements in them.
>
> In matlab, doing pretty much exactly the same thing takes about 6-7
> seconds, always around 13-14 with numpy on the same machine.  I've
> gotten this on Linux & Mac OS X.
>
> self.aff_input has the bulk of the data in it (2000x14000 array), and
> the various steps are for computing the state of some afferent neurons
> (if there's any interest, here's a paper that includes the model:
> Brandman, R. and Nelson ME (2002) A simple model of long-term spike
> train regularization. Neural Computation 14, 1575-1597.)
>
> I've imported numpy as np.
>
> Is there anything in practice here that could be done to speed this
> up?  I'm looking more for general numpy usage tips, that I can use
> while writing further code and not so things that would be obscure or
> difficult to maintain in the future.
>
> Also, the results of this are a binary array, I'm wondering if there's
> anything more compact for expressing than using 8 bits to represent
> each single bit.  I've poked around, but I haven't come up with any
> clean and unhackish ideas :-)
>
> Thanks!
>
> I can provide the rest of the code if needed, but it's basically just
> filling some vectors with random and empty data and initializing a few
> things.
>
>        for n in range(0,time_milliseconds):
>            self.u  =  self.expfac_m  *  self.prev_u +
> (1-self.expfac_m) * self.aff_input[n,:]
>            self.v = self.u + self.sigma *
> np.random.standard_normal(size=(1,self.naff))
>            self.theta = self.expfac_theta * self.prev_theta -
> (1-self.expfac_theta)
>
>            idx_spk = np.where(self.v>=self.theta)
>
>            self.S[n,idx_spk] = 1
>            self.theta[idx_spk] = self.theta[idx_spk] + self.b
>
>            self.prev_u = self.u
>            self.prev_theta = self.theta
>
> --
> James Snyder
> Biomedical Engineering
> Northwestern University
> [hidden email]
> _______________________________________________
> Numpy-discussion mailing list
> [hidden email]
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>



--
+++++++++++++++++++++++++++++++++++
Hoyt Koepke
UBC Department of Computer Science
http://www.cs.ubc.ca/~hoytak/
[hidden email]
+++++++++++++++++++++++++++++++++++
_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Quick Question about Optimization

Robin-62
Hi,

I think my understanding is somehow incomplete... It's not clear to me
why (simplified case)

a[curidx,:] = scalar * a[2-curidx,:]
should be faster than
a = scalar * b

In both cases I thought the scalar multiplication results in a new
array (new memory allocated) and then the difference between copying
that result into the existing array u[curix,:] or reassining the
reference u to that result should be very similar?

If anything I would have thought the direct assignment would be
quicker since then there is no copying.

What am I missing?

> This should give you a substantial speedup.  Also, I have to say that
> this is begging for weave.blitz, which compiles such statements using
> templated c++ code to avoid temporaries.  It doesn't work on all
> systems, but if it does in your case, here's what your code might look
> like:

If you haven't seen it this page gives useful examples of methods to
speed up python code (incuding weave.blitz), which has Hoyt says would
be ideal in this case:
http://scipy.org/PerformancePython

Robin
_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Quick Question about Optimization

Anne Archibald
In reply to this post by jbsnyder
2008/5/19 James Snyder <[hidden email]>:

> First off, I know that optimization is evil, and I should make sure
> that everything works as expected prior to bothering with squeezing
> out extra performance, but the situation is that this particular block
> of code works, but it is about half as fast with numpy as in matlab,
> and I'm wondering if there's a better approach than what I'm doing.
>
> I have a chunk of code, below, that generally iterates over 2000
> iterations, and the vectors that are being worked on at a given step
> generally have ~14000 elements in them.

With arrays this size, I wouldn't worry about python overhead - things
like range versus xrange or self lookups.

> Is there anything in practice here that could be done to speed this
> up?  I'm looking more for general numpy usage tips, that I can use
> while writing further code and not so things that would be obscure or
> difficult to maintain in the future.

Try using a profiler to find which steps are using most of your time.
With such a simple function it may not be very informative, but it's
worth a try.

> Also, the results of this are a binary array, I'm wondering if there's
> anything more compact for expressing than using 8 bits to represent
> each single bit.  I've poked around, but I haven't come up with any
> clean and unhackish ideas :-)

There's a tradeoff between compactness and speed here. The *fastest*
is probably one boolean per 32-bit integer. It sounds awful, I know,
but most modern CPUs have to work harder to access bytes individually
than they do to access them four at a time. On the other hand, cache
performance can make a huge difference, so compactness might actually
amount to speed. I don't think numpy has a packed bit array data type
(which is a shame, but would require substantial implementation
effort).

> I can provide the rest of the code if needed, but it's basically just
> filling some vectors with random and empty data and initializing a few
> things.

It would kind of help, since it would make it clearer what's a scalar
and what's an array, and what the dimensions of the various arrays
are.

>        for n in range(0,time_milliseconds):
>            self.u  =  self.expfac_m  *  self.prev_u +
> (1-self.expfac_m) * self.aff_input[n,:]
>            self.v = self.u + self.sigma *
> np.random.standard_normal(size=(1,self.naff))

You can use "scale" to rescale the random numbers on creation; that'll
save you a temporary.

>            self.theta = self.expfac_theta * self.prev_theta -
> (1-self.expfac_theta)
>
>            idx_spk = np.where(self.v>=self.theta)

You can probably skip the "where"; the result of the expression
self.v>=self.theta is a boolean array, which you can use directly for
indexing.

>            self.S[n,idx_spk] = 1
>            self.theta[idx_spk] = self.theta[idx_spk] + self.b

+= here might speed things up, not just in terms of temporaries but by
saving a fancy-indexing operation.

>            self.prev_u = self.u
>            self.prev_theta = self.theta



Anne
_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Quick Question about Optimization

Eric Firing
In reply to this post by Robin-62
Robin wrote:
> Also you could use xrange instead of range...
>
> Again, not sure of the size of the effect but it seems to be
> recommended by the docstring.

No, it is going away in Python 3.0, and its only real benefit is a
memory saving in extreme cases.

 From the Python library docs:
"The advantage of xrange() over range() is minimal (since xrange() still
has to create the values when asked for them) except when a very large
range is used on a memory-starved machine or when all of the range's
elements are never used (such as when the loop is usually terminated
with break)."

Eric
_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Quick Question about Optimization

Hoyt Koepke
In reply to this post by Robin-62
On Mon, May 19, 2008 at 12:53 PM, Robin <[hidden email]> wrote:

> Hi,
>
> I think my understanding is somehow incomplete... It's not clear to me
> why (simplified case)
>
> a[curidx,:] = scalar * a[2-curidx,:]
> should be faster than
> a = scalar * b
>
> In both cases I thought the scalar multiplication results in a new
> array (new memory allocated) and then the difference between copying
> that result into the existing array u[curix,:] or reassining the
> reference u to that result should be very similar?
>
> If anything I would have thought the direct assignment would be
> quicker since then there is no copying.
>
> What am I missing?

Actually, I think you are correct.  My bad.  I was mainly thinking in
terms of weave.blitz, where it would make a difference, then
translating back...

--Hoyt

+++++++++++++++++++++++++++++++++++
Hoyt Koepke
UBC Department of Computer Science
http://www.cs.ubc.ca/~hoytak/
[hidden email]
+++++++++++++++++++++++++++++++++++
_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Quick Question about Optimization

Chris Barker - NOAA Federal
In reply to this post by Anne Archibald
Anne Archibald wrote:
> 2008/5/19 James Snyder <[hidden email]>:
>> I can provide the rest of the code if needed, but it's basically just
>> filling some vectors with random and empty data and initializing a few
>> things.
>
> It would kind of help, since it would make it clearer what's a scalar
> and what's an array, and what the dimensions of the various arrays
> are.

It would also help if you provided a complete example (as little code as
possible), so we could try out and time our ideas before suggesting them.

>> np.random.standard_normal(size=(1,self.naff))

Anyone know how fast this is compared to Matlab? That could be the
difference right there.

-Chris

--
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

[hidden email]
_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Quick Question about Optimization

Charles R Harris


On Mon, May 19, 2008 at 2:53 PM, Christopher Barker <[hidden email]> wrote:
Anne Archibald wrote:
> 2008/5/19 James Snyder <[hidden email]>:
>> I can provide the rest of the code if needed, but it's basically just
>> filling some vectors with random and empty data and initializing a few
>> things.
>
> It would kind of help, since it would make it clearer what's a scalar
> and what's an array, and what the dimensions of the various arrays
> are.

It would also help if you provided a complete example (as little code as
possible), so we could try out and time our ideas before suggesting them.

>> np.random.standard_normal(size=(1,self.naff))

Anyone know how fast this is compared to Matlab? That could be the
difference right there.

The latest versions of Matlab use the ziggurat method to generate random normals and it is faster than the method used in numpy. I have ziggurat code at hand, but IIRC, Robert doesn't trust the method ;) I don't know if it would actually speed things up, though.

Chuck



_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Quick Question about Optimization

Robert Kern-2
On Mon, May 19, 2008 at 5:27 PM, Charles R Harris
<[hidden email]> wrote:
> The latest versions of Matlab use the ziggurat method to generate random
> normals and it is faster than the method used in numpy. I have ziggurat code
> at hand, but IIRC, Robert doesn't trust the method ;)

Well, I outlined the tests that would satisfy me, but I don't think
you ever responded.

  http://projects.scipy.org/pipermail/scipy-dev/2005-December/004405.html
  which references
  http://projects.scipy.org/pipermail/scipy-dev/2005-December/004400.html

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
 -- Umberto Eco
_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Quick Question about Optimization

jbsnyder
In reply to this post by Chris Barker - NOAA Federal
Separating the response into 2 emails, here's the aspect that comes
from implementations of random:

In short, that's part of the difference. I ran these a few times to
check for consistency.

MATLAB (R2008a:
tic
for i = 1:2000
    a = randn(1,13857);
end
toc

Runtime: ~0.733489 s

NumPy (1.1.0rc1):
import numpy as np
import time

t1 = time.time()
for n in xrange(0,2000):
    a = np.random.standard_normal(size=(1,14000))
t2 = time.time()

print 'Runtime: %1.3f s' % ((t2-t1))

Runtime: ~2.716 s

On Mon, May 19, 2008 at 3:53 PM, Christopher Barker
<[hidden email]> wrote:

> Anne Archibald wrote:
>> 2008/5/19 James Snyder <[hidden email]>:
>>> I can provide the rest of the code if needed, but it's basically just
>>> filling some vectors with random and empty data and initializing a few
>>> things.
>>
>> It would kind of help, since it would make it clearer what's a scalar
>> and what's an array, and what the dimensions of the various arrays
>> are.
>
> It would also help if you provided a complete example (as little code as
> possible), so we could try out and time our ideas before suggesting them.
>
>>> np.random.standard_normal(size=(1,self.naff))
>
> Anyone know how fast this is compared to Matlab? That could be the
> difference right there.
>
> -Chris
>
> --
> Christopher Barker, Ph.D.
> Oceanographer
>
> Emergency Response Division
> NOAA/NOS/OR&R            (206) 526-6959   voice
> 7600 Sand Point Way NE   (206) 526-6329   fax
> Seattle, WA  98115       (206) 526-6317   main reception
>
> [hidden email]
> _______________________________________________
> Numpy-discussion mailing list
> [hidden email]
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>



--
James Snyder
Biomedical Engineering
Northwestern University
[hidden email]
_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Quick Question about Optimization

Charles R Harris
In reply to this post by Robert Kern-2


On Mon, May 19, 2008 at 4:36 PM, Robert Kern <[hidden email]> wrote:
On Mon, May 19, 2008 at 5:27 PM, Charles R Harris
<[hidden email]> wrote:
> The latest versions of Matlab use the ziggurat method to generate random
> normals and it is faster than the method used in numpy. I have ziggurat code
> at hand, but IIRC, Robert doesn't trust the method ;)

Well, I outlined the tests that would satisfy me, but I don't think
you ever responded.

 http://projects.scipy.org/pipermail/scipy-dev/2005-December/004405.html
 which references
 http://projects.scipy.org/pipermail/scipy-dev/2005-December/004400.html

It's been tested in the literature. It's basically just sampling a Gaussian but done so most of the tests for being under the curve are trivial and there are few misses, i.e., the Gaussian is covered with a stack (ziggurat) of slices of equal areas, each slice is randomly chosen, then the position along the slice is randomly chosen. Most of those last points will be under the curve except at the ends, and it is those last that require computation. However, like all sampling it has to be carefully implemented and the samples are discretized differently than for the current way. Floats are strange that way because they are on a log scale. The tails will be fine, the real question is how much precision you want when doubles are returned, i.e., how fine the discetization of the resulting samples should be. The same method also works for the exponential distribution.

I don't feel this is a pressing issue, when I need fast normals I use my own code. But if we are in competition with Matlab maybe we should give it a shot.

Chuck



_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Quick Question about Optimization

Robert Kern-2
On Mon, May 19, 2008 at 6:39 PM, Charles R Harris
<[hidden email]> wrote:

>
> On Mon, May 19, 2008 at 4:36 PM, Robert Kern <[hidden email]> wrote:
>>
>> On Mon, May 19, 2008 at 5:27 PM, Charles R Harris
>> <[hidden email]> wrote:
>> > The latest versions of Matlab use the ziggurat method to generate random
>> > normals and it is faster than the method used in numpy. I have ziggurat
>> > code
>> > at hand, but IIRC, Robert doesn't trust the method ;)
>>
>> Well, I outlined the tests that would satisfy me, but I don't think
>> you ever responded.
>>
>>  http://projects.scipy.org/pipermail/scipy-dev/2005-December/004405.html
>>  which references
>>  http://projects.scipy.org/pipermail/scipy-dev/2005-December/004400.html
>
> It's been tested in the literature.

And it happened to fail such tests. Hence the Doornik paper which
improves Marsaglia's method to pass the appropriate tests.
Consequently, I want to see the tests performed on the actual
implementation before using it. It's a complicated algorithm that is
*demonstrably* easy to get wrong.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
 -- Umberto Eco
_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Quick Question about Optimization

jbsnyder
In reply to this post by Chris Barker - NOAA Federal
On to the code, here's a current implementation, attached.  I make no
claims about it being great code, I've modified it so that there is a
weave version and a sans-weave version.

Many of the suggestions make things a bit faster.  The weave version
bombs out with a rather long log, which can be found at:
http://pastebin.com/m79699c04

I can tell it's failing for the second weave.blitz line, but I don't
understand why exactly.

What does this mean?:
error: no match for call to '(blitz::FastArrayIterator<double, 1>)
(const blitz::TinyVector<int, 2>&)'

Also note, I'm not asking to match MATLAB performance.  It'd be nice,
but again I'm just trying to put together decent, fairly efficient
numpy code.

On Mon, May 19, 2008 at 3:53 PM, Christopher Barker
<[hidden email]> wrote:

> Anne Archibald wrote:
>> 2008/5/19 James Snyder <[hidden email]>:
>>> I can provide the rest of the code if needed, but it's basically just
>>> filling some vectors with random and empty data and initializing a few
>>> things.
>>
>> It would kind of help, since it would make it clearer what's a scalar
>> and what's an array, and what the dimensions of the various arrays
>> are.
>
> It would also help if you provided a complete example (as little code as
> possible), so we could try out and time our ideas before suggesting them.
>
>>> np.random.standard_normal(size=(1,self.naff))
>
> Anyone know how fast this is compared to Matlab? That could be the
> difference right there.
>
> -Chris
>
> --
> Christopher Barker, Ph.D.
> Oceanographer
>
> Emergency Response Division
> NOAA/NOS/OR&R            (206) 526-6959   voice
> 7600 Sand Point Way NE   (206) 526-6329   fax
> Seattle, WA  98115       (206) 526-6317   main reception
>
> [hidden email]
> _______________________________________________
> Numpy-discussion mailing list
> [hidden email]
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>


--
James Snyder
Biomedical Engineering
Northwestern University
[hidden email]

_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion

np_afftest.py (5K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Quick Question about Optimization

Charles R Harris
In reply to this post by Robert Kern-2


On Mon, May 19, 2008 at 5:52 PM, Robert Kern <[hidden email]> wrote:
On Mon, May 19, 2008 at 6:39 PM, Charles R Harris
<[hidden email]> wrote:
>
> On Mon, May 19, 2008 at 4:36 PM, Robert Kern <[hidden email]> wrote:
>>
>> On Mon, May 19, 2008 at 5:27 PM, Charles R Harris
>> <[hidden email]> wrote:
>> > The latest versions of Matlab use the ziggurat method to generate random
>> > normals and it is faster than the method used in numpy. I have ziggurat
>> > code
>> > at hand, but IIRC, Robert doesn't trust the method ;)
>>
>> Well, I outlined the tests that would satisfy me, but I don't think
>> you ever responded.
>>
>>  http://projects.scipy.org/pipermail/scipy-dev/2005-December/004405.html
>>  which references
>>  http://projects.scipy.org/pipermail/scipy-dev/2005-December/004400.html
>
> It's been tested in the literature.

And it happened to fail such tests. Hence the Doornik paper which
improves Marsaglia's method to pass the appropriate tests.

Exactly. Doornik was more careful about using independent samples and also used a better random number generator (MWC8222), not exactly rocket science.  Believe it or not, I had read Doornik's paper before I did my implementation. I also used a better ziggurat, IMHO, than Marsaglia and Doornik.

MWC8222 is also about twice as fast on AMD hardware as the Mersenne Twister, but does require more careful initialization. On Pentium V they are about they are about the same. I haven't benchmarked either on Core2.

Chuck


_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Quick Question about Optimization

Robert Kern-2
In reply to this post by jbsnyder
On Mon, May 19, 2008 at 6:55 PM, James Snyder <[hidden email]> wrote:
> Also note, I'm not asking to match MATLAB performance.  It'd be nice,
> but again I'm just trying to put together decent, fairly efficient
> numpy code.

I can cut the time by about a quarter by just using the boolean mask
directly instead of using where().

            for n in range(0,time_milliseconds):
                u  =  expfac_m  *  prev_u + (1-expfac_m) * aff_input[n,:]
                v = u + sigma * stdnormrvs[n, :]
                theta = expfac_theta * prev_theta - (1-expfac_theta)

                mask = (v >= theta)

                S[n,np.squeeze(mask)] = 1
                theta[mask] += b

                prev_u = u
                prev_theta = theta


There aren't any good line-by-line profiling tools in Python, but you
can fake it by making a local function for each line:

            def f1():
                u  =  expfac_m  *  prev_u + (1-expfac_m) * aff_input[n,:]
                return u
            def f2():
                v = u + sigma * stdnormrvs[n, :]
                return v
            def f3():
                theta = expfac_theta * prev_theta - (1-expfac_theta)
                return theta
            def f4():
                mask = (v >= theta)
                return mask
            def f5():
                S[n,np.squeeze(mask)] = 1
            def f6():
                theta[mask] += b

            # Run Standard, Unoptimized Model
            for n in range(0,time_milliseconds):
                u = f1()
                v = f2()
                theta = f3()
                mask = f4()
                f5()
                f6()

                prev_u = u
                prev_theta = theta

I get f6() as being the biggest bottleneck, followed by the general
time spent in the loop (about the same), followed by f5(), f1(), and
f3() (each about half of f6()), followed by f2() (about half of f5()).
f4() is negligible.

Masked operations are inherently slow. They mess up CPU's branch
prediction. Worse, the use of iterators in that part of the code
frustrates compilers' attempts to optimize that away in the case of
contiguous arrays.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
 -- Umberto Eco
_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Quick Question about Optimization

Robert Kern-2
In reply to this post by Charles R Harris
On Mon, May 19, 2008 at 7:30 PM, Charles R Harris
<[hidden email]> wrote:

>
> On Mon, May 19, 2008 at 5:52 PM, Robert Kern <[hidden email]> wrote:
>>
>> On Mon, May 19, 2008 at 6:39 PM, Charles R Harris
>> <[hidden email]> wrote:
>> >
>> > On Mon, May 19, 2008 at 4:36 PM, Robert Kern <[hidden email]>
>> > wrote:
>> >>
>> >> On Mon, May 19, 2008 at 5:27 PM, Charles R Harris
>> >> <[hidden email]> wrote:
>> >> > The latest versions of Matlab use the ziggurat method to generate
>> >> > random
>> >> > normals and it is faster than the method used in numpy. I have
>> >> > ziggurat
>> >> > code
>> >> > at hand, but IIRC, Robert doesn't trust the method ;)
>> >>
>> >> Well, I outlined the tests that would satisfy me, but I don't think
>> >> you ever responded.
>> >>
>> >>
>> >>  http://projects.scipy.org/pipermail/scipy-dev/2005-December/004405.html
>> >>  which references
>> >>
>> >>  http://projects.scipy.org/pipermail/scipy-dev/2005-December/004400.html
>> >
>> > It's been tested in the literature.
>>
>> And it happened to fail such tests. Hence the Doornik paper which
>> improves Marsaglia's method to pass the appropriate tests.
>
> Exactly. Doornik was more careful about using independent samples and also
> used a better random number generator (MWC8222), not exactly rocket
> science.  Believe it or not, I had read Doornik's paper before I did my
> implementation.

Good! I believe this is the first time you've mentioned it.

> I also used a better ziggurat, IMHO, than Marsaglia and
> Doornik.

Ah. HOs need testing. dieharder is probably de rigeur these days, and
it will accept input from a file.

  http://www.phy.duke.edu/~rgb/General/dieharder.php

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
 -- Umberto Eco
_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Quick Question about Optimization

Eric Firing
In reply to this post by Robert Kern-2
Robert Kern wrote:

> On Mon, May 19, 2008 at 6:55 PM, James Snyder <[hidden email]> wrote:
>> Also note, I'm not asking to match MATLAB performance.  It'd be nice,
>> but again I'm just trying to put together decent, fairly efficient
>> numpy code.
>
> I can cut the time by about a quarter by just using the boolean mask
> directly instead of using where().
>
>             for n in range(0,time_milliseconds):
>                 u  =  expfac_m  *  prev_u + (1-expfac_m) * aff_input[n,:]
>                 v = u + sigma * stdnormrvs[n, :]
>                 theta = expfac_theta * prev_theta - (1-expfac_theta)
>
>                 mask = (v >= theta)
>
>                 S[n,np.squeeze(mask)] = 1
>                 theta[mask] += b
>
>                 prev_u = u
>                 prev_theta = theta
>
>
> There aren't any good line-by-line profiling tools in Python, but you
> can fake it by making a local function for each line:
>
>             def f1():
>                 u  =  expfac_m  *  prev_u + (1-expfac_m) * aff_input[n,:]
>                 return u
>             def f2():
>                 v = u + sigma * stdnormrvs[n, :]
>                 return v
>             def f3():
>                 theta = expfac_theta * prev_theta - (1-expfac_theta)
>                 return theta
>             def f4():
>                 mask = (v >= theta)
>                 return mask
>             def f5():
>                 S[n,np.squeeze(mask)] = 1
>             def f6():
>                 theta[mask] += b
>
>             # Run Standard, Unoptimized Model
>             for n in range(0,time_milliseconds):
>                 u = f1()
>                 v = f2()
>                 theta = f3()
>                 mask = f4()
>                 f5()
>                 f6()
>
>                 prev_u = u
>                 prev_theta = theta
>
> I get f6() as being the biggest bottleneck, followed by the general
> time spent in the loop (about the same), followed by f5(), f1(), and
> f3() (each about half of f6()), followed by f2() (about half of f5()).
> f4() is negligible.
>
> Masked operations are inherently slow. They mess up CPU's branch
> prediction. Worse, the use of iterators in that part of the code
> frustrates compilers' attempts to optimize that away in the case of
> contiguous arrays.
>

f6 can be sped up more than a factor of 2 by using putmask:

In [10]:xx = np.random.rand(100000)

In [11]:mask = xx > 0.5

In [12]:timeit xx[mask] += 2.34
100 loops, best of 3: 4.06 ms per loop

In [14]:timeit np.putmask(xx, mask, xx+2.34)
1000 loops, best of 3: 1.4 ms per loop

I think that
        xx += 2.34*mask
will be similarly quick, but I can't get ipython timeit to work with it.

Eric


_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Quick Question about Optimization

jbsnyder
In reply to this post by Robert Kern-2
I've done a little profiling with cProfile as well as with dtrace
since the bindings exist in mac os x, and you can use a lot of the d
scripts that apply to python, so previously I've found that the
np.random call and the where (in the original code) were heavy hitters
as far as amount of time consumed.

The time has now been shaved down to ~9 seconds with this suggestion
from the original 13-14s, with the inclusing of Eric Firing's
suggestions.  This is without scipy.weave, which at the moment I can't
get to work for all lines, and when I just replace one of them
sucessfully it seems to run more slowly, I assume because it is
converting data back and forth.

Quick question regarding the pointer abstraction that's going on, the
following seems to work:
np.putmask(S[n,:],np.squeeze(mask),1)

with that section of S being worked on.  Is it safe to assume in most
cases while working with NumPy that without additional operations,
aside from indexing, that a reference rather than a copy is being
passed?  It certainly seems like this sort of thing, including stuff
like:

        u = self.u
        v = self.v
        theta = self.theta
        ...

without having to repack those data into self later, since u,v,theta
are just references to the existing data saves on code and whatnot,
but I'm a little worried about not being explicit.

Are there any major pitfalls to be aware of?  It sounds like if I do:
f = a[n,:] I get a reference, but if I did something like g = a[n,:]*2
I would get a copy.

Thanks guys.  This is definitely useful, especially in combination
with using PyPar on my dual core system I'm getting pretty good
performance :-)

If anyone has any clues on why scipy.weave is blowing
(http://pastebin.com/m79699c04) using the code I attached, I wouldn't
mind knowing.  Most of the times I've attempted using weave, I've been
a little baffled about why things aren't working.  I also don't have a
sense for whether all numpy functions should be "weavable" or if it's
just general array operations that can be put through weave.

I know this is the numpy list, so I can take things over to the scipy
list if that's more appropriate.


On Mon, May 19, 2008 at 7:36 PM, Robert Kern <[hidden email]> wrote:

> On Mon, May 19, 2008 at 6:55 PM, James Snyder <[hidden email]> wrote:
>> Also note, I'm not asking to match MATLAB performance.  It'd be nice,
>> but again I'm just trying to put together decent, fairly efficient
>> numpy code.
>
> I can cut the time by about a quarter by just using the boolean mask
> directly instead of using where().
>
>            for n in range(0,time_milliseconds):
>                u  =  expfac_m  *  prev_u + (1-expfac_m) * aff_input[n,:]
>                v = u + sigma * stdnormrvs[n, :]
>                theta = expfac_theta * prev_theta - (1-expfac_theta)
>
>                mask = (v >= theta)
>
>                S[n,np.squeeze(mask)] = 1
>                theta[mask] += b
>
>                prev_u = u
>                prev_theta = theta
>
>
> There aren't any good line-by-line profiling tools in Python, but you
> can fake it by making a local function for each line:
>
>            def f1():
>                u  =  expfac_m  *  prev_u + (1-expfac_m) * aff_input[n,:]
>                return u
>            def f2():
>                v = u + sigma * stdnormrvs[n, :]
>                return v
>            def f3():
>                theta = expfac_theta * prev_theta - (1-expfac_theta)
>                return theta
>            def f4():
>                mask = (v >= theta)
>                return mask
>            def f5():
>                S[n,np.squeeze(mask)] = 1
>            def f6():
>                theta[mask] += b
>
>            # Run Standard, Unoptimized Model
>            for n in range(0,time_milliseconds):
>                u = f1()
>                v = f2()
>                theta = f3()
>                mask = f4()
>                f5()
>                f6()
>
>                prev_u = u
>                prev_theta = theta
>
> I get f6() as being the biggest bottleneck, followed by the general
> time spent in the loop (about the same), followed by f5(), f1(), and
> f3() (each about half of f6()), followed by f2() (about half of f5()).
> f4() is negligible.
>
> Masked operations are inherently slow. They mess up CPU's branch
> prediction. Worse, the use of iterators in that part of the code
> frustrates compilers' attempts to optimize that away in the case of
> contiguous arrays.
>
> --
> Robert Kern
>
> "I have come to believe that the whole world is an enigma, a harmless
> enigma that is made terrible by our own mad attempt to interpret it as
> though it had an underlying truth."
>  -- Umberto Eco
> _______________________________________________
> Numpy-discussion mailing list
> [hidden email]
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>



--
James Snyder
Biomedical Engineering
Northwestern University
[hidden email]
_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
12