
12

Hi 
First off, I know that optimization is evil, and I should make sure
that everything works as expected prior to bothering with squeezing
out extra performance, but the situation is that this particular block
of code works, but it is about half as fast with numpy as in matlab,
and I'm wondering if there's a better approach than what I'm doing.
I have a chunk of code, below, that generally iterates over 2000
iterations, and the vectors that are being worked on at a given step
generally have ~14000 elements in them.
In matlab, doing pretty much exactly the same thing takes about 67
seconds, always around 1314 with numpy on the same machine. I've
gotten this on Linux & Mac OS X.
self.aff_input has the bulk of the data in it (2000x14000 array), and
the various steps are for computing the state of some afferent neurons
(if there's any interest, here's a paper that includes the model:
Brandman, R. and Nelson ME (2002) A simple model of longterm spike
train regularization. Neural Computation 14, 15751597.)
I've imported numpy as np.
Is there anything in practice here that could be done to speed this
up? I'm looking more for general numpy usage tips, that I can use
while writing further code and not so things that would be obscure or
difficult to maintain in the future.
Also, the results of this are a binary array, I'm wondering if there's
anything more compact for expressing than using 8 bits to represent
each single bit. I've poked around, but I haven't come up with any
clean and unhackish ideas :)
Thanks!
I can provide the rest of the code if needed, but it's basically just
filling some vectors with random and empty data and initializing a few
things.
for n in range(0,time_milliseconds):
self.u = self.expfac_m * self.prev_u +
(1self.expfac_m) * self.aff_input[n,:]
self.v = self.u + self.sigma *
np.random.standard_normal(size=(1,self.naff))
self.theta = self.expfac_theta * self.prev_theta 
(1self.expfac_theta)
idx_spk = np.where(self.v>=self.theta)
self.S[n,idx_spk] = 1
self.theta[idx_spk] = self.theta[idx_spk] + self.b
self.prev_u = self.u
self.prev_theta = self.theta

James Snyder
Biomedical Engineering
Northwestern University
[hidden email]
_______________________________________________
Numpydiscussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpydiscussion


On Mon, May 19, 2008 at 7:08 PM, James Snyder < [hidden email]> wrote:
>
> for n in range(0,time_milliseconds):
> self.u = self.expfac_m * self.prev_u +
> (1self.expfac_m) * self.aff_input[n,:]
> self.v = self.u + self.sigma *
> np.random.standard_normal(size=(1,self.naff))
> self.theta = self.expfac_theta * self.prev_theta 
> (1self.expfac_theta)
>
> idx_spk = np.where(self.v>=self.theta)
>
> self.S[n,idx_spk] = 1
> self.theta[idx_spk] = self.theta[idx_spk] + self.b
>
> self.prev_u = self.u
> self.prev_theta = self.theta
Hello,
The only thoughts I had were that depending on how 'vectorised'
random.standard_normal is, it might be better to calculate a big block
of random data outside the loop and index it in the same way as the
aff_input. Not certain if the indexing would be faster than the
function call but it could be worth a try if you have enough memory.
The other thing is there are a lot of self.'s there. I don't have a
lot of practicle experience, but I've read (
http://wiki.python.org/moin/PythonSpeed/PerformanceTips#headaa6c07c46a630a2fa10bd6502510e532806f1f62) that . based lookups are slower than local variables so another
thing to try would be to rebind everything to a local variable outside
the loop:
u = self.u
v = self.v
etc. which although a bit unsightly actually can make the inner loop
more readable and might speed things up.
The only other thing is be careful with things like this when
translating from matlab:
> self.prev_u = self.u
since this is a reference not a copy of the data. This is OK because
when you recreate u as a product it creates a new object, but if you
changed u in another way ie self.u[:100] = 10 then self.prev_u would
still be pointing to the same array and also reflect those changes.
In this case it doesn't look like you explicitly need the prev_ values
so it's possible you could do the v and theta updates in place
(although I'm not sure if that's quicker)
u *= expfac_m
u += (1expfac_m)*aff_input.. etc.
Of course you can also take the (1)'s outside of the loop although
again I'm not sure how much difference it would make.
So sorry I can't give any concrete advise but I hope I've given some ideas...
Cheers
Robin
_______________________________________________
Numpydiscussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpydiscussion


> for n in range(0,time_milliseconds):
> self.u = self.expfac_m * self.prev_u +
> (1self.expfac_m) * self.aff_input[n,:]
> self.v = self.u + self.sigma *
> np.random.standard_normal(size=(1,self.naff))
> self.theta = self.expfac_theta * self.prev_theta 
> (1self.expfac_theta)
>
> idx_spk = np.where(self.v>=self.theta)
>
> self.S[n,idx_spk] = 1
> self.theta[idx_spk] = self.theta[idx_spk] + self.b
>
> self.prev_u = self.u
> self.prev_theta = self.theta
Copying elements into array objects that already exist will always be
faster than creating a new object with separate data. However, in
this case, you don't need to do any copying or creation if you use a
flipflopping index to handle keeping track of the previous. If I drop
the selfs, you can translate the above into (untested):
curidx = 0 # prev will be 2curidx
u = empty( (2, naff) )
v = empty( naff )
theta = empty( (2, naff) )
stdnormrvs = np.random.standard_normal(size=(time_milliseconds,naff) )
for n in xrange(time_milliseconds):
u[curidx, :] = expfac_m * u[2curidx, :] + (1expfac_m) * aff_input[n,:]
v[:] = u[curidx, :] + sigma * stdnormrvs[n, :]
theta[curidx, :] = expfac_theta * theta[2curidx]  (1expfac_theta)
idx_spk = np.where(v >= theta)
S[n,idx_spk] = 1
theta[curidx, idx_spk] += b
# Flop to handle previous stuff
curidx = 2  curidx
This should give you a substantial speedup. Also, I have to say that
this is begging for weave.blitz, which compiles such statements using
templated c++ code to avoid temporaries. It doesn't work on all
systems, but if it does in your case, here's what your code might look
like:
import scipy.weave as wv
curidx = 0
u = empty( (2, naff) )
v = empty( (1, naff) )
theta = empty( (2, naff) )
stdnormrvs = np.random.standard_normal(size=(time_milliseconds,naff) )
for n in xrange(time_milliseconds):
wv.blitz("u[curidx, :] = expfac_m * u[2curidx, :] +
(1expfac_m) * aff_input[n,:]")
wv.blitz("v[:] = u[curidx, :] + sigma * stdnormrvs[n, :]")
wv.blitz("theta[curidx, :] = expfac_theta * theta[2curidx] 
(1expfac_theta)")
idx_spk = np.where(v >= theta)
S[n,idx_spk] = 1
theta[curidx, idx_spk] += b
# Flop to handle previous stuff
curidx = 2  curidx

+++++++++++++++++++++++++++++++++++
Hoyt Koepke
UBC Department of Computer Science
http://www.cs.ubc.ca/~hoytak/[hidden email]
+++++++++++++++++++++++++++++++++++
On Mon, May 19, 2008 at 11:08 AM, James Snyder < [hidden email]> wrote:
> Hi 
>
> First off, I know that optimization is evil, and I should make sure
> that everything works as expected prior to bothering with squeezing
> out extra performance, but the situation is that this particular block
> of code works, but it is about half as fast with numpy as in matlab,
> and I'm wondering if there's a better approach than what I'm doing.
>
> I have a chunk of code, below, that generally iterates over 2000
> iterations, and the vectors that are being worked on at a given step
> generally have ~14000 elements in them.
>
> In matlab, doing pretty much exactly the same thing takes about 67
> seconds, always around 1314 with numpy on the same machine. I've
> gotten this on Linux & Mac OS X.
>
> self.aff_input has the bulk of the data in it (2000x14000 array), and
> the various steps are for computing the state of some afferent neurons
> (if there's any interest, here's a paper that includes the model:
> Brandman, R. and Nelson ME (2002) A simple model of longterm spike
> train regularization. Neural Computation 14, 15751597.)
>
> I've imported numpy as np.
>
> Is there anything in practice here that could be done to speed this
> up? I'm looking more for general numpy usage tips, that I can use
> while writing further code and not so things that would be obscure or
> difficult to maintain in the future.
>
> Also, the results of this are a binary array, I'm wondering if there's
> anything more compact for expressing than using 8 bits to represent
> each single bit. I've poked around, but I haven't come up with any
> clean and unhackish ideas :)
>
> Thanks!
>
> I can provide the rest of the code if needed, but it's basically just
> filling some vectors with random and empty data and initializing a few
> things.
>
> for n in range(0,time_milliseconds):
> self.u = self.expfac_m * self.prev_u +
> (1self.expfac_m) * self.aff_input[n,:]
> self.v = self.u + self.sigma *
> np.random.standard_normal(size=(1,self.naff))
> self.theta = self.expfac_theta * self.prev_theta 
> (1self.expfac_theta)
>
> idx_spk = np.where(self.v>=self.theta)
>
> self.S[n,idx_spk] = 1
> self.theta[idx_spk] = self.theta[idx_spk] + self.b
>
> self.prev_u = self.u
> self.prev_theta = self.theta
>
> 
> James Snyder
> Biomedical Engineering
> Northwestern University
> [hidden email]
> _______________________________________________
> Numpydiscussion mailing list
> [hidden email]
> http://projects.scipy.org/mailman/listinfo/numpydiscussion>

+++++++++++++++++++++++++++++++++++
Hoyt Koepke
UBC Department of Computer Science
http://www.cs.ubc.ca/~hoytak/[hidden email]
+++++++++++++++++++++++++++++++++++
_______________________________________________
Numpydiscussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpydiscussion


Hi,
I think my understanding is somehow incomplete... It's not clear to me
why (simplified case)
a[curidx,:] = scalar * a[2curidx,:]
should be faster than
a = scalar * b
In both cases I thought the scalar multiplication results in a new
array (new memory allocated) and then the difference between copying
that result into the existing array u[curix,:] or reassining the
reference u to that result should be very similar?
If anything I would have thought the direct assignment would be
quicker since then there is no copying.
What am I missing?
> This should give you a substantial speedup. Also, I have to say that
> this is begging for weave.blitz, which compiles such statements using
> templated c++ code to avoid temporaries. It doesn't work on all
> systems, but if it does in your case, here's what your code might look
> like:
If you haven't seen it this page gives useful examples of methods to
speed up python code (incuding weave.blitz), which has Hoyt says would
be ideal in this case:
http://scipy.org/PerformancePythonRobin
_______________________________________________
Numpydiscussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpydiscussion


2008/5/19 James Snyder < [hidden email]>:
> First off, I know that optimization is evil, and I should make sure
> that everything works as expected prior to bothering with squeezing
> out extra performance, but the situation is that this particular block
> of code works, but it is about half as fast with numpy as in matlab,
> and I'm wondering if there's a better approach than what I'm doing.
>
> I have a chunk of code, below, that generally iterates over 2000
> iterations, and the vectors that are being worked on at a given step
> generally have ~14000 elements in them.
With arrays this size, I wouldn't worry about python overhead  things
like range versus xrange or self lookups.
> Is there anything in practice here that could be done to speed this
> up? I'm looking more for general numpy usage tips, that I can use
> while writing further code and not so things that would be obscure or
> difficult to maintain in the future.
Try using a profiler to find which steps are using most of your time.
With such a simple function it may not be very informative, but it's
worth a try.
> Also, the results of this are a binary array, I'm wondering if there's
> anything more compact for expressing than using 8 bits to represent
> each single bit. I've poked around, but I haven't come up with any
> clean and unhackish ideas :)
There's a tradeoff between compactness and speed here. The *fastest*
is probably one boolean per 32bit integer. It sounds awful, I know,
but most modern CPUs have to work harder to access bytes individually
than they do to access them four at a time. On the other hand, cache
performance can make a huge difference, so compactness might actually
amount to speed. I don't think numpy has a packed bit array data type
(which is a shame, but would require substantial implementation
effort).
> I can provide the rest of the code if needed, but it's basically just
> filling some vectors with random and empty data and initializing a few
> things.
It would kind of help, since it would make it clearer what's a scalar
and what's an array, and what the dimensions of the various arrays
are.
> for n in range(0,time_milliseconds):
> self.u = self.expfac_m * self.prev_u +
> (1self.expfac_m) * self.aff_input[n,:]
> self.v = self.u + self.sigma *
> np.random.standard_normal(size=(1,self.naff))
You can use "scale" to rescale the random numbers on creation; that'll
save you a temporary.
> self.theta = self.expfac_theta * self.prev_theta 
> (1self.expfac_theta)
>
> idx_spk = np.where(self.v>=self.theta)
You can probably skip the "where"; the result of the expression
self.v>=self.theta is a boolean array, which you can use directly for
indexing.
> self.S[n,idx_spk] = 1
> self.theta[idx_spk] = self.theta[idx_spk] + self.b
+= here might speed things up, not just in terms of temporaries but by
saving a fancyindexing operation.
> self.prev_u = self.u
> self.prev_theta = self.theta
Anne
_______________________________________________
Numpydiscussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpydiscussion


Robin wrote:
> Also you could use xrange instead of range...
>
> Again, not sure of the size of the effect but it seems to be
> recommended by the docstring.
No, it is going away in Python 3.0, and its only real benefit is a
memory saving in extreme cases.
From the Python library docs:
"The advantage of xrange() over range() is minimal (since xrange() still
has to create the values when asked for them) except when a very large
range is used on a memorystarved machine or when all of the range's
elements are never used (such as when the loop is usually terminated
with break)."
Eric
_______________________________________________
Numpydiscussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpydiscussion


On Mon, May 19, 2008 at 12:53 PM, Robin < [hidden email]> wrote:
> Hi,
>
> I think my understanding is somehow incomplete... It's not clear to me
> why (simplified case)
>
> a[curidx,:] = scalar * a[2curidx,:]
> should be faster than
> a = scalar * b
>
> In both cases I thought the scalar multiplication results in a new
> array (new memory allocated) and then the difference between copying
> that result into the existing array u[curix,:] or reassining the
> reference u to that result should be very similar?
>
> If anything I would have thought the direct assignment would be
> quicker since then there is no copying.
>
> What am I missing?
Actually, I think you are correct. My bad. I was mainly thinking in
terms of weave.blitz, where it would make a difference, then
translating back...
Hoyt
+++++++++++++++++++++++++++++++++++
Hoyt Koepke
UBC Department of Computer Science
http://www.cs.ubc.ca/~hoytak/[hidden email]
+++++++++++++++++++++++++++++++++++
_______________________________________________
Numpydiscussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpydiscussion


Anne Archibald wrote:
> 2008/5/19 James Snyder < [hidden email]>:
>> I can provide the rest of the code if needed, but it's basically just
>> filling some vectors with random and empty data and initializing a few
>> things.
>
> It would kind of help, since it would make it clearer what's a scalar
> and what's an array, and what the dimensions of the various arrays
> are.
It would also help if you provided a complete example (as little code as
possible), so we could try out and time our ideas before suggesting them.
>> np.random.standard_normal(size=(1,self.naff))
Anyone know how fast this is compared to Matlab? That could be the
difference right there.
Chris

Christopher Barker, Ph.D.
Oceanographer
Emergency Response Division
NOAA/NOS/OR&R (206) 5266959 voice
7600 Sand Point Way NE (206) 5266329 fax
Seattle, WA 98115 (206) 5266317 main reception
[hidden email]
_______________________________________________
Numpydiscussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpydiscussion


On Mon, May 19, 2008 at 2:53 PM, Christopher Barker < [hidden email]> wrote:
>> I can provide the rest of the code if needed, but it's basically just
>> filling some vectors with random and empty data and initializing a few
>> things.
>
> It would kind of help, since it would make it clearer what's a scalar
> and what's an array, and what the dimensions of the various arrays
> are.
It would also help if you provided a complete example (as little code as
possible), so we could try out and time our ideas before suggesting them.
>> np.random.standard_normal(size=(1,self.naff))
Anyone know how fast this is compared to Matlab? That could be the
difference right there.
The latest versions of Matlab use the ziggurat method to generate random normals and it is faster than the method used in numpy. I have ziggurat code at hand, but IIRC, Robert doesn't trust the method ;) I don't know if it would actually speed things up, though.
Chuck
_______________________________________________
Numpydiscussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpydiscussion


In reply to this post by Chris Barker  NOAA Federal
Separating the response into 2 emails, here's the aspect that comes
from implementations of random:
In short, that's part of the difference. I ran these a few times to
check for consistency.
MATLAB (R2008a:
tic
for i = 1:2000
a = randn(1,13857);
end
toc
Runtime: ~0.733489 s
NumPy (1.1.0rc1):
import numpy as np
import time
t1 = time.time()
for n in xrange(0,2000):
a = np.random.standard_normal(size=(1,14000))
t2 = time.time()
print 'Runtime: %1.3f s' % ((t2t1))
Runtime: ~2.716 s
On Mon, May 19, 2008 at 3:53 PM, Christopher Barker
< [hidden email]> wrote:
> Anne Archibald wrote:
>> 2008/5/19 James Snyder < [hidden email]>:
>>> I can provide the rest of the code if needed, but it's basically just
>>> filling some vectors with random and empty data and initializing a few
>>> things.
>>
>> It would kind of help, since it would make it clearer what's a scalar
>> and what's an array, and what the dimensions of the various arrays
>> are.
>
> It would also help if you provided a complete example (as little code as
> possible), so we could try out and time our ideas before suggesting them.
>
>>> np.random.standard_normal(size=(1,self.naff))
>
> Anyone know how fast this is compared to Matlab? That could be the
> difference right there.
>
> Chris
>
> 
> Christopher Barker, Ph.D.
> Oceanographer
>
> Emergency Response Division
> NOAA/NOS/OR&R (206) 5266959 voice
> 7600 Sand Point Way NE (206) 5266329 fax
> Seattle, WA 98115 (206) 5266317 main reception
>
> [hidden email]
> _______________________________________________
> Numpydiscussion mailing list
> [hidden email]
> http://projects.scipy.org/mailman/listinfo/numpydiscussion>

James Snyder
Biomedical Engineering
Northwestern University
[hidden email]
_______________________________________________
Numpydiscussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpydiscussion


On Mon, May 19, 2008 at 4:36 PM, Robert Kern < [hidden email]> wrote:
On Mon, May 19, 2008 at 5:27 PM, Charles R Harris
< [hidden email]> wrote:
> The latest versions of Matlab use the ziggurat method to generate random
> normals and it is faster than the method used in numpy. I have ziggurat code
> at hand, but IIRC, Robert doesn't trust the method ;)
Well, I outlined the tests that would satisfy me, but I don't think
you ever responded.
http://projects.scipy.org/pipermail/scipydev/2005December/004405.html
which references
http://projects.scipy.org/pipermail/scipydev/2005December/004400.html
It's been tested in the literature. It's basically just sampling a Gaussian but done so most of the tests for being under the curve are trivial and there are few misses, i.e., the Gaussian is covered with a stack (ziggurat) of slices of equal areas, each slice is randomly chosen, then the position along the slice is randomly chosen. Most of those last points will be under the curve except at the ends, and it is those last that require computation. However, like all sampling it has to be carefully implemented and the samples are discretized differently than for the current way. Floats are strange that way because they are on a log scale. The tails will be fine, the real question is how much precision you want when doubles are returned, i.e., how fine the discetization of the resulting samples should be. The same method also works for the exponential distribution.
I don't feel this is a pressing issue, when I need fast normals I use my own code. But if we are in competition with Matlab maybe we should give it a shot.
Chuck
_______________________________________________
Numpydiscussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpydiscussion


On Mon, May 19, 2008 at 6:39 PM, Charles R Harris
< [hidden email]> wrote:
>
> On Mon, May 19, 2008 at 4:36 PM, Robert Kern < [hidden email]> wrote:
>>
>> On Mon, May 19, 2008 at 5:27 PM, Charles R Harris
>> < [hidden email]> wrote:
>> > The latest versions of Matlab use the ziggurat method to generate random
>> > normals and it is faster than the method used in numpy. I have ziggurat
>> > code
>> > at hand, but IIRC, Robert doesn't trust the method ;)
>>
>> Well, I outlined the tests that would satisfy me, but I don't think
>> you ever responded.
>>
>> http://projects.scipy.org/pipermail/scipydev/2005December/004405.html>> which references
>> http://projects.scipy.org/pipermail/scipydev/2005December/004400.html>
> It's been tested in the literature.
And it happened to fail such tests. Hence the Doornik paper which
improves Marsaglia's method to pass the appropriate tests.
Consequently, I want to see the tests performed on the actual
implementation before using it. It's a complicated algorithm that is
*demonstrably* easy to get wrong.

Robert Kern
"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
 Umberto Eco
_______________________________________________
Numpydiscussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpydiscussion


In reply to this post by Chris Barker  NOAA Federal
On to the code, here's a current implementation, attached. I make no
claims about it being great code, I've modified it so that there is a
weave version and a sansweave version.
Many of the suggestions make things a bit faster. The weave version
bombs out with a rather long log, which can be found at:
http://pastebin.com/m79699c04I can tell it's failing for the second weave.blitz line, but I don't
understand why exactly.
What does this mean?:
error: no match for call to '(blitz::FastArrayIterator<double, 1>)
(const blitz::TinyVector<int, 2>&)'
Also note, I'm not asking to match MATLAB performance. It'd be nice,
but again I'm just trying to put together decent, fairly efficient
numpy code.
On Mon, May 19, 2008 at 3:53 PM, Christopher Barker
< [hidden email]> wrote:
> Anne Archibald wrote:
>> 2008/5/19 James Snyder < [hidden email]>:
>>> I can provide the rest of the code if needed, but it's basically just
>>> filling some vectors with random and empty data and initializing a few
>>> things.
>>
>> It would kind of help, since it would make it clearer what's a scalar
>> and what's an array, and what the dimensions of the various arrays
>> are.
>
> It would also help if you provided a complete example (as little code as
> possible), so we could try out and time our ideas before suggesting them.
>
>>> np.random.standard_normal(size=(1,self.naff))
>
> Anyone know how fast this is compared to Matlab? That could be the
> difference right there.
>
> Chris
>
> 
> Christopher Barker, Ph.D.
> Oceanographer
>
> Emergency Response Division
> NOAA/NOS/OR&R (206) 5266959 voice
> 7600 Sand Point Way NE (206) 5266329 fax
> Seattle, WA 98115 (206) 5266317 main reception
>
> [hidden email]
> _______________________________________________
> Numpydiscussion mailing list
> [hidden email]
> http://projects.scipy.org/mailman/listinfo/numpydiscussion>

James Snyder
Biomedical Engineering
Northwestern University
[hidden email]
_______________________________________________
Numpydiscussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpydiscussion


On Mon, May 19, 2008 at 5:52 PM, Robert Kern < [hidden email]> wrote:
On Mon, May 19, 2008 at 6:39 PM, Charles R Harris
< [hidden email]> wrote:
>
> On Mon, May 19, 2008 at 4:36 PM, Robert Kern < [hidden email]> wrote:
>>
>> On Mon, May 19, 2008 at 5:27 PM, Charles R Harris
>> < [hidden email]> wrote:
>> > The latest versions of Matlab use the ziggurat method to generate random
>> > normals and it is faster than the method used in numpy. I have ziggurat
>> > code
>> > at hand, but IIRC, Robert doesn't trust the method ;)
>>
>> Well, I outlined the tests that would satisfy me, but I don't think
>> you ever responded.
>>
>> http://projects.scipy.org/pipermail/scipydev/2005December/004405.html
>> which references
>> http://projects.scipy.org/pipermail/scipydev/2005December/004400.html
>
> It's been tested in the literature.
And it happened to fail such tests. Hence the Doornik paper which
improves Marsaglia's method to pass the appropriate tests. Exactly. Doornik was more careful about using independent samples and also used a better random number generator (MWC8222), not exactly rocket science. Believe it or not, I had read Doornik's paper before I did my implementation. I also used a better ziggurat, IMHO, than Marsaglia and Doornik.
MWC8222 is also about twice as fast on AMD hardware as the Mersenne Twister, but does require more careful initialization. On Pentium V they are about they are about the same. I haven't benchmarked either on Core2.
Chuck
_______________________________________________
Numpydiscussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpydiscussion


On Mon, May 19, 2008 at 6:55 PM, James Snyder < [hidden email]> wrote:
> Also note, I'm not asking to match MATLAB performance. It'd be nice,
> but again I'm just trying to put together decent, fairly efficient
> numpy code.
I can cut the time by about a quarter by just using the boolean mask
directly instead of using where().
for n in range(0,time_milliseconds):
u = expfac_m * prev_u + (1expfac_m) * aff_input[n,:]
v = u + sigma * stdnormrvs[n, :]
theta = expfac_theta * prev_theta  (1expfac_theta)
mask = (v >= theta)
S[n,np.squeeze(mask)] = 1
theta[mask] += b
prev_u = u
prev_theta = theta
There aren't any good linebyline profiling tools in Python, but you
can fake it by making a local function for each line:
def f1():
u = expfac_m * prev_u + (1expfac_m) * aff_input[n,:]
return u
def f2():
v = u + sigma * stdnormrvs[n, :]
return v
def f3():
theta = expfac_theta * prev_theta  (1expfac_theta)
return theta
def f4():
mask = (v >= theta)
return mask
def f5():
S[n,np.squeeze(mask)] = 1
def f6():
theta[mask] += b
# Run Standard, Unoptimized Model
for n in range(0,time_milliseconds):
u = f1()
v = f2()
theta = f3()
mask = f4()
f5()
f6()
prev_u = u
prev_theta = theta
I get f6() as being the biggest bottleneck, followed by the general
time spent in the loop (about the same), followed by f5(), f1(), and
f3() (each about half of f6()), followed by f2() (about half of f5()).
f4() is negligible.
Masked operations are inherently slow. They mess up CPU's branch
prediction. Worse, the use of iterators in that part of the code
frustrates compilers' attempts to optimize that away in the case of
contiguous arrays.

Robert Kern
"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
 Umberto Eco
_______________________________________________
Numpydiscussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpydiscussion


On Mon, May 19, 2008 at 7:30 PM, Charles R Harris
< [hidden email]> wrote:
>
> On Mon, May 19, 2008 at 5:52 PM, Robert Kern < [hidden email]> wrote:
>>
>> On Mon, May 19, 2008 at 6:39 PM, Charles R Harris
>> < [hidden email]> wrote:
>> >
>> > On Mon, May 19, 2008 at 4:36 PM, Robert Kern < [hidden email]>
>> > wrote:
>> >>
>> >> On Mon, May 19, 2008 at 5:27 PM, Charles R Harris
>> >> < [hidden email]> wrote:
>> >> > The latest versions of Matlab use the ziggurat method to generate
>> >> > random
>> >> > normals and it is faster than the method used in numpy. I have
>> >> > ziggurat
>> >> > code
>> >> > at hand, but IIRC, Robert doesn't trust the method ;)
>> >>
>> >> Well, I outlined the tests that would satisfy me, but I don't think
>> >> you ever responded.
>> >>
>> >>
>> >> http://projects.scipy.org/pipermail/scipydev/2005December/004405.html>> >> which references
>> >>
>> >> http://projects.scipy.org/pipermail/scipydev/2005December/004400.html>> >
>> > It's been tested in the literature.
>>
>> And it happened to fail such tests. Hence the Doornik paper which
>> improves Marsaglia's method to pass the appropriate tests.
>
> Exactly. Doornik was more careful about using independent samples and also
> used a better random number generator (MWC8222), not exactly rocket
> science. Believe it or not, I had read Doornik's paper before I did my
> implementation.
Good! I believe this is the first time you've mentioned it.
> I also used a better ziggurat, IMHO, than Marsaglia and
> Doornik.
Ah. HOs need testing. dieharder is probably de rigeur these days, and
it will accept input from a file.
http://www.phy.duke.edu/~rgb/General/dieharder.php
Robert Kern
"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
 Umberto Eco
_______________________________________________
Numpydiscussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpydiscussion


Robert Kern wrote:
> On Mon, May 19, 2008 at 6:55 PM, James Snyder < [hidden email]> wrote:
>> Also note, I'm not asking to match MATLAB performance. It'd be nice,
>> but again I'm just trying to put together decent, fairly efficient
>> numpy code.
>
> I can cut the time by about a quarter by just using the boolean mask
> directly instead of using where().
>
> for n in range(0,time_milliseconds):
> u = expfac_m * prev_u + (1expfac_m) * aff_input[n,:]
> v = u + sigma * stdnormrvs[n, :]
> theta = expfac_theta * prev_theta  (1expfac_theta)
>
> mask = (v >= theta)
>
> S[n,np.squeeze(mask)] = 1
> theta[mask] += b
>
> prev_u = u
> prev_theta = theta
>
>
> There aren't any good linebyline profiling tools in Python, but you
> can fake it by making a local function for each line:
>
> def f1():
> u = expfac_m * prev_u + (1expfac_m) * aff_input[n,:]
> return u
> def f2():
> v = u + sigma * stdnormrvs[n, :]
> return v
> def f3():
> theta = expfac_theta * prev_theta  (1expfac_theta)
> return theta
> def f4():
> mask = (v >= theta)
> return mask
> def f5():
> S[n,np.squeeze(mask)] = 1
> def f6():
> theta[mask] += b
>
> # Run Standard, Unoptimized Model
> for n in range(0,time_milliseconds):
> u = f1()
> v = f2()
> theta = f3()
> mask = f4()
> f5()
> f6()
>
> prev_u = u
> prev_theta = theta
>
> I get f6() as being the biggest bottleneck, followed by the general
> time spent in the loop (about the same), followed by f5(), f1(), and
> f3() (each about half of f6()), followed by f2() (about half of f5()).
> f4() is negligible.
>
> Masked operations are inherently slow. They mess up CPU's branch
> prediction. Worse, the use of iterators in that part of the code
> frustrates compilers' attempts to optimize that away in the case of
> contiguous arrays.
>
f6 can be sped up more than a factor of 2 by using putmask:
In [10]:xx = np.random.rand(100000)
In [11]:mask = xx > 0.5
In [12]:timeit xx[mask] += 2.34
100 loops, best of 3: 4.06 ms per loop
In [14]:timeit np.putmask(xx, mask, xx+2.34)
1000 loops, best of 3: 1.4 ms per loop
I think that
xx += 2.34*mask
will be similarly quick, but I can't get ipython timeit to work with it.
Eric
_______________________________________________
Numpydiscussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpydiscussion


I've done a little profiling with cProfile as well as with dtrace
since the bindings exist in mac os x, and you can use a lot of the d
scripts that apply to python, so previously I've found that the
np.random call and the where (in the original code) were heavy hitters
as far as amount of time consumed.
The time has now been shaved down to ~9 seconds with this suggestion
from the original 1314s, with the inclusing of Eric Firing's
suggestions. This is without scipy.weave, which at the moment I can't
get to work for all lines, and when I just replace one of them
sucessfully it seems to run more slowly, I assume because it is
converting data back and forth.
Quick question regarding the pointer abstraction that's going on, the
following seems to work:
np.putmask(S[n,:],np.squeeze(mask),1)
with that section of S being worked on. Is it safe to assume in most
cases while working with NumPy that without additional operations,
aside from indexing, that a reference rather than a copy is being
passed? It certainly seems like this sort of thing, including stuff
like:
u = self.u
v = self.v
theta = self.theta
...
without having to repack those data into self later, since u,v,theta
are just references to the existing data saves on code and whatnot,
but I'm a little worried about not being explicit.
Are there any major pitfalls to be aware of? It sounds like if I do:
f = a[n,:] I get a reference, but if I did something like g = a[n,:]*2
I would get a copy.
Thanks guys. This is definitely useful, especially in combination
with using PyPar on my dual core system I'm getting pretty good
performance :)
If anyone has any clues on why scipy.weave is blowing
( http://pastebin.com/m79699c04) using the code I attached, I wouldn't
mind knowing. Most of the times I've attempted using weave, I've been
a little baffled about why things aren't working. I also don't have a
sense for whether all numpy functions should be "weavable" or if it's
just general array operations that can be put through weave.
I know this is the numpy list, so I can take things over to the scipy
list if that's more appropriate.
On Mon, May 19, 2008 at 7:36 PM, Robert Kern < [hidden email]> wrote:
> On Mon, May 19, 2008 at 6:55 PM, James Snyder < [hidden email]> wrote:
>> Also note, I'm not asking to match MATLAB performance. It'd be nice,
>> but again I'm just trying to put together decent, fairly efficient
>> numpy code.
>
> I can cut the time by about a quarter by just using the boolean mask
> directly instead of using where().
>
> for n in range(0,time_milliseconds):
> u = expfac_m * prev_u + (1expfac_m) * aff_input[n,:]
> v = u + sigma * stdnormrvs[n, :]
> theta = expfac_theta * prev_theta  (1expfac_theta)
>
> mask = (v >= theta)
>
> S[n,np.squeeze(mask)] = 1
> theta[mask] += b
>
> prev_u = u
> prev_theta = theta
>
>
> There aren't any good linebyline profiling tools in Python, but you
> can fake it by making a local function for each line:
>
> def f1():
> u = expfac_m * prev_u + (1expfac_m) * aff_input[n,:]
> return u
> def f2():
> v = u + sigma * stdnormrvs[n, :]
> return v
> def f3():
> theta = expfac_theta * prev_theta  (1expfac_theta)
> return theta
> def f4():
> mask = (v >= theta)
> return mask
> def f5():
> S[n,np.squeeze(mask)] = 1
> def f6():
> theta[mask] += b
>
> # Run Standard, Unoptimized Model
> for n in range(0,time_milliseconds):
> u = f1()
> v = f2()
> theta = f3()
> mask = f4()
> f5()
> f6()
>
> prev_u = u
> prev_theta = theta
>
> I get f6() as being the biggest bottleneck, followed by the general
> time spent in the loop (about the same), followed by f5(), f1(), and
> f3() (each about half of f6()), followed by f2() (about half of f5()).
> f4() is negligible.
>
> Masked operations are inherently slow. They mess up CPU's branch
> prediction. Worse, the use of iterators in that part of the code
> frustrates compilers' attempts to optimize that away in the case of
> contiguous arrays.
>
> 
> Robert Kern
>
> "I have come to believe that the whole world is an enigma, a harmless
> enigma that is made terrible by our own mad attempt to interpret it as
> though it had an underlying truth."
>  Umberto Eco
> _______________________________________________
> Numpydiscussion mailing list
> [hidden email]
> http://projects.scipy.org/mailman/listinfo/numpydiscussion>

James Snyder
Biomedical Engineering
Northwestern University
[hidden email]
_______________________________________________
Numpydiscussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpydiscussion

12
