searchsorted() and memory cache

classic Classic list List threaded Threaded
22 messages Options
12
Reply | Threaded
Open this post in threaded view
|

searchsorted() and memory cache

Andrew Straw
I've got a big element array (25 million int64s) that searchsorted()
takes a long time to grind through. After a bit of digging in the
literature and the numpy source code, I believe that searchsorted() is
implementing a classic binary search, which is pretty bad in terms of
cache misses. There are several modern implementations of binary search
which arrange items in memory such that cache misses are much more rare.
Clearly making such an indexing arrangement would take time, but in my
particular case, I can spare the time to create an index if searching
was faster, since I'd make the index once but do the searching many times.

Is there an implementation of such an algorithm that works easilty with
numpy? Also, can you offer any advice, suggestions, and comments to me
if I attempted to implement such an algorithm?

Thanks,
Andrew
_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: searchsorted() and memory cache

Robert Kern-2
On Thu, May 8, 2008 at 10:51 PM, Andrew Straw <[hidden email]> wrote:
> I've got a big element array (25 million int64s) that searchsorted()
> takes a long time to grind through. After a bit of digging in the
> literature and the numpy source code, I believe that searchsorted() is
> implementing a classic binary search,

Yes.

> which is pretty bad in terms of
> cache misses. There are several modern implementations of binary search
> which arrange items in memory such that cache misses are much more rare.
> Clearly making such an indexing arrangement would take time, but in my
> particular case, I can spare the time to create an index if searching
> was faster, since I'd make the index once but do the searching many times.
>
> Is there an implementation of such an algorithm that works easilty with
> numpy? Also, can you offer any advice, suggestions, and comments to me
> if I attempted to implement such an algorithm?

I'm no help. You seem to know more than I do. Sadly, the first few
Google hits I get for "binary search minimize cache misses" are
patents. I don't know what the substantive content of those patents
are; I have a strict policy of not reading patents.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
 -- Umberto Eco
_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: searchsorted() and memory cache

Charles R Harris


On Thu, May 8, 2008 at 9:55 PM, Robert Kern <[hidden email]> wrote:
On Thu, May 8, 2008 at 10:51 PM, Andrew Straw <[hidden email]> wrote:
> I've got a big element array (25 million int64s) that searchsorted()
> takes a long time to grind through. After a bit of digging in the
> literature and the numpy source code, I believe that searchsorted() is
> implementing a classic binary search,

Yes.

> which is pretty bad in terms of
> cache misses. There are several modern implementations of binary search
> which arrange items in memory such that cache misses are much more rare.
> Clearly making such an indexing arrangement would take time, but in my
> particular case, I can spare the time to create an index if searching
> was faster, since I'd make the index once but do the searching many times.
>
> Is there an implementation of such an algorithm that works easilty with
> numpy? Also, can you offer any advice, suggestions, and comments to me
> if I attempted to implement such an algorithm?

I'm no help. You seem to know more than I do. Sadly, the first few
Google hits I get for "binary search minimize cache misses" are
patents. I don't know what the substantive content of those patents
are; I have a strict policy of not reading patents.

I would be interested in adding such a thing if it wasn't patent encumbered. A good start would be a prototype in python to show how it all went together and whether it needed a separate indexing/lookup function or could be fit into the current setup.

Chuck



_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: searchsorted() and memory cache

Charles R Harris


On Thu, May 8, 2008 at 10:30 PM, Charles R Harris <[hidden email]> wrote:


On Thu, May 8, 2008 at 9:55 PM, Robert Kern <[hidden email]> wrote:
On Thu, May 8, 2008 at 10:51 PM, Andrew Straw <[hidden email]> wrote:
> I've got a big element array (25 million int64s) that searchsorted()
> takes a long time to grind through. After a bit of digging in the
> literature and the numpy source code, I believe that searchsorted() is
> implementing a classic binary search,

Yes.

> which is pretty bad in terms of
> cache misses. There are several modern implementations of binary search
> which arrange items in memory such that cache misses are much more rare.
> Clearly making such an indexing arrangement would take time, but in my
> particular case, I can spare the time to create an index if searching
> was faster, since I'd make the index once but do the searching many times.
>
> Is there an implementation of such an algorithm that works easilty with
> numpy? Also, can you offer any advice, suggestions, and comments to me
> if I attempted to implement such an algorithm?

I'm no help. You seem to know more than I do. Sadly, the first few
Google hits I get for "binary search minimize cache misses" are
patents. I don't know what the substantive content of those patents
are; I have a strict policy of not reading patents.

I would be interested in adding such a thing if it wasn't patent encumbered. A good start would be a prototype in python to show how it all went together and whether it needed a separate indexing/lookup function or could be fit into the current setup.

One way I can think of doing this is to have two indices. One is the usual sorted list, the second consists of, say, every 1024'th entry in the first list. Then search the second list first to find the part of the first list to search. That won't get you into the very best cache, but it could buy you a factor of 2x-4x in speed. It's sort of splitting the binary tree into two levels.

Chuck


_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: searchsorted() and memory cache

Charles R Harris


On Thu, May 8, 2008 at 11:06 PM, Charles R Harris <[hidden email]> wrote:


On Thu, May 8, 2008 at 10:30 PM, Charles R Harris <[hidden email]> wrote:


On Thu, May 8, 2008 at 9:55 PM, Robert Kern <[hidden email]> wrote:
On Thu, May 8, 2008 at 10:51 PM, Andrew Straw <[hidden email]> wrote:
> I've got a big element array (25 million int64s) that searchsorted()
> takes a long time to grind through. After a bit of digging in the
> literature and the numpy source code, I believe that searchsorted() is
> implementing a classic binary search,

Yes.

> which is pretty bad in terms of
> cache misses. There are several modern implementations of binary search
> which arrange items in memory such that cache misses are much more rare.
> Clearly making such an indexing arrangement would take time, but in my
> particular case, I can spare the time to create an index if searching
> was faster, since I'd make the index once but do the searching many times.
>
> Is there an implementation of such an algorithm that works easilty with
> numpy? Also, can you offer any advice, suggestions, and comments to me
> if I attempted to implement such an algorithm?

I'm no help. You seem to know more than I do. Sadly, the first few
Google hits I get for "binary search minimize cache misses" are
patents. I don't know what the substantive content of those patents
are; I have a strict policy of not reading patents.

I would be interested in adding such a thing if it wasn't patent encumbered. A good start would be a prototype in python to show how it all went together and whether it needed a separate indexing/lookup function or could be fit into the current setup.

One way I can think of doing this is to have two indices. One is the usual sorted list, the second consists of, say, every 1024'th entry in the first list. Then search the second list first to find the part of the first list to search. That won't get you into the very best cache, but it could buy you a factor of 2x-4x in speed. It's sort of splitting the binary tree into two levels.

You could even do it with your data from python, generating the second list and calling searchsorted multiple times. If you are searching for a bunch of values, it's probably good to also sort them first so they bunch together in the same part of the big list.

Chuck


_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: searchsorted() and memory cache

Francesc Alted
In reply to this post by Andrew Straw
A Friday 09 May 2008, Andrew Straw escrigué:

> I've got a big element array (25 million int64s) that searchsorted()
> takes a long time to grind through. After a bit of digging in the
> literature and the numpy source code, I believe that searchsorted()
> is implementing a classic binary search, which is pretty bad in terms
> of cache misses. There are several modern implementations of binary
> search which arrange items in memory such that cache misses are much
> more rare. Clearly making such an indexing arrangement would take
> time, but in my particular case, I can spare the time to create an
> index if searching was faster, since I'd make the index once but do
> the searching many times.
>
> Is there an implementation of such an algorithm that works easilty
> with numpy? Also, can you offer any advice, suggestions, and comments
> to me if I attempted to implement such an algorithm?

Well, if you can afford extra space for the hashes you can always use a
dictionary for doing the lookups.  In pure Python they are around 3x
faster (for arrays of 8 millions of elements) than binary searches.  If
your space is tight, you can build an extension (for example in Pyrex)
for doing binary search for your specific type (int64), for an small
improvement.  Finally, if you combine this approach with what is
suggesting Charles Harris (i.e. creating several levels of caches, but
not more than two, which in my experience works best), you can have
pretty optimal lookups with relatively low space overhead.

See this thread for a discussion of the binary/hash lookup approaches:

http://mail.python.org/pipermail/python-list/2007-November/465900.html

Hope that helps,

--
Francesc Alted
_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: searchsorted() and memory cache

Bruce Southey
Hi,
I don't know if it helps, but Ulrich Drepper had a 9 part series about memory in Linux Weekly News (http://lwn.net). You can under all 9 linked under his name in the Guest archives (http://lwn.net/Archives/GuestIndex/)  as not all are linked together. The first one is:
http://lwn.net/Articles/250967/

What every programmer should know about memory, Part 1 (September 21, 2007)

In part 5 there was a comment on 'Cache-oblivious algorithms' by akapoor:
"i guess it's worth mentioning harald-prokop's 1999 thesis on "cache oblivious algorithms"
(http://citeseer.ist.psu.edu/prokop99cacheobliviou.html)."

Regards
Bruce


Francesc Alted wrote:
A Friday 09 May 2008, Andrew Straw escrigué:
  
I've got a big element array (25 million int64s) that searchsorted()
takes a long time to grind through. After a bit of digging in the
literature and the numpy source code, I believe that searchsorted()
is implementing a classic binary search, which is pretty bad in terms
of cache misses. There are several modern implementations of binary
search which arrange items in memory such that cache misses are much
more rare. Clearly making such an indexing arrangement would take
time, but in my particular case, I can spare the time to create an
index if searching was faster, since I'd make the index once but do
the searching many times.

Is there an implementation of such an algorithm that works easilty
with numpy? Also, can you offer any advice, suggestions, and comments
to me if I attempted to implement such an algorithm?
    

Well, if you can afford extra space for the hashes you can always use a 
dictionary for doing the lookups.  In pure Python they are around 3x 
faster (for arrays of 8 millions of elements) than binary searches.  If 
your space is tight, you can build an extension (for example in Pyrex) 
for doing binary search for your specific type (int64), for an small 
improvement.  Finally, if you combine this approach with what is 
suggesting Charles Harris (i.e. creating several levels of caches, but 
not more than two, which in my experience works best), you can have 
pretty optimal lookups with relatively low space overhead.

See this thread for a discussion of the binary/hash lookup approaches:

http://mail.python.org/pipermail/python-list/2007-November/465900.html

Hope that helps,

  


_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: searchsorted() and memory cache

Charles R Harris
In reply to this post by Andrew Straw


On Thu, May 8, 2008 at 9:51 PM, Andrew Straw <[hidden email]> wrote:
I've got a big element array (25 million int64s) that searchsorted()
takes a long time to grind through. After a bit of digging in the
literature and the numpy source code, I believe that searchsorted() is
implementing a classic binary search, which is pretty bad in terms of
cache misses. There are several modern implementations of binary search
which arrange items in memory such that cache misses are much more rare.
Clearly making such an indexing arrangement would take time, but in my
particular case, I can spare the time to create an index if searching
was faster, since I'd make the index once but do the searching many times.

Is there an implementation of such an algorithm that works easilty with
numpy? Also, can you offer any advice, suggestions, and comments to me
if I attempted to implement such an algorithm?

What sort of algorithm is best also depends on the use. If you have a 25e6 sized table that you want to interpolate through with another set of 25e6 indices, then binary search is the wrong approach. In that case you really want to start from the last position and search forward with increasing steps to bracket the next value. Basically, binary search is order n*ln(m), where n is the size of the index list and m the size of the table. The sequential way is nearly n + m, which will be much better if n and m are of comparable size.

Chuck



_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: searchsorted() and memory cache

Stéfan van der Walt
In reply to this post by Andrew Straw
Hi Andrew

2008/5/9 Andrew Straw <[hidden email]>:

> I've got a big element array (25 million int64s) that searchsorted()
> takes a long time to grind through. After a bit of digging in the
> literature and the numpy source code, I believe that searchsorted() is
> implementing a classic binary search, which is pretty bad in terms of
> cache misses. There are several modern implementations of binary search
> which arrange items in memory such that cache misses are much more rare.
> Clearly making such an indexing arrangement would take time, but in my
> particular case, I can spare the time to create an index if searching
> was faster, since I'd make the index once but do the searching many times.
>
> Is there an implementation of such an algorithm that works easilty with
> numpy? Also, can you offer any advice, suggestions, and comments to me
> if I attempted to implement such an algorithm?

If found Francesc Altet's Pyrex implementation at

http://mail.python.org/pipermail/python-list/2007-November/466503.html

I modified it for use with Cython and added some tests:

https://code.launchpad.net/~stefanv/+junk/my_bisect

That may be a good starting point for further experimentation.  As it
is, it is already about 10 times faster than the built-in version
(since I can assume we're working with int64's, so no special type
checking is done).

Regards
Stéfan
_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: searchsorted() and memory cache

Charles R Harris


On Sat, May 10, 2008 at 9:31 AM, Stéfan van der Walt <[hidden email]> wrote:
Hi Andrew

2008/5/9 Andrew Straw <[hidden email]>:
> I've got a big element array (25 million int64s) that searchsorted()
> takes a long time to grind through. After a bit of digging in the
> literature and the numpy source code, I believe that searchsorted() is
> implementing a classic binary search, which is pretty bad in terms of
> cache misses. There are several modern implementations of binary search
> which arrange items in memory such that cache misses are much more rare.
> Clearly making such an indexing arrangement would take time, but in my
> particular case, I can spare the time to create an index if searching
> was faster, since I'd make the index once but do the searching many times.
>
> Is there an implementation of such an algorithm that works easilty with
> numpy? Also, can you offer any advice, suggestions, and comments to me
> if I attempted to implement such an algorithm?

If found Francesc Altet's Pyrex implementation at

http://mail.python.org/pipermail/python-list/2007-November/466503.html

I modified it for use with Cython and added some tests:

https://code.launchpad.net/~stefanv/+junk/my_bisect

That may be a good starting point for further experimentation.  As it
is, it is already about 10 times faster than the built-in version.
 
The built in version is in c, but not type specific. It could be moved to the generated code base easily enough. The slow part is here

            if (compare(parr + elsize*imid, pkey, key) < 0)

Chuck



_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: searchsorted() and memory cache

Andrew Straw
In reply to this post by Andrew Straw
Thanks for all the comments on my original question. I was more offline
than intended after I sent it until now, so I'm sorry I wasn't
immediately able to participate in the discussion.

Anyhow, after working on this a bit more, I came up with a few
implementations of search algorithms doing just what I needed with the
same interface available using bazaar and launchpad at
http://launchpad.net/~astraw/+junk/fastsearch (MIT license). I have
attached the output of the plot_comparisons.py benchmarking script to
this email (note that this benchmarking is pretty crude).

For the problem I originally wrote about, I get what a nearly
unbelievable speedup of ~250x using the
fastsearch.downsamp.DownSampledPreSearcher class, which is very similar
in spirit to Charles' suggestion. It takes 1000 values from the original
array to create a new first-level array that is itself localized in
memory and points to a more localized region of the full original array.
Also, I get a similar (though slightly slower) result using AVL trees
using the fastsearch.avlsearch.AvlSearcher class, which uses pyavl (
http://sourceforge.net/projects/pyavl ).

Using the benchmarking code included in the bzr branch, I don't get
anything like this speedup (e.g. the attached figure), so I'm not sure
exactly what's going on at this point, but I'm not going to argue with a
250x speedup, so the fastsearch.downsamp code is now being put to use in
one of my projects.

Stefan -- I think your code simply implements the classic binary search
-- I don't see how it will reduce cache misses.

Anyhow, perhaps someone will find the above useful. I guess it would
still be a substantial amount of work to make a numpy-types-aware
implementation of AVL trees or similar algorithms. These sorts of binary
search trees seem like the right way to solve this problem and thus
there might be an interesting project in this. I imagine that a
numpy-types-aware Cython might make such implementation significantly
easier and still blazingly fast compared to the binary search
implemented in searchsorted() given today's cached memory architectures.

-Andrew


Andrew Straw wrote:

> I've got a big element array (25 million int64s) that searchsorted()
> takes a long time to grind through. After a bit of digging in the
> literature and the numpy source code, I believe that searchsorted() is
> implementing a classic binary search, which is pretty bad in terms of
> cache misses. There are several modern implementations of binary search
> which arrange items in memory such that cache misses are much more rare.
> Clearly making such an indexing arrangement would take time, but in my
> particular case, I can spare the time to create an index if searching
> was faster, since I'd make the index once but do the searching many times.
>
> Is there an implementation of such an algorithm that works easilty with
> numpy? Also, can you offer any advice, suggestions, and comments to me
> if I attempted to implement such an algorithm?
>
> Thanks,
> Andrew
> _______________________________________________
> Numpy-discussion mailing list
> [hidden email]
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>  

_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion

fastsearch.png (19K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: searchsorted() and memory cache

Charles R Harris


On Tue, May 13, 2008 at 5:59 PM, Andrew Straw <[hidden email]> wrote:
Thanks for all the comments on my original question. I was more offline
than intended after I sent it until now, so I'm sorry I wasn't
immediately able to participate in the discussion.

Anyhow, after working on this a bit more, I came up with a few
implementations of search algorithms doing just what I needed with the
same interface available using bazaar and launchpad at
http://launchpad.net/~astraw/+junk/fastsearch (MIT license). I have
attached the output of the plot_comparisons.py benchmarking script to
this email (note that this benchmarking is pretty crude).

For the problem I originally wrote about, I get what a nearly
unbelievable speedup of ~250x using the
fastsearch.downsamp.DownSampledPreSearcher class, which is very similar
in spirit to Charles' suggestion. It takes 1000 values from the original
array to create a new first-level array that is itself localized in
memory and points to a more localized region of the full original array.
Also, I get a similar (though slightly slower) result using AVL trees
using the fastsearch.avlsearch.AvlSearcher class, which uses pyavl (
http://sourceforge.net/projects/pyavl ).

Using the benchmarking code included in the bzr branch, I don't get
anything like this speedup (e.g. the attached figure), so I'm not sure
exactly what's going on at this point, but I'm not going to argue with a
250x speedup, so the fastsearch.downsamp code is now being put to use in
one of my projects.

Stefan -- I think your code simply implements the classic binary search
-- I don't see how it will reduce cache misses.

Anyhow, perhaps someone will find the above useful. I guess it would
still be a substantial amount of work to make a numpy-types-aware
implementation of AVL trees or similar algorithms. These sorts of binary
search trees seem like the right way to solve this problem and thus
there might be an interesting project in this. I imagine that a
numpy-types-aware Cython might make such implementation significantly
easier and still blazingly fast compared to the binary search
implemented in searchsorted() given today's cached memory architectures.

That's pretty amazing, but I don't understand the graph. The  DownSampled search looks like the worst. Are the curves mislabled? Are the axis correct? I'm assuming smaller is better here.

Chuck



_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: searchsorted() and memory cache

Andrew Straw
Charles R Harris wrote:

>
>
> On Tue, May 13, 2008 at 5:59 PM, Andrew Straw <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     Thanks for all the comments on my original question. I was more
>     offline
>     than intended after I sent it until now, so I'm sorry I wasn't
>     immediately able to participate in the discussion.
>
>     Anyhow, after working on this a bit more, I came up with a few
>     implementations of search algorithms doing just what I needed with the
>     same interface available using bazaar and launchpad at
>     http://launchpad.net/~astraw/+junk/fastsearch
>     <http://launchpad.net/%7Eastraw/+junk/fastsearch> (MIT license). I
>     have
>     attached the output of the plot_comparisons.py benchmarking script to
>     this email (note that this benchmarking is pretty crude).
>
>     For the problem I originally wrote about, I get what a nearly
>     unbelievable speedup of ~250x using the
>     fastsearch.downsamp.DownSampledPreSearcher class, which is very
>     similar
>     in spirit to Charles' suggestion. It takes 1000 values from the
>     original
>     array to create a new first-level array that is itself localized in
>     memory and points to a more localized region of the full original
>     array.
>     Also, I get a similar (though slightly slower) result using AVL trees
>     using the fastsearch.avlsearch.AvlSearcher class, which uses pyavl (
>     http://sourceforge.net/projects/pyavl ).
>
>     Using the benchmarking code included in the bzr branch, I don't get
>     anything like this speedup (e.g. the attached figure), so I'm not sure
>     exactly what's going on at this point, but I'm not going to argue
>     with a
>     250x speedup, so the fastsearch.downsamp code is now being put to
>     use in
>     one of my projects.
>
>     Stefan -- I think your code simply implements the classic binary
>     search
>     -- I don't see how it will reduce cache misses.
>
>     Anyhow, perhaps someone will find the above useful. I guess it would
>     still be a substantial amount of work to make a numpy-types-aware
>     implementation of AVL trees or similar algorithms. These sorts of
>     binary
>     search trees seem like the right way to solve this problem and thus
>     there might be an interesting project in this. I imagine that a
>     numpy-types-aware Cython might make such implementation significantly
>     easier and still blazingly fast compared to the binary search
>     implemented in searchsorted() given today's cached memory
>     architectures.
>
>
> That's pretty amazing, but I don't understand the graph. The
> DownSampled search looks like the worst. Are the curves mislabled? Are
> the axis correct? I'm assuming smaller is better here.
The lines are labeled properly -- the graph is inconsistent with the
findings on my real data (not shown), which is what I meant with "Using
the benchmarking code included in the bzr branch, I don't get anything
like this speedup (e.g. the attached figure)". My guess is that the
BinarySearcher climbs terribly under some usage pattern that isn't being
exhibited with this test. I'm really not sure yet what is the important
difference with my real data and these synthetic data. I will keep the
list posted as I find out more. Clearly, on the synthetic data for the
benchmark, the BinarySearcher does pretty well when N items is large.
This is quite contrary to my theory about cache misses being the root of
my problem with the binary search, so I don't understand it at the
moment, but certainly both the of the other searchers perform better on
my real data.

I will post any new insights as I continue to work on this...

-Andrew
_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: searchsorted() and memory cache

Nathan Bell-4
In reply to this post by Andrew Straw
On Tue, May 13, 2008 at 6:59 PM, Andrew Straw <[hidden email]> wrote:
>  easier and still blazingly fast compared to the binary search
>  implemented in searchsorted() given today's cached memory architectures.

Andrew, I looked at your code and I don't quite understand something.
Why are you looking up single values?

Is it not the case that the overhead of several Python calls
completely dominates the actual cost of performing a binary search
(e.g. in C code)?  For instance, the 'v' argument of searchsorted(a,v)
can be an array:

    from scipy import *
    haystack = rand(1e7)
    needles = haystack[:10000].copy()
    haystack.sort()
    timeit searchsorted(haystack, needles)

    100 loops, best of 3: 12.7 ms per loop

Which seems to be much faster than:

    timeit for k in needles: x = searchsorted(haystack,k)

    10 loops, best of 3: 43.8 ms per loop


The other thing to consider is that a reasonably smart CPU cache
manager may retain the first few levels that are explored in a binary
search tree.  Clearly you could speed this up by increasing the
branching factor of the tree, or using a fast index into the large
array.  However, I think that these effects are being masked by Python
overhead in your tests.

I whipped up a weave implementation of searchsorted() that uses the
STL.  It clocks in at 9.72ms per loop, so I think NumPy's
searchsorted() is fairly good.


import scipy
import scipy.weave

def searchsorted2(a,v):
    N_a = len(a)
    N_v = len(v)
    indices = scipy.empty(N_v, dtype='intc')

    code = """
    for(int i = 0; i < N_v; i++){
      indices(i) = lower_bound(&a(0), &a(N_a), v(i)) - &a(0);
    }
    """

    err = scipy.weave.inline(code,
                             ['a','v','N_a', 'N_v','indices'],
                             type_converters = scipy.weave.converters.blitz,
                             compiler = 'gcc',
                             support_code = '#include <algo.h>')

    return indices


--
Nathan Bell [hidden email]
http://graphics.cs.uiuc.edu/~wnbell/
_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: searchsorted() and memory cache

Andrew Straw
Nathan Bell wrote:
> On Tue, May 13, 2008 at 6:59 PM, Andrew Straw <[hidden email]> wrote:
>  
>>  easier and still blazingly fast compared to the binary search
>>  implemented in searchsorted() given today's cached memory architectures.
>>    
>
> Andrew, I looked at your code and I don't quite understand something.
> Why are you looking up single values?
>  
Hi Nathan,

The Python overhead was nothing compared to the speed problems I was
having... Now I'm quite sure that some optimization could go a little
further. Nevertheless, for my motivating use case, it wouldn't be
trivial to vectorize this, and a "little further" in this case is too
little to justify the investment of my time at the moment.

-Andrew
_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: searchsorted() and memory cache

Andrew Straw
In reply to this post by Andrew Straw

> I will post any new insights as I continue to work on this...
>  
OK, I save isolated a sample of my data that illustrates the terrible
performance with the binarysearch. I have uploaded it as a pytables file
to http://astraw.com/framenumbers.h5 in case anyone wants to have a look
themselves. Here's an example of the type of benchmark I've been running:

import fastsearch.downsamp
import fastsearch.binarysearch
import tables

h5=tables.openFile('framenumbers.h5',mode='r')
framenumbers=h5.root.framenumbers.read()
keys=h5.root.keys.read()
h5.close()

def bench( implementation ):
    for key in keys:
        implementation.index( key )

downsamp = fastsearch.downsamp.DownSampledPreSearcher( framenumbers )
binary = fastsearch.binarysearch.BinarySearcher( framenumbers )

# The next two lines are IPython-specific, and the 2nd takes a looong time:

%timeit bench(downsamp)
%timeit bench(binary)



Running the above gives:

In [14]: %timeit bench(downsamp)
10 loops, best of 3: 64 ms per loop

In [15]: %timeit bench(binary)

10 loops, best of 3: 184 s per loop

Quite a difference (a factor of about 3000)! At this point, I haven't
delved into the dataset to see what makes it so pathological --
performance is nowhere near this bad for the binary search algorithm
with other sets of keys.

_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: searchsorted() and memory cache

Andrew Straw
Andrew Straw wrote:
> I have uploaded it as a pytables file to http://astraw.com/framenumbers.h5 
Ahh, forgot to mention a potentially important point -- this data file
is 191 MB.
_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: searchsorted() and memory cache

Charles R Harris
In reply to this post by Andrew Straw


On Wed, May 14, 2008 at 8:09 AM, Andrew Straw <[hidden email]> wrote:

> I will post any new insights as I continue to work on this...
>
OK, I save isolated a sample of my data that illustrates the terrible
performance with the binarysearch. I have uploaded it as a pytables file
to http://astraw.com/framenumbers.h5 in case anyone wants to have a look
themselves. Here's an example of the type of benchmark I've been running:

import fastsearch.downsamp
import fastsearch.binarysearch
import tables

h5=tables.openFile('framenumbers.h5',mode='r')
framenumbers=h5.root.framenumbers.read()
keys=h5.root.keys.read()
h5.close()

def bench( implementation ):
   for key in keys:
       implementation.index( key )

downsamp = fastsearch.downsamp.DownSampledPreSearcher( framenumbers )
binary = fastsearch.binarysearch.BinarySearcher( framenumbers )

# The next two lines are IPython-specific, and the 2nd takes a looong time:

%timeit bench(downsamp)
%timeit bench(binary)



Running the above gives:

In [14]: %timeit bench(downsamp)
10 loops, best of 3: 64 ms per loop

In [15]: %timeit bench(binary)

10 loops, best of 3: 184 s per loop

Quite a difference (a factor of about 3000)! At this point, I haven't
delved into the dataset to see what makes it so pathological --
performance is nowhere near this bad for the binary search algorithm
with other sets of keys.

It can't be that bad Andrew, something else is going on. And 191 MB isn's *that* big, I expect it should bit in memory with no problem.

Chuck



_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: searchsorted() and memory cache

Andrew Straw
Charles R Harris wrote:

>
>
> On Wed, May 14, 2008 at 8:09 AM, Andrew Straw <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>
>
>     Quite a difference (a factor of about 3000)! At this point, I haven't
>     delved into the dataset to see what makes it so pathological --
>     performance is nowhere near this bad for the binary search algorithm
>     with other sets of keys.
>
>
> It can't be that bad Andrew, something else is going on. And 191 MB
> isn's *that* big, I expect it should bit in memory with no problem.
I agree the performance difference seems beyond what one would expect
due to cache misses alone. I'm at a loss to propose other explanations,
though. Ideas?
_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: searchsorted() and memory cache

Charles R Harris


On Wed, May 14, 2008 at 2:00 PM, Andrew Straw <[hidden email]> wrote:
Charles R Harris wrote:
>
>
> On Wed, May 14, 2008 at 8:09 AM, Andrew Straw <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>
>
>     Quite a difference (a factor of about 3000)! At this point, I haven't
>     delved into the dataset to see what makes it so pathological --
>     performance is nowhere near this bad for the binary search algorithm
>     with other sets of keys.
>
>
> It can't be that bad Andrew, something else is going on. And 191 MB
> isn's *that* big, I expect it should bit in memory with no problem.
I agree the performance difference seems beyond what one would expect
due to cache misses alone. I'm at a loss to propose other explanations,
though. Ideas?

I just searched for  2**25/10 keys in a 2**25 array of reals. It took less than a second when vectorized. In a python loop it took about 7.7 seconds. The only thing I can think of is that the search isn't getting any cpu cycles for some reason. How much memory is it using? Do you have any nans and such in the data?

Chuck

_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion


_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
12