Proposal: np.search() to complement np.searchsorted()

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Proposal: np.search() to complement np.searchsorted()

Martin Spacek-3
Hello,

I've opened up a pull request to add a function called np.search(), or something
like it, to complement np.searchsorted():

https://github.com/numpy/numpy/pull/9055

There's also this issue I opened before starting the PR:

https://github.com/numpy/numpy/issues/9052

Proposed API changes require discussion on the list, so here I am!

This proposed function (and perhaps array method?) does the same as
np.searchsorted(a, v), but doesn't require `a` to be sorted, and explicitly
checks if all the values in `v` are a subset of those in `a`. If not, it
currently raises an error, but that could be controlled via a kwarg.

As I mentioned in the PR, I often find myself abusing np.searchsorted() by not
explicitly checking these assumptions. The temptation to use it is great,
because it's such a fast and convenient function, and most of the time that I
use it, the assumptions are indeed valid. Explicitly checking those assumptions
each and every time before I use np.searchsorted() is tedious, and easy to
forget to do. I wouldn't be surprised if many others abuse np.searchsorted() in
the same way.

Looking at my own habits and uses, it seems to me that finding the indices of
matching values of one array in another is a more common use case than finding
insertion indices of one array into another sorted array. So, I propose that
np.search(), or something like it, could be even more useful than np.searchsorted().

Thoughts?

Martin
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Proposal: np.search() to complement np.searchsorted()

Stephan Hoyer-2
On Tue, May 9, 2017 at 9:46 AM, Martin Spacek <[hidden email]> wrote:
Looking at my own habits and uses, it seems to me that finding the indices of matching values of one array in another is a more common use case than finding insertion indices of one array into another sorted array. So, I propose that np.search(), or something like it, could be even more useful than np.searchsorted().

The current version of this PR only returns the indices of the first match (rather than all matches), which is an important detail. I would strongly consider including that detail in the name (e.g., by calling this "find_first" rather than "search"), because my naive expectation for a method called "search" is to find all matches.

In any case, I agree that this functionality would be welcome. Getting the details right for a high performance solution is tricky, and there is strong evidence of interest given the 200+ upvotes on this StackOverflow question:


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Proposal: np.search() to complement np.searchsorted()

Eric Wieser

I don’t think that np.search is really solving the same problem as find_first would.

IMO, we should solve that problem with an argfirst(bool_array, axis=0, keepdims=False) -> intp function, with almost the same semantics as argmax, but special-casing an array of Falses, to return bool_array.shape[axis].

Eric


On Tue, 9 May 2017 at 18:39 Stephan Hoyer <[hidden email]> wrote:
On Tue, May 9, 2017 at 9:46 AM, Martin Spacek <[hidden email]> wrote:
Looking at my own habits and uses, it seems to me that finding the indices of matching values of one array in another is a more common use case than finding insertion indices of one array into another sorted array. So, I propose that np.search(), or something like it, could be even more useful than np.searchsorted().

The current version of this PR only returns the indices of the first match (rather than all matches), which is an important detail. I would strongly consider including that detail in the name (e.g., by calling this "find_first" rather than "search"), because my naive expectation for a method called "search" is to find all matches.

In any case, I agree that this functionality would be welcome. Getting the details right for a high performance solution is tricky, and there is strong evidence of interest given the 200+ upvotes on this StackOverflow question:

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Proposal: np.search() to complement np.searchsorted()

Martin Spacek-3
In reply to this post by Stephan Hoyer-2
On 2017-05-09 07:39 PM, Stephan Hoyer wrote:

> On Tue, May 9, 2017 at 9:46 AM, Martin Spacek <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     Looking at my own habits and uses, it seems to me that finding the indices
>     of matching values of one array in another is a more common use case than
>     finding insertion indices of one array into another sorted array. So, I
>     propose that np.search(), or something like it, could be even more useful
>     than np.searchsorted().
>
>
> The current version of this PR only returns the indices of the /first/ match
> (rather than all matches), which is an important detail. I would strongly
> consider including that detail in the name (e.g., by calling this "find_first"
> rather than "search"), because my naive expectation for a method called "search"
> is to find all matches.
>
> In any case, I agree that this functionality would be welcome. Getting the
> details right for a high performance solution is tricky, and there is strong
> evidence of interest given the 200+ upvotes on this StackOverflow question:
> http://stackoverflow.com/questions/432112/is-there-a-numpy-function-to-return-the-first-index-of-something-in-an-array
>

Good point about it only finding the first hit. However, `np.find_first` sounds
a bit awkward to me. I've updated the PR to have a `which` kwarg that specifies
whether it should return the first or last hit for values in `v` that have
multiple hits in `a`.

https://github.com/numpy/numpy/pull/9055

Not sure if I already mentioned this somewhere, but we might also consider
naming this `np.index` due to its analogous behaviour to the Python list method
`.index()`.

Martin
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Proposal: np.search() to complement np.searchsorted()

Nathaniel Smith
In reply to this post by Martin Spacek-3
On May 9, 2017 9:47 AM, "Martin Spacek" <[hidden email]> wrote:
Hello,

I've opened up a pull request to add a function called np.search(), or something like it, to complement np.searchsorted():

https://github.com/numpy/numpy/pull/9055

There's also this issue I opened before starting the PR:

https://github.com/numpy/numpy/issues/9052

Proposed API changes require discussion on the list, so here I am!

This proposed function (and perhaps array method?) does the same as np.searchsorted(a, v), but doesn't require `a` to be sorted, and explicitly checks if all the values in `v` are a subset of those in `a`. If not, it currently raises an error, but that could be controlled via a kwarg.

As I mentioned in the PR, I often find myself abusing np.searchsorted() by not explicitly checking these assumptions. The temptation to use it is great, because it's such a fast and convenient function, and most of the time that I use it, the assumptions are indeed valid. Explicitly checking those assumptions each and every time before I use np.searchsorted() is tedious, and easy to forget to do. I wouldn't be surprised if many others abuse np.searchsorted() in the same way.

It's worth noting though that the "sorted" part is a critical part of what makes it fast. If we're looking for k needles in an n-item haystack, then:

If the haystack is already sorted and we know it, using searchsorted does it in k*log2(n) comparisons. (Could be reduced to average case O(k log log n) for simple scalars by using interpolation search, but I don't think searchsorted is that clever atm.)

If the haystack is not sorted, then sorting it and then using searchsorted requires a total of O(n log n) + k*log2(n) comparisons.

And if the haystack is not sorted, then doing linear search to find the first item like list.index does requires on average 0.5*k*n comparisons.

This analysis ignores memory effects, which are important -- linear memory access is faster than random access, and searchsorted is all about making memory access maximally unpredictable. But even so, I think sorting-then-searching will be reasonably competitive pretty much from the start, and for moderately large k and n values the difference between (n + k)*log(n) and n*k is huge.

Another issue is that sorting requires an O(n)-sized temporary buffer (assuming you can't mutate the haystack in place). But if your haystack is a large enough fraction of memory that you can't afford is buffer, then it's likely large enough that you can't afford linear searching either...


Looking at my own habits and uses, it seems to me that finding the indices of matching values of one array in another is a more common use case than finding insertion indices of one array into another sorted array. So, I propose that np.search(), or something like it, could be even more useful than np.searchsorted().

My main concern here would be creating a trap for the unwary, where people use search() naively because it's so nice and convenient, and then eventually get surprised by a nasty quadratic slowdown. There's a whole blog about these traps :-) https://accidentallyquadratic.tumblr.com/

Otoh there are also huge number of numpy use cases where it doesn't matter if some calculation is 1000x slower than it should be, as long as it works and is discoverable...

So it sounds like one obvious thing would be to have a version of searchsorted that checks for matches (maybe side="exact"? Though that's not easy to find...). That's clearly useful, and orthogonal to the linear/binary search issue, so we shouldn't make it a reason people are tempted to choose the inferior algorithm.

...ok, how's this for a suggestion. Give np.search a strategy= kwarg, with options "linear", "searchsorted", and "auto". Linear does the obvious thing, searchsorted generates a sorter array using argsort (unless the user provided one) and then calls searchsorted, and auto picks one of them depending on whether a sorter array was provided and how large the arrays are. The default is auto. In all cases it looks for exact matches.

I guess by default "not found" should be signaled with an exception, and then there should be some option to have it return a sentinel value instead? The problem is that since we're returning integers then there's no sentinel value that's necessarily an invalid index (e.g. indexing will happily accept -1).

-n

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Proposal: np.search() to complement np.searchsorted()

Stephan Hoyer-2
I like the idea of a strategy keyword argument. strategy='auto' leaves the door open for future improvements, e.g., if we ever add hash tables to numpy.

For the algorithm, I think we actually want to sort the needles array as well in most (all?) cases.

If haystack is also sorted, advancing thorough both arrays at once brings down the cost of the actual search itself down to O(n+k). (Possibly this is worth exposing as np.searchbothsorted or something similar?)

If we don't sort haystack, we can still use binary search on the needles to bring the search cost down to O(n log k).
On Mon, May 15, 2017 at 5:00 PM Nathaniel Smith <[hidden email]> wrote:
On May 9, 2017 9:47 AM, "Martin Spacek" <[hidden email]> wrote:
Hello,

I've opened up a pull request to add a function called np.search(), or something like it, to complement np.searchsorted():

https://github.com/numpy/numpy/pull/9055

There's also this issue I opened before starting the PR:

https://github.com/numpy/numpy/issues/9052

Proposed API changes require discussion on the list, so here I am!

This proposed function (and perhaps array method?) does the same as np.searchsorted(a, v), but doesn't require `a` to be sorted, and explicitly checks if all the values in `v` are a subset of those in `a`. If not, it currently raises an error, but that could be controlled via a kwarg.

As I mentioned in the PR, I often find myself abusing np.searchsorted() by not explicitly checking these assumptions. The temptation to use it is great, because it's such a fast and convenient function, and most of the time that I use it, the assumptions are indeed valid. Explicitly checking those assumptions each and every time before I use np.searchsorted() is tedious, and easy to forget to do. I wouldn't be surprised if many others abuse np.searchsorted() in the same way.

It's worth noting though that the "sorted" part is a critical part of what makes it fast. If we're looking for k needles in an n-item haystack, then:

If the haystack is already sorted and we know it, using searchsorted does it in k*log2(n) comparisons. (Could be reduced to average case O(k log log n) for simple scalars by using interpolation search, but I don't think searchsorted is that clever atm.)

If the haystack is not sorted, then sorting it and then using searchsorted requires a total of O(n log n) + k*log2(n) comparisons.

And if the haystack is not sorted, then doing linear search to find the first item like list.index does requires on average 0.5*k*n comparisons.

This analysis ignores memory effects, which are important -- linear memory access is faster than random access, and searchsorted is all about making memory access maximally unpredictable. But even so, I think sorting-then-searching will be reasonably competitive pretty much from the start, and for moderately large k and n values the difference between (n + k)*log(n) and n*k is huge.

Another issue is that sorting requires an O(n)-sized temporary buffer (assuming you can't mutate the haystack in place). But if your haystack is a large enough fraction of memory that you can't afford is buffer, then it's likely large enough that you can't afford linear searching either...


Looking at my own habits and uses, it seems to me that finding the indices of matching values of one array in another is a more common use case than finding insertion indices of one array into another sorted array. So, I propose that np.search(), or something like it, could be even more useful than np.searchsorted().

My main concern here would be creating a trap for the unwary, where people use search() naively because it's so nice and convenient, and then eventually get surprised by a nasty quadratic slowdown. There's a whole blog about these traps :-) https://accidentallyquadratic.tumblr.com/

Otoh there are also huge number of numpy use cases where it doesn't matter if some calculation is 1000x slower than it should be, as long as it works and is discoverable...

So it sounds like one obvious thing would be to have a version of searchsorted that checks for matches (maybe side="exact"? Though that's not easy to find...). That's clearly useful, and orthogonal to the linear/binary search issue, so we shouldn't make it a reason people are tempted to choose the inferior algorithm.

...ok, how's this for a suggestion. Give np.search a strategy= kwarg, with options "linear", "searchsorted", and "auto". Linear does the obvious thing, searchsorted generates a sorter array using argsort (unless the user provided one) and then calls searchsorted, and auto picks one of them depending on whether a sorter array was provided and how large the arrays are. The default is auto. In all cases it looks for exact matches.

I guess by default "not found" should be signaled with an exception, and then there should be some option to have it return a sentinel value instead? The problem is that since we're returning integers then there's no sentinel value that's necessarily an invalid index (e.g. indexing will happily accept -1).

-n
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Proposal: np.search() to complement np.searchsorted()

Peter Creasey
In reply to this post by Martin Spacek-3
> From: Stephan Hoyer <[hidden email]>
>
> I like the idea of a strategy keyword argument. strategy='auto' leaves the
> door open for future improvements, e.g., if we ever add hash tables to
> numpy.
>
> For the algorithm, I think we actually want to sort the needles array as
> well in most (all?) cases.
>
> If haystack is also sorted, advancing thorough both arrays at once brings
> down the cost of the actual search itself down to O(n+k). (Possibly this is
> worth exposing as np.searchbothsorted or something similar?)

I actually suggest reducing the scope of the problem to

>>> search_unique(haystack, needles)

which finds the index* in a haystack (of unique values) of each needle
and can make the assumption (e.g. as Stephen Hoyer points out) that if
len(needles)<<len(haystack), then you can just sort the needles and
iterate through the haystack in O(H log(N)) steps (otherwise you sort
the haystack and iterate through the needles instead).

Whilst you could make this function work in the more general
non-unique case (i.e. using stable sorts and finding the first
occurrence) I notice this often an accident and users would be better
served with an exception. I also prefer raising an exception if a
needle is unfound (perhaps with an unfound_index=None kwarg).

Finally I am a little wary of strategy=‘hashtable’ (though I love
hash-tables!) since choosing the one hash that is great for everyone’s
data does not seem an easy problem.

Peter

* ‘index_’ sounds more attractive than ‘search_’, but we’re stuck with
the python connotation it would be the first occurrence and the numpy
connotation of that you are giving rather than receiving indices.
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Proposal: np.search() to complement np.searchsorted()

Jaime Fernández del Río
In reply to this post by Stephan Hoyer-2
On Tue, May 16, 2017 at 1:25 AM, Stephan Hoyer <[hidden email]> wrote:
I like the idea of a strategy keyword argument. strategy='auto' leaves the door open for future improvements, e.g., if we ever add hash tables to numpy.

For the algorithm, I think we actually want to sort the needles array as well in most (all?) cases.

If haystack is also sorted, advancing thorough both arrays at once brings down the cost of the actual search itself down to O(n+k). (Possibly this is worth exposing as np.searchbothsorted or something similar?)

If we don't sort haystack, we can still use binary search on the needles to bring the search cost down to O(n log k).

I have several old branches implementing most of this:
Some of the speed-ups are considerable, but I guess I found the extra complexity didn't justify sending any of those as a PR.

But if we really want to make that search thing happen, then maybe it's time to bring them back from the dead.

Jaime

 
On Mon, May 15, 2017 at 5:00 PM Nathaniel Smith <[hidden email]> wrote:
On May 9, 2017 9:47 AM, "Martin Spacek" <[hidden email]> wrote:
Hello,

I've opened up a pull request to add a function called np.search(), or something like it, to complement np.searchsorted():

https://github.com/numpy/numpy/pull/9055

There's also this issue I opened before starting the PR:

https://github.com/numpy/numpy/issues/9052

Proposed API changes require discussion on the list, so here I am!

This proposed function (and perhaps array method?) does the same as np.searchsorted(a, v), but doesn't require `a` to be sorted, and explicitly checks if all the values in `v` are a subset of those in `a`. If not, it currently raises an error, but that could be controlled via a kwarg.

As I mentioned in the PR, I often find myself abusing np.searchsorted() by not explicitly checking these assumptions. The temptation to use it is great, because it's such a fast and convenient function, and most of the time that I use it, the assumptions are indeed valid. Explicitly checking those assumptions each and every time before I use np.searchsorted() is tedious, and easy to forget to do. I wouldn't be surprised if many others abuse np.searchsorted() in the same way.

It's worth noting though that the "sorted" part is a critical part of what makes it fast. If we're looking for k needles in an n-item haystack, then:

If the haystack is already sorted and we know it, using searchsorted does it in k*log2(n) comparisons. (Could be reduced to average case O(k log log n) for simple scalars by using interpolation search, but I don't think searchsorted is that clever atm.)

If the haystack is not sorted, then sorting it and then using searchsorted requires a total of O(n log n) + k*log2(n) comparisons.

And if the haystack is not sorted, then doing linear search to find the first item like list.index does requires on average 0.5*k*n comparisons.

This analysis ignores memory effects, which are important -- linear memory access is faster than random access, and searchsorted is all about making memory access maximally unpredictable. But even so, I think sorting-then-searching will be reasonably competitive pretty much from the start, and for moderately large k and n values the difference between (n + k)*log(n) and n*k is huge.

Another issue is that sorting requires an O(n)-sized temporary buffer (assuming you can't mutate the haystack in place). But if your haystack is a large enough fraction of memory that you can't afford is buffer, then it's likely large enough that you can't afford linear searching either...


Looking at my own habits and uses, it seems to me that finding the indices of matching values of one array in another is a more common use case than finding insertion indices of one array into another sorted array. So, I propose that np.search(), or something like it, could be even more useful than np.searchsorted().

My main concern here would be creating a trap for the unwary, where people use search() naively because it's so nice and convenient, and then eventually get surprised by a nasty quadratic slowdown. There's a whole blog about these traps :-) https://accidentallyquadratic.tumblr.com/

Otoh there are also huge number of numpy use cases where it doesn't matter if some calculation is 1000x slower than it should be, as long as it works and is discoverable...

So it sounds like one obvious thing would be to have a version of searchsorted that checks for matches (maybe side="exact"? Though that's not easy to find...). That's clearly useful, and orthogonal to the linear/binary search issue, so we shouldn't make it a reason people are tempted to choose the inferior algorithm.

...ok, how's this for a suggestion. Give np.search a strategy= kwarg, with options "linear", "searchsorted", and "auto". Linear does the obvious thing, searchsorted generates a sorter array using argsort (unless the user provided one) and then calls searchsorted, and auto picks one of them depending on whether a sorter array was provided and how large the arrays are. The default is auto. In all cases it looks for exact matches.

I guess by default "not found" should be signaled with an exception, and then there should be some option to have it return a sentinel value instead? The problem is that since we're returning integers then there's no sentinel value that's necessarily an invalid index (e.g. indexing will happily accept -1).

-n
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion




--
(\__/)
( O.o)
( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Loading...