Hello,
I've opened up a pull request to add a function called np.search(), or something like it, to complement np.searchsorted(): https://github.com/numpy/numpy/pull/9055 There's also this issue I opened before starting the PR: https://github.com/numpy/numpy/issues/9052 Proposed API changes require discussion on the list, so here I am! This proposed function (and perhaps array method?) does the same as np.searchsorted(a, v), but doesn't require `a` to be sorted, and explicitly checks if all the values in `v` are a subset of those in `a`. If not, it currently raises an error, but that could be controlled via a kwarg. As I mentioned in the PR, I often find myself abusing np.searchsorted() by not explicitly checking these assumptions. The temptation to use it is great, because it's such a fast and convenient function, and most of the time that I use it, the assumptions are indeed valid. Explicitly checking those assumptions each and every time before I use np.searchsorted() is tedious, and easy to forget to do. I wouldn't be surprised if many others abuse np.searchsorted() in the same way. Looking at my own habits and uses, it seems to me that finding the indices of matching values of one array in another is a more common use case than finding insertion indices of one array into another sorted array. So, I propose that np.search(), or something like it, could be even more useful than np.searchsorted(). Thoughts? Martin _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpy-discussion |
On Tue, May 9, 2017 at 9:46 AM, Martin Spacek <[hidden email]> wrote: Looking at my own habits and uses, it seems to me that finding the indices of matching values of one array in another is a more common use case than finding insertion indices of one array into another sorted array. So, I propose that np.search(), or something like it, could be even more useful than np.searchsorted(). In any case, I agree that this functionality would be welcome. Getting the details right for a high performance solution is tricky, and there is strong evidence of interest given the 200+ upvotes on this StackOverflow question: _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpy-discussion |
I don’t think that IMO, we should solve that problem with an Eric On Tue, 9 May 2017 at 18:39 Stephan Hoyer <[hidden email]> wrote:
_______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpy-discussion |
In reply to this post by Stephan Hoyer-2
On 2017-05-09 07:39 PM, Stephan Hoyer wrote:
> On Tue, May 9, 2017 at 9:46 AM, Martin Spacek <[hidden email] > <mailto:[hidden email]>> wrote: > > Looking at my own habits and uses, it seems to me that finding the indices > of matching values of one array in another is a more common use case than > finding insertion indices of one array into another sorted array. So, I > propose that np.search(), or something like it, could be even more useful > than np.searchsorted(). > > > The current version of this PR only returns the indices of the /first/ match > (rather than all matches), which is an important detail. I would strongly > consider including that detail in the name (e.g., by calling this "find_first" > rather than "search"), because my naive expectation for a method called "search" > is to find all matches. > > In any case, I agree that this functionality would be welcome. Getting the > details right for a high performance solution is tricky, and there is strong > evidence of interest given the 200+ upvotes on this StackOverflow question: > http://stackoverflow.com/questions/432112/is-there-a-numpy-function-to-return-the-first-index-of-something-in-an-array > Good point about it only finding the first hit. However, `np.find_first` sounds a bit awkward to me. I've updated the PR to have a `which` kwarg that specifies whether it should return the first or last hit for values in `v` that have multiple hits in `a`. https://github.com/numpy/numpy/pull/9055 Not sure if I already mentioned this somewhere, but we might also consider naming this `np.index` due to its analogous behaviour to the Python list method `.index()`. Martin _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpy-discussion |
In reply to this post by Martin Spacek-3
On May 9, 2017 9:47 AM, "Martin Spacek" <[hidden email]> wrote: Hello, It's worth noting though that the "sorted" part is a critical part of what makes it fast. If we're looking for k needles in an n-item haystack, then: If the haystack is already sorted and we know it, using searchsorted does it in k*log2(n) comparisons. (Could be reduced to average case O(k log log n) for simple scalars by using interpolation search, but I don't think searchsorted is that clever atm.) If the haystack is not sorted, then sorting it and then using searchsorted requires a total of O(n log n) + k*log2(n) comparisons. And if the haystack is not sorted, then doing linear search to find the first item like list.index does requires on average 0.5*k*n comparisons. This analysis ignores memory effects, which are important -- linear memory access is faster than random access, and searchsorted is all about making memory access maximally unpredictable. But even so, I think sorting-then-searching will be reasonably competitive pretty much from the start, and for moderately large k and n values the difference between (n + k)*log(n) and n*k is huge. Another issue is that sorting requires an O(n)-sized temporary buffer (assuming you can't mutate the haystack in place). But if your haystack is a large enough fraction of memory that you can't afford is buffer, then it's likely large enough that you can't afford linear searching either...
My main concern here would be creating a trap for the unwary, where people use search() naively because it's so nice and convenient, and then eventually get surprised by a nasty quadratic slowdown. There's a whole blog about these traps :-) https://accidentallyquadratic.tumblr.com/ Otoh there are also huge number of numpy use cases where it doesn't matter if some calculation is 1000x slower than it should be, as long as it works and is discoverable... So it sounds like one obvious thing would be to have a version of searchsorted that checks for matches (maybe side="exact"? Though that's not easy to find...). That's clearly useful, and orthogonal to the linear/binary search issue, so we shouldn't make it a reason people are tempted to choose the inferior algorithm. ...ok, how's this for a suggestion. Give np.search a strategy= kwarg, with options "linear", "searchsorted", and "auto". Linear does the obvious thing, searchsorted generates a sorter array using argsort (unless the user provided one) and then calls searchsorted, and auto picks one of them depending on whether a sorter array was provided and how large the arrays are. The default is auto. In all cases it looks for exact matches. I guess by default "not found" should be signaled with an exception, and then there should be some option to have it return a sentinel value instead? The problem is that since we're returning integers then there's no sentinel value that's necessarily an invalid index (e.g. indexing will happily accept -1). -n _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpy-discussion |
I like the idea of a strategy keyword argument. strategy='auto' leaves the door open for future improvements, e.g., if we ever add hash tables to numpy.
For the algorithm, I think we actually want to sort the needles array as well in most (all?) cases. If haystack is also sorted, advancing thorough both arrays at once brings down the cost of the actual search itself down to O(n+k). (Possibly this is worth exposing as np.searchbothsorted or something similar?) If we don't sort haystack, we can still use binary search on the needles to bring the search cost down to O(n log k). On Mon, May 15, 2017 at 5:00 PM Nathaniel Smith <[hidden email]> wrote:
_______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpy-discussion |
In reply to this post by Martin Spacek-3
> From: Stephan Hoyer <[hidden email]>
> > I like the idea of a strategy keyword argument. strategy='auto' leaves the > door open for future improvements, e.g., if we ever add hash tables to > numpy. > > For the algorithm, I think we actually want to sort the needles array as > well in most (all?) cases. > > If haystack is also sorted, advancing thorough both arrays at once brings > down the cost of the actual search itself down to O(n+k). (Possibly this is > worth exposing as np.searchbothsorted or something similar?) I actually suggest reducing the scope of the problem to >>> search_unique(haystack, needles) which finds the index* in a haystack (of unique values) of each needle and can make the assumption (e.g. as Stephen Hoyer points out) that if len(needles)<<len(haystack), then you can just sort the needles and iterate through the haystack in O(H log(N)) steps (otherwise you sort the haystack and iterate through the needles instead). Whilst you could make this function work in the more general non-unique case (i.e. using stable sorts and finding the first occurrence) I notice this often an accident and users would be better served with an exception. I also prefer raising an exception if a needle is unfound (perhaps with an unfound_index=None kwarg). Finally I am a little wary of strategy=‘hashtable’ (though I love hash-tables!) since choosing the one hash that is great for everyone’s data does not seem an easy problem. Peter * ‘index_’ sounds more attractive than ‘search_’, but we’re stuck with the python connotation it would be the first occurrence and the numpy connotation of that you are giving rather than receiving indices. _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpy-discussion |
In reply to this post by Stephan Hoyer-2
On Tue, May 16, 2017 at 1:25 AM, Stephan Hoyer <[hidden email]> wrote: I like the idea of a strategy keyword argument. strategy='auto' leaves the door open for future improvements, e.g., if we ever add hash tables to numpy. I have several old branches implementing most of this:
Some of the speed-ups are considerable, but I guess I found the extra complexity didn't justify sending any of those as a PR. But if we really want to make that search thing happen, then maybe it's time to bring them back from the dead. Jaime
(\__/)
( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial. _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpy-discussion |
Free forum by Nabble | Edit this page |