Speedups for intersect1d in numpy.lib

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Speedups for intersect1d in numpy.lib

Stamm, Felix

Hello mailing list,

I would like to open a PR that improves the speed of intersect1d in particular for large arrays of potentially unequal sizes. As parts of the change would involve a new keyword I think I should write to this mailing list first (correct me if I am wrong).

Lets start with the current implementation of intersect1d. The steps are (roughly): I) concatenate both arrays II) sort that large array and III) keep only duplicates. This is a simple but not necessary fast implementation. In particular it has has complexity O( (len1 + len2) log(len1 + len2)).

I have two suggestions one of which can greatly improve the runtime (1) while the other has some minor improvements (2). Lets start with the major one.

1) Adding assume_sorted as keyword argument.
    If you can already assume that the arrays are sorted you can improve the runtime to min(len1, len2)*log(max(len1, len2)) which is orders of magnitude faster especially when the arrays intersected are unequal in size.
    The implementation basically is I) figure out which is the smaller array II) make that array unique III) use np.searchsorted to find duplicates of the smaller in the larger.

    I think that having assume_sorted as a keyword argument is quite useful, especially as the result of assume_sorted is always sorted. So chaining multiple intersections together can be made much faster.

2) Improve the performance in the case that we cannot assume sorted arrays

     In the case that we cannot assume that both arrays are already sorted we can still gain advantages. Mainly by sorting the arrays separately and then using np.searchsorted again. This brings the complexity down to O( len1 log(len1)  + len2 log(len2))min(len1, len2)*log(max(len1, len2)).

I already mention this in the issue 16964: https://github.com/numpy/numpy/issues/16964

Looking forward for your feedback.

Have a great day and stay safe

Felix Stamm

NumPy-Discussion mailing list
[hidden email]