random.choice(replace=False) very slow

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

random.choice(replace=False) very slow

Matthew Brett
Hi,

I noticed that numpy.random.choice was very slow, with the
replace=False option, and then I noticed it can (for most cases) be
made many hundreds of times faster in Python code:

In [18]: sample = np.random.uniform(size=1000000)
In [19]: timeit np.random.choice(sample, 500, replace=False)
        42.1 ms ± 214 µs per loop (mean ± std. dev. of 7 runs, 10
loops each)
IIn [22]: def rc(x, size):
    ...:     n = np.prod(size)
    ...:     n_plus = n * 2
    ...:     inds = np.unique(np.random.randint(0, n_plus+1, size=n_plus))[:n]
    ...:     return x[inds].reshape(size)
In [23]: timeit rc(sample, 500)
86.5 µs ± 421 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)each)

Is there a reason why it's so slow in C?  Could something more
intelligent than the above be used to speed it up?

Cheers,

Matthew
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: random.choice(replace=False) very slow

einstein.edison
Hi!

The standard algorithm for sampling without replacement is ``O(N)`` expected for ``N < 0.5 * M`` where ``M`` is the length of the original set, but ``O(N^2)`` worst-case. When this is not true, a simple Durstenfeld-Fisher-Yates shuffle [1] (``O(M)``) can be used on the original set and then the first ``N`` items selected. Although this is fast, it uses up a large amount of memory (``O(M)`` extra memory rather than ``O(N)``) and I’m not sure where the best trade off is. It also can’t be used with an arbitrary probability distribution.

One way to handle this would be to sample a maximum of ``N // 2`` samples and then select the “unselected” samples instead. Although this has a faster expected run-time than the standard algorithm in all cases, it would break backwards-compatibility guarantees.

On Wednesday, Oct 17, 2018 at 7:48 PM, Matthew Brett <[hidden email]> wrote:
Hi,

I noticed that numpy.random.choice was very slow, with the
replace=False option, and then I noticed it can (for most cases) be
made many hundreds of times faster in Python code:

In [18]: sample = np.random.uniform(size=1000000)
In [19]: timeit np.random.choice(sample, 500, replace=False)
42.1 ms ± 214 µs per loop (mean ± std. dev. of 7 runs, 10
loops each)
IIn [22]: def rc(x, size):
...: n = np.prod(size)
...: n_plus = n * 2
...: inds = np.unique(np.random.randint(0, n_plus+1, size=n_plus))[:n]
...: return x[inds].reshape(size)
In [23]: timeit rc(sample, 500)
86.5 µs ± 421 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)each)

Is there a reason why it's so slow in C? Could something more
intelligent than the above be used to speed it up?

Cheers,

Matthew
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion