align `choices` and `sample` with Python `random` module

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

align `choices` and `sample` with Python `random` module

Alan Isaac
I believe this was proposed in the past to little enthusiasm,
with the response, "you're using a library; learn its functions".

Nevertheless, given the addition of `choices` to the Python
random module in 3.6, it would be nice to have the *same name*
for parallel functionality in numpy.random.

And given the redundancy of numpy.random.sample, it would be
nice to deprecate it with the intent to reintroduce
the name later, better aligned with Python's usage.

Obviously numpy.random.choice exists for both cases,
so this comment is not about functionality.
And I accept that some will think it is not about anything.
Perhaps it might be at least seen as being about this:
using the same function (`choice`) with a boolean argument
(`replace`) to switch between sampling strategies at least
appears to violate the proposal floated at times on this
list that called for two separate functions in apparently
similar cases.  (I am not at all trying to claim that the
argument against flag parameters is definitive; I'm just
mentioning that this viewpoint has already been
promulgated on this list.)

Cheers, Alan Isaac
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: align `choices` and `sample` with Python `random` module

ralfgommers



On Sun, Dec 9, 2018 at 2:00 PM Alan Isaac <[hidden email]> wrote:
I believe this was proposed in the past to little enthusiasm,
with the response, "you're using a library; learn its functions".

Not only that, NumPy and the core libraries around it are the standard for numerical/statistical computing. If core Python devs want to replicate a small subset of that functionality in a new Python version like 3.6, it would be sensible for them to choose compatible names. I don't think there's any justification for us to bother our users based on new things that get added to the stdlib.


Nevertheless, given the addition of `choices` to the Python
random module in 3.6, it would be nice to have the *same name*
for parallel functionality in numpy.random.

And given the redundancy of numpy.random.sample, it would be
nice to deprecate it with the intent to reintroduce
the name later, better aligned with Python's usage.

No, there is nothing wrong with the current API, so I'm -10 on deprecating it.

Ralf


Obviously numpy.random.choice exists for both cases,
so this comment is not about functionality.
And I accept that some will think it is not about anything.
Perhaps it might be at least seen as being about this:
using the same function (`choice`) with a boolean argument
(`replace`) to switch between sampling strategies at least
appears to violate the proposal floated at times on this
list that called for two separate functions in apparently
similar cases.  (I am not at all trying to claim that the
argument against flag parameters is definitive; I'm just
mentioning that this viewpoint has already been
promulgated on this list.)

Cheers, Alan Isaac
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: align `choices` and `sample` with Python `random` module

Alan Isaac
On 12/10/2018 11:20 AM, Ralf Gommers wrote:
> there is nothing wrong with the current API

Just to be clear: you completely reject the past
cautions on this list against creating APIs
with flag parameters.  Is that correct?

Or is "nothing wrong" just a narrow approval in
this particular case?

Alan Isaac
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: align `choices` and `sample` with Python `random` module

Tyler Reddy
I think the current random infrastructure is mostly considered frozen anyway, even for bugfixes, given the pending NEP to produce a new random infrastructure and the commitment therein to guarantee that old random streams behave the same way given their extensive use in testing and so on.
Maybe there are opportunities to have fruitful suggestions for the new system moving forward.

On Mon, 10 Dec 2018 at 08:27, Alan Isaac <[hidden email]> wrote:
On 12/10/2018 11:20 AM, Ralf Gommers wrote:
> there is nothing wrong with the current API

Just to be clear: you completely reject the past
cautions on this list against creating APIs
with flag parameters.  Is that correct?

Or is "nothing wrong" just a narrow approval in
this particular case?

Alan Isaac
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: align `choices` and `sample` with Python `random` module

ralfgommers
In reply to this post by Alan Isaac


On Mon, Dec 10, 2018 at 8:27 AM Alan Isaac <[hidden email]> wrote:
On 12/10/2018 11:20 AM, Ralf Gommers wrote:
> there is nothing wrong with the current API

Just to be clear: you completely reject the past
cautions on this list against creating APIs
with flag parameters.  Is that correct?

There's no such caution in general. There are particular cases of keyword arguments that behave in certain ways that are best avoided in the future, for example `full_output=False` to return extra arguments. In this case, the `replace` keyword just switches between two methods, which seems perfectly normal to me.

Either way, even things like `full_output` are not a good reason to deprecate something. We deprecate things because they're buggy, have severe usability issues, or some similar reason that translates to user pain.

Cheers,
Ralf

 

Or is "nothing wrong" just a narrow approval in
this particular case?

Alan Isaac
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: align `choices` and `sample` with Python `random` module

Stephan Hoyer-2
In reply to this post by Alan Isaac
On Mon, Dec 10, 2018 at 8:26 AM Alan Isaac <[hidden email]> wrote:
On 12/10/2018 11:20 AM, Ralf Gommers wrote:
> there is nothing wrong with the current API

Just to be clear: you completely reject the past
cautions on this list against creating APIs
with flag parameters.  Is that correct?

Or is "nothing wrong" just a narrow approval in
this particular case?

I agree with you that numpy.random.sample is redundant, that APIs based on flags are generally poorly design and that all things being equal it would be desirable for NumPy and Python's standard library to be aligned.

That said, "replacing a function/parameter with something totally different by the same name" is a really painful/slow deprecation process that is best avoided if at all possible in mature projects.

Personally, I would be +1 for issuing a deprecation warning for np.random.sample, and removing it after a good amount of notice (maybe several years). This is a similar deprecation cycle to what you see in Python itself (e.g., for ABCs in collections vs collections.abc). If you look at NumPy's docs for "Simple random data" [1] we have four different names for this same function ("random_sample", "random", "ranf" and "sample"), which is frankly absurd. Some cleanup is long overdue.

But we should be extremely hesitant to actually reuse these names for something else. People depend on NumPy for stability, and there is plenty of code written against NumPy from five years ago that still runs just fine today. It's one thing to break code noisily by removing a function, but if there's any chance of introducing silent errors that would be inexcusable.

Best,
Stephan


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: align `choices` and `sample` with Python `random` module

Warren Weckesser-2
In reply to this post by ralfgommers


On 12/10/18, Ralf Gommers <[hidden email]> wrote:

> On Sun, Dec 9, 2018 at 2:00 PM Alan Isaac <[hidden email]> wrote:
>
>> I believe this was proposed in the past to little enthusiasm,
>> with the response, "you're using a library; learn its functions".
>>
>
> Not only that, NumPy and the core libraries around it are the standard for
> numerical/statistical computing. If core Python devs want to replicate a
> small subset of that functionality in a new Python version like 3.6, it
> would be sensible for them to choose compatible names. I don't think
> there's any justification for us to bother our users based on new things
> that get added to the stdlib.
>
>
>> Nevertheless, given the addition of `choices` to the Python
>> random module in 3.6, it would be nice to have the *same name*
>> for parallel functionality in numpy.random.
>>
>> And given the redundancy of numpy.random.sample, it would be
>> nice to deprecate it with the intent to reintroduce
>> the name later, better aligned with Python's usage.
>>
>
> No, there is nothing wrong with the current API, so I'm -10 on deprecating
> it.

Actually, the `numpy.random.choice` API has one major weakness.  When `replace` is False and `size` is greater than 1, the function is actually drawing a *one* sample from a multivariate distribution.  For the other multivariate distributions (multinomial, multivariate_normal and dirichlet), `size` sets the number of samples to draw from the distribution.  With `replace=False` in `choice`, size becomes a *parameter* of the distribution, and it is only possible to draw one (multivariate) sample.

I thought about this some time ago, and came up with an API that eliminates the boolean flag, and separates the `size` argument from the number of items drawn in one sample, which I'll call `nsample`. To avoid creating a "false friend" with the standard library and with numpy's `choice`, I'll call this function `select`.

Here's the proposed signature and docstring.  (A prototype implementation is in a gist at https://gist.github.com/WarrenWeckesser/2e5905d116e710914af383ee47adc2bf.)  The key feature is the `nsample` argument, which sets how many items to select from the given collection.  Also note that this function is *always* drawing *without replacement*.  It covers the `replace=True` case because drawing one item without replacement is the same as drawing one item with replacement.

Whether or not an API like the following is used, it would be nice if there was some way to get multiple samples in the `replace=False` case in one function call.

def select(items, nsample=None, p=None, size=None):
    """
    Select random samples from `items`.

    The function randomly selects `nsample` items from `items` without
    replacement.

    Parameters
    ----------
    items : sequence
        The collection of items from which the selection is made.
    nsample : int, optional
        Number of items to select without replacement in each draw.
        It must be between 0 and len(items), inclusize.
    p : array-like of floats, same length as `items, optional
        Probabilities of the items.  If this argument is not given,
        the elements in `items` are assumed to have equal probability.
    size : int, optional
        Number of variates to draw.

    Notes
    -----
    `size=None` means "generate a single selection".

    If `size` is None, the result is equivalent to
        numpy.random.choice(items, size=nsample, replace=False)

    `nsample=None` means draw one (scalar) sample.
    If `nsample` is None, the functon acts (almost) like nsample=1 (see
    below for more information), and the result is equivalent to
        numpy.random.choice(items, size=size)
    In effect, it does choice with replacement.  The case `nsample=None`
    can be interpreted as each sample is a scalar, and `nsample=k`
    means each sample is a sequence with length k.

    If `nsample` is not None, it must be a nonnegative integer with
    0 <= nsample <= len(items).

    If `size` is not None, it must be an integer or a tuple of integers.
    When `size` is an integer, it is treated as the tuple ``(size,)``.

    When both `nsample` and `size` are not None, the result
    has shape ``size + (nsample,)``.

    Examples
    --------
    Make 6 choices with replacement from [10, 20, 30, 40].  (This is
    equivalent to "Make 1 choice without replacement from [10, 20, 30, 40];
    do it six times.")

    >>> select([10, 20, 30, 40], size=6)
    array([20, 20, 40, 10, 40, 30])

    Choose two items from [10, 20, 30, 40] without replacement.  Do it six
    times.

    >>> select([10, 20, 30, 40], nsample=2, size=6)
    array([[40, 10],
           [20, 30],
           [10, 40],
           [30, 10],
           [10, 30],
           [10, 20]])

    When `nsample` is an integer, there is always an axis at the end of the
    result with length `nsample`, even when `nsample=1`.  For example, the
    shape of the array returned in the following call is (2, 3, 1)

    >>> select([10, 20, 30, 40], nsample=1, size=(2, 3))
    array([[[10],
            [30],
            [20]],

           [[10],
            [40],
            [20]]])

    When `nsample` is None, it acts like `nsample=1`, but the trivial
    dimension is not included.  The shape of the array returned in the
    following call is (2, 3).

    >>> select([10, 20, 30, 40], size=(2, 3))
    array([[20, 40, 30],
           [30, 20, 40]])

    """



Warren

>
> Ralf
>
>
>> Obviously numpy.random.choice exists for both cases,
>> so this comment is not about functionality.
>> And I accept that some will think it is not about anything.
>> Perhaps it might be at least seen as being about this:
>> using the same function (`choice`) with a boolean argument
>> (`replace`) to switch between sampling strategies at least
>> appears to violate the proposal floated at times on this
>> list that called for two separate functions in apparently
>> similar cases.  (I am not at all trying to claim that the
>> argument against flag parameters is definitive; I'm just
>> mentioning that this viewpoint has already been
>> promulgated on this list.)
>>
>> Cheers, Alan Isaac
>> _______________________________________________
>> NumPy-Discussion mailing list
>> [hidden email]
>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>
>
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: align `choices` and `sample` with Python `random` module

ralfgommers


On Mon, Dec 10, 2018 at 10:27 AM Warren Weckesser <[hidden email]> wrote:


On 12/10/18, Ralf Gommers <[hidden email]> wrote:

> On Sun, Dec 9, 2018 at 2:00 PM Alan Isaac <[hidden email]> wrote:
>
>> I believe this was proposed in the past to little enthusiasm,
>> with the response, "you're using a library; learn its functions".
>>
>
> Not only that, NumPy and the core libraries around it are the standard for
> numerical/statistical computing. If core Python devs want to replicate a
> small subset of that functionality in a new Python version like 3.6, it
> would be sensible for them to choose compatible names. I don't think
> there's any justification for us to bother our users based on new things
> that get added to the stdlib.
>
>
>> Nevertheless, given the addition of `choices` to the Python
>> random module in 3.6, it would be nice to have the *same name*
>> for parallel functionality in numpy.random.
>>
>> And given the redundancy of numpy.random.sample, it would be
>> nice to deprecate it with the intent to reintroduce
>> the name later, better aligned with Python's usage.
>>
>
> No, there is nothing wrong with the current API, so I'm -10 on deprecating
> it.

Actually, the `numpy.random.choice` API has one major weakness.  When `replace` is False and `size` is greater than 1, the function is actually drawing a *one* sample from a multivariate distribution.  For the other multivariate distributions (multinomial, multivariate_normal and dirichlet), `size` sets the number of samples to draw from the distribution.  With `replace=False` in `choice`, size becomes a *parameter* of the distribution, and it is only possible to draw one (multivariate) sample.

I'm not sure I follow. `choice` draws samples from a given 1-D array, more than 1:

In [12]: np.random.choice(np.arange(5), size=2, replace=True)
Out[12]: array([2, 2])

In [13]: np.random.choice(np.arange(5), size=2, replace=False)
Out[13]: array([3, 0])

The multivariate distribution you're talking about is for generating the indices I assume. Does the current implementation actually give a result for size>1 that has different statistic properties from calling the function N times with size=1? If so, that's definitely worth a bug report at least (I don't think there is one for this).

Cheers,
Ralf



I thought about this some time ago, and came up with an API that eliminates the boolean flag, and separates the `size` argument from the number of items drawn in one sample, which I'll call `nsample`. To avoid creating a "false friend" with the standard library and with numpy's `choice`, I'll call this function `select`.

Here's the proposed signature and docstring.  (A prototype implementation is in a gist at https://gist.github.com/WarrenWeckesser/2e5905d116e710914af383ee47adc2bf.)  The key feature is the `nsample` argument, which sets how many items to select from the given collection.  Also note that this function is *always* drawing *without replacement*.  It covers the `replace=True` case because drawing one item without replacement is the same as drawing one item with replacement.

Whether or not an API like the following is used, it would be nice if there was some way to get multiple samples in the `replace=False` case in one function call.

def select(items, nsample=None, p=None, size=None):
    """
    Select random samples from `items`.

    The function randomly selects `nsample` items from `items` without
    replacement.

    Parameters
    ----------
    items : sequence
        The collection of items from which the selection is made.
    nsample : int, optional
        Number of items to select without replacement in each draw.
        It must be between 0 and len(items), inclusize.
    p : array-like of floats, same length as `items, optional
        Probabilities of the items.  If this argument is not given,
        the elements in `items` are assumed to have equal probability.
    size : int, optional
        Number of variates to draw.

    Notes
    -----
    `size=None` means "generate a single selection".

    If `size` is None, the result is equivalent to
        numpy.random.choice(items, size=nsample, replace=False)

    `nsample=None` means draw one (scalar) sample.
    If `nsample` is None, the functon acts (almost) like nsample=1 (see
    below for more information), and the result is equivalent to
        numpy.random.choice(items, size=size)
    In effect, it does choice with replacement.  The case `nsample=None`
    can be interpreted as each sample is a scalar, and `nsample=k`
    means each sample is a sequence with length k.

    If `nsample` is not None, it must be a nonnegative integer with
    0 <= nsample <= len(items).

    If `size` is not None, it must be an integer or a tuple of integers.
    When `size` is an integer, it is treated as the tuple ``(size,)``.

    When both `nsample` and `size` are not None, the result
    has shape ``size + (nsample,)``.

    Examples
    --------
    Make 6 choices with replacement from [10, 20, 30, 40].  (This is
    equivalent to "Make 1 choice without replacement from [10, 20, 30, 40];
    do it six times.")

    >>> select([10, 20, 30, 40], size=6)
    array([20, 20, 40, 10, 40, 30])

    Choose two items from [10, 20, 30, 40] without replacement.  Do it six
    times.

    >>> select([10, 20, 30, 40], nsample=2, size=6)
    array([[40, 10],
           [20, 30],
           [10, 40],
           [30, 10],
           [10, 30],
           [10, 20]])

    When `nsample` is an integer, there is always an axis at the end of the
    result with length `nsample`, even when `nsample=1`.  For example, the
    shape of the array returned in the following call is (2, 3, 1)

    >>> select([10, 20, 30, 40], nsample=1, size=(2, 3))
    array([[[10],
            [30],
            [20]],

           [[10],
            [40],
            [20]]])

    When `nsample` is None, it acts like `nsample=1`, but the trivial
    dimension is not included.  The shape of the array returned in the
    following call is (2, 3).

    >>> select([10, 20, 30, 40], size=(2, 3))
    array([[20, 40, 30],
           [30, 20, 40]])

    """



Warren


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: align `choices` and `sample` with Python `random` module

Warren Weckesser-2


On Tue, Dec 11, 2018 at 10:32 AM Ralf Gommers <[hidden email]> wrote:


On Mon, Dec 10, 2018 at 10:27 AM Warren Weckesser <[hidden email]> wrote:


On 12/10/18, Ralf Gommers <[hidden email]> wrote:

> On Sun, Dec 9, 2018 at 2:00 PM Alan Isaac <[hidden email]> wrote:
>
>> I believe this was proposed in the past to little enthusiasm,
>> with the response, "you're using a library; learn its functions".
>>
>
> Not only that, NumPy and the core libraries around it are the standard for
> numerical/statistical computing. If core Python devs want to replicate a
> small subset of that functionality in a new Python version like 3.6, it
> would be sensible for them to choose compatible names. I don't think
> there's any justification for us to bother our users based on new things
> that get added to the stdlib.
>
>
>> Nevertheless, given the addition of `choices` to the Python
>> random module in 3.6, it would be nice to have the *same name*
>> for parallel functionality in numpy.random.
>>
>> And given the redundancy of numpy.random.sample, it would be
>> nice to deprecate it with the intent to reintroduce
>> the name later, better aligned with Python's usage.
>>
>
> No, there is nothing wrong with the current API, so I'm -10 on deprecating
> it.

Actually, the `numpy.random.choice` API has one major weakness.  When `replace` is False and `size` is greater than 1, the function is actually drawing a *one* sample from a multivariate distribution.  For the other multivariate distributions (multinomial, multivariate_normal and dirichlet), `size` sets the number of samples to draw from the distribution.  With `replace=False` in `choice`, size becomes a *parameter* of the distribution, and it is only possible to draw one (multivariate) sample.

I'm not sure I follow. `choice` draws samples from a given 1-D array, more than 1:

In [12]: np.random.choice(np.arange(5), size=2, replace=True)
Out[12]: array([2, 2])

In [13]: np.random.choice(np.arange(5), size=2, replace=False)
Out[13]: array([3, 0])

The multivariate distribution you're talking about is for generating the indices I assume. Does the current implementation actually give a result for size>1 that has different statistic properties from calling the function N times with size=1? If so, that's definitely worth a bug report at least (I don't think there is one for this).


There is no bug, just a limitation in the API.

When I draw without replacement, say, three values from a collection of length five, the three values that I get are not independent.  So really, this is *one* sample from a three-dimensional (discrete-valued) distribution.  The problem with the current API is that I can't get multiple samples from this three-dimensional distribution in one call.  If I need to repeat the process six times, I have to use a loop, e.g.:

    >>> samples = [np.random.choice([10, 20, 30, 40, 50], replace=False, size=3) for _ in range(6)]

With the `select` function I described in my previous email, which I'll call `random_select` here, the parameter that determines the number of items per sample, `nsample`, is separate from the parameter that determines the number of samples, `size`:

    >>> samples = random_select([10, 20, 30, 40, 50], nsample=3, size=6)
    >>> samples
    array([[30, 40, 50],
           [40, 50, 30],
           [10, 20, 40],
           [20, 30, 50],
           [40, 20, 50],
           [20, 10, 30]])


(`select` is a really bad name, since `numpy.select` already exists and is something completely different.  I had the longer name `random.select` in mind when I started using it. "There are only two hard problems..." etc.)

Warren

 
Cheers,
Ralf



I thought about this some time ago, and came up with an API that eliminates the boolean flag, and separates the `size` argument from the number of items drawn in one sample, which I'll call `nsample`. To avoid creating a "false friend" with the standard library and with numpy's `choice`, I'll call this function `select`.

Here's the proposed signature and docstring.  (A prototype implementation is in a gist at https://gist.github.com/WarrenWeckesser/2e5905d116e710914af383ee47adc2bf.)  The key feature is the `nsample` argument, which sets how many items to select from the given collection.  Also note that this function is *always* drawing *without replacement*.  It covers the `replace=True` case because drawing one item without replacement is the same as drawing one item with replacement.

Whether or not an API like the following is used, it would be nice if there was some way to get multiple samples in the `replace=False` case in one function call.

def select(items, nsample=None, p=None, size=None):
    """
    Select random samples from `items`.

    The function randomly selects `nsample` items from `items` without
    replacement.

    Parameters
    ----------
    items : sequence
        The collection of items from which the selection is made.
    nsample : int, optional
        Number of items to select without replacement in each draw.
        It must be between 0 and len(items), inclusize.
    p : array-like of floats, same length as `items, optional
        Probabilities of the items.  If this argument is not given,
        the elements in `items` are assumed to have equal probability.
    size : int, optional
        Number of variates to draw.

    Notes
    -----
    `size=None` means "generate a single selection".

    If `size` is None, the result is equivalent to
        numpy.random.choice(items, size=nsample, replace=False)

    `nsample=None` means draw one (scalar) sample.
    If `nsample` is None, the functon acts (almost) like nsample=1 (see
    below for more information), and the result is equivalent to
        numpy.random.choice(items, size=size)
    In effect, it does choice with replacement.  The case `nsample=None`
    can be interpreted as each sample is a scalar, and `nsample=k`
    means each sample is a sequence with length k.

    If `nsample` is not None, it must be a nonnegative integer with
    0 <= nsample <= len(items).

    If `size` is not None, it must be an integer or a tuple of integers.
    When `size` is an integer, it is treated as the tuple ``(size,)``.

    When both `nsample` and `size` are not None, the result
    has shape ``size + (nsample,)``.

    Examples
    --------
    Make 6 choices with replacement from [10, 20, 30, 40].  (This is
    equivalent to "Make 1 choice without replacement from [10, 20, 30, 40];
    do it six times.")

    >>> select([10, 20, 30, 40], size=6)
    array([20, 20, 40, 10, 40, 30])

    Choose two items from [10, 20, 30, 40] without replacement.  Do it six
    times.

    >>> select([10, 20, 30, 40], nsample=2, size=6)
    array([[40, 10],
           [20, 30],
           [10, 40],
           [30, 10],
           [10, 30],
           [10, 20]])

    When `nsample` is an integer, there is always an axis at the end of the
    result with length `nsample`, even when `nsample=1`.  For example, the
    shape of the array returned in the following call is (2, 3, 1)

    >>> select([10, 20, 30, 40], nsample=1, size=(2, 3))
    array([[[10],
            [30],
            [20]],

           [[10],
            [40],
            [20]]])

    When `nsample` is None, it acts like `nsample=1`, but the trivial
    dimension is not included.  The shape of the array returned in the
    following call is (2, 3).

    >>> select([10, 20, 30, 40], size=(2, 3))
    array([[20, 40, 30],
           [30, 20, 40]])

    """



Warren

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: align `choices` and `sample` with Python `random` module

Warren Weckesser-2


On Tue, Dec 11, 2018 at 1:37 PM Warren Weckesser <[hidden email]> wrote:


On Tue, Dec 11, 2018 at 10:32 AM Ralf Gommers <[hidden email]> wrote:


On Mon, Dec 10, 2018 at 10:27 AM Warren Weckesser <[hidden email]> wrote:


On 12/10/18, Ralf Gommers <[hidden email]> wrote:

> On Sun, Dec 9, 2018 at 2:00 PM Alan Isaac <[hidden email]> wrote:
>
>> I believe this was proposed in the past to little enthusiasm,
>> with the response, "you're using a library; learn its functions".
>>
>
> Not only that, NumPy and the core libraries around it are the standard for
> numerical/statistical computing. If core Python devs want to replicate a
> small subset of that functionality in a new Python version like 3.6, it
> would be sensible for them to choose compatible names. I don't think
> there's any justification for us to bother our users based on new things
> that get added to the stdlib.
>
>
>> Nevertheless, given the addition of `choices` to the Python
>> random module in 3.6, it would be nice to have the *same name*
>> for parallel functionality in numpy.random.
>>
>> And given the redundancy of numpy.random.sample, it would be
>> nice to deprecate it with the intent to reintroduce
>> the name later, better aligned with Python's usage.
>>
>
> No, there is nothing wrong with the current API, so I'm -10 on deprecating
> it.

Actually, the `numpy.random.choice` API has one major weakness.  When `replace` is False and `size` is greater than 1, the function is actually drawing a *one* sample from a multivariate distribution.  For the other multivariate distributions (multinomial, multivariate_normal and dirichlet), `size` sets the number of samples to draw from the distribution.  With `replace=False` in `choice`, size becomes a *parameter* of the distribution, and it is only possible to draw one (multivariate) sample.

I'm not sure I follow. `choice` draws samples from a given 1-D array, more than 1:

In [12]: np.random.choice(np.arange(5), size=2, replace=True)
Out[12]: array([2, 2])

In [13]: np.random.choice(np.arange(5), size=2, replace=False)
Out[13]: array([3, 0])

The multivariate distribution you're talking about is for generating the indices I assume. Does the current implementation actually give a result for size>1 that has different statistic properties from calling the function N times with size=1? If so, that's definitely worth a bug report at least (I don't think there is one for this).


There is no bug, just a limitation in the API.

When I draw without replacement, say, three values from a collection of length five, the three values that I get are not independent.  So really, this is *one* sample from a three-dimensional (discrete-valued) distribution.  The problem with the current API is that I can't get multiple samples from this three-dimensional distribution in one call.  If I need to repeat the process six times, I have to use a loop, e.g.:

    >>> samples = [np.random.choice([10, 20, 30, 40, 50], replace=False, size=3) for _ in range(6)]

With the `select` function I described in my previous email, which I'll call `random_select` here, the parameter that determines the number of items per sample, `nsample`, is separate from the parameter that determines the number of samples, `size`:

    >>> samples = random_select([10, 20, 30, 40, 50], nsample=3, size=6)
    >>> samples
    array([[30, 40, 50],
           [40, 50, 30],
           [10, 20, 40],
           [20, 30, 50],
           [40, 20, 50],
           [20, 10, 30]])


(`select` is a really bad name, since `numpy.select` already exists and is something completely different.  I had the longer name `random.select` in mind when I started using it. "There are only two hard problems..." etc.)



As I reread this, I see another naming problem:  "sample" is used to mean different things.  In my description above,  I referred to one "sample" as the length-3 sequence generated by one call to `numpy.random.choice([10, 20, 30, 40, 50], replace=False, size=3)`, but in `random_select`, `nsample` refers to the length of each sequence generated.   I use the name 'nsample' to be consistent with `numpy.random.hypergeometric`.  I hope the output of the `random_select` call shown above makes clear the desired behavior.

Warren


Warren

 
Cheers,
Ralf



I thought about this some time ago, and came up with an API that eliminates the boolean flag, and separates the `size` argument from the number of items drawn in one sample, which I'll call `nsample`. To avoid creating a "false friend" with the standard library and with numpy's `choice`, I'll call this function `select`.

Here's the proposed signature and docstring.  (A prototype implementation is in a gist at https://gist.github.com/WarrenWeckesser/2e5905d116e710914af383ee47adc2bf.)  The key feature is the `nsample` argument, which sets how many items to select from the given collection.  Also note that this function is *always* drawing *without replacement*.  It covers the `replace=True` case because drawing one item without replacement is the same as drawing one item with replacement.

Whether or not an API like the following is used, it would be nice if there was some way to get multiple samples in the `replace=False` case in one function call.

def select(items, nsample=None, p=None, size=None):
    """
    Select random samples from `items`.

    The function randomly selects `nsample` items from `items` without
    replacement.

    Parameters
    ----------
    items : sequence
        The collection of items from which the selection is made.
    nsample : int, optional
        Number of items to select without replacement in each draw.
        It must be between 0 and len(items), inclusize.
    p : array-like of floats, same length as `items, optional
        Probabilities of the items.  If this argument is not given,
        the elements in `items` are assumed to have equal probability.
    size : int, optional
        Number of variates to draw.

    Notes
    -----
    `size=None` means "generate a single selection".

    If `size` is None, the result is equivalent to
        numpy.random.choice(items, size=nsample, replace=False)

    `nsample=None` means draw one (scalar) sample.
    If `nsample` is None, the functon acts (almost) like nsample=1 (see
    below for more information), and the result is equivalent to
        numpy.random.choice(items, size=size)
    In effect, it does choice with replacement.  The case `nsample=None`
    can be interpreted as each sample is a scalar, and `nsample=k`
    means each sample is a sequence with length k.

    If `nsample` is not None, it must be a nonnegative integer with
    0 <= nsample <= len(items).

    If `size` is not None, it must be an integer or a tuple of integers.
    When `size` is an integer, it is treated as the tuple ``(size,)``.

    When both `nsample` and `size` are not None, the result
    has shape ``size + (nsample,)``.

    Examples
    --------
    Make 6 choices with replacement from [10, 20, 30, 40].  (This is
    equivalent to "Make 1 choice without replacement from [10, 20, 30, 40];
    do it six times.")

    >>> select([10, 20, 30, 40], size=6)
    array([20, 20, 40, 10, 40, 30])

    Choose two items from [10, 20, 30, 40] without replacement.  Do it six
    times.

    >>> select([10, 20, 30, 40], nsample=2, size=6)
    array([[40, 10],
           [20, 30],
           [10, 40],
           [30, 10],
           [10, 30],
           [10, 20]])

    When `nsample` is an integer, there is always an axis at the end of the
    result with length `nsample`, even when `nsample=1`.  For example, the
    shape of the array returned in the following call is (2, 3, 1)

    >>> select([10, 20, 30, 40], nsample=1, size=(2, 3))
    array([[[10],
            [30],
            [20]],

           [[10],
            [40],
            [20]]])

    When `nsample` is None, it acts like `nsample=1`, but the trivial
    dimension is not included.  The shape of the array returned in the
    following call is (2, 3).

    >>> select([10, 20, 30, 40], size=(2, 3))
    array([[20, 40, 30],
           [30, 20, 40]])

    """



Warren

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: align `choices` and `sample` with Python `random` module

Stephan Hoyer-2
In reply to this post by Warren Weckesser-2
On Tue, Dec 11, 2018 at 10:39 AM Warren Weckesser <[hidden email]> wrote:
There is no bug, just a limitation in the API.

When I draw without replacement, say, three values from a collection of length five, the three values that I get are not independent.  So really, this is *one* sample from a three-dimensional (discrete-valued) distribution.  The problem with the current API is that I can't get multiple samples from this three-dimensional distribution in one call.  If I need to repeat the process six times, I have to use a loop, e.g.:

    >>> samples = [np.random.choice([10, 20, 30, 40, 50], replace=False, size=3) for _ in range(6)]

With the `select` function I described in my previous email, which I'll call `random_select` here, the parameter that determines the number of items per sample, `nsample`, is separate from the parameter that determines the number of samples, `size`:

    >>> samples = random_select([10, 20, 30, 40, 50], nsample=3, size=6)
    >>> samples
    array([[30, 40, 50],
           [40, 50, 30],
           [10, 20, 40],
           [20, 30, 50],
           [40, 20, 50],
           [20, 10, 30]])


(`select` is a really bad name, since `numpy.select` already exists and is something completely different.  I had the longer name `random.select` in mind when I started using it. "There are only two hard problems..." etc.)

Warren

This is an issue for the probability distributions from scipy.stats, too. 

The only library that I know handles this well is TensorFlow Probability, which has a notion of "batch" vs "events" dimensions in distributions. It's actually pretty comprehensive, and makes it easy to express these sorts of operations:

>>> import tensorflow_probability as tfp
>>> import tensorflow as tf
>>> tf.enable_eager_execution()
>>> dist = tfp.distributions.Categorical(tf.zeros((3, 5)))
>>> dist
<tfp.distributions.Categorical 'Categorical/' batch_shape=(3,) event_shape=() dtype=int32>
>>> dist.sample(6)
<tf.Tensor: id=299, shape=(6, 3), dtype=int32, numpy= array([[1, 2, 1], [2, 1, 3], [4, 4, 2], [0, 1, 1], [0, 2, 2], [2, 0, 4]], dtype=int32)>


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: align `choices` and `sample` with Python `random` module

Warren Weckesser-2


On Tue, Dec 11, 2018 at 2:27 PM Stephan Hoyer <[hidden email]> wrote:
On Tue, Dec 11, 2018 at 10:39 AM Warren Weckesser <[hidden email]> wrote:
There is no bug, just a limitation in the API.

When I draw without replacement, say, three values from a collection of length five, the three values that I get are not independent.  So really, this is *one* sample from a three-dimensional (discrete-valued) distribution.  The problem with the current API is that I can't get multiple samples from this three-dimensional distribution in one call.  If I need to repeat the process six times, I have to use a loop, e.g.:

    >>> samples = [np.random.choice([10, 20, 30, 40, 50], replace=False, size=3) for _ in range(6)]

With the `select` function I described in my previous email, which I'll call `random_select` here, the parameter that determines the number of items per sample, `nsample`, is separate from the parameter that determines the number of samples, `size`:

    >>> samples = random_select([10, 20, 30, 40, 50], nsample=3, size=6)
    >>> samples
    array([[30, 40, 50],
           [40, 50, 30],
           [10, 20, 40],
           [20, 30, 50],
           [40, 20, 50],
           [20, 10, 30]])


(`select` is a really bad name, since `numpy.select` already exists and is something completely different.  I had the longer name `random.select` in mind when I started using it. "There are only two hard problems..." etc.)

Warren

This is an issue for the probability distributions from scipy.stats, too. 

The only library that I know handles this well is TensorFlow Probability, which has a notion of "batch" vs "events" dimensions in distributions. It's actually pretty comprehensive, and makes it easy to express these sorts of operations:

>>> import tensorflow_probability as tfp
>>> import tensorflow as tf
>>> tf.enable_eager_execution()
>>> dist = tfp.distributions.Categorical(tf.zeros((3, 5)))
>>> dist
<tfp.distributions.Categorical 'Categorical/' batch_shape=(3,) event_shape=() dtype=int32>
>>> dist.sample(6)
<tf.Tensor: id=299, shape=(6, 3), dtype=int32, numpy= array([[1, 2, 1], [2, 1, 3], [4, 4, 2], [0, 1, 1], [0, 2, 2], [2, 0, 4]], dtype=int32)>



Yes, tensorflow-probability includes broadcasting of the parameters and generating multiple variates in one call, but note that your example is not sampling without replacement.  For sampling 3 items without replacement from a population, the *event_shape* (to use tensorflow-probability terminology) would have to be (3,).

Warren


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: align `choices` and `sample` with Python `random` module

ralfgommers
In reply to this post by Stephan Hoyer-2


On Mon, Dec 10, 2018 at 9:26 AM Stephan Hoyer <[hidden email]> wrote:
On Mon, Dec 10, 2018 at 8:26 AM Alan Isaac <[hidden email]> wrote:
On 12/10/2018 11:20 AM, Ralf Gommers wrote:
> there is nothing wrong with the current API

Just to be clear: you completely reject the past
cautions on this list against creating APIs
with flag parameters.  Is that correct?

Or is "nothing wrong" just a narrow approval in
this particular case?

I agree with you that numpy.random.sample is redundant, that APIs based on flags are generally poorly design

That argument is for things like 3 boolean variables that together create 8 states. It then becomes hard to spot bugs like 1 of the 8 cases not being handled or not being well-defined. In this case, the meaning is clear and imho the API is better like this then if we'd create two almost identical functions.

and that all things being equal it would be desirable for NumPy and Python's standard library to be aligned.

That said, "replacing a function/parameter with something totally different by the same name" is a really painful/slow deprecation process that is best avoided if at all possible in mature projects.

+1


Personally, I would be +1 for issuing a deprecation warning for np.random.sample, and removing it after a good amount of notice (maybe several years). This is a similar deprecation cycle to what you see in Python itself (e.g., for ABCs in collections vs collections.abc). If you look at NumPy's docs for "Simple random data" [1] we have four different names for this same function ("random_sample", "random", "ranf" and "sample"), which is frankly absurd. Some cleanup is long overdue.

These aliases have always been there, since before 2005: https://github.com/numpy/numpy/commit/9338dea9. Deprecating some of those seems fine to me, `ranf` is a really bad name, and `random` is too. I don't see a good reason to remove `sample` and `random_sample`; they're sane names, the function is not buggy, and it just seems like bothering our users for no good reason.

Cheers,
Ralf



But we should be extremely hesitant to actually reuse these names for something else. People depend on NumPy for stability, and there is plenty of code written against NumPy from five years ago that still runs just fine today. It's one thing to break code noisily by removing a function, but if there's any chance of introducing silent errors that would be inexcusable.

Best,
Stephan

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion