Proposal: add `force=` or `copy=` kwarg to `__array__` interface

classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|

Proposal: add `force=` or `copy=` kwarg to `__array__` interface

Juan Nunez-Iglesias-2
Hello NumPy-ers!

The __array__ method is a great little tool to allow interoperability with NumPy. Briefly, calling `np.array()` or `np.asarray()` on an object with an `__array__` method, one can get a NumPy representation of that object, which may or may not involve data copying (this is up to the object’s implementation of `__array__`). Some references:




(I couldn’t find an authoritative guide on good and bad practices with `__array__`, btw.)

For people writing e.g. visualisation libraries, this is a wonderful thing, because if we know how to visualise NumPy arrays, we can suddenly visualise anything with an `__array__` method. As an example, napari, while not being aware of dask, can visualise large dask arrays out of the box, which allows us to view 100GB out-of-core datasets easily.

However, in many cases, instantiating a NumPy array is an expensive operation, for example copying an array from GPU to CPU memory, or involves substantial loss of information. Some library authors are reluctant to allow implicit execution of such an operation, such as PyOpenCL [1], PyTorch [2], or even scipy.sparse.

My proposal is to add an optional argument to `__array__` that would signal to the downstream library that we *really* want a NumPy array and are willing to wait for it. In the PyTorch issue I proposed `force=True`, and they are somewhat receptive of this, but, reading more about the existing NumPy APIs, I think `copy=True` would be a nice alternative:

- np.array already has a copy= keyword argument. Under this proposal, it would attempt to pass it to the downstream library, and, if that failed, it would try again without it and run its own copy.
- np.asarray could get a new copy= keyword argument that would match np.array’s.
- It would neatly express the idea that the array is going to e.g. get passed around between devices.

Or, we could just go with `force=`.

One bit of expressivity we would miss is “copy if necessary, but otherwise don’t bother”, but there are workarounds to this.

What do people think? I would be happy to write a PR and/or NEP for this if there is general consensus that this would be useful.

Thanks,

Juan.

Refs:


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Proposal: add `force=` or `copy=` kwarg to `__array__` interface

Charles R Harris


On Tue, Apr 21, 2020 at 1:07 AM Juan Nunez-Iglesias <[hidden email]> wrote:
Hello NumPy-ers!

The __array__ method is a great little tool to allow interoperability with NumPy. Briefly, calling `np.array()` or `np.asarray()` on an object with an `__array__` method, one can get a NumPy representation of that object, which may or may not involve data copying (this is up to the object’s implementation of `__array__`). Some references:




(I couldn’t find an authoritative guide on good and bad practices with `__array__`, btw.)

For people writing e.g. visualisation libraries, this is a wonderful thing, because if we know how to visualise NumPy arrays, we can suddenly visualise anything with an `__array__` method. As an example, napari, while not being aware of dask, can visualise large dask arrays out of the box, which allows us to view 100GB out-of-core datasets easily.

However, in many cases, instantiating a NumPy array is an expensive operation, for example copying an array from GPU to CPU memory, or involves substantial loss of information. Some library authors are reluctant to allow implicit execution of such an operation, such as PyOpenCL [1], PyTorch [2], or even scipy.sparse.

My proposal is to add an optional argument to `__array__` that would signal to the downstream library that we *really* want a NumPy array and are willing to wait for it. In the PyTorch issue I proposed `force=True`, and they are somewhat receptive of this, but, reading more about the existing NumPy APIs, I think `copy=True` would be a nice alternative:

- np.array already has a copy= keyword argument. Under this proposal, it would attempt to pass it to the downstream library, and, if that failed, it would try again without it and run its own copy.
- np.asarray could get a new copy= keyword argument that would match np.array’s.
- It would neatly express the idea that the array is going to e.g. get passed around between devices.

Or, we could just go with `force=`.

One bit of expressivity we would miss is “copy if necessary, but otherwise don’t bother”, but there are workarounds to this.

What do people think? I would be happy to write a PR and/or NEP for this if there is general consensus that this would be useful.


This sounds like the sort of thing that is use case driven. If enough projects want to use it, then I have no objections to adding the keyword. OTOH, we need to be careful about adding too many interoperability tricks as they complicate the code and makes it hard for folks to determine the best solution. Interoperability is a hot topic and we need to be careful not put too leave behind too many experiments in the NumPy code.  Do you have any other ideas of how to achieve the same effect?

Chuck

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Proposal: add `force=` or `copy=` kwarg to `__array__` interface

Juan Nunez-Iglesias-2
Hi everyone,

One bit of expressivity we would miss is “copy if necessary, but otherwise don’t bother”, but there are workarounds to this.

After a side discussion with Stéfan van der Walt, we came up with `allow_copy=True`, which would express to the downstream library that we don’t mind waiting, but that zero-copy would also be ok.

This sounds like the sort of thing that is use case driven. If enough projects want to use it, then I have no objections to adding the keyword. OTOH, we need to be careful about adding too many interoperability tricks as they complicate the code and makes it hard for folks to determine the best solution. Interoperability is a hot topic and we need to be careful not put too leave behind too many experiments in the NumPy code.  Do you have any other ideas of how to achieve the same effect?

Personally, I don’t have any other ideas, but would be happy to hear some!

My view regarding API/experiment creep is that `__array__` is the oldest and most basic of all the interop tricks and that this can be safely maintained for future generations. Currently it only takes `dtype=` as a keyword argument, so it is a very lean API. I think this particular use case is very natural and I’ve encountered the reluctance to implicitly copy twice, so I expect it is reasonably common.

Regarding difficulty in determining the best solution, I would be happy to contribute to the dispatch basics guide together with the new kwarg. I agree that the protocols are getting quite numerous and I couldn’t find a single place that gathers all the best practices together. But, to reiterate my point: `__array__` is the simplest of these and I think this keyword is pretty safe to add.

For ease of discussion, here are the API options discussed so far, as well as a few extra that I don’t like but might trigger other ideas:

np.asarray(my_duck_array, allow_copy=True)  # default is False, or None -> leave it to the duck array to decide
np.asarray(my_duck_array, copy=True)  # always copies, but, if supported by the duck array, defers to it for the copy
np.asarray(my_duck_array, copy=‘allow’)  # could take values ‘allow’, ‘force’, ’no’, True(=‘force’), False(=’no’)
np.asarray(my_duck_array, force_copy=False, allow_copy=True)  # separate concepts, but unclear what force_copy=True, allow_copy=False means!
np.asarray(my_duck_array, force=True)

Juan.

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Proposal: add `force=` or `copy=` kwarg to `__array__` interface

Eric Wieser
Perhaps worth mentioning that we've discussed this sort of API before, in https://github.com/numpy/numpy/pull/11897.

Under that proposal, the api would be something like:

* `copy=True` - always copy, like it is today
* `copy=False` - copy if needed, like it is today
* `copy=np.never_copy` - never copy, throw an exception if not possible

I think the discussion stalled on the precise spelling of the third option.

`__array__` was not discussed there, but it seems like adding the `copy` argument to `__array__` would be a perfectly reasonable extension.

Eric

On Fri, 24 Apr 2020 at 03:00, Juan Nunez-Iglesias <[hidden email]> wrote:
Hi everyone,

One bit of expressivity we would miss is “copy if necessary, but otherwise don’t bother”, but there are workarounds to this.

After a side discussion with Stéfan van der Walt, we came up with `allow_copy=True`, which would express to the downstream library that we don’t mind waiting, but that zero-copy would also be ok.

This sounds like the sort of thing that is use case driven. If enough projects want to use it, then I have no objections to adding the keyword. OTOH, we need to be careful about adding too many interoperability tricks as they complicate the code and makes it hard for folks to determine the best solution. Interoperability is a hot topic and we need to be careful not put too leave behind too many experiments in the NumPy code.  Do you have any other ideas of how to achieve the same effect?

Personally, I don’t have any other ideas, but would be happy to hear some!

My view regarding API/experiment creep is that `__array__` is the oldest and most basic of all the interop tricks and that this can be safely maintained for future generations. Currently it only takes `dtype=` as a keyword argument, so it is a very lean API. I think this particular use case is very natural and I’ve encountered the reluctance to implicitly copy twice, so I expect it is reasonably common.

Regarding difficulty in determining the best solution, I would be happy to contribute to the dispatch basics guide together with the new kwarg. I agree that the protocols are getting quite numerous and I couldn’t find a single place that gathers all the best practices together. But, to reiterate my point: `__array__` is the simplest of these and I think this keyword is pretty safe to add.

For ease of discussion, here are the API options discussed so far, as well as a few extra that I don’t like but might trigger other ideas:

np.asarray(my_duck_array, allow_copy=True)  # default is False, or None -> leave it to the duck array to decide
np.asarray(my_duck_array, copy=True)  # always copies, but, if supported by the duck array, defers to it for the copy
np.asarray(my_duck_array, copy=‘allow’)  # could take values ‘allow’, ‘force’, ’no’, True(=‘force’), False(=’no’)
np.asarray(my_duck_array, force_copy=False, allow_copy=True)  # separate concepts, but unclear what force_copy=True, allow_copy=False means!
np.asarray(my_duck_array, force=True)

Juan.
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Proposal: add `force=` or `copy=` kwarg to `__array__` interface

Sebastian Berg
On Fri, 2020-04-24 at 11:34 +0100, Eric Wieser wrote:

> Perhaps worth mentioning that we've discussed this sort of API
> before, in
> https://github.com/numpy/numpy/pull/11897.
>
> Under that proposal, the api would be something like:
>
> * `copy=True` - always copy, like it is today
> * `copy=False` - copy if needed, like it is today
> * `copy=np.never_copy` - never copy, throw an exception if not
> possible
>
> I think the discussion stalled on the precise spelling of the third
> option.
>
> `__array__` was not discussed there, but it seems like adding the
> `copy`
> argument to `__array__` would be a perfectly reasonable extension.
>

One thing to note is that `__array__` is actually asked to return a
copy AFAIK. I doubt it always does, but if it does not I assume the
object should and could provide `__array_interface__`.
Under that assumption, it would be an opt-out right now since NumPy
allows copies by default here.
Defining things along copy does seem sensible, though I do not know how
it would play with some of the current array-likes choosing to refuse
`__array__`.

- Sebastian



> Eric
>
> On Fri, 24 Apr 2020 at 03:00, Juan Nunez-Iglesias <[hidden email]>
> wrote:
>
> > Hi everyone,
> >
> > One bit of expressivity we would miss is “copy if necessary, but
> > otherwise
> > > don’t bother”, but there are workarounds to this.
> > >
> >
> > After a side discussion with Stéfan van der Walt, we came up with
> > `allow_copy=True`, which would express to the downstream library
> > that we
> > don’t mind waiting, but that zero-copy would also be ok.
> >
> > This sounds like the sort of thing that is use case driven. If
> > enough
> > projects want to use it, then I have no objections to adding the
> > keyword.
> > OTOH, we need to be careful about adding too many interoperability
> > tricks
> > as they complicate the code and makes it hard for folks to
> > determine the
> > best solution. Interoperability is a hot topic and we need to be
> > careful
> > not put too leave behind too many experiments in the NumPy
> > code.  Do you
> > have any other ideas of how to achieve the same effect?
> >
> >
> > Personally, I don’t have any other ideas, but would be happy to
> > hear some!
> >
> > My view regarding API/experiment creep is that `__array__` is the
> > oldest
> > and most basic of all the interop tricks and that this can be
> > safely
> > maintained for future generations. Currently it only takes `dtype=`
> > as a
> > keyword argument, so it is a very lean API. I think this particular
> > use
> > case is very natural and I’ve encountered the reluctance to
> > implicitly copy
> > twice, so I expect it is reasonably common.
> >
> > Regarding difficulty in determining the best solution, I would be
> > happy to
> > contribute to the dispatch basics guide together with the new
> > kwarg. I
> > agree that the protocols are getting quite numerous and I couldn’t
> > find a
> > single place that gathers all the best practices together. But, to
> > reiterate my point: `__array__` is the simplest of these and I
> > think this
> > keyword is pretty safe to add.
> >
> > For ease of discussion, here are the API options discussed so far,
> > as well
> > as a few extra that I don’t like but might trigger other ideas:
> >
> > np.asarray(my_duck_array, allow_copy=True)  # default is False, or
> > None ->
> > leave it to the duck array to decide
> > np.asarray(my_duck_array, copy=True)  # always copies, but, if
> > supported
> > by the duck array, defers to it for the copy
> > np.asarray(my_duck_array, copy=‘allow’)  # could take values
> > ‘allow’,
> > ‘force’, ’no’, True(=‘force’), False(=’no’)
> > np.asarray(my_duck_array, force_copy=False, allow_copy=True)  #
> > separate
> > concepts, but unclear what force_copy=True, allow_copy=False means!
> > np.asarray(my_duck_array, force=True)
> >
> > Juan.
> > _______________________________________________
> > NumPy-Discussion mailing list
> > [hidden email]
> > https://mail.python.org/mailman/listinfo/numpy-discussion
> >
>
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> https://mail.python.org/mailman/listinfo/numpy-discussion


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Proposal: add `force=` or `copy=` kwarg to `__array__` interface

Stephan Hoyer-2
On Fri, Apr 24, 2020 at 6:31 AM Sebastian Berg <[hidden email]> wrote:
One thing to note is that `__array__` is actually asked to return a
copy AFAIK.

The documentation on __array__ seems to quite limited, unfortunately. The most I can find are a few sentences here:

I don't see anything about returning copies. My interpretation has always been that __array__ can return either a copy or a view, like the np.asarray() constructor.
 
I doubt it always does, but if it does not I assume the
object should and could provide `__array_interface__`.

Objects like xarray.DataArray and pandas.Series sometimes directly wrap NumPy arrays and sometimes don't.

They both implement __array__ but not __array_inferace__. It's very obvious how to implement a "forwarding" __array__ method (just call `np.asarray()` on an argument that might implement it). I guess something similar could be done for __array_interface__, but it's not clear to me that it's right to implement __array_interface__ when doing so might require a copy.
 
Under that assumption, it would be an opt-out right now since NumPy
allows copies by default here.
Defining things along copy does seem sensible, though I do not know how
it would play with some of the current array-likes choosing to refuse
`__array__`.

- Sebastian



> Eric
>
> On Fri, 24 Apr 2020 at 03:00, Juan Nunez-Iglesias <[hidden email]>
> wrote:
>
> > Hi everyone,
> >
> > One bit of expressivity we would miss is “copy if necessary, but
> > otherwise
> > > don’t bother”, but there are workarounds to this.
> > >
> >
> > After a side discussion with Stéfan van der Walt, we came up with
> > `allow_copy=True`, which would express to the downstream library
> > that we
> > don’t mind waiting, but that zero-copy would also be ok.
> >
> > This sounds like the sort of thing that is use case driven. If
> > enough
> > projects want to use it, then I have no objections to adding the
> > keyword.
> > OTOH, we need to be careful about adding too many interoperability
> > tricks
> > as they complicate the code and makes it hard for folks to
> > determine the
> > best solution. Interoperability is a hot topic and we need to be
> > careful
> > not put too leave behind too many experiments in the NumPy
> > code.  Do you
> > have any other ideas of how to achieve the same effect?
> >
> >
> > Personally, I don’t have any other ideas, but would be happy to
> > hear some!
> >
> > My view regarding API/experiment creep is that `__array__` is the
> > oldest
> > and most basic of all the interop tricks and that this can be
> > safely
> > maintained for future generations. Currently it only takes `dtype=`
> > as a
> > keyword argument, so it is a very lean API. I think this particular
> > use
> > case is very natural and I’ve encountered the reluctance to
> > implicitly copy
> > twice, so I expect it is reasonably common.
> >
> > Regarding difficulty in determining the best solution, I would be
> > happy to
> > contribute to the dispatch basics guide together with the new
> > kwarg. I
> > agree that the protocols are getting quite numerous and I couldn’t
> > find a
> > single place that gathers all the best practices together. But, to
> > reiterate my point: `__array__` is the simplest of these and I
> > think this
> > keyword is pretty safe to add.
> >
> > For ease of discussion, here are the API options discussed so far,
> > as well
> > as a few extra that I don’t like but might trigger other ideas:
> >
> > np.asarray(my_duck_array, allow_copy=True)  # default is False, or
> > None ->
> > leave it to the duck array to decide
> > np.asarray(my_duck_array, copy=True)  # always copies, but, if
> > supported
> > by the duck array, defers to it for the copy
> > np.asarray(my_duck_array, copy=‘allow’)  # could take values
> > ‘allow’,
> > ‘force’, ’no’, True(=‘force’), False(=’no’)
> > np.asarray(my_duck_array, force_copy=False, allow_copy=True)  #
> > separate
> > concepts, but unclear what force_copy=True, allow_copy=False means!
> > np.asarray(my_duck_array, force=True)
> >
> > Juan.
> > _______________________________________________
> > NumPy-Discussion mailing list
> > [hidden email]
> > https://mail.python.org/mailman/listinfo/numpy-discussion
> >
>
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> https://mail.python.org/mailman/listinfo/numpy-discussion


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Proposal: add `force=` or `copy=` kwarg to `__array__` interface

Sebastian Berg
On Fri, 2020-04-24 at 10:12 -0700, Stephan Hoyer wrote:

> On Fri, Apr 24, 2020 at 6:31 AM Sebastian Berg <
> [hidden email]>
> wrote:
>
> > One thing to note is that `__array__` is actually asked to return a
> > copy AFAIK.
>
> The documentation on __array__ seems to quite limited, unfortunately.
> The
> most I can find are a few sentences here:
> https://numpy.org/doc/stable/reference/arrays.classes.html#numpy.class.__array__
>
> I don't see anything about returning copies. My interpretation has
> always
> been that __array__ can return either a copy or a view, like the
> np.asarray() constructor.
>

Hmmm, right, I am not quite sure why I thought this was the case.

The more important part is behaviour. And the fact is that if you do
`np.array(array_like)` with an array like that implements `__array__`
then we ensure a copy is made by default (`copy=True` by default), even
though `__array__()` may already return a copy.

In any case, the current default for `np.asarray`, i.e. `copy=False` is
"copy if necessary". So if PyTorch uses a new parameter to Opt-In to
copying, the default behaviour will depend on the object. The
definition would then be:

    Copy if necessary but error if a copy is necessary and the
    object doesn't want to be copied silently.

To be honest, that seems not totally terrible to me... The old
statement remains true with the small caveat that it will sometimes
cause a loud error explaining things. The only problem is that some
users may want an the explicit `np.copy_if_necessary` to get PyTorch to
do what most already do on `copy=False`.

I guess the new behaviour would then be:

if copy is np.never_copy:  # or however we signal it
    try:
        arr = obj.__array__(copy=np.no_copy)
    except TypeError as e:
        raise TypeError("no copy appears unsupported by ...!") from e
elif copy is np.copy_if_necessary:
    # Some users may want to tell PyTorch not to error, but
    # tell pandas, that a view is OK:
    try:
        arr = np.array(copy=np.copy_if_necessary)
    except TypeError:
        arr = obj.__array__()
elif not copy:
    # Behaviour here may depend on the array-like!
    # current array likes may or may not return a copy,
    # new ones may choose to raise an error when a view
    # is not possible.
    arr = obj.__array__()
else:
    try:
        arr = obj.__array__(copy=True)
    except TypeError:
        arr = obj.__array__()
        arr = arr.copy()  # make sure its a copy

PyTorch can then implement copy, but raise an error if `copy=False`
(which must be the default). Current objects will error for
`np.never_copy` but otherwise be fine. And they can implement `copy` to
avoid an unnecessary double copy if they wish so.
We could add the `np.copy_if_necessary` to be an explicit replacement
for the current `copy=False`. This will be necessary, or nicer, unless
everyone is happy to copy by default.

Another side note: calls such as `np.array([arr1, arr2])` probably must
always fail if `copy=np.never_copy` since a view is not guaranteed.

- Sebastian


>
> > I doubt it always does, but if it does not I assume the
> > object should and could provide `__array_interface__`.
> >
>
> Objects like xarray.DataArray and pandas.Series sometimes directly
> wrap
> NumPy arrays and sometimes don't.
>
> They both implement __array__ but not __array_inferace__. It's very
> obvious
> how to implement a "forwarding" __array__ method (just call
> `np.asarray()`
> on an argument that might implement it). I guess something similar
> could be
> done for __array_interface__, but it's not clear to me that it's
> right to
> implement __array_interface__ when doing so might require a copy.
>

Yes, I do not think you should implement __array_interface__ then,
unless "simplifying the array" is for some reason beneficial for
yourself. I suppose you could raise an AttributeError, but it is
questionable if thats good.


>
> > Under that assumption, it would be an opt-out right now since NumPy
> > allows copies by default here.
> > Defining things along copy does seem sensible, though I do not know
> > how
> > it would play with some of the current array-likes choosing to
> > refuse
> > `__array__`.
> >
> > - Sebastian
> >
> >
> >
> > > Eric
> > >
> > > On Fri, 24 Apr 2020 at 03:00, Juan Nunez-Iglesias <
> > > [hidden email]>
> > > wrote:
> > >
> > > > Hi everyone,
> > > >
> > > > One bit of expressivity we would miss is “copy if necessary,
> > > > but
> > > > otherwise
> > > > > don’t bother”, but there are workarounds to this.
> > > > >
> > > >
> > > > After a side discussion with Stéfan van der Walt, we came up
> > > > with
> > > > `allow_copy=True`, which would express to the downstream
> > > > library
> > > > that we
> > > > don’t mind waiting, but that zero-copy would also be ok.
> > > >
> > > > This sounds like the sort of thing that is use case driven. If
> > > > enough
> > > > projects want to use it, then I have no objections to adding
> > > > the
> > > > keyword.
> > > > OTOH, we need to be careful about adding too many
> > > > interoperability
> > > > tricks
> > > > as they complicate the code and makes it hard for folks to
> > > > determine the
> > > > best solution. Interoperability is a hot topic and we need to
> > > > be
> > > > careful
> > > > not put too leave behind too many experiments in the NumPy
> > > > code.  Do you
> > > > have any other ideas of how to achieve the same effect?
> > > >
> > > >
> > > > Personally, I don’t have any other ideas, but would be happy to
> > > > hear some!
> > > >
> > > > My view regarding API/experiment creep is that `__array__` is
> > > > the
> > > > oldest
> > > > and most basic of all the interop tricks and that this can be
> > > > safely
> > > > maintained for future generations. Currently it only takes
> > > > `dtype=`
> > > > as a
> > > > keyword argument, so it is a very lean API. I think this
> > > > particular
> > > > use
> > > > case is very natural and I’ve encountered the reluctance to
> > > > implicitly copy
> > > > twice, so I expect it is reasonably common.
> > > >
> > > > Regarding difficulty in determining the best solution, I would
> > > > be
> > > > happy to
> > > > contribute to the dispatch basics guide together with the new
> > > > kwarg. I
> > > > agree that the protocols are getting quite numerous and I
> > > > couldn’t
> > > > find a
> > > > single place that gathers all the best practices together. But,
> > > > to
> > > > reiterate my point: `__array__` is the simplest of these and I
> > > > think this
> > > > keyword is pretty safe to add.
> > > >
> > > > For ease of discussion, here are the API options discussed so
> > > > far,
> > > > as well
> > > > as a few extra that I don’t like but might trigger other ideas:
> > > >
> > > > np.asarray(my_duck_array, allow_copy=True)  # default is False,
> > > > or
> > > > None ->
> > > > leave it to the duck array to decide
> > > > np.asarray(my_duck_array, copy=True)  # always copies, but, if
> > > > supported
> > > > by the duck array, defers to it for the copy
> > > > np.asarray(my_duck_array, copy=‘allow’)  # could take values
> > > > ‘allow’,
> > > > ‘force’, ’no’, True(=‘force’), False(=’no’)
> > > > np.asarray(my_duck_array, force_copy=False, allow_copy=True)  #
> > > > separate
> > > > concepts, but unclear what force_copy=True, allow_copy=False
> > > > means!
> > > > np.asarray(my_duck_array, force=True)
> > > >
> > > > Juan.
> > > > _______________________________________________
> > > > NumPy-Discussion mailing list
> > > > [hidden email]
> > > > https://mail.python.org/mailman/listinfo/numpy-discussion
> > > >
> > >
> > > _______________________________________________
> > > NumPy-Discussion mailing list
> > > [hidden email]
> > > https://mail.python.org/mailman/listinfo/numpy-discussion
> >
> > _______________________________________________
> > NumPy-Discussion mailing list
> > [hidden email]
> > https://mail.python.org/mailman/listinfo/numpy-discussion
> >
>
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> https://mail.python.org/mailman/listinfo/numpy-discussion


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Proposal: add `force=` or `copy=` kwarg to `__array__` interface

ralfgommers
In reply to this post by Eric Wieser


On Fri, Apr 24, 2020 at 12:35 PM Eric Wieser <[hidden email]> wrote:
Perhaps worth mentioning that we've discussed this sort of API before, in https://github.com/numpy/numpy/pull/11897.

Under that proposal, the api would be something like:

* `copy=True` - always copy, like it is today
* `copy=False` - copy if needed, like it is today
* `copy=np.never_copy` - never copy, throw an exception if not possible

There's a couple of issues I see with using `copy` for __array__:
- copy is already weird (False doesn't mean no), and a [bool, some_obj_or_str] keyword isn't making that better
- the behavior we're talking about can do more than copying, e.g. for PyTorch it would modify the autograd graph by adding detach(), and for sparse it's not just "make a copy" (which implies doubling memory use) but it densifies which can massively blow up the memory.
- I'm -1 on adding things to the main namespace (never_copy) for something that can be handled differently (like a string, or a new keyword)

tl;dr a new `force` keyword would be better

Cheers,
Ralf


I think the discussion stalled on the precise spelling of the third option.

`__array__` was not discussed there, but it seems like adding the `copy` argument to `__array__` would be a perfectly reasonable extension.

Eric

On Fri, 24 Apr 2020 at 03:00, Juan Nunez-Iglesias <[hidden email]> wrote:
Hi everyone,

One bit of expressivity we would miss is “copy if necessary, but otherwise don’t bother”, but there are workarounds to this.

After a side discussion with Stéfan van der Walt, we came up with `allow_copy=True`, which would express to the downstream library that we don’t mind waiting, but that zero-copy would also be ok.

This sounds like the sort of thing that is use case driven. If enough projects want to use it, then I have no objections to adding the keyword. OTOH, we need to be careful about adding too many interoperability tricks as they complicate the code and makes it hard for folks to determine the best solution. Interoperability is a hot topic and we need to be careful not put too leave behind too many experiments in the NumPy code.  Do you have any other ideas of how to achieve the same effect?

Personally, I don’t have any other ideas, but would be happy to hear some!

My view regarding API/experiment creep is that `__array__` is the oldest and most basic of all the interop tricks and that this can be safely maintained for future generations. Currently it only takes `dtype=` as a keyword argument, so it is a very lean API. I think this particular use case is very natural and I’ve encountered the reluctance to implicitly copy twice, so I expect it is reasonably common.

Regarding difficulty in determining the best solution, I would be happy to contribute to the dispatch basics guide together with the new kwarg. I agree that the protocols are getting quite numerous and I couldn’t find a single place that gathers all the best practices together. But, to reiterate my point: `__array__` is the simplest of these and I think this keyword is pretty safe to add.

For ease of discussion, here are the API options discussed so far, as well as a few extra that I don’t like but might trigger other ideas:

np.asarray(my_duck_array, allow_copy=True)  # default is False, or None -> leave it to the duck array to decide
np.asarray(my_duck_array, copy=True)  # always copies, but, if supported by the duck array, defers to it for the copy
np.asarray(my_duck_array, copy=‘allow’)  # could take values ‘allow’, ‘force’, ’no’, True(=‘force’), False(=’no’)
np.asarray(my_duck_array, force_copy=False, allow_copy=True)  # separate concepts, but unclear what force_copy=True, allow_copy=False means!
np.asarray(my_duck_array, force=True)

Juan.
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Proposal: add `force=` or `copy=` kwarg to `__array__` interface

Stephan Hoyer-2


On Sat, Apr 25, 2020 at 10:40 AM Ralf Gommers <[hidden email]> wrote:


On Fri, Apr 24, 2020 at 12:35 PM Eric Wieser <[hidden email]> wrote:
Perhaps worth mentioning that we've discussed this sort of API before, in https://github.com/numpy/numpy/pull/11897.

Under that proposal, the api would be something like:

* `copy=True` - always copy, like it is today
* `copy=False` - copy if needed, like it is today
* `copy=np.never_copy` - never copy, throw an exception if not possible

There's a couple of issues I see with using `copy` for __array__:
- copy is already weird (False doesn't mean no), and a [bool, some_obj_or_str] keyword isn't making that better
- the behavior we're talking about can do more than copying, e.g. for PyTorch it would modify the autograd graph by adding detach(), and for sparse it's not just "make a copy" (which implies doubling memory use) but it densifies which can massively blow up the memory.
- I'm -1 on adding things to the main namespace (never_copy) for something that can be handled differently (like a string, or a new keyword)

tl;dr a new `force` keyword would be better

I agree, “copy” is not a good description of this desired coercion behavior.

A new keyword argument like “force” would be much clearer.


Cheers,
Ralf


I think the discussion stalled on the precise spelling of the third option.

`__array__` was not discussed there, but it seems like adding the `copy` argument to `__array__` would be a perfectly reasonable extension.

Eric

On Fri, 24 Apr 2020 at 03:00, Juan Nunez-Iglesias <[hidden email]> wrote:
Hi everyone,

One bit of expressivity we would miss is “copy if necessary, but otherwise don’t bother”, but there are workarounds to this.

After a side discussion with Stéfan van der Walt, we came up with `allow_copy=True`, which would express to the downstream library that we don’t mind waiting, but that zero-copy would also be ok.

This sounds like the sort of thing that is use case driven. If enough projects want to use it, then I have no objections to adding the keyword. OTOH, we need to be careful about adding too many interoperability tricks as they complicate the code and makes it hard for folks to determine the best solution. Interoperability is a hot topic and we need to be careful not put too leave behind too many experiments in the NumPy code.  Do you have any other ideas of how to achieve the same effect?

Personally, I don’t have any other ideas, but would be happy to hear some!

My view regarding API/experiment creep is that `__array__` is the oldest and most basic of all the interop tricks and that this can be safely maintained for future generations. Currently it only takes `dtype=` as a keyword argument, so it is a very lean API. I think this particular use case is very natural and I’ve encountered the reluctance to implicitly copy twice, so I expect it is reasonably common.

Regarding difficulty in determining the best solution, I would be happy to contribute to the dispatch basics guide together with the new kwarg. I agree that the protocols are getting quite numerous and I couldn’t find a single place that gathers all the best practices together. But, to reiterate my point: `__array__` is the simplest of these and I think this keyword is pretty safe to add.

For ease of discussion, here are the API options discussed so far, as well as a few extra that I don’t like but might trigger other ideas:

np.asarray(my_duck_array, allow_copy=True)  # default is False, or None -> leave it to the duck array to decide
np.asarray(my_duck_array, copy=True)  # always copies, but, if supported by the duck array, defers to it for the copy
np.asarray(my_duck_array, copy=‘allow’)  # could take values ‘allow’, ‘force’, ’no’, True(=‘force’), False(=’no’)
np.asarray(my_duck_array, force_copy=False, allow_copy=True)  # separate concepts, but unclear what force_copy=True, allow_copy=False means!
np.asarray(my_duck_array, force=True)

Juan.
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Proposal: add `force=` or `copy=` kwarg to `__array__` interface

Sebastian Berg
On Sat, 2020-04-25 at 10:52 -0700, Stephan Hoyer wrote:

> On Sat, Apr 25, 2020 at 10:40 AM Ralf Gommers <[hidden email]
> >
> wrote:
>
> >
> > On Fri, Apr 24, 2020 at 12:35 PM Eric Wieser <
> > [hidden email]>
> > wrote:
> >
> > > Perhaps worth mentioning that we've discussed this sort of API
> > > before, in
> > > https://github.com/numpy/numpy/pull/11897.
> > >
> > > Under that proposal, the api would be something like:
> > >
> > > * `copy=True` - always copy, like it is today
> > > * `copy=False` - copy if needed, like it is today
> > > * `copy=np.never_copy` - never copy, throw an exception if not
> > > possible
> > >
> >
> > There's a couple of issues I see with using `copy` for __array__:
> > - copy is already weird (False doesn't mean no), and a [bool,
> > some_obj_or_str] keyword isn't making that better
> > - the behavior we're talking about can do more than copying, e.g.
> > for
> > PyTorch it would modify the autograd graph by adding detach(), and
> > for
> > sparse it's not just "make a copy" (which implies doubling memory
> > use) but
> > it densifies which can massively blow up the memory.
> > - I'm -1 on adding things to the main namespace (never_copy) for
> > something
> > that can be handled differently (like a string, or a new keyword)
> >
> > tl;dr a new `force` keyword would be better
> >
>
> I agree, “copy” is not a good description of this desired coercion
> behavior.
>
> A new keyword argument like “force” would be much clearer.
>

That seems fine and practical. But, in the end it seems to me that the
`force=` keyword, just means that some projects want to teach their
users that:

1. `np.asarray()` can be expensive (and may always copy)
2. `np.asarray()` always loses type properties

while others do not choose to teach about it.  There seems very little
or even no "promise" attached to either `force=True` or `force=False`.


In the end, the question is whether sparse will actually want to
implement `force=True` if the main reason we add is for library use.
There is no difference between a visualization library or numpy. In
both cases the users memory will blow up just the same.

As for PyTorch, is `.detach()` even a good reason?  Maybe I am missing
things, but:

>>> torch.ones(10, requires_grad=True) + np.arange(10)
# RuntimeError: Can't call numpy() on Variable that requires grad. Use
var.detach().numpy() instead.

So arguably, there is no type-safety concern due to `.detach()`. There
is an (obvious) general loss of type information that always occurs
with an `np.asarray` call.
But I do not see that creating any openings for bugs here, due to the
wisdom of not allowing the above operation.
In fact, it actually seems much worse for for xarray, or pandas. They
do support the above operation and will potentially mess up if the
arange was previously an xarray with a matching index, but different
order.


I am very much in favor of adding such things, but I still lack a bit
of clarity as to whom we would be helping?

If end-users will actually use `np.asarray(..., force=True)` over
special methods, then great! But I am currently not sure the type-
safety argument is all that big of a point.  And the performance or
memory-blowup argument remains true even for visualization libraries
(where the array is purely input and never output as such).


But yes, "never copy" is a somewhat different extension to `__array__`
and `np.asarray`. It guarantees high speed and in-place behaviour which
is useful for different settings.

- Sebastian


>
> > Cheers,
> > Ralf
> >
> >
> > > I think the discussion stalled on the precise spelling of the
> > > third
> > > option.
> > >
> > > `__array__` was not discussed there, but it seems like adding the
> > > `copy`
> > > argument to `__array__` would be a perfectly reasonable
> > > extension.
> > >
> > > Eric
> > >
> > > On Fri, 24 Apr 2020 at 03:00, Juan Nunez-Iglesias <
> > > [hidden email]>
> > > wrote:
> > >
> > > > Hi everyone,
> > > >
> > > > One bit of expressivity we would miss is “copy if necessary,
> > > > but
> > > > > otherwise don’t bother”, but there are workarounds to this.
> > > > >
> > > >
> > > > After a side discussion with Stéfan van der Walt, we came up
> > > > with
> > > > `allow_copy=True`, which would express to the downstream
> > > > library that we
> > > > don’t mind waiting, but that zero-copy would also be ok.
> > > >
> > > > This sounds like the sort of thing that is use case driven. If
> > > > enough
> > > > projects want to use it, then I have no objections to adding
> > > > the keyword.
> > > > OTOH, we need to be careful about adding too many
> > > > interoperability tricks
> > > > as they complicate the code and makes it hard for folks to
> > > > determine the
> > > > best solution. Interoperability is a hot topic and we need to
> > > > be careful
> > > > not put too leave behind too many experiments in the NumPy
> > > > code.  Do you
> > > > have any other ideas of how to achieve the same effect?
> > > >
> > > >
> > > > Personally, I don’t have any other ideas, but would be happy to
> > > > hear
> > > > some!
> > > >
> > > > My view regarding API/experiment creep is that `__array__` is
> > > > the oldest
> > > > and most basic of all the interop tricks and that this can be
> > > > safely
> > > > maintained for future generations. Currently it only takes
> > > > `dtype=` as a
> > > > keyword argument, so it is a very lean API. I think this
> > > > particular use
> > > > case is very natural and I’ve encountered the reluctance to
> > > > implicitly copy
> > > > twice, so I expect it is reasonably common.
> > > >
> > > > Regarding difficulty in determining the best solution, I would
> > > > be happy
> > > > to contribute to the dispatch basics guide together with the
> > > > new kwarg. I
> > > > agree that the protocols are getting quite numerous and I
> > > > couldn’t find a
> > > > single place that gathers all the best practices together. But,
> > > > to
> > > > reiterate my point: `__array__` is the simplest of these and I
> > > > think this
> > > > keyword is pretty safe to add.
> > > >
> > > > For ease of discussion, here are the API options discussed so
> > > > far, as
> > > > well as a few extra that I don’t like but might trigger other
> > > > ideas:
> > > >
> > > > np.asarray(my_duck_array, allow_copy=True)  # default is False,
> > > > or None
> > > > -> leave it to the duck array to decide
> > > > np.asarray(my_duck_array, copy=True)  # always copies, but, if
> > > > supported
> > > > by the duck array, defers to it for the copy
> > > > np.asarray(my_duck_array, copy=‘allow’)  # could take values
> > > > ‘allow’,
> > > > ‘force’, ’no’, True(=‘force’), False(=’no’)
> > > > np.asarray(my_duck_array, force_copy=False, allow_copy=True)  #
> > > > separate
> > > > concepts, but unclear what force_copy=True, allow_copy=False
> > > > means!
> > > > np.asarray(my_duck_array, force=True)
> > > >
> > > > Juan.
> > > > _______________________________________________
> > > > NumPy-Discussion mailing list
> > > > [hidden email]
> > > > https://mail.python.org/mailman/listinfo/numpy-discussion
> > > >
> > > _______________________________________________
> > > NumPy-Discussion mailing list
> > > [hidden email]
> > > https://mail.python.org/mailman/listinfo/numpy-discussion
> > >
> > _______________________________________________
> > NumPy-Discussion mailing list
> > [hidden email]
> > https://mail.python.org/mailman/listinfo/numpy-discussion
> >
>
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> https://mail.python.org/mailman/listinfo/numpy-discussion


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Proposal: add `force=` or `copy=` kwarg to `__array__` interface

ralfgommers


On Mon, Apr 27, 2020 at 12:10 AM Sebastian Berg <[hidden email]> wrote:
On Sat, 2020-04-25 at 10:52 -0700, Stephan Hoyer wrote:
> On Sat, Apr 25, 2020 at 10:40 AM Ralf Gommers <[hidden email]
> >
> wrote:
>
> >
> > On Fri, Apr 24, 2020 at 12:35 PM Eric Wieser <
> > [hidden email]>
> > wrote:
> >
> > > Perhaps worth mentioning that we've discussed this sort of API
> > > before, in
> > > https://github.com/numpy/numpy/pull/11897.
> > >
> > > Under that proposal, the api would be something like:
> > >
> > > * `copy=True` - always copy, like it is today
> > > * `copy=False` - copy if needed, like it is today
> > > * `copy=np.never_copy` - never copy, throw an exception if not
> > > possible
> > >
> >
> > There's a couple of issues I see with using `copy` for __array__:
> > - copy is already weird (False doesn't mean no), and a [bool,
> > some_obj_or_str] keyword isn't making that better
> > - the behavior we're talking about can do more than copying, e.g.
> > for
> > PyTorch it would modify the autograd graph by adding detach(), and
> > for
> > sparse it's not just "make a copy" (which implies doubling memory
> > use) but
> > it densifies which can massively blow up the memory.
> > - I'm -1 on adding things to the main namespace (never_copy) for
> > something
> > that can be handled differently (like a string, or a new keyword)
> >
> > tl;dr a new `force` keyword would be better
> >
>
> I agree, “copy” is not a good description of this desired coercion
> behavior.
>
> A new keyword argument like “force” would be much clearer.
>

That seems fine and practical. But, in the end it seems to me that the
`force=` keyword, just means that some projects want to teach their
users that:

1. `np.asarray()` can be expensive (and may always copy)
2. `np.asarray()` always loses type properties

while others do not choose to teach about it.  There seems very little
or even no "promise" attached to either `force=True` or `force=False`.


In the end, the question is whether sparse will actually want to
implement `force=True` if the main reason we add is for library use.

That's for PyData Sparse and scipy.sparse devs to answer. Maybe Hameer can answer for the former here. For SciPy that should be decided on the scipy-dev list, but my opinion would be: yes to adding __array__ that raises TypeError by default, and converts with `force=True`.
 
There is no difference between a visualization library or numpy. In
both cases the users memory will blow up just the same.

As for PyTorch, is `.detach()` even a good reason?  Maybe I am missing
things, but:

>>> torch.ones(10, requires_grad=True) + np.arange(10)
# RuntimeError: Can't call numpy() on Variable that requires grad. Use
var.detach().numpy() instead.

So arguably, there is no type-safety concern due to `.detach()`.

I'm not sure what the question is here; no one mentioned type-safety. The PyTorch maintainers have already said they're fine with adding a force keyword.

There
is an (obvious) general loss of type information that always occurs
with an `np.asarray` call.
But I do not see that creating any openings for bugs here, due to the
wisdom of not allowing the above operation.
In fact, it actually seems much worse for for xarray, or pandas. They
do support the above operation and will potentially mess up if the
arange was previously an xarray with a matching index, but different
order.


I am very much in favor of adding such things, but I still lack a bit
of clarity as to whom we would be helping?

See Juan's first email. I personally am ambivalent on this proposal, but if Juan and the Napari devs really want it, that's good enough for me.

Cheers,
Ralf



If end-users will actually use `np.asarray(..., force=True)` over
special methods, then great! But I am currently not sure the type-
safety argument is all that big of a point.  And the performance or
memory-blowup argument remains true even for visualization libraries
(where the array is purely input and never output as such).


But yes, "never copy" is a somewhat different extension to `__array__`
and `np.asarray`. It guarantees high speed and in-place behaviour which
is useful for different settings.

- Sebastian


>
> > Cheers,
> > Ralf
> >
> >
> > > I think the discussion stalled on the precise spelling of the
> > > third
> > > option.
> > >
> > > `__array__` was not discussed there, but it seems like adding the
> > > `copy`
> > > argument to `__array__` would be a perfectly reasonable
> > > extension.
> > >
> > > Eric
> > >
> > > On Fri, 24 Apr 2020 at 03:00, Juan Nunez-Iglesias <
> > > [hidden email]>
> > > wrote:
> > >
> > > > Hi everyone,
> > > >
> > > > One bit of expressivity we would miss is “copy if necessary,
> > > > but
> > > > > otherwise don’t bother”, but there are workarounds to this.
> > > > >
> > > >
> > > > After a side discussion with Stéfan van der Walt, we came up
> > > > with
> > > > `allow_copy=True`, which would express to the downstream
> > > > library that we
> > > > don’t mind waiting, but that zero-copy would also be ok.
> > > >
> > > > This sounds like the sort of thing that is use case driven. If
> > > > enough
> > > > projects want to use it, then I have no objections to adding
> > > > the keyword.
> > > > OTOH, we need to be careful about adding too many
> > > > interoperability tricks
> > > > as they complicate the code and makes it hard for folks to
> > > > determine the
> > > > best solution. Interoperability is a hot topic and we need to
> > > > be careful
> > > > not put too leave behind too many experiments in the NumPy
> > > > code.  Do you
> > > > have any other ideas of how to achieve the same effect?
> > > >
> > > >
> > > > Personally, I don’t have any other ideas, but would be happy to
> > > > hear
> > > > some!
> > > >
> > > > My view regarding API/experiment creep is that `__array__` is
> > > > the oldest
> > > > and most basic of all the interop tricks and that this can be
> > > > safely
> > > > maintained for future generations. Currently it only takes
> > > > `dtype=` as a
> > > > keyword argument, so it is a very lean API. I think this
> > > > particular use
> > > > case is very natural and I’ve encountered the reluctance to
> > > > implicitly copy
> > > > twice, so I expect it is reasonably common.
> > > >
> > > > Regarding difficulty in determining the best solution, I would
> > > > be happy
> > > > to contribute to the dispatch basics guide together with the
> > > > new kwarg. I
> > > > agree that the protocols are getting quite numerous and I
> > > > couldn’t find a
> > > > single place that gathers all the best practices together. But,
> > > > to
> > > > reiterate my point: `__array__` is the simplest of these and I
> > > > think this
> > > > keyword is pretty safe to add.
> > > >
> > > > For ease of discussion, here are the API options discussed so
> > > > far, as
> > > > well as a few extra that I don’t like but might trigger other
> > > > ideas:
> > > >
> > > > np.asarray(my_duck_array, allow_copy=True)  # default is False,
> > > > or None
> > > > -> leave it to the duck array to decide
> > > > np.asarray(my_duck_array, copy=True)  # always copies, but, if
> > > > supported
> > > > by the duck array, defers to it for the copy
> > > > np.asarray(my_duck_array, copy=‘allow’)  # could take values
> > > > ‘allow’,
> > > > ‘force’, ’no’, True(=‘force’), False(=’no’)
> > > > np.asarray(my_duck_array, force_copy=False, allow_copy=True)  #
> > > > separate
> > > > concepts, but unclear what force_copy=True, allow_copy=False
> > > > means!
> > > > np.asarray(my_duck_array, force=True)
> > > >
> > > > Juan.
> > > > _______________________________________________
> > > > NumPy-Discussion mailing list
> > > > [hidden email]
> > > > https://mail.python.org/mailman/listinfo/numpy-discussion
> > > >
> > > _______________________________________________
> > > NumPy-Discussion mailing list
> > > [hidden email]
> > > https://mail.python.org/mailman/listinfo/numpy-discussion
> > >
> > _______________________________________________
> > NumPy-Discussion mailing list
> > [hidden email]
> > https://mail.python.org/mailman/listinfo/numpy-discussion
> >
>
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> https://mail.python.org/mailman/listinfo/numpy-discussion


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Proposal: add `force=` or `copy=` kwarg to `__array__` interface

Hameer Abbasi
Hi! Yes, I would advocate for a `force=` kwarg but personally I don't think it's explicit enough, but probably as explicit as can be given NumPy's API.

Personally, I'd also raise a warning within PyData/Sparse and I hope it's in big bold letters in the docs in NumPy to be careful with this.


From: NumPy-Discussion <numpy-discussion-bounces+einstein.edison=[hidden email]> on behalf of Ralf Gommers <[hidden email]>
Sent: Tuesday, April 28, 2020 11:51:13 AM
To: Discussion of Numerical Python <[hidden email]>
Subject: Re: [Numpy-discussion] Proposal: add `force=` or `copy=` kwarg to `__array__` interface
 


On Mon, Apr 27, 2020 at 12:10 AM Sebastian Berg <[hidden email]> wrote:
On Sat, 2020-04-25 at 10:52 -0700, Stephan Hoyer wrote:
> On Sat, Apr 25, 2020 at 10:40 AM Ralf Gommers <[hidden email]
> >
> wrote:
>
> >
> > On Fri, Apr 24, 2020 at 12:35 PM Eric Wieser <
> > [hidden email]>
> > wrote:
> >
> > > Perhaps worth mentioning that we've discussed this sort of API
> > > before, in
> > > https://github.com/numpy/numpy/pull/11897.
> > >
> > > Under that proposal, the api would be something like:
> > >
> > > * `copy=True` - always copy, like it is today
> > > * `copy=False` - copy if needed, like it is today
> > > * `copy=np.never_copy` - never copy, throw an exception if not
> > > possible
> > >
> >
> > There's a couple of issues I see with using `copy` for __array__:
> > - copy is already weird (False doesn't mean no), and a [bool,
> > some_obj_or_str] keyword isn't making that better
> > - the behavior we're talking about can do more than copying, e.g.
> > for
> > PyTorch it would modify the autograd graph by adding detach(), and
> > for
> > sparse it's not just "make a copy" (which implies doubling memory
> > use) but
> > it densifies which can massively blow up the memory.
> > - I'm -1 on adding things to the main namespace (never_copy) for
> > something
> > that can be handled differently (like a string, or a new keyword)
> >
> > tl;dr a new `force` keyword would be better
> >
>
> I agree, “copy” is not a good description of this desired coercion
> behavior.
>
> A new keyword argument like “force” would be much clearer.
>

That seems fine and practical. But, in the end it seems to me that the
`force=` keyword, just means that some projects want to teach their
users that:

1. `np.asarray()` can be expensive (and may always copy)
2. `np.asarray()` always loses type properties

while others do not choose to teach about it.  There seems very little
or even no "promise" attached to either `force=True` or `force=False`.


In the end, the question is whether sparse will actually want to
implement `force=True` if the main reason we add is for library use.

That's for PyData Sparse and scipy.sparse devs to answer. Maybe Hameer can answer for the former here. For SciPy that should be decided on the scipy-dev list, but my opinion would be: yes to adding __array__ that raises TypeError by default, and converts with `force=True`.
 
There is no difference between a visualization library or numpy. In
both cases the users memory will blow up just the same.

As for PyTorch, is `.detach()` even a good reason?  Maybe I am missing
things, but:

>>> torch.ones(10, requires_grad=True) + np.arange(10)
# RuntimeError: Can't call numpy() on Variable that requires grad. Use
var.detach().numpy() instead.

So arguably, there is no type-safety concern due to `.detach()`.

I'm not sure what the question is here; no one mentioned type-safety. The PyTorch maintainers have already said they're fine with adding a force keyword.

There
is an (obvious) general loss of type information that always occurs
with an `np.asarray` call.
But I do not see that creating any openings for bugs here, due to the
wisdom of not allowing the above operation.
In fact, it actually seems much worse for for xarray, or pandas. They
do support the above operation and will potentially mess up if the
arange was previously an xarray with a matching index, but different
order.


I am very much in favor of adding such things, but I still lack a bit
of clarity as to whom we would be helping?

See Juan's first email. I personally am ambivalent on this proposal, but if Juan and the Napari devs really want it, that's good enough for me.

Cheers,
Ralf



If end-users will actually use `np.asarray(..., force=True)` over
special methods, then great! But I am currently not sure the type-
safety argument is all that big of a point.  And the performance or
memory-blowup argument remains true even for visualization libraries
(where the array is purely input and never output as such).


But yes, "never copy" is a somewhat different extension to `__array__`
and `np.asarray`. It guarantees high speed and in-place behaviour which
is useful for different settings.

- Sebastian


>
> > Cheers,
> > Ralf
> >
> >
> > > I think the discussion stalled on the precise spelling of the
> > > third
> > > option.
> > >
> > > `__array__` was not discussed there, but it seems like adding the
> > > `copy`
> > > argument to `__array__` would be a perfectly reasonable
> > > extension.
> > >
> > > Eric
> > >
> > > On Fri, 24 Apr 2020 at 03:00, Juan Nunez-Iglesias <
> > > [hidden email]>
> > > wrote:
> > >
> > > > Hi everyone,
> > > >
> > > > One bit of expressivity we would miss is “copy if necessary,
> > > > but
> > > > > otherwise don’t bother”, but there are workarounds to this.
> > > > >
> > > >
> > > > After a side discussion with Stéfan van der Walt, we came up
> > > > with
> > > > `allow_copy=True`, which would express to the downstream
> > > > library that we
> > > > don’t mind waiting, but that zero-copy would also be ok.
> > > >
> > > > This sounds like the sort of thing that is use case driven. If
> > > > enough
> > > > projects want to use it, then I have no objections to adding
> > > > the keyword.
> > > > OTOH, we need to be careful about adding too many
> > > > interoperability tricks
> > > > as they complicate the code and makes it hard for folks to
> > > > determine the
> > > > best solution. Interoperability is a hot topic and we need to
> > > > be careful
> > > > not put too leave behind too many experiments in the NumPy
> > > > code.  Do you
> > > > have any other ideas of how to achieve the same effect?
> > > >
> > > >
> > > > Personally, I don’t have any other ideas, but would be happy to
> > > > hear
> > > > some!
> > > >
> > > > My view regarding API/experiment creep is that `__array__` is
> > > > the oldest
> > > > and most basic of all the interop tricks and that this can be
> > > > safely
> > > > maintained for future generations. Currently it only takes
> > > > `dtype=` as a
> > > > keyword argument, so it is a very lean API. I think this
> > > > particular use
> > > > case is very natural and I’ve encountered the reluctance to
> > > > implicitly copy
> > > > twice, so I expect it is reasonably common.
> > > >
> > > > Regarding difficulty in determining the best solution, I would
> > > > be happy
> > > > to contribute to the dispatch basics guide together with the
> > > > new kwarg. I
> > > > agree that the protocols are getting quite numerous and I
> > > > couldn’t find a
> > > > single place that gathers all the best practices together. But,
> > > > to
> > > > reiterate my point: `__array__` is the simplest of these and I
> > > > think this
> > > > keyword is pretty safe to add.
> > > >
> > > > For ease of discussion, here are the API options discussed so
> > > > far, as
> > > > well as a few extra that I don’t like but might trigger other
> > > > ideas:
> > > >
> > > > np.asarray(my_duck_array, allow_copy=True)  # default is False,
> > > > or None
> > > > -> leave it to the duck array to decide
> > > > np.asarray(my_duck_array, copy=True)  # always copies, but, if
> > > > supported
> > > > by the duck array, defers to it for the copy
> > > > np.asarray(my_duck_array, copy=‘allow’)  # could take values
> > > > ‘allow’,
> > > > ‘force’, ’no’, True(=‘force’), False(=’no’)
> > > > np.asarray(my_duck_array, force_copy=False, allow_copy=True)  #
> > > > separate
> > > > concepts, but unclear what force_copy=True, allow_copy=False
> > > > means!
> > > > np.asarray(my_duck_array, force=True)
> > > >
> > > > Juan.
> > > > _______________________________________________
> > > > NumPy-Discussion mailing list
> > > > [hidden email]
> > > > https://mail.python.org/mailman/listinfo/numpy-discussion
> > > >
> > > _______________________________________________
> > > NumPy-Discussion mailing list
> > > [hidden email]
> > > https://mail.python.org/mailman/listinfo/numpy-discussion
> > >
> > _______________________________________________
> > NumPy-Discussion mailing list
> > [hidden email]
> > https://mail.python.org/mailman/listinfo/numpy-discussion
> >
>
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> https://mail.python.org/mailman/listinfo/numpy-discussion


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Proposal: add `force=` or `copy=` kwarg to `__array__` interface

Sebastian Berg
In reply to this post by ralfgommers
On Tue, 2020-04-28 at 11:51 +0200, Ralf Gommers wrote:
<snip>
> > So arguably, there is no type-safety concern due to `.detach()`.
>
> I'm not sure what the question is here; no one mentioned type-safety.
> The
> PyTorch maintainers have already said they're fine with adding a
> force
> keyword.

But type-safety is the reason to distinguish between:

* np.asarrau(tensor)
* np.asarray(tensor, force=True)

Similar to:

* operator.index(obj)
* int(obj)   # convert less type-safe (strings, floats)!

I actually mentioned 3 reasons in my email:

1. Teach and Inform users (about the next two mainly)
2. Type-safety
3. Expensive conversion

And only type-safety is related to `.detach()` mentioning that there
may not be clear story about the usage in that case.

(continued below)

>
<snip>

> >
> >
> > I am very much in favor of adding such things, but I still lack a
> > bit
> > of clarity as to whom we would be helping?
> >
>
> See Juan's first email. I personally am ambivalent on this proposal,
> but if
> Juan and the Napari devs really want it, that's good enough for me.

Of course I read it, twice, but it is only good enough for me if we
actually *solve the issue*, and for that I want to know which issue we
are solving :), it seems obvious, but I am not so sure...

That brings us to the other two reasons:

Teaching and Informing users:

If Napari uses `force=True` indiscriminately, it is not very clear to
the user about whether or not the operation is expensive.  I.e. the
user can learn it is when using `np.asarray(sparse_arr)` with other
libraries. But they are not notified that `napari.vis_func(sparse_arr)`
might kill their computer.

So the "Teaching" part can still partially work, but it does not inform
the user well anymore on whether or not a function will blow-up memory.

Expensive Conversion:

If the main reason is expensive conversions, however, than, as a
library I would probably just use it for half my API, since copying
from GPU to CPU will still be much faster than my own function.


Generally:

I want to help Napari, but it seems like there may be more to this, and
it may be good to finish these thoughts before making a call.

E.g. Napari wants to use it, but do the array-providers want Napari to
use it?

For sparse Hameer just mentioned that he still would want big warnings
both during the operation and in the `np.asarray` documentation.
If we put such big warnings there, we should have an idea of who we
want to ignore that warning? (Napari yes, sklearn sometimes, ...?)

   -> Is "whatever the library feels right" good enough?

And if the conversion still gives warnings for some array-objects, have
we actually gained much?

  -> Maybe we do, end-users may be happy to ignore those warnings...


The one clear use-case for `force=True` is the end-user. Just like no
library uses `int(obj)`, but end-users can use it very nicely.
I am happy to help the end-user in this case, but if that is the target
audience we may want to _discourage_ Napari from using `force=True` and
encourage sparse not to put any RuntimeWarnings on it!

- Sebastian


> Cheers,
> Ralf
>
>
>
> > If end-users will actually use `np.asarray(..., force=True)` over
> > special methods, then great! But I am currently not sure the type-
> > safety argument is all that big of a point.  And the performance or
> > memory-blowup argument remains true even for visualization
> > libraries
> > (where the array is purely input and never output as such).
> >
> >
> > But yes, "never copy" is a somewhat different extension to
> > `__array__`
> > and `np.asarray`. It guarantees high speed and in-place behaviour
> > which
> > is useful for different settings.
> >
> > - Sebastian
> >
> >
> > > > Cheers,
> > > > Ralf
> > > >
> > > >
> > > > > I think the discussion stalled on the precise spelling of the
> > > > > third
> > > > > option.
> > > > >
> > > > > `__array__` was not discussed there, but it seems like adding
> > > > > the
> > > > > `copy`
> > > > > argument to `__array__` would be a perfectly reasonable
> > > > > extension.
> > > > >
> > > > > Eric
> > > > >
> > > > > On Fri, 24 Apr 2020 at 03:00, Juan Nunez-Iglesias <
> > > > > [hidden email]>
> > > > > wrote:
> > > > >
> > > > > > Hi everyone,
> > > > > >
> > > > > > One bit of expressivity we would miss is “copy if
> > > > > > necessary,
> > > > > > but
> > > > > > > otherwise don’t bother”, but there are workarounds to
> > > > > > > this.
> > > > > > >
> > > > > >
> > > > > > After a side discussion with Stéfan van der Walt, we came
> > > > > > up
> > > > > > with
> > > > > > `allow_copy=True`, which would express to the downstream
> > > > > > library that we
> > > > > > don’t mind waiting, but that zero-copy would also be ok.
> > > > > >
> > > > > > This sounds like the sort of thing that is use case driven.
> > > > > > If
> > > > > > enough
> > > > > > projects want to use it, then I have no objections to
> > > > > > adding
> > > > > > the keyword.
> > > > > > OTOH, we need to be careful about adding too many
> > > > > > interoperability tricks
> > > > > > as they complicate the code and makes it hard for folks to
> > > > > > determine the
> > > > > > best solution. Interoperability is a hot topic and we need
> > > > > > to
> > > > > > be careful
> > > > > > not put too leave behind too many experiments in the NumPy
> > > > > > code.  Do you
> > > > > > have any other ideas of how to achieve the same effect?
> > > > > >
> > > > > >
> > > > > > Personally, I don’t have any other ideas, but would be
> > > > > > happy to
> > > > > > hear
> > > > > > some!
> > > > > >
> > > > > > My view regarding API/experiment creep is that `__array__`
> > > > > > is
> > > > > > the oldest
> > > > > > and most basic of all the interop tricks and that this can
> > > > > > be
> > > > > > safely
> > > > > > maintained for future generations. Currently it only takes
> > > > > > `dtype=` as a
> > > > > > keyword argument, so it is a very lean API. I think this
> > > > > > particular use
> > > > > > case is very natural and I’ve encountered the reluctance to
> > > > > > implicitly copy
> > > > > > twice, so I expect it is reasonably common.
> > > > > >
> > > > > > Regarding difficulty in determining the best solution, I
> > > > > > would
> > > > > > be happy
> > > > > > to contribute to the dispatch basics guide together with
> > > > > > the
> > > > > > new kwarg. I
> > > > > > agree that the protocols are getting quite numerous and I
> > > > > > couldn’t find a
> > > > > > single place that gathers all the best practices together.
> > > > > > But,
> > > > > > to
> > > > > > reiterate my point: `__array__` is the simplest of these
> > > > > > and I
> > > > > > think this
> > > > > > keyword is pretty safe to add.
> > > > > >
> > > > > > For ease of discussion, here are the API options discussed
> > > > > > so
> > > > > > far, as
> > > > > > well as a few extra that I don’t like but might trigger
> > > > > > other
> > > > > > ideas:
> > > > > >
> > > > > > np.asarray(my_duck_array, allow_copy=True)  # default is
> > > > > > False,
> > > > > > or None
> > > > > > -> leave it to the duck array to decide
> > > > > > np.asarray(my_duck_array, copy=True)  # always copies, but,
> > > > > > if
> > > > > > supported
> > > > > > by the duck array, defers to it for the copy
> > > > > > np.asarray(my_duck_array, copy=‘allow’)  # could take
> > > > > > values
> > > > > > ‘allow’,
> > > > > > ‘force’, ’no’, True(=‘force’), False(=’no’)
> > > > > > np.asarray(my_duck_array, force_copy=False,
> > > > > > allow_copy=True)  #
> > > > > > separate
> > > > > > concepts, but unclear what force_copy=True,
> > > > > > allow_copy=False
> > > > > > means!
> > > > > > np.asarray(my_duck_array, force=True)
> > > > > >
> > > > > > Juan.
> > > > > > _______________________________________________
> > > > > > NumPy-Discussion mailing list
> > > > > > [hidden email]
> > > > > > https://mail.python.org/mailman/listinfo/numpy-discussion
> > > > > >
> > > > > _______________________________________________
> > > > > NumPy-Discussion mailing list
> > > > > [hidden email]
> > > > > https://mail.python.org/mailman/listinfo/numpy-discussion
> > > > >
> > > > _______________________________________________
> > > > NumPy-Discussion mailing list
> > > > [hidden email]
> > > > https://mail.python.org/mailman/listinfo/numpy-discussion
> > > >
> > >
> > > _______________________________________________
> > > NumPy-Discussion mailing list
> > > [hidden email]
> > > https://mail.python.org/mailman/listinfo/numpy-discussion
> >
> > _______________________________________________
> > NumPy-Discussion mailing list
> > [hidden email]
> > https://mail.python.org/mailman/listinfo/numpy-discussion
> >
>
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> https://mail.python.org/mailman/listinfo/numpy-discussion


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Proposal: add `force=` or `copy=` kwarg to `__array__` interface

Sebastian Berg
On Tue, 2020-04-28 at 09:58 -0500, Sebastian Berg wrote:

> On Tue, 2020-04-28 at 11:51 +0200, Ralf Gommers wrote:
> <snip>
> > > So arguably, there is no type-safety concern due to `.detach()`.
> >
> > I'm not sure what the question is here; no one mentioned type-
> > safety.
> > The
> > PyTorch maintainers have already said they're fine with adding a
> > force
> > keyword.
>
> But type-safety is the reason to distinguish between:
>
> * np.asarrau(tensor)
> * np.asarray(tensor, force=True)
>
> Similar to:
>
> * operator.index(obj)
> * int(obj)   # convert less type-safe (strings, floats)!
>
> I actually mentioned 3 reasons in my email:
>
> 1. Teach and Inform users (about the next two mainly)
> 2. Type-safety
> 3. Expensive conversion
>
> And only type-safety is related to `.detach()` mentioning that there
> may not be clear story about the usage in that case.
>

(Sorry something got broken here)

The question is what PyTorch's reasons are to feel `np.asarray(tensor)`
should not work generally.
I for one thought it was type-safety with regard to `.detach()`. And
then I was surprised to realize that type-safety might not be a great
reason to reject an implicit `.detach()` within `np.asarray(tensor)`.


In any case, all the long talk is simply that I first want to be clear
on what the concerns are why libraries reject `np.asarray(tensor)`.
And then, I want to be clear that adding `force=True` will actually
solves those concerns.
And I was surprised myself that this became very much unclear to me.

Again, one reason for it being not clear to me is half the ecosystem
could potentially can just always use `force=True`.  So there must be
some "good usage" and some "bad usage" and I would like to know what
that is.

- Sebastian


> (continued below)
>
> <snip>
> > >
> > > I am very much in favor of adding such things, but I still lack a
> > > bit
> > > of clarity as to whom we would be helping?
> > >
> >
> > See Juan's first email. I personally am ambivalent on this
> > proposal,
> > but if
> > Juan and the Napari devs really want it, that's good enough for me.
>
> Of course I read it, twice, but it is only good enough for me if we
> actually *solve the issue*, and for that I want to know which issue
> we
> are solving :), it seems obvious, but I am not so sure...
>
> That brings us to the other two reasons:
>
> Teaching and Informing users:
>
> If Napari uses `force=True` indiscriminately, it is not very clear to
> the user about whether or not the operation is expensive.  I.e. the
> user can learn it is when using `np.asarray(sparse_arr)` with other
> libraries. But they are not notified that
> `napari.vis_func(sparse_arr)`
> might kill their computer.
>
> So the "Teaching" part can still partially work, but it does not
> inform
> the user well anymore on whether or not a function will blow-up
> memory.
>
> Expensive Conversion:
>
> If the main reason is expensive conversions, however, than, as a
> library I would probably just use it for half my API, since copying
> from GPU to CPU will still be much faster than my own function.
>
>
> Generally:
>
> I want to help Napari, but it seems like there may be more to this,
> and
> it may be good to finish these thoughts before making a call.
>
> E.g. Napari wants to use it, but do the array-providers want Napari
> to
> use it?
>
> For sparse Hameer just mentioned that he still would want big
> warnings
> both during the operation and in the `np.asarray` documentation.
> If we put such big warnings there, we should have an idea of who we
> want to ignore that warning? (Napari yes, sklearn sometimes, ...?)
>
>    -> Is "whatever the library feels right" good enough?
>
> And if the conversion still gives warnings for some array-objects,
> have
> we actually gained much?
>
>   -> Maybe we do, end-users may be happy to ignore those warnings...
>
>
> The one clear use-case for `force=True` is the end-user. Just like no
> library uses `int(obj)`, but end-users can use it very nicely.
> I am happy to help the end-user in this case, but if that is the
> target
> audience we may want to _discourage_ Napari from using `force=True`
> and
> encourage sparse not to put any RuntimeWarnings on it!
>
> - Sebastian
>
>
> > Cheers,
> > Ralf
> >
> >
> >
> > > If end-users will actually use `np.asarray(..., force=True)` over
> > > special methods, then great! But I am currently not sure the
> > > type-
> > > safety argument is all that big of a point.  And the performance
> > > or
> > > memory-blowup argument remains true even for visualization
> > > libraries
> > > (where the array is purely input and never output as such).
> > >
> > >
> > > But yes, "never copy" is a somewhat different extension to
> > > `__array__`
> > > and `np.asarray`. It guarantees high speed and in-place behaviour
> > > which
> > > is useful for different settings.
> > >
> > > - Sebastian
> > >
> > >
> > > > > Cheers,
> > > > > Ralf
> > > > >
> > > > >
> > > > > > I think the discussion stalled on the precise spelling of
> > > > > > the
> > > > > > third
> > > > > > option.
> > > > > >
> > > > > > `__array__` was not discussed there, but it seems like
> > > > > > adding
> > > > > > the
> > > > > > `copy`
> > > > > > argument to `__array__` would be a perfectly reasonable
> > > > > > extension.
> > > > > >
> > > > > > Eric
> > > > > >
> > > > > > On Fri, 24 Apr 2020 at 03:00, Juan Nunez-Iglesias <
> > > > > > [hidden email]>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi everyone,
> > > > > > >
> > > > > > > One bit of expressivity we would miss is “copy if
> > > > > > > necessary,
> > > > > > > but
> > > > > > > > otherwise don’t bother”, but there are workarounds to
> > > > > > > > this.
> > > > > > > >
> > > > > > >
> > > > > > > After a side discussion with Stéfan van der Walt, we came
> > > > > > > up
> > > > > > > with
> > > > > > > `allow_copy=True`, which would express to the downstream
> > > > > > > library that we
> > > > > > > don’t mind waiting, but that zero-copy would also be ok.
> > > > > > >
> > > > > > > This sounds like the sort of thing that is use case
> > > > > > > driven.
> > > > > > > If
> > > > > > > enough
> > > > > > > projects want to use it, then I have no objections to
> > > > > > > adding
> > > > > > > the keyword.
> > > > > > > OTOH, we need to be careful about adding too many
> > > > > > > interoperability tricks
> > > > > > > as they complicate the code and makes it hard for folks
> > > > > > > to
> > > > > > > determine the
> > > > > > > best solution. Interoperability is a hot topic and we
> > > > > > > need
> > > > > > > to
> > > > > > > be careful
> > > > > > > not put too leave behind too many experiments in the
> > > > > > > NumPy
> > > > > > > code.  Do you
> > > > > > > have any other ideas of how to achieve the same effect?
> > > > > > >
> > > > > > >
> > > > > > > Personally, I don’t have any other ideas, but would be
> > > > > > > happy to
> > > > > > > hear
> > > > > > > some!
> > > > > > >
> > > > > > > My view regarding API/experiment creep is that
> > > > > > > `__array__`
> > > > > > > is
> > > > > > > the oldest
> > > > > > > and most basic of all the interop tricks and that this
> > > > > > > can
> > > > > > > be
> > > > > > > safely
> > > > > > > maintained for future generations. Currently it only
> > > > > > > takes
> > > > > > > `dtype=` as a
> > > > > > > keyword argument, so it is a very lean API. I think this
> > > > > > > particular use
> > > > > > > case is very natural and I’ve encountered the reluctance
> > > > > > > to
> > > > > > > implicitly copy
> > > > > > > twice, so I expect it is reasonably common.
> > > > > > >
> > > > > > > Regarding difficulty in determining the best solution, I
> > > > > > > would
> > > > > > > be happy
> > > > > > > to contribute to the dispatch basics guide together with
> > > > > > > the
> > > > > > > new kwarg. I
> > > > > > > agree that the protocols are getting quite numerous and I
> > > > > > > couldn’t find a
> > > > > > > single place that gathers all the best practices
> > > > > > > together.
> > > > > > > But,
> > > > > > > to
> > > > > > > reiterate my point: `__array__` is the simplest of these
> > > > > > > and I
> > > > > > > think this
> > > > > > > keyword is pretty safe to add.
> > > > > > >
> > > > > > > For ease of discussion, here are the API options
> > > > > > > discussed
> > > > > > > so
> > > > > > > far, as
> > > > > > > well as a few extra that I don’t like but might trigger
> > > > > > > other
> > > > > > > ideas:
> > > > > > >
> > > > > > > np.asarray(my_duck_array, allow_copy=True)  # default is
> > > > > > > False,
> > > > > > > or None
> > > > > > > -> leave it to the duck array to decide
> > > > > > > np.asarray(my_duck_array, copy=True)  # always copies,
> > > > > > > but,
> > > > > > > if
> > > > > > > supported
> > > > > > > by the duck array, defers to it for the copy
> > > > > > > np.asarray(my_duck_array, copy=‘allow’)  # could take
> > > > > > > values
> > > > > > > ‘allow’,
> > > > > > > ‘force’, ’no’, True(=‘force’), False(=’no’)
> > > > > > > np.asarray(my_duck_array, force_copy=False,
> > > > > > > allow_copy=True)  #
> > > > > > > separate
> > > > > > > concepts, but unclear what force_copy=True,
> > > > > > > allow_copy=False
> > > > > > > means!
> > > > > > > np.asarray(my_duck_array, force=True)
> > > > > > >
> > > > > > > Juan.
> > > > > > > _______________________________________________
> > > > > > > NumPy-Discussion mailing list
> > > > > > > [hidden email]
> > > > > > > https://mail.python.org/mailman/listinfo/numpy-discussion
> > > > > > >
> > > > > > _______________________________________________
> > > > > > NumPy-Discussion mailing list
> > > > > > [hidden email]
> > > > > > https://mail.python.org/mailman/listinfo/numpy-discussion
> > > > > >
> > > > > _______________________________________________
> > > > > NumPy-Discussion mailing list
> > > > > [hidden email]
> > > > > https://mail.python.org/mailman/listinfo/numpy-discussion
> > > > >
> > > >
> > > > _______________________________________________
> > > > NumPy-Discussion mailing list
> > > > [hidden email]
> > > > https://mail.python.org/mailman/listinfo/numpy-discussion
> > >
> > > _______________________________________________
> > > NumPy-Discussion mailing list
> > > [hidden email]
> > > https://mail.python.org/mailman/listinfo/numpy-discussion
> > >
> >
> > _______________________________________________
> > NumPy-Discussion mailing list
> > [hidden email]
> > https://mail.python.org/mailman/listinfo/numpy-discussion
>
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> https://mail.python.org/mailman/listinfo/numpy-discussion


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Proposal: add `force=` or `copy=` kwarg to `__array__` interface

ralfgommers
In reply to this post by Sebastian Berg


On Tue, Apr 28, 2020 at 5:03 PM Sebastian Berg <[hidden email]> wrote:
On Tue, 2020-04-28 at 11:51 +0200, Ralf Gommers wrote:
<snip>
> > So arguably, there is no type-safety concern due to `.detach()`.
>
> I'm not sure what the question is here; no one mentioned type-safety.
> The
> PyTorch maintainers have already said they're fine with adding a
> force
> keyword.

But type-safety is the reason to distinguish between:

* np.asarrau(tensor)
* np.asarray(tensor, force=True)

No it's not, the rationale given by library authors is expensive conversion / memory copies / side effects. `np.asarray(x)` is used all over the place, and can/will continue to be used by library authors. `force=True` is for cases where things like expensive conversion don't matter, like visualization - if you need a picture of an array then it helps, while the downside of writing inefficient/unreliable numerical code isn't present.


Similar to:

* operator.index(obj)
* int(obj)   # convert less type-safe (strings, floats)!

I actually mentioned 3 reasons in my email:

1. Teach and Inform users (about the next two mainly)
2. Type-safety
3. Expensive conversion

And only type-safety is related to `.detach()` mentioning that there
may not be clear story about the usage in that case.

(continued below)

>
<snip>
> >
> >
> > I am very much in favor of adding such things, but I still lack a
> > bit
> > of clarity as to whom we would be helping?
> >
>
> See Juan's first email. I personally am ambivalent on this proposal,
> but if
> Juan and the Napari devs really want it, that's good enough for me.

Of course I read it, twice, but it is only good enough for me if we
actually *solve the issue*, and for that I want to know which issue we
are solving :), it seems obvious, but I am not so sure...

That brings us to the other two reasons:

Teaching and Informing users:

If Napari uses `force=True` indiscriminately, it is not very clear to
the user about whether or not the operation is expensive.  I.e. the
user can learn it is when using `np.asarray(sparse_arr)` with other
libraries. But they are not notified that `napari.vis_func(sparse_arr)`
might kill their computer.

So the "Teaching" part can still partially work, but it does not inform
the user well anymore on whether or not a function will blow-up memory.

Expensive Conversion:

If the main reason is expensive conversions, however, than, as a
library I would probably just use it for half my API, since copying
from GPU to CPU will still be much faster than my own function.


Generally:

I want to help Napari, but it seems like there may be more to this, and
it may be good to finish these thoughts before making a call.

E.g. Napari wants to use it, but do the array-providers want Napari to
use it?

For sparse Hameer just mentioned that he still would want big warnings
both during the operation and in the `np.asarray` documentation.
If we put such big warnings there, we should have an idea of who we
want to ignore that warning? (Napari yes, sklearn sometimes, ...?)

There clearly should not be warnings. And sklearn is irrelevant, it cannot use `force=True`.

Ralf



   -> Is "whatever the library feels right" good enough?

And if the conversion still gives warnings for some array-objects, have
we actually gained much?

  -> Maybe we do, end-users may be happy to ignore those warnings...


The one clear use-case for `force=True` is the end-user. Just like no
library uses `int(obj)`, but end-users can use it very nicely.
I am happy to help the end-user in this case, but if that is the target
audience we may want to _discourage_ Napari from using `force=True` and
encourage sparse not to put any RuntimeWarnings on it!

- Sebastian


> Cheers,
> Ralf
>
>
>
> > If end-users will actually use `np.asarray(..., force=True)` over
> > special methods, then great! But I am currently not sure the type-
> > safety argument is all that big of a point.  And the performance or
> > memory-blowup argument remains true even for visualization
> > libraries
> > (where the array is purely input and never output as such).
> >
> >
> > But yes, "never copy" is a somewhat different extension to
> > `__array__`
> > and `np.asarray`. It guarantees high speed and in-place behaviour
> > which
> > is useful for different settings.
> >
> > - Sebastian
> >
> >
> > > > Cheers,
> > > > Ralf
> > > >
> > > >
> > > > > I think the discussion stalled on the precise spelling of the
> > > > > third
> > > > > option.
> > > > >
> > > > > `__array__` was not discussed there, but it seems like adding
> > > > > the
> > > > > `copy`
> > > > > argument to `__array__` would be a perfectly reasonable
> > > > > extension.
> > > > >
> > > > > Eric
> > > > >
> > > > > On Fri, 24 Apr 2020 at 03:00, Juan Nunez-Iglesias <
> > > > > [hidden email]>
> > > > > wrote:
> > > > >
> > > > > > Hi everyone,
> > > > > >
> > > > > > One bit of expressivity we would miss is “copy if
> > > > > > necessary,
> > > > > > but
> > > > > > > otherwise don’t bother”, but there are workarounds to
> > > > > > > this.
> > > > > > >
> > > > > >
> > > > > > After a side discussion with Stéfan van der Walt, we came
> > > > > > up
> > > > > > with
> > > > > > `allow_copy=True`, which would express to the downstream
> > > > > > library that we
> > > > > > don’t mind waiting, but that zero-copy would also be ok.
> > > > > >
> > > > > > This sounds like the sort of thing that is use case driven.
> > > > > > If
> > > > > > enough
> > > > > > projects want to use it, then I have no objections to
> > > > > > adding
> > > > > > the keyword.
> > > > > > OTOH, we need to be careful about adding too many
> > > > > > interoperability tricks
> > > > > > as they complicate the code and makes it hard for folks to
> > > > > > determine the
> > > > > > best solution. Interoperability is a hot topic and we need
> > > > > > to
> > > > > > be careful
> > > > > > not put too leave behind too many experiments in the NumPy
> > > > > > code.  Do you
> > > > > > have any other ideas of how to achieve the same effect?
> > > > > >
> > > > > >
> > > > > > Personally, I don’t have any other ideas, but would be
> > > > > > happy to
> > > > > > hear
> > > > > > some!
> > > > > >
> > > > > > My view regarding API/experiment creep is that `__array__`
> > > > > > is
> > > > > > the oldest
> > > > > > and most basic of all the interop tricks and that this can
> > > > > > be
> > > > > > safely
> > > > > > maintained for future generations. Currently it only takes
> > > > > > `dtype=` as a
> > > > > > keyword argument, so it is a very lean API. I think this
> > > > > > particular use
> > > > > > case is very natural and I’ve encountered the reluctance to
> > > > > > implicitly copy
> > > > > > twice, so I expect it is reasonably common.
> > > > > >
> > > > > > Regarding difficulty in determining the best solution, I
> > > > > > would
> > > > > > be happy
> > > > > > to contribute to the dispatch basics guide together with
> > > > > > the
> > > > > > new kwarg. I
> > > > > > agree that the protocols are getting quite numerous and I
> > > > > > couldn’t find a
> > > > > > single place that gathers all the best practices together.
> > > > > > But,
> > > > > > to
> > > > > > reiterate my point: `__array__` is the simplest of these
> > > > > > and I
> > > > > > think this
> > > > > > keyword is pretty safe to add.
> > > > > >
> > > > > > For ease of discussion, here are the API options discussed
> > > > > > so
> > > > > > far, as
> > > > > > well as a few extra that I don’t like but might trigger
> > > > > > other
> > > > > > ideas:
> > > > > >
> > > > > > np.asarray(my_duck_array, allow_copy=True)  # default is
> > > > > > False,
> > > > > > or None
> > > > > > -> leave it to the duck array to decide
> > > > > > np.asarray(my_duck_array, copy=True)  # always copies, but,
> > > > > > if
> > > > > > supported
> > > > > > by the duck array, defers to it for the copy
> > > > > > np.asarray(my_duck_array, copy=‘allow’)  # could take
> > > > > > values
> > > > > > ‘allow’,
> > > > > > ‘force’, ’no’, True(=‘force’), False(=’no’)
> > > > > > np.asarray(my_duck_array, force_copy=False,
> > > > > > allow_copy=True)  #
> > > > > > separate
> > > > > > concepts, but unclear what force_copy=True,
> > > > > > allow_copy=False
> > > > > > means!
> > > > > > np.asarray(my_duck_array, force=True)
> > > > > >
> > > > > > Juan.
> > > > > > _______________________________________________
> > > > > > NumPy-Discussion mailing list
> > > > > > [hidden email]
> > > > > > https://mail.python.org/mailman/listinfo/numpy-discussion
> > > > > >
> > > > > _______________________________________________
> > > > > NumPy-Discussion mailing list
> > > > > [hidden email]
> > > > > https://mail.python.org/mailman/listinfo/numpy-discussion
> > > > >
> > > > _______________________________________________
> > > > NumPy-Discussion mailing list
> > > > [hidden email]
> > > > https://mail.python.org/mailman/listinfo/numpy-discussion
> > > >
> > >
> > > _______________________________________________
> > > NumPy-Discussion mailing list
> > > [hidden email]
> > > https://mail.python.org/mailman/listinfo/numpy-discussion
> >
> > _______________________________________________
> > NumPy-Discussion mailing list
> > [hidden email]
> > https://mail.python.org/mailman/listinfo/numpy-discussion
> >
>
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> https://mail.python.org/mailman/listinfo/numpy-discussion


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Proposal: add `force=` or `copy=` kwarg to `__array__` interface

Juan Nunez-Iglesias-2
Hi everyone, and thank you Ralf for carrying the flag in my absence. =D

Sebastian, the *primary* motivation behind avoiding detach() in PyTorch is listed in original post of the PyTorch issue:

People not very familiar with requires_grad and cpu/gpu Tensors might go back and forth with numpy. For example doing pytorch -> numpy -> pytorch and backward on the last Tensor. This will backward without issue but not all the way to the first part of the code and won’t raise any error.


The PyTorch team are concerned that they will be overwhelmed with help requests if np.array() silently succeeds on a tensor with gradients. I definitely get that.

Avoiding .gpu() is more straightforwardly about avoiding implicit expensive computation.

while others do not choose to teach about it.  There seems very little
or even no "promise" attached to either `force=True` or `force=False`.

NumPy can set a precedent through policy. The *only* reason client libraries would implement `__array__` is to play well with NumPy, so if NumPy documents that `force=True` should *always* succeed, we can expect client libraries to follow suit. At least the PyTorch devs have indicated that they would be open to this.

E.g. Napari wants to use it, but do the array-providers want Napari to use it?

As Ralf pointed out, the PyTorch devs have already agreed to it.

From the napari perspective, we'd be ok with leaving the decision on warnings to client libraries. We may or may not suppress them depending on user requests. ;) But the point is to have a way of saying "give me a NumPy array DAMMIT" without having to know about all the possible array libraries. Which are numerous and getting numerouser.

Ralf, you said you don't want warnings — even for sparse arrays? That was an area of concern for you on the PyTorch discussion side.

And if the conversion still gives warnings for some array-objects, have we actually gained much?

Yes.

Hameer,

I would advocate for a `force=` kwarg but personally I don't think it's explicit enough, but probably as explicit as can be given NumPy's API.

Yeah, I agree that force is kind of vague, which is why I was looking for things like `allow_copy`. But it is hard to be general enough here: sparse requires an expensive instantiation, cupy requires copying from gpu to cpu, dask requires arbitrary computation, xarray requires information loss... I'm inclined to agree with Ralf that force= is the only generic-enough term, but I'm happy to entertain other options!

Juan.

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Proposal: add `force=` or `copy=` kwarg to `__array__` interface

ralfgommers


On Wed, Apr 29, 2020 at 12:27 PM Juan Nunez-Iglesias <[hidden email]> wrote:


Ralf, you said you don't want warnings — even for sparse arrays? That was an area of concern for you on the PyTorch discussion side.

Providing a boolean keyword argument and then raising a warning every time anyone uses that keyword makes very little sense.

This should simply be handled by clear documentation: use only in code like visualization where the array arrives at its "end station", never in functions that are again built on top of.

Cheers,
Ralf




_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Proposal: add `force=` or `copy=` kwarg to `__array__` interface

Sebastian Berg
In reply to this post by Juan Nunez-Iglesias-2
On Wed, 2020-04-29 at 05:26 -0500, Juan Nunez-Iglesias wrote:

> Hi everyone, and thank you Ralf for carrying the flag in my absence.
> =D
>
> Sebastian, the *primary* motivation behind avoiding detach() in
> PyTorch is listed in original post of the PyTorch issue:
>
> > People not very familiar with `requires_grad` and cpu/gpu Tensors
> > might go back and forth with numpy. For example doing pytorch ->
> > numpy -> pytorch and backward on the last Tensor. This will
> > backward without issue but not all the way to the first part of the
> > code and won’t raise any error.
>
> The PyTorch team are concerned that they will be overwhelmed with
> help requests if np.array() silently succeeds on a tensor with
> gradients. I definitely get that.

Sorry for playing advocatus diaboli...

I guess it is simply that before the end, it would be nice to have a
short list with projects:

* Napari, matplotlib on the "user" side
* PyTorch, ...? on the "provider" side

And maybe what their expectations on `force=True` are, to make sure
they roughly align.

The best definition for when to use `force=True` at this time seems to
be "end-point" users (such as visualization or maybe writing to disk?).

I still think performance can be just as valid of an issue there. For
example it may be better to convert to a numpy array earlier in the
computation.  Or someone could be surprised that saving their gpu array
to an hdf5 file is by far the slowest part of the computation.

Maybe I have the feeling the definition we want is actually:

   There is definitely no way to do this computation faster or better
   than by converting it to a NumPy array.

Since currently the main reason to reject it seems a bit to be:

   Wait, are you sure there is not a much better way than using NumPy
   arrays, be careful!

And while that distinction is clear for PyTorch + visualization, I am
not quite sure yet, that it is clear for various combinations of
`force=True` and array-like users.
Maybe CuPy does not want h5py to use `force=True`, because cupy has its
own super fast "stream-to-file" functionality... But it wants to to do
it for napari.

- Sebastian


>
> Avoiding .gpu() is more straightforwardly about avoiding implicit
> expensive computation.
>
> > while others do not choose to teach about it. There seems very
> > little
> > or even no "promise" attached to either `force=True` or
> > `force=False`.
>
> NumPy can set a precedent through policy. The *only* reason client
> libraries would implement `__array__` is to play well with NumPy, so
> if NumPy documents that `force=True` should *always* succeed, we can
> expect client libraries to follow suit. At least the PyTorch devs
> have indicated that they would be open to this.
>
> > E.g. Napari wants to use it, but do the array-providers want Napari
> > to use it?
>
> As Ralf pointed out, the PyTorch devs have already agreed to it.
>
> From the napari perspective, we'd be ok with leaving the decision on
> warnings to client libraries. We may or may not suppress them
> depending on user requests. ;) But the point is to have a way of
> saying "give me a NumPy array DAMMIT" without having to know about
> all the possible array libraries. Which are numerous and getting
> numerouser.
>
> Ralf, you said you don't want warnings — even for sparse arrays? That
> was an area of concern for you on the PyTorch discussion side.
>
> > And if the conversion still gives warnings for some array-objects,
> > have we actually gained much?
>
> Yes.
>
> Hameer,
>
> > I would advocate for a `force=` kwarg but personally I don't think
> > it's explicit enough, but probably as explicit as can be given
> > NumPy's API.
>
> Yeah, I agree that force is kind of vague, which is why I was looking
> for things like `allow_copy`. But it is hard to be general enough
> here: sparse requires an expensive instantiation, cupy requires
> copying from gpu to cpu, dask requires arbitrary computation, xarray
> requires information loss... I'm inclined to agree with Ralf that
> force= is the only generic-enough term, but I'm happy to entertain
> other options!
>
> Juan.
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> https://mail.python.org/mailman/listinfo/numpy-discussion


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion