Proposal to accept NEP-18, __array_function__ protocol

classic Classic list List threaded Threaded
62 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

Proposal to accept NEP-18, __array_function__ protocol

Stephan Hoyer-2
I propose to accept NEP-18, "A dispatch mechanism for NumPy’s high level array functions":
http://www.numpy.org/neps/nep-0018-array-function-protocol.html

Since the last round of discussion, we added a new section on "Callable objects generated at runtime" clarifying that to handle such objects is out of scope for the initial proposal in the NEP.

If there are no substantive objections within 7 days from this email, then the NEP will be accepted; see NEP 0 for more details.

Cheers,
Stpehan

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Proposal to accept NEP-18, __array_function__ protocol

Nathaniel Smith
I'm sorry, I've been trying to find the time to read what you ended up
with, and haven't managed yet – could I get a few days extension? :-)

-n

On Wed, Aug 1, 2018 at 5:27 PM, Stephan Hoyer <[hidden email]> wrote:

> I propose to accept NEP-18, "A dispatch mechanism for NumPy’s high level
> array functions":
> http://www.numpy.org/neps/nep-0018-array-function-protocol.html
>
> Since the last round of discussion, we added a new section on "Callable
> objects generated at runtime" clarifying that to handle such objects is out
> of scope for the initial proposal in the NEP.
>
> If there are no substantive objections within 7 days from this email, then
> the NEP will be accepted; see NEP 0 for more details.
>
> Cheers,
> Stpehan
>
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> https://mail.python.org/mailman/listinfo/numpy-discussion
>



--
Nathaniel J. Smith -- https://vorpus.org
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Proposal to accept NEP-18, __array_function__ protocol

Stephan Hoyer-2
OK, I am can give you a few more days. I'll be camping through the weekend, but hope to accept it when I get back on Monday!

On Wed, Aug 8, 2018 at 1:07 AM Nathaniel Smith <[hidden email]> wrote:
I'm sorry, I've been trying to find the time to read what you ended up
with, and haven't managed yet – could I get a few days extension? :-)

-n

On Wed, Aug 1, 2018 at 5:27 PM, Stephan Hoyer <[hidden email]> wrote:
> I propose to accept NEP-18, "A dispatch mechanism for NumPy’s high level
> array functions":
> http://www.numpy.org/neps/nep-0018-array-function-protocol.html
>
> Since the last round of discussion, we added a new section on "Callable
> objects generated at runtime" clarifying that to handle such objects is out
> of scope for the initial proposal in the NEP.
>
> If there are no substantive objections within 7 days from this email, then
> the NEP will be accepted; see NEP 0 for more details.
>
> Cheers,
> Stpehan
>
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> https://mail.python.org/mailman/listinfo/numpy-discussion
>



--
Nathaniel J. Smith -- https://vorpus.org
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Proposal to accept NEP-18, __array_function__ protocol

Nathaniel Smith
In reply to this post by Stephan Hoyer-2
Hey all,

So I've finally read through NEP 18 (__array_function__). Sorry again
for the delay!

It's an impressive piece of work! Thanks to the many authors; there's
clearly been a lot of thought put into this.

# The trade-off between comprehensive APIs versus clean APIs

At a high-level, what makes me nervous about this proposal is that it
reminds me of a classic software design pattern that... I don't know a
name for. You might call it the "structured monkeypatching" approach
to extensibility. The pattern is: a project decides they want to allow
some kind of extensions or addons or plugins or something, but
defining a big structured API for this is too difficult. So they take
their current API surface area and declare that that's the plugin API
(plus some mechanism for plugins to hook in etc.). Is this pattern
good? It's... hard to say. What generally happens is:

1. You get a very complete, powerful, flexible plugin API with minimal work.
2. This quickly leads to a rich system of powerful plugins, which
drives quick uptake of the project, sometimes even driving out
competitors.
3. The maintainers slowly realize that committing to such a large and
unstructured API is horribly unwieldy and makes changes difficult.
4. The maintainers spend huge amounts of effort trying to crawl out
from under the weight of their commitments, which mixed success.

Examples:

pytest, sphinx: For both of these projects, writing plugins is a
miserable experience, and you never really know if they'll work with
new releases or when composed with random other plugins. Both projects
are absolutely the dominant players in their niche, far better than
the competition, largely thanks to their rich plugin ecosystems.

CPython: the C extension API is basically just... all of CPython's
internals dumped into a header file. Without this numpy wouldn't
exist. A key ingredient in Python's miraculous popularity. Also, at
this point, possibly the largest millstone preventing further
improvements in Python – this is why we can't have multicore support,
JITs, etc. etc.; all the most ambitious discussions at the Python
language summit the last few years have circled back to "...but we
can't do that b/c it will break the C API". See also:
https://mail.python.org/pipermail/python-dev/2018-July/154814.html

Firefox: their original extension API was basically just "our UI is
written in javascript, extension modules get to throw more javascript
in the pot". One of Firefox's original USPs, and a key part of like...
how Mozilla even exists instead of having gone out of business a
decade ago. Eventually the extension API started blocking critical
architectural changes (e.g. for better sandboxing), and they had to go
through an *immensely* painful migration to a properly designed API,
which tooks years and burned huge amounts of goodwill.

So this is like... an extreme version of technical debt. You're making
a deal with the devil for wealth and fame, and then eventually the
bill becomes due. It's hard for me to say categorically that this is a
bad idea – empirically, it can be very successful! But there are real
trade-offs. And it makes me a bit nervous that Matt is the one
proposing this, because I'm pretty sure if you asked him he'd say he's
absolutely focused on how to get something working ASAP and has no
plans to maintain numpy in the future.

The other approach would be to incrementally add clean, well-defined
dunder methods like __array_ufunc__, __array_concatenate__, etc. This
way we end up putting some thought into each interface, making sure
that it's something we can support, protecting downstream libraries
from unnecessary complexity (e.g. they can implement
__array_concatenate__ instead of hstack, vstack, row_stack,
column_stack, ...), or avoiding adding new APIs entirely (e.g., by
converting existing functions into ufuncs so __array_ufunc__ starts
automagically working). And in the end we get a clean list of dunder
methods that new array container implementations have to define. It's
plausible to imagine a generic test suite for array containers. (I
suspect that every library that tries to implement __array_function__
will end up with accidental behavioral differences, just because the
numpy API is so vast and contains so many corner cases.) So the
clean-well-defined-dunders approach has lots of upsides. The big
downside is that this is a much longer road to go down.

I am genuinely uncertain which of these approaches is better on net,
or whether we should do both. But because I'm uncertain, I'm nervous
about committing to the NEP 18 approach -- it feels risky.

## Can we mitigate that risk?

One thing that helps is the way the proposal makes it all-or-nothing:
if you have an __array_function__ method, then you are committing to
reimplementing *all* the numpy API (or at least all the parts that you
want to work at all). This is arguably a bad thing in the long run,
because only large and well-resourced projects can realistically hope
to implement __array_function__. But for now it does somewhat mitigate
the risks, because the fewer users we have the easier it is to work
with them to change course later. But that's probably not enough --
"don't worry, if we change it we'll only break large, important
projects with lots of users" isn't actually *that* reassuring :-).

The proposal also bills itself as an unstable, provisional experiment
("this protocol should be considered strictly experimental. We reserve
the right to change the details of this protocol and how specific
NumPy functions use it at any time in the future – even in otherwise
bug-fix only releases of NumPy."). This mitigates a lot of risk! If we
aren't committing to anything, then sure, why not experiment.

But... this is wishful thinking. No matter what the NEP says, I simply
don't believe that we'll actually go break dask, sparse arrays,
xarray, and sklearn in a numpy point release. Or any numpy release.
Nor should we. If we're serious about keeping this experimental – and
I think that's an excellent idea for now! – then IMO we need to do
something more to avoid getting trapped by backwards compatibility.

My suggestion: at numpy import time, check for an envvar, like say
NUMPY_EXPERIMENTAL_ARRAY_FUNCTION=1. If it's not set, then all the
__array_function__ dispatches turn into no-ops. This lets interested
downstream libraries and users try this out, but makes sure that we
won't have a hundred thousand end users depending on it without
realizing. Other advantages:

- makes it easy for end-users to check how much overhead this adds (by
running their code with it enabled vs disabled)
- if/when we decide to commit to supporting it for real, we just
remove the envvar.

With this change, I'm overall +1 on the proposal. Without it, I...
would like more convincing, at least :-).

# Minor quibbles

I don't really understand the 'types' frozenset. The NEP says "it will
be used by most __array_function__ methods, which otherwise would need
to extract this information themselves"... but they still need to
extract the information themselves, because they still have to examine
each object and figure out what type it is. And, simply creating a
frozenset costs ~0.2 µs on my laptop, which is overhead that we can't
possibly optimize later...

-n

On Wed, Aug 1, 2018 at 5:27 PM, Stephan Hoyer <[hidden email]> wrote:

> I propose to accept NEP-18, "A dispatch mechanism for NumPy’s high level
> array functions":
> http://www.numpy.org/neps/nep-0018-array-function-protocol.html
>
> Since the last round of discussion, we added a new section on "Callable
> objects generated at runtime" clarifying that to handle such objects is out
> of scope for the initial proposal in the NEP.
>
> If there are no substantive objections within 7 days from this email, then
> the NEP will be accepted; see NEP 0 for more details.
>
> Cheers,
> Stpehan
>
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> https://mail.python.org/mailman/listinfo/numpy-discussion
>



--
Nathaniel J. Smith -- https://vorpus.org
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Proposal to accept NEP-18, __array_function__ protocol

einstein.edison
Hi Nathaniel,

Very well written summary, it provides a lot of perspective into the different ways that this could go wrong. Here is a little commentary.

On 13. Aug 2018, at 11:44, Nathaniel Smith <[hidden email]> wrote:

Hey all,

So I've finally read through NEP 18 (__array_function__). Sorry again
for the delay!

It's an impressive piece of work! Thanks to the many authors; there's
clearly been a lot of thought put into this.

# The trade-off between comprehensive APIs versus clean APIs

At a high-level, what makes me nervous about this proposal is that it
reminds me of a classic software design pattern that... I don't know a
name for. You might call it the "structured monkeypatching" approach
to extensibility. The pattern is: a project decides they want to allow
some kind of extensions or addons or plugins or something, but
defining a big structured API for this is too difficult. So they take
their current API surface area and declare that that's the plugin API
(plus some mechanism for plugins to hook in etc.). Is this pattern
good? It's... hard to say. What generally happens is:

1. You get a very complete, powerful, flexible plugin API with minimal work.
2. This quickly leads to a rich system of powerful plugins, which
drives quick uptake of the project, sometimes even driving out
competitors.
3. The maintainers slowly realize that committing to such a large and
unstructured API is horribly unwieldy and makes changes difficult.
4. The maintainers spend huge amounts of effort trying to crawl out
from under the weight of their commitments, which mixed success.

Examples:

pytest, sphinx: For both of these projects, writing plugins is a
miserable experience, and you never really know if they'll work with
new releases or when composed with random other plugins. Both projects
are absolutely the dominant players in their niche, far better than
the competition, largely thanks to their rich plugin ecosystems.

Ah, yes. I’ve been an affectee of this in both instances. For example, there’s a bug that with doctests enabled, you can’t select tests to run, and with coverage enabled, the pydev debugger stops working.

However, composition (at least) is better handled with this protocol. This is answered in more detail later on.


CPython: the C extension API is basically just... all of CPython's
internals dumped into a header file. Without this numpy wouldn't
exist. A key ingredient in Python's miraculous popularity. Also, at
this point, possibly the largest millstone preventing further
improvements in Python – this is why we can't have multicore support,
JITs, etc. etc.; all the most ambitious discussions at the Python
language summit the last few years have circled back to "...but we
can't do that b/c it will break the C API". See also:
https://mail.python.org/pipermail/python-dev/2018-July/154814.html

Firefox: their original extension API was basically just "our UI is
written in javascript, extension modules get to throw more javascript
in the pot". One of Firefox's original USPs, and a key part of like...
how Mozilla even exists instead of having gone out of business a
decade ago. Eventually the extension API started blocking critical
architectural changes (e.g. for better sandboxing), and they had to go
through an *immensely* painful migration to a properly designed API,
which tooks years and burned huge amounts of goodwill.

Ah, yes. I remember heated debates about this. A lot of good add-ons (as Mozilla calls them) were lost because of alienated developers or missing APIs.


So this is like... an extreme version of technical debt. You're making
a deal with the devil for wealth and fame, and then eventually the
bill becomes due. It's hard for me to say categorically that this is a
bad idea – empirically, it can be very successful! But there are real
trade-offs. And it makes me a bit nervous that Matt is the one
proposing this, because I'm pretty sure if you asked him he'd say he's
absolutely focused on how to get something working ASAP and has no
plans to maintain numpy in the future.

The other approach would be to incrementally add clean, well-defined
dunder methods like __array_ufunc__, __array_concatenate__, etc. This
way we end up putting some thought into each interface, making sure
that it's something we can support, protecting downstream libraries
from unnecessary complexity (e.g. they can implement
__array_concatenate__ instead of hstack, vstack, row_stack,
column_stack, ...), or avoiding adding new APIs entirely (e.g., by
converting existing functions into ufuncs so __array_ufunc__ starts
automagically working).

Yes, this is the way I’d prefer to go as well, but the machinery required for converting something to a ufunc is rather complex. Take, for example, the three-argument np.where. In order to preserve full backward compatibility, we need to support structured arrays and strings… which is rather hard to do. Same with the np.identity ufunc that was proposed for casting to a given dtype. This isn’t necessarily an argument against using ufuncs, I’d actually take it as an argument for better documenting these sorts of things. In fact, I wanted to do both of these myself at one point, but I found the documentation for writing ufuncs insufficient for handling these corner cases.


And in the end we get a clean list of dunder
methods that new array container implementations have to define. It's
plausible to imagine a generic test suite for array containers. (I
suspect that every library that tries to implement __array_function__
will end up with accidental behavioral differences, just because the
numpy API is so vast and contains so many corner cases.) So the
clean-well-defined-dunders approach has lots of upsides. The big
downside is that this is a much longer road to go down.

It all comes down to how stable we consider the NumPy API to be. And I’d say pretty stable. Once a function is overridden, we generally do not let NumPy handle it at all. We generally tend to preserve backward compatibility in the API, and the API is all we’re exposing.


I am genuinely uncertain which of these approaches is better on net,
or whether we should do both. But because I'm uncertain, I'm nervous
about committing to the NEP 18 approach -- it feels risky.

## Can we mitigate that risk?

One thing that helps is the way the proposal makes it all-or-nothing:
if you have an __array_function__ method, then you are committing to
reimplementing *all* the numpy API (or at least all the parts that you
want to work at all). This is arguably a bad thing in the long run,
because only large and well-resourced projects can realistically hope
to implement __array_function__. But for now it does somewhat mitigate
the risks, because the fewer users we have the easier it is to work
with them to change course later. But that's probably not enough --
"don't worry, if we change it we'll only break large, important
projects with lots of users" isn't actually *that* reassuring :-).

There was a proposal that we elected to leave out — Maybe this is a reason to put it back in. The proposal was to have a sentinel (np.NotImplementedButCoercible) that would take the regular route via coercion, with NotImplemented raising a TypeError. This way, someone could do these things incrementally.

Also, another thing we could do (as Stephan Hoyer, I, Travis Oliphant, Tyler Reddy, Saul Shanabrook and a few others discussed in the SciPy 2018 sprints) is to go through the NumPy API, convert as much of them into ufuncs as possible, and identify a “base set” from the rest. Then duck array implementations would only need to implement this “base set” of operations along with __array_ufunc__. My idea is that we get started on identifying these things as soon as possible, and only allow this “base set” under __array_function__, and the rest should simply use these to create all NumPy functionality. I might take on identifying this “base set” of functionality in early September if enough people are interested. I might even go as far as to say that we shouldn’t allow any function under this protocol unless it’s in the "base set”, and rewrite the rest of NumPy to use the “base set”. It’s a big project for sure, but identifying the “base set”/ufunc-able functions shouldn’t take too long. The rewriting part might.

The downside would be that some things (such as np.stack) could have better implementations if a custom one was allowed, rather than for example, error checking + concatenate + reshape or error-checking + introduce extra dimension + concatenate. I’m willing to live with this downside, personally, in favour of a cleaner API.


The proposal also bills itself as an unstable, provisional experiment
("this protocol should be considered strictly experimental. We reserve
the right to change the details of this protocol and how specific
NumPy functions use it at any time in the future – even in otherwise
bug-fix only releases of NumPy."). This mitigates a lot of risk! If we
aren't committing to anything, then sure, why not experiment.

But... this is wishful thinking. No matter what the NEP says, I simply
don't believe that we'll actually go break dask, sparse arrays,
xarray, and sklearn in a numpy point release. Or any numpy release.
Nor should we. If we're serious about keeping this experimental – and
I think that's an excellent idea for now! – then IMO we need to do
something more to avoid getting trapped by backwards compatibility.

My suggestion: at numpy import time, check for an envvar, like say
NUMPY_EXPERIMENTAL_ARRAY_FUNCTION=1. If it's not set, then all the
__array_function__ dispatches turn into no-ops. This lets interested
downstream libraries and users try this out, but makes sure that we
won't have a hundred thousand end users depending on it without
realizing. Other advantages:

- makes it easy for end-users to check how much overhead this adds (by
running their code with it enabled vs disabled)
- if/when we decide to commit to supporting it for real, we just
remove the envvar.

We also have to consider that this might hinder adoption. But I’m fine with that. Properly > Quickly, as long as it doesn’t take too long. I’m +0 on this until we properly hammer out this stuff, then we remove it and make this the default.

However, I also realise that pydata/sparse is in the early stages, and can probably wait. Other duck array implementations such as Dask and XArray might need this soon-ish.


With this change, I'm overall +1 on the proposal. Without it, I...
would like more convincing, at least :-).

# Minor quibbles

I don't really understand the 'types' frozenset. The NEP says "it will
be used by most __array_function__ methods, which otherwise would need
to extract this information themselves"... but they still need to
extract the information themselves, because they still have to examine
each object and figure out what type it is. And, simply creating a
frozenset costs ~0.2 µs on my laptop, which is overhead that we can't
possibly optimize later…

The rationale here is that most implementations would check if the types in the array are actually supported by their implementation. If not, they’d return NotImplemented. If it wasn’t done here, every input would need to do it individually, and this may take a lot of time.

I do agree that it violates DRY a bit though… The types are already present in the passed-in arguments, and this can be inferred from those.


-n

On Wed, Aug 1, 2018 at 5:27 PM, Stephan Hoyer <[hidden email]> wrote:
I propose to accept NEP-18, "A dispatch mechanism for NumPy’s high level
array functions":
http://www.numpy.org/neps/nep-0018-array-function-protocol.html

Since the last round of discussion, we added a new section on "Callable
objects generated at runtime" clarifying that to handle such objects is out
of scope for the initial proposal in the NEP.

If there are no substantive objections within 7 days from this email, then
the NEP will be accepted; see NEP 0 for more details.

Cheers,
Stpehan

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion




--
Nathaniel J. Smith -- https://vorpus.org
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

Hope that clarifies things!

Best regards,
Hameer Abbasi


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Proposal to accept NEP-18, __array_function__ protocol

Nathaniel Smith
In reply to this post by Nathaniel Smith
On Mon, Aug 13, 2018 at 2:44 AM, Nathaniel Smith <[hidden email]> wrote:
> So this is like... an extreme version of technical debt. You're making
> a deal with the devil for wealth and fame, and then eventually the
> bill becomes due. It's hard for me to say categorically that this is a
> bad idea – empirically, it can be very successful! But there are real
> trade-offs. And it makes me a bit nervous that Matt is the one
> proposing this, because I'm pretty sure if you asked him he'd say he's
> absolutely focused on how to get something working ASAP and has no
> plans to maintain numpy in the future.

Rereading this today I realized that it could come across like I have
an issue with Matt specifically. I apologize to anyone who got that
impression (esp. Matt!) -- that definitely wasn't my intent. Matt is
awesome. I should stop writing these things at 2 am.

What I should have said is:

We have an unusual decision to make here, where there are two
plausible approaches that both have significant upsides and downsides,
and whose effects are going to be distributed in a complicated way
across different parts of our community over time. So the big
challenge is to figure out how to take all that into account and weigh
the needs of different stakeholders against each other.

One major argument for the __array_function__ approach is that it has
an actual NEP, which happened because we have a contributor who took
the lead on making it happen, and who's deeply involved in some of the
target projects like dask and sparse, so can make sure that the
proposal will work well for them. That's a huge advantage! But... it
also makes me a *little* nervous, because when you have really
talented and productive contributors like this it's easy to get swept
up in their perspective. So I want to double-check that we're also
thinking about the stakeholders who can't be as active in the
discussion, like "numpy maintainers from the future".

(And I mostly mean this as a "we should keep this in mind" kind of
thing – like I said in my original post, I think moving forward
implementing __array_function__ is a great idea; I just want to be
cautious about getting more experience before committing.)

-n

--
Nathaniel J. Smith -- https://vorpus.org
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Proposal to accept NEP-18, __array_function__ protocol

Matthew Rocklin
Hi Nathaniel, 

I appreciate the clarification.  Thank you for that.  For what it's worth, I think that you may overestimate my involvement in the writing of that NEP.  I sat down with Stephan during a Numpy dev meeting and we hacked something together.  Afterwards several other people poured their thoughts into the process.  I'd like to think that my perspective helped to inform this NEP, but it wasn't, by far, the driving force.  If anyone had a strongest hand in the writing process it would probably be Stephan, who I find generally has a more conservative and careful perspective than I do.

That being said, I do think that Numpy would be wise to move quickly here.  I think that the growing fragmentation that we see in array computing in Numpy (Tensorflow, Torch, Dask, Sparse, CuPy) is largely due to Numpy moving slowly in the past.  There is, I think, a systemic problem slowly erupting now that I think the community to respond to quickly if it is possible to do so safely.  I believe that Numpy should absolutely be willing to try something experimental, and then say "nope, that was a bad idea" and retract it if it doesn't work out well.  I think that figuring out all of __array_concatenate__, __array_stack__, __array_foo__, etc. for each of the many cases will take too long to respond to in an effective timeframe.  I believe that we simply don't move quickly enough that this piece-by-piece careful handling of the API will result in Numpy's API becoming a meaningful standard in the broader community in the near-future.

That being said, I think that we should engage in this piece-by-piece discussion, and as we figure them out we should slowly encroach on __array_function__ and remove functionality from it, much as __array_ufunc__ is not included in it in the current NEP.  Ideally we should get to exactly where you want to get to.  I perceive the __array_function__ protocol as a sort of necessary stop-gap.

All that being said, this is just my personal stance.  I suspect that each of the authors of the NEP and others who engaged in its careful review have a different perspective, which should probably carry more weight than my own.

Best,
-matt

On Mon, Aug 13, 2018 at 4:29 PM Nathaniel Smith <[hidden email]> wrote:
On Mon, Aug 13, 2018 at 2:44 AM, Nathaniel Smith <[hidden email]> wrote:
> So this is like... an extreme version of technical debt. You're making
> a deal with the devil for wealth and fame, and then eventually the
> bill becomes due. It's hard for me to say categorically that this is a
> bad idea – empirically, it can be very successful! But there are real
> trade-offs. And it makes me a bit nervous that Matt is the one
> proposing this, because I'm pretty sure if you asked him he'd say he's
> absolutely focused on how to get something working ASAP and has no
> plans to maintain numpy in the future.

Rereading this today I realized that it could come across like I have
an issue with Matt specifically. I apologize to anyone who got that
impression (esp. Matt!) -- that definitely wasn't my intent. Matt is
awesome. I should stop writing these things at 2 am.

What I should have said is:

We have an unusual decision to make here, where there are two
plausible approaches that both have significant upsides and downsides,
and whose effects are going to be distributed in a complicated way
across different parts of our community over time. So the big
challenge is to figure out how to take all that into account and weigh
the needs of different stakeholders against each other.

One major argument for the __array_function__ approach is that it has
an actual NEP, which happened because we have a contributor who took
the lead on making it happen, and who's deeply involved in some of the
target projects like dask and sparse, so can make sure that the
proposal will work well for them. That's a huge advantage! But... it
also makes me a *little* nervous, because when you have really
talented and productive contributors like this it's easy to get swept
up in their perspective. So I want to double-check that we're also
thinking about the stakeholders who can't be as active in the
discussion, like "numpy maintainers from the future".

(And I mostly mean this as a "we should keep this in mind" kind of
thing – like I said in my original post, I think moving forward
implementing __array_function__ is a great idea; I just want to be
cautious about getting more experience before committing.)

-n

--
Nathaniel J. Smith -- https://vorpus.org
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Proposal to accept NEP-18, __array_function__ protocol

Matthew Brett
In reply to this post by Nathaniel Smith
Hi,

Thanks Nathaniel for this thoughtful response.

On Mon, Aug 13, 2018 at 10:44 AM, Nathaniel Smith <[hidden email]> wrote:
...

> The other approach would be to incrementally add clean, well-defined
> dunder methods like __array_ufunc__, __array_concatenate__, etc. This
> way we end up putting some thought into each interface, making sure
> that it's something we can support, protecting downstream libraries
> from unnecessary complexity (e.g. they can implement
> __array_concatenate__ instead of hstack, vstack, row_stack,
> column_stack, ...), or avoiding adding new APIs entirely (e.g., by
> converting existing functions into ufuncs so __array_ufunc__ starts
> automagically working). And in the end we get a clean list of dunder
> methods that new array container implementations have to define. It's
> plausible to imagine a generic test suite for array containers. (I
> suspect that every library that tries to implement __array_function__
> will end up with accidental behavioral differences, just because the
> numpy API is so vast and contains so many corner cases.) So the
> clean-well-defined-dunders approach has lots of upsides. The big
> downside is that this is a much longer road to go down.

Does everyone agree that, if we had infinite time and resources, this
would be the better solution?

If we devoted all the resources of the current Numpy grant to taking
this track, could we complete it in a reasonable time?

Cheers,

Matthew
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Proposal to accept NEP-18, __array_function__ protocol

einstein.edison
On 15. Aug 2018, at 18:25, Matthew Brett <[hidden email]> wrote:

Hi,

Thanks Nathaniel for this thoughtful response.

On Mon, Aug 13, 2018 at 10:44 AM, Nathaniel Smith <[hidden email]> wrote:
...
The other approach would be to incrementally add clean, well-defined
dunder methods like __array_ufunc__, __array_concatenate__, etc. This
way we end up putting some thought into each interface, making sure
that it's something we can support, protecting downstream libraries
from unnecessary complexity (e.g. they can implement
__array_concatenate__ instead of hstack, vstack, row_stack,
column_stack, ...), or avoiding adding new APIs entirely (e.g., by
converting existing functions into ufuncs so __array_ufunc__ starts
automagically working). And in the end we get a clean list of dunder
methods that new array container implementations have to define. It's
plausible to imagine a generic test suite for array containers. (I
suspect that every library that tries to implement __array_function__
will end up with accidental behavioral differences, just because the
numpy API is so vast and contains so many corner cases.) So the
clean-well-defined-dunders approach has lots of upsides. The big
downside is that this is a much longer road to go down.

Does everyone agree that, if we had infinite time and resources, this
would be the better solution?


More resources means (given NumPy’s consensus system), more people have to agree on the overall design, so in my mind, it might even be slower.

If we devoted all the resources of the current Numpy grant to taking
this track, could we complete it in a reasonable time?

I somehow think just the design of all these different protocols, heck, even ironing all these different protocols out and ignoring implementation; would take an unreasonably long amount of time, as evidenced by this one NEP.

I’m more in favour of using this one rather conservatively: Perhaps a mailing list consensus before actually adding a function to __array_function__, making sure it won’t hinder too much progress.

I also differ with Nathaniel on one minor thing with his comparisons to Firefox, CPython, pytest and Sphinx: We’re not talking about monkey-patching NumPy internals, we’re just talking about monkey-patching the public API. Of course, this is still a cost and can still hinder development, but it’s definitely better than exposing all internals.


Cheers,

Matthew
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

Best Regards,
Hameer Abbasi




_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Proposal to accept NEP-18, __array_function__ protocol

Charles R Harris
In reply to this post by Matthew Brett


On Wed, Aug 15, 2018 at 10:25 AM, Matthew Brett <[hidden email]> wrote:
Hi,

Thanks Nathaniel for this thoughtful response.

On Mon, Aug 13, 2018 at 10:44 AM, Nathaniel Smith <[hidden email]> wrote:
...
> The other approach would be to incrementally add clean, well-defined
> dunder methods like __array_ufunc__, __array_concatenate__, etc. This
> way we end up putting some thought into each interface, making sure
> that it's something we can support, protecting downstream libraries
> from unnecessary complexity (e.g. they can implement
> __array_concatenate__ instead of hstack, vstack, row_stack,
> column_stack, ...), or avoiding adding new APIs entirely (e.g., by
> converting existing functions into ufuncs so __array_ufunc__ starts
> automagically working). And in the end we get a clean list of dunder
> methods that new array container implementations have to define. It's
> plausible to imagine a generic test suite for array containers. (I
> suspect that every library that tries to implement __array_function__
> will end up with accidental behavioral differences, just because the
> numpy API is so vast and contains so many corner cases.) So the
> clean-well-defined-dunders approach has lots of upsides. The big
> downside is that this is a much longer road to go down.

Does everyone agree that, if we had infinite time and resources, this
would be the better solution? 

If we devoted all the resources of the current Numpy grant to taking
this track, could we complete it in a reasonable time?

I think it is further down the road than that. Determining a core set of functions would depend on feedback (need), as well be dependent on implementation details for other functions. I think a dependency list for common NumPy functions might be interesting.

That said, I don't think the current proposal is orthogonal to this. I don't expect every NumPy function to make call through this API, and we should probably try to limit, and list, the functions that implement the proposed mechanism.

Chuck 


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Proposal to accept NEP-18, __array_function__ protocol

Matthew Brett
In reply to this post by einstein.edison
Hi,

On Wed, Aug 15, 2018 at 5:36 PM, Hameer Abbasi
<[hidden email]> wrote:

> On 15. Aug 2018, at 18:25, Matthew Brett <[hidden email]> wrote:
>
> Hi,
>
> Thanks Nathaniel for this thoughtful response.
>
> On Mon, Aug 13, 2018 at 10:44 AM, Nathaniel Smith <[hidden email]> wrote:
> ...
>
> The other approach would be to incrementally add clean, well-defined
> dunder methods like __array_ufunc__, __array_concatenate__, etc. This
> way we end up putting some thought into each interface, making sure
> that it's something we can support, protecting downstream libraries
> from unnecessary complexity (e.g. they can implement
> __array_concatenate__ instead of hstack, vstack, row_stack,
> column_stack, ...), or avoiding adding new APIs entirely (e.g., by
> converting existing functions into ufuncs so __array_ufunc__ starts
> automagically working). And in the end we get a clean list of dunder
> methods that new array container implementations have to define. It's
> plausible to imagine a generic test suite for array containers. (I
> suspect that every library that tries to implement __array_function__
> will end up with accidental behavioral differences, just because the
> numpy API is so vast and contains so many corner cases.) So the
> clean-well-defined-dunders approach has lots of upsides. The big
> downside is that this is a much longer road to go down.
>
>
> Does everyone agree that, if we had infinite time and resources, this
> would be the better solution?
>
>
> More resources means (given NumPy’s consensus system), more people have to
> agree on the overall design, so in my mind, it might even be slower.

I don't think that's likely.  As far as I can see, past discussions
have been slow because several people need to get deep down into the
details in order to understand the problem, and then stay focused
through the discussion.  When there is no-one working on that
full-time, it's easy for the discussion to drift into the background,
and the shared understanding is lost.  My suspicion is, to the extent
that Matti and Tyler can devote time and energy to shepherding the
discussion, these will become quicker and more productive.

Cheers,

Matthew
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Proposal to accept NEP-18, __array_function__ protocol

Stephan Hoyer-2
In reply to this post by Nathaniel Smith
Nathaniel,

Thanks for raising these thoughtful concerns. Your independent review of this proposal is greatly appreciated!

See my responses inline below:

On Mon, Aug 13, 2018 at 2:44 AM Nathaniel Smith <[hidden email]> wrote:
The other approach would be to incrementally add clean, well-defined
dunder methods like __array_ufunc__, __array_concatenate__, etc. This
way we end up putting some thought into each interface, making sure
that it's something we can support, protecting downstream libraries
from unnecessary complexity (e.g. they can implement
__array_concatenate__ instead of hstack, vstack, row_stack,
column_stack, ...), or avoiding adding new APIs entirely (e.g., by
converting existing functions into ufuncs so __array_ufunc__ starts
automagically working). And in the end we get a clean list of dunder
methods that new array container implementations have to define. It's
plausible to imagine a generic test suite for array containers. (I
suspect that every library that tries to implement __array_function__
will end up with accidental behavioral differences, just because the
numpy API is so vast and contains so many corner cases.) So the
clean-well-defined-dunders approach has lots of upsides. The big
downside is that this is a much longer road to go down.

RE: accidental differences in behavior:

I actually think that the __array_function__ approach is *less* prone to accidental differences in behavior, because we require implementing every function directly (or it raises an error).

This avoids a classic subclassing problem that has plagued NumPy for years, where overriding the behavior of method A causes apparently unrelated method B to break, because it relied on method A internally. In NumPy, this constrained our implementation of np.median(), because it needed to call np.mean() in order for subclasses implementing units to work properly.

There will certainly be accidental differences in behavior for third-party code that *uses* NumPy, but this is basically inevitable for any proposal to allow's NumPy's public API to be overloaded. It's also avoided by default by third-party libraries that follow the current best practice of casting all input arrays with np.asarray().

--------------

RE: a hypothetical simplified interface:

The need to implement everything you want to use in NumPy's public API could certainly be onerous, but on the other hand there are a long list of projects that have already done this today -- and these are the projects that most need __array_function__.

I'm sure there are cases were simplification would be warranted, but in particular I don't think __array_concatenate__ has significant advantages over simply implementing __array_function__ for np.concatenate. It's a slightly different way of spelling, but it basically does the same thing. The level of complexity to implement hstack, vstack, row_stack and column_stack in terms of np.concatenate is pretty minimal. __array_function__ implementors could easily copy and paste code from NumPy or use a third-party helpers library (like NDArrayOperatorsMixin) that provides such implementations.

I also have other concerns about the "simplified API" approach beyond the difficulty of figuring it out, but those are already mentioned in the NEP:

But... this is wishful thinking. No matter what the NEP says, I simply
don't believe that we'll actually go break dask, sparse arrays,
xarray, and sklearn in a numpy point release. Or any numpy release.
Nor should we. If we're serious about keeping this experimental – and
I think that's an excellent idea for now! – then IMO we need to do
something more to avoid getting trapped by backwards compatibility.

I agree, but to be clear, development for dask, sparse and xarray (and even broadly supported machine learning libraries like TensorFlow) still happens at a much faster pace than is currently the case for "core" projects in the SciPy stack like NumPy. It would not be a big deal to encounter breaking changes in a "major" NumPy release (i.e., 1.X -> 1.(X+1)).

(Side note: sklearn doesn't directly implement any array types, so I don't think it would make use of __array_function__ in any way, except possibly to implement overloadable functions.)
 
My suggestion: at numpy import time, check for an envvar, like say
NUMPY_EXPERIMENTAL_ARRAY_FUNCTION=1. If it's not set, then all the
__array_function__ dispatches turn into no-ops. This lets interested
downstream libraries and users try this out, but makes sure that we
won't have a hundred thousand end users depending on it without
realizing.  
 
- makes it easy for end-users to check how much overhead this adds (by
running their code with it enabled vs disabled)
- if/when we decide to commit to supporting it for real, we just
remove the envvar.

I'm slightly concerned that the cost of reading an environment variable with os.environ could exaggerate the performance cost of __array_function__. It takes about 1 microsecond to read an environment variable on my laptop, which is comparable to the full overhead of __array_function__. So we may want to switch to an explicit Python API instead, e.g., np.enable_experimental_array_function().

My bigger concern is when/how we decide to graduate __array_function__ from requiring an explicit opt-in. We don't need to make a final decision now, but it would be good to clear about what specifically we are waiting for.

I see three types of likely scenarios for changing __array_function__:
1. We decide that the overloading the NumPy namespace in general is a bad idea, based on either performance or predictability consequences for third-party libraries. In this case, I imagine we would probably keep __array_function__, but revert to a separate namespace for explicitly overloaded functions, e.g., numpy.api.
2. We want to keep __array_function__, but need a breaking change to the interface (and we're really attached to keeping the name __array_function__).
3. We decide that specific functions should use a different interface (e.g., switch from __array_function__ to __array_ufunc__).

(1) and (2) are the sort of major concerns that in my mind would warrant hiding a feature behind an experimental flag. For the most part, I expect (1) could be resolved relatively quickly by running benchmark suites after we have a working version of __array_function__. To be honest, I don't see either of these rollback scenarios as terribly likely, but the downside risk is large enough that we may want to protect ourselves for a major release or two (6-12 months).

(3) will be a much longer process, likely to stretch out over years at the current pace of NumPy development. I don't think we'll want to keep an opt-in flag for this long of a period. Rather, we may want to accept a shorter deprecation cycle than usual. In most cases, I suspect we could incrementally switch to new overloads while preserving the __array_function__ overload for a release or two.

I don't really understand the 'types' frozenset. The NEP says "it will
be used by most __array_function__ methods, which otherwise would need
to extract this information themselves"... but they still need to
extract the information themselves, because they still have to examine
each object and figure out what type it is. And, simply creating a
frozenset costs ~0.2 µs on my laptop, which is overhead that we can't
possibly optimize later...

The most flexible alternative would be to just say that we provide an fixed-length iterable, and return a tuple object. (In my microbenchmarks, it's faster to make a tuple than a list or set.) In an early draft of the NEP, I proposed exactly this, but the speed difference seemed really marginal to me.

I included 'types' in the interface because I really do think it's something that almost all __array_function__ implementations should use use. It preserves a nice separation of concerns between dispatching logic and implementations for a new type. At least as long as __array_function__ is experimental, I don't think we should be encouraging people to write functions that could return NotImplemented directly and to rely entirely on the NumPy interface.

Many but not all implementations will need to look at argument types. This is only really essential for cases where mixed operations between NumPy arrays and another type are allowed. If you only implement the NumPy interface for MyArray objects, then in the usual Python style you wouldn't need isinstance checks.

It's also important from an ecosystem perspective. If we don't make it easy to get type information, my guess is that many __array_function__ authors wouldn't bother to return NotImplemented for unexpected types, which means that __array_function__ will break in weird ways when used with objects from unrelated libraries.

Cheers,
Stephan

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Proposal to accept NEP-18, __array_function__ protocol

mattip
In reply to this post by Matthew Brett
On 15/08/18 19:44, Matthew Brett wrote:
My suspicion is, to the extent
that Matti and Tyler can devote time and energy to shepherding the
discussion, these will become quicker and more productive.
Since my name was mentioned ..
Even if we could implement pull requests immediately when issues are reported, and reply to all mails, careful review and community consensus take time. See for instance the merge-umath-and-multiarray PR #10915 (NEP 15) and the generalized-ufunc PR #11175 (NEP20). Progress within NumPy is and always will be slow. If there is an expectation from the community that we take a more active role in moving things forward, that is feedback we need to hear, hopefully with concrete suggestions, preferably in a separate email thread.

Whatever direction __array_function__ or other protocols take, we should create a page of links to implementations of __array_function__ (or for that matter a dtype, gufunc, ndarray, or an __array_ufunc__) so that we can gather data on how these are being used. It would also mean we could test downstream packages when changes are proposed, and we would know who to reach out to when issues arise.

Matti

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Proposal to accept NEP-18, __array_function__ protocol

Charles R Harris
In reply to this post by Stephan Hoyer-2
Ping to finish up this discussion so we can come to a conclusion. I'm in favor of the NEP, as I don't see it as orthogonal to Nathaniel's concerns. However, we might want to be selective as to which functions we expose via the `__array_function__` method.


On Wed, Aug 15, 2018 at 10:45 AM, Stephan Hoyer <[hidden email]> wrote:
Nathaniel,

Thanks for raising these thoughtful concerns. Your independent review of this proposal is greatly appreciated!

See my responses inline below:

On Mon, Aug 13, 2018 at 2:44 AM Nathaniel Smith <[hidden email]> wrote:
The other approach would be to incrementally add clean, well-defined
dunder methods like __array_ufunc__, __array_concatenate__, etc. This
way we end up putting some thought into each interface, making sure
that it's something we can support, protecting downstream libraries
from unnecessary complexity (e.g. they can implement
__array_concatenate__ instead of hstack, vstack, row_stack,
column_stack, ...), or avoiding adding new APIs entirely (e.g., by
converting existing functions into ufuncs so __array_ufunc__ starts
automagically working). And in the end we get a clean list of dunder
methods that new array container implementations have to define. It's
plausible to imagine a generic test suite for array containers. (I
suspect that every library that tries to implement __array_function__
will end up with accidental behavioral differences, just because the
numpy API is so vast and contains so many corner cases.) So the
clean-well-defined-dunders approach has lots of upsides. The big
downside is that this is a much longer road to go down.

RE: accidental differences in behavior:

I actually think that the __array_function__ approach is *less* prone to accidental differences in behavior, because we require implementing every function directly (or it raises an error).

This avoids a classic subclassing problem that has plagued NumPy for years, where overriding the behavior of method A causes apparently unrelated method B to break, because it relied on method A internally. In NumPy, this constrained our implementation of np.median(), because it needed to call np.mean() in order for subclasses implementing units to work properly.

There will certainly be accidental differences in behavior for third-party code that *uses* NumPy, but this is basically inevitable for any proposal to allow's NumPy's public API to be overloaded. It's also avoided by default by third-party libraries that follow the current best practice of casting all input arrays with np.asarray().

--------------

RE: a hypothetical simplified interface:

The need to implement everything you want to use in NumPy's public API could certainly be onerous, but on the other hand there are a long list of projects that have already done this today -- and these are the projects that most need __array_function__.

I'm sure there are cases were simplification would be warranted, but in particular I don't think __array_concatenate__ has significant advantages over simply implementing __array_function__ for np.concatenate. It's a slightly different way of spelling, but it basically does the same thing. The level of complexity to implement hstack, vstack, row_stack and column_stack in terms of np.concatenate is pretty minimal. __array_function__ implementors could easily copy and paste code from NumPy or use a third-party helpers library (like NDArrayOperatorsMixin) that provides such implementations.

I also have other concerns about the "simplified API" approach beyond the difficulty of figuring it out, but those are already mentioned in the NEP:

But... this is wishful thinking. No matter what the NEP says, I simply
don't believe that we'll actually go break dask, sparse arrays,
xarray, and sklearn in a numpy point release. Or any numpy release.
Nor should we. If we're serious about keeping this experimental – and
I think that's an excellent idea for now! – then IMO we need to do
something more to avoid getting trapped by backwards compatibility.

I agree, but to be clear, development for dask, sparse and xarray (and even broadly supported machine learning libraries like TensorFlow) still happens at a much faster pace than is currently the case for "core" projects in the SciPy stack like NumPy. It would not be a big deal to encounter breaking changes in a "major" NumPy release (i.e., 1.X -> 1.(X+1)).

(Side note: sklearn doesn't directly implement any array types, so I don't think it would make use of __array_function__ in any way, except possibly to implement overloadable functions.)

Here is Travis Oliphant's talk at PyBay, where he talks about the proliferation of arrays and interfaces in the ML/AI ecosystem among other things. I think that we should definitely try to get NumPy out there as an option in the near future.

 
My suggestion: at numpy import time, check for an envvar, like say
NUMPY_EXPERIMENTAL_ARRAY_FUNCTION=1. If it's not set, then all the
__array_function__ dispatches turn into no-ops. This lets interested
downstream libraries and users try this out, but makes sure that we
won't have a hundred thousand end users depending on it without
realizing.  
 
- makes it easy for end-users to check how much overhead this adds (by
running their code with it enabled vs disabled)
- if/when we decide to commit to supporting it for real, we just
remove the envvar.

I'm slightly concerned that the cost of reading an environment variable with os.environ could exaggerate the performance cost of __array_function__. It takes about 1 microsecond to read an environment variable on my laptop, which is comparable to the full overhead of __array_function__. So we may want to switch to an explicit Python API instead, e.g., np.enable_experimental_array_function().

My bigger concern is when/how we decide to graduate __array_function__ from requiring an explicit opt-in. We don't need to make a final decision now, but it would be good to clear about what specifically we are waiting for.

I see three types of likely scenarios for changing __array_function__:
1. We decide that the overloading the NumPy namespace in general is a bad idea, based on either performance or predictability consequences for third-party libraries. In this case, I imagine we would probably keep __array_function__, but revert to a separate namespace for explicitly overloaded functions, e.g., numpy.api.
2. We want to keep __array_function__, but need a breaking change to the interface (and we're really attached to keeping the name __array_function__).
3. We decide that specific functions should use a different interface (e.g., switch from __array_function__ to __array_ufunc__).

(1) and (2) are the sort of major concerns that in my mind would warrant hiding a feature behind an experimental flag. For the most part, I expect (1) could be resolved relatively quickly by running benchmark suites after we have a working version of __array_function__. To be honest, I don't see either of these rollback scenarios as terribly likely, but the downside risk is large enough that we may want to protect ourselves for a major release or two (6-12 months).

(3) will be a much longer process, likely to stretch out over years at the current pace of NumPy development. I don't think we'll want to keep an opt-in flag for this long of a period. Rather, we may want to accept a shorter deprecation cycle than usual. In most cases, I suspect we could incrementally switch to new overloads while preserving the __array_function__ overload for a release or two.

I don't really understand the 'types' frozenset. The NEP says "it will
be used by most __array_function__ methods, which otherwise would need
to extract this information themselves"... but they still need to
extract the information themselves, because they still have to examine
each object and figure out what type it is. And, simply creating a
frozenset costs ~0.2 µs on my laptop, which is overhead that we can't
possibly optimize later...

The most flexible alternative would be to just say that we provide an fixed-length iterable, and return a tuple object. (In my microbenchmarks, it's faster to make a tuple than a list or set.) In an early draft of the NEP, I proposed exactly this, but the speed difference seemed really marginal to me.

I included 'types' in the interface because I really do think it's something that almost all __array_function__ implementations should use use. It preserves a nice separation of concerns between dispatching logic and implementations for a new type. At least as long as __array_function__ is experimental, I don't think we should be encouraging people to write functions that could return NotImplemented directly and to rely entirely on the NumPy interface.

Many but not all implementations will need to look at argument types. This is only really essential for cases where mixed operations between NumPy arrays and another type are allowed. If you only implement the NumPy interface for MyArray objects, then in the usual Python style you wouldn't need isinstance checks.

It's also important from an ecosystem perspective. If we don't make it easy to get type information, my guess is that many __array_function__ authors wouldn't bother to return NotImplemented for unexpected types, which means that __array_function__ will break in weird ways when used with objects from unrelated libraries.
 

Chuck 

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Proposal to accept NEP-18, __array_function__ protocol

Stephan Hoyer-2
On Mon, Aug 20, 2018 at 8:27 AM Charles R Harris <[hidden email]> wrote:
Ping to finish up this discussion so we can come to a conclusion. I'm in favor of the NEP, as I don't see it as orthogonal to Nathaniel's concerns. However, we might want to be selective as to which functions we expose via the `__array_function__` method.

Chunk -- thanks for bringing this back up.

My proposal is to hide this feature behind an environment variable for now, or something morally equivalent to an environment variable if that's too slow (i.e., an explicit Python variable). I don't think Nathaniel's concerns are entirely warranted for the reasons I went into in my earlier reply, but I do really want to get this moving forward now in whatever way is necessary. We can figure out the rest down the road.

Nathaniel -- are you OK with that?
 
Here is Travis Oliphant's talk at PyBay, where he talks about the proliferation of arrays and interfaces in the ML/AI ecosystem among other things. I think that we should definitely try to get NumPy out there as an option in the near future.

Yes, there is an urgent need for this :).

Cheers,
Stephan

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Proposal to accept NEP-18, __array_function__ protocol

Nathaniel Smith
In reply to this post by Stephan Hoyer-2
On Wed, Aug 15, 2018 at 9:45 AM, Stephan Hoyer <[hidden email]> wrote:

> RE: accidental differences in behavior:
>
> I actually think that the __array_function__ approach is *less* prone to
> accidental differences in behavior, because we require implementing every
> function directly (or it raises an error).
>
> This avoids a classic subclassing problem that has plagued NumPy for years,
> where overriding the behavior of method A causes apparently unrelated method
> B to break, because it relied on method A internally. In NumPy, this
> constrained our implementation of np.median(), because it needed to call
> np.mean() in order for subclasses implementing units to work properly.

I don't think I follow... if B uses A internally, then overriding A
shouldn't cause B to break, unless the overridden A is buggy.

The median() case was different: it wasn't overriding A that caused B
to break, that part worked fine. It was when we changed the
implementation of B that we had problems.

...actually, this made me realize that I was uncertain about what
exactly happened in that case. I just checked, and AFAICT with current
astropy the call to mean() is unnecessary. I tried modifying np.median
to remove the call to mean, and it still gave the same result for
np.median([1, 2, 3] * u.m). I also checked np.percentile, and it seems
to work fine on units-arrays if you make it call np.asanyarray instead
of np.asarray.

The issue is here if anyone else wants to refresh their memory:
https://github.com/numpy/numpy/issues/3846

Reading the comments there it looks like we might have hacked
np.median so it (a) uses np.asanyarray, and (b) always calls np.mean,
and actually the 'asanyarray' change is what astropy needed and the
'mean' part was just a red herring all along! Whoops. And here I've
been telling this story for years now...

> RE: a hypothetical simplified interface:
>
> The need to implement everything you want to use in NumPy's public API could
> certainly be onerous, but on the other hand there are a long list of
> projects that have already done this today -- and these are the projects
> that most need __array_function__.
>
> I'm sure there are cases were simplification would be warranted, but in
> particular I don't think __array_concatenate__ has significant advantages
> over simply implementing __array_function__ for np.concatenate. It's a
> slightly different way of spelling, but it basically does the same thing.
> The level of complexity to implement hstack, vstack, row_stack and
> column_stack in terms of np.concatenate is pretty minimal.
> __array_function__ implementors could easily copy and paste code from NumPy
> or use a third-party helpers library (like NDArrayOperatorsMixin) that
> provides such implementations.

And when we fix a bug in row_stack, this means we also have to fix it
in all the copy-paste versions, which won't happen, so np.row_stack
has different semantics on different objects, even if they started out
matching. The NDArrayOperatorsMixin reduces the number of duplicate
copies of the same code that need to be updated, but 2 copies is still
a lot worse than 1 copy :-).

> I also have other concerns about the "simplified API" approach beyond the
> difficulty of figuring it out, but those are already mentioned in the NEP:
> http://www.numpy.org/neps/nep-0018-array-function-protocol.html#implementations-in-terms-of-a-limited-core-api

Yeah, there are definitely trade-offs. I don't have like, knock-down
rebuttals to these or anything, but since I didn't comment on them
before I might as well say a few words :-).

> 1. The details of how NumPy implements a high-level function in terms of overloaded functions now becomes an implicit part of NumPy’s public API. For example, refactoring stack to use np.block() instead of np.concatenate() internally would now become a breaking change.

The way I'm imagining this would work is, we guarantee not to take a
function that used to be implemented in terms of overridable
operations, and refactor it so it's implemented in terms of
overridable operations. So long as people have correct implementations
of __array_concatenate__ and __array_block__, they shouldn't care
which one we use. In the interim period where we have
__array_concatenate__ but there's no such thing as __array_block__,
then that refactoring would indeed break things, so we shouldn't do
that :-). But we could fix that by adding __array_block__.

> 2. Array libraries may prefer to implement high level functions differently than NumPy. For example, a library might prefer to implement a fundamental operations like mean() directly rather than relying on sum() followed by division. More generally, it’s not clear yet what exactly qualifies as core functionality, and figuring this out could be a large project.

True. And this is a very general problem... for example, the
appropriate way to implement logistic regression is very different
in-core versus out-of-core. You're never going to be able to take code
written for ndarray, drop in an arbitrary new array object, and get
optimal results in all cases -- that's just way too ambitious to hope
for. There will be cases where reducing to operations like sum() and
division is fine. There will be cases where you have a high-level
operation like logistic regression, where reducing to sum() and
division doesn't work, but reducing to slightly-higher-level
operations like np.mean also doesn't work, because you need to redo
the whole high-level operation. And then there will be cases where
sum() and division are too low-level, but mean() is high-level enough
to make the critical difference. It's that last one where it's
important to be able to override mean() directly. Are there a lot of
cases like this?

For mean() in particular I doubt it. But then, mean() in particular is
irrelevant here, because mean() is already directly overridable,
regardless of __array_function__ :-). So really the question is about
the larger landscape of numpy APIs: What traps are lurking in the
underbrush that we don't know about? And yeah, the intuition behind
the "simplified API" approach is that we need to do the work to clear
out that underbrush, and the downside is exactly that that will be a
lot of work and take time. So... I think this criticism is basically
that restated?

> 3. We don’t yet have an overloading system for attributes and methods on array objects, e.g., for accessing .dtype and .shape. This should be the subject of a future NEP, but until then we should be reluctant to rely on these properties.

This one I don't understand. If you have a duck-array object, and you
want to access its .dtype or .shape attributes, you just... write
myobj.dtype or myobj.shape? That doesn't need a NEP though so I must
be missing something :-).

>> But... this is wishful thinking. No matter what the NEP says, I simply
>> don't believe that we'll actually go break dask, sparse arrays,
>> xarray, and sklearn in a numpy point release. Or any numpy release.
>> Nor should we. If we're serious about keeping this experimental – and
>> I think that's an excellent idea for now! – then IMO we need to do
>> something more to avoid getting trapped by backwards compatibility.
>
>
> I agree, but to be clear, development for dask, sparse and xarray (and even
> broadly supported machine learning libraries like TensorFlow) still happens
> at a much faster pace than is currently the case for "core" projects in the
> SciPy stack like NumPy. It would not be a big deal to encounter breaking
> changes in a "major" NumPy release (i.e., 1.X -> 1.(X+1)).
>
> (Side note: sklearn doesn't directly implement any array types, so I don't
> think it would make use of __array_function__ in any way, except possibly to
> implement overloadable functions.)

They don't implement array types, but they do things like use sparse
arrays internally, so from the user's point of view you could have
some code that only uses numpy and sklearn, and then the new numpy
release breaks sklearn (because it broke the sparse package that
sklearn was using internally).

>>
>> My suggestion: at numpy import time, check for an envvar, like say
>> NUMPY_EXPERIMENTAL_ARRAY_FUNCTION=1. If it's not set, then all the
>> __array_function__ dispatches turn into no-ops. This lets interested
>> downstream libraries and users try this out, but makes sure that we
>> won't have a hundred thousand end users depending on it without
>> realizing.
>>
>>
>>
>> - makes it easy for end-users to check how much overhead this adds (by
>> running their code with it enabled vs disabled)
>> - if/when we decide to commit to supporting it for real, we just
>> remove the envvar.
>
>
> I'm slightly concerned that the cost of reading an environment variable with
> os.environ could exaggerate the performance cost of __array_function__. It
> takes about 1 microsecond to read an environment variable on my laptop,
> which is comparable to the full overhead of __array_function__.

That's why I said "at numpy import time" :-). I was imagining we'd
check it once at import, and then from then on it'd be stashed in some
C global, so after that the overhead would just be a single
predictable branch 'if (array_function_is_enabled) { ... }'.

> So we may
> want to switch to an explicit Python API instead, e.g.,
> np.enable_experimental_array_function().

If we do this, then libraries that want to use __array_function__ will
just call it themselves at import time. The point of the env-var is
that our policy is not to break end-users, so if we want an API to be
provisional and experimental then it's end-users who need to be aware
of that before using it. (This is also an advantage of checking the
envvar only at import time: it means libraries can't easily just
setenv() to enable the functionality behind users' backs.)

> My bigger concern is when/how we decide to graduate __array_function__ from
> requiring an explicit opt-in. We don't need to make a final decision now,
> but it would be good to clear about what specifically we are waiting for.

The motivation for keeping it provisional is that we'll know more
after we have some implementation experience, so our future selves
will be in a better position to make this decision. If I knew what I
was waiting for, I might not need to wait :-).

But yeah, to be clear, I'm totally OK with the possibility that we'll
do this for a few releases and then look again and be like "eh... now
that we have more experience, it looks like the original plan was fine
after all, let's remove the envvar and document some kind of
accelerated deprecation cycle".

>> I don't really understand the 'types' frozenset. The NEP says "it will
>> be used by most __array_function__ methods, which otherwise would need
>> to extract this information themselves"... but they still need to
>> extract the information themselves, because they still have to examine
>> each object and figure out what type it is. And, simply creating a
>> frozenset costs ~0.2 µs on my laptop, which is overhead that we can't
>> possibly optimize later...
>
>
> The most flexible alternative would be to just say that we provide an
> fixed-length iterable, and return a tuple object. (In my microbenchmarks,
> it's faster to make a tuple than a list or set.) In an early draft of the
> NEP, I proposed exactly this, but the speed difference seemed really
> marginal to me.
>
> I included 'types' in the interface because I really do think it's something
> that almost all __array_function__ implementations should use use. It
> preserves a nice separation of concerns between dispatching logic and
> implementations for a new type. At least as long as __array_function__ is
> experimental, I don't think we should be encouraging people to write
> functions that could return NotImplemented directly and to rely entirely on
> the NumPy interface.
>
> Many but not all implementations will need to look at argument types. This
> is only really essential for cases where mixed operations between NumPy
> arrays and another type are allowed. If you only implement the NumPy
> interface for MyArray objects, then in the usual Python style you wouldn't
> need isinstance checks.
>
> It's also important from an ecosystem perspective. If we don't make it easy
> to get type information, my guess is that many __array_function__ authors
> wouldn't bother to return NotImplemented for unexpected types, which means
> that __array_function__ will break in weird ways when used with objects from
> unrelated libraries.

This is much more of a detail as compared to the rest of the
discussion, so I don't want to quibble too much about it. (Especially
since if we keep things really-provisional, we can change our mind
about the argument later :-).) Mostly I'm just confused, because there
are lots of __dunder__ functions in Python (and NumPy), and none of
them take a special 'types' argument... so what's special about
__array_function__ that makes it necessary/worthwhile?

Any implementation of, say, concatenate-via-array_function is going to
involve iterating through all the arguments and looking at each of
them to figure out what kind of object it is and how to handle it,
right? That's true whether or not they've done a "pre-check" using the
types set, so in theory it's just as easy to return NotImplemented at
that point. But I guess your point in the last paragraph is that this
means there will be lots of chances to mess up the
NotImplemented-returning code in particular, especially since it's
less likely to be tested than the happy path, which seems plausible.
So basically the point of the types set is to let people factor out
that little bit of lots of functions into one common place? I guess
some careful devs might be unhappy with paying extra so that other
lazier devs can get away with being lazy, but maybe it's a good
tradeoff for us (esp. since as numpy devs, we'll be getting the bug
reports regardless :-)).

If that's the goal, then it does make me wonder if there might be a
more direct way to accomplish it -- like, should we let classes define
an __array_function_types__ attribute that numpy would check before
even trying to dispatch to __array_function__?

-n

--
Nathaniel J. Smith -- https://vorpus.org
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Proposal to accept NEP-18, __array_function__ protocol

Marten van Kerkwijk

>> I don't really understand the 'types' frozenset. The NEP says "it will
>> be used by most __array_function__ methods, which otherwise would need
>> to extract this information themselves"... but they still need to
>> extract the information themselves, because they still have to examine
>> each object and figure out what type it is. And, simply creating a
>> frozenset costs ~0.2 µs on my laptop, which is overhead that we can't
>> possibly optimize later...
>
>
> The most flexible alternative would be to just say that we provide an
> fixed-length iterable, and return a tuple object. (In my microbenchmarks,
> it's faster to make a tuple than a list or set.) In an early draft of the
> NEP, I proposed exactly this, but the speed difference seemed really
> marginal to me.
>
> I included 'types' in the interface because I really do think it's something
> that almost all __array_function__ implementations should use use. It
> preserves a nice separation of concerns between dispatching logic and
> implementations for a new type. At least as long as __array_function__ is
> experimental, I don't think we should be encouraging people to write
> functions that could return NotImplemented directly and to rely entirely on
> the NumPy interface.
>
> Many but not all implementations will need to look at argument types. This
> is only really essential for cases where mixed operations between NumPy
> arrays and another type are allowed. If you only implement the NumPy
> interface for MyArray objects, then in the usual Python style you wouldn't
> need isinstance checks.
>
> It's also important from an ecosystem perspective. If we don't make it easy
> to get type information, my guess is that many __array_function__ authors
> wouldn't bother to return NotImplemented for unexpected types, which means
> that __array_function__ will break in weird ways when used with objects from
> unrelated libraries.

This is much more of a detail as compared to the rest of the
discussion, so I don't want to quibble too much about it. (Especially
since if we keep things really-provisional, we can change our mind
about the argument later :-).) Mostly I'm just confused, because there
are lots of __dunder__ functions in Python (and NumPy), and none of
them take a special 'types' argument... so what's special about
__array_function__ that makes it necessary/worthwhile?

Any implementation of, say, concatenate-via-array_function is going to
involve iterating through all the arguments and looking at each of
them to figure out what kind of object it is and how to handle it,
right? That's true whether or not they've done a "pre-check" using the
types set, so in theory it's just as easy to return NotImplemented at
that point. But I guess your point in the last paragraph is that this
means there will be lots of chances to mess up the
NotImplemented-returning code in particular, especially since it's
less likely to be tested than the happy path, which seems plausible.
So basically the point of the types set is to let people factor out
that little bit of lots of functions into one common place? I guess
some careful devs might be unhappy with paying extra so that other
lazier devs can get away with being lazy, but maybe it's a good
tradeoff for us (esp. since as numpy devs, we'll be getting the bug
reports regardless :-)).

If that's the goal, then it does make me wonder if there might be a
more direct way to accomplish it -- like, should we let classes define
an __array_function_types__ attribute that numpy would check before
even trying to dispatch to __array_function__?

I quite like that idea; I've not been enchanted by the extra `types` either - it seems like `method` in `__array_ufunc__`, it could become quite superfluous.

-- Marten

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Proposal to accept NEP-18, __array_function__ protocol

einstein.edison
I’m +0 on removing it, so mostly neutral, but slightly in favour. While I see the argument for having it, I also see it as a violation of DRY... The information is already available in relevant arguments.

I doubt any people implementing this protocol are going to be lazy enough not to implement a type check. So far, we’ve been good on __array_ufunc__.

The part of me that says it’s good to have it is the mostly “squeeze every bit of performance out” part.

Best Regards
Hameer Abbasi
Sent from my iPhone

On 21. Aug 2018, at 10:34, Marten van Kerkwijk <[hidden email]> wrote:


>> I don't really understand the 'types' frozenset. The NEP says "it will
>> be used by most __array_function__ methods, which otherwise would need
>> to extract this information themselves"... but they still need to
>> extract the information themselves, because they still have to examine
>> each object and figure out what type it is. And, simply creating a
>> frozenset costs ~0.2 µs on my laptop, which is overhead that we can't
>> possibly optimize later...
>
>
> The most flexible alternative would be to just say that we provide an
> fixed-length iterable, and return a tuple object. (In my microbenchmarks,
> it's faster to make a tuple than a list or set.) In an early draft of the
> NEP, I proposed exactly this, but the speed difference seemed really
> marginal to me.
>
> I included 'types' in the interface because I really do think it's something
> that almost all __array_function__ implementations should use use. It
> preserves a nice separation of concerns between dispatching logic and
> implementations for a new type. At least as long as __array_function__ is
> experimental, I don't think we should be encouraging people to write
> functions that could return NotImplemented directly and to rely entirely on
> the NumPy interface.
>
> Many but not all implementations will need to look at argument types. This
> is only really essential for cases where mixed operations between NumPy
> arrays and another type are allowed. If you only implement the NumPy
> interface for MyArray objects, then in the usual Python style you wouldn't
> need isinstance checks.
>
> It's also important from an ecosystem perspective. If we don't make it easy
> to get type information, my guess is that many __array_function__ authors
> wouldn't bother to return NotImplemented for unexpected types, which means
> that __array_function__ will break in weird ways when used with objects from
> unrelated libraries.

This is much more of a detail as compared to the rest of the
discussion, so I don't want to quibble too much about it. (Especially
since if we keep things really-provisional, we can change our mind
about the argument later :-).) Mostly I'm just confused, because there
are lots of __dunder__ functions in Python (and NumPy), and none of
them take a special 'types' argument... so what's special about
__array_function__ that makes it necessary/worthwhile?

Any implementation of, say, concatenate-via-array_function is going to
involve iterating through all the arguments and looking at each of
them to figure out what kind of object it is and how to handle it,
right? That's true whether or not they've done a "pre-check" using the
types set, so in theory it's just as easy to return NotImplemented at
that point. But I guess your point in the last paragraph is that this
means there will be lots of chances to mess up the
NotImplemented-returning code in particular, especially since it's
less likely to be tested than the happy path, which seems plausible.
So basically the point of the types set is to let people factor out
that little bit of lots of functions into one common place? I guess
some careful devs might be unhappy with paying extra so that other
lazier devs can get away with being lazy, but maybe it's a good
tradeoff for us (esp. since as numpy devs, we'll be getting the bug
reports regardless :-)).

If that's the goal, then it does make me wonder if there might be a
more direct way to accomplish it -- like, should we let classes define
an __array_function_types__ attribute that numpy would check before
even trying to dispatch to __array_function__?

I quite like that idea; I've not been enchanted by the extra `types` either - it seems like `method` in `__array_ufunc__`, it could become quite superfluous.

-- Marten
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Proposal to accept NEP-18, __array_function__ protocol

Stephan Hoyer-2
In reply to this post by Nathaniel Smith
On Tue, Aug 21, 2018 at 12:21 AM Nathaniel Smith <[hidden email]> wrote:
On Wed, Aug 15, 2018 at 9:45 AM, Stephan Hoyer <[hidden email]> wrote:
> This avoids a classic subclassing problem that has plagued NumPy for years,
> where overriding the behavior of method A causes apparently unrelated method
> B to break, because it relied on method A internally. In NumPy, this
> constrained our implementation of np.median(), because it needed to call
> np.mean() in order for subclasses implementing units to work properly.

I don't think I follow... if B uses A internally, then overriding A
shouldn't cause B to break, unless the overridden A is buggy.

Let me try another example with arrays with units. My understanding of the contract provided by unit implementations is their behavior should never deviate from NumPy unless an operation raises an error. (This is more explicit for arrays with units because they raise errors for operations with incompatible units, but practically speaking almost all duck arrays will have at least some unsupported operations in NumPy's giant API.)

It is quite possible that NumPy functions could be (re)written in a way that is incompatible with some unit implementations but is perfectly valid for "full" duck arrays. We actually see this even within NumPy already -- for example, see this recent PR adding support for the datetime64 dtype to percentile:

A lesser case of this are changes in NumPy causing performance issues for users of duck arrays, which is basically inevitable if we share implementations.

I don't think it's possible to anticipate all of these cases, and I don't want NumPy to be unduly constrained in its internal design. I want our user support answer to be simple: if you care about performance for a particular array operations on your type of arrays, you should implement it yourself (i.e., with __array_function__).

This definitely doesn't preclude the careful, systematic overriding approach. But I think we'll almost always want NumPy's external API to be overridable.

And when we fix a bug in row_stack, this means we also have to fix it
in all the copy-paste versions, which won't happen, so np.row_stack
has different semantics on different objects, even if they started out
matching. The NDArrayOperatorsMixin reduces the number of duplicate
copies of the same code that need to be updated, but 2 copies is still
a lot worse than 1 copy :-).

I see your point, but in all seriousness if encounter a bug in np.row_stack at this point we might just call it a feature instead.
 
> 1. The details of how NumPy implements a high-level function in terms of overloaded functions now becomes an implicit part of NumPy’s public API. For example, refactoring stack to use np.block() instead of np.concatenate() internally would now become a breaking change.

The way I'm imagining this would work is, we guarantee not to take a
function that used to be implemented in terms of overridable
operations, and refactor it so it's implemented in terms of
overridable operations. So long as people have correct implementations
of __array_concatenate__ and __array_block__, they shouldn't care
which one we use. In the interim period where we have
__array_concatenate__ but there's no such thing as __array_block__,
then that refactoring would indeed break things, so we shouldn't do
that :-). But we could fix that by adding __array_block__.

""we guarantee not to take a function that used to be implemented in terms of overridable operations, and refactor it so it's implemented in terms of overridable operations"
Did you miss a "not" in here somewhere, e.g., "refactor it so it's NOT implemented"?

If we ever tried to do something like this, I'm pretty sure that it just wouldn't happen -- unless we also change NumPy's extremely conservative approach to breaking third-party code. np.block() is much more complex to implement than np.concatenate(), and users would resist being forced to handle that complexity if they don't need it. (Example: TensorFlow has a concatenate function, but not block.)
 
> 2. Array libraries may prefer to implement high level functions differently than NumPy. For example, a library might prefer to implement a fundamental operations like mean() directly rather than relying on sum() followed by division. More generally, it’s not clear yet what exactly qualifies as core functionality, and figuring this out could be a large project.

True. And this is a very general problem... for example, the
appropriate way to implement logistic regression is very different
in-core versus out-of-core. You're never going to be able to take code
written for ndarray, drop in an arbitrary new array object, and get
optimal results in all cases -- that's just way too ambitious to hope
for. There will be cases where reducing to operations like sum() and
division is fine. There will be cases where you have a high-level
operation like logistic regression, where reducing to sum() and
division doesn't work, but reducing to slightly-higher-level
operations like np.mean also doesn't work, because you need to redo
the whole high-level operation. And then there will be cases where
sum() and division are too low-level, but mean() is high-level enough
to make the critical difference. It's that last one where it's
important to be able to override mean() directly. Are there a lot of
cases like this?

mean() is not entirely hypothetical. TensorFlow and Eigen actually do implement mean separately from sum, though to be honest it's not entirely clear to me why:

I do think this probably will come up with some frequency for other operations, but the bigger answer here really is consistency -- it allows projects and their users to have very clearly defined dependencies on NumPy's API. They don't need to worry about any implementation details from NumPy leaking into their override of a function.
 
> 3. We don’t yet have an overloading system for attributes and methods on array objects, e.g., for accessing .dtype and .shape. This should be the subject of a future NEP, but until then we should be reluctant to rely on these properties.

This one I don't understand. If you have a duck-array object, and you
want to access its .dtype or .shape attributes, you just... write
myobj.dtype or myobj.shape? That doesn't need a NEP though so I must
be missing something :-).

We don't have np.asduckarray() yet or whatever we'll end up calling our proposed casting function from NEP 22, so we don't have a fully fleshed out mechanism for NumPy to declare "this object needs to support .shape and .dtype, or I'm going to cast it into something that does".

More comments on the environment variable and the interface to come in my next email...

Cheers,
Stephan

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Proposal to accept NEP-18, __array_function__ protocol

Stephan Hoyer-2
In reply to this post by Nathaniel Smith
On Tue, Aug 21, 2018 at 12:21 AM Nathaniel Smith <[hidden email]> wrote:
>> My suggestion: at numpy import time, check for an envvar, like say
>> NUMPY_EXPERIMENTAL_ARRAY_FUNCTION=1. If it's not set, then all the
>> __array_function__ dispatches turn into no-ops. This lets interested
>> downstream libraries and users try this out, but makes sure that we
>> won't have a hundred thousand end users depending on it without
>> realizing.
>>
>>
>>
>> - makes it easy for end-users to check how much overhead this adds (by
>> running their code with it enabled vs disabled)
>> - if/when we decide to commit to supporting it for real, we just
>> remove the envvar.
>
>
> I'm slightly concerned that the cost of reading an environment variable with
> os.environ could exaggerate the performance cost of __array_function__. It
> takes about 1 microsecond to read an environment variable on my laptop,
> which is comparable to the full overhead of __array_function__.

That's why I said "at numpy import time" :-). I was imagining we'd
check it once at import, and then from then on it'd be stashed in some
C global, so after that the overhead would just be a single
predictable branch 'if (array_function_is_enabled) { ... }'.

Indeed, I missed the "at numpy import time" bit :).

In that case, I'm concerned that it isn't always possible to set environment variables once before importing NumPy. The environment variable solution works great if users have full control of their own Python binaries, but that isn't always the case today in this era of server-less infrastructure and online notebooks.

One example offhand is Google's Colaboratory (https://research.google.com/colaboratory), a web based Jupyter notebook. NumPy is always loaded when a notebook is opened, as you can check from inspecting sys.modules. Now, I work with the developers of Colaboratory, so we could probably figure out a work-around together, but I'm pretty sure this would also come up in the context of other tools.

Another problem is unit testing. Does pytest use a separate Python process for running the tests in each file? I don't know and that feels like an implementation detail that I shouldn't have to know :). Yes, in principle I could use a subprocess in my __array_function__ for unit tests, but that would be really awkward.

> So we may
> want to switch to an explicit Python API instead, e.g.,
> np.enable_experimental_array_function().

If we do this, then libraries that want to use __array_function__ will
just call it themselves at import time. The point of the env-var is
that our policy is not to break end-users, so if we want an API to be
provisional and experimental then it's end-users who need to be aware
of that before using it. (This is also an advantage of checking the
envvar only at import time: it means libraries can't easily just
setenv() to enable the functionality behind users' backs.)

I'm in complete agreement that only authors of end-user applications should invoke this option, but just because something is technically possible doesn't mean that people will actually do it or that we need to support that use case :).

numpy.seterr() is a good example. It allows users to globally set how NumPy does error handling, but well written libraries still don't do that.

TensorFlow has similar function tf.enable_eager_execution() for enabling "eager mode" that is also worth examining:

To solve the testing issue, they wrote decorator for using with tests, run_in_graph_and_eager_modes(): https://www.tensorflow.org/api_docs/python/tf/contrib/eager/run_test_in_graph_and_eager_modes

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
1234