asanyarray vs. asarray

classic Classic list List threaded Threaded
25 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: asanyarray vs. asarray

Chris Barker - NOAA Federal
On Tue, Oct 30, 2018 at 2:22 PM, Stephan Hoyer <[hidden email]> wrote:
The Liskov substitution principle (LSP) suggests that the set of reasonable ndarray subclasses are exactly those that could also in principle correspond to a new dtype. Of np.ndarray subclasses in wide-spread use, I think only the various "array with units" types come close satisfying this criteria. They only fall short insofar as they present a misleading dtype (without unit information).

How about subclasses that only add functionality? My only use case of subclassing is exactly that:

I have a "bounding box" object (probably could have been called a rectangle) that is a subclass of ndarray, is always shape (2,2), and has various methods for merging two such boxes, etc, adding a point, etc.

I did it that way, 'cause I had a lot of code already that simply used a (2,2) array to represent a bounding box, and I wanted all that code to still work.

I have had zero problems with it.

Maybe that's too trivial to be worth talking about, but this kind of use case can be handy.

It is a bit awkward to write the code, though -- it would be nice to have a cleaner API for this sort of subclassing (not that I have any idea how to do that)

The main problem with subclassing for numpy.ndarray is that it guarantees too much: a large set of operations/methods along with a specific memory layout exposed as part of its public API.

This is a big deal -- we really have two concepts here:
 - a Python class (type) with certain behaviors in Python code
 - a wrapper around a strided memory block.

maybe it's possible to be clear about that distinction:

"Duck Arrays" are the Python API
 
Maybe a C-API object  would be useful, that shares the memory layout, but could have completely different functionality at the Python level. 

- CHB


--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

[hidden email]

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: asanyarray vs. asarray

Matthew Harrigan
In reply to this post by Stephan Hoyer-2
Would the extended dtypes also violate the Liskov substitution principle?  In place operations which would mutate the dtype are one potential issue.  Would a single dtype for an array be sufficient, i.e. np.polynomial coefficients?  Compared to ndarray subclasses, the memory layout issue goes away, but there is still a large set of operations exposed as part of a public API with various quirks.  I can imagine a new function "asunitless" scattered around downstream projects.

On Tue, Oct 30, 2018 at 5:23 PM Stephan Hoyer <[hidden email]> wrote:
On Mon, Oct 29, 2018 at 9:49 PM Eric Wieser <[hidden email]> wrote:

The latter - changing the behavior of multiplication breaks the principle.

But this is not the main reason for deprecating matrix - almost all of the problems I’ve seen have been caused by the way that matrices behave when sliced. The way that m[i][j] and m[i,j] are different is just one example of this, the fact that they must be 2d is another.

Matrices behaving differently on multiplication isn’t super different in my mind to how string arrays fail to multiply at all.

Eric

It's certainly fine for arithmetic to work differently on an element-wise basis or even to error. But np.matrix changes the shape of results from various ndarray operations (e.g., both multiplication and indexing), which is more than any dtype can do.

The Liskov substitution principle (LSP) suggests that the set of reasonable ndarray subclasses are exactly those that could also in principle correspond to a new dtype. Of np.ndarray subclasses in wide-spread use, I think only the various "array with units" types come close satisfying this criteria. They only fall short insofar as they present a misleading dtype (without unit information).

The main problem with subclassing for numpy.ndarray is that it guarantees too much: a large set of operations/methods along with a specific memory layout exposed as part of its public API. Worse, ndarray itself is a little quirky (e.g., with indexing, and its handling of scalars vs. 0d arrays). In practice, it's basically impossible to layer on complex behavior with these exact semantics, so only extremely minimal ndarray subclasses don't violate LSP.

Once we have more easily extended dtypes, I suspect most of the good use cases for subclassing will have gone away.
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: asanyarray vs. asarray

ralfgommers
In reply to this post by Stephan Hoyer-2


On Tue, Oct 30, 2018 at 2:22 PM Stephan Hoyer <[hidden email]> wrote:
On Mon, Oct 29, 2018 at 9:49 PM Eric Wieser <[hidden email]> wrote:

The latter - changing the behavior of multiplication breaks the principle.

But this is not the main reason for deprecating matrix - almost all of the problems I’ve seen have been caused by the way that matrices behave when sliced. The way that m[i][j] and m[i,j] are different is just one example of this, the fact that they must be 2d is another.

Matrices behaving differently on multiplication isn’t super different in my mind to how string arrays fail to multiply at all.

Eric

It's certainly fine for arithmetic to work differently on an element-wise basis or even to error. But np.matrix changes the shape of results from various ndarray operations (e.g., both multiplication and indexing), which is more than any dtype can do.

The Liskov substitution principle (LSP) suggests that the set of reasonable ndarray subclasses are exactly those that could also in principle correspond to a new dtype.

I don't think so. Dtypes have nothing to do with a whole set of use cases that add extra methods or attributes. Random made-up example: user has a system with 1000 sensor signals, some of which should be treated with robust statistics for <reasons like unreliable hardware>. So user writes a subclass robust_ndarray, adds a bunch of methods like median/iqr/mad, and uses isinstance checks in functions that accept both ndarray and robust_ndarray to figure out how to preprocess sensor signals.

Of course you can do everything you can do with subclasses also in other ways, but such "let's add some methods or attributes" are much more common (I think, hard to prove) than "let's change how indexing or multiplication works" in end user code.

Cheers,
Ralf

 
Of np.ndarray subclasses in wide-spread use, I think only the various "array with units" types come close satisfying this criteria. They only fall short insofar as they present a misleading dtype (without unit information).

The main problem with subclassing for numpy.ndarray is that it guarantees too much: a large set of operations/methods along with a specific memory layout exposed as part of its public API. Worse, ndarray itself is a little quirky (e.g., with indexing, and its handling of scalars vs. 0d arrays). In practice, it's basically impossible to layer on complex behavior with these exact semantics, so only extremely minimal ndarray subclasses don't violate LSP.

Once we have more easily extended dtypes, I suspect most of the good use cases for subclassing will have gone away.
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: asanyarray vs. asarray

Matthew Harrigan
I don't think so. Dtypes have nothing to do with a whole set of use cases that add extra methods or attributes. Random made-up example: user has a system with 1000 sensor signals, some of which should be treated with robust statistics for <reasons like unreliable hardware>. So user writes a subclass robust_ndarray, adds a bunch of methods like median/iqr/mad, and uses isinstance checks in functions that accept both ndarray and robust_ndarray to figure out how to preprocess sensor signals.

Of course you can do everything you can do with subclasses also in other ways, but such "let's add some methods or attributes" are much more common (I think, hard to prove) than "let's change how indexing or multiplication works" in end user code.

Cheers,
Ralf

The build on Ralf's thought, a common subclass use case would be to add logging to various methods and attributes.  That might actually be useful for ndarray for understanding what is under the hood of some function in a downstream project.  It would satisfy SOLID and not be related at all to dtype subclasses.


On Wed, Oct 31, 2018 at 8:28 PM Ralf Gommers <[hidden email]> wrote:


On Tue, Oct 30, 2018 at 2:22 PM Stephan Hoyer <[hidden email]> wrote:
On Mon, Oct 29, 2018 at 9:49 PM Eric Wieser <[hidden email]> wrote:

The latter - changing the behavior of multiplication breaks the principle.

But this is not the main reason for deprecating matrix - almost all of the problems I’ve seen have been caused by the way that matrices behave when sliced. The way that m[i][j] and m[i,j] are different is just one example of this, the fact that they must be 2d is another.

Matrices behaving differently on multiplication isn’t super different in my mind to how string arrays fail to multiply at all.

Eric

It's certainly fine for arithmetic to work differently on an element-wise basis or even to error. But np.matrix changes the shape of results from various ndarray operations (e.g., both multiplication and indexing), which is more than any dtype can do.

The Liskov substitution principle (LSP) suggests that the set of reasonable ndarray subclasses are exactly those that could also in principle correspond to a new dtype.

I don't think so. Dtypes have nothing to do with a whole set of use cases that add extra methods or attributes. Random made-up example: user has a system with 1000 sensor signals, some of which should be treated with robust statistics for <reasons like unreliable hardware>. So user writes a subclass robust_ndarray, adds a bunch of methods like median/iqr/mad, and uses isinstance checks in functions that accept both ndarray and robust_ndarray to figure out how to preprocess sensor signals.

Of course you can do everything you can do with subclasses also in other ways, but such "let's add some methods or attributes" are much more common (I think, hard to prove) than "let's change how indexing or multiplication works" in end user code.

Cheers,
Ralf

 
Of np.ndarray subclasses in wide-spread use, I think only the various "array with units" types come close satisfying this criteria. They only fall short insofar as they present a misleading dtype (without unit information).

The main problem with subclassing for numpy.ndarray is that it guarantees too much: a large set of operations/methods along with a specific memory layout exposed as part of its public API. Worse, ndarray itself is a little quirky (e.g., with indexing, and its handling of scalars vs. 0d arrays). In practice, it's basically impossible to layer on complex behavior with these exact semantics, so only extremely minimal ndarray subclasses don't violate LSP.

Once we have more easily extended dtypes, I suspect most of the good use cases for subclassing will have gone away.
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: asanyarray vs. asarray

Marten van Kerkwijk

The substitution principle is interesting (and, being trained as an astronomer, not a computer scientist, I had not heard of it before). I think matrix is indeed obviously wrong here (with indexing being more annoying, but multiplication being a good example as well).

Perhaps more interesting as an example to consider is MaskedArray, which is much closer to a sensible subclass, though different from Quantity in that what is masked can itself be an ndarray subclass. In a sense, it is more of a container class, in which the operations are done on what is inside it, with some care taken about which elements are fixed. This becomes quite clear when one thinks of implementing __array_ufunc__ or __array_function__: for Quantity, calling super after dealing with the units is very logical, for MaskedArray, it makes more sense to call the (universal) function again on the contents [1].

For this particular class, if reimplemented, it might make most sense as a "mixin" since its attributes depend both on the masked class (.mask, etc.) and on what is being masked (say, .unit for a quantity). Thus, the final class might be an auto-generated new class (e.g., MaskedQuantity(MaskedArray, Quantity)). We have just added a new Distribution class to astropy which is based on this idea [2] (since this uses casting from structured dtypes which hold the samples to real arrays on which functions are evaluated, this probably could be done just as well or better with more flexible dtypes, but we have to deal with what's available in the real world, not the ideal one...).

-- Marten

[1] http://www.numpy.org/neps/nep-0013-ufunc-overrides.html#subclass-hierarchies
[2] https://github.com/astropy/astropy/pull/6945

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
12