asarray/anyarray; matrix/subclass

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

asarray/anyarray; matrix/subclass

einstein.edison
Begin forwarded message:

From: Stephan Hoyer
Date: Friday, Nov 09, 2018 at 3:19 PM
To: Hameer Abbasi
Cc: Stefan van der Walt , Marten van Kerkwijk
Subject: asarray/anyarray; matrix/subclass

This is a great discussion, but let's try to have it in public (e.g., on the NumPy mailing list).
On Fri, Nov 9, 2018 at 8:42 AM Hameer Abbasi <[hidden email]> wrote:
Hi Stephan,

The issue I have with writing another function is that asarray/asanyarray are so widely used that it’d be a huge maintenance task to update them throughout NumPy, not to mention other codebases, not to mention other codebases having to rely on newer NumPy versions for this. In short, it would dramatically reduce adaptability of this function.

One path we can take is to allow asarray/asanyarray to be overridable via __array_function__ (the former is debatable). This solves most of our duck-array related issues without introducing another protocol.

Regardless of what path we choose, I would recommend changing asanyarray to not pass through np.matrix regardless, instead passing through mat.view(type=np.ndarray) instead, which has O(1) cost and memory. In the vast majority of contexts, it’s used to ensure an array-ish structure for another operation, and usually there’s no guarantee that what comes out will be a matrix anyway. I suggest we raise a FutureWarning and then change this behaviour.

There have been a number of discussions about deprecating np.matrix (and a few about MaskedArray as well, though there are less compelling reasons for that one). I suggest we start down that path as soon as possible. The biggest (only?) user I know of blocking that is scipy.sparse, and we’re on our way to replacing that with PyData/Sparse.

Best Regards,
Hameer Abbasi

On Friday, Nov 09, 2018 at 1:26 AM, Stephan Hoyer <[hidden email]> wrote:
Hi Hameer,

I'd love to talk about this in more detail. I agree that something like this is needed.

The challenge with reusing an existing function like asanyarray() is that there is at least one (somewhat?) widely used ndarray subclass that badly violates the Liskov Substitution Principle: np.matrix.

NumPy can't really use np.asanyarray() widely for internal purposes until we don't have to worry about np matrix. We might special case np.matrix in some way, but then asanyarray() would do totally opposite things on different versions of NumPy. It's almost certainly a better idea to just write a new function with the desired semantics, and "soft deprecate" asanyarray(). The new function can explicitly black list np.matrix, as well as any other subclasses we know of that badly violate LSP.

Cheers,
Stephan
On Thu, Nov 8, 2018 at 5:06 PM Hameer Abbasi <[hidden email]> wrote:
No, Stefan, I’ll do that now. Putting you in the cc.

It slipped my mind among the million other things I had in mind — Namely: My job visa. It was only done this Monday.

Hi, Marten, Stephan:

Stefan wants me to write up a NEP that allows a given object to specify that it is a duck array — Namely, that it follows duck-array semantics.

We were thinking of switching asanyarray to switch to passing through anything that implements the duck-array protocol along with ndarray subclasses. I’m sure this would help XArray and Quantity work better with existing codebases, along with PyData/Sparse arrays.

Would you be interested?

Best Regards,
Hameer Abbasi

On Thursday, Nov 08, 2018 at 9:09 PM, Stefan van der Walt <[hidden email]> wrote:
Hi Hameer,

In last week's meeting, we had the following in the notes:

Hameer is contacting Marten & Stephan and write up a draft NEP for
clarifying the asarray/asanyarray and matrix/subclass path forward.

Did any of that happen that you could share?

Thanks and best regards,
Stéfan

Hello, everyone,

Me, Stefan van der Walt, Stephan Hoyer and Marten van Kerkwijk were having a discussion about the state of matrix, asarray and asanyarray. Our thoughts are summarised above (in the quoted text that I’m forwarding)

Basically, this grew out of a discussion relating to asanyarray/asarray inconsistencies in NumPy about which to use where. Historically, asarray was used in many libraries/places instead of asanyarray usually because np.matrix caused problems due to its special behaviour with regard to indexing (it always returns a 2-D object when eliminating one dimension, but a 0-D one when eliminating both), its behaviour regarding __mul__ (the multiplication operator represents matrix multiplication rather than element-wise multiplication) and its fixed dimensionality (matrix is 2D only). Because of these three things, as Stephan accurately pointed out, it violates the Liskov Substitution Principle.

Because of this behaviour, many libraries switched from using asanyarray to asarray, as np.matrix wouldn’t work with their code. This shut out other matrix subclasses from being used as well, such as MaskedArray and astropy.Quantity. Even if asanyarray is used, there is usually no guarantee that a matrix will be returned instead of an array.

The changes I’m proposing are twofold, but simple:
  • asanyarray should return mat.view(type=np.ndarray) instead of matrices, after an appropriate time with a FutureWarning. This allows us to preserve the performance (Creating a view is O(1) both in memory and time), and the mutability of the original matrix. This change should happen after a FutureWarning and the usual grace period.
  • In the spirit of allowing duck-arrays to work with existing NumPy code, asanyarray should be overridable via __array_function__, so that duck arrays can decide whether to pass themselves through. If subclasses are allowed, so should ducka-arrays as well.
This is a part of a larger effort to deprecate np.matrix. As far as I’m aware, it has one big customer (scipy.sparse). The effort to replace that is already underway at PyData/Sparse.

Best Regards,
Hameer Abbasi


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: asarray/anyarray; matrix/subclass

Stephan Hoyer-2
I’m still not sure I agree with the advantages of reusing asanyarray(), even if matrix did not exist. Yes, asanyarray will exist in old NumPy versions, but you can’t use it with sparse arrays anyways because it will have the wrong semantics. I expect this would be a bug magnet, with inadvertent loading of sparse arrays into memory if you’re accidentally using old NumPy.

With regards to the protocol, I would suggest a dedicated method, e.g., __asanyarray__ (or something similar based on the final chosen name of the function). Coercing to arrays is special enough to have its own dedicated protocol, and it could be useful for libraries like xarray to check for __asanyarray__ attributes before deciding which coercion mechanism to use.
On Fri, Nov 9, 2018 at 10:17 AM Hameer Abbasi <[hidden email]> wrote:
Begin forwarded message:

From: Stephan Hoyer
Date: Friday, Nov 09, 2018 at 3:19 PM
To: Hameer Abbasi
Cc: Stefan van der Walt , Marten van Kerkwijk
Subject: asarray/anyarray; matrix/subclass

This is a great discussion, but let's try to have it in public (e.g., on the NumPy mailing list).
On Fri, Nov 9, 2018 at 8:42 AM Hameer Abbasi <[hidden email]> wrote:
Hi Stephan,

The issue I have with writing another function is that asarray/asanyarray are so widely used that it’d be a huge maintenance task to update them throughout NumPy, not to mention other codebases, not to mention other codebases having to rely on newer NumPy versions for this. In short, it would dramatically reduce adaptability of this function.

One path we can take is to allow asarray/asanyarray to be overridable via __array_function__ (the former is debatable). This solves most of our duck-array related issues without introducing another protocol.

Regardless of what path we choose, I would recommend changing asanyarray to not pass through np.matrix regardless, instead passing through mat.view(type=np.ndarray) instead, which has O(1) cost and memory. In the vast majority of contexts, it’s used to ensure an array-ish structure for another operation, and usually there’s no guarantee that what comes out will be a matrix anyway. I suggest we raise a FutureWarning and then change this behaviour.

There have been a number of discussions about deprecating np.matrix (and a few about MaskedArray as well, though there are less compelling reasons for that one). I suggest we start down that path as soon as possible. The biggest (only?) user I know of blocking that is scipy.sparse, and we’re on our way to replacing that with PyData/Sparse.

Best Regards,
Hameer Abbasi

On Friday, Nov 09, 2018 at 1:26 AM, Stephan Hoyer <[hidden email]> wrote:
Hi Hameer,

I'd love to talk about this in more detail. I agree that something like this is needed.

The challenge with reusing an existing function like asanyarray() is that there is at least one (somewhat?) widely used ndarray subclass that badly violates the Liskov Substitution Principle: np.matrix.

NumPy can't really use np.asanyarray() widely for internal purposes until we don't have to worry about np matrix. We might special case np.matrix in some way, but then asanyarray() would do totally opposite things on different versions of NumPy. It's almost certainly a better idea to just write a new function with the desired semantics, and "soft deprecate" asanyarray(). The new function can explicitly black list np.matrix, as well as any other subclasses we know of that badly violate LSP.

Cheers,
Stephan
On Thu, Nov 8, 2018 at 5:06 PM Hameer Abbasi <[hidden email]> wrote:
No, Stefan, I’ll do that now. Putting you in the cc.

It slipped my mind among the million other things I had in mind — Namely: My job visa. It was only done this Monday.

Hi, Marten, Stephan:

Stefan wants me to write up a NEP that allows a given object to specify that it is a duck array — Namely, that it follows duck-array semantics.

We were thinking of switching asanyarray to switch to passing through anything that implements the duck-array protocol along with ndarray subclasses. I’m sure this would help XArray and Quantity work better with existing codebases, along with PyData/Sparse arrays.

Would you be interested?

Best Regards,
Hameer Abbasi

On Thursday, Nov 08, 2018 at 9:09 PM, Stefan van der Walt <[hidden email]> wrote:
Hi Hameer,

In last week's meeting, we had the following in the notes:

Hameer is contacting Marten & Stephan and write up a draft NEP for
clarifying the asarray/asanyarray and matrix/subclass path forward.

Did any of that happen that you could share?

Thanks and best regards,
Stéfan

Hello, everyone,

Me, Stefan van der Walt, Stephan Hoyer and Marten van Kerkwijk were having a discussion about the state of matrix, asarray and asanyarray. Our thoughts are summarised above (in the quoted text that I’m forwarding)

Basically, this grew out of a discussion relating to asanyarray/asarray inconsistencies in NumPy about which to use where. Historically, asarray was used in many libraries/places instead of asanyarray usually because np.matrix caused problems due to its special behaviour with regard to indexing (it always returns a 2-D object when eliminating one dimension, but a 0-D one when eliminating both), its behaviour regarding __mul__ (the multiplication operator represents matrix multiplication rather than element-wise multiplication) and its fixed dimensionality (matrix is 2D only). Because of these three things, as Stephan accurately pointed out, it violates the Liskov Substitution Principle.

Because of this behaviour, many libraries switched from using asanyarray to asarray, as np.matrix wouldn’t work with their code. This shut out other matrix subclasses from being used as well, such as MaskedArray and astropy.Quantity. Even if asanyarray is used, there is usually no guarantee that a matrix will be returned instead of an array.

The changes I’m proposing are twofold, but simple:
  • asanyarray should return mat.view(type=np.ndarray) instead of matrices, after an appropriate time with a FutureWarning. This allows us to preserve the performance (Creating a view is O(1) both in memory and time), and the mutability of the original matrix. This change should happen after a FutureWarning and the usual grace period.
  • In the spirit of allowing duck-arrays to work with existing NumPy code, asanyarray should be overridable via __array_function__, so that duck arrays can decide whether to pass themselves through. If subclasses are allowed, so should ducka-arrays as well.
This is a part of a larger effort to deprecate np.matrix. As far as I’m aware, it has one big customer (scipy.sparse). The effort to replace that is already underway at PyData/Sparse.

Best Regards,
Hameer Abbasi

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: asarray/anyarray; matrix/subclass

Nathaniel Smith
In reply to this post by einstein.edison
But matrix isn't the only problem with asanyarray. np.ma also violates
Liskov. No doubt there are other problematic ndarray subclasses out
there too...

If we were going to try to reuse asanyarray through some deprecation
mechanism, I think we'd need to deprecate allowing asanyarray to
return *any* ndarray subclass, unless they explicitly provided an
__asanyarray__ dunder. But at that point I'm not sure what the point
would be of reusing it.

On Fri, Nov 9, 2018 at 7:15 AM, Hameer Abbasi <[hidden email]> wrote:

> Begin forwarded message:
>
> From: Stephan Hoyer
> Date: Friday, Nov 09, 2018 at 3:19 PM
> To: Hameer Abbasi
> Cc: Stefan van der Walt , Marten van Kerkwijk
> Subject: asarray/anyarray; matrix/subclass
>
> This is a great discussion, but let's try to have it in public (e.g., on the
> NumPy mailing list).
> On Fri, Nov 9, 2018 at 8:42 AM Hameer Abbasi <[hidden email]>
> wrote:
>>
>> Hi Stephan,
>>
>> The issue I have with writing another function is that asarray/asanyarray
>> are so widely used that it’d be a huge maintenance task to update them
>> throughout NumPy, not to mention other codebases, not to mention other
>> codebases having to rely on newer NumPy versions for this. In short, it
>> would dramatically reduce adaptability of this function.
>>
>> One path we can take is to allow asarray/asanyarray to be overridable via
>> __array_function__ (the former is debatable). This solves most of our
>> duck-array related issues without introducing another protocol.
>>
>> Regardless of what path we choose, I would recommend changing asanyarray
>> to not pass through np.matrix regardless, instead passing through
>> mat.view(type=np.ndarray) instead, which has O(1) cost and memory. In the
>> vast majority of contexts, it’s used to ensure an array-ish structure for
>> another operation, and usually there’s no guarantee that what comes out will
>> be a matrix anyway. I suggest we raise a FutureWarning and then change this
>> behaviour.
>>
>> There have been a number of discussions about deprecating np.matrix (and a
>> few about MaskedArray as well, though there are less compelling reasons for
>> that one). I suggest we start down that path as soon as possible. The
>> biggest (only?) user I know of blocking that is scipy.sparse, and we’re on
>> our way to replacing that with PyData/Sparse.
>>
>> Best Regards,
>> Hameer Abbasi
>>
>> On Friday, Nov 09, 2018 at 1:26 AM, Stephan Hoyer <[hidden email]>
>> wrote:
>> Hi Hameer,
>>
>> I'd love to talk about this in more detail. I agree that something like
>> this is needed.
>>
>> The challenge with reusing an existing function like asanyarray() is that
>> there is at least one (somewhat?) widely used ndarray subclass that badly
>> violates the Liskov Substitution Principle: np.matrix.
>>
>> NumPy can't really use np.asanyarray() widely for internal purposes until
>> we don't have to worry about np matrix. We might special case np.matrix in
>> some way, but then asanyarray() would do totally opposite things on
>> different versions of NumPy. It's almost certainly a better idea to just
>> write a new function with the desired semantics, and "soft deprecate"
>> asanyarray(). The new function can explicitly black list np.matrix, as well
>> as any other subclasses we know of that badly violate LSP.
>>
>> Cheers,
>> Stephan
>> On Thu, Nov 8, 2018 at 5:06 PM Hameer Abbasi <[hidden email]>
>> wrote:
>>>
>>> No, Stefan, I’ll do that now. Putting you in the cc.
>>>
>>> It slipped my mind among the million other things I had in mind — Namely:
>>> My job visa. It was only done this Monday.
>>>
>>> Hi, Marten, Stephan:
>>>
>>> Stefan wants me to write up a NEP that allows a given object to specify
>>> that it is a duck array — Namely, that it follows duck-array semantics.
>>>
>>> We were thinking of switching asanyarray to switch to passing through
>>> anything that implements the duck-array protocol along with ndarray
>>> subclasses. I’m sure this would help XArray and Quantity work better with
>>> existing codebases, along with PyData/Sparse arrays.
>>>
>>> Would you be interested?
>>>
>>> Best Regards,
>>> Hameer Abbasi
>>>
>>> On Thursday, Nov 08, 2018 at 9:09 PM, Stefan van der Walt
>>> <[hidden email]> wrote:
>>> Hi Hameer,
>>>
>>> In last week's meeting, we had the following in the notes:
>>>
>>> Hameer is contacting Marten & Stephan and write up a draft NEP for
>>> clarifying the asarray/asanyarray and matrix/subclass path forward.
>>>
>>>
>>> Did any of that happen that you could share?
>>>
>>> Thanks and best regards,
>>> Stéfan
>
>
> Hello, everyone,
>
> Me, Stefan van der Walt, Stephan Hoyer and Marten van Kerkwijk were having a
> discussion about the state of matrix, asarray and asanyarray. Our thoughts
> are summarised above (in the quoted text that I’m forwarding)
>
> Basically, this grew out of a discussion relating to asanyarray/asarray
> inconsistencies in NumPy about which to use where. Historically, asarray was
> used in many libraries/places instead of asanyarray usually because
> np.matrix caused problems due to its special behaviour with regard to
> indexing (it always returns a 2-D object when eliminating one dimension, but
> a 0-D one when eliminating both), its behaviour regarding __mul__ (the
> multiplication operator represents matrix multiplication rather than
> element-wise multiplication) and its fixed dimensionality (matrix is 2D
> only). Because of these three things, as Stephan accurately pointed out, it
> violates the Liskov Substitution Principle.
>
> Because of this behaviour, many libraries switched from using asanyarray to
> asarray, as np.matrix wouldn’t work with their code. This shut out other
> matrix subclasses from being used as well, such as MaskedArray and
> astropy.Quantity. Even if asanyarray is used, there is usually no guarantee
> that a matrix will be returned instead of an array.
>
> The changes I’m proposing are twofold, but simple:
>
> asanyarray should return mat.view(type=np.ndarray) instead of matrices,
> after an appropriate time with a FutureWarning. This allows us to preserve
> the performance (Creating a view is O(1) both in memory and time), and the
> mutability of the original matrix. This change should happen after a
> FutureWarning and the usual grace period.
> In the spirit of allowing duck-arrays to work with existing NumPy code,
> asanyarray should be overridable via __array_function__, so that duck arrays
> can decide whether to pass themselves through. If subclasses are allowed, so
> should ducka-arrays as well.
>
> This is a part of a larger effort to deprecate np.matrix. As far as I’m
> aware, it has one big customer (scipy.sparse). The effort to replace that is
> already underway at PyData/Sparse.
>
> Best Regards,
> Hameer Abbasi
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> https://mail.python.org/mailman/listinfo/numpy-discussion
>



--
Nathaniel J. Smith -- https://vorpus.org
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: asarray/anyarray; matrix/subclass

Stephan Hoyer-2
On Fri, Nov 9, 2018 at 6:46 PM Nathaniel Smith <[hidden email]> wrote:
But matrix isn't the only problem with asanyarray. np.ma also violates
Liskov. No doubt there are other problematic ndarray subclasses out
there too...

Please forgive my ignorance (I don't really use mask arrays), but how specifically do masked arrays violate Liskov? In most cases shouldn't they work the same as base numpy arrays, except with operations keeping track of masks?

I'm sure there are some cases where masked arrays have different semantics than NumPy arrays, but are any of these intentional?

I would guess that the worst current violation is that there is a risk of losing mask information in some operations, but implementing __array_function__ would presumably make it possible to fix most of these.

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: asarray/anyarray; matrix/subclass

Nathaniel Smith
On Fri, Nov 9, 2018 at 4:59 PM, Stephan Hoyer <[hidden email]> wrote:

> On Fri, Nov 9, 2018 at 6:46 PM Nathaniel Smith <[hidden email]> wrote:
>>
>> But matrix isn't the only problem with asanyarray. np.ma also violates
>> Liskov. No doubt there are other problematic ndarray subclasses out
>> there too...
>
>
> Please forgive my ignorance (I don't really use mask arrays), but how
> specifically do masked arrays violate Liskov? In most cases shouldn't they
> work the same as base numpy arrays, except with operations keeping track of
> masks?

Since many operations silently skip over masked values, the
computation semantics are different. For example, in a regular array,
sum()/size() == mean(), but with a masked array these are totally
different operations. So if you have code that was written for regular
arrays, but pass in a masked array, there's a solid chance that it
will silently return nonsensical results.

(This is why it's better for NAs to propagate by default.)

-n

--
Nathaniel J. Smith -- https://vorpus.org
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: asarray/anyarray; matrix/subclass

mattip
On 9/11/18 5:09 pm, Nathaniel Smith wrote:

> On Fri, Nov 9, 2018 at 4:59 PM, Stephan Hoyer <[hidden email]> wrote:
>> On Fri, Nov 9, 2018 at 6:46 PM Nathaniel Smith <[hidden email]> wrote:
>>> But matrix isn't the only problem with asanyarray. np.ma also violates
>>> Liskov. No doubt there are other problematic ndarray subclasses out
>>> there too...
>>>
>>>
>>> Please forgive my ignorance (I don't really use mask arrays), but how
>>> specifically do masked arrays violate Liskov? In most cases shouldn't they
>>> work the same as base numpy arrays, except with operations keeping track of
>>> masks?
> Since many operations silently skip over masked values, the
> computation semantics are different. For example, in a regular array,
> sum()/size() == mean(), but with a masked array these are totally
> different operations. So if you have code that was written for regular
> arrays, but pass in a masked array, there's a solid chance that it
> will silently return nonsensical results.
>
> (This is why it's better for NAs to propagate by default.)
>
> -n


Echos of the discussions in neps 12, 24, 25, 26. http://www.numpy.org/neps


Matti

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: asarray/anyarray; matrix/subclass

Marten van Kerkwijk
Hi Hameer,

I do not think we should change `asanyarray` itself to special-case matrix; rather, we could start converting `asarray` to `asanyarray` and solve the problems that produces for matrices in `matrix` itself (e.g., by overriding the relevant function with `__array_function__`).

I think the idea of providing an `__anyarray__` method (in analogy with `__array__`) might work. Indeed, the default in `ndarray` (and thus all its subclasses) could be to let it return `self`  and to override it for `matrix` to return an ndarray view.

All the best,

Marten

p.s. Note that we are already giving PendingDeprecationWarning for matrix; https://github.com/numpy/numpy/pull/10142.



On Sat, Nov 10, 2018 at 11:02 AM Matti Picus <[hidden email]> wrote:
On 9/11/18 5:09 pm, Nathaniel Smith wrote:
> On Fri, Nov 9, 2018 at 4:59 PM, Stephan Hoyer <[hidden email]> wrote:
>> On Fri, Nov 9, 2018 at 6:46 PM Nathaniel Smith <[hidden email]> wrote:
>>> But matrix isn't the only problem with asanyarray. np.ma also violates
>>> Liskov. No doubt there are other problematic ndarray subclasses out
>>> there too...
>>>
>>>
>>> Please forgive my ignorance (I don't really use mask arrays), but how
>>> specifically do masked arrays violate Liskov? In most cases shouldn't they
>>> work the same as base numpy arrays, except with operations keeping track of
>>> masks?
> Since many operations silently skip over masked values, the
> computation semantics are different. For example, in a regular array,
> sum()/size() == mean(), but with a masked array these are totally
> different operations. So if you have code that was written for regular
> arrays, but pass in a masked array, there's a solid chance that it
> will silently return nonsensical results.
>
> (This is why it's better for NAs to propagate by default.)
>
> -n


Echos of the discussions in neps 12, 24, 25, 26. http://www.numpy.org/neps


Matti

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: asarray/anyarray; matrix/subclass

Stephan Hoyer-2
On Sat, Nov 10, 2018 at 9:49 AM Marten van Kerkwijk <[hidden email]> wrote:
Hi Hameer,

I do not think we should change `asanyarray` itself to special-case matrix; rather, we could start converting `asarray` to `asanyarray` and solve the problems that produces for matrices in `matrix` itself (e.g., by overriding the relevant function with `__array_function__`).

I think the idea of providing an `__anyarray__` method (in analogy with `__array__`) might work. Indeed, the default in `ndarray` (and thus all its subclasses) could be to let it return `self`  and to override it for `matrix` to return an ndarray view.

Yes, we certainly would rather implement a matrix.__anyarray__ method (if we're already doing a new protocol) rather than special case np.matrix explicitly.

Unfortunately, per Nathaniel's comments about NA skipping behavior, it seems like we will also need MaskedArray.__anyarray__ to return something other than itself. In principle, we should probably write new version of MaskedArray that doesn't deviate from ndarray semantics, but that's a rather large project (we'd also probably want to stop subclassing ndarray).

Changing the default aggregation behavior for the existing MaskedArray is also an option but that would be a serious annoyance to users and backwards compatibility break. If the only way MaskedArray violates Liskov is in terms of NA skipping aggregations by default, then this might be viable. In practice, this would require adding an explicit skipna argument so FutureWarnings could be silenced. The plus side of this option is that it would make it easier to use np.anyarray() or any new coercion function throughout the internal NumPy code base.

To summarize, I think these are our options:
1. Change the behavior of np.anyarray() to check for an __anyarray__() protocol. Change np.matrix.__anyarray__() to return a base numpy array (this is a minor backwards compatibility break, but probably for the best). Start issuing a FutureWarning for any MaskedArray operations that violate Liskov and add a skipna argument that in the future will default to skipna=False.
2. Introduce a new coercion function, e.g., np.duckarray(). This is the easiest option because we don't need to cleanup NumPy's existing ndarray subclasses.

P.S. I'm just glad pandas stopped subclassing ndarray a while ago -- there's no way pandas.Series() could be fixed up to not violate Liskov :).

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: asarray/anyarray; matrix/subclass

Eric Wieser
If the only way MaskedArray violates Liskov is in terms of NA skipping aggregations by default, then this might be viable

One of the ways to fix these liskov substitution problems is just to introduce more base classes - for instance, if we had an `NDContainer` base class with only slicing support, then masked arrays would be an exact liskov substitution, but np.matrix would not.

Eric

On Sat, 10 Nov 2018 at 12:17 Stephan Hoyer <[hidden email]> wrote:
On Sat, Nov 10, 2018 at 9:49 AM Marten van Kerkwijk <[hidden email]> wrote:
Hi Hameer,

I do not think we should change `asanyarray` itself to special-case matrix; rather, we could start converting `asarray` to `asanyarray` and solve the problems that produces for matrices in `matrix` itself (e.g., by overriding the relevant function with `__array_function__`).

I think the idea of providing an `__anyarray__` method (in analogy with `__array__`) might work. Indeed, the default in `ndarray` (and thus all its subclasses) could be to let it return `self`  and to override it for `matrix` to return an ndarray view.

Yes, we certainly would rather implement a matrix.__anyarray__ method (if we're already doing a new protocol) rather than special case np.matrix explicitly.

Unfortunately, per Nathaniel's comments about NA skipping behavior, it seems like we will also need MaskedArray.__anyarray__ to return something other than itself. In principle, we should probably write new version of MaskedArray that doesn't deviate from ndarray semantics, but that's a rather large project (we'd also probably want to stop subclassing ndarray).

Changing the default aggregation behavior for the existing MaskedArray is also an option but that would be a serious annoyance to users and backwards compatibility break. If the only way MaskedArray violates Liskov is in terms of NA skipping aggregations by default, then this might be viable. In practice, this would require adding an explicit skipna argument so FutureWarnings could be silenced. The plus side of this option is that it would make it easier to use np.anyarray() or any new coercion function throughout the internal NumPy code base.

To summarize, I think these are our options:
1. Change the behavior of np.anyarray() to check for an __anyarray__() protocol. Change np.matrix.__anyarray__() to return a base numpy array (this is a minor backwards compatibility break, but probably for the best). Start issuing a FutureWarning for any MaskedArray operations that violate Liskov and add a skipna argument that in the future will default to skipna=False.
2. Introduce a new coercion function, e.g., np.duckarray(). This is the easiest option because we don't need to cleanup NumPy's existing ndarray subclasses.

P.S. I'm just glad pandas stopped subclassing ndarray a while ago -- there's no way pandas.Series() could be fixed up to not violate Liskov :).
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: asarray/anyarray; matrix/subclass

einstein.edison
In reply to this post by Stephan Hoyer-2
On Saturday, Nov 10, 2018 at 9:16 PM, Stephan Hoyer <[hidden email]> wrote:
On Sat, Nov 10, 2018 at 9:49 AM Marten van Kerkwijk <[hidden email]> wrote:
Hi Hameer,

I do not think we should change `asanyarray` itself to special-case matrix; rather, we could start converting `asarray` to `asanyarray` and solve the problems that produces for matrices in `matrix` itself (e.g., by overriding the relevant function with `__array_function__`).

I think the idea of providing an `__anyarray__` method (in analogy with `__array__`) might work. Indeed, the default in `ndarray` (and thus all its subclasses) could be to let it return `self`  and to override it for `matrix` to return an ndarray view.

Yes, we certainly would rather implement a matrix.__anyarray__ method (if we're already doing a new protocol) rather than special case np.matrix explicitly.

Unfortunately, per Nathaniel's comments about NA skipping behavior, it seems like we will also need MaskedArray.__anyarray__ to return something other than itself. In principle, we should probably write new version of MaskedArray that doesn't deviate from ndarray semantics, but that's a rather large project (we'd also probably want to stop subclassing ndarray).

Changing the default aggregation behavior for the existing MaskedArray is also an option but that would be a serious annoyance to users and backwards compatibility break. If the only way MaskedArray violates Liskov is in terms of NA skipping aggregations by default, then this might be viable. In practice, this would require adding an explicit skipna argument so FutureWarnings could be silenced. The plus side of this option is that it would make it easier to use np.anyarray() or any new coercion function throughout the internal NumPy code base.

To summarize, I think these are our options:
1. Change the behavior of np.anyarray() to check for an __anyarray__() protocol. Change np.matrix.__anyarray__() to return a base numpy array (this is a minor backwards compatibility break, but probably for the best). Start issuing a FutureWarning for any MaskedArray operations that violate Liskov and add a skipna argument that in the future will default to skipna=False.
2. Introduce a new coercion function, e.g., np.duckarray(). This is the easiest option because we don't need to cleanup NumPy's existing ndarray subclasses.

My vote is still for 1. I don’t have an issue for PyData/Sparse depending on recent-ish NumPy versions — It’ll need a lot of the recent protocols anyway, although I could be convinced otherwise if major package devs (scikits, SciPy, Dask) were to weigh in and say they’ll jump on it (which seems unlikely given SciPy’s policy to support old NumPy versions).



P.S. I'm just glad pandas stopped subclassing ndarray a while ago -- there's no way pandas.Series() could be fixed up to not violate Liskov :).
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: asarray/anyarray; matrix/subclass

Stephan Hoyer-2
On Sat, Nov 10, 2018 at 2:22 PM Hameer Abbasi <[hidden email]> wrote:
To summarize, I think these are our options:
1. Change the behavior of np.anyarray() to check for an __anyarray__() protocol. Change np.matrix.__anyarray__() to return a base numpy array (this is a minor backwards compatibility break, but probably for the best). Start issuing a FutureWarning for any MaskedArray operations that violate Liskov and add a skipna argument that in the future will default to skipna=False.
2. Introduce a new coercion function, e.g., np.duckarray(). This is the easiest option because we don't need to cleanup NumPy's existing ndarray subclasses.

My vote is still for 1. I don’t have an issue for PyData/Sparse depending on recent-ish NumPy versions — It’ll need a lot of the recent protocols anyway, although I could be convinced otherwise if major package devs (scikits, SciPy, Dask) were to weigh in and say they’ll jump on it (which seems unlikely given SciPy’s policy to support old NumPy versions).

I agree that option (1) is fine for PyData/sparse. The bigger issue is that this change should be conditional on making breaking changes (at least raising FutureWarning for now) to np.ma.MaskedArray.

I don't know how people who currently use MaskedArray would feel about that. I would love to hear their thoughts.

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: asarray/anyarray; matrix/subclass

Marten van Kerkwijk


On Sat, Nov 10, 2018 at 5:39 PM Stephan Hoyer <[hidden email]> wrote:
On Sat, Nov 10, 2018 at 2:22 PM Hameer Abbasi <[hidden email]> wrote:
To summarize, I think these are our options:
1. Change the behavior of np.anyarray() to check for an __anyarray__() protocol. Change np.matrix.__anyarray__() to return a base numpy array (this is a minor backwards compatibility break, but probably for the best). Start issuing a FutureWarning for any MaskedArray operations that violate Liskov and add a skipna argument that in the future will default to skipna=False.
2. Introduce a new coercion function, e.g., np.duckarray(). This is the easiest option because we don't need to cleanup NumPy's existing ndarray subclasses.

My vote is still for 1. I don’t have an issue for PyData/Sparse depending on recent-ish NumPy versions — It’ll need a lot of the recent protocols anyway, although I could be convinced otherwise if major package devs (scikits, SciPy, Dask) were to weigh in and say they’ll jump on it (which seems unlikely given SciPy’s policy to support old NumPy versions).

I agree that option (1) is fine for PyData/sparse. The bigger issue is that this change should be conditional on making breaking changes (at least raising FutureWarning for now) to np.ma.MaskedArray.

Might be good to try before worrying too much - MaskedArray already overrides *a lot*; it is not at all obvious to me that things wouldn't "just work" if we bulk-replaced `asarray` with `asanyarray`.  And with `__array_function__` we now have the option to fix code paths that do not work immediately.

-- Marten


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: asarray/anyarray; matrix/subclass

Charles R Harris
In reply to this post by Eric Wieser


On Sat, Nov 10, 2018 at 2:15 PM Eric Wieser <[hidden email]> wrote:
If the only way MaskedArray violates Liskov is in terms of NA skipping aggregations by default, then this might be viable

One of the ways to fix these liskov substitution problems is just to introduce more base classes - for instance, if we had an `NDContainer` base class with only slicing support, then masked arrays would be an exact liskov substitution, but np.matrix would not.

Eric

I've had the same thought and wouldn't be surprised if others have considered that possibility. Travis would be a good guy to ask about that.

<snip>

Chuck 

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: asarray/anyarray; matrix/subclass

Eric Firing
In reply to this post by Stephan Hoyer-2
On 2018/11/10 12:39 PM, Stephan Hoyer wrote:

> On Sat, Nov 10, 2018 at 2:22 PM Hameer Abbasi <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>         To summarize, I think these are our options:
>
>         1. Change the behavior of np.anyarray() to check for an
>         __anyarray__() protocol. Change np.matrix.__anyarray__() to
>         return a base numpy array (this is a minor backwards
>         compatibility break, but probably for the best). Start issuing a
>         FutureWarning for any MaskedArray operations that violate Liskov
>         and add a skipna argument that in the future will default to
>         skipna=False.
>
>         2. Introduce a new coercion function, e.g., np.duckarray(). This
>         is the easiest option because we don't need to cleanup NumPy's
>         existing ndarray subclasses.
>
>
>     My vote is still for 1. I don’t have an issue for PyData/Sparse
>     depending on recent-ish NumPy versions — It’ll need a lot of the
>     recent protocols anyway, although I could be convinced otherwise if
>     major package devs (scikits, SciPy, Dask) were to weigh in and say
>     they’ll jump on it (which seems unlikely given SciPy’s policy to
>     support old NumPy versions).
>
>
> I agree that option (1) is fine for PyData/sparse. The bigger issue is
> that this change should be conditional on making breaking changes (at
> least raising FutureWarning for now) to np.ma.MaskedArray.
>
> I don't know how people who currently use MaskedArray would feel about
> that. I would love to hear their thoughts.

Thank you.  I am a user of masked arrays, and have been since pre-numpy
days.  I introduced their extensive use in matplotlib long ago.  I have
been a bit concerned, indeed, that all of the discussion of modifying
masked arrays seems to be by people who don't actually use them
explicitly (though they might be using them without knowing it via
internal operations in matplotlib, or they might be quickly getting rid
of them after they are yielded by netCDF4.Dataset()).

I think that those of us who do use masked arrays recognize that they
are not perfect; they have some quirks and gotchas, and one has to be
careful to use numpy.ma functions instead of numpy functions in most
cases.  But we use them because they have real advantages over the
alternatives, which are using nans and/or manually tracking independent
masks throughout calculations.  These advantages are largely because
masked values *don't* behave like nan, *don't* propagate.  This is
fundamental to the design, and motivated by real-life use cases.

The proposal to add a skipna kwarg to MaskedArray looks to me like it is
giving purity priority over practicality.  It will force ma users to
insert skipna kwargs all over the place--because the default will be
contrary to the primary purposes of using masked arrays, in most cases.
How many people will it actually benefit?  How many people are being
bitten, and how badly, by masked array behavior?

If there were a prospect of truly integrating missing/masked value
handling into numpy, simplifying or phasing out numpy.ma, I would be
delighted--I think it is the biggest single fundamental improvement that
could be made, from the user's standpoint.  I was sad to see Mark
Wiebe's work in that direction come to grief.

If there are ways of gradually improving numpy.ma and its
interoperability with the rest of numpy and with the proliferation of
duck arrays, I'm all in favor--so long as they don't effectively wreck
numpy.ma for its present intended purposes.

Eric

>
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> https://mail.python.org/mailman/listinfo/numpy-discussion
>

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: asarray/anyarray; matrix/subclass

Marten van Kerkwijk
Hi Eric,

Thanks very much for the detailed response; it is good to be reminded that `MaskedArray` is used in a package that, indeed, (nearly?) all of us use!

But I do think that those of us who have been trying to change MaskedArray, are generally good at making sure the tests continue to pass, i.e., that the behaviour does not change (the main exception in the last few years was that views should be taken of masks too, not just the data).

I also think that between __array_ufunc__ and __array_function__, it has become quite easy to ensure that one no longer has to rely on `np.ma` functions, i.e., that the regular numpy functions will do the right thing. But it will need work to actually implement that.

All the best,

Marten

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: asarray/anyarray; matrix/subclass

Stephan Hoyer-2
In reply to this post by Eric Firing
On Sat, Nov 10, 2018 at 10:45 PM Eric Firing <[hidden email]> wrote:
On 2018/11/10 12:39 PM, Stephan Hoyer wrote:
> On Sat, Nov 10, 2018 at 2:22 PM Hameer Abbasi <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>         To summarize, I think these are our options:
>
>         1. Change the behavior of np.anyarray() to check for an
>         __anyarray__() protocol. Change np.matrix.__anyarray__() to
>         return a base numpy array (this is a minor backwards
>         compatibility break, but probably for the best). Start issuing a
>         FutureWarning for any MaskedArray operations that violate Liskov
>         and add a skipna argument that in the future will default to
>         skipna=False.
>
>         2. Introduce a new coercion function, e.g., np.duckarray(). This
>         is the easiest option because we don't need to cleanup NumPy's
>         existing ndarray subclasses.
>
>
>     My vote is still for 1. I don’t have an issue for PyData/Sparse
>     depending on recent-ish NumPy versions — It’ll need a lot of the
>     recent protocols anyway, although I could be convinced otherwise if
>     major package devs (scikits, SciPy, Dask) were to weigh in and say
>     they’ll jump on it (which seems unlikely given SciPy’s policy to
>     support old NumPy versions).
>
>
> I agree that option (1) is fine for PyData/sparse. The bigger issue is
> that this change should be conditional on making breaking changes (at
> least raising FutureWarning for now) to np.ma.MaskedArray.
>
> I don't know how people who currently use MaskedArray would feel about
> that. I would love to hear their thoughts.

Thank you.  I am a user of masked arrays, and have been since pre-numpy
days.  I introduced their extensive use in matplotlib long ago.  I have
been a bit concerned, indeed, that all of the discussion of modifying
masked arrays seems to be by people who don't actually use them
explicitly (though they might be using them without knowing it via
internal operations in matplotlib, or they might be quickly getting rid
of them after they are yielded by netCDF4.Dataset()).

I think that those of us who do use masked arrays recognize that they
are not perfect; they have some quirks and gotchas, and one has to be
careful to use numpy.ma functions instead of numpy functions in most
cases.  But we use them because they have real advantages over the
alternatives, which are using nans and/or manually tracking independent
masks throughout calculations.  These advantages are largely because
masked values *don't* behave like nan, *don't* propagate.  This is
fundamental to the design, and motivated by real-life use cases.

The proposal to add a skipna kwarg to MaskedArray looks to me like it is
giving purity priority over practicality.  It will force ma users to
insert skipna kwargs all over the place--because the default will be
contrary to the primary purposes of using masked arrays, in most cases.
How many people will it actually benefit?  How many people are being
bitten, and how badly, by masked array behavior?

If there were a prospect of truly integrating missing/masked value
handling into numpy, simplifying or phasing out numpy.ma, I would be
delighted--I think it is the biggest single fundamental improvement that
could be made, from the user's standpoint.  I was sad to see Mark
Wiebe's work in that direction come to grief.

If there are ways of gradually improving numpy.ma and its
interoperability with the rest of numpy and with the proliferation of
duck arrays, I'm all in favor--so long as they don't effectively wreck
numpy.ma for its present intended purposes.

Eric -- thank you for sharing your perspective! I guess it should not be surprising that the semantics of MaskedArray intentionally deviate from the semantics of base NumPy arrays.

This deviation is fortunately less severe than than deviations in the behavior of np.matrix, but it still presents some difficulties for duck typing. We're in a position to reduce (but still not eliminate) these differences with new protocols like __array_function__.

I think Nathaniel actually summarized these issues pretty well in NEP 16 (http://www.numpy.org/neps/nep-0016-abstract-array.html). If we want a coercion function that guarantees an object is a "full duck array", then it can't pass on either np.matrix or MaskedArray in their current state. Anything less than full compatibility provides a shaky foundation for use in downstream projects or inside NumPy itself.

In theory (certainly if we were starting from scratch) it would make sense to make asabstractarray() pass on any ndarray subclass, but this would require willingness to make breaking changes to both np.matrix and MaskedArray.

I would suggest adopting a variation of the proposal in NEP 16, except using a protocol rather an abstract base class per NEP 22, e.g.,

# names still to be determined
def asabstractarray(array, dtype):
    if hasattr(array, '__abstractarray__'):
        return array.__abstractarray__(array, dtype=dtype)
    return asarray(array, dtype)



_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion