

Begin forwarded message:
From: Stephan Hoyer Date: Friday, Nov 09, 2018 at 3:19 PM To: Hameer Abbasi Cc: Stefan van der Walt , Marten van Kerkwijk Subject: asarray/anyarray; matrix/subclass
This is a great discussion, but let's try to have it in public (e.g., on the NumPy mailing list). Hi Stephan,
The issue I have with writing another function is that asarray/asanyarray are so widely used that it’d be a huge maintenance task to update them throughout NumPy, not to mention other codebases, not to mention other codebases having to rely on newer NumPy versions for this. In short, it would dramatically reduce adaptability of this function.
One path we can take is to allow asarray/asanyarray to be overridable via __array_function__ (the former is debatable). This solves most of our duckarray related issues without introducing another protocol.
Regardless of what path we choose, I would recommend changing asanyarray to not pass through np.matrix regardless, instead passing through mat.view(type=np.ndarray) instead, which has O(1) cost and memory. In the vast majority of contexts, it’s used to ensure an arrayish structure for another operation, and usually there’s no guarantee that what comes out will be a matrix anyway. I suggest we raise a FutureWarning and then change this behaviour.
There have been a number of discussions about deprecating np.matrix (and a few about MaskedArray as well, though there are less compelling reasons for that one). I suggest we start down that path as soon as possible. The biggest (only?) user I know of blocking that is scipy.sparse, and we’re on our way to replacing that with PyData/Sparse.
Best Regards, Hameer Abbasi
On Friday, Nov 09, 2018 at 1:26 AM, Stephan Hoyer < [hidden email]> wrote: Hi Hameer, I'd love to talk about this in more detail. I agree that something like this is needed. The challenge with reusing an existing function like asanyarray() is that there is at least one (somewhat?) widely used ndarray subclass that badly violates the Liskov Substitution Principle: np.matrix. NumPy can't really use np.asanyarray() widely for internal purposes until we don't have to worry about np matrix. We might special case np.matrix in some way, but then asanyarray() would do totally opposite things on different versions of NumPy. It's almost certainly a better idea to just write a new function with the desired semantics, and "soft deprecate" asanyarray(). The new function can explicitly black list np.matrix, as well as any other subclasses we know of that badly violate LSP. Cheers, Stephan No, Stefan, I’ll do that now. Putting you in the cc.
It slipped my mind among the million other things I had in mind — Namely: My job visa. It was only done this Monday.
Hi, Marten, Stephan:
Stefan wants me to write up a NEP that allows a given object to specify that it is a duck array — Namely, that it follows duckarray semantics.
We were thinking of switching asanyarray to switch to passing through anything that implements the duckarray protocol along with ndarray subclasses. I’m sure this would help XArray and Quantity work better with existing codebases, along with PyData/Sparse arrays.
Best Regards, Hameer Abbasi
On Thursday, Nov 08, 2018 at 9:09 PM, Stefan van der Walt < [hidden email]> wrote: Hi Hameer, In last week's meeting, we had the following in the notes: Hameer is contacting Marten & Stephan and write up a draft NEP for clarifying the asarray/asanyarray and matrix/subclass path forward.
Did any of that happen that you could share? Thanks and best regards, Stéfan
Hello, everyone,
Me, Stefan van der Walt, Stephan Hoyer and Marten van Kerkwijk were having a discussion about the state of matrix, asarray and asanyarray. Our thoughts are summarised above (in the quoted text that I’m forwarding)
Basically, this grew out of a discussion relating to asanyarray/asarray inconsistencies in NumPy about which to use where. Historically, asarray was used in many libraries/places instead of asanyarray usually because np.matrix caused problems due to its special behaviour with regard to indexing (it always returns a 2D object when eliminating one dimension, but a 0D one when eliminating both), its behaviour regarding __mul__ (the multiplication operator represents matrix multiplication rather than elementwise multiplication) and its fixed dimensionality (matrix is 2D only). Because of these three things, as Stephan accurately pointed out, it violates the Liskov Substitution Principle.
Because of this behaviour, many libraries switched from using asanyarray to asarray, as np.matrix wouldn’t work with their code. This shut out other matrix subclasses from being used as well, such as MaskedArray and astropy.Quantity. Even if asanyarray is used, there is usually no guarantee that a matrix will be returned instead of an array.
The changes I’m proposing are twofold, but simple:  asanyarray should return mat.view(type=np.ndarray) instead of matrices, after an appropriate time with a FutureWarning. This allows us to preserve the performance (Creating a view is O(1) both in memory and time), and the mutability of the original matrix. This change should happen after a FutureWarning and the usual grace period.
 In the spirit of allowing duckarrays to work with existing NumPy code, asanyarray should be overridable via __array_function__, so that duck arrays can decide whether to pass themselves through. If subclasses are allowed, so should duckaarrays as well.
This is a part of a larger effort to deprecate np.matrix. As far as I’m aware, it has one big customer (scipy.sparse). The effort to replace that is already underway at PyData/Sparse. Best Regards, Hameer Abbasi
_______________________________________________
NumPyDiscussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpydiscussion


I’m still not sure I agree with the advantages of reusing asanyarray(), even if matrix did not exist. Yes, asanyarray will exist in old NumPy versions, but you can’t use it with sparse arrays anyways because it will have the wrong semantics. I expect this would be a bug magnet, with inadvertent loading of sparse arrays into memory if you’re accidentally using old NumPy. With regards to the protocol, I would suggest a dedicated method, e.g., __asanyarray__ (or something similar based on the final chosen name of the function). Coercing to arrays is special enough to have its own dedicated protocol, and it could be useful for libraries like xarray to check for __asanyarray__ attributes before deciding which coercion mechanism to use. Begin forwarded message:
From: Stephan Hoyer Date: Friday, Nov 09, 2018 at 3:19 PM To: Hameer Abbasi Cc: Stefan van der Walt , Marten van Kerkwijk Subject: asarray/anyarray; matrix/subclass
This is a great discussion, but let's try to have it in public (e.g., on the NumPy mailing list). Hi Stephan,
The issue I have with writing another function is that asarray/asanyarray are so widely used that it’d be a huge maintenance task to update them throughout NumPy, not to mention other codebases, not to mention other codebases having to rely on newer NumPy versions for this. In short, it would dramatically reduce adaptability of this function.
One path we can take is to allow asarray/asanyarray to be overridable via __array_function__ (the former is debatable). This solves most of our duckarray related issues without introducing another protocol.
Regardless of what path we choose, I would recommend changing asanyarray to not pass through np.matrix regardless, instead passing through mat.view(type=np.ndarray) instead, which has O(1) cost and memory. In the vast majority of contexts, it’s used to ensure an arrayish structure for another operation, and usually there’s no guarantee that what comes out will be a matrix anyway. I suggest we raise a FutureWarning and then change this behaviour.
There have been a number of discussions about deprecating np.matrix (and a few about MaskedArray as well, though there are less compelling reasons for that one). I suggest we start down that path as soon as possible. The biggest (only?) user I know of blocking that is scipy.sparse, and we’re on our way to replacing that with PyData/Sparse.
Best Regards, Hameer Abbasi
On Friday, Nov 09, 2018 at 1:26 AM, Stephan Hoyer < [hidden email]> wrote: Hi Hameer, I'd love to talk about this in more detail. I agree that something like this is needed. The challenge with reusing an existing function like asanyarray() is that there is at least one (somewhat?) widely used ndarray subclass that badly violates the Liskov Substitution Principle: np.matrix. NumPy can't really use np.asanyarray() widely for internal purposes until we don't have to worry about np matrix. We might special case np.matrix in some way, but then asanyarray() would do totally opposite things on different versions of NumPy. It's almost certainly a better idea to just write a new function with the desired semantics, and "soft deprecate" asanyarray(). The new function can explicitly black list np.matrix, as well as any other subclasses we know of that badly violate LSP. Cheers, Stephan No, Stefan, I’ll do that now. Putting you in the cc.
It slipped my mind among the million other things I had in mind — Namely: My job visa. It was only done this Monday.
Hi, Marten, Stephan:
Stefan wants me to write up a NEP that allows a given object to specify that it is a duck array — Namely, that it follows duckarray semantics.
We were thinking of switching asanyarray to switch to passing through anything that implements the duckarray protocol along with ndarray subclasses. I’m sure this would help XArray and Quantity work better with existing codebases, along with PyData/Sparse arrays.
Best Regards, Hameer Abbasi
On Thursday, Nov 08, 2018 at 9:09 PM, Stefan van der Walt < [hidden email]> wrote: Hi Hameer, In last week's meeting, we had the following in the notes: Hameer is contacting Marten & Stephan and write up a draft NEP for clarifying the asarray/asanyarray and matrix/subclass path forward.
Did any of that happen that you could share? Thanks and best regards, Stéfan
Hello, everyone,
Me, Stefan van der Walt, Stephan Hoyer and Marten van Kerkwijk were having a discussion about the state of matrix, asarray and asanyarray. Our thoughts are summarised above (in the quoted text that I’m forwarding)
Basically, this grew out of a discussion relating to asanyarray/asarray inconsistencies in NumPy about which to use where. Historically, asarray was used in many libraries/places instead of asanyarray usually because np.matrix caused problems due to its special behaviour with regard to indexing (it always returns a 2D object when eliminating one dimension, but a 0D one when eliminating both), its behaviour regarding __mul__ (the multiplication operator represents matrix multiplication rather than elementwise multiplication) and its fixed dimensionality (matrix is 2D only). Because of these three things, as Stephan accurately pointed out, it violates the Liskov Substitution Principle.
Because of this behaviour, many libraries switched from using asanyarray to asarray, as np.matrix wouldn’t work with their code. This shut out other matrix subclasses from being used as well, such as MaskedArray and astropy.Quantity. Even if asanyarray is used, there is usually no guarantee that a matrix will be returned instead of an array.
The changes I’m proposing are twofold, but simple:  asanyarray should return mat.view(type=np.ndarray) instead of matrices, after an appropriate time with a FutureWarning. This allows us to preserve the performance (Creating a view is O(1) both in memory and time), and the mutability of the original matrix. This change should happen after a FutureWarning and the usual grace period.
 In the spirit of allowing duckarrays to work with existing NumPy code, asanyarray should be overridable via __array_function__, so that duck arrays can decide whether to pass themselves through. If subclasses are allowed, so should duckaarrays as well.
This is a part of a larger effort to deprecate np.matrix. As far as I’m aware, it has one big customer (scipy.sparse). The effort to replace that is already underway at PyData/Sparse. Best Regards, Hameer Abbasi
_______________________________________________
NumPyDiscussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpydiscussion
_______________________________________________
NumPyDiscussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpydiscussion


But matrix isn't the only problem with asanyarray. np.ma also violates
Liskov. No doubt there are other problematic ndarray subclasses out
there too...
If we were going to try to reuse asanyarray through some deprecation
mechanism, I think we'd need to deprecate allowing asanyarray to
return *any* ndarray subclass, unless they explicitly provided an
__asanyarray__ dunder. But at that point I'm not sure what the point
would be of reusing it.
On Fri, Nov 9, 2018 at 7:15 AM, Hameer Abbasi < [hidden email]> wrote:
> Begin forwarded message:
>
> From: Stephan Hoyer
> Date: Friday, Nov 09, 2018 at 3:19 PM
> To: Hameer Abbasi
> Cc: Stefan van der Walt , Marten van Kerkwijk
> Subject: asarray/anyarray; matrix/subclass
>
> This is a great discussion, but let's try to have it in public (e.g., on the
> NumPy mailing list).
> On Fri, Nov 9, 2018 at 8:42 AM Hameer Abbasi < [hidden email]>
> wrote:
>>
>> Hi Stephan,
>>
>> The issue I have with writing another function is that asarray/asanyarray
>> are so widely used that it’d be a huge maintenance task to update them
>> throughout NumPy, not to mention other codebases, not to mention other
>> codebases having to rely on newer NumPy versions for this. In short, it
>> would dramatically reduce adaptability of this function.
>>
>> One path we can take is to allow asarray/asanyarray to be overridable via
>> __array_function__ (the former is debatable). This solves most of our
>> duckarray related issues without introducing another protocol.
>>
>> Regardless of what path we choose, I would recommend changing asanyarray
>> to not pass through np.matrix regardless, instead passing through
>> mat.view(type=np.ndarray) instead, which has O(1) cost and memory. In the
>> vast majority of contexts, it’s used to ensure an arrayish structure for
>> another operation, and usually there’s no guarantee that what comes out will
>> be a matrix anyway. I suggest we raise a FutureWarning and then change this
>> behaviour.
>>
>> There have been a number of discussions about deprecating np.matrix (and a
>> few about MaskedArray as well, though there are less compelling reasons for
>> that one). I suggest we start down that path as soon as possible. The
>> biggest (only?) user I know of blocking that is scipy.sparse, and we’re on
>> our way to replacing that with PyData/Sparse.
>>
>> Best Regards,
>> Hameer Abbasi
>>
>> On Friday, Nov 09, 2018 at 1:26 AM, Stephan Hoyer < [hidden email]>
>> wrote:
>> Hi Hameer,
>>
>> I'd love to talk about this in more detail. I agree that something like
>> this is needed.
>>
>> The challenge with reusing an existing function like asanyarray() is that
>> there is at least one (somewhat?) widely used ndarray subclass that badly
>> violates the Liskov Substitution Principle: np.matrix.
>>
>> NumPy can't really use np.asanyarray() widely for internal purposes until
>> we don't have to worry about np matrix. We might special case np.matrix in
>> some way, but then asanyarray() would do totally opposite things on
>> different versions of NumPy. It's almost certainly a better idea to just
>> write a new function with the desired semantics, and "soft deprecate"
>> asanyarray(). The new function can explicitly black list np.matrix, as well
>> as any other subclasses we know of that badly violate LSP.
>>
>> Cheers,
>> Stephan
>> On Thu, Nov 8, 2018 at 5:06 PM Hameer Abbasi < [hidden email]>
>> wrote:
>>>
>>> No, Stefan, I’ll do that now. Putting you in the cc.
>>>
>>> It slipped my mind among the million other things I had in mind — Namely:
>>> My job visa. It was only done this Monday.
>>>
>>> Hi, Marten, Stephan:
>>>
>>> Stefan wants me to write up a NEP that allows a given object to specify
>>> that it is a duck array — Namely, that it follows duckarray semantics.
>>>
>>> We were thinking of switching asanyarray to switch to passing through
>>> anything that implements the duckarray protocol along with ndarray
>>> subclasses. I’m sure this would help XArray and Quantity work better with
>>> existing codebases, along with PyData/Sparse arrays.
>>>
>>> Would you be interested?
>>>
>>> Best Regards,
>>> Hameer Abbasi
>>>
>>> On Thursday, Nov 08, 2018 at 9:09 PM, Stefan van der Walt
>>> < [hidden email]> wrote:
>>> Hi Hameer,
>>>
>>> In last week's meeting, we had the following in the notes:
>>>
>>> Hameer is contacting Marten & Stephan and write up a draft NEP for
>>> clarifying the asarray/asanyarray and matrix/subclass path forward.
>>>
>>>
>>> Did any of that happen that you could share?
>>>
>>> Thanks and best regards,
>>> Stéfan
>
>
> Hello, everyone,
>
> Me, Stefan van der Walt, Stephan Hoyer and Marten van Kerkwijk were having a
> discussion about the state of matrix, asarray and asanyarray. Our thoughts
> are summarised above (in the quoted text that I’m forwarding)
>
> Basically, this grew out of a discussion relating to asanyarray/asarray
> inconsistencies in NumPy about which to use where. Historically, asarray was
> used in many libraries/places instead of asanyarray usually because
> np.matrix caused problems due to its special behaviour with regard to
> indexing (it always returns a 2D object when eliminating one dimension, but
> a 0D one when eliminating both), its behaviour regarding __mul__ (the
> multiplication operator represents matrix multiplication rather than
> elementwise multiplication) and its fixed dimensionality (matrix is 2D
> only). Because of these three things, as Stephan accurately pointed out, it
> violates the Liskov Substitution Principle.
>
> Because of this behaviour, many libraries switched from using asanyarray to
> asarray, as np.matrix wouldn’t work with their code. This shut out other
> matrix subclasses from being used as well, such as MaskedArray and
> astropy.Quantity. Even if asanyarray is used, there is usually no guarantee
> that a matrix will be returned instead of an array.
>
> The changes I’m proposing are twofold, but simple:
>
> asanyarray should return mat.view(type=np.ndarray) instead of matrices,
> after an appropriate time with a FutureWarning. This allows us to preserve
> the performance (Creating a view is O(1) both in memory and time), and the
> mutability of the original matrix. This change should happen after a
> FutureWarning and the usual grace period.
> In the spirit of allowing duckarrays to work with existing NumPy code,
> asanyarray should be overridable via __array_function__, so that duck arrays
> can decide whether to pass themselves through. If subclasses are allowed, so
> should duckaarrays as well.
>
> This is a part of a larger effort to deprecate np.matrix. As far as I’m
> aware, it has one big customer (scipy.sparse). The effort to replace that is
> already underway at PyData/Sparse.
>
> Best Regards,
> Hameer Abbasi
>
>
> _______________________________________________
> NumPyDiscussion mailing list
> [hidden email]
> https://mail.python.org/mailman/listinfo/numpydiscussion>

Nathaniel J. Smith  https://vorpus.org_______________________________________________
NumPyDiscussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpydiscussion


On Fri, Nov 9, 2018 at 6:46 PM Nathaniel Smith < [hidden email]> wrote: But matrix isn't the only problem with asanyarray. np.ma also violates
Liskov. No doubt there are other problematic ndarray subclasses out
there too...
Please forgive my ignorance (I don't really use mask arrays), but how specifically do masked arrays violate Liskov? In most cases shouldn't they work the same as base numpy arrays, except with operations keeping track of masks?
I'm sure there are some cases where masked arrays have different semantics than NumPy arrays, but are any of these intentional?
I would guess that the worst current violation is that there is a risk of losing mask information in some operations, but implementing __array_function__ would presumably make it possible to fix most of these.
_______________________________________________
NumPyDiscussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpydiscussion


On Fri, Nov 9, 2018 at 4:59 PM, Stephan Hoyer < [hidden email]> wrote:
> On Fri, Nov 9, 2018 at 6:46 PM Nathaniel Smith < [hidden email]> wrote:
>>
>> But matrix isn't the only problem with asanyarray. np.ma also violates
>> Liskov. No doubt there are other problematic ndarray subclasses out
>> there too...
>
>
> Please forgive my ignorance (I don't really use mask arrays), but how
> specifically do masked arrays violate Liskov? In most cases shouldn't they
> work the same as base numpy arrays, except with operations keeping track of
> masks?
Since many operations silently skip over masked values, the
computation semantics are different. For example, in a regular array,
sum()/size() == mean(), but with a masked array these are totally
different operations. So if you have code that was written for regular
arrays, but pass in a masked array, there's a solid chance that it
will silently return nonsensical results.
(This is why it's better for NAs to propagate by default.)
n

Nathaniel J. Smith  https://vorpus.org_______________________________________________
NumPyDiscussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpydiscussion


On 9/11/18 5:09 pm, Nathaniel Smith wrote:
> On Fri, Nov 9, 2018 at 4:59 PM, Stephan Hoyer < [hidden email]> wrote:
>> On Fri, Nov 9, 2018 at 6:46 PM Nathaniel Smith < [hidden email]> wrote:
>>> But matrix isn't the only problem with asanyarray. np.ma also violates
>>> Liskov. No doubt there are other problematic ndarray subclasses out
>>> there too...
>>>
>>>
>>> Please forgive my ignorance (I don't really use mask arrays), but how
>>> specifically do masked arrays violate Liskov? In most cases shouldn't they
>>> work the same as base numpy arrays, except with operations keeping track of
>>> masks?
> Since many operations silently skip over masked values, the
> computation semantics are different. For example, in a regular array,
> sum()/size() == mean(), but with a masked array these are totally
> different operations. So if you have code that was written for regular
> arrays, but pass in a masked array, there's a solid chance that it
> will silently return nonsensical results.
>
> (This is why it's better for NAs to propagate by default.)
>
> n
Echos of the discussions in neps 12, 24, 25, 26. http://www.numpy.org/nepsMatti
_______________________________________________
NumPyDiscussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpydiscussion


Hi Hameer,
I do not think we should change `asanyarray` itself to specialcase
matrix; rather, we could start converting `asarray` to `asanyarray` and
solve the problems that produces for matrices in `matrix` itself (e.g., by
overriding the relevant function with `__array_function__`).
I think the idea of providing an `__anyarray__` method (in analogy with `__array__`) might work. Indeed, the default in `ndarray` (and thus all its subclasses) could be to let it return `self` and to override it for `matrix` to return an ndarray view.
All the best,
Marten
On 9/11/18 5:09 pm, Nathaniel Smith wrote:
> On Fri, Nov 9, 2018 at 4:59 PM, Stephan Hoyer <[hidden email]> wrote:
>> On Fri, Nov 9, 2018 at 6:46 PM Nathaniel Smith <[hidden email]> wrote:
>>> But matrix isn't the only problem with asanyarray. np.ma also violates
>>> Liskov. No doubt there are other problematic ndarray subclasses out
>>> there too...
>>>
>>>
>>> Please forgive my ignorance (I don't really use mask arrays), but how
>>> specifically do masked arrays violate Liskov? In most cases shouldn't they
>>> work the same as base numpy arrays, except with operations keeping track of
>>> masks?
> Since many operations silently skip over masked values, the
> computation semantics are different. For example, in a regular array,
> sum()/size() == mean(), but with a masked array these are totally
> different operations. So if you have code that was written for regular
> arrays, but pass in a masked array, there's a solid chance that it
> will silently return nonsensical results.
>
> (This is why it's better for NAs to propagate by default.)
>
> n
Echos of the discussions in neps 12, 24, 25, 26. http://www.numpy.org/neps
Matti
_______________________________________________
NumPyDiscussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpydiscussion
_______________________________________________
NumPyDiscussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpydiscussion


On Sat, Nov 10, 2018 at 9:49 AM Marten van Kerkwijk < [hidden email]> wrote: Hi Hameer,
I do not think we should change `asanyarray` itself to specialcase
matrix; rather, we could start converting `asarray` to `asanyarray` and
solve the problems that produces for matrices in `matrix` itself (e.g., by
overriding the relevant function with `__array_function__`).
I think the idea of providing an `__anyarray__` method (in analogy with `__array__`) might work. Indeed, the default in `ndarray` (and thus all its subclasses) could be to let it return `self` and to override it for `matrix` to return an ndarray view.
Yes, we certainly would rather implement a matrix.__anyarray__ method (if we're already doing a new protocol) rather than special case np.matrix explicitly.
Unfortunately, per Nathaniel's comments about NA skipping behavior, it seems like we will also need MaskedArray.__anyarray__ to return something other than itself. In principle, we should probably write new version of MaskedArray that doesn't deviate from ndarray semantics, but that's a rather large project (we'd also probably want to stop subclassing ndarray).
Changing the default aggregation behavior for the existing MaskedArray is also an option but that would be a serious annoyance to users and backwards compatibility break. If the only way MaskedArray violates Liskov is in terms of NA skipping aggregations by default, then this might be viable. In practice, this would require adding an explicit skipna argument so FutureWarnings could be silenced. The plus side of this option is that it would make it easier to use np.anyarray() or any new coercion function throughout the internal NumPy code base.
To summarize, I think these are our options: 1. Change the behavior of np.anyarray() to check for an __anyarray__() protocol. Change np.matrix.__anyarray__() to return a base numpy array (this is a minor backwards compatibility break, but probably for the best). Start issuing a FutureWarning for any MaskedArray operations that violate Liskov and add a skipna argument that in the future will default to skipna=False. 2. Introduce a new coercion function, e.g., np.duckarray(). This is the easiest option because we don't need to cleanup NumPy's existing ndarray subclasses.
P.S. I'm just glad pandas stopped subclassing ndarray a while ago  there's no way pandas.Series() could be fixed up to not violate Liskov :).
_______________________________________________
NumPyDiscussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpydiscussion


> If the only way MaskedArray violates Liskov is in terms of NA skipping aggregations by default, then this might be viable
One of the ways to fix these liskov substitution problems is just to introduce more base classes  for instance, if we had an `NDContainer` base class with only slicing support, then masked arrays would be an exact liskov substitution, but np.matrix would not.
Eric On Sat, Nov 10, 2018 at 9:49 AM Marten van Kerkwijk < [hidden email]> wrote: Hi Hameer,
I do not think we should change `asanyarray` itself to specialcase
matrix; rather, we could start converting `asarray` to `asanyarray` and
solve the problems that produces for matrices in `matrix` itself (e.g., by
overriding the relevant function with `__array_function__`).
I think the idea of providing an `__anyarray__` method (in analogy with `__array__`) might work. Indeed, the default in `ndarray` (and thus all its subclasses) could be to let it return `self` and to override it for `matrix` to return an ndarray view.
Yes, we certainly would rather implement a matrix.__anyarray__ method (if we're already doing a new protocol) rather than special case np.matrix explicitly.
Unfortunately, per Nathaniel's comments about NA skipping behavior, it seems like we will also need MaskedArray.__anyarray__ to return something other than itself. In principle, we should probably write new version of MaskedArray that doesn't deviate from ndarray semantics, but that's a rather large project (we'd also probably want to stop subclassing ndarray).
Changing the default aggregation behavior for the existing MaskedArray is also an option but that would be a serious annoyance to users and backwards compatibility break. If the only way MaskedArray violates Liskov is in terms of NA skipping aggregations by default, then this might be viable. In practice, this would require adding an explicit skipna argument so FutureWarnings could be silenced. The plus side of this option is that it would make it easier to use np.anyarray() or any new coercion function throughout the internal NumPy code base.
To summarize, I think these are our options: 1. Change the behavior of np.anyarray() to check for an __anyarray__() protocol. Change np.matrix.__anyarray__() to return a base numpy array (this is a minor backwards compatibility break, but probably for the best). Start issuing a FutureWarning for any MaskedArray operations that violate Liskov and add a skipna argument that in the future will default to skipna=False. 2. Introduce a new coercion function, e.g., np.duckarray(). This is the easiest option because we don't need to cleanup NumPy's existing ndarray subclasses.
P.S. I'm just glad pandas stopped subclassing ndarray a while ago  there's no way pandas.Series() could be fixed up to not violate Liskov :).
_______________________________________________
NumPyDiscussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpydiscussion
_______________________________________________
NumPyDiscussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpydiscussion


On Saturday, Nov 10, 2018 at 9:16 PM, Stephan Hoyer < [hidden email]> wrote: On Sat, Nov 10, 2018 at 9:49 AM Marten van Kerkwijk < [hidden email]> wrote: Hi Hameer,
I do not think we should change `asanyarray` itself to specialcase matrix; rather, we could start converting `asarray` to `asanyarray` and solve the problems that produces for matrices in `matrix` itself (e.g., by overriding the relevant function with `__array_function__`).
I think the idea of providing an `__anyarray__` method (in analogy with `__array__`) might work. Indeed, the default in `ndarray` (and thus all its subclasses) could be to let it return `self` and to override it for `matrix` to return an ndarray view.
Yes, we certainly would rather implement a matrix.__anyarray__ method (if we're already doing a new protocol) rather than special case np.matrix explicitly.
Unfortunately, per Nathaniel's comments about NA skipping behavior, it seems like we will also need MaskedArray.__anyarray__ to return something other than itself. In principle, we should probably write new version of MaskedArray that doesn't deviate from ndarray semantics, but that's a rather large project (we'd also probably want to stop subclassing ndarray).
Changing the default aggregation behavior for the existing MaskedArray is also an option but that would be a serious annoyance to users and backwards compatibility break. If the only way MaskedArray violates Liskov is in terms of NA skipping aggregations by default, then this might be viable. In practice, this would require adding an explicit skipna argument so FutureWarnings could be silenced. The plus side of this option is that it would make it easier to use np.anyarray() or any new coercion function throughout the internal NumPy code base.
To summarize, I think these are our options: 1. Change the behavior of np.anyarray() to check for an __anyarray__() protocol. Change np.matrix.__anyarray__() to return a base numpy array (this is a minor backwards compatibility break, but probably for the best). Start issuing a FutureWarning for any MaskedArray operations that violate Liskov and add a skipna argument that in the future will default to skipna=False.
2. Introduce a new coercion function, e.g., np.duckarray(). This is the easiest option because we don't need to cleanup NumPy's existing ndarray subclasses.
My vote is still for 1. I don’t have an issue for PyData/Sparse depending on recentish NumPy versions — It’ll need a lot of the recent protocols anyway, although I could be convinced otherwise if major package devs (scikits, SciPy, Dask) were to weigh in and say they’ll jump on it (which seems unlikely given SciPy’s policy to support old NumPy versions).
P.S. I'm just glad pandas stopped subclassing ndarray a while ago  there's no way pandas.Series() could be fixed up to not violate Liskov :). _______________________________________________ NumPyDiscussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpydiscussion _______________________________________________
NumPyDiscussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpydiscussion


To summarize, I think these are our options:
1. Change the behavior of np.anyarray() to check for an __anyarray__() protocol. Change np.matrix.__anyarray__() to return a base numpy array (this is a minor backwards compatibility break, but probably for the best). Start issuing a FutureWarning for any MaskedArray operations that violate Liskov and add a skipna argument that in the future will default to skipna=False.
2. Introduce a new coercion function, e.g., np.duckarray(). This is the easiest option because we don't need to cleanup NumPy's existing ndarray subclasses.
My vote is still for 1. I don’t have an issue for PyData/Sparse depending on recentish NumPy versions — It’ll need a lot of the recent protocols anyway, although I could be convinced otherwise if major package devs (scikits, SciPy, Dask) were to weigh in and say they’ll jump on it (which seems unlikely given SciPy’s policy to support old NumPy versions).
I agree that option (1) is fine for PyData/sparse. The bigger issue is that this change should be conditional on making breaking changes (at least raising FutureWarning for now) to np.ma.MaskedArray.
I don't know how people who currently use MaskedArray would feel about that. I would love to hear their thoughts.
_______________________________________________
NumPyDiscussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpydiscussion


To summarize, I think these are our options:
1. Change the behavior of np.anyarray() to check for an __anyarray__() protocol. Change np.matrix.__anyarray__() to return a base numpy array (this is a minor backwards compatibility break, but probably for the best). Start issuing a FutureWarning for any MaskedArray operations that violate Liskov and add a skipna argument that in the future will default to skipna=False.
2. Introduce a new coercion function, e.g., np.duckarray(). This is the easiest option because we don't need to cleanup NumPy's existing ndarray subclasses.
My vote is still for 1. I don’t have an issue for PyData/Sparse depending on recentish NumPy versions — It’ll need a lot of the recent protocols anyway, although I could be convinced otherwise if major package devs (scikits, SciPy, Dask) were to weigh in and say they’ll jump on it (which seems unlikely given SciPy’s policy to support old NumPy versions).
I agree that option (1) is fine for PyData/sparse. The bigger issue is that this change should be conditional on making breaking changes (at least raising FutureWarning for now) to np.ma.MaskedArray.
Might be good to try before worrying too much  MaskedArray already overrides *a lot*; it is not at all obvious to me that things wouldn't "just work" if we bulkreplaced `asarray` with `asanyarray`. And with `__array_function__` we now have the option to fix code paths that do not work immediately.
 Marten
_______________________________________________
NumPyDiscussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpydiscussion


> If the only way MaskedArray violates Liskov is in terms of NA skipping aggregations by default, then this might be viable
One of the ways to fix these liskov substitution problems is just to introduce more base classes  for instance, if we had an `NDContainer` base class with only slicing support, then masked arrays would be an exact liskov substitution, but np.matrix would not.
Eric
I've had the same thought and wouldn't be surprised if others have considered that possibility. Travis would be a good guy to ask about that.
<snip>
Chuck
_______________________________________________
NumPyDiscussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpydiscussion


On 2018/11/10 12:39 PM, Stephan Hoyer wrote:
> On Sat, Nov 10, 2018 at 2:22 PM Hameer Abbasi < [hidden email]
> <mailto: [hidden email]>> wrote:
>
> To summarize, I think these are our options:
>
> 1. Change the behavior of np.anyarray() to check for an
> __anyarray__() protocol. Change np.matrix.__anyarray__() to
> return a base numpy array (this is a minor backwards
> compatibility break, but probably for the best). Start issuing a
> FutureWarning for any MaskedArray operations that violate Liskov
> and add a skipna argument that in the future will default to
> skipna=False.
>
> 2. Introduce a new coercion function, e.g., np.duckarray(). This
> is the easiest option because we don't need to cleanup NumPy's
> existing ndarray subclasses.
>
>
> My vote is still for 1. I don’t have an issue for PyData/Sparse
> depending on recentish NumPy versions — It’ll need a lot of the
> recent protocols anyway, although I could be convinced otherwise if
> major package devs (scikits, SciPy, Dask) were to weigh in and say
> they’ll jump on it (which seems unlikely given SciPy’s policy to
> support old NumPy versions).
>
>
> I agree that option (1) is fine for PyData/sparse. The bigger issue is
> that this change should be conditional on making breaking changes (at
> least raising FutureWarning for now) to np.ma.MaskedArray.
>
> I don't know how people who currently use MaskedArray would feel about
> that. I would love to hear their thoughts.
Thank you. I am a user of masked arrays, and have been since prenumpy
days. I introduced their extensive use in matplotlib long ago. I have
been a bit concerned, indeed, that all of the discussion of modifying
masked arrays seems to be by people who don't actually use them
explicitly (though they might be using them without knowing it via
internal operations in matplotlib, or they might be quickly getting rid
of them after they are yielded by netCDF4.Dataset()).
I think that those of us who do use masked arrays recognize that they
are not perfect; they have some quirks and gotchas, and one has to be
careful to use numpy.ma functions instead of numpy functions in most
cases. But we use them because they have real advantages over the
alternatives, which are using nans and/or manually tracking independent
masks throughout calculations. These advantages are largely because
masked values *don't* behave like nan, *don't* propagate. This is
fundamental to the design, and motivated by reallife use cases.
The proposal to add a skipna kwarg to MaskedArray looks to me like it is
giving purity priority over practicality. It will force ma users to
insert skipna kwargs all over the placebecause the default will be
contrary to the primary purposes of using masked arrays, in most cases.
How many people will it actually benefit? How many people are being
bitten, and how badly, by masked array behavior?
If there were a prospect of truly integrating missing/masked value
handling into numpy, simplifying or phasing out numpy.ma, I would be
delightedI think it is the biggest single fundamental improvement that
could be made, from the user's standpoint. I was sad to see Mark
Wiebe's work in that direction come to grief.
If there are ways of gradually improving numpy.ma and its
interoperability with the rest of numpy and with the proliferation of
duck arrays, I'm all in favorso long as they don't effectively wreck
numpy.ma for its present intended purposes.
Eric
>
> _______________________________________________
> NumPyDiscussion mailing list
> [hidden email]
> https://mail.python.org/mailman/listinfo/numpydiscussion>
_______________________________________________
NumPyDiscussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpydiscussion


Hi Eric,
Thanks very much for the detailed response; it is good to be reminded that `MaskedArray` is used in a package that, indeed, (nearly?) all of us use!
But I do think that those of us who have been trying to change MaskedArray, are generally good at making sure the tests continue to pass, i.e., that the behaviour does not change (the main exception in the last few years was that views should be taken of masks too, not just the data).
I also think that between __array_ufunc__ and __array_function__, it has become quite easy to ensure that one no longer has to rely on ` np.ma` functions, i.e., that the regular numpy functions will do the right thing. But it will need work to actually implement that.
All the best,
Marten
_______________________________________________
NumPyDiscussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpydiscussion


On 2018/11/10 12:39 PM, Stephan Hoyer wrote:
> On Sat, Nov 10, 2018 at 2:22 PM Hameer Abbasi <[hidden email]
> <mailto:[hidden email]>> wrote:
>
> To summarize, I think these are our options:
>
> 1. Change the behavior of np.anyarray() to check for an
> __anyarray__() protocol. Change np.matrix.__anyarray__() to
> return a base numpy array (this is a minor backwards
> compatibility break, but probably for the best). Start issuing a
> FutureWarning for any MaskedArray operations that violate Liskov
> and add a skipna argument that in the future will default to
> skipna=False.
>
> 2. Introduce a new coercion function, e.g., np.duckarray(). This
> is the easiest option because we don't need to cleanup NumPy's
> existing ndarray subclasses.
>
>
> My vote is still for 1. I don’t have an issue for PyData/Sparse
> depending on recentish NumPy versions — It’ll need a lot of the
> recent protocols anyway, although I could be convinced otherwise if
> major package devs (scikits, SciPy, Dask) were to weigh in and say
> they’ll jump on it (which seems unlikely given SciPy’s policy to
> support old NumPy versions).
>
>
> I agree that option (1) is fine for PyData/sparse. The bigger issue is
> that this change should be conditional on making breaking changes (at
> least raising FutureWarning for now) to np.ma.MaskedArray.
>
> I don't know how people who currently use MaskedArray would feel about
> that. I would love to hear their thoughts.
Thank you. I am a user of masked arrays, and have been since prenumpy
days. I introduced their extensive use in matplotlib long ago. I have
been a bit concerned, indeed, that all of the discussion of modifying
masked arrays seems to be by people who don't actually use them
explicitly (though they might be using them without knowing it via
internal operations in matplotlib, or they might be quickly getting rid
of them after they are yielded by netCDF4.Dataset()).
I think that those of us who do use masked arrays recognize that they
are not perfect; they have some quirks and gotchas, and one has to be
careful to use numpy.ma functions instead of numpy functions in most
cases. But we use them because they have real advantages over the
alternatives, which are using nans and/or manually tracking independent
masks throughout calculations. These advantages are largely because
masked values *don't* behave like nan, *don't* propagate. This is
fundamental to the design, and motivated by reallife use cases.
The proposal to add a skipna kwarg to MaskedArray looks to me like it is
giving purity priority over practicality. It will force ma users to
insert skipna kwargs all over the placebecause the default will be
contrary to the primary purposes of using masked arrays, in most cases.
How many people will it actually benefit? How many people are being
bitten, and how badly, by masked array behavior?
If there were a prospect of truly integrating missing/masked value
handling into numpy, simplifying or phasing out numpy.ma, I would be
delightedI think it is the biggest single fundamental improvement that
could be made, from the user's standpoint. I was sad to see Mark
Wiebe's work in that direction come to grief.
If there are ways of gradually improving numpy.ma and its
interoperability with the rest of numpy and with the proliferation of
duck arrays, I'm all in favorso long as they don't effectively wreck
numpy.ma for its present intended purposes.
Eric  thank you for sharing your perspective! I guess it should not be surprising that the semantics of MaskedArray intentionally deviate from the semantics of base NumPy arrays.
This deviation is fortunately less severe than than deviations in the behavior of np.matrix, but it still presents some difficulties for duck typing. We're in a position to reduce (but still not eliminate) these differences with new protocols like __array_function__.
I think Nathaniel actually summarized these issues pretty well in NEP 16 ( http://www.numpy.org/neps/nep0016abstractarray.html). If we want a coercion function that guarantees an object is a "full duck array", then it can't pass on either np.matrix or MaskedArray in their current state. Anything less than full compatibility provides a shaky foundation for use in downstream projects or inside NumPy itself. In theory (certainly if we were starting from scratch) it would make sense to make asabstractarray() pass on any ndarray subclass, but this would require willingness to make breaking changes to both np.matrix and MaskedArray.
I would suggest adopting a variation of the proposal in NEP 16, except using a protocol rather an abstract base class per NEP 22, e.g.,
# names still to be determined def asabstractarray(array, dtype): if hasattr(array, '__abstractarray__'): return array.__abstractarray__(array, dtype=dtype) return asarray(array, dtype)
_______________________________________________
NumPyDiscussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpydiscussion

