new MaskedArray class

classic Classic list List threaded Threaded
38 messages Options
12
Reply | Threaded
Open this post in threaded view
|

new MaskedArray class

Allan Haldane
Hi all,

Chuck suggested we think about a MaskedArray replacement for 1.18.

A few months ago I did some work on a MaskedArray replacement using
`__array_function__`, which I got mostly working. It seems like a good
time to bring it up for discussion now. See it at:

https://github.com/ahaldane/ndarray_ducktypes

It should be very usable, it has docs you can read, and it passes a
pytest-suite with 1024 tests adapted from numpy's MaskedArray tests.
What is missing? It needs even more tests for new functionality, and a
couple numpy-API functions are missing, in particular `np.median`,
`np.percentile`, `np.cov`, and `np.corrcoef`. I'm sure other devs could
also find many things to improve too.

Besides fixing many little annoyances from MaskedArray, and simplifying
the logic by always storing the mask in full, it also has new features.
For instance it allows the use of a "X" variable to mark masked
locations during array construction, and I solve the issue of how to
mask individual fields of a structured array differently.

At this point I would by happy to get some feedback on the design and
what seems good or bad. If it seems like a good start, I'd be happy to
move it into a numpy repo of some sort for further collaboration &
discussion, and maybe into 1.18. At the least I hope it can serve as a
design study of what we could do.





Let me also drop here two more interesting detailed issues:

First, the issue of what to do about .real and .imag of complex arrays,
and similarly about field-assignment of structured arrays. The problem
is that we have a single mask bool per element of a complex array, but
what if someone does `arr.imag = MaskedArray([1,X,1])`? How should the
mask of the original array change? Should we make .real and .imag readonly?

Second, a more general issue of how to ducktype scalars when using
`__array_function__` which I think many ducktype implementors will have
to face. For MaskedArray, I created an associated "MaskedScalar" type.
However, MaskedScalar has to behave differently from normal numpy
scalars in a number of ways: It is not part of the numpy scalar
hierarchy, it fails checks `isinstance(var, np.floating)`, and
np.isscalar returns false. Numpy scalar types cannot be subclassed. We
have discussed before the need to have distinction between 0d-arrays and
scalars, so we shouldn't just use a 0d (in fact, this makes printing
very difficult). This leads me to think that in future dtype-overhaul
plans, we should consider creating a subclassable `np.scalar` base type
to wrap all numpy scalar variables, and code like `isinstance(var,
np.floating)` should be replaced by `isinstance(var.dtype.type,
np.floating)` or similar. That is, the numeric dtype of the scalar is no
longer encoded in `type(var)` but in `var.dtype`: The fact that the
variable is a numpy scalar is decoupled from its numeric dtype.

This is useful because there are many "associated" properties of scalars
in common with arrays which have nothing to do with the dtype, which
ducktype implementors want to touch. I imagine this will come up a lot:
In that repo I also have an "ArrayCollection" ducktype which required a
"CollectionScalar" scalar, and similarly I imagine people implementing
units want the units attached to the scalar, independently of the dtype.

Cheers,
Allan




_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: new MaskedArray class

Marten van Kerkwijk
Hi Allen,

Thanks for the message and link! In astropy, we've been struggling with masking a lot, and one of the main conclusions I have reached is that ideally one has a more abstract `Masked` class that can take any type of data (including `ndarray`, of course), and behaves like that data as much as possible, to the extent that if, e.g., I create a `Masked(Quantity(..., unit), mask=...)`, the instance will have a `.unit` attribute and perhaps even `isinstance(..., Quantity)` will hold. And similarly for `Masked(Time(...), mask=...)`, `Masked(SkyCoord(...), mask=...)`, etc. In a way, `Masked` would be a kind of Mixin-class that just tracks a mask attribute.

This may be too much to ask from the initializer, but, if so, it still seems most useful if it is made as easy as possible to do, say, `class MaskedQuantity(Masked, Quantity): <very few overrides>`.

Even if this impossible, I think it is conceptually useful to think about what the masking class should do. My sense is that, e.g., it should not attempt to decide when an operation succeeds or not, but just "or together" input masks for regular, multiple-input functions, and let the underlying arrays skip elements for reductions by using `where` (hey, I did implement that for a reason... ;-). In particular, it suggests one should not have things like domains and all that (I never understood why `MaskedArray` did that). If one wants more, the class should provide a method that updates the mask (a sensible default might be `mask |= ~np.isfinite(result)` - here, the class being masked should logically support ufuncs and functions, so it can decide what "isfinite" means).

In any case, I would think that a basic truth should be that everything has a mask with a shape consistent with the data, so
1. Each complex numbers has just one mask, and setting `a.imag` with a masked array should definitely propagate the mask.
2. For a masked array with structured dtype, I'd similarly say that the default is for a mask to have the same shape as the array. But that something like your collection makes sense for the case where one wants to mask items in a structure.

All the best,

Marten

p.s. I started trying to implement the above "Mixin" class; will try to clean that up a bit so that at least it uses `where` and push it up.



On Mon, Jun 17, 2019 at 6:43 PM Allan Haldane <[hidden email]> wrote:
Hi all,

Chuck suggested we think about a MaskedArray replacement for 1.18.

A few months ago I did some work on a MaskedArray replacement using
`__array_function__`, which I got mostly working. It seems like a good
time to bring it up for discussion now. See it at:

https://github.com/ahaldane/ndarray_ducktypes

It should be very usable, it has docs you can read, and it passes a
pytest-suite with 1024 tests adapted from numpy's MaskedArray tests.
What is missing? It needs even more tests for new functionality, and a
couple numpy-API functions are missing, in particular `np.median`,
`np.percentile`, `np.cov`, and `np.corrcoef`. I'm sure other devs could
also find many things to improve too.

Besides fixing many little annoyances from MaskedArray, and simplifying
the logic by always storing the mask in full, it also has new features.
For instance it allows the use of a "X" variable to mark masked
locations during array construction, and I solve the issue of how to
mask individual fields of a structured array differently.

At this point I would by happy to get some feedback on the design and
what seems good or bad. If it seems like a good start, I'd be happy to
move it into a numpy repo of some sort for further collaboration &
discussion, and maybe into 1.18. At the least I hope it can serve as a
design study of what we could do.





Let me also drop here two more interesting detailed issues:

First, the issue of what to do about .real and .imag of complex arrays,
and similarly about field-assignment of structured arrays. The problem
is that we have a single mask bool per element of a complex array, but
what if someone does `arr.imag = MaskedArray([1,X,1])`? How should the
mask of the original array change? Should we make .real and .imag readonly?

Second, a more general issue of how to ducktype scalars when using
`__array_function__` which I think many ducktype implementors will have
to face. For MaskedArray, I created an associated "MaskedScalar" type.
However, MaskedScalar has to behave differently from normal numpy
scalars in a number of ways: It is not part of the numpy scalar
hierarchy, it fails checks `isinstance(var, np.floating)`, and
np.isscalar returns false. Numpy scalar types cannot be subclassed. We
have discussed before the need to have distinction between 0d-arrays and
scalars, so we shouldn't just use a 0d (in fact, this makes printing
very difficult). This leads me to think that in future dtype-overhaul
plans, we should consider creating a subclassable `np.scalar` base type
to wrap all numpy scalar variables, and code like `isinstance(var,
np.floating)` should be replaced by `isinstance(var.dtype.type,
np.floating)` or similar. That is, the numeric dtype of the scalar is no
longer encoded in `type(var)` but in `var.dtype`: The fact that the
variable is a numpy scalar is decoupled from its numeric dtype.

This is useful because there are many "associated" properties of scalars
in common with arrays which have nothing to do with the dtype, which
ducktype implementors want to touch. I imagine this will come up a lot:
In that repo I also have an "ArrayCollection" ducktype which required a
"CollectionScalar" scalar, and similarly I imagine people implementing
units want the units attached to the scalar, independently of the dtype.

Cheers,
Allan




_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: new MaskedArray class

Allan Haldane
On 6/18/19 10:06 AM, Marten van Kerkwijk wrote:

> Hi Allen,
>
> Thanks for the message and link! In astropy, we've been struggling with
> masking a lot, and one of the main conclusions I have reached is that
> ideally one has a more abstract `Masked` class that can take any type of
> data (including `ndarray`, of course), and behaves like that data as
> much as possible, to the extent that if, e.g., I create a
> `Masked(Quantity(..., unit), mask=...)`, the instance will have a
> `.unit` attribute and perhaps even `isinstance(..., Quantity)` will
> hold. And similarly for `Masked(Time(...), mask=...)`,
> `Masked(SkyCoord(...), mask=...)`, etc. In a way, `Masked` would be a
> kind of Mixin-class that just tracks a mask attribute.
>
> This may be too much to ask from the initializer, but, if so, it still
> seems most useful if it is made as easy as possible to do, say, `class
> MaskedQuantity(Masked, Quantity): <very few overrides>`.
Currently MaskedArray does not accept ducktypes as underlying arrays,
but I think it shouldn't be too hard to modify it to do so. Good idea!

I already partly navigated this mixin-issue in the
"MaskedArrayCollection" class, which essentially does
ArrayCollection(MaskedArray(array)), and only takes about 30 lines of
boilerplate. That's the backwards encapsulation order from what you want
though.

> Even if this impossible, I think it is conceptually useful to think
> about what the masking class should do. My sense is that, e.g., it
> should not attempt to decide when an operation succeeds or not, but just
> "or together" input masks for regular, multiple-input functions, and let
> the underlying arrays skip elements for reductions by using `where`
> (hey, I did implement that for a reason... ;-). In particular, it
> suggests one should not have things like domains and all that (I never
> understood why `MaskedArray` did that). If one wants more, the class
> should provide a method that updates the mask (a sensible default might
> be `mask |= ~np.isfinite(result)` - here, the class being masked should
> logically support ufuncs and functions, so it can decide what "isfinite"
> means).
I agree it would be nice to remove domains. It would make life easier,
and I could remove a lot of twiddly code! I kept it in for now to
minimize the behavior changes from the old MaskedArray.

> In any case, I would think that a basic truth should be that everything
> has a mask with a shape consistent with the data, so
> 1. Each complex numbers has just one mask, and setting `a.imag` with a
> masked array should definitely propagate the mask.
> 2. For a masked array with structured dtype, I'd similarly say that the
> default is for a mask to have the same shape as the array. But that
> something like your collection makes sense for the case where one wants
> to mask items in a structure.

Agreed that we should have a single bool per complex or structured
element, and the mask shape is the same as the array shape. That's how I
implemented it. But there is still a problem with complex.imag assignment:

    >>> a = MaskedArray([1j, 2, X])
    >>> i = a.imag
    >>> i[:] = MaskedArray([1, X, 1])

If we make the last line copy the mask to the original array, what
should the real part of a[2] be? Conversely, if we don't copy the mask,
what should the imag part of a[1] be? It seems like we might "want" the
masks to be OR'd instead, but then should i[2] be masked after we just
set it to 1?


> All the best,
>
> Marten
>
> p.s. I started trying to implement the above "Mixin" class; will try to
> clean that up a bit so that at least it uses `where` and push it up.

I played with "where", but didn't include it since 1.17 is not released.
To avoid duplication of effort, I've attached a diff of what I tried. I
actually get a slight slowdown of about 10% by using where...

If you make progress with the mixin, a push is welcome. I imagine a
problem is going to be that np.isscalar doesn't work to detect duck scalars.

Cheers,
Allan




> On Mon, Jun 17, 2019 at 6:43 PM Allan Haldane <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     Hi all,
>
>     Chuck suggested we think about a MaskedArray replacement for 1.18.
>
>     A few months ago I did some work on a MaskedArray replacement using
>     `__array_function__`, which I got mostly working. It seems like a good
>     time to bring it up for discussion now. See it at:
>
>     https://github.com/ahaldane/ndarray_ducktypes
>
>     It should be very usable, it has docs you can read, and it passes a
>     pytest-suite with 1024 tests adapted from numpy's MaskedArray tests.
>     What is missing? It needs even more tests for new functionality, and a
>     couple numpy-API functions are missing, in particular `np.median`,
>     `np.percentile`, `np.cov`, and `np.corrcoef`. I'm sure other devs could
>     also find many things to improve too.
>
>     Besides fixing many little annoyances from MaskedArray, and simplifying
>     the logic by always storing the mask in full, it also has new features.
>     For instance it allows the use of a "X" variable to mark masked
>     locations during array construction, and I solve the issue of how to
>     mask individual fields of a structured array differently.
>
>     At this point I would by happy to get some feedback on the design and
>     what seems good or bad. If it seems like a good start, I'd be happy to
>     move it into a numpy repo of some sort for further collaboration &
>     discussion, and maybe into 1.18. At the least I hope it can serve as a
>     design study of what we could do.
>
>
>
>
>
>     Let me also drop here two more interesting detailed issues:
>
>     First, the issue of what to do about .real and .imag of complex arrays,
>     and similarly about field-assignment of structured arrays. The problem
>     is that we have a single mask bool per element of a complex array, but
>     what if someone does `arr.imag = MaskedArray([1,X,1])`? How should the
>     mask of the original array change? Should we make .real and .imag
>     readonly?
>
>     Second, a more general issue of how to ducktype scalars when using
>     `__array_function__` which I think many ducktype implementors will have
>     to face. For MaskedArray, I created an associated "MaskedScalar" type.
>     However, MaskedScalar has to behave differently from normal numpy
>     scalars in a number of ways: It is not part of the numpy scalar
>     hierarchy, it fails checks `isinstance(var, np.floating)`, and
>     np.isscalar returns false. Numpy scalar types cannot be subclassed. We
>     have discussed before the need to have distinction between 0d-arrays and
>     scalars, so we shouldn't just use a 0d (in fact, this makes printing
>     very difficult). This leads me to think that in future dtype-overhaul
>     plans, we should consider creating a subclassable `np.scalar` base type
>     to wrap all numpy scalar variables, and code like `isinstance(var,
>     np.floating)` should be replaced by `isinstance(var.dtype.type,
>     np.floating)` or similar. That is, the numeric dtype of the scalar is no
>     longer encoded in `type(var)` but in `var.dtype`: The fact that the
>     variable is a numpy scalar is decoupled from its numeric dtype.
>
>     This is useful because there are many "associated" properties of scalars
>     in common with arrays which have nothing to do with the dtype, which
>     ducktype implementors want to touch. I imagine this will come up a lot:
>     In that repo I also have an "ArrayCollection" ducktype which required a
>     "CollectionScalar" scalar, and similarly I imagine people implementing
>     units want the units attached to the scalar, independently of the dtype.
>
>     Cheers,
>     Allan
>
>
>
>
>     _______________________________________________
>     NumPy-Discussion mailing list
>     [hidden email] <mailto:[hidden email]>
>     https://mail.python.org/mailman/listinfo/numpy-discussion
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> https://mail.python.org/mailman/listinfo/numpy-discussion
>

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: new MaskedArray class

Marten van Kerkwijk


On Tue, Jun 18, 2019 at 12:55 PM Allan Haldane <[hidden email]> wrote:
<snip>
> This may be too much to ask from the initializer, but, if so, it still
> seems most useful if it is made as easy as possible to do, say, `class
> MaskedQuantity(Masked, Quantity): <very few overrides>`.

Currently MaskedArray does not accept ducktypes as underlying arrays,
but I think it shouldn't be too hard to modify it to do so. Good idea!

Looking back at my trial, I see that I also never got to duck arrays - only ndarray subclasses - though I tried to make the code as agnostic as possible.


I already partly navigated this mixin-issue in the
"MaskedArrayCollection" class, which essentially does
ArrayCollection(MaskedArray(array)), and only takes about 30 lines of
boilerplate. That's the backwards encapsulation order from what you want
though.

Yes, indeed, from a quick trial `MaskedArray(np.arange(3.) * u.m, mask=[True, False, False])` does indeed not have a `.unit` attribute (and cannot represent itself...); I'm not at all sure that my method of just creating a mixed class is anything but a recipe for disaster, though!
 
> Even if this impossible, I think it is conceptually useful to think
> about what the masking class should do. My sense is that, e.g., it
> should not attempt to decide when an operation succeeds or not, but just
> "or together" input masks for regular, multiple-input functions, and let
> the underlying arrays skip elements for reductions by using `where`
> (hey, I did implement that for a reason... ;-). In particular, it
> suggests one should not have things like domains and all that (I never
> understood why `MaskedArray` did that). If one wants more, the class
> should provide a method that updates the mask (a sensible default might
> be `mask |= ~np.isfinite(result)` - here, the class being masked should
> logically support ufuncs and functions, so it can decide what "isfinite"
> means).

I agree it would be nice to remove domains. It would make life easier,
and I could remove a lot of twiddly code! I kept it in for now to
minimize the behavior changes from the old MaskedArray.

That makes sense. Could be separated out to a backwards-compatibility class later.


> In any case, I would think that a basic truth should be that everything
> has a mask with a shape consistent with the data, so
> 1. Each complex numbers has just one mask, and setting `a.imag` with a
> masked array should definitely propagate the mask.
> 2. For a masked array with structured dtype, I'd similarly say that the
> default is for a mask to have the same shape as the array. But that
> something like your collection makes sense for the case where one wants
> to mask items in a structure.

Agreed that we should have a single bool per complex or structured
element, and the mask shape is the same as the array shape. That's how I
implemented it. But there is still a problem with complex.imag assignment:

    >>> a = MaskedArray([1j, 2, X])
    >>> i = a.imag
    >>> i[:] = MaskedArray([1, X, 1])

If we make the last line copy the mask to the original array, what
should the real part of a[2] be? Conversely, if we don't copy the mask,
what should the imag part of a[1] be? It seems like we might "want" the
masks to be OR'd instead, but then should i[2] be masked after we just
set it to 1?

Ah, I see the issue now... Easiest to implement and closest in analogy to a regular view would be to just let it unmask a[2] (with whatever is in real; user beware!).

Perhaps better would be to special-case such that `imag` returns a read-only view of the mask. Making `imag` itself read-only would prevent possibly reasonable things like `i[np.isclose(i, 0)] = 0` - but there is no reason this should update the mask.

Still, neither is really satisfactory...
 

> p.s. I started trying to implement the above "Mixin" class; will try to
> clean that up a bit so that at least it uses `where` and push it up.

I played with "where", but didn't include it since 1.17 is not released.
To avoid duplication of effort, I've attached a diff of what I tried. I
actually get a slight slowdown of about 10% by using where...

Your implementation is indeed quite similar to what I got in __array_ufunc__ (though one should "&" the where with ~mask).

I think the main benefit is not to presume that whatever is underneath understands 0 or 1, i.e., avoid filling.


If you make progress with the mixin, a push is welcome. I imagine a
problem is going to be that np.isscalar doesn't work to detect duck scalars.

I fear that in my attempts I've simply decided that only array scalars exist...

-- Marten

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: new MaskedArray class

Allan Haldane
On 6/18/19 2:04 PM, Marten van Kerkwijk wrote:

>
>
> On Tue, Jun 18, 2019 at 12:55 PM Allan Haldane <[hidden email]
> <mailto:[hidden email]>> wrote:
> <snip>
>
>     > This may be too much to ask from the initializer, but, if so, it still
>     > seems most useful if it is made as easy as possible to do, say, `class
>     > MaskedQuantity(Masked, Quantity): <very few overrides>`.
>
>     Currently MaskedArray does not accept ducktypes as underlying arrays,
>     but I think it shouldn't be too hard to modify it to do so. Good idea!
>
>
> Looking back at my trial, I see that I also never got to duck arrays -
> only ndarray subclasses - though I tried to make the code as agnostic as
> possible.
>
> (Trial at
> https://github.com/astropy/astropy/compare/master...mhvk:utils-masked-class?expand=1)
>
>     I already partly navigated this mixin-issue in the
>     "MaskedArrayCollection" class, which essentially does
>     ArrayCollection(MaskedArray(array)), and only takes about 30 lines of
>     boilerplate. That's the backwards encapsulation order from what you want
>     though.
>
>
> Yes, indeed, from a quick trial `MaskedArray(np.arange(3.) * u.m,
> mask=[True, False, False])` does indeed not have a `.unit` attribute
> (and cannot represent itself...); I'm not at all sure that my method of
> just creating a mixed class is anything but a recipe for disaster, though!
Based on your suggestion I worked on this a little today, and now my
MaskedArray more easily encapsulates both ducktypes and ndarray
subclasses (pushed to repo). Here's an example I got working with masked
units using unyt:

[1]: from MaskedArray import X, MaskedArray, MaskedScalar

[2]: from unyt import m, km

[3]: import numpy as np

[4]: uarr = MaskedArray([1., 2., 3.]*km, mask=[0,1,0])

[5]: uarr

MaskedArray([1., X , 3.])
[6]: uarr + 1*m

MaskedArray([1.001, X    , 3.001])
[7]: uarr.filled()

unyt_array([1., 0., 3.], 'km')
[8]: np.concatenate([uarr, 2*uarr]).filled()
unyt_array([1., 0., 3., 2., 0., 6.], '(dimensionless)')

The catch is the ducktype/subclass has to rigorously follow numpy's
indexing rules, including distinguishing 0d arrays from scalars. For now
only I used unyt in the example above since it happens to be less strict
 about dimensionless operations than astropy.units which trips up my
repr code. (see below for example with astropy.units). Note in the last
line I lost the dimensions, but that is because unyt does not handle
np.concatenate. To get that to work we need a true ducktype for units.

The example above doesn't expose the ".units" attribute outside the
MaskedArray, and it doesn't print the units in the repr. But you can
access them using "filled".

While I could make MaskedArray forward unknown attribute accesses to the
encapsulated array, that seems a bit dangerous/bug-prone at first
glance, so probably I want to require the user to make a MaskedArray
subclass to do so. I've just started playing with that (probably buggy),
and Ive attached subclass examples for astropy.unit and unyt, with some
example output below.

Cheers,
Allan



Example using the attached astropy unit subclass:

    >>> from astropy.units import m, km, s
    >>> uarr = MaskedQ(np.ones(3), units=km, mask=[0,1,0])
    >>> uarr
    MaskedQ([1., X , 1.], units=km)
    >>> uarr.units
    km
    >>> uarr + (1*m)
    MaskedQ([1.001, X    , 1.001], units=km)
    >>> uarr/(1*s)
    MaskedQ([1., X , 1.], units=km / s)
    >>> (uarr*(1*m))[1:]
    MaskedQ([X , 1.], units=km m)
    >>> np.add.outer(uarr, uarr)
    MaskedQ([[2., X , 2.],
             [X , X , X ],
             [2., X , 2.]], units=km)
    >>> print(uarr)
    [1. X  1.] km m

Cheers,
Allan


>     > Even if this impossible, I think it is conceptually useful to think
>     > about what the masking class should do. My sense is that, e.g., it
>     > should not attempt to decide when an operation succeeds or not,
>     but just
>     > "or together" input masks for regular, multiple-input functions,
>     and let
>     > the underlying arrays skip elements for reductions by using `where`
>     > (hey, I did implement that for a reason... ;-). In particular, it
>     > suggests one should not have things like domains and all that (I never
>     > understood why `MaskedArray` did that). If one wants more, the class
>     > should provide a method that updates the mask (a sensible default
>     might
>     > be `mask |= ~np.isfinite(result)` - here, the class being masked
>     should
>     > logically support ufuncs and functions, so it can decide what
>     "isfinite"
>     > means).
>
>     I agree it would be nice to remove domains. It would make life easier,
>     and I could remove a lot of twiddly code! I kept it in for now to
>     minimize the behavior changes from the old MaskedArray.
>
>
> That makes sense. Could be separated out to a backwards-compatibility
> class later.
>
>
>     > In any case, I would think that a basic truth should be that
>     everything
>     > has a mask with a shape consistent with the data, so
>     > 1. Each complex numbers has just one mask, and setting `a.imag` with a
>     > masked array should definitely propagate the mask.
>     > 2. For a masked array with structured dtype, I'd similarly say
>     that the
>     > default is for a mask to have the same shape as the array. But that
>     > something like your collection makes sense for the case where one
>     wants
>     > to mask items in a structure.
>
>     Agreed that we should have a single bool per complex or structured
>     element, and the mask shape is the same as the array shape. That's how I
>     implemented it. But there is still a problem with complex.imag
>     assignment:
>
>         >>> a = MaskedArray([1j, 2, X])
>         >>> i = a.imag
>         >>> i[:] = MaskedArray([1, X, 1])
>
>     If we make the last line copy the mask to the original array, what
>     should the real part of a[2] be? Conversely, if we don't copy the mask,
>     what should the imag part of a[1] be? It seems like we might "want" the
>     masks to be OR'd instead, but then should i[2] be masked after we just
>     set it to 1?
>
> Ah, I see the issue now... Easiest to implement and closest in analogy
> to a regular view would be to just let it unmask a[2] (with whatever is
> in real; user beware!).
>
> Perhaps better would be to special-case such that `imag` returns a
> read-only view of the mask. Making `imag` itself read-only would prevent
> possibly reasonable things like `i[np.isclose(i, 0)] = 0` - but there is
> no reason this should update the mask.
>
> Still, neither is really satisfactory...
>  
>
>
>     > p.s. I started trying to implement the above "Mixin" class; will
>     try to
>     > clean that up a bit so that at least it uses `where` and push it up.
>
>     I played with "where", but didn't include it since 1.17 is not released.
>     To avoid duplication of effort, I've attached a diff of what I tried. I
>     actually get a slight slowdown of about 10% by using where...
>
>
> Your implementation is indeed quite similar to what I got in
> __array_ufunc__ (though one should "&" the where with ~mask).
>
> I think the main benefit is not to presume that whatever is underneath
> understands 0 or 1, i.e., avoid filling.
>
>
>     If you make progress with the mixin, a push is welcome. I imagine a
>     problem is going to be that np.isscalar doesn't work to detect duck
>     scalars.
>
> I fear that in my attempts I've simply decided that only array scalars
> exist...
>
> -- Marten
>
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> https://mail.python.org/mailman/listinfo/numpy-discussion
>

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

test_astrounit.py (1K) Download Attachment
test_maskunyt.py (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: new MaskedArray class

Marten van Kerkwijk
Hi Allan,

This is very impressive! I could get the tests that I wrote for my class pass with yours using Quantity with what I would consider very minimal changes. I only could not find a good way to unmask data (I like the idea of setting the mask on some elements via `ma[item] = X`); is this on purpose?

Anyway, it would seem easily at the point where I should comment on your repository rather than in the mailing list!

All the best,

Marten


On Wed, Jun 19, 2019 at 5:45 PM Allan Haldane <[hidden email]> wrote:
On 6/18/19 2:04 PM, Marten van Kerkwijk wrote:
>
>
> On Tue, Jun 18, 2019 at 12:55 PM Allan Haldane <[hidden email]
> <mailto:[hidden email]>> wrote:
> <snip>
>
>     > This may be too much to ask from the initializer, but, if so, it still
>     > seems most useful if it is made as easy as possible to do, say, `class
>     > MaskedQuantity(Masked, Quantity): <very few overrides>`.
>
>     Currently MaskedArray does not accept ducktypes as underlying arrays,
>     but I think it shouldn't be too hard to modify it to do so. Good idea!
>
>
> Looking back at my trial, I see that I also never got to duck arrays -
> only ndarray subclasses - though I tried to make the code as agnostic as
> possible.
>
> (Trial at
> https://github.com/astropy/astropy/compare/master...mhvk:utils-masked-class?expand=1)
>
>     I already partly navigated this mixin-issue in the
>     "MaskedArrayCollection" class, which essentially does
>     ArrayCollection(MaskedArray(array)), and only takes about 30 lines of
>     boilerplate. That's the backwards encapsulation order from what you want
>     though.
>
>
> Yes, indeed, from a quick trial `MaskedArray(np.arange(3.) * u.m,
> mask=[True, False, False])` does indeed not have a `.unit` attribute
> (and cannot represent itself...); I'm not at all sure that my method of
> just creating a mixed class is anything but a recipe for disaster, though!

Based on your suggestion I worked on this a little today, and now my
MaskedArray more easily encapsulates both ducktypes and ndarray
subclasses (pushed to repo). Here's an example I got working with masked
units using unyt:

[1]: from MaskedArray import X, MaskedArray, MaskedScalar

[2]: from unyt import m, km

[3]: import numpy as np

[4]: uarr = MaskedArray([1., 2., 3.]*km, mask=[0,1,0])

[5]: uarr

MaskedArray([1., X , 3.])
[6]: uarr + 1*m

MaskedArray([1.001, X    , 3.001])
[7]: uarr.filled()

unyt_array([1., 0., 3.], 'km')
[8]: np.concatenate([uarr, 2*uarr]).filled()
unyt_array([1., 0., 3., 2., 0., 6.], '(dimensionless)')

The catch is the ducktype/subclass has to rigorously follow numpy's
indexing rules, including distinguishing 0d arrays from scalars. For now
only I used unyt in the example above since it happens to be less strict
 about dimensionless operations than astropy.units which trips up my
repr code. (see below for example with astropy.units). Note in the last
line I lost the dimensions, but that is because unyt does not handle
np.concatenate. To get that to work we need a true ducktype for units.

The example above doesn't expose the ".units" attribute outside the
MaskedArray, and it doesn't print the units in the repr. But you can
access them using "filled".

While I could make MaskedArray forward unknown attribute accesses to the
encapsulated array, that seems a bit dangerous/bug-prone at first
glance, so probably I want to require the user to make a MaskedArray
subclass to do so. I've just started playing with that (probably buggy),
and Ive attached subclass examples for astropy.unit and unyt, with some
example output below.

Cheers,
Allan



Example using the attached astropy unit subclass:

    >>> from astropy.units import m, km, s
    >>> uarr = MaskedQ(np.ones(3), units=km, mask=[0,1,0])
    >>> uarr
    MaskedQ([1., X , 1.], units=km)
    >>> uarr.units
    km
    >>> uarr + (1*m)
    MaskedQ([1.001, X    , 1.001], units=km)
    >>> uarr/(1*s)
    MaskedQ([1., X , 1.], units=km / s)
    >>> (uarr*(1*m))[1:]
    MaskedQ([X , 1.], units=km m)
    >>> np.add.outer(uarr, uarr)
    MaskedQ([[2., X , 2.],
             [X , X , X ],
             [2., X , 2.]], units=km)
    >>> print(uarr)
    [1. X  1.] km m

Cheers,
Allan


>     > Even if this impossible, I think it is conceptually useful to think
>     > about what the masking class should do. My sense is that, e.g., it
>     > should not attempt to decide when an operation succeeds or not,
>     but just
>     > "or together" input masks for regular, multiple-input functions,
>     and let
>     > the underlying arrays skip elements for reductions by using `where`
>     > (hey, I did implement that for a reason... ;-). In particular, it
>     > suggests one should not have things like domains and all that (I never
>     > understood why `MaskedArray` did that). If one wants more, the class
>     > should provide a method that updates the mask (a sensible default
>     might
>     > be `mask |= ~np.isfinite(result)` - here, the class being masked
>     should
>     > logically support ufuncs and functions, so it can decide what
>     "isfinite"
>     > means).
>
>     I agree it would be nice to remove domains. It would make life easier,
>     and I could remove a lot of twiddly code! I kept it in for now to
>     minimize the behavior changes from the old MaskedArray.
>
>
> That makes sense. Could be separated out to a backwards-compatibility
> class later.
>
>
>     > In any case, I would think that a basic truth should be that
>     everything
>     > has a mask with a shape consistent with the data, so
>     > 1. Each complex numbers has just one mask, and setting `a.imag` with a
>     > masked array should definitely propagate the mask.
>     > 2. For a masked array with structured dtype, I'd similarly say
>     that the
>     > default is for a mask to have the same shape as the array. But that
>     > something like your collection makes sense for the case where one
>     wants
>     > to mask items in a structure.
>
>     Agreed that we should have a single bool per complex or structured
>     element, and the mask shape is the same as the array shape. That's how I
>     implemented it. But there is still a problem with complex.imag
>     assignment:
>
>         >>> a = MaskedArray([1j, 2, X])
>         >>> i = a.imag
>         >>> i[:] = MaskedArray([1, X, 1])
>
>     If we make the last line copy the mask to the original array, what
>     should the real part of a[2] be? Conversely, if we don't copy the mask,
>     what should the imag part of a[1] be? It seems like we might "want" the
>     masks to be OR'd instead, but then should i[2] be masked after we just
>     set it to 1?
>
> Ah, I see the issue now... Easiest to implement and closest in analogy
> to a regular view would be to just let it unmask a[2] (with whatever is
> in real; user beware!).
>
> Perhaps better would be to special-case such that `imag` returns a
> read-only view of the mask. Making `imag` itself read-only would prevent
> possibly reasonable things like `i[np.isclose(i, 0)] = 0` - but there is
> no reason this should update the mask.
>
> Still, neither is really satisfactory...
>  
>
>
>     > p.s. I started trying to implement the above "Mixin" class; will
>     try to
>     > clean that up a bit so that at least it uses `where` and push it up.
>
>     I played with "where", but didn't include it since 1.17 is not released.
>     To avoid duplication of effort, I've attached a diff of what I tried. I
>     actually get a slight slowdown of about 10% by using where...
>
>
> Your implementation is indeed quite similar to what I got in
> __array_ufunc__ (though one should "&" the where with ~mask).
>
> I think the main benefit is not to presume that whatever is underneath
> understands 0 or 1, i.e., avoid filling.
>
>
>     If you make progress with the mixin, a push is welcome. I imagine a
>     problem is going to be that np.isscalar doesn't work to detect duck
>     scalars.
>
> I fear that in my attempts I've simply decided that only array scalars
> exist...
>
> -- Marten
>
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> https://mail.python.org/mailman/listinfo/numpy-discussion
>

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: new MaskedArray class

Allan Haldane
On 6/19/19 10:19 PM, Marten van Kerkwijk wrote:
> Hi Allan,
>
> This is very impressive! I could get the tests that I wrote for my class
> pass with yours using Quantity with what I would consider very minimal
> changes. I only could not find a good way to unmask data (I like the
> idea of setting the mask on some elements via `ma[item] = X`); is this
> on purpose?

Yes, I want to make it difficult for the user to access the garbage
values under the mask, which are often clobbered values. The only way to
"remove" a masked value is by replacing it with a new non-masked value.


> Anyway, it would seem easily at the point where I should comment on your
> repository rather than in the mailing list!

To make further progress on this encapsulation idea I need a more
complete ducktype to pass into MaskedArray to test, so that's what I'll
work on next, when I have time. I'll either try to finish my
ArrayCollection type, or try making a simple NDunit ducktype
piggybacking on astropy's Unit.

Best,
Allan


>
> All the best,
>
> Marten
>
>
> On Wed, Jun 19, 2019 at 5:45 PM Allan Haldane <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     On 6/18/19 2:04 PM, Marten van Kerkwijk wrote:
>     >
>     >
>     > On Tue, Jun 18, 2019 at 12:55 PM Allan Haldane
>     <[hidden email] <mailto:[hidden email]>
>     > <mailto:[hidden email] <mailto:[hidden email]>>>
>     wrote:
>     > <snip>
>     >
>     >     > This may be too much to ask from the initializer, but, if
>     so, it still
>     >     > seems most useful if it is made as easy as possible to do,
>     say, `class
>     >     > MaskedQuantity(Masked, Quantity): <very few overrides>`.
>     >
>     >     Currently MaskedArray does not accept ducktypes as underlying
>     arrays,
>     >     but I think it shouldn't be too hard to modify it to do so.
>     Good idea!
>     >
>     >
>     > Looking back at my trial, I see that I also never got to duck arrays -
>     > only ndarray subclasses - though I tried to make the code as
>     agnostic as
>     > possible.
>     >
>     > (Trial at
>     >
>     https://github.com/astropy/astropy/compare/master...mhvk:utils-masked-class?expand=1)
>     >
>     >     I already partly navigated this mixin-issue in the
>     >     "MaskedArrayCollection" class, which essentially does
>     >     ArrayCollection(MaskedArray(array)), and only takes about 30
>     lines of
>     >     boilerplate. That's the backwards encapsulation order from
>     what you want
>     >     though.
>     >
>     >
>     > Yes, indeed, from a quick trial `MaskedArray(np.arange(3.) * u.m,
>     > mask=[True, False, False])` does indeed not have a `.unit` attribute
>     > (and cannot represent itself...); I'm not at all sure that my
>     method of
>     > just creating a mixed class is anything but a recipe for disaster,
>     though!
>
>     Based on your suggestion I worked on this a little today, and now my
>     MaskedArray more easily encapsulates both ducktypes and ndarray
>     subclasses (pushed to repo). Here's an example I got working with masked
>     units using unyt:
>
>     [1]: from MaskedArray import X, MaskedArray, MaskedScalar
>
>     [2]: from unyt import m, km
>
>     [3]: import numpy as np
>
>     [4]: uarr = MaskedArray([1., 2., 3.]*km, mask=[0,1,0])
>
>     [5]: uarr
>
>     MaskedArray([1., X , 3.])
>     [6]: uarr + 1*m
>
>     MaskedArray([1.001, X    , 3.001])
>     [7]: uarr.filled()
>
>     unyt_array([1., 0., 3.], 'km')
>     [8]: np.concatenate([uarr, 2*uarr]).filled()
>     unyt_array([1., 0., 3., 2., 0., 6.], '(dimensionless)')
>
>     The catch is the ducktype/subclass has to rigorously follow numpy's
>     indexing rules, including distinguishing 0d arrays from scalars. For now
>     only I used unyt in the example above since it happens to be less strict
>      about dimensionless operations than astropy.units which trips up my
>     repr code. (see below for example with astropy.units). Note in the last
>     line I lost the dimensions, but that is because unyt does not handle
>     np.concatenate. To get that to work we need a true ducktype for units.
>
>     The example above doesn't expose the ".units" attribute outside the
>     MaskedArray, and it doesn't print the units in the repr. But you can
>     access them using "filled".
>
>     While I could make MaskedArray forward unknown attribute accesses to the
>     encapsulated array, that seems a bit dangerous/bug-prone at first
>     glance, so probably I want to require the user to make a MaskedArray
>     subclass to do so. I've just started playing with that (probably buggy),
>     and Ive attached subclass examples for astropy.unit and unyt, with some
>     example output below.
>
>     Cheers,
>     Allan
>
>
>
>     Example using the attached astropy unit subclass:
>
>         >>> from astropy.units import m, km, s
>         >>> uarr = MaskedQ(np.ones(3), units=km, mask=[0,1,0])
>         >>> uarr
>         MaskedQ([1., X , 1.], units=km)
>         >>> uarr.units
>         km
>         >>> uarr + (1*m)
>         MaskedQ([1.001, X    , 1.001], units=km)
>         >>> uarr/(1*s)
>         MaskedQ([1., X , 1.], units=km / s)
>         >>> (uarr*(1*m))[1:]
>         MaskedQ([X , 1.], units=km m)
>         >>> np.add.outer(uarr, uarr)
>         MaskedQ([[2., X , 2.],
>                  [X , X , X ],
>                  [2., X , 2.]], units=km)
>         >>> print(uarr)
>         [1. X  1.] km m
>
>     Cheers,
>     Allan
>
>
>     >     > Even if this impossible, I think it is conceptually useful
>     to think
>     >     > about what the masking class should do. My sense is that,
>     e.g., it
>     >     > should not attempt to decide when an operation succeeds or not,
>     >     but just
>     >     > "or together" input masks for regular, multiple-input functions,
>     >     and let
>     >     > the underlying arrays skip elements for reductions by using
>     `where`
>     >     > (hey, I did implement that for a reason... ;-). In
>     particular, it
>     >     > suggests one should not have things like domains and all
>     that (I never
>     >     > understood why `MaskedArray` did that). If one wants more,
>     the class
>     >     > should provide a method that updates the mask (a sensible
>     default
>     >     might
>     >     > be `mask |= ~np.isfinite(result)` - here, the class being masked
>     >     should
>     >     > logically support ufuncs and functions, so it can decide what
>     >     "isfinite"
>     >     > means).
>     >
>     >     I agree it would be nice to remove domains. It would make life
>     easier,
>     >     and I could remove a lot of twiddly code! I kept it in for now to
>     >     minimize the behavior changes from the old MaskedArray.
>     >
>     >
>     > That makes sense. Could be separated out to a backwards-compatibility
>     > class later.
>     >
>     >
>     >     > In any case, I would think that a basic truth should be that
>     >     everything
>     >     > has a mask with a shape consistent with the data, so
>     >     > 1. Each complex numbers has just one mask, and setting
>     `a.imag` with a
>     >     > masked array should definitely propagate the mask.
>     >     > 2. For a masked array with structured dtype, I'd similarly say
>     >     that the
>     >     > default is for a mask to have the same shape as the array.
>     But that
>     >     > something like your collection makes sense for the case
>     where one
>     >     wants
>     >     > to mask items in a structure.
>     >
>     >     Agreed that we should have a single bool per complex or structured
>     >     element, and the mask shape is the same as the array shape.
>     That's how I
>     >     implemented it. But there is still a problem with complex.imag
>     >     assignment:
>     >
>     >         >>> a = MaskedArray([1j, 2, X])
>     >         >>> i = a.imag
>     >         >>> i[:] = MaskedArray([1, X, 1])
>     >
>     >     If we make the last line copy the mask to the original array, what
>     >     should the real part of a[2] be? Conversely, if we don't copy
>     the mask,
>     >     what should the imag part of a[1] be? It seems like we might
>     "want" the
>     >     masks to be OR'd instead, but then should i[2] be masked after
>     we just
>     >     set it to 1?
>     >
>     > Ah, I see the issue now... Easiest to implement and closest in analogy
>     > to a regular view would be to just let it unmask a[2] (with
>     whatever is
>     > in real; user beware!).
>     >
>     > Perhaps better would be to special-case such that `imag` returns a
>     > read-only view of the mask. Making `imag` itself read-only would
>     prevent
>     > possibly reasonable things like `i[np.isclose(i, 0)] = 0` - but
>     there is
>     > no reason this should update the mask.
>     >
>     > Still, neither is really satisfactory...
>     >  
>     >
>     >
>     >     > p.s. I started trying to implement the above "Mixin" class; will
>     >     try to
>     >     > clean that up a bit so that at least it uses `where` and
>     push it up.
>     >
>     >     I played with "where", but didn't include it since 1.17 is not
>     released.
>     >     To avoid duplication of effort, I've attached a diff of what I
>     tried. I
>     >     actually get a slight slowdown of about 10% by using where...
>     >
>     >
>     > Your implementation is indeed quite similar to what I got in
>     > __array_ufunc__ (though one should "&" the where with ~mask).
>     >
>     > I think the main benefit is not to presume that whatever is underneath
>     > understands 0 or 1, i.e., avoid filling.
>     >
>     >
>     >     If you make progress with the mixin, a push is welcome. I
>     imagine a
>     >     problem is going to be that np.isscalar doesn't work to detect
>     duck
>     >     scalars.
>     >
>     > I fear that in my attempts I've simply decided that only array scalars
>     > exist...
>     >
>     > -- Marten
>     >
>     > _______________________________________________
>     > NumPy-Discussion mailing list
>     > [hidden email] <mailto:[hidden email]>
>     > https://mail.python.org/mailman/listinfo/numpy-discussion
>     >
>
>     _______________________________________________
>     NumPy-Discussion mailing list
>     [hidden email] <mailto:[hidden email]>
>     https://mail.python.org/mailman/listinfo/numpy-discussion
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> https://mail.python.org/mailman/listinfo/numpy-discussion
>

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: new MaskedArray class

Benjamin Root
Just to note, data that is masked isn't always garbage. There are plenty of use-cases where one may want to temporarily apply a mask for a set of computation, or possibly want to apply a series of different masks to the data. I haven't read through this discussion deeply enough, but is this new class going to destroy underlying masked data? and will it be possible to swap out masks?

Cheers!
Ben Root


On Thu, Jun 20, 2019 at 12:44 PM Allan Haldane <[hidden email]> wrote:
On 6/19/19 10:19 PM, Marten van Kerkwijk wrote:
> Hi Allan,
>
> This is very impressive! I could get the tests that I wrote for my class
> pass with yours using Quantity with what I would consider very minimal
> changes. I only could not find a good way to unmask data (I like the
> idea of setting the mask on some elements via `ma[item] = X`); is this
> on purpose?

Yes, I want to make it difficult for the user to access the garbage
values under the mask, which are often clobbered values. The only way to
"remove" a masked value is by replacing it with a new non-masked value.


> Anyway, it would seem easily at the point where I should comment on your
> repository rather than in the mailing list!

To make further progress on this encapsulation idea I need a more
complete ducktype to pass into MaskedArray to test, so that's what I'll
work on next, when I have time. I'll either try to finish my
ArrayCollection type, or try making a simple NDunit ducktype
piggybacking on astropy's Unit.

Best,
Allan


>
> All the best,
>
> Marten
>
>
> On Wed, Jun 19, 2019 at 5:45 PM Allan Haldane <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     On 6/18/19 2:04 PM, Marten van Kerkwijk wrote:
>     >
>     >
>     > On Tue, Jun 18, 2019 at 12:55 PM Allan Haldane
>     <[hidden email] <mailto:[hidden email]>
>     > <mailto:[hidden email] <mailto:[hidden email]>>>
>     wrote:
>     > <snip>
>     >
>     >     > This may be too much to ask from the initializer, but, if
>     so, it still
>     >     > seems most useful if it is made as easy as possible to do,
>     say, `class
>     >     > MaskedQuantity(Masked, Quantity): <very few overrides>`.
>     >
>     >     Currently MaskedArray does not accept ducktypes as underlying
>     arrays,
>     >     but I think it shouldn't be too hard to modify it to do so.
>     Good idea!
>     >
>     >
>     > Looking back at my trial, I see that I also never got to duck arrays -
>     > only ndarray subclasses - though I tried to make the code as
>     agnostic as
>     > possible.
>     >
>     > (Trial at
>     >
>     https://github.com/astropy/astropy/compare/master...mhvk:utils-masked-class?expand=1)
>     >
>     >     I already partly navigated this mixin-issue in the
>     >     "MaskedArrayCollection" class, which essentially does
>     >     ArrayCollection(MaskedArray(array)), and only takes about 30
>     lines of
>     >     boilerplate. That's the backwards encapsulation order from
>     what you want
>     >     though.
>     >
>     >
>     > Yes, indeed, from a quick trial `MaskedArray(np.arange(3.) * u.m,
>     > mask=[True, False, False])` does indeed not have a `.unit` attribute
>     > (and cannot represent itself...); I'm not at all sure that my
>     method of
>     > just creating a mixed class is anything but a recipe for disaster,
>     though!
>
>     Based on your suggestion I worked on this a little today, and now my
>     MaskedArray more easily encapsulates both ducktypes and ndarray
>     subclasses (pushed to repo). Here's an example I got working with masked
>     units using unyt:
>
>     [1]: from MaskedArray import X, MaskedArray, MaskedScalar
>
>     [2]: from unyt import m, km
>
>     [3]: import numpy as np
>
>     [4]: uarr = MaskedArray([1., 2., 3.]*km, mask=[0,1,0])
>
>     [5]: uarr
>
>     MaskedArray([1., X , 3.])
>     [6]: uarr + 1*m
>
>     MaskedArray([1.001, X    , 3.001])
>     [7]: uarr.filled()
>
>     unyt_array([1., 0., 3.], 'km')
>     [8]: np.concatenate([uarr, 2*uarr]).filled()
>     unyt_array([1., 0., 3., 2., 0., 6.], '(dimensionless)')
>
>     The catch is the ducktype/subclass has to rigorously follow numpy's
>     indexing rules, including distinguishing 0d arrays from scalars. For now
>     only I used unyt in the example above since it happens to be less strict
>      about dimensionless operations than astropy.units which trips up my
>     repr code. (see below for example with astropy.units). Note in the last
>     line I lost the dimensions, but that is because unyt does not handle
>     np.concatenate. To get that to work we need a true ducktype for units.
>
>     The example above doesn't expose the ".units" attribute outside the
>     MaskedArray, and it doesn't print the units in the repr. But you can
>     access them using "filled".
>
>     While I could make MaskedArray forward unknown attribute accesses to the
>     encapsulated array, that seems a bit dangerous/bug-prone at first
>     glance, so probably I want to require the user to make a MaskedArray
>     subclass to do so. I've just started playing with that (probably buggy),
>     and Ive attached subclass examples for astropy.unit and unyt, with some
>     example output below.
>
>     Cheers,
>     Allan
>
>
>
>     Example using the attached astropy unit subclass:
>
>         >>> from astropy.units import m, km, s
>         >>> uarr = MaskedQ(np.ones(3), units=km, mask=[0,1,0])
>         >>> uarr
>         MaskedQ([1., X , 1.], units=km)
>         >>> uarr.units
>         km
>         >>> uarr + (1*m)
>         MaskedQ([1.001, X    , 1.001], units=km)
>         >>> uarr/(1*s)
>         MaskedQ([1., X , 1.], units=km / s)
>         >>> (uarr*(1*m))[1:]
>         MaskedQ([X , 1.], units=km m)
>         >>> np.add.outer(uarr, uarr)
>         MaskedQ([[2., X , 2.],
>                  [X , X , X ],
>                  [2., X , 2.]], units=km)
>         >>> print(uarr)
>         [1. X  1.] km m
>
>     Cheers,
>     Allan
>
>
>     >     > Even if this impossible, I think it is conceptually useful
>     to think
>     >     > about what the masking class should do. My sense is that,
>     e.g., it
>     >     > should not attempt to decide when an operation succeeds or not,
>     >     but just
>     >     > "or together" input masks for regular, multiple-input functions,
>     >     and let
>     >     > the underlying arrays skip elements for reductions by using
>     `where`
>     >     > (hey, I did implement that for a reason... ;-). In
>     particular, it
>     >     > suggests one should not have things like domains and all
>     that (I never
>     >     > understood why `MaskedArray` did that). If one wants more,
>     the class
>     >     > should provide a method that updates the mask (a sensible
>     default
>     >     might
>     >     > be `mask |= ~np.isfinite(result)` - here, the class being masked
>     >     should
>     >     > logically support ufuncs and functions, so it can decide what
>     >     "isfinite"
>     >     > means).
>     >
>     >     I agree it would be nice to remove domains. It would make life
>     easier,
>     >     and I could remove a lot of twiddly code! I kept it in for now to
>     >     minimize the behavior changes from the old MaskedArray.
>     >
>     >
>     > That makes sense. Could be separated out to a backwards-compatibility
>     > class later.
>     >
>     >
>     >     > In any case, I would think that a basic truth should be that
>     >     everything
>     >     > has a mask with a shape consistent with the data, so
>     >     > 1. Each complex numbers has just one mask, and setting
>     `a.imag` with a
>     >     > masked array should definitely propagate the mask.
>     >     > 2. For a masked array with structured dtype, I'd similarly say
>     >     that the
>     >     > default is for a mask to have the same shape as the array.
>     But that
>     >     > something like your collection makes sense for the case
>     where one
>     >     wants
>     >     > to mask items in a structure.
>     >
>     >     Agreed that we should have a single bool per complex or structured
>     >     element, and the mask shape is the same as the array shape.
>     That's how I
>     >     implemented it. But there is still a problem with complex.imag
>     >     assignment:
>     >
>     >         >>> a = MaskedArray([1j, 2, X])
>     >         >>> i = a.imag
>     >         >>> i[:] = MaskedArray([1, X, 1])
>     >
>     >     If we make the last line copy the mask to the original array, what
>     >     should the real part of a[2] be? Conversely, if we don't copy
>     the mask,
>     >     what should the imag part of a[1] be? It seems like we might
>     "want" the
>     >     masks to be OR'd instead, but then should i[2] be masked after
>     we just
>     >     set it to 1?
>     >
>     > Ah, I see the issue now... Easiest to implement and closest in analogy
>     > to a regular view would be to just let it unmask a[2] (with
>     whatever is
>     > in real; user beware!).
>     >
>     > Perhaps better would be to special-case such that `imag` returns a
>     > read-only view of the mask. Making `imag` itself read-only would
>     prevent
>     > possibly reasonable things like `i[np.isclose(i, 0)] = 0` - but
>     there is
>     > no reason this should update the mask.
>     >
>     > Still, neither is really satisfactory...
>     >  
>     >
>     >
>     >     > p.s. I started trying to implement the above "Mixin" class; will
>     >     try to
>     >     > clean that up a bit so that at least it uses `where` and
>     push it up.
>     >
>     >     I played with "where", but didn't include it since 1.17 is not
>     released.
>     >     To avoid duplication of effort, I've attached a diff of what I
>     tried. I
>     >     actually get a slight slowdown of about 10% by using where...
>     >
>     >
>     > Your implementation is indeed quite similar to what I got in
>     > __array_ufunc__ (though one should "&" the where with ~mask).
>     >
>     > I think the main benefit is not to presume that whatever is underneath
>     > understands 0 or 1, i.e., avoid filling.
>     >
>     >
>     >     If you make progress with the mixin, a push is welcome. I
>     imagine a
>     >     problem is going to be that np.isscalar doesn't work to detect
>     duck
>     >     scalars.
>     >
>     > I fear that in my attempts I've simply decided that only array scalars
>     > exist...
>     >
>     > -- Marten
>     >
>     > _______________________________________________
>     > NumPy-Discussion mailing list
>     > [hidden email] <mailto:[hidden email]>
>     > https://mail.python.org/mailman/listinfo/numpy-discussion
>     >
>
>     _______________________________________________
>     NumPy-Discussion mailing list
>     [hidden email] <mailto:[hidden email]>
>     https://mail.python.org/mailman/listinfo/numpy-discussion
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> https://mail.python.org/mailman/listinfo/numpy-discussion
>

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: new MaskedArray class

Allan Haldane
On 6/21/19 2:37 PM, Benjamin Root wrote:
> Just to note, data that is masked isn't always garbage. There are plenty
> of use-cases where one may want to temporarily apply a mask for a set of
> computation, or possibly want to apply a series of different masks to
> the data. I haven't read through this discussion deeply enough, but is
> this new class going to destroy underlying masked data? and will it be
> possible to swap out masks?
>
> Cheers!
> Ben Root

Indeed my implementation currently feels free to clobber the data at
masked positions and makes no guarantees not to.

I'd like to try to support reasonable use-cases like yours though. A few
thoughts:

First, the old np.ma.MaskedArray explicitly does not promise to preserve
masked values, with a big warning in the docs. I can't recall the
examples, but I remember coming across cases where clobbering happens.
So arguably your behavior was never supported, and perhaps this means
that no-clobber behavior is difficult to reasonably support.

Second, the old np.ma.MaskedArray avoids frequent clobbering by making
lots of copies. Therefore, in most cases you will not lose any
performance in my new MaskedArray relative to the old one by making an
explicit copy yourself. I.e, is it problematic to have to do

     >>> result = MaskedArray(data.copy(), trial_mask).sum()

instead of

     >>> marr.mask = trial_mask
     >>> result = marr.sum()

since they have similar performance?

Third, in the old np.ma.MaskedArray masked positions are very often
"effectively" clobbered, in the sense that they are not computed. For
example, if you do "c = a+b", and then change the mask of c, the values
at masked position of the result of (a+b) do not correspond to the sum
of the masked values in a and b. Thus, by "unmasking" c you are exposing
nonsense values, which to me seems likely to cause heisenbugs.


In summary, by not making no-clobber guarantees and by strictly
preventing exposure of nonsense values, I suspect that: 1. my new code
is simpler and faster by avoiding lots of copies, and forces copies to
be explicit in user code. 2. disallowing direct modification of the mask
lowers the "API surface area" making people's MaskedArray code less
buggy and easier to read: Exposure of nonsense values by "unmasking" is
one less possibility to keep in mind.

Best,
Allan


> On Thu, Jun 20, 2019 at 12:44 PM Allan Haldane <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     On 6/19/19 10:19 PM, Marten van Kerkwijk wrote:
>     > Hi Allan,
>     >
>     > This is very impressive! I could get the tests that I wrote for my
>     class
>     > pass with yours using Quantity with what I would consider very minimal
>     > changes. I only could not find a good way to unmask data (I like the
>     > idea of setting the mask on some elements via `ma[item] = X`); is this
>     > on purpose?
>
>     Yes, I want to make it difficult for the user to access the garbage
>     values under the mask, which are often clobbered values. The only way to
>     "remove" a masked value is by replacing it with a new non-masked value.
>
>
>     > Anyway, it would seem easily at the point where I should comment
>     on your
>     > repository rather than in the mailing list!
>
>     To make further progress on this encapsulation idea I need a more
>     complete ducktype to pass into MaskedArray to test, so that's what I'll
>     work on next, when I have time. I'll either try to finish my
>     ArrayCollection type, or try making a simple NDunit ducktype
>     piggybacking on astropy's Unit.
>
>     Best,
>     Allan
>
>
>     >
>     > All the best,
>     >
>     > Marten
>     >
>     >
>     > On Wed, Jun 19, 2019 at 5:45 PM Allan Haldane
>     <[hidden email] <mailto:[hidden email]>
>     > <mailto:[hidden email] <mailto:[hidden email]>>>
>     wrote:
>     >
>     >     On 6/18/19 2:04 PM, Marten van Kerkwijk wrote:
>     >     >
>     >     >
>     >     > On Tue, Jun 18, 2019 at 12:55 PM Allan Haldane
>     >     <[hidden email] <mailto:[hidden email]>
>     <mailto:[hidden email] <mailto:[hidden email]>>
>     >     > <mailto:[hidden email]
>     <mailto:[hidden email]> <mailto:[hidden email]
>     <mailto:[hidden email]>>>>
>     >     wrote:
>     >     > <snip>
>     >     >
>     >     >     > This may be too much to ask from the initializer, but, if
>     >     so, it still
>     >     >     > seems most useful if it is made as easy as possible to do,
>     >     say, `class
>     >     >     > MaskedQuantity(Masked, Quantity): <very few overrides>`.
>     >     >
>     >     >     Currently MaskedArray does not accept ducktypes as
>     underlying
>     >     arrays,
>     >     >     but I think it shouldn't be too hard to modify it to do so.
>     >     Good idea!
>     >     >
>     >     >
>     >     > Looking back at my trial, I see that I also never got to
>     duck arrays -
>     >     > only ndarray subclasses - though I tried to make the code as
>     >     agnostic as
>     >     > possible.
>     >     >
>     >     > (Trial at
>     >     >
>     >  
>      https://github.com/astropy/astropy/compare/master...mhvk:utils-masked-class?expand=1)
>     >     >
>     >     >     I already partly navigated this mixin-issue in the
>     >     >     "MaskedArrayCollection" class, which essentially does
>     >     >     ArrayCollection(MaskedArray(array)), and only takes about 30
>     >     lines of
>     >     >     boilerplate. That's the backwards encapsulation order from
>     >     what you want
>     >     >     though.
>     >     >
>     >     >
>     >     > Yes, indeed, from a quick trial `MaskedArray(np.arange(3.) *
>     u.m,
>     >     > mask=[True, False, False])` does indeed not have a `.unit`
>     attribute
>     >     > (and cannot represent itself...); I'm not at all sure that my
>     >     method of
>     >     > just creating a mixed class is anything but a recipe for
>     disaster,
>     >     though!
>     >
>     >     Based on your suggestion I worked on this a little today, and
>     now my
>     >     MaskedArray more easily encapsulates both ducktypes and ndarray
>     >     subclasses (pushed to repo). Here's an example I got working
>     with masked
>     >     units using unyt:
>     >
>     >     [1]: from MaskedArray import X, MaskedArray, MaskedScalar
>     >
>     >     [2]: from unyt import m, km
>     >
>     >     [3]: import numpy as np
>     >
>     >     [4]: uarr = MaskedArray([1., 2., 3.]*km, mask=[0,1,0])
>     >
>     >     [5]: uarr
>     >
>     >     MaskedArray([1., X , 3.])
>     >     [6]: uarr + 1*m
>     >
>     >     MaskedArray([1.001, X    , 3.001])
>     >     [7]: uarr.filled()
>     >
>     >     unyt_array([1., 0., 3.], 'km')
>     >     [8]: np.concatenate([uarr, 2*uarr]).filled()
>     >     unyt_array([1., 0., 3., 2., 0., 6.], '(dimensionless)')
>     >
>     >     The catch is the ducktype/subclass has to rigorously follow
>     numpy's
>     >     indexing rules, including distinguishing 0d arrays from
>     scalars. For now
>     >     only I used unyt in the example above since it happens to be
>     less strict
>     >      about dimensionless operations than astropy.units which trips
>     up my
>     >     repr code. (see below for example with astropy.units). Note in
>     the last
>     >     line I lost the dimensions, but that is because unyt does not
>     handle
>     >     np.concatenate. To get that to work we need a true ducktype
>     for units.
>     >
>     >     The example above doesn't expose the ".units" attribute
>     outside the
>     >     MaskedArray, and it doesn't print the units in the repr. But
>     you can
>     >     access them using "filled".
>     >
>     >     While I could make MaskedArray forward unknown attribute
>     accesses to the
>     >     encapsulated array, that seems a bit dangerous/bug-prone at first
>     >     glance, so probably I want to require the user to make a
>     MaskedArray
>     >     subclass to do so. I've just started playing with that
>     (probably buggy),
>     >     and Ive attached subclass examples for astropy.unit and unyt,
>     with some
>     >     example output below.
>     >
>     >     Cheers,
>     >     Allan
>     >
>     >
>     >
>     >     Example using the attached astropy unit subclass:
>     >
>     >         >>> from astropy.units import m, km, s
>     >         >>> uarr = MaskedQ(np.ones(3), units=km, mask=[0,1,0])
>     >         >>> uarr
>     >         MaskedQ([1., X , 1.], units=km)
>     >         >>> uarr.units
>     >         km
>     >         >>> uarr + (1*m)
>     >         MaskedQ([1.001, X    , 1.001], units=km)
>     >         >>> uarr/(1*s)
>     >         MaskedQ([1., X , 1.], units=km / s)
>     >         >>> (uarr*(1*m))[1:]
>     >         MaskedQ([X , 1.], units=km m)
>     >         >>> np.add.outer(uarr, uarr)
>     >         MaskedQ([[2., X , 2.],
>     >                  [X , X , X ],
>     >                  [2., X , 2.]], units=km)
>     >         >>> print(uarr)
>     >         [1. X  1.] km m
>     >
>     >     Cheers,
>     >     Allan
>     >
>     >
>     >     >     > Even if this impossible, I think it is conceptually useful
>     >     to think
>     >     >     > about what the masking class should do. My sense is that,
>     >     e.g., it
>     >     >     > should not attempt to decide when an operation
>     succeeds or not,
>     >     >     but just
>     >     >     > "or together" input masks for regular, multiple-input
>     functions,
>     >     >     and let
>     >     >     > the underlying arrays skip elements for reductions by
>     using
>     >     `where`
>     >     >     > (hey, I did implement that for a reason... ;-). In
>     >     particular, it
>     >     >     > suggests one should not have things like domains and all
>     >     that (I never
>     >     >     > understood why `MaskedArray` did that). If one wants more,
>     >     the class
>     >     >     > should provide a method that updates the mask (a sensible
>     >     default
>     >     >     might
>     >     >     > be `mask |= ~np.isfinite(result)` - here, the class
>     being masked
>     >     >     should
>     >     >     > logically support ufuncs and functions, so it can
>     decide what
>     >     >     "isfinite"
>     >     >     > means).
>     >     >
>     >     >     I agree it would be nice to remove domains. It would
>     make life
>     >     easier,
>     >     >     and I could remove a lot of twiddly code! I kept it in
>     for now to
>     >     >     minimize the behavior changes from the old MaskedArray.
>     >     >
>     >     >
>     >     > That makes sense. Could be separated out to a
>     backwards-compatibility
>     >     > class later.
>     >     >
>     >     >
>     >     >     > In any case, I would think that a basic truth should
>     be that
>     >     >     everything
>     >     >     > has a mask with a shape consistent with the data, so
>     >     >     > 1. Each complex numbers has just one mask, and setting
>     >     `a.imag` with a
>     >     >     > masked array should definitely propagate the mask.
>     >     >     > 2. For a masked array with structured dtype, I'd
>     similarly say
>     >     >     that the
>     >     >     > default is for a mask to have the same shape as the array.
>     >     But that
>     >     >     > something like your collection makes sense for the case
>     >     where one
>     >     >     wants
>     >     >     > to mask items in a structure.
>     >     >
>     >     >     Agreed that we should have a single bool per complex or
>     structured
>     >     >     element, and the mask shape is the same as the array shape.
>     >     That's how I
>     >     >     implemented it. But there is still a problem with
>     complex.imag
>     >     >     assignment:
>     >     >
>     >     >         >>> a = MaskedArray([1j, 2, X])
>     >     >         >>> i = a.imag
>     >     >         >>> i[:] = MaskedArray([1, X, 1])
>     >     >
>     >     >     If we make the last line copy the mask to the original
>     array, what
>     >     >     should the real part of a[2] be? Conversely, if we don't
>     copy
>     >     the mask,
>     >     >     what should the imag part of a[1] be? It seems like we might
>     >     "want" the
>     >     >     masks to be OR'd instead, but then should i[2] be masked
>     after
>     >     we just
>     >     >     set it to 1?
>     >     >
>     >     > Ah, I see the issue now... Easiest to implement and closest
>     in analogy
>     >     > to a regular view would be to just let it unmask a[2] (with
>     >     whatever is
>     >     > in real; user beware!).
>     >     >
>     >     > Perhaps better would be to special-case such that `imag`
>     returns a
>     >     > read-only view of the mask. Making `imag` itself read-only would
>     >     prevent
>     >     > possibly reasonable things like `i[np.isclose(i, 0)] = 0` - but
>     >     there is
>     >     > no reason this should update the mask.
>     >     >
>     >     > Still, neither is really satisfactory...
>     >     >  
>     >     >
>     >     >
>     >     >     > p.s. I started trying to implement the above "Mixin"
>     class; will
>     >     >     try to
>     >     >     > clean that up a bit so that at least it uses `where` and
>     >     push it up.
>     >     >
>     >     >     I played with "where", but didn't include it since 1.17
>     is not
>     >     released.
>     >     >     To avoid duplication of effort, I've attached a diff of
>     what I
>     >     tried. I
>     >     >     actually get a slight slowdown of about 10% by using
>     where...
>     >     >
>     >     >
>     >     > Your implementation is indeed quite similar to what I got in
>     >     > __array_ufunc__ (though one should "&" the where with ~mask).
>     >     >
>     >     > I think the main benefit is not to presume that whatever is
>     underneath
>     >     > understands 0 or 1, i.e., avoid filling.
>     >     >
>     >     >
>     >     >     If you make progress with the mixin, a push is welcome. I
>     >     imagine a
>     >     >     problem is going to be that np.isscalar doesn't work to
>     detect
>     >     duck
>     >     >     scalars.
>     >     >
>     >     > I fear that in my attempts I've simply decided that only
>     array scalars
>     >     > exist...
>     >     >
>     >     > -- Marten
>     >     >
>     >     > _______________________________________________
>     >     > NumPy-Discussion mailing list
>     >     > [hidden email]
>     <mailto:[hidden email]>
>     <mailto:[hidden email]
>     <mailto:[hidden email]>>
>     >     > https://mail.python.org/mailman/listinfo/numpy-discussion
>     >     >
>     >
>     >     _______________________________________________
>     >     NumPy-Discussion mailing list
>     >     [hidden email]
>     <mailto:[hidden email]>
>     <mailto:[hidden email]
>     <mailto:[hidden email]>>
>     >     https://mail.python.org/mailman/listinfo/numpy-discussion
>     >
>     >
>     > _______________________________________________
>     > NumPy-Discussion mailing list
>     > [hidden email] <mailto:[hidden email]>
>     > https://mail.python.org/mailman/listinfo/numpy-discussion
>     >
>
>     _______________________________________________
>     NumPy-Discussion mailing list
>     [hidden email] <mailto:[hidden email]>
>     https://mail.python.org/mailman/listinfo/numpy-discussion
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> https://mail.python.org/mailman/listinfo/numpy-discussion
>

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: new MaskedArray class

Marten van Kerkwijk
Hi Allan,

I'm not sure I would go too much by what the old MaskedArray class did. It indeed made an effort not to overwrite masked values with a new result, even to the extend of copying back masked input data elements to the output data array after an operation. But the fact that this is non-sensical if the dtype changes (or the units in an operation on quantities) suggests that this mental model simply does not work.

I think a sensible alternative mental model for the MaskedArray class is that all it does is forward any operations to the data it holds and separately propagate a mask, ORing elements together for binary operations, etc., and explicitly skipping masked elements in reductions (ideally using `where` to be as agnostic as possible about the underlying data, for which, e.g., setting masked values to `0` for `np.reduce.add` may or may not be the right thing to do - what if they are string?).

With this mental picture, the underlying data are always have well-defined meaning: they have been operated on as if the mask did not exist. There then is also less reason to try to avoid getting it back to the user.

As a concrete example (maybe Ben has others): in astropy we have a sigma-clipping average routine, which uses a `MaskedArray` to iteratively mask items that are too far off from the mean; here, the mask varies each iteration (an initially masked element can come back into play), but the data do not.

All the best,

Marten

On Sat, Jun 22, 2019 at 10:54 AM Allan Haldane <[hidden email]> wrote:
On 6/21/19 2:37 PM, Benjamin Root wrote:
> Just to note, data that is masked isn't always garbage. There are plenty
> of use-cases where one may want to temporarily apply a mask for a set of
> computation, or possibly want to apply a series of different masks to
> the data. I haven't read through this discussion deeply enough, but is
> this new class going to destroy underlying masked data? and will it be
> possible to swap out masks?
>
> Cheers!
> Ben Root

Indeed my implementation currently feels free to clobber the data at
masked positions and makes no guarantees not to.

I'd like to try to support reasonable use-cases like yours though. A few
thoughts:

First, the old np.ma.MaskedArray explicitly does not promise to preserve
masked values, with a big warning in the docs. I can't recall the
examples, but I remember coming across cases where clobbering happens.
So arguably your behavior was never supported, and perhaps this means
that no-clobber behavior is difficult to reasonably support.

Second, the old np.ma.MaskedArray avoids frequent clobbering by making
lots of copies. Therefore, in most cases you will not lose any
performance in my new MaskedArray relative to the old one by making an
explicit copy yourself. I.e, is it problematic to have to do

     >>> result = MaskedArray(data.copy(), trial_mask).sum()

instead of

     >>> marr.mask = trial_mask
     >>> result = marr.sum()

since they have similar performance?

Third, in the old np.ma.MaskedArray masked positions are very often
"effectively" clobbered, in the sense that they are not computed. For
example, if you do "c = a+b", and then change the mask of c, the values
at masked position of the result of (a+b) do not correspond to the sum
of the masked values in a and b. Thus, by "unmasking" c you are exposing
nonsense values, which to me seems likely to cause heisenbugs.


In summary, by not making no-clobber guarantees and by strictly
preventing exposure of nonsense values, I suspect that: 1. my new code
is simpler and faster by avoiding lots of copies, and forces copies to
be explicit in user code. 2. disallowing direct modification of the mask
lowers the "API surface area" making people's MaskedArray code less
buggy and easier to read: Exposure of nonsense values by "unmasking" is
one less possibility to keep in mind.

Best,
Allan


> On Thu, Jun 20, 2019 at 12:44 PM Allan Haldane <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     On 6/19/19 10:19 PM, Marten van Kerkwijk wrote:
>     > Hi Allan,
>     >
>     > This is very impressive! I could get the tests that I wrote for my
>     class
>     > pass with yours using Quantity with what I would consider very minimal
>     > changes. I only could not find a good way to unmask data (I like the
>     > idea of setting the mask on some elements via `ma[item] = X`); is this
>     > on purpose?
>
>     Yes, I want to make it difficult for the user to access the garbage
>     values under the mask, which are often clobbered values. The only way to
>     "remove" a masked value is by replacing it with a new non-masked value.
>
>
>     > Anyway, it would seem easily at the point where I should comment
>     on your
>     > repository rather than in the mailing list!
>
>     To make further progress on this encapsulation idea I need a more
>     complete ducktype to pass into MaskedArray to test, so that's what I'll
>     work on next, when I have time. I'll either try to finish my
>     ArrayCollection type, or try making a simple NDunit ducktype
>     piggybacking on astropy's Unit.
>
>     Best,
>     Allan
>
>
>     >
>     > All the best,
>     >
>     > Marten
>     >
>     >
>     > On Wed, Jun 19, 2019 at 5:45 PM Allan Haldane
>     <[hidden email] <mailto:[hidden email]>
>     > <mailto:[hidden email] <mailto:[hidden email]>>>
>     wrote:
>     >
>     >     On 6/18/19 2:04 PM, Marten van Kerkwijk wrote:
>     >     >
>     >     >
>     >     > On Tue, Jun 18, 2019 at 12:55 PM Allan Haldane
>     >     <[hidden email] <mailto:[hidden email]>
>     <mailto:[hidden email] <mailto:[hidden email]>>
>     >     > <mailto:[hidden email]
>     <mailto:[hidden email]> <mailto:[hidden email]
>     <mailto:[hidden email]>>>>
>     >     wrote:
>     >     > <snip>
>     >     >
>     >     >     > This may be too much to ask from the initializer, but, if
>     >     so, it still
>     >     >     > seems most useful if it is made as easy as possible to do,
>     >     say, `class
>     >     >     > MaskedQuantity(Masked, Quantity): <very few overrides>`.
>     >     >
>     >     >     Currently MaskedArray does not accept ducktypes as
>     underlying
>     >     arrays,
>     >     >     but I think it shouldn't be too hard to modify it to do so.
>     >     Good idea!
>     >     >
>     >     >
>     >     > Looking back at my trial, I see that I also never got to
>     duck arrays -
>     >     > only ndarray subclasses - though I tried to make the code as
>     >     agnostic as
>     >     > possible.
>     >     >
>     >     > (Trial at
>     >     >
>     >   
>      https://github.com/astropy/astropy/compare/master...mhvk:utils-masked-class?expand=1)
>     >     >
>     >     >     I already partly navigated this mixin-issue in the
>     >     >     "MaskedArrayCollection" class, which essentially does
>     >     >     ArrayCollection(MaskedArray(array)), and only takes about 30
>     >     lines of
>     >     >     boilerplate. That's the backwards encapsulation order from
>     >     what you want
>     >     >     though.
>     >     >
>     >     >
>     >     > Yes, indeed, from a quick trial `MaskedArray(np.arange(3.) *
>     u.m,
>     >     > mask=[True, False, False])` does indeed not have a `.unit`
>     attribute
>     >     > (and cannot represent itself...); I'm not at all sure that my
>     >     method of
>     >     > just creating a mixed class is anything but a recipe for
>     disaster,
>     >     though!
>     >
>     >     Based on your suggestion I worked on this a little today, and
>     now my
>     >     MaskedArray more easily encapsulates both ducktypes and ndarray
>     >     subclasses (pushed to repo). Here's an example I got working
>     with masked
>     >     units using unyt:
>     >
>     >     [1]: from MaskedArray import X, MaskedArray, MaskedScalar
>     >
>     >     [2]: from unyt import m, km
>     >
>     >     [3]: import numpy as np
>     >
>     >     [4]: uarr = MaskedArray([1., 2., 3.]*km, mask=[0,1,0])
>     >
>     >     [5]: uarr
>     >
>     >     MaskedArray([1., X , 3.])
>     >     [6]: uarr + 1*m
>     >
>     >     MaskedArray([1.001, X    , 3.001])
>     >     [7]: uarr.filled()
>     >
>     >     unyt_array([1., 0., 3.], 'km')
>     >     [8]: np.concatenate([uarr, 2*uarr]).filled()
>     >     unyt_array([1., 0., 3., 2., 0., 6.], '(dimensionless)')
>     >
>     >     The catch is the ducktype/subclass has to rigorously follow
>     numpy's
>     >     indexing rules, including distinguishing 0d arrays from
>     scalars. For now
>     >     only I used unyt in the example above since it happens to be
>     less strict
>     >      about dimensionless operations than astropy.units which trips
>     up my
>     >     repr code. (see below for example with astropy.units). Note in
>     the last
>     >     line I lost the dimensions, but that is because unyt does not
>     handle
>     >     np.concatenate. To get that to work we need a true ducktype
>     for units.
>     >
>     >     The example above doesn't expose the ".units" attribute
>     outside the
>     >     MaskedArray, and it doesn't print the units in the repr. But
>     you can
>     >     access them using "filled".
>     >
>     >     While I could make MaskedArray forward unknown attribute
>     accesses to the
>     >     encapsulated array, that seems a bit dangerous/bug-prone at first
>     >     glance, so probably I want to require the user to make a
>     MaskedArray
>     >     subclass to do so. I've just started playing with that
>     (probably buggy),
>     >     and Ive attached subclass examples for astropy.unit and unyt,
>     with some
>     >     example output below.
>     >
>     >     Cheers,
>     >     Allan
>     >
>     >
>     >
>     >     Example using the attached astropy unit subclass:
>     >
>     >         >>> from astropy.units import m, km, s
>     >         >>> uarr = MaskedQ(np.ones(3), units=km, mask=[0,1,0])
>     >         >>> uarr
>     >         MaskedQ([1., X , 1.], units=km)
>     >         >>> uarr.units
>     >         km
>     >         >>> uarr + (1*m)
>     >         MaskedQ([1.001, X    , 1.001], units=km)
>     >         >>> uarr/(1*s)
>     >         MaskedQ([1., X , 1.], units=km / s)
>     >         >>> (uarr*(1*m))[1:]
>     >         MaskedQ([X , 1.], units=km m)
>     >         >>> np.add.outer(uarr, uarr)
>     >         MaskedQ([[2., X , 2.],
>     >                  [X , X , X ],
>     >                  [2., X , 2.]], units=km)
>     >         >>> print(uarr)
>     >         [1. X  1.] km m
>     >
>     >     Cheers,
>     >     Allan
>     >
>     >
>     >     >     > Even if this impossible, I think it is conceptually useful
>     >     to think
>     >     >     > about what the masking class should do. My sense is that,
>     >     e.g., it
>     >     >     > should not attempt to decide when an operation
>     succeeds or not,
>     >     >     but just
>     >     >     > "or together" input masks for regular, multiple-input
>     functions,
>     >     >     and let
>     >     >     > the underlying arrays skip elements for reductions by
>     using
>     >     `where`
>     >     >     > (hey, I did implement that for a reason... ;-). In
>     >     particular, it
>     >     >     > suggests one should not have things like domains and all
>     >     that (I never
>     >     >     > understood why `MaskedArray` did that). If one wants more,
>     >     the class
>     >     >     > should provide a method that updates the mask (a sensible
>     >     default
>     >     >     might
>     >     >     > be `mask |= ~np.isfinite(result)` - here, the class
>     being masked
>     >     >     should
>     >     >     > logically support ufuncs and functions, so it can
>     decide what
>     >     >     "isfinite"
>     >     >     > means).
>     >     >
>     >     >     I agree it would be nice to remove domains. It would
>     make life
>     >     easier,
>     >     >     and I could remove a lot of twiddly code! I kept it in
>     for now to
>     >     >     minimize the behavior changes from the old MaskedArray.
>     >     >
>     >     >
>     >     > That makes sense. Could be separated out to a
>     backwards-compatibility
>     >     > class later.
>     >     >
>     >     >
>     >     >     > In any case, I would think that a basic truth should
>     be that
>     >     >     everything
>     >     >     > has a mask with a shape consistent with the data, so
>     >     >     > 1. Each complex numbers has just one mask, and setting
>     >     `a.imag` with a
>     >     >     > masked array should definitely propagate the mask.
>     >     >     > 2. For a masked array with structured dtype, I'd
>     similarly say
>     >     >     that the
>     >     >     > default is for a mask to have the same shape as the array.
>     >     But that
>     >     >     > something like your collection makes sense for the case
>     >     where one
>     >     >     wants
>     >     >     > to mask items in a structure.
>     >     >
>     >     >     Agreed that we should have a single bool per complex or
>     structured
>     >     >     element, and the mask shape is the same as the array shape.
>     >     That's how I
>     >     >     implemented it. But there is still a problem with
>     complex.imag
>     >     >     assignment:
>     >     >
>     >     >         >>> a = MaskedArray([1j, 2, X])
>     >     >         >>> i = a.imag
>     >     >         >>> i[:] = MaskedArray([1, X, 1])
>     >     >
>     >     >     If we make the last line copy the mask to the original
>     array, what
>     >     >     should the real part of a[2] be? Conversely, if we don't
>     copy
>     >     the mask,
>     >     >     what should the imag part of a[1] be? It seems like we might
>     >     "want" the
>     >     >     masks to be OR'd instead, but then should i[2] be masked
>     after
>     >     we just
>     >     >     set it to 1?
>     >     >
>     >     > Ah, I see the issue now... Easiest to implement and closest
>     in analogy
>     >     > to a regular view would be to just let it unmask a[2] (with
>     >     whatever is
>     >     > in real; user beware!).
>     >     >
>     >     > Perhaps better would be to special-case such that `imag`
>     returns a
>     >     > read-only view of the mask. Making `imag` itself read-only would
>     >     prevent
>     >     > possibly reasonable things like `i[np.isclose(i, 0)] = 0` - but
>     >     there is
>     >     > no reason this should update the mask.
>     >     >
>     >     > Still, neither is really satisfactory...
>     >     > 
>     >     >
>     >     >
>     >     >     > p.s. I started trying to implement the above "Mixin"
>     class; will
>     >     >     try to
>     >     >     > clean that up a bit so that at least it uses `where` and
>     >     push it up.
>     >     >
>     >     >     I played with "where", but didn't include it since 1.17
>     is not
>     >     released.
>     >     >     To avoid duplication of effort, I've attached a diff of
>     what I
>     >     tried. I
>     >     >     actually get a slight slowdown of about 10% by using
>     where...
>     >     >
>     >     >
>     >     > Your implementation is indeed quite similar to what I got in
>     >     > __array_ufunc__ (though one should "&" the where with ~mask).
>     >     >
>     >     > I think the main benefit is not to presume that whatever is
>     underneath
>     >     > understands 0 or 1, i.e., avoid filling.
>     >     >
>     >     >
>     >     >     If you make progress with the mixin, a push is welcome. I
>     >     imagine a
>     >     >     problem is going to be that np.isscalar doesn't work to
>     detect
>     >     duck
>     >     >     scalars.
>     >     >
>     >     > I fear that in my attempts I've simply decided that only
>     array scalars
>     >     > exist...
>     >     >
>     >     > -- Marten
>     >     >
>     >     > _______________________________________________
>     >     > NumPy-Discussion mailing list
>     >     > [hidden email]
>     <mailto:[hidden email]>
>     <mailto:[hidden email]
>     <mailto:[hidden email]>>
>     >     > https://mail.python.org/mailman/listinfo/numpy-discussion
>     >     >
>     >
>     >     _______________________________________________
>     >     NumPy-Discussion mailing list
>     >     [hidden email]
>     <mailto:[hidden email]>
>     <mailto:[hidden email]
>     <mailto:[hidden email]>>
>     >     https://mail.python.org/mailman/listinfo/numpy-discussion
>     >
>     >
>     > _______________________________________________
>     > NumPy-Discussion mailing list
>     > [hidden email] <mailto:[hidden email]>
>     > https://mail.python.org/mailman/listinfo/numpy-discussion
>     >
>
>     _______________________________________________
>     NumPy-Discussion mailing list
>     [hidden email] <mailto:[hidden email]>
>     https://mail.python.org/mailman/listinfo/numpy-discussion
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> https://mail.python.org/mailman/listinfo/numpy-discussion
>

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: new MaskedArray class

Benjamin Root
"""Third, in the old np.ma.MaskedArray masked positions are very often
"effectively" clobbered, in the sense that they are not computed. For
example, if you do "c = a+b", and then change the mask of c"""

My use-cases don't involve changing the mask of "c". It would involve changing the mask of "a" or "b" after I have calculated "c", so that I could calculate "d". As a fairly simple example, I frequently work with satellite data. We have multiple masks, such as water, vegetation, sandy loam, bare rock, etc. The underlying satellite data in any of these places isn't bad, they just need to be dealt with differently.  I wouldn't want the act of applying a mask for a set of calculations on things that aren't bare rock to mess up my subsequent calculation on things that aren't water. Right now, I have to handle this explicitly with flattened sparse arrays, which makes visualization and conception difficult.

Ben Root

On Sat, Jun 22, 2019 at 11:51 AM Marten van Kerkwijk <[hidden email]> wrote:
Hi Allan,

I'm not sure I would go too much by what the old MaskedArray class did. It indeed made an effort not to overwrite masked values with a new result, even to the extend of copying back masked input data elements to the output data array after an operation. But the fact that this is non-sensical if the dtype changes (or the units in an operation on quantities) suggests that this mental model simply does not work.

I think a sensible alternative mental model for the MaskedArray class is that all it does is forward any operations to the data it holds and separately propagate a mask, ORing elements together for binary operations, etc., and explicitly skipping masked elements in reductions (ideally using `where` to be as agnostic as possible about the underlying data, for which, e.g., setting masked values to `0` for `np.reduce.add` may or may not be the right thing to do - what if they are string?).

With this mental picture, the underlying data are always have well-defined meaning: they have been operated on as if the mask did not exist. There then is also less reason to try to avoid getting it back to the user.

As a concrete example (maybe Ben has others): in astropy we have a sigma-clipping average routine, which uses a `MaskedArray` to iteratively mask items that are too far off from the mean; here, the mask varies each iteration (an initially masked element can come back into play), but the data do not.

All the best,

Marten

On Sat, Jun 22, 2019 at 10:54 AM Allan Haldane <[hidden email]> wrote:
On 6/21/19 2:37 PM, Benjamin Root wrote:
> Just to note, data that is masked isn't always garbage. There are plenty
> of use-cases where one may want to temporarily apply a mask for a set of
> computation, or possibly want to apply a series of different masks to
> the data. I haven't read through this discussion deeply enough, but is
> this new class going to destroy underlying masked data? and will it be
> possible to swap out masks?
>
> Cheers!
> Ben Root

Indeed my implementation currently feels free to clobber the data at
masked positions and makes no guarantees not to.

I'd like to try to support reasonable use-cases like yours though. A few
thoughts:

First, the old np.ma.MaskedArray explicitly does not promise to preserve
masked values, with a big warning in the docs. I can't recall the
examples, but I remember coming across cases where clobbering happens.
So arguably your behavior was never supported, and perhaps this means
that no-clobber behavior is difficult to reasonably support.

Second, the old np.ma.MaskedArray avoids frequent clobbering by making
lots of copies. Therefore, in most cases you will not lose any
performance in my new MaskedArray relative to the old one by making an
explicit copy yourself. I.e, is it problematic to have to do

     >>> result = MaskedArray(data.copy(), trial_mask).sum()

instead of

     >>> marr.mask = trial_mask
     >>> result = marr.sum()

since they have similar performance?

Third, in the old np.ma.MaskedArray masked positions are very often
"effectively" clobbered, in the sense that they are not computed. For
example, if you do "c = a+b", and then change the mask of c, the values
at masked position of the result of (a+b) do not correspond to the sum
of the masked values in a and b. Thus, by "unmasking" c you are exposing
nonsense values, which to me seems likely to cause heisenbugs.


In summary, by not making no-clobber guarantees and by strictly
preventing exposure of nonsense values, I suspect that: 1. my new code
is simpler and faster by avoiding lots of copies, and forces copies to
be explicit in user code. 2. disallowing direct modification of the mask
lowers the "API surface area" making people's MaskedArray code less
buggy and easier to read: Exposure of nonsense values by "unmasking" is
one less possibility to keep in mind.

Best,
Allan


> On Thu, Jun 20, 2019 at 12:44 PM Allan Haldane <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     On 6/19/19 10:19 PM, Marten van Kerkwijk wrote:
>     > Hi Allan,
>     >
>     > This is very impressive! I could get the tests that I wrote for my
>     class
>     > pass with yours using Quantity with what I would consider very minimal
>     > changes. I only could not find a good way to unmask data (I like the
>     > idea of setting the mask on some elements via `ma[item] = X`); is this
>     > on purpose?
>
>     Yes, I want to make it difficult for the user to access the garbage
>     values under the mask, which are often clobbered values. The only way to
>     "remove" a masked value is by replacing it with a new non-masked value.
>
>
>     > Anyway, it would seem easily at the point where I should comment
>     on your
>     > repository rather than in the mailing list!
>
>     To make further progress on this encapsulation idea I need a more
>     complete ducktype to pass into MaskedArray to test, so that's what I'll
>     work on next, when I have time. I'll either try to finish my
>     ArrayCollection type, or try making a simple NDunit ducktype
>     piggybacking on astropy's Unit.
>
>     Best,
>     Allan
>
>
>     >
>     > All the best,
>     >
>     > Marten
>     >
>     >
>     > On Wed, Jun 19, 2019 at 5:45 PM Allan Haldane
>     <[hidden email] <mailto:[hidden email]>
>     > <mailto:[hidden email] <mailto:[hidden email]>>>
>     wrote:
>     >
>     >     On 6/18/19 2:04 PM, Marten van Kerkwijk wrote:
>     >     >
>     >     >
>     >     > On Tue, Jun 18, 2019 at 12:55 PM Allan Haldane
>     >     <[hidden email] <mailto:[hidden email]>
>     <mailto:[hidden email] <mailto:[hidden email]>>
>     >     > <mailto:[hidden email]
>     <mailto:[hidden email]> <mailto:[hidden email]
>     <mailto:[hidden email]>>>>
>     >     wrote:
>     >     > <snip>
>     >     >
>     >     >     > This may be too much to ask from the initializer, but, if
>     >     so, it still
>     >     >     > seems most useful if it is made as easy as possible to do,
>     >     say, `class
>     >     >     > MaskedQuantity(Masked, Quantity): <very few overrides>`.
>     >     >
>     >     >     Currently MaskedArray does not accept ducktypes as
>     underlying
>     >     arrays,
>     >     >     but I think it shouldn't be too hard to modify it to do so.
>     >     Good idea!
>     >     >
>     >     >
>     >     > Looking back at my trial, I see that I also never got to
>     duck arrays -
>     >     > only ndarray subclasses - though I tried to make the code as
>     >     agnostic as
>     >     > possible.
>     >     >
>     >     > (Trial at
>     >     >
>     >   
>      https://github.com/astropy/astropy/compare/master...mhvk:utils-masked-class?expand=1)
>     >     >
>     >     >     I already partly navigated this mixin-issue in the
>     >     >     "MaskedArrayCollection" class, which essentially does
>     >     >     ArrayCollection(MaskedArray(array)), and only takes about 30
>     >     lines of
>     >     >     boilerplate. That's the backwards encapsulation order from
>     >     what you want
>     >     >     though.
>     >     >
>     >     >
>     >     > Yes, indeed, from a quick trial `MaskedArray(np.arange(3.) *
>     u.m,
>     >     > mask=[True, False, False])` does indeed not have a `.unit`
>     attribute
>     >     > (and cannot represent itself...); I'm not at all sure that my
>     >     method of
>     >     > just creating a mixed class is anything but a recipe for
>     disaster,
>     >     though!
>     >
>     >     Based on your suggestion I worked on this a little today, and
>     now my
>     >     MaskedArray more easily encapsulates both ducktypes and ndarray
>     >     subclasses (pushed to repo). Here's an example I got working
>     with masked
>     >     units using unyt:
>     >
>     >     [1]: from MaskedArray import X, MaskedArray, MaskedScalar
>     >
>     >     [2]: from unyt import m, km
>     >
>     >     [3]: import numpy as np
>     >
>     >     [4]: uarr = MaskedArray([1., 2., 3.]*km, mask=[0,1,0])
>     >
>     >     [5]: uarr
>     >
>     >     MaskedArray([1., X , 3.])
>     >     [6]: uarr + 1*m
>     >
>     >     MaskedArray([1.001, X    , 3.001])
>     >     [7]: uarr.filled()
>     >
>     >     unyt_array([1., 0., 3.], 'km')
>     >     [8]: np.concatenate([uarr, 2*uarr]).filled()
>     >     unyt_array([1., 0., 3., 2., 0., 6.], '(dimensionless)')
>     >
>     >     The catch is the ducktype/subclass has to rigorously follow
>     numpy's
>     >     indexing rules, including distinguishing 0d arrays from
>     scalars. For now
>     >     only I used unyt in the example above since it happens to be
>     less strict
>     >      about dimensionless operations than astropy.units which trips
>     up my
>     >     repr code. (see below for example with astropy.units). Note in
>     the last
>     >     line I lost the dimensions, but that is because unyt does not
>     handle
>     >     np.concatenate. To get that to work we need a true ducktype
>     for units.
>     >
>     >     The example above doesn't expose the ".units" attribute
>     outside the
>     >     MaskedArray, and it doesn't print the units in the repr. But
>     you can
>     >     access them using "filled".
>     >
>     >     While I could make MaskedArray forward unknown attribute
>     accesses to the
>     >     encapsulated array, that seems a bit dangerous/bug-prone at first
>     >     glance, so probably I want to require the user to make a
>     MaskedArray
>     >     subclass to do so. I've just started playing with that
>     (probably buggy),
>     >     and Ive attached subclass examples for astropy.unit and unyt,
>     with some
>     >     example output below.
>     >
>     >     Cheers,
>     >     Allan
>     >
>     >
>     >
>     >     Example using the attached astropy unit subclass:
>     >
>     >         >>> from astropy.units import m, km, s
>     >         >>> uarr = MaskedQ(np.ones(3), units=km, mask=[0,1,0])
>     >         >>> uarr
>     >         MaskedQ([1., X , 1.], units=km)
>     >         >>> uarr.units
>     >         km
>     >         >>> uarr + (1*m)
>     >         MaskedQ([1.001, X    , 1.001], units=km)
>     >         >>> uarr/(1*s)
>     >         MaskedQ([1., X , 1.], units=km / s)
>     >         >>> (uarr*(1*m))[1:]
>     >         MaskedQ([X , 1.], units=km m)
>     >         >>> np.add.outer(uarr, uarr)
>     >         MaskedQ([[2., X , 2.],
>     >                  [X , X , X ],
>     >                  [2., X , 2.]], units=km)
>     >         >>> print(uarr)
>     >         [1. X  1.] km m
>     >
>     >     Cheers,
>     >     Allan
>     >
>     >
>     >     >     > Even if this impossible, I think it is conceptually useful
>     >     to think
>     >     >     > about what the masking class should do. My sense is that,
>     >     e.g., it
>     >     >     > should not attempt to decide when an operation
>     succeeds or not,
>     >     >     but just
>     >     >     > "or together" input masks for regular, multiple-input
>     functions,
>     >     >     and let
>     >     >     > the underlying arrays skip elements for reductions by
>     using
>     >     `where`
>     >     >     > (hey, I did implement that for a reason... ;-). In
>     >     particular, it
>     >     >     > suggests one should not have things like domains and all
>     >     that (I never
>     >     >     > understood why `MaskedArray` did that). If one wants more,
>     >     the class
>     >     >     > should provide a method that updates the mask (a sensible
>     >     default
>     >     >     might
>     >     >     > be `mask |= ~np.isfinite(result)` - here, the class
>     being masked
>     >     >     should
>     >     >     > logically support ufuncs and functions, so it can
>     decide what
>     >     >     "isfinite"
>     >     >     > means).
>     >     >
>     >     >     I agree it would be nice to remove domains. It would
>     make life
>     >     easier,
>     >     >     and I could remove a lot of twiddly code! I kept it in
>     for now to
>     >     >     minimize the behavior changes from the old MaskedArray.
>     >     >
>     >     >
>     >     > That makes sense. Could be separated out to a
>     backwards-compatibility
>     >     > class later.
>     >     >
>     >     >
>     >     >     > In any case, I would think that a basic truth should
>     be that
>     >     >     everything
>     >     >     > has a mask with a shape consistent with the data, so
>     >     >     > 1. Each complex numbers has just one mask, and setting
>     >     `a.imag` with a
>     >     >     > masked array should definitely propagate the mask.
>     >     >     > 2. For a masked array with structured dtype, I'd
>     similarly say
>     >     >     that the
>     >     >     > default is for a mask to have the same shape as the array.
>     >     But that
>     >     >     > something like your collection makes sense for the case
>     >     where one
>     >     >     wants
>     >     >     > to mask items in a structure.
>     >     >
>     >     >     Agreed that we should have a single bool per complex or
>     structured
>     >     >     element, and the mask shape is the same as the array shape.
>     >     That's how I
>     >     >     implemented it. But there is still a problem with
>     complex.imag
>     >     >     assignment:
>     >     >
>     >     >         >>> a = MaskedArray([1j, 2, X])
>     >     >         >>> i = a.imag
>     >     >         >>> i[:] = MaskedArray([1, X, 1])
>     >     >
>     >     >     If we make the last line copy the mask to the original
>     array, what
>     >     >     should the real part of a[2] be? Conversely, if we don't
>     copy
>     >     the mask,
>     >     >     what should the imag part of a[1] be? It seems like we might
>     >     "want" the
>     >     >     masks to be OR'd instead, but then should i[2] be masked
>     after
>     >     we just
>     >     >     set it to 1?
>     >     >
>     >     > Ah, I see the issue now... Easiest to implement and closest
>     in analogy
>     >     > to a regular view would be to just let it unmask a[2] (with
>     >     whatever is
>     >     > in real; user beware!).
>     >     >
>     >     > Perhaps better would be to special-case such that `imag`
>     returns a
>     >     > read-only view of the mask. Making `imag` itself read-only would
>     >     prevent
>     >     > possibly reasonable things like `i[np.isclose(i, 0)] = 0` - but
>     >     there is
>     >     > no reason this should update the mask.
>     >     >
>     >     > Still, neither is really satisfactory...
>     >     > 
>     >     >
>     >     >
>     >     >     > p.s. I started trying to implement the above "Mixin"
>     class; will
>     >     >     try to
>     >     >     > clean that up a bit so that at least it uses `where` and
>     >     push it up.
>     >     >
>     >     >     I played with "where", but didn't include it since 1.17
>     is not
>     >     released.
>     >     >     To avoid duplication of effort, I've attached a diff of
>     what I
>     >     tried. I
>     >     >     actually get a slight slowdown of about 10% by using
>     where...
>     >     >
>     >     >
>     >     > Your implementation is indeed quite similar to what I got in
>     >     > __array_ufunc__ (though one should "&" the where with ~mask).
>     >     >
>     >     > I think the main benefit is not to presume that whatever is
>     underneath
>     >     > understands 0 or 1, i.e., avoid filling.
>     >     >
>     >     >
>     >     >     If you make progress with the mixin, a push is welcome. I
>     >     imagine a
>     >     >     problem is going to be that np.isscalar doesn't work to
>     detect
>     >     duck
>     >     >     scalars.
>     >     >
>     >     > I fear that in my attempts I've simply decided that only
>     array scalars
>     >     > exist...
>     >     >
>     >     > -- Marten
>     >     >
>     >     > _______________________________________________
>     >     > NumPy-Discussion mailing list
>     >     > [hidden email]
>     <mailto:[hidden email]>
>     <mailto:[hidden email]
>     <mailto:[hidden email]>>
>     >     > https://mail.python.org/mailman/listinfo/numpy-discussion
>     >     >
>     >
>     >     _______________________________________________
>     >     NumPy-Discussion mailing list
>     >     [hidden email]
>     <mailto:[hidden email]>
>     <mailto:[hidden email]
>     <mailto:[hidden email]>>
>     >     https://mail.python.org/mailman/listinfo/numpy-discussion
>     >
>     >
>     > _______________________________________________
>     > NumPy-Discussion mailing list
>     > [hidden email] <mailto:[hidden email]>
>     > https://mail.python.org/mailman/listinfo/numpy-discussion
>     >
>
>     _______________________________________________
>     NumPy-Discussion mailing list
>     [hidden email] <mailto:[hidden email]>
>     https://mail.python.org/mailman/listinfo/numpy-discussion
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> https://mail.python.org/mailman/listinfo/numpy-discussion
>

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: new MaskedArray class

Stephan Hoyer-2
In reply to this post by Allan Haldane
On Thu, Jun 20, 2019 at 7:44 PM Allan Haldane <[hidden email]> wrote:
On 6/19/19 10:19 PM, Marten van Kerkwijk wrote:
> Hi Allan,
>
> This is very impressive! I could get the tests that I wrote for my class
> pass with yours using Quantity with what I would consider very minimal
> changes. I only could not find a good way to unmask data (I like the
> idea of setting the mask on some elements via `ma[item] = X`); is this
> on purpose?

Yes, I want to make it difficult for the user to access the garbage
values under the mask, which are often clobbered values. The only way to
"remove" a masked value is by replacing it with a new non-masked value.

I think we should make it possible to access (and even mutate) data under the mask directly, while noting the lack of any guarantees about what those values are.

MaskedArray has a minimal and transparent data model, consisting of data and mask arrays. There are plenty of use cases where it is convenient to access the underlying arrays directly, e.g., for efficient implementation of low-level MaskedArray algorithms.

NumPy itself does a similar thing on ndarray by exposing data/strides.  Advanced users who learn the details of the data model find them useful, and everyone else ignores them.



> Anyway, it would seem easily at the point where I should comment on your
> repository rather than in the mailing list!

To make further progress on this encapsulation idea I need a more
complete ducktype to pass into MaskedArray to test, so that's what I'll
work on next, when I have time. I'll either try to finish my
ArrayCollection type, or try making a simple NDunit ducktype
piggybacking on astropy's Unit.

dask.array would be another good example to try. I think it already should support __array_function__ (and if not it should be very easy to add).


Best,
Allan


>
> All the best,
>
> Marten
>
>
> On Wed, Jun 19, 2019 at 5:45 PM Allan Haldane <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     On 6/18/19 2:04 PM, Marten van Kerkwijk wrote:
>     >
>     >
>     > On Tue, Jun 18, 2019 at 12:55 PM Allan Haldane
>     <[hidden email] <mailto:[hidden email]>
>     > <mailto:[hidden email] <mailto:[hidden email]>>>
>     wrote:
>     > <snip>
>     >
>     >     > This may be too much to ask from the initializer, but, if
>     so, it still
>     >     > seems most useful if it is made as easy as possible to do,
>     say, `class
>     >     > MaskedQuantity(Masked, Quantity): <very few overrides>`.
>     >
>     >     Currently MaskedArray does not accept ducktypes as underlying
>     arrays,
>     >     but I think it shouldn't be too hard to modify it to do so.
>     Good idea!
>     >
>     >
>     > Looking back at my trial, I see that I also never got to duck arrays -
>     > only ndarray subclasses - though I tried to make the code as
>     agnostic as
>     > possible.
>     >
>     > (Trial at
>     >
>     https://github.com/astropy/astropy/compare/master...mhvk:utils-masked-class?expand=1)
>     >
>     >     I already partly navigated this mixin-issue in the
>     >     "MaskedArrayCollection" class, which essentially does
>     >     ArrayCollection(MaskedArray(array)), and only takes about 30
>     lines of
>     >     boilerplate. That's the backwards encapsulation order from
>     what you want
>     >     though.
>     >
>     >
>     > Yes, indeed, from a quick trial `MaskedArray(np.arange(3.) * u.m,
>     > mask=[True, False, False])` does indeed not have a `.unit` attribute
>     > (and cannot represent itself...); I'm not at all sure that my
>     method of
>     > just creating a mixed class is anything but a recipe for disaster,
>     though!
>
>     Based on your suggestion I worked on this a little today, and now my
>     MaskedArray more easily encapsulates both ducktypes and ndarray
>     subclasses (pushed to repo). Here's an example I got working with masked
>     units using unyt:
>
>     [1]: from MaskedArray import X, MaskedArray, MaskedScalar
>
>     [2]: from unyt import m, km
>
>     [3]: import numpy as np
>
>     [4]: uarr = MaskedArray([1., 2., 3.]*km, mask=[0,1,0])
>
>     [5]: uarr
>
>     MaskedArray([1., X , 3.])
>     [6]: uarr + 1*m
>
>     MaskedArray([1.001, X    , 3.001])
>     [7]: uarr.filled()
>
>     unyt_array([1., 0., 3.], 'km')
>     [8]: np.concatenate([uarr, 2*uarr]).filled()
>     unyt_array([1., 0., 3., 2., 0., 6.], '(dimensionless)')
>
>     The catch is the ducktype/subclass has to rigorously follow numpy's
>     indexing rules, including distinguishing 0d arrays from scalars. For now
>     only I used unyt in the example above since it happens to be less strict
>      about dimensionless operations than astropy.units which trips up my
>     repr code. (see below for example with astropy.units). Note in the last
>     line I lost the dimensions, but that is because unyt does not handle
>     np.concatenate. To get that to work we need a true ducktype for units.
>
>     The example above doesn't expose the ".units" attribute outside the
>     MaskedArray, and it doesn't print the units in the repr. But you can
>     access them using "filled".
>
>     While I could make MaskedArray forward unknown attribute accesses to the
>     encapsulated array, that seems a bit dangerous/bug-prone at first
>     glance, so probably I want to require the user to make a MaskedArray
>     subclass to do so. I've just started playing with that (probably buggy),
>     and Ive attached subclass examples for astropy.unit and unyt, with some
>     example output below.
>
>     Cheers,
>     Allan
>
>
>
>     Example using the attached astropy unit subclass:
>
>         >>> from astropy.units import m, km, s
>         >>> uarr = MaskedQ(np.ones(3), units=km, mask=[0,1,0])
>         >>> uarr
>         MaskedQ([1., X , 1.], units=km)
>         >>> uarr.units
>         km
>         >>> uarr + (1*m)
>         MaskedQ([1.001, X    , 1.001], units=km)
>         >>> uarr/(1*s)
>         MaskedQ([1., X , 1.], units=km / s)
>         >>> (uarr*(1*m))[1:]
>         MaskedQ([X , 1.], units=km m)
>         >>> np.add.outer(uarr, uarr)
>         MaskedQ([[2., X , 2.],
>                  [X , X , X ],
>                  [2., X , 2.]], units=km)
>         >>> print(uarr)
>         [1. X  1.] km m
>
>     Cheers,
>     Allan
>
>
>     >     > Even if this impossible, I think it is conceptually useful
>     to think
>     >     > about what the masking class should do. My sense is that,
>     e.g., it
>     >     > should not attempt to decide when an operation succeeds or not,
>     >     but just
>     >     > "or together" input masks for regular, multiple-input functions,
>     >     and let
>     >     > the underlying arrays skip elements for reductions by using
>     `where`
>     >     > (hey, I did implement that for a reason... ;-). In
>     particular, it
>     >     > suggests one should not have things like domains and all
>     that (I never
>     >     > understood why `MaskedArray` did that). If one wants more,
>     the class
>     >     > should provide a method that updates the mask (a sensible
>     default
>     >     might
>     >     > be `mask |= ~np.isfinite(result)` - here, the class being masked
>     >     should
>     >     > logically support ufuncs and functions, so it can decide what
>     >     "isfinite"
>     >     > means).
>     >
>     >     I agree it would be nice to remove domains. It would make life
>     easier,
>     >     and I could remove a lot of twiddly code! I kept it in for now to
>     >     minimize the behavior changes from the old MaskedArray.
>     >
>     >
>     > That makes sense. Could be separated out to a backwards-compatibility
>     > class later.
>     >
>     >
>     >     > In any case, I would think that a basic truth should be that
>     >     everything
>     >     > has a mask with a shape consistent with the data, so
>     >     > 1. Each complex numbers has just one mask, and setting
>     `a.imag` with a
>     >     > masked array should definitely propagate the mask.
>     >     > 2. For a masked array with structured dtype, I'd similarly say
>     >     that the
>     >     > default is for a mask to have the same shape as the array.
>     But that
>     >     > something like your collection makes sense for the case
>     where one
>     >     wants
>     >     > to mask items in a structure.
>     >
>     >     Agreed that we should have a single bool per complex or structured
>     >     element, and the mask shape is the same as the array shape.
>     That's how I
>     >     implemented it. But there is still a problem with complex.imag
>     >     assignment:
>     >
>     >         >>> a = MaskedArray([1j, 2, X])
>     >         >>> i = a.imag
>     >         >>> i[:] = MaskedArray([1, X, 1])
>     >
>     >     If we make the last line copy the mask to the original array, what
>     >     should the real part of a[2] be? Conversely, if we don't copy
>     the mask,
>     >     what should the imag part of a[1] be? It seems like we might
>     "want" the
>     >     masks to be OR'd instead, but then should i[2] be masked after
>     we just
>     >     set it to 1?
>     >
>     > Ah, I see the issue now... Easiest to implement and closest in analogy
>     > to a regular view would be to just let it unmask a[2] (with
>     whatever is
>     > in real; user beware!).
>     >
>     > Perhaps better would be to special-case such that `imag` returns a
>     > read-only view of the mask. Making `imag` itself read-only would
>     prevent
>     > possibly reasonable things like `i[np.isclose(i, 0)] = 0` - but
>     there is
>     > no reason this should update the mask.
>     >
>     > Still, neither is really satisfactory...
>     >  
>     >
>     >
>     >     > p.s. I started trying to implement the above "Mixin" class; will
>     >     try to
>     >     > clean that up a bit so that at least it uses `where` and
>     push it up.
>     >
>     >     I played with "where", but didn't include it since 1.17 is not
>     released.
>     >     To avoid duplication of effort, I've attached a diff of what I
>     tried. I
>     >     actually get a slight slowdown of about 10% by using where...
>     >
>     >
>     > Your implementation is indeed quite similar to what I got in
>     > __array_ufunc__ (though one should "&" the where with ~mask).
>     >
>     > I think the main benefit is not to presume that whatever is underneath
>     > understands 0 or 1, i.e., avoid filling.
>     >
>     >
>     >     If you make progress with the mixin, a push is welcome. I
>     imagine a
>     >     problem is going to be that np.isscalar doesn't work to detect
>     duck
>     >     scalars.
>     >
>     > I fear that in my attempts I've simply decided that only array scalars
>     > exist...
>     >
>     > -- Marten
>     >
>     > _______________________________________________
>     > NumPy-Discussion mailing list
>     > [hidden email] <mailto:[hidden email]>
>     > https://mail.python.org/mailman/listinfo/numpy-discussion
>     >
>
>     _______________________________________________
>     NumPy-Discussion mailing list
>     [hidden email] <mailto:[hidden email]>
>     https://mail.python.org/mailman/listinfo/numpy-discussion
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> https://mail.python.org/mailman/listinfo/numpy-discussion
>

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: new MaskedArray class

Stephan Hoyer-2
In reply to this post by Marten van Kerkwijk


On Sat, Jun 22, 2019 at 6:50 PM Marten van Kerkwijk <[hidden email]> wrote:
Hi Allan,

I'm not sure I would go too much by what the old MaskedArray class did. It indeed made an effort not to overwrite masked values with a new result, even to the extend of copying back masked input data elements to the output data array after an operation. But the fact that this is non-sensical if the dtype changes (or the units in an operation on quantities) suggests that this mental model simply does not work.

I think a sensible alternative mental model for the MaskedArray class is that all it does is forward any operations to the data it holds and separately propagate a mask, ORing elements together for binary operations, etc., and explicitly skipping masked elements in reductions (ideally using `where` to be as agnostic as possible about the underlying data, for which, e.g., setting masked values to `0` for `np.reduce.add` may or may not be the right thing to do - what if they are string?).

+1, this sounds like the right model to me.

That said, I would still not guarantee values under the mask as part of NumPy's API. The result of computations under the mask should be considered an undefined implementation detail, sort of like integer overflow or dict iteration order pre-Python 3.7. The values may even be entirely arbitrary, e.g., in cases where the result is preallocated with empty().

I'm less confident about the right way to handle missing elements in reductions. For example:
- Should median() also skip missing elements, even though there is no identity element?
- If reductions/aggregations default to skipping missing elements, how is it be possible to express "NA propagating" versions, which are also useful, if slightly less common?

We may want to add a standard "skipna" argument on NumPy aggregations, solely for the benefit of duck arrays (and dtypes with missing values). But that could also be a source of confusion, especially if skipna=True refers only "true NA" values, not including NaN, which is used as an alias for NA in pandas and elsewhere.

With this mental picture, the underlying data are always have well-defined meaning: they have been operated on as if the mask did not exist. There then is also less reason to try to avoid getting it back to the user.

As a concrete example (maybe Ben has others): in astropy we have a sigma-clipping average routine, which uses a `MaskedArray` to iteratively mask items that are too far off from the mean; here, the mask varies each iteration (an initially masked element can come back into play), but the data do not.

All the best,

Marten

On Sat, Jun 22, 2019 at 10:54 AM Allan Haldane <[hidden email]> wrote:
On 6/21/19 2:37 PM, Benjamin Root wrote:
> Just to note, data that is masked isn't always garbage. There are plenty
> of use-cases where one may want to temporarily apply a mask for a set of
> computation, or possibly want to apply a series of different masks to
> the data. I haven't read through this discussion deeply enough, but is
> this new class going to destroy underlying masked data? and will it be
> possible to swap out masks?
>
> Cheers!
> Ben Root

Indeed my implementation currently feels free to clobber the data at
masked positions and makes no guarantees not to.

I'd like to try to support reasonable use-cases like yours though. A few
thoughts:

First, the old np.ma.MaskedArray explicitly does not promise to preserve
masked values, with a big warning in the docs. I can't recall the
examples, but I remember coming across cases where clobbering happens.
So arguably your behavior was never supported, and perhaps this means
that no-clobber behavior is difficult to reasonably support.

Second, the old np.ma.MaskedArray avoids frequent clobbering by making
lots of copies. Therefore, in most cases you will not lose any
performance in my new MaskedArray relative to the old one by making an
explicit copy yourself. I.e, is it problematic to have to do

     >>> result = MaskedArray(data.copy(), trial_mask).sum()

instead of

     >>> marr.mask = trial_mask
     >>> result = marr.sum()

since they have similar performance?

Third, in the old np.ma.MaskedArray masked positions are very often
"effectively" clobbered, in the sense that they are not computed. For
example, if you do "c = a+b", and then change the mask of c, the values
at masked position of the result of (a+b) do not correspond to the sum
of the masked values in a and b. Thus, by "unmasking" c you are exposing
nonsense values, which to me seems likely to cause heisenbugs.


In summary, by not making no-clobber guarantees and by strictly
preventing exposure of nonsense values, I suspect that: 1. my new code
is simpler and faster by avoiding lots of copies, and forces copies to
be explicit in user code. 2. disallowing direct modification of the mask
lowers the "API surface area" making people's MaskedArray code less
buggy and easier to read: Exposure of nonsense values by "unmasking" is
one less possibility to keep in mind.

Best,
Allan


> On Thu, Jun 20, 2019 at 12:44 PM Allan Haldane <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     On 6/19/19 10:19 PM, Marten van Kerkwijk wrote:
>     > Hi Allan,
>     >
>     > This is very impressive! I could get the tests that I wrote for my
>     class
>     > pass with yours using Quantity with what I would consider very minimal
>     > changes. I only could not find a good way to unmask data (I like the
>     > idea of setting the mask on some elements via `ma[item] = X`); is this
>     > on purpose?
>
>     Yes, I want to make it difficult for the user to access the garbage
>     values under the mask, which are often clobbered values. The only way to
>     "remove" a masked value is by replacing it with a new non-masked value.
>
>
>     > Anyway, it would seem easily at the point where I should comment
>     on your
>     > repository rather than in the mailing list!
>
>     To make further progress on this encapsulation idea I need a more
>     complete ducktype to pass into MaskedArray to test, so that's what I'll
>     work on next, when I have time. I'll either try to finish my
>     ArrayCollection type, or try making a simple NDunit ducktype
>     piggybacking on astropy's Unit.
>
>     Best,
>     Allan
>
>
>     >
>     > All the best,
>     >
>     > Marten
>     >
>     >
>     > On Wed, Jun 19, 2019 at 5:45 PM Allan Haldane
>     <[hidden email] <mailto:[hidden email]>
>     > <mailto:[hidden email] <mailto:[hidden email]>>>
>     wrote:
>     >
>     >     On 6/18/19 2:04 PM, Marten van Kerkwijk wrote:
>     >     >
>     >     >
>     >     > On Tue, Jun 18, 2019 at 12:55 PM Allan Haldane
>     >     <[hidden email] <mailto:[hidden email]>
>     <mailto:[hidden email] <mailto:[hidden email]>>
>     >     > <mailto:[hidden email]
>     <mailto:[hidden email]> <mailto:[hidden email]
>     <mailto:[hidden email]>>>>
>     >     wrote:
>     >     > <snip>
>     >     >
>     >     >     > This may be too much to ask from the initializer, but, if
>     >     so, it still
>     >     >     > seems most useful if it is made as easy as possible to do,
>     >     say, `class
>     >     >     > MaskedQuantity(Masked, Quantity): <very few overrides>`.
>     >     >
>     >     >     Currently MaskedArray does not accept ducktypes as
>     underlying
>     >     arrays,
>     >     >     but I think it shouldn't be too hard to modify it to do so.
>     >     Good idea!
>     >     >
>     >     >
>     >     > Looking back at my trial, I see that I also never got to
>     duck arrays -
>     >     > only ndarray subclasses - though I tried to make the code as
>     >     agnostic as
>     >     > possible.
>     >     >
>     >     > (Trial at
>     >     >
>     >   
>      https://github.com/astropy/astropy/compare/master...mhvk:utils-masked-class?expand=1)
>     >     >
>     >     >     I already partly navigated this mixin-issue in the
>     >     >     "MaskedArrayCollection" class, which essentially does
>     >     >     ArrayCollection(MaskedArray(array)), and only takes about 30
>     >     lines of
>     >     >     boilerplate. That's the backwards encapsulation order from
>     >     what you want
>     >     >     though.
>     >     >
>     >     >
>     >     > Yes, indeed, from a quick trial `MaskedArray(np.arange(3.) *
>     u.m,
>     >     > mask=[True, False, False])` does indeed not have a `.unit`
>     attribute
>     >     > (and cannot represent itself...); I'm not at all sure that my
>     >     method of
>     >     > just creating a mixed class is anything but a recipe for
>     disaster,
>     >     though!
>     >
>     >     Based on your suggestion I worked on this a little today, and
>     now my
>     >     MaskedArray more easily encapsulates both ducktypes and ndarray
>     >     subclasses (pushed to repo). Here's an example I got working
>     with masked
>     >     units using unyt:
>     >
>     >     [1]: from MaskedArray import X, MaskedArray, MaskedScalar
>     >
>     >     [2]: from unyt import m, km
>     >
>     >     [3]: import numpy as np
>     >
>     >     [4]: uarr = MaskedArray([1., 2., 3.]*km, mask=[0,1,0])
>     >
>     >     [5]: uarr
>     >
>     >     MaskedArray([1., X , 3.])
>     >     [6]: uarr + 1*m
>     >
>     >     MaskedArray([1.001, X    , 3.001])
>     >     [7]: uarr.filled()
>     >
>     >     unyt_array([1., 0., 3.], 'km')
>     >     [8]: np.concatenate([uarr, 2*uarr]).filled()
>     >     unyt_array([1., 0., 3., 2., 0., 6.], '(dimensionless)')
>     >
>     >     The catch is the ducktype/subclass has to rigorously follow
>     numpy's
>     >     indexing rules, including distinguishing 0d arrays from
>     scalars. For now
>     >     only I used unyt in the example above since it happens to be
>     less strict
>     >      about dimensionless operations than astropy.units which trips
>     up my
>     >     repr code. (see below for example with astropy.units). Note in
>     the last
>     >     line I lost the dimensions, but that is because unyt does not
>     handle
>     >     np.concatenate. To get that to work we need a true ducktype
>     for units.
>     >
>     >     The example above doesn't expose the ".units" attribute
>     outside the
>     >     MaskedArray, and it doesn't print the units in the repr. But
>     you can
>     >     access them using "filled".
>     >
>     >     While I could make MaskedArray forward unknown attribute
>     accesses to the
>     >     encapsulated array, that seems a bit dangerous/bug-prone at first
>     >     glance, so probably I want to require the user to make a
>     MaskedArray
>     >     subclass to do so. I've just started playing with that
>     (probably buggy),
>     >     and Ive attached subclass examples for astropy.unit and unyt,
>     with some
>     >     example output below.
>     >
>     >     Cheers,
>     >     Allan
>     >
>     >
>     >
>     >     Example using the attached astropy unit subclass:
>     >
>     >         >>> from astropy.units import m, km, s
>     >         >>> uarr = MaskedQ(np.ones(3), units=km, mask=[0,1,0])
>     >         >>> uarr
>     >         MaskedQ([1., X , 1.], units=km)
>     >         >>> uarr.units
>     >         km
>     >         >>> uarr + (1*m)
>     >         MaskedQ([1.001, X    , 1.001], units=km)
>     >         >>> uarr/(1*s)
>     >         MaskedQ([1., X , 1.], units=km / s)
>     >         >>> (uarr*(1*m))[1:]
>     >         MaskedQ([X , 1.], units=km m)
>     >         >>> np.add.outer(uarr, uarr)
>     >         MaskedQ([[2., X , 2.],
>     >                  [X , X , X ],
>     >                  [2., X , 2.]], units=km)
>     >         >>> print(uarr)
>     >         [1. X  1.] km m
>     >
>     >     Cheers,
>     >     Allan
>     >
>     >
>     >     >     > Even if this impossible, I think it is conceptually useful
>     >     to think
>     >     >     > about what the masking class should do. My sense is that,
>     >     e.g., it
>     >     >     > should not attempt to decide when an operation
>     succeeds or not,
>     >     >     but just
>     >     >     > "or together" input masks for regular, multiple-input
>     functions,
>     >     >     and let
>     >     >     > the underlying arrays skip elements for reductions by
>     using
>     >     `where`
>     >     >     > (hey, I did implement that for a reason... ;-). In
>     >     particular, it
>     >     >     > suggests one should not have things like domains and all
>     >     that (I never
>     >     >     > understood why `MaskedArray` did that). If one wants more,
>     >     the class
>     >     >     > should provide a method that updates the mask (a sensible
>     >     default
>     >     >     might
>     >     >     > be `mask |= ~np.isfinite(result)` - here, the class
>     being masked
>     >     >     should
>     >     >     > logically support ufuncs and functions, so it can
>     decide what
>     >     >     "isfinite"
>     >     >     > means).
>     >     >
>     >     >     I agree it would be nice to remove domains. It would
>     make life
>     >     easier,
>     >     >     and I could remove a lot of twiddly code! I kept it in
>     for now to
>     >     >     minimize the behavior changes from the old MaskedArray.
>     >     >
>     >     >
>     >     > That makes sense. Could be separated out to a
>     backwards-compatibility
>     >     > class later.
>     >     >
>     >     >
>     >     >     > In any case, I would think that a basic truth should
>     be that
>     >     >     everything
>     >     >     > has a mask with a shape consistent with the data, so
>     >     >     > 1. Each complex numbers has just one mask, and setting
>     >     `a.imag` with a
>     >     >     > masked array should definitely propagate the mask.
>     >     >     > 2. For a masked array with structured dtype, I'd
>     similarly say
>     >     >     that the
>     >     >     > default is for a mask to have the same shape as the array.
>     >     But that
>     >     >     > something like your collection makes sense for the case
>     >     where one
>     >     >     wants
>     >     >     > to mask items in a structure.
>     >     >
>     >     >     Agreed that we should have a single bool per complex or
>     structured
>     >     >     element, and the mask shape is the same as the array shape.
>     >     That's how I
>     >     >     implemented it. But there is still a problem with
>     complex.imag
>     >     >     assignment:
>     >     >
>     >     >         >>> a = MaskedArray([1j, 2, X])
>     >     >         >>> i = a.imag
>     >     >         >>> i[:] = MaskedArray([1, X, 1])
>     >     >
>     >     >     If we make the last line copy the mask to the original
>     array, what
>     >     >     should the real part of a[2] be? Conversely, if we don't
>     copy
>     >     the mask,
>     >     >     what should the imag part of a[1] be? It seems like we might
>     >     "want" the
>     >     >     masks to be OR'd instead, but then should i[2] be masked
>     after
>     >     we just
>     >     >     set it to 1?
>     >     >
>     >     > Ah, I see the issue now... Easiest to implement and closest
>     in analogy
>     >     > to a regular view would be to just let it unmask a[2] (with
>     >     whatever is
>     >     > in real; user beware!).
>     >     >
>     >     > Perhaps better would be to special-case such that `imag`
>     returns a
>     >     > read-only view of the mask. Making `imag` itself read-only would
>     >     prevent
>     >     > possibly reasonable things like `i[np.isclose(i, 0)] = 0` - but
>     >     there is
>     >     > no reason this should update the mask.
>     >     >
>     >     > Still, neither is really satisfactory...
>     >     > 
>     >     >
>     >     >
>     >     >     > p.s. I started trying to implement the above "Mixin"
>     class; will
>     >     >     try to
>     >     >     > clean that up a bit so that at least it uses `where` and
>     >     push it up.
>     >     >
>     >     >     I played with "where", but didn't include it since 1.17
>     is not
>     >     released.
>     >     >     To avoid duplication of effort, I've attached a diff of
>     what I
>     >     tried. I
>     >     >     actually get a slight slowdown of about 10% by using
>     where...
>     >     >
>     >     >
>     >     > Your implementation is indeed quite similar to what I got in
>     >     > __array_ufunc__ (though one should "&" the where with ~mask).
>     >     >
>     >     > I think the main benefit is not to presume that whatever is
>     underneath
>     >     > understands 0 or 1, i.e., avoid filling.
>     >     >
>     >     >
>     >     >     If you make progress with the mixin, a push is welcome. I
>     >     imagine a
>     >     >     problem is going to be that np.isscalar doesn't work to
>     detect
>     >     duck
>     >     >     scalars.
>     >     >
>     >     > I fear that in my attempts I've simply decided that only
>     array scalars
>     >     > exist...
>     >     >
>     >     > -- Marten
>     >     >
>     >     > _______________________________________________
>     >     > NumPy-Discussion mailing list
>     >     > [hidden email]
>     <mailto:[hidden email]>
>     <mailto:[hidden email]
>     <mailto:[hidden email]>>
>     >     > https://mail.python.org/mailman/listinfo/numpy-discussion
>     >     >
>     >
>     >     _______________________________________________
>     >     NumPy-Discussion mailing list
>     >     [hidden email]
>     <mailto:[hidden email]>
>     <mailto:[hidden email]
>     <mailto:[hidden email]>>
>     >     https://mail.python.org/mailman/listinfo/numpy-discussion
>     >
>     >
>     > _______________________________________________
>     > NumPy-Discussion mailing list
>     > [hidden email] <mailto:[hidden email]>
>     > https://mail.python.org/mailman/listinfo/numpy-discussion
>     >
>
>     _______________________________________________
>     NumPy-Discussion mailing list
>     [hidden email] <mailto:[hidden email]>
>     https://mail.python.org/mailman/listinfo/numpy-discussion
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> https://mail.python.org/mailman/listinfo/numpy-discussion
>

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: new MaskedArray class

Marten van Kerkwijk

I think a sensible alternative mental model for the MaskedArray class is that all it does is forward any operations to the data it holds and separately propagate a mask, ORing elements together for binary operations, etc., and explicitly skipping masked elements in reductions (ideally using `where` to be as agnostic as possible about the underlying data, for which, e.g., setting masked values to `0` for `np.reduce.add` may or may not be the right thing to do - what if they are string?).

+1, this sounds like the right model to me.

One small worry is about name clashes - ideally one wants the masked array to be somewhat of a drop-in for whatever class it is masking (independently of whether it is an actual subclass of it). In this respect, `.data` is a pretty terrible name (and, yes, the cause of lots of problems for astropy's MaskedColumn - not my fault, that one!). In my own trials, thinking that names that include "mask" are fair game, I've been considering a function `.unmask(fill_value=None)` which would replace both `.filled(fill_value)` and `.data` by having the default be not to fill anything (I don't see why a masked array should carry a fill value along; one might use specific strings such as 'minmax' for auto-generated cases). If wanted, one could then add `unmasked = property(unmask)`.
 
Aside: my sense is to, at first at least, feel as unbound as possible from the current MaskedArray - one could then use whatever it is to try to create something that comes close to reproducing it, but only for ease of transition.


That said, I would still not guarantee values under the mask as part of NumPy's API. The result of computations under the mask should be considered an undefined implementation detail, sort of like integer overflow or dict iteration order pre-Python 3.7. The values may even be entirely arbitrary, e.g., in cases where the result is preallocated with empty().

I think that is reasonable. The use cases Ben and I described both are ones where the array is being used as input for a set of computations which differ only in their mask. (Admittedly, in both our cases one could just reinitialize a masked array with the new mask; but I think we share the mental model of that if I don't operate on the masked array, the data doesn't change, so I should just be able to change the mask.)


I'm less confident about the right way to handle missing elements in reductions. For example:
- Should median() also skip missing elements, even though there is no identity element?

I think so. If for mean(), std(), etc., the number of unmasked elements comes into play, I don't see why it wouldn't for median().

- If reductions/aggregations default to skipping missing elements, how is it be possible to express "NA propagating" versions, which are also useful, if slightly less common?

I have been playing with using a new `Mask(np.ndarray)` class for the mask, which does the actual mask propagation (i.e., all single-operand ufuncs just copy the mask, binary operations do `logical_or` and reductions do `logical.and.reduce`). This way the `Masked` class itself can generally apply a given operation on the data and the masks separately and then combine the two results (reductions are the exception in that `where` has to be set). Your particular example here could be solved with a different `Mask` class, for which reductions do `logical.or.reduce`.

A larger issue is the accumulations. Personally, I think those are basically meaningless for masked arrays, as to me logically the result on the position of any masked item should be masked. But, if so, I do not see how the ones "beyond" it could not be masked as well. Since here the right answers seems at least unclear, my sense would be to refuse the temptation to guess (i.e., the user should just explicitly fill with ufunc.identity if this is the right thing to do in their case).
 
I should add that I'm slightly torn about a similar, somewhat related issue: what should `np.minimum(a, b)` do for the case where either a or b is masked? Currently, one just treats this as a bin-op, so the result is masked, but one could argue that this ufunc is a bit like a 2-element reduction, and thus that the unmasked item should "win by default". Possibly, the answer should be different between `np.minimum` and `np.fmin` (since the two differ in how they propagate `NaN` as well - note that you don't include `fmin` and `fmax` in your coverage).

We may want to add a standard "skipna" argument on NumPy aggregations, solely for the benefit of duck arrays (and dtypes with missing values). But that could also be a source of confusion, especially if skipna=True refers only "true NA" values, not including NaN, which is used as an alias for NA in pandas and elsewhere.

It does seem `where` should suffice, no? If one wants to be super-fancy, we could allow it to be a callable, which, if a ufunc, gets used inside the loop (`where=np.isfinite` would be particularly useful).

All the best,

Marten


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: new MaskedArray class

Aldcroft, Thomas
In reply to this post by Marten van Kerkwijk


On Sat, Jun 22, 2019 at 11:51 AM Marten van Kerkwijk <[hidden email]> wrote:
Hi Allan,

I'm not sure I would go too much by what the old MaskedArray class did. It indeed made an effort not to overwrite masked values with a new result, even to the extend of copying back masked input data elements to the output data array after an operation. But the fact that this is non-sensical if the dtype changes (or the units in an operation on quantities) suggests that this mental model simply does not work.

I think a sensible alternative mental model for the MaskedArray class is that all it does is forward any operations to the data it holds and separately propagate a mask,

I'm generally on-board with that mental picture, and agree that the use-case described by Ben (different layers of satellite imagery) is important.  Same thing happens in astronomy data, e.g. you have a CCD image of the sky and there are cosmic rays that contaminate the image.  Those are not garbage data, just pixels that one wants to ignore in some, but not all, contexts.

However, it's worth noting that one cannot blindly forward any operations to the data it holds since the operation may be illegal on that data.  The simplest example is dividing `a / b` where  `b` has data values of 0 but they are masked.  That operation should succeed with no exception, and here the resultant value under the mask is genuinely garbage.

The current MaskedArray seems a bit inconsistent in dealing with invalid calcuations.  Dividing by 0 (if masked) is no problem and returns the numerator.  Taking the log of a masked 0 gives the usual divide by zero RuntimeWarning and puts a 1.0 under the mask of the output.

Perhaps the expression should not even be evaluated on elements where the output mask is True, and all the masked output data values should be set to a predictable value (e.g. zero for numerical, zero-length string for string, or maybe a default fill value).  That at least provides consistent and predictable behavior that is simple to explain.  Otherwise the story is that the data under the mask *might* be OK, unless for a particular element the computation was invalid in which case it is filled with some arbitrary value.  I think that is actually an error-prone behavior that should be avoided.

- Tom
 
ORing elements together for binary operations, etc., and explicitly skipping masked elements in reductions (ideally using `where` to be as agnostic as possible about the underlying data, for which, e.g., setting masked values to `0` for `np.reduce.add` may or may not be the right thing to do - what if they are string?).

With this mental picture, the underlying data are always have well-defined meaning: they have been operated on as if the mask did not exist. There then is also less reason to try to avoid getting it back to the user.

As a concrete example (maybe Ben has others): in astropy we have a sigma-clipping average routine, which uses a `MaskedArray` to iteratively mask items that are too far off from the mean; here, the mask varies each iteration (an initially masked element can come back into play), but the data do not.

All the best,

Marten

On Sat, Jun 22, 2019 at 10:54 AM Allan Haldane <[hidden email]> wrote:
On 6/21/19 2:37 PM, Benjamin Root wrote:
> Just to note, data that is masked isn't always garbage. There are plenty
> of use-cases where one may want to temporarily apply a mask for a set of
> computation, or possibly want to apply a series of different masks to
> the data. I haven't read through this discussion deeply enough, but is
> this new class going to destroy underlying masked data? and will it be
> possible to swap out masks?
>
> Cheers!
> Ben Root

Indeed my implementation currently feels free to clobber the data at
masked positions and makes no guarantees not to.

I'd like to try to support reasonable use-cases like yours though. A few
thoughts:

First, the old np.ma.MaskedArray explicitly does not promise to preserve
masked values, with a big warning in the docs. I can't recall the
examples, but I remember coming across cases where clobbering happens.
So arguably your behavior was never supported, and perhaps this means
that no-clobber behavior is difficult to reasonably support.

Second, the old np.ma.MaskedArray avoids frequent clobbering by making
lots of copies. Therefore, in most cases you will not lose any
performance in my new MaskedArray relative to the old one by making an
explicit copy yourself. I.e, is it problematic to have to do

     >>> result = MaskedArray(data.copy(), trial_mask).sum()

instead of

     >>> marr.mask = trial_mask
     >>> result = marr.sum()

since they have similar performance?

Third, in the old np.ma.MaskedArray masked positions are very often
"effectively" clobbered, in the sense that they are not computed. For
example, if you do "c = a+b", and then change the mask of c, the values
at masked position of the result of (a+b) do not correspond to the sum
of the masked values in a and b. Thus, by "unmasking" c you are exposing
nonsense values, which to me seems likely to cause heisenbugs.


In summary, by not making no-clobber guarantees and by strictly
preventing exposure of nonsense values, I suspect that: 1. my new code
is simpler and faster by avoiding lots of copies, and forces copies to
be explicit in user code. 2. disallowing direct modification of the mask
lowers the "API surface area" making people's MaskedArray code less
buggy and easier to read: Exposure of nonsense values by "unmasking" is
one less possibility to keep in mind.

Best,
Allan


> On Thu, Jun 20, 2019 at 12:44 PM Allan Haldane <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     On 6/19/19 10:19 PM, Marten van Kerkwijk wrote:
>     > Hi Allan,
>     >
>     > This is very impressive! I could get the tests that I wrote for my
>     class
>     > pass with yours using Quantity with what I would consider very minimal
>     > changes. I only could not find a good way to unmask data (I like the
>     > idea of setting the mask on some elements via `ma[item] = X`); is this
>     > on purpose?
>
>     Yes, I want to make it difficult for the user to access the garbage
>     values under the mask, which are often clobbered values. The only way to
>     "remove" a masked value is by replacing it with a new non-masked value.
>
>
>     > Anyway, it would seem easily at the point where I should comment
>     on your
>     > repository rather than in the mailing list!
>
>     To make further progress on this encapsulation idea I need a more
>     complete ducktype to pass into MaskedArray to test, so that's what I'll
>     work on next, when I have time. I'll either try to finish my
>     ArrayCollection type, or try making a simple NDunit ducktype
>     piggybacking on astropy's Unit.
>
>     Best,
>     Allan
>
>
>     >
>     > All the best,
>     >
>     > Marten
>     >
>     >
>     > On Wed, Jun 19, 2019 at 5:45 PM Allan Haldane
>     <[hidden email] <mailto:[hidden email]>
>     > <mailto:[hidden email] <mailto:[hidden email]>>>
>     wrote:
>     >
>     >     On 6/18/19 2:04 PM, Marten van Kerkwijk wrote:
>     >     >
>     >     >
>     >     > On Tue, Jun 18, 2019 at 12:55 PM Allan Haldane
>     >     <[hidden email] <mailto:[hidden email]>
>     <mailto:[hidden email] <mailto:[hidden email]>>
>     >     > <mailto:[hidden email]
>     <mailto:[hidden email]> <mailto:[hidden email]
>     <mailto:[hidden email]>>>>
>     >     wrote:
>     >     > <snip>
>     >     >
>     >     >     > This may be too much to ask from the initializer, but, if
>     >     so, it still
>     >     >     > seems most useful if it is made as easy as possible to do,
>     >     say, `class
>     >     >     > MaskedQuantity(Masked, Quantity): <very few overrides>`.
>     >     >
>     >     >     Currently MaskedArray does not accept ducktypes as
>     underlying
>     >     arrays,
>     >     >     but I think it shouldn't be too hard to modify it to do so.
>     >     Good idea!
>     >     >
>     >     >
>     >     > Looking back at my trial, I see that I also never got to
>     duck arrays -
>     >     > only ndarray subclasses - though I tried to make the code as
>     >     agnostic as
>     >     > possible.
>     >     >
>     >     > (Trial at
>     >     >
>     >   
>      https://github.com/astropy/astropy/compare/master...mhvk:utils-masked-class?expand=1)
>     >     >
>     >     >     I already partly navigated this mixin-issue in the
>     >     >     "MaskedArrayCollection" class, which essentially does
>     >     >     ArrayCollection(MaskedArray(array)), and only takes about 30
>     >     lines of
>     >     >     boilerplate. That's the backwards encapsulation order from
>     >     what you want
>     >     >     though.
>     >     >
>     >     >
>     >     > Yes, indeed, from a quick trial `MaskedArray(np.arange(3.) *
>     u.m,
>     >     > mask=[True, False, False])` does indeed not have a `.unit`
>     attribute
>     >     > (and cannot represent itself...); I'm not at all sure that my
>     >     method of
>     >     > just creating a mixed class is anything but a recipe for
>     disaster,
>     >     though!
>     >
>     >     Based on your suggestion I worked on this a little today, and
>     now my
>     >     MaskedArray more easily encapsulates both ducktypes and ndarray
>     >     subclasses (pushed to repo). Here's an example I got working
>     with masked
>     >     units using unyt:
>     >
>     >     [1]: from MaskedArray import X, MaskedArray, MaskedScalar
>     >
>     >     [2]: from unyt import m, km
>     >
>     >     [3]: import numpy as np
>     >
>     >     [4]: uarr = MaskedArray([1., 2., 3.]*km, mask=[0,1,0])
>     >
>     >     [5]: uarr
>     >
>     >     MaskedArray([1., X , 3.])
>     >     [6]: uarr + 1*m
>     >
>     >     MaskedArray([1.001, X    , 3.001])
>     >     [7]: uarr.filled()
>     >
>     >     unyt_array([1., 0., 3.], 'km')
>     >     [8]: np.concatenate([uarr, 2*uarr]).filled()
>     >     unyt_array([1., 0., 3., 2., 0., 6.], '(dimensionless)')
>     >
>     >     The catch is the ducktype/subclass has to rigorously follow
>     numpy's
>     >     indexing rules, including distinguishing 0d arrays from
>     scalars. For now
>     >     only I used unyt in the example above since it happens to be
>     less strict
>     >      about dimensionless operations than astropy.units which trips
>     up my
>     >     repr code. (see below for example with astropy.units). Note in
>     the last
>     >     line I lost the dimensions, but that is because unyt does not
>     handle
>     >     np.concatenate. To get that to work we need a true ducktype
>     for units.
>     >
>     >     The example above doesn't expose the ".units" attribute
>     outside the
>     >     MaskedArray, and it doesn't print the units in the repr. But
>     you can
>     >     access them using "filled".
>     >
>     >     While I could make MaskedArray forward unknown attribute
>     accesses to the
>     >     encapsulated array, that seems a bit dangerous/bug-prone at first
>     >     glance, so probably I want to require the user to make a
>     MaskedArray
>     >     subclass to do so. I've just started playing with that
>     (probably buggy),
>     >     and Ive attached subclass examples for astropy.unit and unyt,
>     with some
>     >     example output below.
>     >
>     >     Cheers,
>     >     Allan
>     >
>     >
>     >
>     >     Example using the attached astropy unit subclass:
>     >
>     >         >>> from astropy.units import m, km, s
>     >         >>> uarr = MaskedQ(np.ones(3), units=km, mask=[0,1,0])
>     >         >>> uarr
>     >         MaskedQ([1., X , 1.], units=km)
>     >         >>> uarr.units
>     >         km
>     >         >>> uarr + (1*m)
>     >         MaskedQ([1.001, X    , 1.001], units=km)
>     >         >>> uarr/(1*s)
>     >         MaskedQ([1., X , 1.], units=km / s)
>     >         >>> (uarr*(1*m))[1:]
>     >         MaskedQ([X , 1.], units=km m)
>     >         >>> np.add.outer(uarr, uarr)
>     >         MaskedQ([[2., X , 2.],
>     >                  [X , X , X ],
>     >                  [2., X , 2.]], units=km)
>     >         >>> print(uarr)
>     >         [1. X  1.] km m
>     >
>     >     Cheers,
>     >     Allan
>     >
>     >
>     >     >     > Even if this impossible, I think it is conceptually useful
>     >     to think
>     >     >     > about what the masking class should do. My sense is that,
>     >     e.g., it
>     >     >     > should not attempt to decide when an operation
>     succeeds or not,
>     >     >     but just
>     >     >     > "or together" input masks for regular, multiple-input
>     functions,
>     >     >     and let
>     >     >     > the underlying arrays skip elements for reductions by
>     using
>     >     `where`
>     >     >     > (hey, I did implement that for a reason... ;-). In
>     >     particular, it
>     >     >     > suggests one should not have things like domains and all
>     >     that (I never
>     >     >     > understood why `MaskedArray` did that). If one wants more,
>     >     the class
>     >     >     > should provide a method that updates the mask (a sensible
>     >     default
>     >     >     might
>     >     >     > be `mask |= ~np.isfinite(result)` - here, the class
>     being masked
>     >     >     should
>     >     >     > logically support ufuncs and functions, so it can
>     decide what
>     >     >     "isfinite"
>     >     >     > means).
>     >     >
>     >     >     I agree it would be nice to remove domains. It would
>     make life
>     >     easier,
>     >     >     and I could remove a lot of twiddly code! I kept it in
>     for now to
>     >     >     minimize the behavior changes from the old MaskedArray.
>     >     >
>     >     >
>     >     > That makes sense. Could be separated out to a
>     backwards-compatibility
>     >     > class later.
>     >     >
>     >     >
>     >     >     > In any case, I would think that a basic truth should
>     be that
>     >     >     everything
>     >     >     > has a mask with a shape consistent with the data, so
>     >     >     > 1. Each complex numbers has just one mask, and setting
>     >     `a.imag` with a
>     >     >     > masked array should definitely propagate the mask.
>     >     >     > 2. For a masked array with structured dtype, I'd
>     similarly say
>     >     >     that the
>     >     >     > default is for a mask to have the same shape as the array.
>     >     But that
>     >     >     > something like your collection makes sense for the case
>     >     where one
>     >     >     wants
>     >     >     > to mask items in a structure.
>     >     >
>     >     >     Agreed that we should have a single bool per complex or
>     structured
>     >     >     element, and the mask shape is the same as the array shape.
>     >     That's how I
>     >     >     implemented it. But there is still a problem with
>     complex.imag
>     >     >     assignment:
>     >     >
>     >     >         >>> a = MaskedArray([1j, 2, X])
>     >     >         >>> i = a.imag
>     >     >         >>> i[:] = MaskedArray([1, X, 1])
>     >     >
>     >     >     If we make the last line copy the mask to the original
>     array, what
>     >     >     should the real part of a[2] be? Conversely, if we don't
>     copy
>     >     the mask,
>     >     >     what should the imag part of a[1] be? It seems like we might
>     >     "want" the
>     >     >     masks to be OR'd instead, but then should i[2] be masked
>     after
>     >     we just
>     >     >     set it to 1?
>     >     >
>     >     > Ah, I see the issue now... Easiest to implement and closest
>     in analogy
>     >     > to a regular view would be to just let it unmask a[2] (with
>     >     whatever is
>     >     > in real; user beware!).
>     >     >
>     >     > Perhaps better would be to special-case such that `imag`
>     returns a
>     >     > read-only view of the mask. Making `imag` itself read-only would
>     >     prevent
>     >     > possibly reasonable things like `i[np.isclose(i, 0)] = 0` - but
>     >     there is
>     >     > no reason this should update the mask.
>     >     >
>     >     > Still, neither is really satisfactory...
>     >     > 
>     >     >
>     >     >
>     >     >     > p.s. I started trying to implement the above "Mixin"
>     class; will
>     >     >     try to
>     >     >     > clean that up a bit so that at least it uses `where` and
>     >     push it up.
>     >     >
>     >     >     I played with "where", but didn't include it since 1.17
>     is not
>     >     released.
>     >     >     To avoid duplication of effort, I've attached a diff of
>     what I
>     >     tried. I
>     >     >     actually get a slight slowdown of about 10% by using
>     where...
>     >     >
>     >     >
>     >     > Your implementation is indeed quite similar to what I got in
>     >     > __array_ufunc__ (though one should "&" the where with ~mask).
>     >     >
>     >     > I think the main benefit is not to presume that whatever is
>     underneath
>     >     > understands 0 or 1, i.e., avoid filling.
>     >     >
>     >     >
>     >     >     If you make progress with the mixin, a push is welcome. I
>     >     imagine a
>     >     >     problem is going to be that np.isscalar doesn't work to
>     detect
>     >     duck
>     >     >     scalars.
>     >     >
>     >     > I fear that in my attempts I've simply decided that only
>     array scalars
>     >     > exist...
>     >     >
>     >     > -- Marten
>     >     >
>     >     > _______________________________________________
>     >     > NumPy-Discussion mailing list
>     >     > [hidden email]
>     <mailto:[hidden email]>
>     <mailto:[hidden email]
>     <mailto:[hidden email]>>
>     >     > https://mail.python.org/mailman/listinfo/numpy-discussion
>     >     >
>     >
>     >     _______________________________________________
>     >     NumPy-Discussion mailing list
>     >     [hidden email]
>     <mailto:[hidden email]>
>     <mailto:[hidden email]
>     <mailto:[hidden email]>>
>     >     https://mail.python.org/mailman/listinfo/numpy-discussion
>     >
>     >
>     > _______________________________________________
>     > NumPy-Discussion mailing list
>     > [hidden email] <mailto:[hidden email]>
>     > https://mail.python.org/mailman/listinfo/numpy-discussion
>     >
>
>     _______________________________________________
>     NumPy-Discussion mailing list
>     [hidden email] <mailto:[hidden email]>
>     https://mail.python.org/mailman/listinfo/numpy-discussion
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> https://mail.python.org/mailman/listinfo/numpy-discussion
>

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: new MaskedArray class

Marten van Kerkwijk
Hi Tom,


I think a sensible alternative mental model for the MaskedArray class is that all it does is forward any operations to the data it holds and separately propagate a mask,

I'm generally on-board with that mental picture, and agree that the use-case described by Ben (different layers of satellite imagery) is important.  Same thing happens in astronomy data, e.g. you have a CCD image of the sky and there are cosmic rays that contaminate the image.  Those are not garbage data, just pixels that one wants to ignore in some, but not all, contexts.

However, it's worth noting that one cannot blindly forward any operations to the data it holds since the operation may be illegal on that data.  The simplest example is dividing `a / b` where  `b` has data values of 0 but they are masked.  That operation should succeed with no exception, and here the resultant value under the mask is genuinely garbage.

Even in the present implementation, the operation is just forwarded, with numpy errstate set to ignore all errors. And then after the fact some half-hearted remediation is done.
 
The current MaskedArray seems a bit inconsistent in dealing with invalid calcuations.  Dividing by 0 (if masked) is no problem and returns the numerator.  Taking the log of a masked 0 gives the usual divide by zero RuntimeWarning and puts a 1.0 under the mask of the output.

Perhaps the expression should not even be evaluated on elements where the output mask is True, and all the masked output data values should be set to a predictable value (e.g. zero for numerical, zero-length string for string, or maybe a default fill value).  That at least provides consistent and predictable behavior that is simple to explain.  Otherwise the story is that the data under the mask *might* be OK, unless for a particular element the computation was invalid in which case it is filled with some arbitrary value.  I think that is actually an error-prone behavior that should be avoided.

I think I agree with Allan here, that after a computation, one generally simply cannot safely assume anything for masked elements.

But it is reasonable for subclasses to define what they want to do "post-operation"; e.g., for numerical arrays, it might make generally make sense to do
```
    notok = ~np.isfinite(result)
    mask |= notok
```
and one could then also do
```
    result[notok] = fill_value
```

But I think one might want to leave that to the user.

All the best,

Marten


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: new MaskedArray class

Stephan Hoyer-2
In reply to this post by Marten van Kerkwijk


On Sun, Jun 23, 2019 at 4:07 PM Marten van Kerkwijk <[hidden email]> wrote:
- If reductions/aggregations default to skipping missing elements, how is it be possible to express "NA propagating" versions, which are also useful, if slightly less common?

I have been playing with using a new `Mask(np.ndarray)` class for the mask, which does the actual mask propagation (i.e., all single-operand ufuncs just copy the mask, binary operations do `logical_or` and reductions do `logical.and.reduce`). This way the `Masked` class itself can generally apply a given operation on the data and the masks separately and then combine the two results (reductions are the exception in that `where` has to be set). Your particular example here could be solved with a different `Mask` class, for which reductions do `logical.or.reduce`.

I think it would be much better to use duck-typing for the mask as well, if possible, rather than a NumPy array subclass. This would facilitate using alternative mask implementations, e.g., distributed masks, sparse masks, bit-array masks, etc.

Are there use-cases for propagating masks separately from data? If not, it might make sense to only define mask operations along with data, which could be much simpler.
 
We may want to add a standard "skipna" argument on NumPy aggregations, solely for the benefit of duck arrays (and dtypes with missing values). But that could also be a source of confusion, especially if skipna=True refers only "true NA" values, not including NaN, which is used as an alias for NA in pandas and elsewhere.

It does seem `where` should suffice, no? If one wants to be super-fancy, we could allow it to be a callable, which, if a ufunc, gets used inside the loop (`where=np.isfinite` would be particularly useful).

Let me try to make the API issue more concrete. Suppose we have a MaskedArray with values [1, 2, NA]. How do I get:
1. The sum ignoring masked values, i.e., 3.
2. The sum that is tainted by masked values, i.e., NA.

Here's how this works with existing array libraries:
- With base NumPy using NaN as a sentinel value for NA, you can get (1) with np.sum and (2) with np.nansum.
- With pandas and xarray, the default behavior is (1) and to get (2) you need to write array.sum(skipna=False).
- With NumPy's current MaskedArray, it appears that you can only get (1). Maybe there isn't as strong a need for (2) as I thought?

Your proposal would be something like np.sum(array, where=np.ones_like(array))? This seems rather verbose for a common operation. Perhaps np.sum(array, where=True) would work, making use of broadcasting? (I haven't actually checked whether this is well-defined yet.)

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: new MaskedArray class

Marten van Kerkwijk
Hi Stephan,

In slightly changed order:

Let me try to make the API issue more concrete. Suppose we have a MaskedArray with values [1, 2, NA]. How do I get:
1. The sum ignoring masked values, i.e., 3.
2. The sum that is tainted by masked values, i.e., NA.

Here's how this works with existing array libraries:
- With base NumPy using NaN as a sentinel value for NA, you can get (1) with np.sum and (2) with np.nansum.
- With pandas and xarray, the default behavior is (1) and to get (2) you need to write array.sum(skipna=False).
- With NumPy's current MaskedArray, it appears that you can only get (1). Maybe there isn't as strong a need for (2) as I thought?

I think this is all correct.

Your proposal would be something like np.sum(array, where=np.ones_like(array))? This seems rather verbose for a common operation. Perhaps np.sum(array, where=True) would work, making use of broadcasting? (I haven't actually checked whether this is well-defined yet.)

I think we'd need to consider separately the operation on the mask and on the data. In my proposal, the data would always do `np.sum(array, where=~mask)`, while how the mask would propagate might depend on the mask itself, i.e., we'd have different mask types for `skipna=True` (default) and `False` ("contagious") reductions, which differed in doing `logical_and.reduce` or `logical_or.reduce` on the mask.

I have been playing with using a new `Mask(np.ndarray)` class for the mask, which does the actual mask propagation (i.e., all single-operand ufuncs just copy the mask, binary operations do `logical_or` and reductions do `logical.and.reduce`). This way the `Masked` class itself can generally apply a given operation on the data and the masks separately and then combine the two results (reductions are the exception in that `where` has to be set). Your particular example here could be solved with a different `Mask` class, for which reductions do `logical.or.reduce`.

I think it would be much better to use duck-typing for the mask as well, if possible, rather than a NumPy array subclass. This would facilitate using alternative mask implementations, e.g., distributed masks, sparse masks, bit-array masks, etc.

Implicitly in the above, I agree with having the mask not necessarily be a plain ndarray, but something that can determine part of the action. Makes sense to generalize that to duck arrays for the reasons you give. Indeed, if we let the mask do the mask propagation as well, it might help make the implementation substantially easier (e.g., `logical_and.reduce` and `logical_or.reduce` can be super-fast on a bitmask!).
 
Are there use-cases for propagating masks separately from data? If not, it might make sense to only define mask operations along with data, which could be much simpler.

I had only thought about separating out the concern of mask propagation from the "MaskedArray" class to the mask proper, but it might indeed make things easier if the mask also did any required preparation for passing things on to the data (such as adjusting the "where" argument in a reduction). I also like that this way the mask can determine even before the data what functionality is available (i.e., it could be the place from which to return `NotImplemented` for a ufunc.at call with a masked index argument).

It may be good to collect a few more test cases... E.g., I'd like to mask some of the astropy classes that are only very partial duck arrays, in that they cover only the shape aspect, and which do have some operators and for which it would be nice not to feel forced to use __array_ufunc__.

All the best,

Marten

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: new MaskedArray class

Stephan Hoyer-2
On Sun, Jun 23, 2019 at 11:55 PM Marten van Kerkwijk <[hidden email]> wrote:
Your proposal would be something like np.sum(array, where=np.ones_like(array))? This seems rather verbose for a common operation. Perhaps np.sum(array, where=True) would work, making use of broadcasting? (I haven't actually checked whether this is well-defined yet.)

I think we'd need to consider separately the operation on the mask and on the data. In my proposal, the data would always do `np.sum(array, where=~mask)`, while how the mask would propagate might depend on the mask itself, i.e., we'd have different mask types for `skipna=True` (default) and `False` ("contagious") reductions, which differed in doing `logical_and.reduce` or `logical_or.reduce` on the mask.

OK, I think I finally understand what you're getting at. So suppose this this how we implement it internally. Would we really insist on a user creating a new MaskedArray with a new mask object, e.g., with a GreedyMask? We could add sugar for this, but certainly array.greedy_masked().sum() is significantly less clear than array.sum(skipna=False).

I'm also a little concerned about a proliferation of MaskedArray/Mask types. New types are significantly harder to understand than new functions (or new arguments on existing functions). I don't know if we have enough distinct use cases for this many types.

Are there use-cases for propagating masks separately from data? If not, it might make sense to only define mask operations along with data, which could be much simpler.

I had only thought about separating out the concern of mask propagation from the "MaskedArray" class to the mask proper, but it might indeed make things easier if the mask also did any required preparation for passing things on to the data (such as adjusting the "where" argument in a reduction). I also like that this way the mask can determine even before the data what functionality is available (i.e., it could be the place from which to return `NotImplemented` for a ufunc.at call with a masked index argument).

You're going to have to come up with something more compelling than "separation of concerns" to convince me that this extra Mask abstraction is worthwhile. On its own, I think a separate Mask class would only obfuscate MaskedArray functions.

For example, compare these two implementations of add:

def  add1(x, y):
    return MaskedArray(x.data + y.data,  x.mask | y.mask)

def  add2(x, y):
    return MaskedArray(x.data + y.data,  x.mask + y.mask)

The second version requires that you *also* know how Mask classes work, and how they implement +. So now you need to look in at least twice as many places to understand add() for MaskedArray objects.

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: new MaskedArray class

Eric Wieser

I think we’d need to consider separately the operation on the mask and on the data. In my proposal, the data would always do np.sum(array, where=~mask), while how the mask would propagate might depend on the mask itself,

I quite like this idea, and I think Stephan’s strawman design is actually plausible, where MaskedArray.mask is either an InvalidMask or a IgnoreMask instance to pick between the different propagation types. Both classes could simply have an underlying ._array attribute pointing to a duck-array of some kind that backs their boolean data.

The second version requires that you also know how Mask classes work, and how they implement +

I remain unconvinced that Mask classes should behave differently on different ufuncs. I don’t think np.minimum(ignore_na, b) is any different to np.add(ignore_na, b) - either both should produce b, or both should produce ignore_na. I would lean towards produxing ignore_na, and propagation behavior differing between “ignore” and “invalid” only for reduce / accumulate operations, where the concept of skipping an application is well-defined.

Some possible follow-up questions that having two distinct masked types raise:

  • what if I want my data to support both invalid and skip fields at the same time? sum([invalid, skip, 1]) == invalid
  • is there a use case for more that these two types of mask? invalid_due_to_reason_A, invalid_due_to_reason_B would be interesting things to track through a calculation, possibly a dictionary of named masks.

Eric


On Sun, 23 Jun 2019 at 15:28, Stephan Hoyer <[hidden email]> wrote:
On Sun, Jun 23, 2019 at 11:55 PM Marten van Kerkwijk <[hidden email]> wrote:
Your proposal would be something like np.sum(array, where=np.ones_like(array))? This seems rather verbose for a common operation. Perhaps np.sum(array, where=True) would work, making use of broadcasting? (I haven't actually checked whether this is well-defined yet.)

I think we'd need to consider separately the operation on the mask and on the data. In my proposal, the data would always do `np.sum(array, where=~mask)`, while how the mask would propagate might depend on the mask itself, i.e., we'd have different mask types for `skipna=True` (default) and `False` ("contagious") reductions, which differed in doing `logical_and.reduce` or `logical_or.reduce` on the mask.

OK, I think I finally understand what you're getting at. So suppose this this how we implement it internally. Would we really insist on a user creating a new MaskedArray with a new mask object, e.g., with a GreedyMask? We could add sugar for this, but certainly array.greedy_masked().sum() is significantly less clear than array.sum(skipna=False).

I'm also a little concerned about a proliferation of MaskedArray/Mask types. New types are significantly harder to understand than new functions (or new arguments on existing functions). I don't know if we have enough distinct use cases for this many types.

Are there use-cases for propagating masks separately from data? If not, it might make sense to only define mask operations along with data, which could be much simpler.

I had only thought about separating out the concern of mask propagation from the "MaskedArray" class to the mask proper, but it might indeed make things easier if the mask also did any required preparation for passing things on to the data (such as adjusting the "where" argument in a reduction). I also like that this way the mask can determine even before the data what functionality is available (i.e., it could be the place from which to return `NotImplemented` for a ufunc.at call with a masked index argument).

You're going to have to come up with something more compelling than "separation of concerns" to convince me that this extra Mask abstraction is worthwhile. On its own, I think a separate Mask class would only obfuscate MaskedArray functions.

For example, compare these two implementations of add:

def  add1(x, y):
    return MaskedArray(x.data + y.data,  x.mask | y.mask)

def  add2(x, y):
    return MaskedArray(x.data + y.data,  x.mask + y.mask)

The second version requires that you *also* know how Mask classes work, and how they implement +. So now you need to look in at least twice as many places to understand add() for MaskedArray objects.
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
12