What to do about structured string dtype and string regression?

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

What to do about structured string dtype and string regression?

Sebastian Berg
Hi all,

In https://github.com/numpy/numpy/issues/18407 it was reported that
there is a regression for `np.array()` and friends in NumPy 1.20 for
code such as:

    np.array(["1234"], dtype=("U1", 4))
    # NumPy 1.20: array(['1', '1', '1', '1'], dtype='<U1')
    # NumPy 1.19: array(['1', '2', '3', '4'], dtype='<U1')


The Basics
----------

This happens when you ask for a rare "subarray" dtype, ways to create
it are:

    np.dtype(("U1", 4))
    np.dtype("(4)U1,")  # (does not have a field, only a subarray)

Both of which give the same subarray dtype a "U1" dtype with shape 4.
One thing to know about these dtypes is that they cannot be attached to
an array:

    np.zeros(3, dtype="(4)U1,").dtype == "U1"
    np.zeros(3, dtype="(4)U1,").shape == (3, 4)

I.e. the shape is moved/added into the array itself (instead of
remaining part of the dtype).

The Change
----------

Now what/why did something change?  When filling subarray dtypes, NumPy
normally fills every element with the same input. In the above case in
most cases NumPy will give the 1.20 result because it assigns "1234" to
every subarray element individually; maybe confusingly, this truncates
so that only the "1" is actually assigned, we can proof it with a
structured dtype (same result in 1.19 and 1.20):

    >>> np.array(["1234"], dtype="(4)U1,i")
    array([(['1', '1', '1', '1'], 1234)],
          dtype=[('f0', '<U1', (4,)), ('f1', '<i4')])

Another, weirder case which changed (more obviously for the better is:

    >>> np.array("1234", dtype="(4)U1,")
    # Numpy 1.20: array(['1', '1', '1', '1'], dtype='<U1')
    # NumPy 1.19: array(['1', '', '', ''], dtype='<U1')

And, to point it out, we can have subarrays that are not 1-D:

    >>> np.array(["12"],dtype=("(2,2)U1,"))
    array([[['1', '1'],
        ['2', '2']]], dtype='<U1')  # NumPy 1.19, 1.20 all is '1'


The Cause
---------

The cause of the 1.19 behaviour is two-fold:

1. The "subarray" part of the dtype is moved into the array after the
dimension is found. At this point strings are always considered
"scalars".  In most above examples, the new array shape is (1,)+(4,).

2. When filling the new array with values, it now has an _additional_
dimension!  Because of this, the string is now suddenly considered a
sequence, so it behaves the same as if `list("1234")`.  Although,
normally, NumPy would never consider a string a sequence.


The Solution?
-------------

I honestly don't have one.  We can consider strings as sequences in
this weird special case.  That will probably create other weird special
cases, but they would be even more hidden (I expect mainly odder things
throwing an error).

Should we try to document this better in the release notes or can we
think of some better (or at least louder) solution?


Cheers,

Sebastian

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: What to do about structured string dtype and string regression?

Stephan Hoyer-2
On Tue, Feb 16, 2021 at 3:13 PM Sebastian Berg <[hidden email]> wrote:
Hi all,

In https://github.com/numpy/numpy/issues/18407 it was reported that
there is a regression for `np.array()` and friends in NumPy 1.20 for
code such as:

    np.array(["1234"], dtype=("U1", 4))
    # NumPy 1.20: array(['1', '1', '1', '1'], dtype='<U1')
    # NumPy 1.19: array(['1', '2', '3', '4'], dtype='<U1')


The Basics
----------

This happens when you ask for a rare "subarray" dtype, ways to create
it are:

    np.dtype(("U1", 4))
    np.dtype("(4)U1,")  # (does not have a field, only a subarray)

Both of which give the same subarray dtype a "U1" dtype with shape 4.
One thing to know about these dtypes is that they cannot be attached to
an array:

    np.zeros(3, dtype="(4)U1,").dtype == "U1"
    np.zeros(3, dtype="(4)U1,").shape == (3, 4)

I.e. the shape is moved/added into the array itself (instead of
remaining part of the dtype).

The Change
----------

Now what/why did something change?  When filling subarray dtypes, NumPy
normally fills every element with the same input. In the above case in
most cases NumPy will give the 1.20 result because it assigns "1234" to
every subarray element individually; maybe confusingly, this truncates
so that only the "1" is actually assigned, we can proof it with a
structured dtype (same result in 1.19 and 1.20):

    >>> np.array(["1234"], dtype="(4)U1,i")
    array([(['1', '1', '1', '1'], 1234)],
          dtype=[('f0', '<U1', (4,)), ('f1', '<i4')])

Another, weirder case which changed (more obviously for the better is:

    >>> np.array("1234", dtype="(4)U1,")
    # Numpy 1.20: array(['1', '1', '1', '1'], dtype='<U1')
    # NumPy 1.19: array(['1', '', '', ''], dtype='<U1')

And, to point it out, we can have subarrays that are not 1-D:

    >>> np.array(["12"],dtype=("(2,2)U1,"))
    array([[['1', '1'],
        ['2', '2']]], dtype='<U1')  # NumPy 1.19, 1.20 all is '1'


The Cause
---------

The cause of the 1.19 behaviour is two-fold:

1. The "subarray" part of the dtype is moved into the array after the
dimension is found. At this point strings are always considered
"scalars".  In most above examples, the new array shape is (1,)+(4,).

2. When filling the new array with values, it now has an _additional_
dimension!  Because of this, the string is now suddenly considered a
sequence, so it behaves the same as if `list("1234")`.  Although,
normally, NumPy would never consider a string a sequence.


The Solution?
-------------

I honestly don't have one.  We can consider strings as sequences in
this weird special case.  That will probably create other weird special
cases, but they would be even more hidden (I expect mainly odder things
throwing an error).

Should we try to document this better in the release notes or can we
think of some better (or at least louder) solution?

There are way too many unsafe assumptions in this example. It's an edge case of an edge case.

I don't think we should be beholden to continuing to support this behavior, which was obviously never anticipated. If there was a way to raise a warning or error in potentially ambiguous situations like this, I would support it.




Cheers,

Sebastian
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: What to do about structured string dtype and string regression?

ralfgommers


On Wed, Feb 17, 2021 at 2:14 AM Stephan Hoyer <[hidden email]> wrote:
On Tue, Feb 16, 2021 at 3:13 PM Sebastian Berg <[hidden email]> wrote:
Hi all,

In https://github.com/numpy/numpy/issues/18407 it was reported that
there is a regression for `np.array()` and friends in NumPy 1.20 for
code such as:

    np.array(["1234"], dtype=("U1", 4))
    # NumPy 1.20: array(['1', '1', '1', '1'], dtype='<U1')
    # NumPy 1.19: array(['1', '2', '3', '4'], dtype='<U1')


The Basics
----------

This happens when you ask for a rare "subarray" dtype, ways to create
it are:

    np.dtype(("U1", 4))
    np.dtype("(4)U1,")  # (does not have a field, only a subarray)

Both of which give the same subarray dtype a "U1" dtype with shape 4.
One thing to know about these dtypes is that they cannot be attached to
an array:

    np.zeros(3, dtype="(4)U1,").dtype == "U1"
    np.zeros(3, dtype="(4)U1,").shape == (3, 4)

I.e. the shape is moved/added into the array itself (instead of
remaining part of the dtype).

The Change
----------

Now what/why did something change?  When filling subarray dtypes, NumPy
normally fills every element with the same input. In the above case in
most cases NumPy will give the 1.20 result because it assigns "1234" to
every subarray element individually; maybe confusingly, this truncates
so that only the "1" is actually assigned, we can proof it with a
structured dtype (same result in 1.19 and 1.20):

    >>> np.array(["1234"], dtype="(4)U1,i")
    array([(['1', '1', '1', '1'], 1234)],
          dtype=[('f0', '<U1', (4,)), ('f1', '<i4')])

Another, weirder case which changed (more obviously for the better is:

    >>> np.array("1234", dtype="(4)U1,")
    # Numpy 1.20: array(['1', '1', '1', '1'], dtype='<U1')
    # NumPy 1.19: array(['1', '', '', ''], dtype='<U1')

And, to point it out, we can have subarrays that are not 1-D:

    >>> np.array(["12"],dtype=("(2,2)U1,"))
    array([[['1', '1'],
        ['2', '2']]], dtype='<U1')  # NumPy 1.19, 1.20 all is '1'


The Cause
---------

The cause of the 1.19 behaviour is two-fold:

1. The "subarray" part of the dtype is moved into the array after the
dimension is found. At this point strings are always considered
"scalars".  In most above examples, the new array shape is (1,)+(4,).

2. When filling the new array with values, it now has an _additional_
dimension!  Because of this, the string is now suddenly considered a
sequence, so it behaves the same as if `list("1234")`.  Although,
normally, NumPy would never consider a string a sequence.


The Solution?
-------------

I honestly don't have one.  We can consider strings as sequences in
this weird special case.  That will probably create other weird special
cases, but they would be even more hidden (I expect mainly odder things
throwing an error).

Should we try to document this better in the release notes or can we
think of some better (or at least louder) solution?

I was honestly surprised there's even such a thing as a "subarray data type", I've never seen it used in the wild. Looking at the release notes you already have, https://numpy.org/devdocs/release/1.20.0-notes.html#arrays-cannot-be-using-subarray-dtypes, all I'm thinking is that no one should ever be writing code like that.


There are way too many unsafe assumptions in this example. It's an edge case of an edge case.

I don't think we should be beholden to continuing to support this behavior, which was obviously never anticipated. If there was a way to raise a warning or error in potentially ambiguous situations like this, I would support it.

+1

Ralf


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: What to do about structured string dtype and string regression?

Sebastian Berg
On Wed, 2021-02-17 at 11:15 +0100, Ralf Gommers wrote:

> On Wed, Feb 17, 2021 at 2:14 AM Stephan Hoyer <[hidden email]>
> wrote:
>
> > On Tue, Feb 16, 2021 at 3:13 PM Sebastian Berg <
> > [hidden email]>
> > wrote:
> >
> > > Hi all,
> > >
> > > In https://github.com/numpy/numpy/issues/18407 it was reported
> > > that
> > > there is a regression for `np.array()` and friends in NumPy 1.20
> > > for
> > > code such as:
> > >
> > >     np.array(["1234"], dtype=("U1", 4))
> > >     # NumPy 1.20: array(['1', '1', '1', '1'], dtype='<U1')
> > >     # NumPy 1.19: array(['1', '2', '3', '4'], dtype='<U1')
> > >
> > >
> > > The Basics
> > > ----------
> > >
> > > This happens when you ask for a rare "subarray" dtype, ways to
> > > create
> > > it are:
> > >
> > >     np.dtype(("U1", 4))
> > >     np.dtype("(4)U1,")  # (does not have a field, only a
> > > subarray)
> > >
> > > Both of which give the same subarray dtype a "U1" dtype with
> > > shape 4.
> > > One thing to know about these dtypes is that they cannot be
> > > attached to
> > > an array:
> > >
> > >     np.zeros(3, dtype="(4)U1,").dtype == "U1"
> > >     np.zeros(3, dtype="(4)U1,").shape == (3, 4)
> > >
> > > I.e. the shape is moved/added into the array itself (instead of
> > > remaining part of the dtype).
> > >
> > > The Change
> > > ----------
> > >
> > > Now what/why did something change?  When filling subarray dtypes,
> > > NumPy
> > > normally fills every element with the same input. In the above
> > > case in
> > > most cases NumPy will give the 1.20 result because it assigns
> > > "1234" to
> > > every subarray element individually; maybe confusingly, this
> > > truncates
> > > so that only the "1" is actually assigned, we can proof it with a
> > > structured dtype (same result in 1.19 and 1.20):
> > >
> > >     >>> np.array(["1234"], dtype="(4)U1,i")
> > >     array([(['1', '1', '1', '1'], 1234)],
> > >           dtype=[('f0', '<U1', (4,)), ('f1', '<i4')])
> > >
> > > Another, weirder case which changed (more obviously for the
> > > better is:
> > >
> > >     >>> np.array("1234", dtype="(4)U1,")
> > >     # Numpy 1.20: array(['1', '1', '1', '1'], dtype='<U1')
> > >     # NumPy 1.19: array(['1', '', '', ''], dtype='<U1')
> > >
> > > And, to point it out, we can have subarrays that are not 1-D:
> > >
> > >     >>> np.array(["12"],dtype=("(2,2)U1,"))
> > >     array([[['1', '1'],
> > >         ['2', '2']]], dtype='<U1')  # NumPy 1.19, 1.20 all is '1'
> > >
> > >
> > > The Cause
> > > ---------
> > >
> > > The cause of the 1.19 behaviour is two-fold:
> > >
> > > 1. The "subarray" part of the dtype is moved into the array after
> > > the
> > > dimension is found. At this point strings are always considered
> > > "scalars".  In most above examples, the new array shape is
> > > (1,)+(4,).
> > >
> > > 2. When filling the new array with values, it now has an
> > > _additional_
> > > dimension!  Because of this, the string is now suddenly
> > > considered a
> > > sequence, so it behaves the same as if `list("1234")`.  Although,
> > > normally, NumPy would never consider a string a sequence.
> > >
> > >
> > > The Solution?
> > > -------------
> > >
> > > I honestly don't have one.  We can consider strings as sequences
> > > in
> > > this weird special case.  That will probably create other weird
> > > special
> > > cases, but they would be even more hidden (I expect mainly odder
> > > things
> > > throwing an error).
> > >
> > > Should we try to document this better in the release notes or can
> > > we
> > > think of some better (or at least louder) solution?
> > >
> >
> I was honestly surprised there's even such a thing as a "subarray
> data
> type", I've never seen it used in the wild. Looking at the release
> notes
> you already have,
>  
> https://numpy.org/devdocs/release/1.20.0-notes.html#arrays-cannot-be-using-subarray-dtypes
> ,
> all I'm thinking is that no one should ever be writing code like
> that.
>
Sure, if you look at the big picture its arguably weird or even plain
wrong.  I guess the spelled out question here should have been:

    Does anyone think there is enough usage of this in the wild to
    worry about it?

based on the current response, it seems, and I hope not...

>
> > There are way too many unsafe assumptions in this example. It's an
> > edge
> > case of an edge case.
> >
> > I don't think we should be beholden to continuing to support this
> > behavior, which was obviously never anticipated. If there was a way
> > to
> > raise a warning or error in potentially ambiguous situations like
> > this, I
> > would support it.
> >
>
We can warn for all subarrays (including deprecation), but that is
probably too noisy/much.
We probably can flag subarray+strings and warn in that case. Just a
full undo seems tricky.  What I mean is a warning like:

    Oops, string+subarray can lead to weird things and unfortunately
    a fix in behaviour means 1.20 may have a different result compared
    to <1.19.x. (you are seeing the new behaviour, see release notes)

If that sounds useful, I can do it, but it will lead to an unavoidable
warning.

Cheers,

Sebastian


> +1
>
> Ralf
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> https://mail.python.org/mailman/listinfo/numpy-discussion


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

signature.asc (849 bytes) Download Attachment