Hi all,
In https://github.com/numpy/numpy/issues/18407 it was reported that there is a regression for `np.array()` and friends in NumPy 1.20 for code such as: np.array(["1234"], dtype=("U1", 4)) # NumPy 1.20: array(['1', '1', '1', '1'], dtype='<U1') # NumPy 1.19: array(['1', '2', '3', '4'], dtype='<U1') The Basics ---------- This happens when you ask for a rare "subarray" dtype, ways to create it are: np.dtype(("U1", 4)) np.dtype("(4)U1,") # (does not have a field, only a subarray) Both of which give the same subarray dtype a "U1" dtype with shape 4. One thing to know about these dtypes is that they cannot be attached to an array: np.zeros(3, dtype="(4)U1,").dtype == "U1" np.zeros(3, dtype="(4)U1,").shape == (3, 4) I.e. the shape is moved/added into the array itself (instead of remaining part of the dtype). The Change ---------- Now what/why did something change? When filling subarray dtypes, NumPy normally fills every element with the same input. In the above case in most cases NumPy will give the 1.20 result because it assigns "1234" to every subarray element individually; maybe confusingly, this truncates so that only the "1" is actually assigned, we can proof it with a structured dtype (same result in 1.19 and 1.20): >>> np.array(["1234"], dtype="(4)U1,i") array([(['1', '1', '1', '1'], 1234)], dtype=[('f0', '<U1', (4,)), ('f1', '<i4')]) Another, weirder case which changed (more obviously for the better is: >>> np.array("1234", dtype="(4)U1,") # Numpy 1.20: array(['1', '1', '1', '1'], dtype='<U1') # NumPy 1.19: array(['1', '', '', ''], dtype='<U1') And, to point it out, we can have subarrays that are not 1-D: >>> np.array(["12"],dtype=("(2,2)U1,")) array([[['1', '1'], ['2', '2']]], dtype='<U1') # NumPy 1.19, 1.20 all is '1' The Cause --------- The cause of the 1.19 behaviour is two-fold: 1. The "subarray" part of the dtype is moved into the array after the dimension is found. At this point strings are always considered "scalars". In most above examples, the new array shape is (1,)+(4,). 2. When filling the new array with values, it now has an _additional_ dimension! Because of this, the string is now suddenly considered a sequence, so it behaves the same as if `list("1234")`. Although, normally, NumPy would never consider a string a sequence. The Solution? ------------- I honestly don't have one. We can consider strings as sequences in this weird special case. That will probably create other weird special cases, but they would be even more hidden (I expect mainly odder things throwing an error). Should we try to document this better in the release notes or can we think of some better (or at least louder) solution? Cheers, Sebastian _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpy-discussion |
On Tue, Feb 16, 2021 at 3:13 PM Sebastian Berg <[hidden email]> wrote: Hi all, There are way too many unsafe assumptions in this example. It's an edge case of an edge case. I don't think we should be beholden to continuing to support this behavior, which was obviously never anticipated. If there was a way to raise a warning or error in potentially ambiguous situations like this, I would support it.
_______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpy-discussion |
On Wed, Feb 17, 2021 at 2:14 AM Stephan Hoyer <[hidden email]> wrote:
I was honestly surprised there's even such a thing as a "subarray data type", I've never seen it used in the wild. Looking at the release notes you already have, https://numpy.org/devdocs/release/1.20.0-notes.html#arrays-cannot-be-using-subarray-dtypes, all I'm thinking is that no one should ever be writing code like that.
+1 Ralf _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpy-discussion |
On Wed, 2021-02-17 at 11:15 +0100, Ralf Gommers wrote:
> On Wed, Feb 17, 2021 at 2:14 AM Stephan Hoyer <[hidden email]> > wrote: > > > On Tue, Feb 16, 2021 at 3:13 PM Sebastian Berg < > > [hidden email]> > > wrote: > > > > > Hi all, > > > > > > In https://github.com/numpy/numpy/issues/18407 it was reported > > > that > > > there is a regression for `np.array()` and friends in NumPy 1.20 > > > for > > > code such as: > > > > > > np.array(["1234"], dtype=("U1", 4)) > > > # NumPy 1.20: array(['1', '1', '1', '1'], dtype='<U1') > > > # NumPy 1.19: array(['1', '2', '3', '4'], dtype='<U1') > > > > > > > > > The Basics > > > ---------- > > > > > > This happens when you ask for a rare "subarray" dtype, ways to > > > create > > > it are: > > > > > > np.dtype(("U1", 4)) > > > np.dtype("(4)U1,") # (does not have a field, only a > > > subarray) > > > > > > Both of which give the same subarray dtype a "U1" dtype with > > > shape 4. > > > One thing to know about these dtypes is that they cannot be > > > attached to > > > an array: > > > > > > np.zeros(3, dtype="(4)U1,").dtype == "U1" > > > np.zeros(3, dtype="(4)U1,").shape == (3, 4) > > > > > > I.e. the shape is moved/added into the array itself (instead of > > > remaining part of the dtype). > > > > > > The Change > > > ---------- > > > > > > Now what/why did something change? When filling subarray dtypes, > > > NumPy > > > normally fills every element with the same input. In the above > > > case in > > > most cases NumPy will give the 1.20 result because it assigns > > > "1234" to > > > every subarray element individually; maybe confusingly, this > > > truncates > > > so that only the "1" is actually assigned, we can proof it with a > > > structured dtype (same result in 1.19 and 1.20): > > > > > > >>> np.array(["1234"], dtype="(4)U1,i") > > > array([(['1', '1', '1', '1'], 1234)], > > > dtype=[('f0', '<U1', (4,)), ('f1', '<i4')]) > > > > > > Another, weirder case which changed (more obviously for the > > > better is: > > > > > > >>> np.array("1234", dtype="(4)U1,") > > > # Numpy 1.20: array(['1', '1', '1', '1'], dtype='<U1') > > > # NumPy 1.19: array(['1', '', '', ''], dtype='<U1') > > > > > > And, to point it out, we can have subarrays that are not 1-D: > > > > > > >>> np.array(["12"],dtype=("(2,2)U1,")) > > > array([[['1', '1'], > > > ['2', '2']]], dtype='<U1') # NumPy 1.19, 1.20 all is '1' > > > > > > > > > The Cause > > > --------- > > > > > > The cause of the 1.19 behaviour is two-fold: > > > > > > 1. The "subarray" part of the dtype is moved into the array after > > > the > > > dimension is found. At this point strings are always considered > > > "scalars". In most above examples, the new array shape is > > > (1,)+(4,). > > > > > > 2. When filling the new array with values, it now has an > > > _additional_ > > > dimension! Because of this, the string is now suddenly > > > considered a > > > sequence, so it behaves the same as if `list("1234")`. Although, > > > normally, NumPy would never consider a string a sequence. > > > > > > > > > The Solution? > > > ------------- > > > > > > I honestly don't have one. We can consider strings as sequences > > > in > > > this weird special case. That will probably create other weird > > > special > > > cases, but they would be even more hidden (I expect mainly odder > > > things > > > throwing an error). > > > > > > Should we try to document this better in the release notes or can > > > we > > > think of some better (or at least louder) solution? > > > > > > I was honestly surprised there's even such a thing as a "subarray > data > type", I've never seen it used in the wild. Looking at the release > notes > you already have, > > https://numpy.org/devdocs/release/1.20.0-notes.html#arrays-cannot-be-using-subarray-dtypes > , > all I'm thinking is that no one should ever be writing code like > that. > wrong. I guess the spelled out question here should have been: Does anyone think there is enough usage of this in the wild to worry about it? based on the current response, it seems, and I hope not... > > > There are way too many unsafe assumptions in this example. It's an > > edge > > case of an edge case. > > > > I don't think we should be beholden to continuing to support this > > behavior, which was obviously never anticipated. If there was a way > > to > > raise a warning or error in potentially ambiguous situations like > > this, I > > would support it. > > > probably too noisy/much. We probably can flag subarray+strings and warn in that case. Just a full undo seems tricky. What I mean is a warning like: Oops, string+subarray can lead to weird things and unfortunately a fix in behaviour means 1.20 may have a different result compared to <1.19.x. (you are seeing the new behaviour, see release notes) If that sounds useful, I can do it, but it will lead to an unavoidable warning. Cheers, Sebastian > +1 > > Ralf > _______________________________________________ > NumPy-Discussion mailing list > [hidden email] > https://mail.python.org/mailman/listinfo/numpy-discussion _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpy-discussion |
Free forum by Nabble | Edit this page |