# Moving forward with value based casting

24 messages
12
Open this post in threaded view
|

## Moving forward with value based casting

 Hi all, TL;DR: Value based promotion seems complex both for users and ufunc- dispatching/promotion logic. Is there any way we can move forward here, and if we do, could we just risk some possible (maybe not-existing) corner cases to break early to get on the way? ----------- Currently when you write code such as: arr = np.array([1, 43, 23], dtype=np.uint16) res = arr + 1 Numpy uses fairly sophisticated logic to decide that `1` can be represented as a uint16, and thus for all unary functions (and most others as well), the output will have a `res.dtype` of uint16. Similar logic also exists for floating point types, where a lower precision floating point can be used: arr = np.array([1, 43, 23], dtype=np.float32) (arr + np.float64(2.)).dtype  # will be float32 Currently, this value based logic is enforced by checking whether the cast is possible: "4" can be cast to int8, uint8. So the first call above will at some point check if "uint16 + uint16 -> uint16" is a valid operation, find that it is, and thus stop searching. (There is the additional logic, that when both/all operands are scalars, it is not applied). Note that while it is defined in terms of casting "1" to uint8 safely being possible even though 1 may be typed as int64. This logic thus affects all promotion rules as well (i.e. what should the output dtype be). There 2 main discussion points/issues about it: 1. Should value based casting/promotion logic exist at all? Arguably an `np.int32(3)` has type information attached to it, so why should we ignore it. It can also be tricky for users, because a small change in values can change the result data type. Because 0-D arrays and scalars are too close inside numpy (you will often not know which one you get). There is not much option but to handle them identically. However, it seems pretty odd that:  * `np.array(3, dtype=np.int32)` + np.arange(10, dtype=int8)  * `np.array([3], dtype=np.int32)` + np.arange(10, dtype=int8) give a different result. This is a bit different for python scalars, which do not have a type attached already. 2. Promotion and type resolution in Ufuncs: What is currently bothering me is that the decision what the output dtypes should be currently depends on the values in complicated ways. It would be nice if we can decide which type signature to use without actually looking at values (or at least only very early on). One reason here is caching and simplicity. I would like to be able to cache which loop should be used for what input. Having value based casting in there bloats up the problem. Of course it currently works OK, but especially when user dtypes come into play, caching would seem like a nice optimization option. Because `uint8(127)` can also be a `int8`, but uint8(128) it is not as simple as finding the "minimal" dtype once and working with that." Of course Eric and I discussed this a bit before, and you could create an internal "uint7" dtype which has the only purpose of flagging that a cast to int8 is safe. I suppose it is possible I am barking up the wrong tree here, and this caching/predictability is not vital (or can be solved with such an internal dtype easily, although I am not sure it seems elegant). Possible options to move forward -------------------------------- I have to still see a bit how trick things are. But there are a few possible options. I would like to move the scalar logic to the beginning of ufunc calls:   * The uint7 idea would be one solution   * Simply implement something that works for numpy and all except     strange external ufuncs (I can only think of numba as a plausible     candidate for creating such). My current plan is to see where the second thing leaves me. We also should see if we cannot move the whole thing forward, in which case the main decision would have to be forward to where. My opinion is currently that when a type has a dtype associated with it clearly, we should always use that dtype in the future. This mostly means that numpy dtypes such as `np.int64` will always be treated like an int64, and never like a `uint8` because they happen to be castable to that. For values without a dtype attached (read python integers, floats), I see three options, from more complex to simpler: 1. Keep the current logic in place as much as possible 2. Only support value based promotion for operators, e.g.:    `arr + scalar` may do it, but `np.add(arr, scalar)` will not.    The upside is that it limits the complexity to a much simpler    problem, the downside is that the ufunc call and operator match    less clearly. 3. Just associate python float with float64 and python integers with    long/int64 and force users to always type them explicitly if they    need to. The downside of 1. is that it doesn't help with simplifying the current situation all that much, because we still have the special casting around... I have realized that this got much too long, so I hope it makes sense. I will continue to dabble along on these things a bit, so if nothing else maybe writing it helps me to get a bit clearer on things... Best, Sebastian _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpy-discussion signature.asc (849 bytes) Download Attachment
Open this post in threaded view
|

## Re: Moving forward with value based casting

 On Wed, Jun 5, 2019 at 1:43 PM Sebastian Berg <[hidden email]> wrote:Hi all, TL;DR: Value based promotion seems complex both for users and ufunc- dispatching/promotion logic. Is there any way we can move forward here, and if we do, could we just risk some possible (maybe not-existing) corner cases to break early to get on the way? ----------- Currently when you write code such as: arr = np.array([1, 43, 23], dtype=np.uint16) res = arr + 1 Numpy uses fairly sophisticated logic to decide that `1` can be represented as a uint16, and thus for all unary functions (and most others as well), the output will have a `res.dtype` of uint16. Similar logic also exists for floating point types, where a lower precision floating point can be used: arr = np.array([1, 43, 23], dtype=np.float32) (arr + np.float64(2.)).dtype  # will be float32 Currently, this value based logic is enforced by checking whether the cast is possible: "4" can be cast to int8, uint8. So the first call above will at some point check if "uint16 + uint16 -> uint16" is a valid operation, find that it is, and thus stop searching. (There is the additional logic, that when both/all operands are scalars, it is not applied). Note that while it is defined in terms of casting "1" to uint8 safely being possible even though 1 may be typed as int64. This logic thus affects all promotion rules as well (i.e. what should the output dtype be). There 2 main discussion points/issues about it: 1. Should value based casting/promotion logic exist at all? Arguably an `np.int32(3)` has type information attached to it, so why should we ignore it. It can also be tricky for users, because a small change in values can change the result data type. Because 0-D arrays and scalars are too close inside numpy (you will often not know which one you get). There is not much option but to handle them identically. However, it seems pretty odd that:  * `np.array(3, dtype=np.int32)` + np.arange(10, dtype=int8)  * `np.array([3], dtype=np.int32)` + np.arange(10, dtype=int8) give a different result. This is a bit different for python scalars, which do not have a type attached already. 2. Promotion and type resolution in Ufuncs: What is currently bothering me is that the decision what the output dtypes should be currently depends on the values in complicated ways. It would be nice if we can decide which type signature to use without actually looking at values (or at least only very early on). One reason here is caching and simplicity. I would like to be able to cache which loop should be used for what input. Having value based casting in there bloats up the problem. Of course it currently works OK, but especially when user dtypes come into play, caching would seem like a nice optimization option. Because `uint8(127)` can also be a `int8`, but uint8(128) it is not as simple as finding the "minimal" dtype once and working with that." Of course Eric and I discussed this a bit before, and you could create an internal "uint7" dtype which has the only purpose of flagging that a cast to int8 is safe.Does NumPy actually have an logic that does these sort of checks currently? If so, it would be interesting to see what it is.My experiments suggest that we currently have this logic of finding the "minimal" dtype that can hold the scalar value:>>> np.array([127], dtype=np.int8) + 127 # silent overflow!array([-2], dtype=int8)>>> np.array([127], dtype=np.int8) + 128 # correct resultarray([255], dtype=int16)I suppose it is possible I am barking up the wrong tree here, and this caching/predictability is not vital (or can be solved with such an internal dtype easily, although I am not sure it seems elegant). Possible options to move forward -------------------------------- I have to still see a bit how trick things are. But there are a few possible options. I would like to move the scalar logic to the beginning of ufunc calls:   * The uint7 idea would be one solution   * Simply implement something that works for numpy and all except     strange external ufuncs (I can only think of numba as a plausible     candidate for creating such). My current plan is to see where the second thing leaves me. We also should see if we cannot move the whole thing forward, in which case the main decision would have to be forward to where. My opinion is currently that when a type has a dtype associated with it clearly, we should always use that dtype in the future. This mostly means that numpy dtypes such as `np.int64` will always be treated like an int64, and never like a `uint8` because they happen to be castable to that. For values without a dtype attached (read python integers, floats), I see three options, from more complex to simpler: 1. Keep the current logic in place as much as possible 2. Only support value based promotion for operators, e.g.:    `arr + scalar` may do it, but `np.add(arr, scalar)` will not.    The upside is that it limits the complexity to a much simpler    problem, the downside is that the ufunc call and operator match    less clearly. 3. Just associate python float with float64 and python integers with    long/int64 and force users to always type them explicitly if they    need to. The downside of 1. is that it doesn't help with simplifying the current situation all that much, because we still have the special casting around...I think it would be fine to special case operators, but NEP-13 means that the ufuncs corresponding to operators really do need to work exactly the same way. So we should also special-case those ufuncs.I don't think Option (3) is viable. Too many users rely upon arithmetic like "x + 1" having a predictable dtype.  I have realized that this got much too long, so I hope it makes sense. I will continue to dabble along on these things a bit, so if nothing else maybe writing it helps me to get a bit clearer on things... Best, Sebastian _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpy-discussion _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpy-discussion
Open this post in threaded view
|

## Re: Moving forward with value based casting

 On Wed, 2019-06-05 at 14:14 -0700, Stephan Hoyer wrote: > On Wed, Jun 5, 2019 at 1:43 PM Sebastian Berg < > [hidden email]> wrote: > > Hi all, > > > > > > Because `uint8(127)` can also be a `int8`, but uint8(128) it is not > > as > > simple as finding the "minimal" dtype once and working with that." > > Of course Eric and I discussed this a bit before, and you could > > create > > an internal "uint7" dtype which has the only purpose of flagging > > that a > > cast to int8 is safe. > > Does NumPy actually have an logic that does these sort of checks > currently? If so, it would be interesting to see what it is. > > My experiments suggest that we currently have this logic of finding > the "minimal" dtype that can hold the scalar value: > > >>> np.array([127], dtype=np.int8) + 127  # silent overflow! > array([-2], dtype=int8) > > >>> np.array([127], dtype=np.int8) + 128  # correct result > array([255], dtype=int16) > The current checks all come down to `np.can_cast` (on the C side this is `PyArray_CanCastArray()`), answering True. The actual result value is not taken into account of course. So 127 can be represented as int8 and since the "int8,int8->int8" loop is checked first (and "can cast" correctly) it is used. Alternatively, you can think of it as using `np.result_type()` which will, for all practical purposes, give the same dtype (but result type may or may not be actually used, and there are some subtle differences in principle). Effectively, in your example you could reduce it to a minimal dtype of uint7 for 127, since a uint7 can be cast safely to an int8 and also to a uint8. (If you would just say the minimal dtype is uint8, you could not distinguish the two examples). Does that answer the question? Best, Sebastian > > > I suppose it is possible I am barking up the wrong tree here, and > > this > > caching/predictability is not vital (or can be solved with such an > > internal dtype easily, although I am not sure it seems elegant). > > > > > > Possible options to move forward > > -------------------------------- > > > > I have to still see a bit how trick things are. But there are a few > > possible options. I would like to move the scalar logic to the > > beginning of ufunc calls: > >   * The uint7 idea would be one solution > >   * Simply implement something that works for numpy and all except > >     strange external ufuncs (I can only think of numba as a > > plausible > >     candidate for creating such). > > > > My current plan is to see where the second thing leaves me. > > > > We also should see if we cannot move the whole thing forward, in > > which > > case the main decision would have to be forward to where. My > > opinion is > > currently that when a type has a dtype associated with it clearly, > > we > > should always use that dtype in the future. This mostly means that > > numpy dtypes such as `np.int64` will always be treated like an > > int64, > > and never like a `uint8` because they happen to be castable to > > that. > > > > For values without a dtype attached (read python integers, floats), > > I > > see three options, from more complex to simpler: > > > > 1. Keep the current logic in place as much as possible > > 2. Only support value based promotion for operators, e.g.: > >    `arr + scalar` may do it, but `np.add(arr, scalar)` will not. > >    The upside is that it limits the complexity to a much simpler > >    problem, the downside is that the ufunc call and operator match > >    less clearly. > > 3. Just associate python float with float64 and python integers > > with > >    long/int64 and force users to always type them explicitly if > > they > >    need to. > > > > The downside of 1. is that it doesn't help with simplifying the > > current > > situation all that much, because we still have the special casting > > around... > > I think it would be fine to special case operators, but NEP-13 means > that the ufuncs corresponding to operators really do need to work > exactly the same way. So we should also special-case those ufuncs. > > I don't think Option (3) is viable. Too many users rely upon > arithmetic like "x + 1" having a predictable dtype. >   > > I have realized that this got much too long, so I hope it makes > > sense. > > I will continue to dabble along on these things a bit, so if > > nothing > > else maybe writing it helps me to get a bit clearer on things... > > > > Best, > > > > Sebastian > > > > > > _______________________________________________ > > NumPy-Discussion mailing list > > [hidden email] > > https://mail.python.org/mailman/listinfo/numpy-discussion> > _______________________________________________ > NumPy-Discussion mailing list > [hidden email] > https://mail.python.org/mailman/listinfo/numpy-discussion_______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpy-discussion signature.asc (849 bytes) Download Attachment
Open this post in threaded view
|

## Re: Moving forward with value based casting

 In reply to this post by Sebastian Berg Hi all, Maybe to clarify this at least a little, here are some examples for what currently happen and what I could imagine we can go to (all in terms of output dtype). float32_arr = np.ones(10, dtype=np.float32) int8_arr = np.ones(10, dtype=np.int8) uint8_arr = np.ones(10, dtype=np.uint8) Current behaviour: ------------------ float32_arr + 12.  # float32 float32_arr + 2**200  # float64 (because np.float32(2**200) == np.inf) int8_arr + 127     # int8 int8_arr + 128     # int16 int8_arr + 2**20   # int32 uint8_arr + -1     # uint16 # But only for arrays that are not 0d: int8_arr + np.array(1, dtype=np.int32)  # int8 int8_arr + np.array([1], dtype=np.int32)  # int32 # When the actual typing is given, this does not change: float32_arr + np.float64(12.)                  # float32 float32_arr + np.array(12., dtype=np.float64)  # float32 # Except for inexact types, or complex: int8_arr + np.float16(3)  # float16  (same as array behaviour) # The exact same happens with all ufuncs: np.add(float32_arr, 1)                               # float32 np.add(float32_arr, np.array(12., dtype=np.float64)  # float32 Keeping Value based casting only for python types ------------------------------------------------- In this case, most examples above stay unchanged, because they use plain python integers or floats, such as 2, 127, 12., 3, ... without any type information attached, such as `np.float64(12.)`. These change for example: float32_arr + np.float64(12.)                        # float64 float32_arr + np.array(12., dtype=np.float64)        # float64 np.add(float32_arr, np.array(12., dtype=np.float64)  # float64 # so if you use `np.int32` it will be the same as np.uint64(10000) int8_arr + np.int32(1)      # int32 int8_arr + np.int32(2**20)  # int32 Remove Value based casting completely ------------------------------------- We could simply abolish it completely, a python `1` would always behave the same as `np.int_(1)`. The downside of this is that: int8_arr + 1  # int64 (or int32) uses much more memory suddenly. Or, we remove it from ufuncs, but not from operators: int8_arr + 1  # int8 dtype but: np.add(int8_arr, 1)  # int64 # same as: np.add(int8_arr, np.array(1))  # int16 The main reason why I was wondering about that is that for operators the logic seems fairly simple, but for general ufuncs it seems more complex. Best, Sebastian On Wed, 2019-06-05 at 15:41 -0500, Sebastian Berg wrote: > Hi all, > > TL;DR: > > Value based promotion seems complex both for users and ufunc- > dispatching/promotion logic. Is there any way we can move forward > here, > and if we do, could we just risk some possible (maybe not-existing) > corner cases to break early to get on the way? > > ----------- > > Currently when you write code such as: > > arr = np.array([1, 43, 23], dtype=np.uint16) > res = arr + 1 > > Numpy uses fairly sophisticated logic to decide that `1` can be > represented as a uint16, and thus for all unary functions (and most > others as well), the output will have a `res.dtype` of uint16. > > Similar logic also exists for floating point types, where a lower > precision floating point can be used: > > arr = np.array([1, 43, 23], dtype=np.float32) > (arr + np.float64(2.)).dtype  # will be float32 > > Currently, this value based logic is enforced by checking whether the > cast is possible: "4" can be cast to int8, uint8. So the first call > above will at some point check if "uint16 + uint16 -> uint16" is a > valid operation, find that it is, and thus stop searching. (There is > the additional logic, that when both/all operands are scalars, it is > not applied). > > Note that while it is defined in terms of casting "1" to uint8 safely > being possible even though 1 may be typed as int64. This logic thus > affects all promotion rules as well (i.e. what should the output > dtype > be). > > > There 2 main discussion points/issues about it: > > 1. Should value based casting/promotion logic exist at all? > > Arguably an `np.int32(3)` has type information attached to it, so why > should we ignore it. It can also be tricky for users, because a small > change in values can change the result data type. > Because 0-D arrays and scalars are too close inside numpy (you will > often not know which one you get). There is not much option but to > handle them identically. However, it seems pretty odd that: >  * `np.array(3, dtype=np.int32)` + np.arange(10, dtype=int8) >  * `np.array([3], dtype=np.int32)` + np.arange(10, dtype=int8) > > give a different result. > > This is a bit different for python scalars, which do not have a type > attached already. > > > 2. Promotion and type resolution in Ufuncs: > > What is currently bothering me is that the decision what the output > dtypes should be currently depends on the values in complicated ways. > It would be nice if we can decide which type signature to use without > actually looking at values (or at least only very early on). > > One reason here is caching and simplicity. I would like to be able to > cache which loop should be used for what input. Having value based > casting in there bloats up the problem. > Of course it currently works OK, but especially when user dtypes come > into play, caching would seem like a nice optimization option. > > Because `uint8(127)` can also be a `int8`, but uint8(128) it is not > as > simple as finding the "minimal" dtype once and working with that." > Of course Eric and I discussed this a bit before, and you could > create > an internal "uint7" dtype which has the only purpose of flagging that > a > cast to int8 is safe. > > I suppose it is possible I am barking up the wrong tree here, and > this > caching/predictability is not vital (or can be solved with such an > internal dtype easily, although I am not sure it seems elegant). > > > Possible options to move forward > -------------------------------- > > I have to still see a bit how trick things are. But there are a few > possible options. I would like to move the scalar logic to the > beginning of ufunc calls: >   * The uint7 idea would be one solution >   * Simply implement something that works for numpy and all except >     strange external ufuncs (I can only think of numba as a plausible >     candidate for creating such). > > My current plan is to see where the second thing leaves me. > > We also should see if we cannot move the whole thing forward, in > which > case the main decision would have to be forward to where. My opinion > is > currently that when a type has a dtype associated with it clearly, we > should always use that dtype in the future. This mostly means that > numpy dtypes such as `np.int64` will always be treated like an int64, > and never like a `uint8` because they happen to be castable to that. > > For values without a dtype attached (read python integers, floats), I > see three options, from more complex to simpler: > > 1. Keep the current logic in place as much as possible > 2. Only support value based promotion for operators, e.g.: >    `arr + scalar` may do it, but `np.add(arr, scalar)` will not. >    The upside is that it limits the complexity to a much simpler >    problem, the downside is that the ufunc call and operator match >    less clearly. > 3. Just associate python float with float64 and python integers with >    long/int64 and force users to always type them explicitly if they >    need to. > > The downside of 1. is that it doesn't help with simplifying the > current > situation all that much, because we still have the special casting > around... > > > I have realized that this got much too long, so I hope it makes > sense. > I will continue to dabble along on these things a bit, so if nothing > else maybe writing it helps me to get a bit clearer on things... > > Best, > > Sebastian > > > _______________________________________________ > NumPy-Discussion mailing list > [hidden email] > https://mail.python.org/mailman/listinfo/numpy-discussion_______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpy-discussion signature.asc (849 bytes) Download Attachment
Open this post in threaded view
|

## Re: Moving forward with value based casting

 A few thoughts:- We're not trying to achieve systematic guards against integer overflow / wrapping in ufunc inner loops, right? The performance tradeoffs for a "result-based" casting / exception handling addition would presumably be controversial? I know there was some discussion about having an "overflow detection mode"  (toggle) of some sort that could be activated for ufunc loops, but don't think that gained much traction/ priority. I think for floats we have an awkward way to propagate something back to the user if there's an issue.- It sounds like the objective is instead primarily to achieve pure dtype-based promotion, which is then effectively just a casting table, which is what I think you mean by "cache?"- Is it a safe assumption that for a cache (dtype-only casting table), the main tradeoff is that we'd likely tend towards conservative upcasting and using more memory in output types in many cases vs. NumPy at the moment? Stephan seems concerned about that, presumably because x + 1 suddenly changes output dtype in an overwhelming number of current code lines and future simple examples for end users.- If np.array + 1 absolutely has to stay the same output dtype moving forward, then "Keeping Value based casting only for python types" is the one that looks most promising to me initially, with a few further concerns:1) Would that give you enough refactoring "wiggle room" to achieve the simplifications you need? If value-based promotion still happens for a non-NumPy operand, can you abstract that logic cleanly from the "pure dtype cache / table" that is planned for NumPy operands?2) Is the "out" argument to ufuncs a satisfactory alternative to the "power users" who want to "override" default output casting type? We suggest that they pre-allocate an output array of the desired type if they want to save memory and if they overflow or wrap integers that is their problem. Can we reasonably ask people who currently depend on the memory-conservation they might get from value-based behavior to adjust in this way?3) Presumably "out" does / will circumvent the "cache / dtype casting table?"TylerOn Wed, 5 Jun 2019 at 15:37, Sebastian Berg <[hidden email]> wrote:Hi all, Maybe to clarify this at least a little, here are some examples for what currently happen and what I could imagine we can go to (all in terms of output dtype). float32_arr = np.ones(10, dtype=np.float32) int8_arr = np.ones(10, dtype=np.int8) uint8_arr = np.ones(10, dtype=np.uint8) Current behaviour: ------------------ float32_arr + 12.  # float32 float32_arr + 2**200  # float64 (because np.float32(2**200) == np.inf) int8_arr + 127     # int8 int8_arr + 128     # int16 int8_arr + 2**20   # int32 uint8_arr + -1     # uint16 # But only for arrays that are not 0d: int8_arr + np.array(1, dtype=np.int32)  # int8 int8_arr + np.array([1], dtype=np.int32)  # int32 # When the actual typing is given, this does not change: float32_arr + np.float64(12.)                  # float32 float32_arr + np.array(12., dtype=np.float64)  # float32 # Except for inexact types, or complex: int8_arr + np.float16(3)  # float16  (same as array behaviour) # The exact same happens with all ufuncs: np.add(float32_arr, 1)                               # float32 np.add(float32_arr, np.array(12., dtype=np.float64)  # float32 Keeping Value based casting only for python types ------------------------------------------------- In this case, most examples above stay unchanged, because they use plain python integers or floats, such as 2, 127, 12., 3, ... without any type information attached, such as `np.float64(12.)`. These change for example: float32_arr + np.float64(12.)                        # float64 float32_arr + np.array(12., dtype=np.float64)        # float64 np.add(float32_arr, np.array(12., dtype=np.float64)  # float64 # so if you use `np.int32` it will be the same as np.uint64(10000) int8_arr + np.int32(1)      # int32 int8_arr + np.int32(2**20)  # int32 Remove Value based casting completely ------------------------------------- We could simply abolish it completely, a python `1` would always behave the same as `np.int_(1)`. The downside of this is that: int8_arr + 1  # int64 (or int32) uses much more memory suddenly. Or, we remove it from ufuncs, but not from operators: int8_arr + 1  # int8 dtype but: np.add(int8_arr, 1)  # int64 # same as: np.add(int8_arr, np.array(1))  # int16 The main reason why I was wondering about that is that for operators the logic seems fairly simple, but for general ufuncs it seems more complex. Best, Sebastian On Wed, 2019-06-05 at 15:41 -0500, Sebastian Berg wrote: > Hi all, > > TL;DR: > > Value based promotion seems complex both for users and ufunc- > dispatching/promotion logic. Is there any way we can move forward > here, > and if we do, could we just risk some possible (maybe not-existing) > corner cases to break early to get on the way? > > ----------- > > Currently when you write code such as: > > arr = np.array([1, 43, 23], dtype=np.uint16) > res = arr + 1 > > Numpy uses fairly sophisticated logic to decide that `1` can be > represented as a uint16, and thus for all unary functions (and most > others as well), the output will have a `res.dtype` of uint16. > > Similar logic also exists for floating point types, where a lower > precision floating point can be used: > > arr = np.array([1, 43, 23], dtype=np.float32) > (arr + np.float64(2.)).dtype  # will be float32 > > Currently, this value based logic is enforced by checking whether the > cast is possible: "4" can be cast to int8, uint8. So the first call > above will at some point check if "uint16 + uint16 -> uint16" is a > valid operation, find that it is, and thus stop searching. (There is > the additional logic, that when both/all operands are scalars, it is > not applied). > > Note that while it is defined in terms of casting "1" to uint8 safely > being possible even though 1 may be typed as int64. This logic thus > affects all promotion rules as well (i.e. what should the output > dtype > be). > > > There 2 main discussion points/issues about it: > > 1. Should value based casting/promotion logic exist at all? > > Arguably an `np.int32(3)` has type information attached to it, so why > should we ignore it. It can also be tricky for users, because a small > change in values can change the result data type. > Because 0-D arrays and scalars are too close inside numpy (you will > often not know which one you get). There is not much option but to > handle them identically. However, it seems pretty odd that: >  * `np.array(3, dtype=np.int32)` + np.arange(10, dtype=int8) >  * `np.array([3], dtype=np.int32)` + np.arange(10, dtype=int8) > > give a different result. > > This is a bit different for python scalars, which do not have a type > attached already. > > > 2. Promotion and type resolution in Ufuncs: > > What is currently bothering me is that the decision what the output > dtypes should be currently depends on the values in complicated ways. > It would be nice if we can decide which type signature to use without > actually looking at values (or at least only very early on). > > One reason here is caching and simplicity. I would like to be able to > cache which loop should be used for what input. Having value based > casting in there bloats up the problem. > Of course it currently works OK, but especially when user dtypes come > into play, caching would seem like a nice optimization option. > > Because `uint8(127)` can also be a `int8`, but uint8(128) it is not > as > simple as finding the "minimal" dtype once and working with that." > Of course Eric and I discussed this a bit before, and you could > create > an internal "uint7" dtype which has the only purpose of flagging that > a > cast to int8 is safe. > > I suppose it is possible I am barking up the wrong tree here, and > this > caching/predictability is not vital (or can be solved with such an > internal dtype easily, although I am not sure it seems elegant). > > > Possible options to move forward > -------------------------------- > > I have to still see a bit how trick things are. But there are a few > possible options. I would like to move the scalar logic to the > beginning of ufunc calls: >   * The uint7 idea would be one solution >   * Simply implement something that works for numpy and all except >     strange external ufuncs (I can only think of numba as a plausible >     candidate for creating such). > > My current plan is to see where the second thing leaves me. > > We also should see if we cannot move the whole thing forward, in > which > case the main decision would have to be forward to where. My opinion > is > currently that when a type has a dtype associated with it clearly, we > should always use that dtype in the future. This mostly means that > numpy dtypes such as `np.int64` will always be treated like an int64, > and never like a `uint8` because they happen to be castable to that. > > For values without a dtype attached (read python integers, floats), I > see three options, from more complex to simpler: > > 1. Keep the current logic in place as much as possible > 2. Only support value based promotion for operators, e.g.: >    `arr + scalar` may do it, but `np.add(arr, scalar)` will not. >    The upside is that it limits the complexity to a much simpler >    problem, the downside is that the ufunc call and operator match >    less clearly. > 3. Just associate python float with float64 and python integers with >    long/int64 and force users to always type them explicitly if they >    need to. > > The downside of 1. is that it doesn't help with simplifying the > current > situation all that much, because we still have the special casting > around... > > > I have realized that this got much too long, so I hope it makes > sense. > I will continue to dabble along on these things a bit, so if nothing > else maybe writing it helps me to get a bit clearer on things... > > Best, > > Sebastian > > > _______________________________________________ > NumPy-Discussion mailing list > [hidden email] > https://mail.python.org/mailman/listinfo/numpy-discussion _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpy-discussion _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpy-discussion
Open this post in threaded view
|

## Re: Moving forward with value based casting

 Hi Sebastian,Tricky! It seems a balance between unexpected memory blow-up and unexpected wrapping (the latter mostly for integers). Some comments specifically on your message first, then some more general related ones. 1. I'm very much against letting `a + b` do anything else than `np.add(a, b)`.2. For python values, an argument for casting by value is that a python int can be arbitrarily long; the only reasonable course of action for those seems to make them float, and once you do that one might as well cast to whatever type can hold the value (at least approximately).3. Not necessarily preferred, but for casting of scalars, one can get more consistent behaviour also by extending the casting by value to any array that has size=1.Overall, just on the narrow question, I'd be quite happy with your suggestion of using type information if available, i.e., only cast python values to a minimal dtype.If one uses numpy types, those mostly will have come from previous calculations with the same arrays, so things will work as expected. And in most memory-limited applications, one would do calculations in-place anyway (or, as Tyler noted, for power users one can assume awareness of memory and thus the incentive to tell explicitly what dtype is wanted - just `np.add(a, b, dtype=...)`, no need to create `out`).More generally, I guess what I don't like about the casting rules generally is that there is a presumption that if the value can be cast, the operation will generally succeed. For `np.add` and `np.subtract`, this perhaps is somewhat reasonable (though for unsigned a bit more dubious), but for `np.multiply` or `np.power` it is much less so. (Indeed, we had a long discussion about what to do with `int ** power` - now special-casing negative integer powers.) Changing this, however, probably really is a bridge too far!Finally, somewhat related: I think the largest confusing actually results from the `uint64+in64 -> float64` casting.  Should this cast to int64 instead?All the best,Marten _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpy-discussion
Open this post in threaded view
|

## Re: Moving forward with value based casting

 In reply to this post by Sebastian Berg On Wed, Jun 5, 2019 at 10:42 PM Sebastian Berg <[hidden email]> wrote:Hi all, TL;DR: Value based promotion seems complex both for users and ufunc- dispatching/promotion logic. Is there any way we can move forward here, and if we do, could we just risk some possible (maybe not-existing) corner cases to break early to get on the way? ... I have realized that this got much too long, so I hope it makes sense. I will continue to dabble along on these things a bit, so if nothing else maybe writing it helps me to get a bit clearer on things...Your email was long but very clear. The part I'm missing is "why are things the way they are?". Before diving into casting rules and all other wishes people may have, can you please try to answer that? Because there's more to it than "(maybe not-existing) corner cases". Marten's first sentence ("a balance between unexpected memory blow-up and unexpected wrapping") is in the right direction. As is Stephan's "Too many users rely upon arithmetic like "x + 1" having a predictable dtype."The problem is clear, however you need to figure out the constraints first, then decide within the wiggle room you have what the options are.Cheers,Ralf _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpy-discussion
Open this post in threaded view
|

## Re: Moving forward with value based casting

Open this post in threaded view
|

## Re: Moving forward with value based casting

 In reply to this post by Marten van Kerkwijk On Wed, 2019-06-05 at 21:35 -0400, Marten van Kerkwijk wrote: > Hi Sebastian, > > Tricky! It seems a balance between unexpected memory blow-up and > unexpected wrapping (the latter mostly for integers). > > Some comments specifically on your message first, then some more > general related ones. > > 1. I'm very much against letting `a + b` do anything else than > `np.add(a, b)`. Well, I tend to agree. But just to put it out there: [1] + [2]  == [1, 2] np.add([1], [2]) == 3 So that is already far from true, since coercion has to occur. Of course it is true that: arr + something_else will at some point force coercion of `something_else`, so that point is only half valid if either `a` or `b` is already a numpy array/scalar. > 2. For python values, an argument for casting by value is that a > python int can be arbitrarily long; the only reasonable course of > action for those seems to make them float, and once you do that one > might as well cast to whatever type can hold the value (at least > approximately). To be honest, the "arbitrary long" thing is another issue, which is the silent conversion to "object" dtype. Something that is also on the not done list of: Maybe we should deprecate it. In other words, we would freeze python int to one clear type, if you have an arbitrarily large int, you would need to use `object` dtype (or preferably a new `pyint/arbitrary_precision_int` dtype) explicitly. > 3. Not necessarily preferred, but for casting of scalars, one can get > more consistent behaviour also by extending the casting by value to > any array that has size=1. > That sounds just as horrible as the current mismatch to me, to be honest. > Overall, just on the narrow question, I'd be quite happy with your > suggestion of using type information if available, i.e., only cast > python values to a minimal dtype.If one uses numpy types, those > mostly will have come from previous calculations with the same > arrays, so things will work as expected. And in most memory-limited > applications, one would do calculations in-place anyway (or, as Tyler > noted, for power users one can assume awareness of memory and thus > the incentive to tell explicitly what dtype is wanted - just > `np.add(a, b, dtype=...)`, no need to create `out`). > > More generally, I guess what I don't like about the casting rules > generally is that there is a presumption that if the value can be > cast, the operation will generally succeed. For `np.add` and > `np.subtract`, this perhaps is somewhat reasonable (though for > unsigned a bit more dubious), but for `np.multiply` or `np.power` it > is much less so. (Indeed, we had a long discussion about what to do > with `int ** power` - now special-casing negative integer powers.) > Changing this, however, probably really is a bridge too far! Indeed that is right. But that is a different point. E.g. there is nothing wrong for example that `np.power` shouldn't decide that `int**power` should always _promote_ (not cast) `int` to some larger integer type if available. The only point where we seriously have such logic right now is for np.add.reduce (sum) and np.multiply.reduce (prod), which always use at least `long` precision (and actually upcast bool->int, although np.add(True, True) does not. Another difference to True + True...) > > Finally, somewhat related: I think the largest confusing actually > results from the `uint64+in64 -> float64` casting.  Should this cast > to int64 instead? Not sure, but yes, it is the other quirk in our casting that should be discussed…. - Sebastian > > All the best, > > Marten > > _______________________________________________ > NumPy-Discussion mailing list > [hidden email] > https://mail.python.org/mailman/listinfo/numpy-discussion_______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpy-discussion signature.asc (849 bytes) Download Attachment
Open this post in threaded view
|

## Re: Moving forward with value based casting

 In reply to this post by Sebastian Berg I think dtype-based casting makes a lot of sense, the problem is backward compatibility. Numpy casting is weird in a number of ways: The array + array casting is unexpected to many users (eg, uint64 + int64 -> float64), and the casting of array + scalar is different from that, and value based. Personally I wouldn't want to try change it unless we make a backward-incompatible release (numpy 2.0), based on my experience trying to change much more minor things. We already put "casting" on the list of desired backward-incompatible changes on the list here: https://github.com/numpy/numpy/wiki/Backwards-incompatible-ideas-for-a-major-releaseRelatedly, I've previously dreamed about a different "C-style" way casting might behave: https://gist.github.com/ahaldane/0f5ade49730e1a5d16ff6df4303f2e76The proposal there is that array + array casting, array + scalar, and array + python casting would all work in the same dtype-based way, which mimics the familiar "C" casting rules. See also: https://github.com/numpy/numpy/issues/12525Allan On 6/5/19 4:41 PM, Sebastian Berg wrote: > Hi all, > > TL;DR: > > Value based promotion seems complex both for users and ufunc- > dispatching/promotion logic. Is there any way we can move forward here, > and if we do, could we just risk some possible (maybe not-existing) > corner cases to break early to get on the way? > > ----------- > > Currently when you write code such as: > > arr = np.array([1, 43, 23], dtype=np.uint16) > res = arr + 1 > > Numpy uses fairly sophisticated logic to decide that `1` can be > represented as a uint16, and thus for all unary functions (and most > others as well), the output will have a `res.dtype` of uint16. > > Similar logic also exists for floating point types, where a lower > precision floating point can be used: > > arr = np.array([1, 43, 23], dtype=np.float32) > (arr + np.float64(2.)).dtype  # will be float32 > > Currently, this value based logic is enforced by checking whether the > cast is possible: "4" can be cast to int8, uint8. So the first call > above will at some point check if "uint16 + uint16 -> uint16" is a > valid operation, find that it is, and thus stop searching. (There is > the additional logic, that when both/all operands are scalars, it is > not applied). > > Note that while it is defined in terms of casting "1" to uint8 safely > being possible even though 1 may be typed as int64. This logic thus > affects all promotion rules as well (i.e. what should the output dtype > be). > > > There 2 main discussion points/issues about it: > > 1. Should value based casting/promotion logic exist at all? > > Arguably an `np.int32(3)` has type information attached to it, so why > should we ignore it. It can also be tricky for users, because a small > change in values can change the result data type. > Because 0-D arrays and scalars are too close inside numpy (you will > often not know which one you get). There is not much option but to > handle them identically. However, it seems pretty odd that: >  * `np.array(3, dtype=np.int32)` + np.arange(10, dtype=int8) >  * `np.array([3], dtype=np.int32)` + np.arange(10, dtype=int8) > > give a different result. > > This is a bit different for python scalars, which do not have a type > attached already. > > > 2. Promotion and type resolution in Ufuncs: > > What is currently bothering me is that the decision what the output > dtypes should be currently depends on the values in complicated ways. > It would be nice if we can decide which type signature to use without > actually looking at values (or at least only very early on). > > One reason here is caching and simplicity. I would like to be able to > cache which loop should be used for what input. Having value based > casting in there bloats up the problem. > Of course it currently works OK, but especially when user dtypes come > into play, caching would seem like a nice optimization option. > > Because `uint8(127)` can also be a `int8`, but uint8(128) it is not as > simple as finding the "minimal" dtype once and working with that." > Of course Eric and I discussed this a bit before, and you could create > an internal "uint7" dtype which has the only purpose of flagging that a > cast to int8 is safe. > > I suppose it is possible I am barking up the wrong tree here, and this > caching/predictability is not vital (or can be solved with such an > internal dtype easily, although I am not sure it seems elegant). > > > Possible options to move forward > -------------------------------- > > I have to still see a bit how trick things are. But there are a few > possible options. I would like to move the scalar logic to the > beginning of ufunc calls: >   * The uint7 idea would be one solution >   * Simply implement something that works for numpy and all except >     strange external ufuncs (I can only think of numba as a plausible >     candidate for creating such). > > My current plan is to see where the second thing leaves me. > > We also should see if we cannot move the whole thing forward, in which > case the main decision would have to be forward to where. My opinion is > currently that when a type has a dtype associated with it clearly, we > should always use that dtype in the future. This mostly means that > numpy dtypes such as `np.int64` will always be treated like an int64, > and never like a `uint8` because they happen to be castable to that. > > For values without a dtype attached (read python integers, floats), I > see three options, from more complex to simpler: > > 1. Keep the current logic in place as much as possible > 2. Only support value based promotion for operators, e.g.: >    `arr + scalar` may do it, but `np.add(arr, scalar)` will not. >    The upside is that it limits the complexity to a much simpler >    problem, the downside is that the ufunc call and operator match >    less clearly. > 3. Just associate python float with float64 and python integers with >    long/int64 and force users to always type them explicitly if they >    need to. > > The downside of 1. is that it doesn't help with simplifying the current > situation all that much, because we still have the special casting > around... > > > I have realized that this got much too long, so I hope it makes sense. > I will continue to dabble along on these things a bit, so if nothing > else maybe writing it helps me to get a bit clearer on things... > > Best, > > Sebastian > > > > _______________________________________________ > NumPy-Discussion mailing list > [hidden email] > https://mail.python.org/mailman/listinfo/numpy-discussion> _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpy-discussion
Open this post in threaded view
|

## Re: Moving forward with value based casting

 On Thu, 2019-06-06 at 11:57 -0400, Allan Haldane wrote: > I think dtype-based casting makes a lot of sense, the problem is > backward compatibility. > > Numpy casting is weird in a number of ways: The array + array casting > is > unexpected to many users (eg, uint64 + int64 -> float64), and the > casting of array + scalar is different from that, and value based. > Personally I wouldn't want to try change it unless we make a > backward-incompatible release (numpy 2.0), based on my experience > trying > to change much more minor things. We already put "casting" on the > list > of desired backward-incompatible changes on the list here: > https://github.com/numpy/numpy/wiki/Backwards-incompatible-ideas-for-a-major-release> > Relatedly, I've previously dreamed about a different "C-style" way > casting might behave: > https://gist.github.com/ahaldane/0f5ade49730e1a5d16ff6df4303f2e76> > The proposal there is that array + array casting, array + scalar, and > array + python casting would all work in the same dtype-based way, > which > mimics the familiar "C" casting rules. If I read it right, you do propose that array + python would cast in a "minimal type" way for python. In your write up, you describe that if you mix array + scalar, the scalar uses a minimal dtype compared to the array's dtype. What we instead have is that in principle you could have loops such as: "ifi->f" "idi->d" and I think we should chose the first for a scalar, because it "fits" into f just fine. (if the input is) `ufunc(int_arr, 12., int_arr)`. I do not mind keeping the "simple" two (or even more) operand "lets assume we have uniform types" logic around. For those it is easy to find a "minimum type" even before actual loop lookup. For the above example it would work in any case well, but it would get complicating, if for example the last integer is an unsigned integer, that happens to be small enough to fit also into an integer. That might give some wiggle room, possibly also to attach warnings to it, or at least make things easier. But I would also like to figure out as well if we shouldn't try to move in any case. Sure, attach a major version to it, but hopefully not a "big step type". One thing that I had not thought about is, that if we create FutureWarnings, we will need to provide a way to opt-in to the new/old behaviour. The old behaviour can be achieved by just using the python types (which probably is what most code that wants this behaviour does already), but the behaviour is tricky. Users can pass `dtype` explicitly, but that is a huge kludge... Will think about if there is a solution to that, because if there is not, you are right. It has to be a "big step" kind of release. Although, even then it would be nice to have warnings that can be enabled to ease the transition! - Sebastian > > See also: > https://github.com/numpy/numpy/issues/12525> > Allan > > > On 6/5/19 4:41 PM, Sebastian Berg wrote: > > Hi all, > > > > TL;DR: > > > > Value based promotion seems complex both for users and ufunc- > > dispatching/promotion logic. Is there any way we can move forward > > here, > > and if we do, could we just risk some possible (maybe not-existing) > > corner cases to break early to get on the way? > > > > ----------- > > > > Currently when you write code such as: > > > > arr = np.array([1, 43, 23], dtype=np.uint16) > > res = arr + 1 > > > > Numpy uses fairly sophisticated logic to decide that `1` can be > > represented as a uint16, and thus for all unary functions (and most > > others as well), the output will have a `res.dtype` of uint16. > > > > Similar logic also exists for floating point types, where a lower > > precision floating point can be used: > > > > arr = np.array([1, 43, 23], dtype=np.float32) > > (arr + np.float64(2.)).dtype  # will be float32 > > > > Currently, this value based logic is enforced by checking whether > > the > > cast is possible: "4" can be cast to int8, uint8. So the first call > > above will at some point check if "uint16 + uint16 -> uint16" is a > > valid operation, find that it is, and thus stop searching. (There > > is > > the additional logic, that when both/all operands are scalars, it > > is > > not applied). > > > > Note that while it is defined in terms of casting "1" to uint8 > > safely > > being possible even though 1 may be typed as int64. This logic thus > > affects all promotion rules as well (i.e. what should the output > > dtype > > be). > > > > > > There 2 main discussion points/issues about it: > > > > 1. Should value based casting/promotion logic exist at all? > > > > Arguably an `np.int32(3)` has type information attached to it, so > > why > > should we ignore it. It can also be tricky for users, because a > > small > > change in values can change the result data type. > > Because 0-D arrays and scalars are too close inside numpy (you will > > often not know which one you get). There is not much option but to > > handle them identically. However, it seems pretty odd that: > >  * `np.array(3, dtype=np.int32)` + np.arange(10, dtype=int8) > >  * `np.array([3], dtype=np.int32)` + np.arange(10, dtype=int8) > > > > give a different result. > > > > This is a bit different for python scalars, which do not have a > > type > > attached already. > > > > > > 2. Promotion and type resolution in Ufuncs: > > > > What is currently bothering me is that the decision what the output > > dtypes should be currently depends on the values in complicated > > ways. > > It would be nice if we can decide which type signature to use > > without > > actually looking at values (or at least only very early on). > > > > One reason here is caching and simplicity. I would like to be able > > to > > cache which loop should be used for what input. Having value based > > casting in there bloats up the problem. > > Of course it currently works OK, but especially when user dtypes > > come > > into play, caching would seem like a nice optimization option. > > > > Because `uint8(127)` can also be a `int8`, but uint8(128) it is not > > as > > simple as finding the "minimal" dtype once and working with that." > > Of course Eric and I discussed this a bit before, and you could > > create > > an internal "uint7" dtype which has the only purpose of flagging > > that a > > cast to int8 is safe. > > > > I suppose it is possible I am barking up the wrong tree here, and > > this > > caching/predictability is not vital (or can be solved with such an > > internal dtype easily, although I am not sure it seems elegant). > > > > > > Possible options to move forward > > -------------------------------- > > > > I have to still see a bit how trick things are. But there are a few > > possible options. I would like to move the scalar logic to the > > beginning of ufunc calls: > >   * The uint7 idea would be one solution > >   * Simply implement something that works for numpy and all except > >     strange external ufuncs (I can only think of numba as a > > plausible > >     candidate for creating such). > > > > My current plan is to see where the second thing leaves me. > > > > We also should see if we cannot move the whole thing forward, in > > which > > case the main decision would have to be forward to where. My > > opinion is > > currently that when a type has a dtype associated with it clearly, > > we > > should always use that dtype in the future. This mostly means that > > numpy dtypes such as `np.int64` will always be treated like an > > int64, > > and never like a `uint8` because they happen to be castable to > > that. > > > > For values without a dtype attached (read python integers, floats), > > I > > see three options, from more complex to simpler: > > > > 1. Keep the current logic in place as much as possible > > 2. Only support value based promotion for operators, e.g.: > >    `arr + scalar` may do it, but `np.add(arr, scalar)` will not. > >    The upside is that it limits the complexity to a much simpler > >    problem, the downside is that the ufunc call and operator match > >    less clearly. > > 3. Just associate python float with float64 and python integers > > with > >    long/int64 and force users to always type them explicitly if > > they > >    need to. > > > > The downside of 1. is that it doesn't help with simplifying the > > current > > situation all that much, because we still have the special casting > > around... > > > > > > I have realized that this got much too long, so I hope it makes > > sense. > > I will continue to dabble along on these things a bit, so if > > nothing > > else maybe writing it helps me to get a bit clearer on things... > > > > Best, > > > > Sebastian > > > > > > > > _______________________________________________ > > NumPy-Discussion mailing list > > [hidden email] > > https://mail.python.org/mailman/listinfo/numpy-discussion> > > > _______________________________________________ > NumPy-Discussion mailing list > [hidden email] > https://mail.python.org/mailman/listinfo/numpy-discussion> _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpy-discussion signature.asc (849 bytes) Download Attachment
Open this post in threaded view
|

## Re: Moving forward with value based casting

Open this post in threaded view
|

## Re: Moving forward with value based casting

 In reply to this post by Sebastian Berg I haven't read all the thread super carefully, so I might have missed something, but I think we might want to look at this together with the special rule for scalar casting. IIUC, the basic end-user problem that motivates all thi sis: when you have a simple Python constant whose exact dtype is unspecified, people don't want numpy to first automatically pick a dtype for it, and then use that automatically chosen dtype to override the explicit dtypes that the user specified. That's that "x + 1" problem. (This also comes up a ton for languages trying to figure out how to type manifest constants.) Numpy's original solution for this was the special casting rule for scalars. I don't understand the exact semantics, but it's something like: in any operation involving a mix of non-zero-dim arrays and zero-dim arrays, we throw out the exact dtype information for the scalar ("float64", "int32") and replace it with just the "kind" ("float", "int"). This has several surprising consequences: - The output dtype depends on not just the input dtypes, but also the input shapes: In [19]: (np.array([1, 2], dtype=np.int8) + 1).dtype Out[19]: dtype('int8') In [20]: (np.array([1, 2], dtype=np.int8) + [1]).dtype Out[20]: dtype('int64') - It doesn't just affect Python scalars with vague dtypes, but also scalars where the user has specifically set the dtype: In [21]: (np.array([1, 2], dtype=np.int8) + np.int64(1)).dtype Out[21]: dtype('int8') - I'm not sure the "kind" rule even does the right thing, especially for mixed-kind operations. float16-array + int8-scalar has to do the same thing as float16-array + int64-scalar, but that feels weird? I think this is why value-based casting got added (at around the same time as float16, in fact). (Kinds are kinda problematic in general... the SAME_KIND casting rule is very weird – casting int32->int64 is radically different from casting float64->float32, which is radically different than casting int64->int32, but SAME_KIND treats them all the same. And it's really unclear how to generalize the 'kind' concept to new dtypes.) My intuition is that what users actually want is for *native Python types* to be treated as having 'underspecified' dtypes, e.g. int is happy to coerce to int8/int32/int64/whatever, float is happy to coerce to float32/float64/whatever, but once you have a fully-specified numpy dtype, it should stay. Some cases to think about: np.array([1, 2], dtype=int8) + [1, 1]  -> maybe this should have dtype int8, because there's no type info on the right side to contradict that? np.array([1, 2], dtype=int8) + 2**40  -> maybe this should be an error, because you can't cast 2**40 to int8 (under default casting safety rules)? That would introduce some value-dependence, but it would only affect whether you get an error or not, and there's precedent for that (e.g. division by zero). In any case, it would probably be helpful to start by just writing down the whole set of rules we have now, because I'm not sure anyone understands all the details... -n On Wed, Jun 5, 2019 at 1:42 PM Sebastian Berg <[hidden email]> wrote: > > Hi all, > > TL;DR: > > Value based promotion seems complex both for users and ufunc- > dispatching/promotion logic. Is there any way we can move forward here, > and if we do, could we just risk some possible (maybe not-existing) > corner cases to break early to get on the way? > > ----------- > > Currently when you write code such as: > > arr = np.array([1, 43, 23], dtype=np.uint16) > res = arr + 1 > > Numpy uses fairly sophisticated logic to decide that `1` can be > represented as a uint16, and thus for all unary functions (and most > others as well), the output will have a `res.dtype` of uint16. > > Similar logic also exists for floating point types, where a lower > precision floating point can be used: > > arr = np.array([1, 43, 23], dtype=np.float32) > (arr + np.float64(2.)).dtype  # will be float32 > > Currently, this value based logic is enforced by checking whether the > cast is possible: "4" can be cast to int8, uint8. So the first call > above will at some point check if "uint16 + uint16 -> uint16" is a > valid operation, find that it is, and thus stop searching. (There is > the additional logic, that when both/all operands are scalars, it is > not applied). > > Note that while it is defined in terms of casting "1" to uint8 safely > being possible even though 1 may be typed as int64. This logic thus > affects all promotion rules as well (i.e. what should the output dtype > be). > > > There 2 main discussion points/issues about it: > > 1. Should value based casting/promotion logic exist at all? > > Arguably an `np.int32(3)` has type information attached to it, so why > should we ignore it. It can also be tricky for users, because a small > change in values can change the result data type. > Because 0-D arrays and scalars are too close inside numpy (you will > often not know which one you get). There is not much option but to > handle them identically. However, it seems pretty odd that: >  * `np.array(3, dtype=np.int32)` + np.arange(10, dtype=int8) >  * `np.array([3], dtype=np.int32)` + np.arange(10, dtype=int8) > > give a different result. > > This is a bit different for python scalars, which do not have a type > attached already. > > > 2. Promotion and type resolution in Ufuncs: > > What is currently bothering me is that the decision what the output > dtypes should be currently depends on the values in complicated ways. > It would be nice if we can decide which type signature to use without > actually looking at values (or at least only very early on). > > One reason here is caching and simplicity. I would like to be able to > cache which loop should be used for what input. Having value based > casting in there bloats up the problem. > Of course it currently works OK, but especially when user dtypes come > into play, caching would seem like a nice optimization option. > > Because `uint8(127)` can also be a `int8`, but uint8(128) it is not as > simple as finding the "minimal" dtype once and working with that." > Of course Eric and I discussed this a bit before, and you could create > an internal "uint7" dtype which has the only purpose of flagging that a > cast to int8 is safe. > > I suppose it is possible I am barking up the wrong tree here, and this > caching/predictability is not vital (or can be solved with such an > internal dtype easily, although I am not sure it seems elegant). > > > Possible options to move forward > -------------------------------- > > I have to still see a bit how trick things are. But there are a few > possible options. I would like to move the scalar logic to the > beginning of ufunc calls: >   * The uint7 idea would be one solution >   * Simply implement something that works for numpy and all except >     strange external ufuncs (I can only think of numba as a plausible >     candidate for creating such). > > My current plan is to see where the second thing leaves me. > > We also should see if we cannot move the whole thing forward, in which > case the main decision would have to be forward to where. My opinion is > currently that when a type has a dtype associated with it clearly, we > should always use that dtype in the future. This mostly means that > numpy dtypes such as `np.int64` will always be treated like an int64, > and never like a `uint8` because they happen to be castable to that. > > For values without a dtype attached (read python integers, floats), I > see three options, from more complex to simpler: > > 1. Keep the current logic in place as much as possible > 2. Only support value based promotion for operators, e.g.: >    `arr + scalar` may do it, but `np.add(arr, scalar)` will not. >    The upside is that it limits the complexity to a much simpler >    problem, the downside is that the ufunc call and operator match >    less clearly. > 3. Just associate python float with float64 and python integers with >    long/int64 and force users to always type them explicitly if they >    need to. > > The downside of 1. is that it doesn't help with simplifying the current > situation all that much, because we still have the special casting > around... > > > I have realized that this got much too long, so I hope it makes sense. > I will continue to dabble along on these things a bit, so if nothing > else maybe writing it helps me to get a bit clearer on things... > > Best, > > Sebastian > > > _______________________________________________ > NumPy-Discussion mailing list > [hidden email] > https://mail.python.org/mailman/listinfo/numpy-discussion-- Nathaniel J. Smith -- https://vorpus.org_______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpy-discussion
Open this post in threaded view
|

## Re: Moving forward with value based casting

 On Fri, Jun 7, 2019 at 1:37 AM Nathaniel Smith <[hidden email]> wrote: My intuition is that what users actually want is for *native Python types* to be treated as having 'underspecified' dtypes, e.g. int is happy to coerce to int8/int32/int64/whatever, float is happy to coerce to float32/float64/whatever, but once you have a fully-specified numpy dtype, it should stay.Thanks Nathaniel, I think this expresses a possible solution better than anything I've seen on this list before. An explicit "underspecified types" concept could make casting understandable. In any case, it would probably be helpful to start by just writing down the whole set of rules we have now, because I'm not sure anyone understands all the details...+1Ralf _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpy-discussion
Open this post in threaded view
|

## Re: Moving forward with value based casting

 On Fri, Jun 7, 2019 at 1:19 AM Ralf Gommers <[hidden email]> wrote:On Fri, Jun 7, 2019 at 1:37 AM Nathaniel Smith <[hidden email]> wrote: My intuition is that what users actually want is for *native Python types* to be treated as having 'underspecified' dtypes, e.g. int is happy to coerce to int8/int32/int64/whatever, float is happy to coerce to float32/float64/whatever, but once you have a fully-specified numpy dtype, it should stay.Thanks Nathaniel, I think this expresses a possible solution better than anything I've seen on this list before. An explicit "underspecified types" concept could make casting understandable.I think the current model is that this holds for all scalars, but changing that to be just for not already explicitly typed types makes sense.In the context of a mental picture, one could think in terms of coercion, of numpy having not just a `numpy.array` but also a `numpy.scalar` function, which takes some input and tries to make a numpy scalar of it.  For python int, float, complex, etc., it uses the minimal numpy type. Of course, this is slightly inconsistent with the `np.array` function which converts things to `ndarray` using a default type for int, float, complex, etc., but my sense is that that is explainable, e.g.,imagining both `np.scalar` and `np.array` to have dtype attributes, one could say that the default for one would be `'minimal'` and the other `'64bit'` (well, that doesn't work for complex, but anyway).-- Marten _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpy-discussion
Open this post in threaded view
|

## Re: Moving forward with value based casting

 In reply to this post by Sebastian Berg On Wed, Jun 5, 2019 at 4:42 PM Sebastian Berg <[hidden email]> wrote: I think the best approach is that if the user gave unambiguous types as inputs to operators then the output should be the same dtype, or type corresponding to the common promotion type of the inputs. If the input type is not specified, I agree with the suggestion here: > 3. Just associate python float with float64 and python integers with >    long/int64 and force users to always type them explicitly if they >    need to. Explicit is better than implicit -- Those who don't understand recursion are doomed to repeat it _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpy-discussion
Open this post in threaded view
|

## Re: Moving forward with value based casting

 In reply to this post by ralfgommers On Fri, 2019-06-07 at 07:18 +0200, Ralf Gommers wrote: > > > On Fri, Jun 7, 2019 at 1:37 AM Nathaniel Smith <[hidden email]> wrote: > > My intuition is that what users actually want is for *native Python > > types* to be treated as having 'underspecified' dtypes, e.g. int is > > happy to coerce to int8/int32/int64/whatever, float is happy to > > coerce > > to float32/float64/whatever, but once you have a fully-specified > > numpy > > dtype, it should stay. > > Thanks Nathaniel, I think this expresses a possible solution better > than anything I've seen on this list before. An explicit > "underspecified types" concept could make casting understandable. Yes, there is one small additional annoyance (but maybe it is just that). In that 127 is the 'underspecified' dtype `uint7` (it can be safely cast both to uint8 and int8). > > > In any case, it would probably be helpful to start by just writing > > down the whole set of rules we have now, because I'm not sure > > anyone > > understands all the details... > > +1 OK, let me try to sketch the details below: 0. "Scalars" means scalars or 0-D arrays here. 1. The logic below will only be used if we have a mix of arrays and scalars. If all are scalars, the logic is never used. (Plus one additional tricky case within ufuncs, which is more hypothetical [0]) 2. Scalars will only be demoted within their category. The categories and casting rules within the category are as follows: Boolean:     Casts safely to all (nothing surprising). Integers:     Casting is possible if output can hold the value.     This includes uint8(127) casting to an int8.     (unsigned and signed integers are the same "category") Floats:     Scalars can be demoted based on value, roughly this     avoids overflows:         float16:     -65000 < value < 65000         float32:    -3.4e38 < value < 3.4e38         float64:   -1.7e308 < value < 1.7e308         float128 (largest type, does not apply). Complex: Same logic as floats (applied to .real and .imag). Others: Anything else. --- Ufunc, as well as `result_type` will use this liberally, which basically means finding the smallest type for each category and using that. Of course for floats we cannot do the actual cast until later, since initially we do not know if the cast will actually be performed. This is only tricky for uint vs. int, because uint8(127) is a "small unsigned". I.e. with our current dtypes there is no strict type hierarchy uint8(x) may or may not cast to int8. --- We could think of doing: arr, min_dtype = np.asarray_and_min_dtype(pyobject) which could even fix the list example Nathaniel had. Which would work if you would do the dtype hierarchy. This is where the `uint7` came from a hypothetical `uint7` would fix the integer dtype hierarchy, by representing the numbers `0-127` which can be cast to uint8 and int8. Best, Sebastian [0] Amendment for point 1: There is one detail (bug?) here in the logic though, that I missed before. If a ufunc (or result_type) sees a mix of scalars and arrays, it will try to decide whether or not to use value based logic. Value based logic will be skipped if the scalars are in a higher category (based on the ones above) then the highest array – for optimization I assume. Plausibly, this could cause incorrect logic when the dtype signature of a ufunc is mixed:   float32, int8 -> float32   float32, int64 -> float64 May choose the second loop unnecessarily. Or for example if we have a datetime64 in the inputs, there would be no way for value based casting to be used. > > Ralf > > _______________________________________________ > NumPy-Discussion mailing list > [hidden email] > https://mail.python.org/mailman/listinfo/numpy-discussion_______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpy-discussion signature.asc (849 bytes) Download Attachment
Open this post in threaded view
|

## Re: Moving forward with value based casting

Open this post in threaded view
|

## Re: Moving forward with value based casting

 In reply to this post by Sebastian Berg On Fri, 2019-06-07 at 13:19 -0500, Sebastian Berg wrote: > On Fri, 2019-06-07 at 07:18 +0200, Ralf Gommers wrote: > > > > On Fri, Jun 7, 2019 at 1:37 AM Nathaniel Smith <[hidden email]> > > wrote: > > > My intuition is that what users actually want is for *native > > > Python > > > types* to be treated as having 'underspecified' dtypes, e.g. int > > > is > > > happy to coerce to int8/int32/int64/whatever, float is happy to > > > coerce > > > to float32/float64/whatever, but once you have a fully-specified > > > numpy > > > dtype, it should stay. > > > > Thanks Nathaniel, I think this expresses a possible solution better > > than anything I've seen on this list before. An explicit > > "underspecified types" concept could make casting understandable. > > Yes, there is one small additional annoyance (but maybe it is just > that). In that 127 is the 'underspecified' dtype `uint7` (it can be > safely cast both to uint8 and int8). > > > > In any case, it would probably be helpful to start by just > > > writing > > > down the whole set of rules we have now, because I'm not sure > > > anyone > > > understands all the details... > > > > +1 > > OK, let me try to sketch the details below: > > 0. "Scalars" means scalars or 0-D arrays here. > > 1. The logic below will only be used if we have a mix of arrays and > scalars. If all are scalars, the logic is never used. (Plus one > additional tricky case within ufuncs, which is more hypothetical [0]) > And of course I just realized that, trying to be simple, I forgot an important point there: The logic in 2. is only used when there is a mix of scalars and arrays, and the arrays are in the same or higher category. As an example: np.array([1, 2, 3], dtype=np.uint8) + np.float64(12.) will not demote the float64, because the scalars "float" is a higher category than the arrays "integer". - Sebastian > 2. Scalars will only be demoted within their category. The categories > and casting rules within the category are as follows: > > Boolean: >     Casts safely to all (nothing surprising). > > Integers: >     Casting is possible if output can hold the value. >     This includes uint8(127) casting to an int8. >     (unsigned and signed integers are the same "category") > > Floats: >     Scalars can be demoted based on value, roughly this >     avoids overflows: >         float16:     -65000 < value < 65000 >         float32:    -3.4e38 < value < 3.4e38 >         float64:   -1.7e308 < value < 1.7e308 >         float128 (largest type, does not apply). > > Complex: Same logic as floats (applied to .real and .imag). > > Others: Anything else. > > --- > > Ufunc, as well as `result_type` will use this liberally, which > basically means finding the smallest type for each category and using > that. Of course for floats we cannot do the actual cast until later, > since initially we do not know if the cast will actually be > performed. > > This is only tricky for uint vs. int, because uint8(127) is a "small > unsigned". I.e. with our current dtypes there is no strict type > hierarchy uint8(x) may or may not cast to int8. > > --- > > We could think of doing: > > arr, min_dtype = np.asarray_and_min_dtype(pyobject) > > which could even fix the list example Nathaniel had. Which would work > if you would do the dtype hierarchy. > > This is where the `uint7` came from a hypothetical `uint7` would fix > the integer dtype hierarchy, by representing the numbers `0-127` > which > can be cast to uint8 and int8. > > Best, > > Sebastian > > > [0] Amendment for point 1: > > There is one detail (bug?) here in the logic though, that I missed > before. If a ufunc (or result_type) sees a mix of scalars and arrays, > it will try to decide whether or not to use value based logic. Value > based logic will be skipped if the scalars are in a higher category > (based on the ones above) then the highest array – for optimization I > assume. > Plausibly, this could cause incorrect logic when the dtype signature > of > a ufunc is mixed: >   float32, int8 -> float32 >   float32, int64 -> float64 > > May choose the second loop unnecessarily. Or for example if we have a > datetime64 in the inputs, there would be no way for value based > casting > to be used. > > > > > Ralf > > > > _______________________________________________ > > NumPy-Discussion mailing list > > [hidden email] > > https://mail.python.org/mailman/listinfo/numpy-discussion> > _______________________________________________ > NumPy-Discussion mailing list > [hidden email] > https://mail.python.org/mailman/listinfo/numpy-discussion_______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpy-discussion signature.asc (849 bytes) Download Attachment