Moving forward with value based casting

classic Classic list List threaded Threaded
24 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Moving forward with value based casting

Sebastian Berg
Hi all,

TL;DR:

Value based promotion seems complex both for users and ufunc-
dispatching/promotion logic. Is there any way we can move forward here,
and if we do, could we just risk some possible (maybe not-existing)
corner cases to break early to get on the way?

-----------

Currently when you write code such as:

arr = np.array([1, 43, 23], dtype=np.uint16)
res = arr + 1

Numpy uses fairly sophisticated logic to decide that `1` can be
represented as a uint16, and thus for all unary functions (and most
others as well), the output will have a `res.dtype` of uint16.

Similar logic also exists for floating point types, where a lower
precision floating point can be used:

arr = np.array([1, 43, 23], dtype=np.float32)
(arr + np.float64(2.)).dtype  # will be float32

Currently, this value based logic is enforced by checking whether the
cast is possible: "4" can be cast to int8, uint8. So the first call
above will at some point check if "uint16 + uint16 -> uint16" is a
valid operation, find that it is, and thus stop searching. (There is
the additional logic, that when both/all operands are scalars, it is
not applied).

Note that while it is defined in terms of casting "1" to uint8 safely
being possible even though 1 may be typed as int64. This logic thus
affects all promotion rules as well (i.e. what should the output dtype
be).


There 2 main discussion points/issues about it:

1. Should value based casting/promotion logic exist at all?

Arguably an `np.int32(3)` has type information attached to it, so why
should we ignore it. It can also be tricky for users, because a small
change in values can change the result data type.
Because 0-D arrays and scalars are too close inside numpy (you will
often not know which one you get). There is not much option but to
handle them identically. However, it seems pretty odd that:
 * `np.array(3, dtype=np.int32)` + np.arange(10, dtype=int8)
 * `np.array([3], dtype=np.int32)` + np.arange(10, dtype=int8)

give a different result.

This is a bit different for python scalars, which do not have a type
attached already.


2. Promotion and type resolution in Ufuncs:

What is currently bothering me is that the decision what the output
dtypes should be currently depends on the values in complicated ways.
It would be nice if we can decide which type signature to use without
actually looking at values (or at least only very early on).

One reason here is caching and simplicity. I would like to be able to
cache which loop should be used for what input. Having value based
casting in there bloats up the problem.
Of course it currently works OK, but especially when user dtypes come
into play, caching would seem like a nice optimization option.

Because `uint8(127)` can also be a `int8`, but uint8(128) it is not as
simple as finding the "minimal" dtype once and working with that."
Of course Eric and I discussed this a bit before, and you could create
an internal "uint7" dtype which has the only purpose of flagging that a
cast to int8 is safe.

I suppose it is possible I am barking up the wrong tree here, and this
caching/predictability is not vital (or can be solved with such an
internal dtype easily, although I am not sure it seems elegant).


Possible options to move forward
--------------------------------

I have to still see a bit how trick things are. But there are a few
possible options. I would like to move the scalar logic to the
beginning of ufunc calls:
  * The uint7 idea would be one solution
  * Simply implement something that works for numpy and all except
    strange external ufuncs (I can only think of numba as a plausible
    candidate for creating such).

My current plan is to see where the second thing leaves me.

We also should see if we cannot move the whole thing forward, in which
case the main decision would have to be forward to where. My opinion is
currently that when a type has a dtype associated with it clearly, we
should always use that dtype in the future. This mostly means that
numpy dtypes such as `np.int64` will always be treated like an int64,
and never like a `uint8` because they happen to be castable to that.

For values without a dtype attached (read python integers, floats), I
see three options, from more complex to simpler:

1. Keep the current logic in place as much as possible
2. Only support value based promotion for operators, e.g.:
   `arr + scalar` may do it, but `np.add(arr, scalar)` will not.
   The upside is that it limits the complexity to a much simpler
   problem, the downside is that the ufunc call and operator match
   less clearly.
3. Just associate python float with float64 and python integers with
   long/int64 and force users to always type them explicitly if they
   need to.

The downside of 1. is that it doesn't help with simplifying the current
situation all that much, because we still have the special casting
around...


I have realized that this got much too long, so I hope it makes sense.
I will continue to dabble along on these things a bit, so if nothing
else maybe writing it helps me to get a bit clearer on things...

Best,

Sebastian



_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Moving forward with value based casting

Stephan Hoyer-2
On Wed, Jun 5, 2019 at 1:43 PM Sebastian Berg <[hidden email]> wrote:
Hi all,

TL;DR:

Value based promotion seems complex both for users and ufunc-
dispatching/promotion logic. Is there any way we can move forward here,
and if we do, could we just risk some possible (maybe not-existing)
corner cases to break early to get on the way?

-----------

Currently when you write code such as:

arr = np.array([1, 43, 23], dtype=np.uint16)
res = arr + 1

Numpy uses fairly sophisticated logic to decide that `1` can be
represented as a uint16, and thus for all unary functions (and most
others as well), the output will have a `res.dtype` of uint16.

Similar logic also exists for floating point types, where a lower
precision floating point can be used:

arr = np.array([1, 43, 23], dtype=np.float32)
(arr + np.float64(2.)).dtype  # will be float32

Currently, this value based logic is enforced by checking whether the
cast is possible: "4" can be cast to int8, uint8. So the first call
above will at some point check if "uint16 + uint16 -> uint16" is a
valid operation, find that it is, and thus stop searching. (There is
the additional logic, that when both/all operands are scalars, it is
not applied).

Note that while it is defined in terms of casting "1" to uint8 safely
being possible even though 1 may be typed as int64. This logic thus
affects all promotion rules as well (i.e. what should the output dtype
be).


There 2 main discussion points/issues about it:

1. Should value based casting/promotion logic exist at all?

Arguably an `np.int32(3)` has type information attached to it, so why
should we ignore it. It can also be tricky for users, because a small
change in values can change the result data type.
Because 0-D arrays and scalars are too close inside numpy (you will
often not know which one you get). There is not much option but to
handle them identically. However, it seems pretty odd that:
 * `np.array(3, dtype=np.int32)` + np.arange(10, dtype=int8)
 * `np.array([3], dtype=np.int32)` + np.arange(10, dtype=int8)

give a different result.

This is a bit different for python scalars, which do not have a type
attached already.


2. Promotion and type resolution in Ufuncs:

What is currently bothering me is that the decision what the output
dtypes should be currently depends on the values in complicated ways.
It would be nice if we can decide which type signature to use without
actually looking at values (or at least only very early on).

One reason here is caching and simplicity. I would like to be able to
cache which loop should be used for what input. Having value based
casting in there bloats up the problem.
Of course it currently works OK, but especially when user dtypes come
into play, caching would seem like a nice optimization option.

Because `uint8(127)` can also be a `int8`, but uint8(128) it is not as
simple as finding the "minimal" dtype once and working with that."
Of course Eric and I discussed this a bit before, and you could create
an internal "uint7" dtype which has the only purpose of flagging that a
cast to int8 is safe.

Does NumPy actually have an logic that does these sort of checks currently? If so, it would be interesting to see what it is.

My experiments suggest that we currently have this logic of finding the "minimal" dtype that can hold the scalar value:

>>> np.array([127], dtype=np.int8) + 127 # silent overflow!
array([-2], dtype=int8)

>>> np.array([127], dtype=np.int8) + 128 # correct result
array([255], dtype=int16)


I suppose it is possible I am barking up the wrong tree here, and this
caching/predictability is not vital (or can be solved with such an
internal dtype easily, although I am not sure it seems elegant).


Possible options to move forward
--------------------------------

I have to still see a bit how trick things are. But there are a few
possible options. I would like to move the scalar logic to the
beginning of ufunc calls:
  * The uint7 idea would be one solution
  * Simply implement something that works for numpy and all except
    strange external ufuncs (I can only think of numba as a plausible
    candidate for creating such).

My current plan is to see where the second thing leaves me.

We also should see if we cannot move the whole thing forward, in which
case the main decision would have to be forward to where. My opinion is
currently that when a type has a dtype associated with it clearly, we
should always use that dtype in the future. This mostly means that
numpy dtypes such as `np.int64` will always be treated like an int64,
and never like a `uint8` because they happen to be castable to that.

For values without a dtype attached (read python integers, floats), I
see three options, from more complex to simpler:

1. Keep the current logic in place as much as possible
2. Only support value based promotion for operators, e.g.:
   `arr + scalar` may do it, but `np.add(arr, scalar)` will not.
   The upside is that it limits the complexity to a much simpler
   problem, the downside is that the ufunc call and operator match
   less clearly.
3. Just associate python float with float64 and python integers with
   long/int64 and force users to always type them explicitly if they
   need to.

The downside of 1. is that it doesn't help with simplifying the current
situation all that much, because we still have the special casting
around...

I think it would be fine to special case operators, but NEP-13 means that the ufuncs corresponding to operators really do need to work exactly the same way. So we should also special-case those ufuncs.

I don't think Option (3) is viable. Too many users rely upon arithmetic like "x + 1" having a predictable dtype.
 
I have realized that this got much too long, so I hope it makes sense.
I will continue to dabble along on these things a bit, so if nothing
else maybe writing it helps me to get a bit clearer on things...

Best,

Sebastian


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Moving forward with value based casting

Sebastian Berg
On Wed, 2019-06-05 at 14:14 -0700, Stephan Hoyer wrote:
> On Wed, Jun 5, 2019 at 1:43 PM Sebastian Berg <
> [hidden email]> wrote:
> > Hi all,
> >

<snip>

> >
> > Because `uint8(127)` can also be a `int8`, but uint8(128) it is not
> > as
> > simple as finding the "minimal" dtype once and working with that."
> > Of course Eric and I discussed this a bit before, and you could
> > create
> > an internal "uint7" dtype which has the only purpose of flagging
> > that a
> > cast to int8 is safe.
>
> Does NumPy actually have an logic that does these sort of checks
> currently? If so, it would be interesting to see what it is.
>
> My experiments suggest that we currently have this logic of finding
> the "minimal" dtype that can hold the scalar value:
>
> >>> np.array([127], dtype=np.int8) + 127  # silent overflow!
> array([-2], dtype=int8)
>
> >>> np.array([127], dtype=np.int8) + 128  # correct result
> array([255], dtype=int16)
>
The current checks all come down to `np.can_cast` (on the C side this
is `PyArray_CanCastArray()`), answering True. The actual result value
is not taken into account of course. So 127 can be represented as int8
and since the "int8,int8->int8" loop is checked first (and "can cast"
correctly) it is used.
Alternatively, you can think of it as using `np.result_type()` which
will, for all practical purposes, give the same dtype (but result type
may or may not be actually used, and there are some subtle differences
in principle).

Effectively, in your example you could reduce it to a minimal dtype of
uint7 for 127, since a uint7 can be cast safely to an int8 and also to
a uint8. (If you would just say the minimal dtype is uint8, you could
not distinguish the two examples).

Does that answer the question?

Best,

Sebastian

>
> > I suppose it is possible I am barking up the wrong tree here, and
> > this
> > caching/predictability is not vital (or can be solved with such an
> > internal dtype easily, although I am not sure it seems elegant).
> >
> >
> > Possible options to move forward
> > --------------------------------
> >
> > I have to still see a bit how trick things are. But there are a few
> > possible options. I would like to move the scalar logic to the
> > beginning of ufunc calls:
> >   * The uint7 idea would be one solution
> >   * Simply implement something that works for numpy and all except
> >     strange external ufuncs (I can only think of numba as a
> > plausible
> >     candidate for creating such).
> >
> > My current plan is to see where the second thing leaves me.
> >
> > We also should see if we cannot move the whole thing forward, in
> > which
> > case the main decision would have to be forward to where. My
> > opinion is
> > currently that when a type has a dtype associated with it clearly,
> > we
> > should always use that dtype in the future. This mostly means that
> > numpy dtypes such as `np.int64` will always be treated like an
> > int64,
> > and never like a `uint8` because they happen to be castable to
> > that.
> >
> > For values without a dtype attached (read python integers, floats),
> > I
> > see three options, from more complex to simpler:
> >
> > 1. Keep the current logic in place as much as possible
> > 2. Only support value based promotion for operators, e.g.:
> >    `arr + scalar` may do it, but `np.add(arr, scalar)` will not.
> >    The upside is that it limits the complexity to a much simpler
> >    problem, the downside is that the ufunc call and operator match
> >    less clearly.
> > 3. Just associate python float with float64 and python integers
> > with
> >    long/int64 and force users to always type them explicitly if
> > they
> >    need to.
> >
> > The downside of 1. is that it doesn't help with simplifying the
> > current
> > situation all that much, because we still have the special casting
> > around...
>
> I think it would be fine to special case operators, but NEP-13 means
> that the ufuncs corresponding to operators really do need to work
> exactly the same way. So we should also special-case those ufuncs.
>
> I don't think Option (3) is viable. Too many users rely upon
> arithmetic like "x + 1" having a predictable dtype.
>  
> > I have realized that this got much too long, so I hope it makes
> > sense.
> > I will continue to dabble along on these things a bit, so if
> > nothing
> > else maybe writing it helps me to get a bit clearer on things...
> >
> > Best,
> >
> > Sebastian
> >
> >
> > _______________________________________________
> > NumPy-Discussion mailing list
> > [hidden email]
> > https://mail.python.org/mailman/listinfo/numpy-discussion
>
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Moving forward with value based casting

Sebastian Berg
In reply to this post by Sebastian Berg
Hi all,

Maybe to clarify this at least a little, here are some examples for
what currently happen and what I could imagine we can go to (all in
terms of output dtype).

float32_arr = np.ones(10, dtype=np.float32)
int8_arr = np.ones(10, dtype=np.int8)
uint8_arr = np.ones(10, dtype=np.uint8)


Current behaviour:
------------------

float32_arr + 12.  # float32
float32_arr + 2**200  # float64 (because np.float32(2**200) == np.inf)

int8_arr + 127     # int8
int8_arr + 128     # int16
int8_arr + 2**20   # int32
uint8_arr + -1     # uint16

# But only for arrays that are not 0d:
int8_arr + np.array(1, dtype=np.int32)  # int8
int8_arr + np.array([1], dtype=np.int32)  # int32

# When the actual typing is given, this does not change:

float32_arr + np.float64(12.)                  # float32
float32_arr + np.array(12., dtype=np.float64)  # float32

# Except for inexact types, or complex:
int8_arr + np.float16(3)  # float16  (same as array behaviour)

# The exact same happens with all ufuncs:
np.add(float32_arr, 1)                               # float32
np.add(float32_arr, np.array(12., dtype=np.float64)  # float32


Keeping Value based casting only for python types
-------------------------------------------------

In this case, most examples above stay unchanged, because they use
plain python integers or floats, such as 2, 127, 12., 3, ... without
any type information attached, such as `np.float64(12.)`.

These change for example:

float32_arr + np.float64(12.)                        # float64
float32_arr + np.array(12., dtype=np.float64)        # float64
np.add(float32_arr, np.array(12., dtype=np.float64)  # float64

# so if you use `np.int32` it will be the same as np.uint64(10000)

int8_arr + np.int32(1)      # int32
int8_arr + np.int32(2**20)  # int32


Remove Value based casting completely
-------------------------------------

We could simply abolish it completely, a python `1` would always behave
the same as `np.int_(1)`. The downside of this is that:

int8_arr + 1  # int64 (or int32)

uses much more memory suddenly. Or, we remove it from ufuncs, but not
from operators:

int8_arr + 1  # int8 dtype

but:

np.add(int8_arr, 1)  # int64
# same as:
np.add(int8_arr, np.array(1))  # int16

The main reason why I was wondering about that is that for operators
the logic seems fairly simple, but for general ufuncs it seems more
complex.

Best,

Sebastian



On Wed, 2019-06-05 at 15:41 -0500, Sebastian Berg wrote:

> Hi all,
>
> TL;DR:
>
> Value based promotion seems complex both for users and ufunc-
> dispatching/promotion logic. Is there any way we can move forward
> here,
> and if we do, could we just risk some possible (maybe not-existing)
> corner cases to break early to get on the way?
>
> -----------
>
> Currently when you write code such as:
>
> arr = np.array([1, 43, 23], dtype=np.uint16)
> res = arr + 1
>
> Numpy uses fairly sophisticated logic to decide that `1` can be
> represented as a uint16, and thus for all unary functions (and most
> others as well), the output will have a `res.dtype` of uint16.
>
> Similar logic also exists for floating point types, where a lower
> precision floating point can be used:
>
> arr = np.array([1, 43, 23], dtype=np.float32)
> (arr + np.float64(2.)).dtype  # will be float32
>
> Currently, this value based logic is enforced by checking whether the
> cast is possible: "4" can be cast to int8, uint8. So the first call
> above will at some point check if "uint16 + uint16 -> uint16" is a
> valid operation, find that it is, and thus stop searching. (There is
> the additional logic, that when both/all operands are scalars, it is
> not applied).
>
> Note that while it is defined in terms of casting "1" to uint8 safely
> being possible even though 1 may be typed as int64. This logic thus
> affects all promotion rules as well (i.e. what should the output
> dtype
> be).
>
>
> There 2 main discussion points/issues about it:
>
> 1. Should value based casting/promotion logic exist at all?
>
> Arguably an `np.int32(3)` has type information attached to it, so why
> should we ignore it. It can also be tricky for users, because a small
> change in values can change the result data type.
> Because 0-D arrays and scalars are too close inside numpy (you will
> often not know which one you get). There is not much option but to
> handle them identically. However, it seems pretty odd that:
>  * `np.array(3, dtype=np.int32)` + np.arange(10, dtype=int8)
>  * `np.array([3], dtype=np.int32)` + np.arange(10, dtype=int8)
>
> give a different result.
>
> This is a bit different for python scalars, which do not have a type
> attached already.
>
>
> 2. Promotion and type resolution in Ufuncs:
>
> What is currently bothering me is that the decision what the output
> dtypes should be currently depends on the values in complicated ways.
> It would be nice if we can decide which type signature to use without
> actually looking at values (or at least only very early on).
>
> One reason here is caching and simplicity. I would like to be able to
> cache which loop should be used for what input. Having value based
> casting in there bloats up the problem.
> Of course it currently works OK, but especially when user dtypes come
> into play, caching would seem like a nice optimization option.
>
> Because `uint8(127)` can also be a `int8`, but uint8(128) it is not
> as
> simple as finding the "minimal" dtype once and working with that."
> Of course Eric and I discussed this a bit before, and you could
> create
> an internal "uint7" dtype which has the only purpose of flagging that
> a
> cast to int8 is safe.
>
> I suppose it is possible I am barking up the wrong tree here, and
> this
> caching/predictability is not vital (or can be solved with such an
> internal dtype easily, although I am not sure it seems elegant).
>
>
> Possible options to move forward
> --------------------------------
>
> I have to still see a bit how trick things are. But there are a few
> possible options. I would like to move the scalar logic to the
> beginning of ufunc calls:
>   * The uint7 idea would be one solution
>   * Simply implement something that works for numpy and all except
>     strange external ufuncs (I can only think of numba as a plausible
>     candidate for creating such).
>
> My current plan is to see where the second thing leaves me.
>
> We also should see if we cannot move the whole thing forward, in
> which
> case the main decision would have to be forward to where. My opinion
> is
> currently that when a type has a dtype associated with it clearly, we
> should always use that dtype in the future. This mostly means that
> numpy dtypes such as `np.int64` will always be treated like an int64,
> and never like a `uint8` because they happen to be castable to that.
>
> For values without a dtype attached (read python integers, floats), I
> see three options, from more complex to simpler:
>
> 1. Keep the current logic in place as much as possible
> 2. Only support value based promotion for operators, e.g.:
>    `arr + scalar` may do it, but `np.add(arr, scalar)` will not.
>    The upside is that it limits the complexity to a much simpler
>    problem, the downside is that the ufunc call and operator match
>    less clearly.
> 3. Just associate python float with float64 and python integers with
>    long/int64 and force users to always type them explicitly if they
>    need to.
>
> The downside of 1. is that it doesn't help with simplifying the
> current
> situation all that much, because we still have the special casting
> around...
>
>
> I have realized that this got much too long, so I hope it makes
> sense.
> I will continue to dabble along on these things a bit, so if nothing
> else maybe writing it helps me to get a bit clearer on things...
>
> Best,
>
> Sebastian
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Moving forward with value based casting

Tyler Reddy
A few thoughts:

- We're not trying to achieve systematic guards against integer overflow / wrapping in ufunc inner loops, right? The performance tradeoffs for a "result-based" casting / exception handling addition would presumably be controversial? I know there was some discussion about having an "overflow detection mode"  (toggle) of some sort that could be activated for ufunc loops, but don't think that gained much traction/ priority. I think for floats we have an awkward way to propagate something back to the user if there's an issue.
- It sounds like the objective is instead primarily to achieve pure dtype-based promotion, which is then effectively just a casting table, which is what I think you mean by "cache?"
- Is it a safe assumption that for a cache (dtype-only casting table), the main tradeoff is that we'd likely tend towards conservative upcasting and using more memory in output types in many cases vs. NumPy at the moment? Stephan seems concerned about that, presumably because x + 1 suddenly changes output dtype in an overwhelming number of current code lines and future simple examples for end users.
- If np.array + 1 absolutely has to stay the same output dtype moving forward, then "Keeping Value based casting only for python types" is the one that looks most promising to me initially, with a few further concerns:

1) Would that give you enough refactoring "wiggle room" to achieve the simplifications you need? If value-based promotion still happens for a non-NumPy operand, can you abstract that logic cleanly from the "pure dtype cache / table" that is planned for NumPy operands?
2) Is the "out" argument to ufuncs a satisfactory alternative to the "power users" who want to "override" default output casting type? We suggest that they pre-allocate an output array of the desired type if they want to save memory and if they overflow or wrap integers that is their problem. Can we reasonably ask people who currently depend on the memory-conservation they might get from value-based behavior to adjust in this way?
3) Presumably "out" does / will circumvent the "cache / dtype casting table?"

Tyler

On Wed, 5 Jun 2019 at 15:37, Sebastian Berg <[hidden email]> wrote:
Hi all,

Maybe to clarify this at least a little, here are some examples for
what currently happen and what I could imagine we can go to (all in
terms of output dtype).

float32_arr = np.ones(10, dtype=np.float32)
int8_arr = np.ones(10, dtype=np.int8)
uint8_arr = np.ones(10, dtype=np.uint8)


Current behaviour:
------------------

float32_arr + 12.  # float32
float32_arr + 2**200  # float64 (because np.float32(2**200) == np.inf)

int8_arr + 127     # int8
int8_arr + 128     # int16
int8_arr + 2**20   # int32
uint8_arr + -1     # uint16

# But only for arrays that are not 0d:
int8_arr + np.array(1, dtype=np.int32)  # int8
int8_arr + np.array([1], dtype=np.int32)  # int32

# When the actual typing is given, this does not change:

float32_arr + np.float64(12.)                  # float32
float32_arr + np.array(12., dtype=np.float64)  # float32

# Except for inexact types, or complex:
int8_arr + np.float16(3)  # float16  (same as array behaviour)

# The exact same happens with all ufuncs:
np.add(float32_arr, 1)                               # float32
np.add(float32_arr, np.array(12., dtype=np.float64)  # float32


Keeping Value based casting only for python types
-------------------------------------------------

In this case, most examples above stay unchanged, because they use
plain python integers or floats, such as 2, 127, 12., 3, ... without
any type information attached, such as `np.float64(12.)`.

These change for example:

float32_arr + np.float64(12.)                        # float64
float32_arr + np.array(12., dtype=np.float64)        # float64
np.add(float32_arr, np.array(12., dtype=np.float64)  # float64

# so if you use `np.int32` it will be the same as np.uint64(10000)

int8_arr + np.int32(1)      # int32
int8_arr + np.int32(2**20)  # int32


Remove Value based casting completely
-------------------------------------

We could simply abolish it completely, a python `1` would always behave
the same as `np.int_(1)`. The downside of this is that:

int8_arr + 1  # int64 (or int32)

uses much more memory suddenly. Or, we remove it from ufuncs, but not
from operators:

int8_arr + 1  # int8 dtype

but:

np.add(int8_arr, 1)  # int64
# same as:
np.add(int8_arr, np.array(1))  # int16

The main reason why I was wondering about that is that for operators
the logic seems fairly simple, but for general ufuncs it seems more
complex.

Best,

Sebastian



On Wed, 2019-06-05 at 15:41 -0500, Sebastian Berg wrote:
> Hi all,
>
> TL;DR:
>
> Value based promotion seems complex both for users and ufunc-
> dispatching/promotion logic. Is there any way we can move forward
> here,
> and if we do, could we just risk some possible (maybe not-existing)
> corner cases to break early to get on the way?
>
> -----------
>
> Currently when you write code such as:
>
> arr = np.array([1, 43, 23], dtype=np.uint16)
> res = arr + 1
>
> Numpy uses fairly sophisticated logic to decide that `1` can be
> represented as a uint16, and thus for all unary functions (and most
> others as well), the output will have a `res.dtype` of uint16.
>
> Similar logic also exists for floating point types, where a lower
> precision floating point can be used:
>
> arr = np.array([1, 43, 23], dtype=np.float32)
> (arr + np.float64(2.)).dtype  # will be float32
>
> Currently, this value based logic is enforced by checking whether the
> cast is possible: "4" can be cast to int8, uint8. So the first call
> above will at some point check if "uint16 + uint16 -> uint16" is a
> valid operation, find that it is, and thus stop searching. (There is
> the additional logic, that when both/all operands are scalars, it is
> not applied).
>
> Note that while it is defined in terms of casting "1" to uint8 safely
> being possible even though 1 may be typed as int64. This logic thus
> affects all promotion rules as well (i.e. what should the output
> dtype
> be).
>
>
> There 2 main discussion points/issues about it:
>
> 1. Should value based casting/promotion logic exist at all?
>
> Arguably an `np.int32(3)` has type information attached to it, so why
> should we ignore it. It can also be tricky for users, because a small
> change in values can change the result data type.
> Because 0-D arrays and scalars are too close inside numpy (you will
> often not know which one you get). There is not much option but to
> handle them identically. However, it seems pretty odd that:
>  * `np.array(3, dtype=np.int32)` + np.arange(10, dtype=int8)
>  * `np.array([3], dtype=np.int32)` + np.arange(10, dtype=int8)
>
> give a different result.
>
> This is a bit different for python scalars, which do not have a type
> attached already.
>
>
> 2. Promotion and type resolution in Ufuncs:
>
> What is currently bothering me is that the decision what the output
> dtypes should be currently depends on the values in complicated ways.
> It would be nice if we can decide which type signature to use without
> actually looking at values (or at least only very early on).
>
> One reason here is caching and simplicity. I would like to be able to
> cache which loop should be used for what input. Having value based
> casting in there bloats up the problem.
> Of course it currently works OK, but especially when user dtypes come
> into play, caching would seem like a nice optimization option.
>
> Because `uint8(127)` can also be a `int8`, but uint8(128) it is not
> as
> simple as finding the "minimal" dtype once and working with that."
> Of course Eric and I discussed this a bit before, and you could
> create
> an internal "uint7" dtype which has the only purpose of flagging that
> a
> cast to int8 is safe.
>
> I suppose it is possible I am barking up the wrong tree here, and
> this
> caching/predictability is not vital (or can be solved with such an
> internal dtype easily, although I am not sure it seems elegant).
>
>
> Possible options to move forward
> --------------------------------
>
> I have to still see a bit how trick things are. But there are a few
> possible options. I would like to move the scalar logic to the
> beginning of ufunc calls:
>   * The uint7 idea would be one solution
>   * Simply implement something that works for numpy and all except
>     strange external ufuncs (I can only think of numba as a plausible
>     candidate for creating such).
>
> My current plan is to see where the second thing leaves me.
>
> We also should see if we cannot move the whole thing forward, in
> which
> case the main decision would have to be forward to where. My opinion
> is
> currently that when a type has a dtype associated with it clearly, we
> should always use that dtype in the future. This mostly means that
> numpy dtypes such as `np.int64` will always be treated like an int64,
> and never like a `uint8` because they happen to be castable to that.
>
> For values without a dtype attached (read python integers, floats), I
> see three options, from more complex to simpler:
>
> 1. Keep the current logic in place as much as possible
> 2. Only support value based promotion for operators, e.g.:
>    `arr + scalar` may do it, but `np.add(arr, scalar)` will not.
>    The upside is that it limits the complexity to a much simpler
>    problem, the downside is that the ufunc call and operator match
>    less clearly.
> 3. Just associate python float with float64 and python integers with
>    long/int64 and force users to always type them explicitly if they
>    need to.
>
> The downside of 1. is that it doesn't help with simplifying the
> current
> situation all that much, because we still have the special casting
> around...
>
>
> I have realized that this got much too long, so I hope it makes
> sense.
> I will continue to dabble along on these things a bit, so if nothing
> else maybe writing it helps me to get a bit clearer on things...
>
> Best,
>
> Sebastian
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> https://mail.python.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Moving forward with value based casting

Marten van Kerkwijk
Hi Sebastian,

Tricky! It seems a balance between unexpected memory blow-up and unexpected wrapping (the latter mostly for integers).

Some comments specifically on your message first, then some more general related ones.

1. I'm very much against letting `a + b` do anything else than `np.add(a, b)`.
2. For python values, an argument for casting by value is that a python int can be arbitrarily long; the only reasonable course of action for those seems to make them float, and once you do that one might as well cast to whatever type can hold the value (at least approximately).
3. Not necessarily preferred, but for casting of scalars, one can get more consistent behaviour also by extending the casting by value to any array that has size=1.

Overall, just on the narrow question, I'd be quite happy with your suggestion of using type information if available, i.e., only cast python values to a minimal dtype.If one uses numpy types, those mostly will have come from previous calculations with the same arrays, so things will work as expected. And in most memory-limited applications, one would do calculations in-place anyway (or, as Tyler noted, for power users one can assume awareness of memory and thus the incentive to tell explicitly what dtype is wanted - just `np.add(a, b, dtype=...)`, no need to create `out`).

More generally, I guess what I don't like about the casting rules generally is that there is a presumption that if the value can be cast, the operation will generally succeed. For `np.add` and `np.subtract`, this perhaps is somewhat reasonable (though for unsigned a bit more dubious), but for `np.multiply` or `np.power` it is much less so. (Indeed, we had a long discussion about what to do with `int ** power` - now special-casing negative integer powers.) Changing this, however, probably really is a bridge too far!

Finally, somewhat related: I think the largest confusing actually results from the `uint64+in64 -> float64` casting.  Should this cast to int64 instead?

All the best,

Marten


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Moving forward with value based casting

ralfgommers
In reply to this post by Sebastian Berg


On Wed, Jun 5, 2019 at 10:42 PM Sebastian Berg <[hidden email]> wrote:
Hi all,

TL;DR:

Value based promotion seems complex both for users and ufunc-
dispatching/promotion logic. Is there any way we can move forward here,
and if we do, could we just risk some possible (maybe not-existing)
corner cases to break early to get on the way?
...
I have realized that this got much too long, so I hope it makes sense.
I will continue to dabble along on these things a bit, so if nothing
else maybe writing it helps me to get a bit clearer on things...

Your email was long but very clear. The part I'm missing is "why are things the way they are?". Before diving into casting rules and all other wishes people may have, can you please try to answer that? Because there's more to it than "(maybe not-existing) corner cases".

Marten's first sentence ("a balance between unexpected memory blow-up and unexpected wrapping") is in the right direction. As is Stephan's "Too many users rely upon arithmetic like "x + 1" having a predictable dtype."

The problem is clear, however you need to figure out the constraints first, then decide within the wiggle room you have what the options are.

Cheers,
Ralf



_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Moving forward with value based casting

Sebastian Berg
In reply to this post by Tyler Reddy
On Wed, 2019-06-05 at 17:14 -0700, Tyler Reddy wrote:

> A few thoughts:
>
> - We're not trying to achieve systematic guards against integer
> overflow / wrapping in ufunc inner loops, right? The performance
> tradeoffs for a "result-based" casting / exception handling addition
> would presumably be controversial? I know there was some discussion
> about having an "overflow detection mode"  (toggle) of some sort that
> could be activated for ufunc loops, but don't think that gained much
> traction/ priority. I think for floats we have an awkward way to
> propagate something back to the user if there's an issue.
No, that is indeed a different issue. It would be nice to provide the
option of integer overflow warnings/errors, but it is different since
it should not affect the dtypes in use (i.e. we would never upcast to
avoid the error).

> - It sounds like the objective is instead primarily to achieve pure
> dtype-based promotion, which is then effectively just a casting
> table, which is what I think you mean by "cache?"

Yes, the cache was a bad word, I used it thinking of user types where a
large table would probably not be created on the fly.

> - Is it a safe assumption that for a cache (dtype-only casting
> table), the main tradeoff is that we'd likely tend towards
> conservative upcasting and using more memory in output types in many
> cases vs. NumPy at the moment? Stephan seems concerned about that,
> presumably because x + 1 suddenly changes output dtype in an
> overwhelming number of current code lines and future simple examples
> for end users.

Yes. That is at least what we currently have. For x + 1 there is a good
point with sudden memory blow up. Maybe an even nicer example is
`float32_arr + 1`, which would have to go to float64 if 1 is
interpreted as `int32(1)`.

> - If np.array + 1 absolutely has to stay the same output dtype moving
> forward, then "Keeping Value based casting only for python types" is
> the one that looks most promising to me initially, with a few further
> concerns:

Well, while it is annoying me. I think we should base that decision of
what we want the user API to be only. And because of that, it seems
like the most likely option.
At least my gut feeling is, if it is typed, we should honor the type
(also for scalars), but code like x + 1 suddenly blowing up memory is
not a good idea.
I just realized that one (anti?)-pattern that is common is the:

arr + 0.  # make sure its "inexact/float"

is exactly an example of where you do not want to upcast unnecessarily.


> 1) Would that give you enough refactoring "wiggle room" to achieve
> the simplifications you need? If value-based promotion still happens
> for a non-NumPy operand, can you abstract that logic cleanly from the
> "pure dtype cache / table" that is planned for NumPy operands?

It is tricky. There is always the slightly strange solution of making
dtypes such as uint7, which "fixes" the type hierarchy as a minimal
dtype for promotion purpose, but would never be exposed to users.
(You probably need more strange dtypes for float and int combinations.)

To give me some wiggle room, what I was now doing is to simply decide
on the correct dtype before lookup. I am pretty sure that works for
all, except possibly one ufunc within numpy. The reason that this works
is that almost all of our ufuncs are typed as "ii->i" (identical
types).
Maybe that is OK to start working, and the strange dtype hierarchy can
be thought of later.


> 2) Is the "out" argument to ufuncs a satisfactory alternative to the
> "power users" who want to "override" default output casting type? We
> suggest that they pre-allocate an output array of the desired type if
> they want to save memory and if they overflow or wrap integers that
> is their problem. Can we reasonably ask people who currently depend
> on the memory-conservation they might get from value-based behavior
> to adjust in this way?

The can also use `dtype=...` (or at least we can fix that part to be
reliable). Or they can cast type the input. Especially if we want to
use it only for python integers/floats, adding the `np.int8(3)` is not
much effort.

> 3) Presumably "out" does / will circumvent the "cache / dtype casting
> table?"

Well, out fixes one of the types, if we look at the general machinery,
it would be possible to have:

ff->d
df->d
dd->d

loops. So if such loops are defined we cannot quite circumvent the
whole lookup. If we know that all loops are of the `ff->f` all same
dtype kind (which is true for almost all functions inside numpy),
lookup could be simplified.
For those loops with all the same dtype, the issue is fairly straight
forward anyway, because I can just decide how to handle the scalar
before hand.

Best,

Sebastian


>
> Tyler
>
> On Wed, 5 Jun 2019 at 15:37, Sebastian Berg <
> [hidden email]> wrote:
> > Hi all,
> >
> > Maybe to clarify this at least a little, here are some examples for
> > what currently happen and what I could imagine we can go to (all in
> > terms of output dtype).
> >
> > float32_arr = np.ones(10, dtype=np.float32)
> > int8_arr = np.ones(10, dtype=np.int8)
> > uint8_arr = np.ones(10, dtype=np.uint8)
> >
> >
> > Current behaviour:
> > ------------------
> >
> > float32_arr + 12.  # float32
> > float32_arr + 2**200  # float64 (because np.float32(2**200) ==
> > np.inf)
> >
> > int8_arr + 127     # int8
> > int8_arr + 128     # int16
> > int8_arr + 2**20   # int32
> > uint8_arr + -1     # uint16
> >
> > # But only for arrays that are not 0d:
> > int8_arr + np.array(1, dtype=np.int32)  # int8
> > int8_arr + np.array([1], dtype=np.int32)  # int32
> >
> > # When the actual typing is given, this does not change:
> >
> > float32_arr + np.float64(12.)                  # float32
> > float32_arr + np.array(12., dtype=np.float64)  # float32
> >
> > # Except for inexact types, or complex:
> > int8_arr + np.float16(3)  # float16  (same as array behaviour)
> >
> > # The exact same happens with all ufuncs:
> > np.add(float32_arr, 1)                               # float32
> > np.add(float32_arr, np.array(12., dtype=np.float64)  # float32
> >
> >
> > Keeping Value based casting only for python types
> > -------------------------------------------------
> >
> > In this case, most examples above stay unchanged, because they use
> > plain python integers or floats, such as 2, 127, 12., 3, ...
> > without
> > any type information attached, such as `np.float64(12.)`.
> >
> > These change for example:
> >
> > float32_arr + np.float64(12.)                        # float64
> > float32_arr + np.array(12., dtype=np.float64)        # float64
> > np.add(float32_arr, np.array(12., dtype=np.float64)  # float64
> >
> > # so if you use `np.int32` it will be the same as np.uint64(10000)
> >
> > int8_arr + np.int32(1)      # int32
> > int8_arr + np.int32(2**20)  # int32
> >
> >
> > Remove Value based casting completely
> > -------------------------------------
> >
> > We could simply abolish it completely, a python `1` would always
> > behave
> > the same as `np.int_(1)`. The downside of this is that:
> >
> > int8_arr + 1  # int64 (or int32)
> >
> > uses much more memory suddenly. Or, we remove it from ufuncs, but
> > not
> > from operators:
> >
> > int8_arr + 1  # int8 dtype
> >
> > but:
> >
> > np.add(int8_arr, 1)  # int64
> > # same as:
> > np.add(int8_arr, np.array(1))  # int16
> >
> > The main reason why I was wondering about that is that for
> > operators
> > the logic seems fairly simple, but for general ufuncs it seems more
> > complex.
> >
> > Best,
> >
> > Sebastian
> >
> >
> >
> > On Wed, 2019-06-05 at 15:41 -0500, Sebastian Berg wrote:
> > > Hi all,
> > >
> > > TL;DR:
> > >
> > > Value based promotion seems complex both for users and ufunc-
> > > dispatching/promotion logic. Is there any way we can move forward
> > > here,
> > > and if we do, could we just risk some possible (maybe not-
> > existing)
> > > corner cases to break early to get on the way?
> > >
> > > -----------
> > >
> > > Currently when you write code such as:
> > >
> > > arr = np.array([1, 43, 23], dtype=np.uint16)
> > > res = arr + 1
> > >
> > > Numpy uses fairly sophisticated logic to decide that `1` can be
> > > represented as a uint16, and thus for all unary functions (and
> > most
> > > others as well), the output will have a `res.dtype` of uint16.
> > >
> > > Similar logic also exists for floating point types, where a lower
> > > precision floating point can be used:
> > >
> > > arr = np.array([1, 43, 23], dtype=np.float32)
> > > (arr + np.float64(2.)).dtype  # will be float32
> > >
> > > Currently, this value based logic is enforced by checking whether
> > the
> > > cast is possible: "4" can be cast to int8, uint8. So the first
> > call
> > > above will at some point check if "uint16 + uint16 -> uint16" is
> > a
> > > valid operation, find that it is, and thus stop searching. (There
> > is
> > > the additional logic, that when both/all operands are scalars, it
> > is
> > > not applied).
> > >
> > > Note that while it is defined in terms of casting "1" to uint8
> > safely
> > > being possible even though 1 may be typed as int64. This logic
> > thus
> > > affects all promotion rules as well (i.e. what should the output
> > > dtype
> > > be).
> > >
> > >
> > > There 2 main discussion points/issues about it:
> > >
> > > 1. Should value based casting/promotion logic exist at all?
> > >
> > > Arguably an `np.int32(3)` has type information attached to it, so
> > why
> > > should we ignore it. It can also be tricky for users, because a
> > small
> > > change in values can change the result data type.
> > > Because 0-D arrays and scalars are too close inside numpy (you
> > will
> > > often not know which one you get). There is not much option but
> > to
> > > handle them identically. However, it seems pretty odd that:
> > >  * `np.array(3, dtype=np.int32)` + np.arange(10, dtype=int8)
> > >  * `np.array([3], dtype=np.int32)` + np.arange(10, dtype=int8)
> > >
> > > give a different result.
> > >
> > > This is a bit different for python scalars, which do not have a
> > type
> > > attached already.
> > >
> > >
> > > 2. Promotion and type resolution in Ufuncs:
> > >
> > > What is currently bothering me is that the decision what the
> > output
> > > dtypes should be currently depends on the values in complicated
> > ways.
> > > It would be nice if we can decide which type signature to use
> > without
> > > actually looking at values (or at least only very early on).
> > >
> > > One reason here is caching and simplicity. I would like to be
> > able to
> > > cache which loop should be used for what input. Having value
> > based
> > > casting in there bloats up the problem.
> > > Of course it currently works OK, but especially when user dtypes
> > come
> > > into play, caching would seem like a nice optimization option.
> > >
> > > Because `uint8(127)` can also be a `int8`, but uint8(128) it is
> > not
> > > as
> > > simple as finding the "minimal" dtype once and working with
> > that."
> > > Of course Eric and I discussed this a bit before, and you could
> > > create
> > > an internal "uint7" dtype which has the only purpose of flagging
> > that
> > > a
> > > cast to int8 is safe.
> > >
> > > I suppose it is possible I am barking up the wrong tree here, and
> > > this
> > > caching/predictability is not vital (or can be solved with such
> > an
> > > internal dtype easily, although I am not sure it seems elegant).
> > >
> > >
> > > Possible options to move forward
> > > --------------------------------
> > >
> > > I have to still see a bit how trick things are. But there are a
> > few
> > > possible options. I would like to move the scalar logic to the
> > > beginning of ufunc calls:
> > >   * The uint7 idea would be one solution
> > >   * Simply implement something that works for numpy and all
> > except
> > >     strange external ufuncs (I can only think of numba as a
> > plausible
> > >     candidate for creating such).
> > >
> > > My current plan is to see where the second thing leaves me.
> > >
> > > We also should see if we cannot move the whole thing forward, in
> > > which
> > > case the main decision would have to be forward to where. My
> > opinion
> > > is
> > > currently that when a type has a dtype associated with it
> > clearly, we
> > > should always use that dtype in the future. This mostly means
> > that
> > > numpy dtypes such as `np.int64` will always be treated like an
> > int64,
> > > and never like a `uint8` because they happen to be castable to
> > that.
> > >
> > > For values without a dtype attached (read python integers,
> > floats), I
> > > see three options, from more complex to simpler:
> > >
> > > 1. Keep the current logic in place as much as possible
> > > 2. Only support value based promotion for operators, e.g.:
> > >    `arr + scalar` may do it, but `np.add(arr, scalar)` will not.
> > >    The upside is that it limits the complexity to a much simpler
> > >    problem, the downside is that the ufunc call and operator
> > match
> > >    less clearly.
> > > 3. Just associate python float with float64 and python integers
> > with
> > >    long/int64 and force users to always type them explicitly if
> > they
> > >    need to.
> > >
> > > The downside of 1. is that it doesn't help with simplifying the
> > > current
> > > situation all that much, because we still have the special
> > casting
> > > around...
> > >
> > >
> > > I have realized that this got much too long, so I hope it makes
> > > sense.
> > > I will continue to dabble along on these things a bit, so if
> > nothing
> > > else maybe writing it helps me to get a bit clearer on things...
> > >
> > > Best,
> > >
> > > Sebastian
> > >
> > >
> > > _______________________________________________
> > > NumPy-Discussion mailing list
> > > [hidden email]
> > > https://mail.python.org/mailman/listinfo/numpy-discussion
> > _______________________________________________
> > NumPy-Discussion mailing list
> > [hidden email]
> > https://mail.python.org/mailman/listinfo/numpy-discussion
>
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Moving forward with value based casting

Sebastian Berg
In reply to this post by Marten van Kerkwijk
On Wed, 2019-06-05 at 21:35 -0400, Marten van Kerkwijk wrote:

> Hi Sebastian,
>
> Tricky! It seems a balance between unexpected memory blow-up and
> unexpected wrapping (the latter mostly for integers).
>
> Some comments specifically on your message first, then some more
> general related ones.
>
> 1. I'm very much against letting `a + b` do anything else than
> `np.add(a, b)`.
Well, I tend to agree. But just to put it out there:

[1] + [2]  == [1, 2]
np.add([1], [2]) == 3

So that is already far from true, since coercion has to occur. Of
course it is true that:

arr + something_else

will at some point force coercion of `something_else`, so that point is
only half valid if either `a` or `b` is already a numpy array/scalar.


> 2. For python values, an argument for casting by value is that a
> python int can be arbitrarily long; the only reasonable course of
> action for those seems to make them float, and once you do that one
> might as well cast to whatever type can hold the value (at least
> approximately).

To be honest, the "arbitrary long" thing is another issue, which is the
silent conversion to "object" dtype. Something that is also on the not
done list of: Maybe we should deprecate it.

In other words, we would freeze python int to one clear type, if you
have an arbitrarily large int, you would need to use `object` dtype (or
preferably a new `pyint/arbitrary_precision_int` dtype) explicitly.

> 3. Not necessarily preferred, but for casting of scalars, one can get
> more consistent behaviour also by extending the casting by value to
> any array that has size=1.
>

That sounds just as horrible as the current mismatch to me, to be
honest.

> Overall, just on the narrow question, I'd be quite happy with your
> suggestion of using type information if available, i.e., only cast
> python values to a minimal dtype.If one uses numpy types, those
> mostly will have come from previous calculations with the same
> arrays, so things will work as expected. And in most memory-limited
> applications, one would do calculations in-place anyway (or, as Tyler
> noted, for power users one can assume awareness of memory and thus
> the incentive to tell explicitly what dtype is wanted - just
> `np.add(a, b, dtype=...)`, no need to create `out`).
>
> More generally, I guess what I don't like about the casting rules
> generally is that there is a presumption that if the value can be
> cast, the operation will generally succeed. For `np.add` and
> `np.subtract`, this perhaps is somewhat reasonable (though for
> unsigned a bit more dubious), but for `np.multiply` or `np.power` it
> is much less so. (Indeed, we had a long discussion about what to do
> with `int ** power` - now special-casing negative integer powers.)
> Changing this, however, probably really is a bridge too far!
Indeed that is right. But that is a different point. E.g. there is
nothing wrong for example that `np.power` shouldn't decide that
`int**power` should always _promote_ (not cast) `int` to some larger
integer type if available.
The only point where we seriously have such logic right now is for
np.add.reduce (sum) and np.multiply.reduce (prod), which always use at
least `long` precision (and actually upcast bool->int, although
np.add(True, True) does not. Another difference to True + True...)

>
> Finally, somewhat related: I think the largest confusing actually
> results from the `uint64+in64 -> float64` casting.  Should this cast
> to int64 instead?

Not sure, but yes, it is the other quirk in our casting that should be
discussed….

- Sebastian

>
> All the best,
>
> Marten
>
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Moving forward with value based casting

Allan Haldane
In reply to this post by Sebastian Berg

I think dtype-based casting makes a lot of sense, the problem is
backward compatibility.

Numpy casting is weird in a number of ways: The array + array casting is
unexpected to many users (eg, uint64 + int64 -> float64), and the
casting of array + scalar is different from that, and value based.
Personally I wouldn't want to try change it unless we make a
backward-incompatible release (numpy 2.0), based on my experience trying
to change much more minor things. We already put "casting" on the list
of desired backward-incompatible changes on the list here:
https://github.com/numpy/numpy/wiki/Backwards-incompatible-ideas-for-a-major-release

Relatedly, I've previously dreamed about a different "C-style" way
casting might behave:
https://gist.github.com/ahaldane/0f5ade49730e1a5d16ff6df4303f2e76

The proposal there is that array + array casting, array + scalar, and
array + python casting would all work in the same dtype-based way, which
mimics the familiar "C" casting rules.

See also:
https://github.com/numpy/numpy/issues/12525

Allan


On 6/5/19 4:41 PM, Sebastian Berg wrote:

> Hi all,
>
> TL;DR:
>
> Value based promotion seems complex both for users and ufunc-
> dispatching/promotion logic. Is there any way we can move forward here,
> and if we do, could we just risk some possible (maybe not-existing)
> corner cases to break early to get on the way?
>
> -----------
>
> Currently when you write code such as:
>
> arr = np.array([1, 43, 23], dtype=np.uint16)
> res = arr + 1
>
> Numpy uses fairly sophisticated logic to decide that `1` can be
> represented as a uint16, and thus for all unary functions (and most
> others as well), the output will have a `res.dtype` of uint16.
>
> Similar logic also exists for floating point types, where a lower
> precision floating point can be used:
>
> arr = np.array([1, 43, 23], dtype=np.float32)
> (arr + np.float64(2.)).dtype  # will be float32
>
> Currently, this value based logic is enforced by checking whether the
> cast is possible: "4" can be cast to int8, uint8. So the first call
> above will at some point check if "uint16 + uint16 -> uint16" is a
> valid operation, find that it is, and thus stop searching. (There is
> the additional logic, that when both/all operands are scalars, it is
> not applied).
>
> Note that while it is defined in terms of casting "1" to uint8 safely
> being possible even though 1 may be typed as int64. This logic thus
> affects all promotion rules as well (i.e. what should the output dtype
> be).
>
>
> There 2 main discussion points/issues about it:
>
> 1. Should value based casting/promotion logic exist at all?
>
> Arguably an `np.int32(3)` has type information attached to it, so why
> should we ignore it. It can also be tricky for users, because a small
> change in values can change the result data type.
> Because 0-D arrays and scalars are too close inside numpy (you will
> often not know which one you get). There is not much option but to
> handle them identically. However, it seems pretty odd that:
>  * `np.array(3, dtype=np.int32)` + np.arange(10, dtype=int8)
>  * `np.array([3], dtype=np.int32)` + np.arange(10, dtype=int8)
>
> give a different result.
>
> This is a bit different for python scalars, which do not have a type
> attached already.
>
>
> 2. Promotion and type resolution in Ufuncs:
>
> What is currently bothering me is that the decision what the output
> dtypes should be currently depends on the values in complicated ways.
> It would be nice if we can decide which type signature to use without
> actually looking at values (or at least only very early on).
>
> One reason here is caching and simplicity. I would like to be able to
> cache which loop should be used for what input. Having value based
> casting in there bloats up the problem.
> Of course it currently works OK, but especially when user dtypes come
> into play, caching would seem like a nice optimization option.
>
> Because `uint8(127)` can also be a `int8`, but uint8(128) it is not as
> simple as finding the "minimal" dtype once and working with that."
> Of course Eric and I discussed this a bit before, and you could create
> an internal "uint7" dtype which has the only purpose of flagging that a
> cast to int8 is safe.
>
> I suppose it is possible I am barking up the wrong tree here, and this
> caching/predictability is not vital (or can be solved with such an
> internal dtype easily, although I am not sure it seems elegant).
>
>
> Possible options to move forward
> --------------------------------
>
> I have to still see a bit how trick things are. But there are a few
> possible options. I would like to move the scalar logic to the
> beginning of ufunc calls:
>   * The uint7 idea would be one solution
>   * Simply implement something that works for numpy and all except
>     strange external ufuncs (I can only think of numba as a plausible
>     candidate for creating such).
>
> My current plan is to see where the second thing leaves me.
>
> We also should see if we cannot move the whole thing forward, in which
> case the main decision would have to be forward to where. My opinion is
> currently that when a type has a dtype associated with it clearly, we
> should always use that dtype in the future. This mostly means that
> numpy dtypes such as `np.int64` will always be treated like an int64,
> and never like a `uint8` because they happen to be castable to that.
>
> For values without a dtype attached (read python integers, floats), I
> see three options, from more complex to simpler:
>
> 1. Keep the current logic in place as much as possible
> 2. Only support value based promotion for operators, e.g.:
>    `arr + scalar` may do it, but `np.add(arr, scalar)` will not.
>    The upside is that it limits the complexity to a much simpler
>    problem, the downside is that the ufunc call and operator match
>    less clearly.
> 3. Just associate python float with float64 and python integers with
>    long/int64 and force users to always type them explicitly if they
>    need to.
>
> The downside of 1. is that it doesn't help with simplifying the current
> situation all that much, because we still have the special casting
> around...
>
>
> I have realized that this got much too long, so I hope it makes sense.
> I will continue to dabble along on these things a bit, so if nothing
> else maybe writing it helps me to get a bit clearer on things...
>
> Best,
>
> Sebastian
>
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> https://mail.python.org/mailman/listinfo/numpy-discussion
>

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Moving forward with value based casting

Sebastian Berg
On Thu, 2019-06-06 at 11:57 -0400, Allan Haldane wrote:

> I think dtype-based casting makes a lot of sense, the problem is
> backward compatibility.
>
> Numpy casting is weird in a number of ways: The array + array casting
> is
> unexpected to many users (eg, uint64 + int64 -> float64), and the
> casting of array + scalar is different from that, and value based.
> Personally I wouldn't want to try change it unless we make a
> backward-incompatible release (numpy 2.0), based on my experience
> trying
> to change much more minor things. We already put "casting" on the
> list
> of desired backward-incompatible changes on the list here:
> https://github.com/numpy/numpy/wiki/Backwards-incompatible-ideas-for-a-major-release
>
> Relatedly, I've previously dreamed about a different "C-style" way
> casting might behave:
> https://gist.github.com/ahaldane/0f5ade49730e1a5d16ff6df4303f2e76
>
> The proposal there is that array + array casting, array + scalar, and
> array + python casting would all work in the same dtype-based way,
> which
> mimics the familiar "C" casting rules.
If I read it right, you do propose that array + python would cast in a
"minimal type" way for python.

In your write up, you describe that if you mix array + scalar, the
scalar uses a minimal dtype compared to the array's dtype. What we
instead have is that in principle you could have loops such as:

"ifi->f"
"idi->d"

and I think we should chose the first for a scalar, because it "fits"
into f just fine. (if the input is) `ufunc(int_arr, 12., int_arr)`.

I do not mind keeping the "simple" two (or even more) operand "lets
assume we have uniform types" logic around. For those it is easy to
find a "minimum type" even before actual loop lookup.
For the above example it would work in any case well, but it would get
complicating, if for example the last integer is an unsigned integer,
that happens to be small enough to fit also into an integer.

That might give some wiggle room, possibly also to attach warnings to
it, or at least make things easier. But I would also like to figure out
as well if we shouldn't try to move in any case. Sure, attach a major
version to it, but hopefully not a "big step type".

One thing that I had not thought about is, that if we create
FutureWarnings, we will need to provide a way to opt-in to the new/old
behaviour.
The old behaviour can be achieved by just using the python types (which
probably is what most code that wants this behaviour does already), but
the behaviour is tricky. Users can pass `dtype` explicitly, but that is
a huge kludge...
Will think about if there is a solution to that, because if there is
not, you are right. It has to be a "big step" kind of release.
Although, even then it would be nice to have warnings that can be
enabled to ease the transition!

- Sebastian


>
> See also:
> https://github.com/numpy/numpy/issues/12525
>
> Allan
>
>
> On 6/5/19 4:41 PM, Sebastian Berg wrote:
> > Hi all,
> >
> > TL;DR:
> >
> > Value based promotion seems complex both for users and ufunc-
> > dispatching/promotion logic. Is there any way we can move forward
> > here,
> > and if we do, could we just risk some possible (maybe not-existing)
> > corner cases to break early to get on the way?
> >
> > -----------
> >
> > Currently when you write code such as:
> >
> > arr = np.array([1, 43, 23], dtype=np.uint16)
> > res = arr + 1
> >
> > Numpy uses fairly sophisticated logic to decide that `1` can be
> > represented as a uint16, and thus for all unary functions (and most
> > others as well), the output will have a `res.dtype` of uint16.
> >
> > Similar logic also exists for floating point types, where a lower
> > precision floating point can be used:
> >
> > arr = np.array([1, 43, 23], dtype=np.float32)
> > (arr + np.float64(2.)).dtype  # will be float32
> >
> > Currently, this value based logic is enforced by checking whether
> > the
> > cast is possible: "4" can be cast to int8, uint8. So the first call
> > above will at some point check if "uint16 + uint16 -> uint16" is a
> > valid operation, find that it is, and thus stop searching. (There
> > is
> > the additional logic, that when both/all operands are scalars, it
> > is
> > not applied).
> >
> > Note that while it is defined in terms of casting "1" to uint8
> > safely
> > being possible even though 1 may be typed as int64. This logic thus
> > affects all promotion rules as well (i.e. what should the output
> > dtype
> > be).
> >
> >
> > There 2 main discussion points/issues about it:
> >
> > 1. Should value based casting/promotion logic exist at all?
> >
> > Arguably an `np.int32(3)` has type information attached to it, so
> > why
> > should we ignore it. It can also be tricky for users, because a
> > small
> > change in values can change the result data type.
> > Because 0-D arrays and scalars are too close inside numpy (you will
> > often not know which one you get). There is not much option but to
> > handle them identically. However, it seems pretty odd that:
> >  * `np.array(3, dtype=np.int32)` + np.arange(10, dtype=int8)
> >  * `np.array([3], dtype=np.int32)` + np.arange(10, dtype=int8)
> >
> > give a different result.
> >
> > This is a bit different for python scalars, which do not have a
> > type
> > attached already.
> >
> >
> > 2. Promotion and type resolution in Ufuncs:
> >
> > What is currently bothering me is that the decision what the output
> > dtypes should be currently depends on the values in complicated
> > ways.
> > It would be nice if we can decide which type signature to use
> > without
> > actually looking at values (or at least only very early on).
> >
> > One reason here is caching and simplicity. I would like to be able
> > to
> > cache which loop should be used for what input. Having value based
> > casting in there bloats up the problem.
> > Of course it currently works OK, but especially when user dtypes
> > come
> > into play, caching would seem like a nice optimization option.
> >
> > Because `uint8(127)` can also be a `int8`, but uint8(128) it is not
> > as
> > simple as finding the "minimal" dtype once and working with that."
> > Of course Eric and I discussed this a bit before, and you could
> > create
> > an internal "uint7" dtype which has the only purpose of flagging
> > that a
> > cast to int8 is safe.
> >
> > I suppose it is possible I am barking up the wrong tree here, and
> > this
> > caching/predictability is not vital (or can be solved with such an
> > internal dtype easily, although I am not sure it seems elegant).
> >
> >
> > Possible options to move forward
> > --------------------------------
> >
> > I have to still see a bit how trick things are. But there are a few
> > possible options. I would like to move the scalar logic to the
> > beginning of ufunc calls:
> >   * The uint7 idea would be one solution
> >   * Simply implement something that works for numpy and all except
> >     strange external ufuncs (I can only think of numba as a
> > plausible
> >     candidate for creating such).
> >
> > My current plan is to see where the second thing leaves me.
> >
> > We also should see if we cannot move the whole thing forward, in
> > which
> > case the main decision would have to be forward to where. My
> > opinion is
> > currently that when a type has a dtype associated with it clearly,
> > we
> > should always use that dtype in the future. This mostly means that
> > numpy dtypes such as `np.int64` will always be treated like an
> > int64,
> > and never like a `uint8` because they happen to be castable to
> > that.
> >
> > For values without a dtype attached (read python integers, floats),
> > I
> > see three options, from more complex to simpler:
> >
> > 1. Keep the current logic in place as much as possible
> > 2. Only support value based promotion for operators, e.g.:
> >    `arr + scalar` may do it, but `np.add(arr, scalar)` will not.
> >    The upside is that it limits the complexity to a much simpler
> >    problem, the downside is that the ufunc call and operator match
> >    less clearly.
> > 3. Just associate python float with float64 and python integers
> > with
> >    long/int64 and force users to always type them explicitly if
> > they
> >    need to.
> >
> > The downside of 1. is that it doesn't help with simplifying the
> > current
> > situation all that much, because we still have the special casting
> > around...
> >
> >
> > I have realized that this got much too long, so I hope it makes
> > sense.
> > I will continue to dabble along on these things a bit, so if
> > nothing
> > else maybe writing it helps me to get a bit clearer on things...
> >
> > Best,
> >
> > Sebastian
> >
> >
> >
> > _______________________________________________
> > NumPy-Discussion mailing list
> > [hidden email]
> > https://mail.python.org/mailman/listinfo/numpy-discussion
> >
>
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> https://mail.python.org/mailman/listinfo/numpy-discussion
>

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Moving forward with value based casting

Allan Haldane
On 6/6/19 12:46 PM, Sebastian Berg wrote:

> On Thu, 2019-06-06 at 11:57 -0400, Allan Haldane wrote:
>> I think dtype-based casting makes a lot of sense, the problem is
>> backward compatibility.
>>
>> Numpy casting is weird in a number of ways: The array + array casting
>> is
>> unexpected to many users (eg, uint64 + int64 -> float64), and the
>> casting of array + scalar is different from that, and value based.
>> Personally I wouldn't want to try change it unless we make a
>> backward-incompatible release (numpy 2.0), based on my experience
>> trying
>> to change much more minor things. We already put "casting" on the
>> list
>> of desired backward-incompatible changes on the list here:
>> https://github.com/numpy/numpy/wiki/Backwards-incompatible-ideas-for-a-major-release
>>
>> Relatedly, I've previously dreamed about a different "C-style" way
>> casting might behave:
>> https://gist.github.com/ahaldane/0f5ade49730e1a5d16ff6df4303f2e76
>>
>> The proposal there is that array + array casting, array + scalar, and
>> array + python casting would all work in the same dtype-based way,
>> which
>> mimics the familiar "C" casting rules.
>
> If I read it right, you do propose that array + python would cast in a
> "minimal type" way for python.

I'm a little unclear what you mean by "minimal type" way. By "minimal
type", I thought you and others are talking about the rule numpy
currently uses that "the output dtype is the minimal dtype capable of
representing the value of both input dtypes", right? But in that gist I
am instead proposing that output-dtype is determined by C-like rules.

For array+py_scalar I was less certain what to do than for array+array
and array+npy_scalar. But I proposed the three "ranks" of 1. bool, 2.
int, and 3. float/complex. My rule for array+py_scalar is that if the
python scalar's rank is less than the numpy operand dtype's rank, use
the numpy dtype. If the python-scalar's rank is greater, use the
"default" types of bool_, int64, float64 respectively. Eg:

np.bool_(1) + 1        -> int64   (default int wins)
np.int8(1) + 1         -> int8    (numpy wins)
np.uint8(1) + (-1)     -> uint8   (numpy wins)
np.int64(1) + 1        -> int64   (numpy wins)
np.int64(1) + 1.0      -> float64 (default float wins)
np.float32(1.0) + 1.0  -> float32 (numpy wins)

Note it does not depend on the numerical value of the scalar, only its type.

> In your write up, you describe that if you mix array + scalar, the
> scalar uses a minimal dtype compared to the array's dtype.

Sorry if I'm nitpicking/misunderstanding, but in my rules np.uint64(1) +
1 -> uint64 but in numpy's "minimal dtype" rules it is  -> float64. So I
don't think I am using the minimal rule.

> What we
> instead have is that in principle you could have loops such as:
>
> "ifi->f"
> "idi->d"
>
> and I think we should chose the first for a scalar, because it "fits"
> into f just fine. (if the input is) `ufunc(int_arr, 12., int_arr)`.

I feel I'm not understanding you, but the casting rules in my gist
follow those two rules if i, f are the numpy types int32 and float32.

If instead you mean (np.int64, py_float, np.int64) my rules would cast
to float64, since py_float has the highest rank and so is converted to
the default numpy-type for that rank, float64.

I would also add that unlike current numpy, my C-casting rules are
associative (if all operands are numpy types, see note below), so it
does not matter in which order you promote the types: (if)i  and i(fi)
give the same result. In current numpy this is not always the case:

    p = np.promote_types
    p(p('u2',   'i1'), 'f4')    # ->  f8
    p(  'u2', p('i1',  'f4'))   # ->  f4

(However, my casting rules are not associative if you include python
scalars.. eg  np.float32(1) + 1.0 + np.int64(1) . Maybe I should try to
fix that...)

Best,
Allan

> I do not mind keeping the "simple" two (or even more) operand "lets
> assume we have uniform types" logic around. For those it is easy to
> find a "minimum type" even before actual loop lookup.
> For the above example it would work in any case well, but it would get
> complicating, if for example the last integer is an unsigned integer,
> that happens to be small enough to fit also into an integer.
>
> That might give some wiggle room, possibly also to attach warnings to
> it, or at least make things easier. But I would also like to figure out
> as well if we shouldn't try to move in any case. Sure, attach a major
> version to it, but hopefully not a "big step type".
>
> One thing that I had not thought about is, that if we create
> FutureWarnings, we will need to provide a way to opt-in to the new/old
> behaviour.
> The old behaviour can be achieved by just using the python types (which
> probably is what most code that wants this behaviour does already), but
> the behaviour is tricky. Users can pass `dtype` explicitly, but that is
> a huge kludge...
> Will think about if there is a solution to that, because if there is
> not, you are right. It has to be a "big step" kind of release.
> Although, even then it would be nice to have warnings that can be
> enabled to ease the transition!
>
> - Sebastian
>
>
>>
>> See also:
>> https://github.com/numpy/numpy/issues/12525
>>
>> Allan
>>
>>
>> On 6/5/19 4:41 PM, Sebastian Berg wrote:
>>> Hi all,
>>>
>>> TL;DR:
>>>
>>> Value based promotion seems complex both for users and ufunc-
>>> dispatching/promotion logic. Is there any way we can move forward
>>> here,
>>> and if we do, could we just risk some possible (maybe not-existing)
>>> corner cases to break early to get on the way?
>>>
>>> -----------
>>>
>>> Currently when you write code such as:
>>>
>>> arr = np.array([1, 43, 23], dtype=np.uint16)
>>> res = arr + 1
>>>
>>> Numpy uses fairly sophisticated logic to decide that `1` can be
>>> represented as a uint16, and thus for all unary functions (and most
>>> others as well), the output will have a `res.dtype` of uint16.
>>>
>>> Similar logic also exists for floating point types, where a lower
>>> precision floating point can be used:
>>>
>>> arr = np.array([1, 43, 23], dtype=np.float32)
>>> (arr + np.float64(2.)).dtype  # will be float32
>>>
>>> Currently, this value based logic is enforced by checking whether
>>> the
>>> cast is possible: "4" can be cast to int8, uint8. So the first call
>>> above will at some point check if "uint16 + uint16 -> uint16" is a
>>> valid operation, find that it is, and thus stop searching. (There
>>> is
>>> the additional logic, that when both/all operands are scalars, it
>>> is
>>> not applied).
>>>
>>> Note that while it is defined in terms of casting "1" to uint8
>>> safely
>>> being possible even though 1 may be typed as int64. This logic thus
>>> affects all promotion rules as well (i.e. what should the output
>>> dtype
>>> be).
>>>
>>>
>>> There 2 main discussion points/issues about it:
>>>
>>> 1. Should value based casting/promotion logic exist at all?
>>>
>>> Arguably an `np.int32(3)` has type information attached to it, so
>>> why
>>> should we ignore it. It can also be tricky for users, because a
>>> small
>>> change in values can change the result data type.
>>> Because 0-D arrays and scalars are too close inside numpy (you will
>>> often not know which one you get). There is not much option but to
>>> handle them identically. However, it seems pretty odd that:
>>>  * `np.array(3, dtype=np.int32)` + np.arange(10, dtype=int8)
>>>  * `np.array([3], dtype=np.int32)` + np.arange(10, dtype=int8)
>>>
>>> give a different result.
>>>
>>> This is a bit different for python scalars, which do not have a
>>> type
>>> attached already.
>>>
>>>
>>> 2. Promotion and type resolution in Ufuncs:
>>>
>>> What is currently bothering me is that the decision what the output
>>> dtypes should be currently depends on the values in complicated
>>> ways.
>>> It would be nice if we can decide which type signature to use
>>> without
>>> actually looking at values (or at least only very early on).
>>>
>>> One reason here is caching and simplicity. I would like to be able
>>> to
>>> cache which loop should be used for what input. Having value based
>>> casting in there bloats up the problem.
>>> Of course it currently works OK, but especially when user dtypes
>>> come
>>> into play, caching would seem like a nice optimization option.
>>>
>>> Because `uint8(127)` can also be a `int8`, but uint8(128) it is not
>>> as
>>> simple as finding the "minimal" dtype once and working with that."
>>> Of course Eric and I discussed this a bit before, and you could
>>> create
>>> an internal "uint7" dtype which has the only purpose of flagging
>>> that a
>>> cast to int8 is safe.
>>>
>>> I suppose it is possible I am barking up the wrong tree here, and
>>> this
>>> caching/predictability is not vital (or can be solved with such an
>>> internal dtype easily, although I am not sure it seems elegant).
>>>
>>>
>>> Possible options to move forward
>>> --------------------------------
>>>
>>> I have to still see a bit how trick things are. But there are a few
>>> possible options. I would like to move the scalar logic to the
>>> beginning of ufunc calls:
>>>   * The uint7 idea would be one solution
>>>   * Simply implement something that works for numpy and all except
>>>     strange external ufuncs (I can only think of numba as a
>>> plausible
>>>     candidate for creating such).
>>>
>>> My current plan is to see where the second thing leaves me.
>>>
>>> We also should see if we cannot move the whole thing forward, in
>>> which
>>> case the main decision would have to be forward to where. My
>>> opinion is
>>> currently that when a type has a dtype associated with it clearly,
>>> we
>>> should always use that dtype in the future. This mostly means that
>>> numpy dtypes such as `np.int64` will always be treated like an
>>> int64,
>>> and never like a `uint8` because they happen to be castable to
>>> that.
>>>
>>> For values without a dtype attached (read python integers, floats),
>>> I
>>> see three options, from more complex to simpler:
>>>
>>> 1. Keep the current logic in place as much as possible
>>> 2. Only support value based promotion for operators, e.g.:
>>>    `arr + scalar` may do it, but `np.add(arr, scalar)` will not.
>>>    The upside is that it limits the complexity to a much simpler
>>>    problem, the downside is that the ufunc call and operator match
>>>    less clearly.
>>> 3. Just associate python float with float64 and python integers
>>> with
>>>    long/int64 and force users to always type them explicitly if
>>> they
>>>    need to.
>>>
>>> The downside of 1. is that it doesn't help with simplifying the
>>> current
>>> situation all that much, because we still have the special casting
>>> around...
>>>
>>>
>>> I have realized that this got much too long, so I hope it makes
>>> sense.
>>> I will continue to dabble along on these things a bit, so if
>>> nothing
>>> else maybe writing it helps me to get a bit clearer on things...
>>>
>>> Best,
>>>
>>> Sebastian
>>>
>>>
>>>
>>> _______________________________________________
>>> NumPy-Discussion mailing list
>>> [hidden email]
>>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>>
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> [hidden email]
>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> [hidden email]
>> https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Moving forward with value based casting

Nathaniel Smith
In reply to this post by Sebastian Berg
I haven't read all the thread super carefully, so I might have missed
something, but I think we might want to look at this together with the
special rule for scalar casting.

IIUC, the basic end-user problem that motivates all thi sis: when you
have a simple Python constant whose exact dtype is unspecified, people
don't want numpy to first automatically pick a dtype for it, and then
use that automatically chosen dtype to override the explicit dtypes
that the user specified. That's that "x + 1" problem. (This also comes
up a ton for languages trying to figure out how to type manifest
constants.)

Numpy's original solution for this was the special casting rule for
scalars. I don't understand the exact semantics, but it's something
like: in any operation involving a mix of non-zero-dim arrays and
zero-dim arrays, we throw out the exact dtype information for the
scalar ("float64", "int32") and replace it with just the "kind"
("float", "int").

This has several surprising consequences:

- The output dtype depends on not just the input dtypes, but also the
input shapes:

In [19]: (np.array([1, 2], dtype=np.int8) + 1).dtype
Out[19]: dtype('int8')

In [20]: (np.array([1, 2], dtype=np.int8) + [1]).dtype
Out[20]: dtype('int64')

- It doesn't just affect Python scalars with vague dtypes, but also
scalars where the user has specifically set the dtype:

In [21]: (np.array([1, 2], dtype=np.int8) + np.int64(1)).dtype
Out[21]: dtype('int8')

- I'm not sure the "kind" rule even does the right thing, especially
for mixed-kind operations. float16-array + int8-scalar has to do the
same thing as float16-array + int64-scalar, but that feels weird? I
think this is why value-based casting got added (at around the same
time as float16, in fact).

(Kinds are kinda problematic in general... the SAME_KIND casting rule
is very weird – casting int32->int64 is radically different from
casting float64->float32, which is radically different than casting
int64->int32, but SAME_KIND treats them all the same. And it's really
unclear how to generalize the 'kind' concept to new dtypes.)

My intuition is that what users actually want is for *native Python
types* to be treated as having 'underspecified' dtypes, e.g. int is
happy to coerce to int8/int32/int64/whatever, float is happy to coerce
to float32/float64/whatever, but once you have a fully-specified numpy
dtype, it should stay.

Some cases to think about:

np.array([1, 2], dtype=int8) + [1, 1]
 -> maybe this should have dtype int8, because there's no type info on
the right side to contradict that?

np.array([1, 2], dtype=int8) + 2**40
 -> maybe this should be an error, because you can't cast 2**40 to
int8 (under default casting safety rules)? That would introduce some
value-dependence, but it would only affect whether you get an error or
not, and there's precedent for that (e.g. division by zero).

In any case, it would probably be helpful to start by just writing
down the whole set of rules we have now, because I'm not sure anyone
understands all the details...

-n

On Wed, Jun 5, 2019 at 1:42 PM Sebastian Berg
<[hidden email]> wrote:

>
> Hi all,
>
> TL;DR:
>
> Value based promotion seems complex both for users and ufunc-
> dispatching/promotion logic. Is there any way we can move forward here,
> and if we do, could we just risk some possible (maybe not-existing)
> corner cases to break early to get on the way?
>
> -----------
>
> Currently when you write code such as:
>
> arr = np.array([1, 43, 23], dtype=np.uint16)
> res = arr + 1
>
> Numpy uses fairly sophisticated logic to decide that `1` can be
> represented as a uint16, and thus for all unary functions (and most
> others as well), the output will have a `res.dtype` of uint16.
>
> Similar logic also exists for floating point types, where a lower
> precision floating point can be used:
>
> arr = np.array([1, 43, 23], dtype=np.float32)
> (arr + np.float64(2.)).dtype  # will be float32
>
> Currently, this value based logic is enforced by checking whether the
> cast is possible: "4" can be cast to int8, uint8. So the first call
> above will at some point check if "uint16 + uint16 -> uint16" is a
> valid operation, find that it is, and thus stop searching. (There is
> the additional logic, that when both/all operands are scalars, it is
> not applied).
>
> Note that while it is defined in terms of casting "1" to uint8 safely
> being possible even though 1 may be typed as int64. This logic thus
> affects all promotion rules as well (i.e. what should the output dtype
> be).
>
>
> There 2 main discussion points/issues about it:
>
> 1. Should value based casting/promotion logic exist at all?
>
> Arguably an `np.int32(3)` has type information attached to it, so why
> should we ignore it. It can also be tricky for users, because a small
> change in values can change the result data type.
> Because 0-D arrays and scalars are too close inside numpy (you will
> often not know which one you get). There is not much option but to
> handle them identically. However, it seems pretty odd that:
>  * `np.array(3, dtype=np.int32)` + np.arange(10, dtype=int8)
>  * `np.array([3], dtype=np.int32)` + np.arange(10, dtype=int8)
>
> give a different result.
>
> This is a bit different for python scalars, which do not have a type
> attached already.
>
>
> 2. Promotion and type resolution in Ufuncs:
>
> What is currently bothering me is that the decision what the output
> dtypes should be currently depends on the values in complicated ways.
> It would be nice if we can decide which type signature to use without
> actually looking at values (or at least only very early on).
>
> One reason here is caching and simplicity. I would like to be able to
> cache which loop should be used for what input. Having value based
> casting in there bloats up the problem.
> Of course it currently works OK, but especially when user dtypes come
> into play, caching would seem like a nice optimization option.
>
> Because `uint8(127)` can also be a `int8`, but uint8(128) it is not as
> simple as finding the "minimal" dtype once and working with that."
> Of course Eric and I discussed this a bit before, and you could create
> an internal "uint7" dtype which has the only purpose of flagging that a
> cast to int8 is safe.
>
> I suppose it is possible I am barking up the wrong tree here, and this
> caching/predictability is not vital (or can be solved with such an
> internal dtype easily, although I am not sure it seems elegant).
>
>
> Possible options to move forward
> --------------------------------
>
> I have to still see a bit how trick things are. But there are a few
> possible options. I would like to move the scalar logic to the
> beginning of ufunc calls:
>   * The uint7 idea would be one solution
>   * Simply implement something that works for numpy and all except
>     strange external ufuncs (I can only think of numba as a plausible
>     candidate for creating such).
>
> My current plan is to see where the second thing leaves me.
>
> We also should see if we cannot move the whole thing forward, in which
> case the main decision would have to be forward to where. My opinion is
> currently that when a type has a dtype associated with it clearly, we
> should always use that dtype in the future. This mostly means that
> numpy dtypes such as `np.int64` will always be treated like an int64,
> and never like a `uint8` because they happen to be castable to that.
>
> For values without a dtype attached (read python integers, floats), I
> see three options, from more complex to simpler:
>
> 1. Keep the current logic in place as much as possible
> 2. Only support value based promotion for operators, e.g.:
>    `arr + scalar` may do it, but `np.add(arr, scalar)` will not.
>    The upside is that it limits the complexity to a much simpler
>    problem, the downside is that the ufunc call and operator match
>    less clearly.
> 3. Just associate python float with float64 and python integers with
>    long/int64 and force users to always type them explicitly if they
>    need to.
>
> The downside of 1. is that it doesn't help with simplifying the current
> situation all that much, because we still have the special casting
> around...
>
>
> I have realized that this got much too long, so I hope it makes sense.
> I will continue to dabble along on these things a bit, so if nothing
> else maybe writing it helps me to get a bit clearer on things...
>
> Best,
>
> Sebastian
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> https://mail.python.org/mailman/listinfo/numpy-discussion



--
Nathaniel J. Smith -- https://vorpus.org
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Moving forward with value based casting

ralfgommers


On Fri, Jun 7, 2019 at 1:37 AM Nathaniel Smith <[hidden email]> wrote:

My intuition is that what users actually want is for *native Python
types* to be treated as having 'underspecified' dtypes, e.g. int is
happy to coerce to int8/int32/int64/whatever, float is happy to coerce
to float32/float64/whatever, but once you have a fully-specified numpy
dtype, it should stay.

Thanks Nathaniel, I think this expresses a possible solution better than anything I've seen on this list before. An explicit "underspecified types" concept could make casting understandable.


In any case, it would probably be helpful to start by just writing
down the whole set of rules we have now, because I'm not sure anyone
understands all the details...

+1

Ralf


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Moving forward with value based casting

Marten van Kerkwijk


On Fri, Jun 7, 2019 at 1:19 AM Ralf Gommers <[hidden email]> wrote:


On Fri, Jun 7, 2019 at 1:37 AM Nathaniel Smith <[hidden email]> wrote:

My intuition is that what users actually want is for *native Python
types* to be treated as having 'underspecified' dtypes, e.g. int is
happy to coerce to int8/int32/int64/whatever, float is happy to coerce
to float32/float64/whatever, but once you have a fully-specified numpy
dtype, it should stay.

Thanks Nathaniel, I think this expresses a possible solution better than anything I've seen on this list before. An explicit "underspecified types" concept could make casting understandable.

I think the current model is that this holds for all scalars, but changing that to be just for not already explicitly typed types makes sense.

In the context of a mental picture, one could think in terms of coercion, of numpy having not just a `numpy.array` but also a `numpy.scalar` function, which takes some input and tries to make a numpy scalar of it.  For python int, float, complex, etc., it uses the minimal numpy type.

Of course, this is slightly inconsistent with the `np.array` function which converts things to `ndarray` using a default type for int, float, complex, etc., but my sense is that that is explainable, e.g.,imagining both `np.scalar` and `np.array` to have dtype attributes, one could say that the default for one would be `'minimal'` and the other `'64bit'` (well, that doesn't work for complex, but anyway).

-- Marten

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Moving forward with value based casting

Neal Becker
In reply to this post by Sebastian Berg
On Wed, Jun 5, 2019 at 4:42 PM Sebastian Berg
<[hidden email]> wrote:

I think the best approach is that if the user gave unambiguous types
as inputs to operators then the output should be the same dtype, or
type corresponding to the common promotion type of the inputs.

If the input type is not specified, I agree with the suggestion here:

> 3. Just associate python float with float64 and python integers with
>    long/int64 and force users to always type them explicitly if they
>    need to.

Explicit is better than implicit

--
Those who don't understand recursion are doomed to repeat it
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Moving forward with value based casting

Sebastian Berg
In reply to this post by ralfgommers
On Fri, 2019-06-07 at 07:18 +0200, Ralf Gommers wrote:

>
>
> On Fri, Jun 7, 2019 at 1:37 AM Nathaniel Smith <[hidden email]> wrote:
> > My intuition is that what users actually want is for *native Python
> > types* to be treated as having 'underspecified' dtypes, e.g. int is
> > happy to coerce to int8/int32/int64/whatever, float is happy to
> > coerce
> > to float32/float64/whatever, but once you have a fully-specified
> > numpy
> > dtype, it should stay.
>
> Thanks Nathaniel, I think this expresses a possible solution better
> than anything I've seen on this list before. An explicit
> "underspecified types" concept could make casting understandable.
Yes, there is one small additional annoyance (but maybe it is just
that). In that 127 is the 'underspecified' dtype `uint7` (it can be
safely cast both to uint8 and int8).

>
> > In any case, it would probably be helpful to start by just writing
> > down the whole set of rules we have now, because I'm not sure
> > anyone
> > understands all the details...
>
> +1

OK, let me try to sketch the details below:

0. "Scalars" means scalars or 0-D arrays here.

1. The logic below will only be used if we have a mix of arrays and
scalars. If all are scalars, the logic is never used. (Plus one
additional tricky case within ufuncs, which is more hypothetical [0])

2. Scalars will only be demoted within their category. The categories
and casting rules within the category are as follows:

Boolean:
    Casts safely to all (nothing surprising).

Integers:
    Casting is possible if output can hold the value.
    This includes uint8(127) casting to an int8.
    (unsigned and signed integers are the same "category")

Floats:
    Scalars can be demoted based on value, roughly this
    avoids overflows:
        float16:     -65000 < value < 65000
        float32:    -3.4e38 < value < 3.4e38
        float64:   -1.7e308 < value < 1.7e308
        float128 (largest type, does not apply).

Complex: Same logic as floats (applied to .real and .imag).

Others: Anything else.

---

Ufunc, as well as `result_type` will use this liberally, which
basically means finding the smallest type for each category and using
that. Of course for floats we cannot do the actual cast until later,
since initially we do not know if the cast will actually be performed.

This is only tricky for uint vs. int, because uint8(127) is a "small
unsigned". I.e. with our current dtypes there is no strict type
hierarchy uint8(x) may or may not cast to int8.

---

We could think of doing:

arr, min_dtype = np.asarray_and_min_dtype(pyobject)

which could even fix the list example Nathaniel had. Which would work
if you would do the dtype hierarchy.

This is where the `uint7` came from a hypothetical `uint7` would fix
the integer dtype hierarchy, by representing the numbers `0-127` which
can be cast to uint8 and int8.

Best,

Sebastian


[0] Amendment for point 1:

There is one detail (bug?) here in the logic though, that I missed
before. If a ufunc (or result_type) sees a mix of scalars and arrays,
it will try to decide whether or not to use value based logic. Value
based logic will be skipped if the scalars are in a higher category
(based on the ones above) then the highest array – for optimization I
assume.
Plausibly, this could cause incorrect logic when the dtype signature of
a ufunc is mixed:
  float32, int8 -> float32
  float32, int64 -> float64

May choose the second loop unnecessarily. Or for example if we have a
datetime64 in the inputs, there would be no way for value based casting
to be used.



>
> Ralf
>
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Moving forward with value based casting

Sebastian Berg
In reply to this post by Allan Haldane
On Thu, 2019-06-06 at 19:34 -0400, Allan Haldane wrote:

> On 6/6/19 12:46 PM, Sebastian Berg wrote:
> > On Thu, 2019-06-06 at 11:57 -0400, Allan Haldane wrote:
> > > I think dtype-based casting makes a lot of sense, the problem is
> > > backward compatibility.
> > >
> > > Numpy casting is weird in a number of ways: The array + array
> > > casting
> > > is
> > > unexpected to many users (eg, uint64 + int64 -> float64), and the
> > > casting of array + scalar is different from that, and value
> > > based.
> > > Personally I wouldn't want to try change it unless we make a
> > > backward-incompatible release (numpy 2.0), based on my experience
> > > trying
> > > to change much more minor things. We already put "casting" on the
> > > list
> > > of desired backward-incompatible changes on the list here:
> > > https://github.com/numpy/numpy/wiki/Backwards-incompatible-ideas-for-a-major-release
> > >
> > > Relatedly, I've previously dreamed about a different "C-style"
> > > way
> > > casting might behave:
> > > https://gist.github.com/ahaldane/0f5ade49730e1a5d16ff6df4303f2e76
> > >
> > > The proposal there is that array + array casting, array + scalar,
> > > and
> > > array + python casting would all work in the same dtype-based
> > > way,
> > > which
> > > mimics the familiar "C" casting rules.
> >
> > If I read it right, you do propose that array + python would cast
> > in a
> > "minimal type" way for python.
>
> I'm a little unclear what you mean by "minimal type" way. By "minimal
> type", I thought you and others are talking about the rule numpy
> currently uses that "the output dtype is the minimal dtype capable of
> representing the value of both input dtypes", right? But in that gist
> I
> am instead proposing that output-dtype is determined by C-like rules.
>
> For array+py_scalar I was less certain what to do than for
> array+array
> and array+npy_scalar. But I proposed the three "ranks" of 1. bool, 2.
> int, and 3. float/complex. My rule for array+py_scalar is that if the
> python scalar's rank is less than the numpy operand dtype's rank, use
> the numpy dtype. If the python-scalar's rank is greater, use the
> "default" types of bool_, int64, float64 respectively. Eg:
>
> np.bool_(1) + 1        -> int64   (default int wins)
> np.int8(1) + 1         -> int8    (numpy wins)
> np.uint8(1) + (-1)     -> uint8   (numpy wins)
> np.int64(1) + 1        -> int64   (numpy wins)
> np.int64(1) + 1.0      -> float64 (default float wins)
> np.float32(1.0) + 1.0  -> float32 (numpy wins)
>
> Note it does not depend on the numerical value of the scalar, only
> its type.
>
> > In your write up, you describe that if you mix array + scalar, the
> > scalar uses a minimal dtype compared to the array's dtype.
>
> Sorry if I'm nitpicking/misunderstanding, but in my rules
> np.uint64(1) +
> 1 -> uint64 but in numpy's "minimal dtype" rules it is  -> float64.
> So I
> don't think I am using the minimal rule.
>
> > What we
> > instead have is that in principle you could have loops such as:
> >
> > "ifi->f"
> > "idi->d"
> >
> > and I think we should chose the first for a scalar, because it
> > "fits"
> > into f just fine. (if the input is) `ufunc(int_arr, 12., int_arr)`.
>
> I feel I'm not understanding you, but the casting rules in my gist
> follow those two rules if i, f are the numpy types int32 and float32.
>
> If instead you mean (np.int64, py_float, np.int64) my rules would
> cast
> to float64, since py_float has the highest rank and so is converted
> to
> the default numpy-type for that rank, float64.
Yes, you are right. I should look at them a bit more carefully in any
case. Actually, numpy would also choose the second one, because it
python float has the higher "category". The example should rather have
been:

int8, float32 -> float32
int64, float32 -> float64

With `python_int(12) + np.array([1., 2.], dtype=float64)`. Numpy would
currently choose the int8 loop here, because the scalar is of a lower
or equal "category" and thus it is OK to demote it even further.

This is fairly irrelevant for most users. But for ufunc dispatching, I
think it is where it gets ugly. In non-uniform ufunc dtype signatures,
and no, I doubt that this is very relevant in practice or that numpy is
even very consistent here.

I have a branch now which basically moves the "ResultType" logic before
choosing the loop (it thus is unable to capture some of the stranger,
probably non-existing corner cases).


On a different note: The ranking you are suggesting for python types
seems very much the same as what we have, with the exception that it
would not look at the value (I suppose what we would do instead is to
simply raise a casting error):

int8_arr + 87345  # ouput should always be int8, so crash on cast?

Which may be a viable approach. Although, signed/unsigned may be
tricky:

uint8_arr + py_int  # do we look at the py_int's sign?


- Sebastian


>
> I would also add that unlike current numpy, my C-casting rules are
> associative (if all operands are numpy types, see note below), so it
> does not matter in which order you promote the types: (if)i  and
> i(fi)
> give the same result. In current numpy this is not always the case:
>
>     p = np.promote_types
>     p(p('u2',   'i1'), 'f4')    # ->  f8
>     p(  'u2', p('i1',  'f4'))   # ->  f4
>
> (However, my casting rules are not associative if you include python
> scalars.. eg  np.float32(1) + 1.0 + np.int64(1) . Maybe I should try
> to
> fix that...)
>
> Best,
> Allan
>
> > I do not mind keeping the "simple" two (or even more) operand "lets
> > assume we have uniform types" logic around. For those it is easy to
> > find a "minimum type" even before actual loop lookup.
> > For the above example it would work in any case well, but it would
> > get
> > complicating, if for example the last integer is an unsigned
> > integer,
> > that happens to be small enough to fit also into an integer.
> >
> > That might give some wiggle room, possibly also to attach warnings
> > to
> > it, or at least make things easier. But I would also like to figure
> > out
> > as well if we shouldn't try to move in any case. Sure, attach a
> > major
> > version to it, but hopefully not a "big step type".
> >
> > One thing that I had not thought about is, that if we create
> > FutureWarnings, we will need to provide a way to opt-in to the
> > new/old
> > behaviour.
> > The old behaviour can be achieved by just using the python types
> > (which
> > probably is what most code that wants this behaviour does already),
> > but
> > the behaviour is tricky. Users can pass `dtype` explicitly, but
> > that is
> > a huge kludge...
> > Will think about if there is a solution to that, because if there
> > is
> > not, you are right. It has to be a "big step" kind of release.
> > Although, even then it would be nice to have warnings that can be
> > enabled to ease the transition!
> >
> > - Sebastian
> >
> >
> > > See also:
> > > https://github.com/numpy/numpy/issues/12525
> > >
> > > Allan
> > >
> > >
> > > On 6/5/19 4:41 PM, Sebastian Berg wrote:
> > > > Hi all,
> > > >
> > > > TL;DR:
> > > >
> > > > Value based promotion seems complex both for users and ufunc-
> > > > dispatching/promotion logic. Is there any way we can move
> > > > forward
> > > > here,
> > > > and if we do, could we just risk some possible (maybe not-
> > > > existing)
> > > > corner cases to break early to get on the way?
> > > >
> > > > -----------
> > > >
> > > > Currently when you write code such as:
> > > >
> > > > arr = np.array([1, 43, 23], dtype=np.uint16)
> > > > res = arr + 1
> > > >
> > > > Numpy uses fairly sophisticated logic to decide that `1` can be
> > > > represented as a uint16, and thus for all unary functions (and
> > > > most
> > > > others as well), the output will have a `res.dtype` of uint16.
> > > >
> > > > Similar logic also exists for floating point types, where a
> > > > lower
> > > > precision floating point can be used:
> > > >
> > > > arr = np.array([1, 43, 23], dtype=np.float32)
> > > > (arr + np.float64(2.)).dtype  # will be float32
> > > >
> > > > Currently, this value based logic is enforced by checking
> > > > whether
> > > > the
> > > > cast is possible: "4" can be cast to int8, uint8. So the first
> > > > call
> > > > above will at some point check if "uint16 + uint16 -> uint16"
> > > > is a
> > > > valid operation, find that it is, and thus stop searching.
> > > > (There
> > > > is
> > > > the additional logic, that when both/all operands are scalars,
> > > > it
> > > > is
> > > > not applied).
> > > >
> > > > Note that while it is defined in terms of casting "1" to uint8
> > > > safely
> > > > being possible even though 1 may be typed as int64. This logic
> > > > thus
> > > > affects all promotion rules as well (i.e. what should the
> > > > output
> > > > dtype
> > > > be).
> > > >
> > > >
> > > > There 2 main discussion points/issues about it:
> > > >
> > > > 1. Should value based casting/promotion logic exist at all?
> > > >
> > > > Arguably an `np.int32(3)` has type information attached to it,
> > > > so
> > > > why
> > > > should we ignore it. It can also be tricky for users, because a
> > > > small
> > > > change in values can change the result data type.
> > > > Because 0-D arrays and scalars are too close inside numpy (you
> > > > will
> > > > often not know which one you get). There is not much option but
> > > > to
> > > > handle them identically. However, it seems pretty odd that:
> > > >  * `np.array(3, dtype=np.int32)` + np.arange(10, dtype=int8)
> > > >  * `np.array([3], dtype=np.int32)` + np.arange(10, dtype=int8)
> > > >
> > > > give a different result.
> > > >
> > > > This is a bit different for python scalars, which do not have a
> > > > type
> > > > attached already.
> > > >
> > > >
> > > > 2. Promotion and type resolution in Ufuncs:
> > > >
> > > > What is currently bothering me is that the decision what the
> > > > output
> > > > dtypes should be currently depends on the values in complicated
> > > > ways.
> > > > It would be nice if we can decide which type signature to use
> > > > without
> > > > actually looking at values (or at least only very early on).
> > > >
> > > > One reason here is caching and simplicity. I would like to be
> > > > able
> > > > to
> > > > cache which loop should be used for what input. Having value
> > > > based
> > > > casting in there bloats up the problem.
> > > > Of course it currently works OK, but especially when user
> > > > dtypes
> > > > come
> > > > into play, caching would seem like a nice optimization option.
> > > >
> > > > Because `uint8(127)` can also be a `int8`, but uint8(128) it is
> > > > not
> > > > as
> > > > simple as finding the "minimal" dtype once and working with
> > > > that."
> > > > Of course Eric and I discussed this a bit before, and you could
> > > > create
> > > > an internal "uint7" dtype which has the only purpose of
> > > > flagging
> > > > that a
> > > > cast to int8 is safe.
> > > >
> > > > I suppose it is possible I am barking up the wrong tree here,
> > > > and
> > > > this
> > > > caching/predictability is not vital (or can be solved with such
> > > > an
> > > > internal dtype easily, although I am not sure it seems
> > > > elegant).
> > > >
> > > >
> > > > Possible options to move forward
> > > > --------------------------------
> > > >
> > > > I have to still see a bit how trick things are. But there are a
> > > > few
> > > > possible options. I would like to move the scalar logic to the
> > > > beginning of ufunc calls:
> > > >   * The uint7 idea would be one solution
> > > >   * Simply implement something that works for numpy and all
> > > > except
> > > >     strange external ufuncs (I can only think of numba as a
> > > > plausible
> > > >     candidate for creating such).
> > > >
> > > > My current plan is to see where the second thing leaves me.
> > > >
> > > > We also should see if we cannot move the whole thing forward,
> > > > in
> > > > which
> > > > case the main decision would have to be forward to where. My
> > > > opinion is
> > > > currently that when a type has a dtype associated with it
> > > > clearly,
> > > > we
> > > > should always use that dtype in the future. This mostly means
> > > > that
> > > > numpy dtypes such as `np.int64` will always be treated like an
> > > > int64,
> > > > and never like a `uint8` because they happen to be castable to
> > > > that.
> > > >
> > > > For values without a dtype attached (read python integers,
> > > > floats),
> > > > I
> > > > see three options, from more complex to simpler:
> > > >
> > > > 1. Keep the current logic in place as much as possible
> > > > 2. Only support value based promotion for operators, e.g.:
> > > >    `arr + scalar` may do it, but `np.add(arr, scalar)` will
> > > > not.
> > > >    The upside is that it limits the complexity to a much
> > > > simpler
> > > >    problem, the downside is that the ufunc call and operator
> > > > match
> > > >    less clearly.
> > > > 3. Just associate python float with float64 and python integers
> > > > with
> > > >    long/int64 and force users to always type them explicitly if
> > > > they
> > > >    need to.
> > > >
> > > > The downside of 1. is that it doesn't help with simplifying the
> > > > current
> > > > situation all that much, because we still have the special
> > > > casting
> > > > around...
> > > >
> > > >
> > > > I have realized that this got much too long, so I hope it makes
> > > > sense.
> > > > I will continue to dabble along on these things a bit, so if
> > > > nothing
> > > > else maybe writing it helps me to get a bit clearer on
> > > > things...
> > > >
> > > > Best,
> > > >
> > > > Sebastian
> > > >
> > > >
> > > >
> > > > _______________________________________________
> > > > NumPy-Discussion mailing list
> > > > [hidden email]
> > > > https://mail.python.org/mailman/listinfo/numpy-discussion
> > > >
> > >
> > > _______________________________________________
> > > NumPy-Discussion mailing list
> > > [hidden email]
> > > https://mail.python.org/mailman/listinfo/numpy-discussion
> > >
> > >
> > > _______________________________________________
> > > NumPy-Discussion mailing list
> > > [hidden email]
> > > https://mail.python.org/mailman/listinfo/numpy-discussion
>
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> https://mail.python.org/mailman/listinfo/numpy-discussion
>

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Moving forward with value based casting

Sebastian Berg
In reply to this post by Sebastian Berg
On Fri, 2019-06-07 at 13:19 -0500, Sebastian Berg wrote:

> On Fri, 2019-06-07 at 07:18 +0200, Ralf Gommers wrote:
> >
> > On Fri, Jun 7, 2019 at 1:37 AM Nathaniel Smith <[hidden email]>
> > wrote:
> > > My intuition is that what users actually want is for *native
> > > Python
> > > types* to be treated as having 'underspecified' dtypes, e.g. int
> > > is
> > > happy to coerce to int8/int32/int64/whatever, float is happy to
> > > coerce
> > > to float32/float64/whatever, but once you have a fully-specified
> > > numpy
> > > dtype, it should stay.
> >
> > Thanks Nathaniel, I think this expresses a possible solution better
> > than anything I've seen on this list before. An explicit
> > "underspecified types" concept could make casting understandable.
>
> Yes, there is one small additional annoyance (but maybe it is just
> that). In that 127 is the 'underspecified' dtype `uint7` (it can be
> safely cast both to uint8 and int8).
>
> > > In any case, it would probably be helpful to start by just
> > > writing
> > > down the whole set of rules we have now, because I'm not sure
> > > anyone
> > > understands all the details...
> >
> > +1
>
> OK, let me try to sketch the details below:
>
> 0. "Scalars" means scalars or 0-D arrays here.
>
> 1. The logic below will only be used if we have a mix of arrays and
> scalars. If all are scalars, the logic is never used. (Plus one
> additional tricky case within ufuncs, which is more hypothetical [0])
>
And of course I just realized that, trying to be simple, I forgot an
important point there:

The logic in 2. is only used when there is a mix of scalars and arrays,
and the arrays are in the same or higher category. As an example:

np.array([1, 2, 3], dtype=np.uint8) + np.float64(12.)

will not demote the float64, because the scalars "float" is a higher
category than the arrays "integer".


- Sebastian


> 2. Scalars will only be demoted within their category. The categories
> and casting rules within the category are as follows:
>
> Boolean:
>     Casts safely to all (nothing surprising).
>
> Integers:
>     Casting is possible if output can hold the value.
>     This includes uint8(127) casting to an int8.
>     (unsigned and signed integers are the same "category")
>
> Floats:
>     Scalars can be demoted based on value, roughly this
>     avoids overflows:
>         float16:     -65000 < value < 65000
>         float32:    -3.4e38 < value < 3.4e38
>         float64:   -1.7e308 < value < 1.7e308
>         float128 (largest type, does not apply).
>
> Complex: Same logic as floats (applied to .real and .imag).
>
> Others: Anything else.
>
> ---
>
> Ufunc, as well as `result_type` will use this liberally, which
> basically means finding the smallest type for each category and using
> that. Of course for floats we cannot do the actual cast until later,
> since initially we do not know if the cast will actually be
> performed.
>
> This is only tricky for uint vs. int, because uint8(127) is a "small
> unsigned". I.e. with our current dtypes there is no strict type
> hierarchy uint8(x) may or may not cast to int8.
>
> ---
>
> We could think of doing:
>
> arr, min_dtype = np.asarray_and_min_dtype(pyobject)
>
> which could even fix the list example Nathaniel had. Which would work
> if you would do the dtype hierarchy.
>
> This is where the `uint7` came from a hypothetical `uint7` would fix
> the integer dtype hierarchy, by representing the numbers `0-127`
> which
> can be cast to uint8 and int8.
>
> Best,
>
> Sebastian
>
>
> [0] Amendment for point 1:
>
> There is one detail (bug?) here in the logic though, that I missed
> before. If a ufunc (or result_type) sees a mix of scalars and arrays,
> it will try to decide whether or not to use value based logic. Value
> based logic will be skipped if the scalars are in a higher category
> (based on the ones above) then the highest array – for optimization I
> assume.
> Plausibly, this could cause incorrect logic when the dtype signature
> of
> a ufunc is mixed:
>   float32, int8 -> float32
>   float32, int64 -> float64
>
> May choose the second loop unnecessarily. Or for example if we have a
> datetime64 in the inputs, there would be no way for value based
> casting
> to be used.
>
>
>
> > Ralf
> >
> > _______________________________________________
> > NumPy-Discussion mailing list
> > [hidden email]
> > https://mail.python.org/mailman/listinfo/numpy-discussion
>
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Moving forward with value based casting

Sebastian Berg
In reply to this post by Sebastian Berg
On Wed, 2019-06-05 at 15:41 -0500, Sebastian Berg wrote:

> Hi all,
>
> TL;DR:
>
> Value based promotion seems complex both for users and ufunc-
> dispatching/promotion logic. Is there any way we can move forward
> here,
> and if we do, could we just risk some possible (maybe not-existing)
> corner cases to break early to get on the way?
>
Hi all,

just to note. I think I will go forward trying to fill the hole in the
hierarchy with a non-existing uint7 dtype. That seemed like it may be
ugly, but if it does not escalate too much, it is probably fairly
straight forward. And it would allow to simplify dispatching without
any logic change at all. After that we could still decide to change the
logic.

Best,

Sebastian


> -----------
>
> Currently when you write code such as:
>
> arr = np.array([1, 43, 23], dtype=np.uint16)
> res = arr + 1
>
> Numpy uses fairly sophisticated logic to decide that `1` can be
> represented as a uint16, and thus for all unary functions (and most
> others as well), the output will have a `res.dtype` of uint16.
>
> Similar logic also exists for floating point types, where a lower
> precision floating point can be used:
>
> arr = np.array([1, 43, 23], dtype=np.float32)
> (arr + np.float64(2.)).dtype  # will be float32
>
> Currently, this value based logic is enforced by checking whether the
> cast is possible: "4" can be cast to int8, uint8. So the first call
> above will at some point check if "uint16 + uint16 -> uint16" is a
> valid operation, find that it is, and thus stop searching. (There is
> the additional logic, that when both/all operands are scalars, it is
> not applied).
>
> Note that while it is defined in terms of casting "1" to uint8 safely
> being possible even though 1 may be typed as int64. This logic thus
> affects all promotion rules as well (i.e. what should the output
> dtype
> be).
>
>
> There 2 main discussion points/issues about it:
>
> 1. Should value based casting/promotion logic exist at all?
>
> Arguably an `np.int32(3)` has type information attached to it, so why
> should we ignore it. It can also be tricky for users, because a small
> change in values can change the result data type.
> Because 0-D arrays and scalars are too close inside numpy (you will
> often not know which one you get). There is not much option but to
> handle them identically. However, it seems pretty odd that:
>  * `np.array(3, dtype=np.int32)` + np.arange(10, dtype=int8)
>  * `np.array([3], dtype=np.int32)` + np.arange(10, dtype=int8)
>
> give a different result.
>
> This is a bit different for python scalars, which do not have a type
> attached already.
>
>
> 2. Promotion and type resolution in Ufuncs:
>
> What is currently bothering me is that the decision what the output
> dtypes should be currently depends on the values in complicated ways.
> It would be nice if we can decide which type signature to use without
> actually looking at values (or at least only very early on).
>
> One reason here is caching and simplicity. I would like to be able to
> cache which loop should be used for what input. Having value based
> casting in there bloats up the problem.
> Of course it currently works OK, but especially when user dtypes come
> into play, caching would seem like a nice optimization option.
>
> Because `uint8(127)` can also be a `int8`, but uint8(128) it is not
> as
> simple as finding the "minimal" dtype once and working with that."
> Of course Eric and I discussed this a bit before, and you could
> create
> an internal "uint7" dtype which has the only purpose of flagging that
> a
> cast to int8 is safe.
>
> I suppose it is possible I am barking up the wrong tree here, and
> this
> caching/predictability is not vital (or can be solved with such an
> internal dtype easily, although I am not sure it seems elegant).
>
>
> Possible options to move forward
> --------------------------------
>
> I have to still see a bit how trick things are. But there are a few
> possible options. I would like to move the scalar logic to the
> beginning of ufunc calls:
>   * The uint7 idea would be one solution
>   * Simply implement something that works for numpy and all except
>     strange external ufuncs (I can only think of numba as a plausible
>     candidate for creating such).
>
> My current plan is to see where the second thing leaves me.
>
> We also should see if we cannot move the whole thing forward, in
> which
> case the main decision would have to be forward to where. My opinion
> is
> currently that when a type has a dtype associated with it clearly, we
> should always use that dtype in the future. This mostly means that
> numpy dtypes such as `np.int64` will always be treated like an int64,
> and never like a `uint8` because they happen to be castable to that.
>
> For values without a dtype attached (read python integers, floats), I
> see three options, from more complex to simpler:
>
> 1. Keep the current logic in place as much as possible
> 2. Only support value based promotion for operators, e.g.:
>    `arr + scalar` may do it, but `np.add(arr, scalar)` will not.
>    The upside is that it limits the complexity to a much simpler
>    problem, the downside is that the ufunc call and operator match
>    less clearly.
> 3. Just associate python float with float64 and python integers with
>    long/int64 and force users to always type them explicitly if they
>    need to.
>
> The downside of 1. is that it doesn't help with simplifying the
> current
> situation all that much, because we still have the special casting
> around...
>
>
> I have realized that this got much too long, so I hope it makes
> sense.
> I will continue to dabble along on these things a bit, so if nothing
> else maybe writing it helps me to get a bit clearer on things...
>
> Best,
>
> Sebastian
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

signature.asc (849 bytes) Download Attachment
12