Thanks Warren! Worked like a charm 😊 Will mention you in the book...

This is an unfortunate discrepancy in the API: `var` uses the default `ddof=0`, while `cov` uses, in effect, `ddof=1` by default.

You can get the consistent behavior you want by using `ddof=1` in both functions. E.g.

Using `ddof=1` in `np.cov` is redundant, but in this context, it is probably useful to make explicit to the reader of the code that both functions are using the same convention.

compatibility. That would require a long and potentially painful

deprecation process.

>

>

>

>

> Gunter Meissner, PhD

>

> University of Hawaii

>

> Adjunct Professor of MathFinance at Columbia University and NYU

>

> President of Derivatives Software www.dersoft.com

> <

http://www.dersoft.com/>

>

>

> CEO Cassandra Capital Management <

http://www.cassandracm.com>

> www.cassandracm.com

>

> CV: <

http://www.dersoft.com/cv.pdf> www.dersoft.com/cv.pdf

>

> Email: <mailto:

[hidden email]>

[hidden email]
>

> Tel: USA (808) 779 3660

>

>

>

>

>

>

>

>

>

> From: NumPy-Discussion

> <numpy-discussion-bounces+meissner=

[hidden email]> On Behalf Of

> Ralf Gommers

> Sent: Wednesday, March 18, 2020 5:16 AM

> To: Discussion of Numerical Python <

[hidden email]>

> Subject: Re: [Numpy-discussion] Proposal: NEP 41 -- First step towards

> a new Datatype System

>

>

>

>

>

>

>

> On Tue, Mar 17, 2020 at 9:03 PM Sebastian Berg

> <

[hidden email] <mailto:

[hidden email]> > wrote:

>

> Hi all,

>

> in the spirit of trying to keep this moving, can I assume that the

> main reason for little discussion is that the actual changes proposed

> are not very far reaching as of now? Or is the reason that this is a

> fairly complex topic that you need more time to think about it?

>

>

>

> Probably (a) it's a long NEP on a complex topic, (b) the past week has

> been a very weird week for everyone (in the extra-news-reading-time I

> could easily have re-reviewed the NEP), and (c) the amount of feedback

> one expects to get on a NEP is roughly inversely proportional to the

> scope and complexity of the NEP contents.

>

>

>

> Today I re-read the parts I commented on before. This version is a big

> improvement over the previous ones. Thanks in particular for adding

> clear examples and the diagram, it helps a lot.

>

>

>

> If it is the latter, is there some way I can help with it? I tried to

> minimize how much is part of this initial NEP.

>

> If there is not much need for discussion, I would like to officially

> accept the NEP very soon, sending out an official one week notice in

> the next days.

>

>

>

> I agree. I think I would like to keep the option open though to come

> back to the NEP later to improve the clarity of the text about

> motivation/plan/examples/scope, given that this will be the reference

> for a major amount of work for a long time to come.

>

>

>

> To summarize one more time, the main point is that:

>

>

>

> This point seems fine, and I'm +1 for going ahead with the described

> parts of the technical design.

>

>

>

> Cheers,

>

> Ralf

>

>

>

>

> type(np.dtype(np.float64))

>

> will be `np.dtype[float64]`, a subclass of dtype, so that:

>

> issubclass(np.dtype[float64], np.dtype)

>

> is true. This means that we will have one class for every current type

> number: `dtype.num`. The implementation of these subclasses will be a

> C-written (extension) MetaClass, all details of this class are

> supposed to remain experimental in flux at this time.

>

> Cheers

>

> Sebastian

>

>

> On Wed, 2020-03-11 at 17:02 -0700, Sebastian Berg wrote:

>> Hi all,

>>

>> I am pleased to propose NEP 41: First step towards a new Datatype

>> System

https://numpy.org/neps/nep-0041-improved-dtype-support.html>>

>> This NEP motivates the larger restructure of the datatype machinery

>> in NumPy and defines a few fundamental design aspects. The long term

>> user impact will be allowing easier and more rich featured user

>> defined datatypes.

>>

>> As this is a large restructure, the NEP represents only the first

>> steps with some additional information in further NEPs being drafted

>> [1] (this may be helpful to look at depending on the level of detail

>> you are interested in).

>> The NEP itself does not propose to add significant new public API.

>> Instead it proposes to move forward with an incremental internal

>> refactor and lays the foundation for this process.

>>

>> The main user facing change at this time is that datatypes will

>> become classes (e.g. ``type(np.dtype("float64"))`` will be a float64

>> specific class.

>> For most users, the main impact should be many new datatypes in the

>> long run (see the user impact section). However, for those interested

>> in API design within NumPy or with respect to implementing new

>> datatypes, this and the following NEPs are important decisions in the

>> future roadmap for NumPy.

>>

>> The current full text is reproduced below, although the above link is

>> probably a better way to read it.

>>

>> Cheers

>>

>> Sebastian

>>

>>

>> [1] NEP 40 gives some background information about the current

>> systems and issues with it:

>>

https://github.com/numpy/numpy/blob/1248cf7a8765b7b53d883f9e706117381>> 7533aac/doc/neps/nep-0040-legacy-datatype-impl.rst

>> and NEP 42 being a first draft of how the new API may look like:

>>

>>

https://github.com/numpy/numpy/blob/f07e25cdff3967a19c4cc45c6e1a94a38>> f53cee3/doc/neps/nep-0042-new-dtypes.rst

>> (links to current rendered versions, check

>>

https://github.com/numpy/numpy/pull/15505 and

>>

https://github.com/numpy/numpy/pull/15507 for updates)

>>

>>

>> -------------------------------------------------------------------

>> ---

>>

>>

>> =================================================

>> NEP 41 — First step towards a new Datatype System

>> =================================================

>>

>> :title: Improved Datatype Support

>> :Author: Sebastian Berg

>> :Author: Stéfan van der Walt

>> :Author: Matti Picus

>> :Status: Draft

>> :Type: Standard Track

>> :Created: 2020-02-03

>>

>>

>> .. note::

>>

>> This NEP is part of a series of NEPs encompassing first

>> information

>> about the previous dtype implementation and issues with it in NEP

>> 40.

>> NEP 41 (this document) then provides an overview and generic

>> design

>> choices for the refactor.

>> Further NEPs 42 and 43 go into the technical details of the

>> datatype

>> and universal function related internal and external API changes.

>> In some cases it may be necessary to consult the other NEPs for a

>> full

>> picture of the desired changes and why these changes are

>> necessary.

>>

>>

>> Abstract

>> --------

>>

>> `Datatypes <data-type-objects-dtype>` in NumPy describe how to

>> interpret each element in arrays. NumPy provides ``int``, ``float``,

>> and ``complex`` numerical types, as well as string, datetime, and

>> structured datatype capabilities.

>> The growing Python community, however, has need for more diverse

>> datatypes.

>> Examples are datatypes with unit information attached (such as

>> meters) or

>> categorical datatypes (fixed set of possible values).

>> However, the current NumPy datatype API is too limited to allow the

>> creation of these.

>>

>> This NEP is the first step to enable such growth; it will lead to a

>> simpler development path for new datatypes.

>> In the long run the new datatype system will also support the

>> creation of datatypes directly from Python rather than C.

>> Refactoring the datatype API will improve maintainability and

>> facilitate development of both user-defined external datatypes, as

>> well as new features for existing datatypes internal to NumPy.

>>

>>

>> Motivation and Scope

>> --------------------

>>

>> .. seealso::

>>

>> The user impact section includes examples of what kind of new

>> datatypes

>> will be enabled by the proposed changes in the long run.

>> It may thus help to read these section out of order.

>>

>> Motivation

>> ^^^^^^^^^^

>>

>> One of the main issues with the current API is the definition of

>> typical functions such as addition and multiplication for parametric

>> datatypes (see also NEP 40) which require additional steps to

>> determine the output type.

>> For example when adding two strings of length 4, the result is a

>> string of length 8, which is different from the input.

>> Similarly, a datatype which embeds a physical unit must calculate the

>> new unit

>> information: dividing a distance by a time results in a speed.

>> A related difficulty is that the :ref:`current casting rules

>> <_ufuncs.casting>`

>> -- the conversion between different datatypes -- cannot describe

>> casting for such parametric datatypes implemented outside of NumPy.

>>

>> This additional functionality for supporting parametric datatypes

>> introduces increased complexity within NumPy itself, and furthermore

>> is not available to external user-defined datatypes.

>> In general the concerns of different datatypes are not well well-

>> encapsulated.

>> This burden is exacerbated by the exposure of internal C structures,

>> limiting the addition of new fields (for example to support new

>> sorting methods [new_sort]_).

>>

>> Currently there are many factors which limit the creation of new

>> user-defined

>> datatypes:

>>

>> * Creating casting rules for parametric user-defined dtypes is either

>> impossible

>> or so complex that it has never been attempted.

>> * Type promotion, e.g. the operation deciding that adding float and

>> integer

>> values should return a float value, is very valuable for numeric

>> datatypes

>> but is limited in scope for user-defined and especially parametric

>> datatypes.

>> * Much of the logic (e.g. promotion) is written in single functions

>> instead of being split as methods on the datatype itself.

>> * In the current design datatypes cannot have methods that do not

>> generalize

>> to other datatypes. For example a unit datatype cannot have a

>> ``.to_si()`` method to

>> easily find the datatype which would represent the same values in

>> SI units.

>>

>> The large need to solve these issues has driven the scientific

>> community to create work-arounds in multiple projects implementing

>> physical units as an array-like class instead of a datatype, which

>> would generalize better across multiple array-likes (Dask, pandas,

>> etc.).

>> Already, Pandas has made a push into the same direction with its

>> extension arrays [pandas_extension_arrays]_ and undoubtedly the

>> community would be best served if such new features could be common

>> between NumPy, Pandas, and other projects.

>>

>> Scope

>> ^^^^^

>>

>> The proposed refactoring of the datatype system is a large

>> undertaking and thus is proposed to be split into various phases,

>> roughly:

>>

>> * Phase I: Restructure and extend the datatype infrastructure (This

>> NEP 41)

>> * Phase II: Incrementally define or rework API (Detailed largely in

>> NEPs 42/43)

>> * Phase III: Growth of NumPy and Scientific Python Ecosystem

>> capabilities.

>>

>> For a more detailed accounting of the various phases, see "Plan to

>> Approach the Full Refactor" in the Implementation section below.

>> This NEP proposes to move ahead with the necessary creation of new

>> dtype subclasses (Phase I), and start working on implementing current

>> functionality.

>> Within the context of this NEP all development will be fully private

>> API or use preliminary underscored names which must be changed in the

>> future.

>> Most of the internal and public API choices are part of a second

>> Phase and will be discussed in more detail in the following NEPs 42

>> and 43.

>> The initial implementation of this NEP will have little or no effect

>> on users, but provides the necessary ground work for incrementally

>> addressing the full rework.

>>

>> The implementation of this NEP and the following, implied large

>> rework of how datatypes are defined in NumPy is expected to create

>> small incompatibilities (see backward compatibility section).

>> However, a transition requiring large code adaption is not

>> anticipated and not within scope.

>>

>> Specifically, this NEP makes the following design choices which are

>> discussed in more details in the detailed description section:

>>

>> 1. Each datatype will be an instance of a subclass of ``np.dtype``,

>> with most of the

>> datatype-specific logic being implemented

>> as special methods on the class. In the C-API, these correspond to

>> specific

>> slots. In short, for ``f = np.dtype("f8")``, ``isinstance(f,

>> np.dtype)`` will remain true,

>> but ``type(f)`` will be a subclass of ``np.dtype`` rather than

>> just ``np.dtype`` itself.

>> The ``PyArray_ArrFuncs`` which are currently stored as a pointer

>> on the instance (as ``PyArray_Descr->f``),

>> should instead be stored on the class as typically done in Python.

>> In the future these may correspond to python side dunder methods.

>> Storage information such as itemsize and byteorder can differ

>> between

>> different dtype instances (e.g. "S3" vs. "S8") and will remain

>> part of the instance.

>> This means that in the long run the current lowlevel access to

>> dtype methods

>> will be removed (see ``PyArray_ArrFuncs`` in NEP 40).

>>

>> 2. The current NumPy scalars will *not* change, they will not be

>> instances of

>> datatypes. This will also be true for new datatypes, scalars will

>> not be

>> instances of a dtype (although ``isinstance(scalar, dtype)`` may

>> be made

>> to return ``True`` when appropriate).

>>

>> Detailed technical decisions to follow in NEP 42.

>>

>> Further, the public API will be designed in a way that is extensible

>> in the future:

>>

>> 3. All new C-API functions provided to the user will hide

>> implementation details

>> as much as possible. The public API should be an identical, but

>> limited,

>> version of the C-API used for the internal NumPy datatypes.

>>

>> The changes to the datatype system in Phase II must include a large

>> refactor of the UFunc machinery, which will be further defined in NEP

>> 43:

>>

>> 4. To enable all of the desired functionality for new user-defined

>> datatypes,

>> the UFunc machinery will be changed to replace the current

>> dispatching

>> and type resolution system.

>> The old system should be *mostly* supported as a legacy version

>> for some time.

>>

>> Additionally, as a general design principle, the addition of new

>> user-defined datatypes will *not* change the behaviour of programs.

>> For example ``common_dtype(a, b)`` must not be ``c`` unless ``a`` or

>> ``b`` know that ``c`` exists.

>>

>>

>> User Impact

>> -----------

>>

>> The current ecosystem has very few user-defined datatypes using

>> NumPy, the two most prominent being: ``rational`` and ``quaternion``.

>> These represent fairly simple datatypes which are not strongly

>> impacted by the current limitations.

>> However, we have identified a need for datatypes such as:

>>

>> * bfloat16, used in deep learning

>> * categorical types

>> * physical units (such as meters)

>> * datatypes for tracing/automatic differentiation

>> * high, fixed precision math

>> * specialized integer types such as int2, int24

>> * new, better datetime representations

>> * extending e.g. integer dtypes to have a sentinel NA value

>> * geometrical objects [pygeos]_

>>

>> Some of these are partially solved; for example unit capability is

>> provided in ``astropy.units``, ``unyt``, or ``pint``, as

>> `numpy.ndarray` subclasses.

>> Most of these datatypes, however, simply cannot be reasonably defined

>> right now.

>> An advantage of having such datatypes in NumPy is that they should

>> integrate seamlessly with other array or array-like packages such as

>> Pandas, ``xarray`` [xarray_dtype_issue]_, or ``Dask``.

>>

>> The long term user impact of implementing this NEP will be to allow

>> both the growth of the whole ecosystem by having such new datatypes,

>> as well as consolidating implementation of such datatypes within

>> NumPy to achieve better interoperability.

>>

>>

>> Examples

>> ^^^^^^^^

>>

>> The following examples represent future user-defined datatypes we

>> wish to enable.

>> These datatypes are not part the NEP and choices (e.g. choice of

>> casting rules) are possibilities we wish to enable and do not

>> represent recommendations.

>>

>> Simple Numerical Types

>> """"""""""""""""""""""

>>

>> Mainly used where memory is a consideration, lower-precision numeric

>> types such as :ref:```bfloat16`` <

>>

https://en.wikipedia.org/wiki/Bfloat16_floating-point_format>`

>> are common in other computational frameworks.

>> For these types the definitions of things such as ``np.common_type``

>> and ``np.can_cast`` are some of the most important interfaces. Once

>> they support ``np.common_type``, it is (for the most part) possible

>> to find the correct ufunc loop to call, since most ufuncs -- such as

>> add -- effectively only require ``np.result_type``::

>>

>> >>> np.add(arr1, arr2).dtype == np.result_type(arr1, arr2)

>>

>> and `~numpy.result_type` is largely identical to

>> `~numpy.common_type`.

>>

>>

>> Fixed, high precision math

>> """"""""""""""""""""""""""

>>

>> Allowing arbitrary precision or higher precision math is important in

>> simulations. For instance ``mpmath`` defines a precision::

>>

>> >>> import mpmath as mp

>> >>> print(mp.dps) # the current (default) precision

>> 15

>>

>> NumPy should be able to construct a native, memory-efficient array

>> from a list of ``mpmath.mpf`` floating point objects::

>>

>> >>> arr_15_dps = np.array(mp.arange(3)) # (mp.arange returns a

>> list)

>> >>> print(arr_15_dps) # Must find the correct precision from the

>> objects:

>> array(['0.0', '1.0', '2.0'], dtype=mpf[dps=15])

>>

>> We should also be able to specify the desired precision when creating

>> the datatype for an array. Here, we use ``np.dtype[mp.mpf]`` to find

>> the DType class (the notation is not part of this NEP), which is then

>> instantiated with the desired parameter.

>> This could also be written as ``MpfDType`` class::

>>

>> >>> arr_100_dps = np.array([1, 2, 3],

>> dtype=np.dtype[mp.mpf](dps=100))

>> >>> print(arr_15_dps + arr_100_dps)

>> array(['0.0', '2.0', '4.0'], dtype=mpf[dps=100])

>>

>> The ``mpf`` datatype can decide that the result of the operation

>> should be the higher precision one of the two, so uses a precision of

>> 100.

>> Furthermore, we should be able to define casting, for example as in::

>>

>> >>> np.can_cast(arr_15_dps.dtype, arr_100_dps.dtype,

>> casting="safe")

>> True

>> >>> np.can_cast(arr_100_dps.dtype, arr_15_dps.dtype,

>> casting="safe")

>> False # loses precision

>> >>> np.can_cast(arr_100_dps.dtype, arr_100_dps.dtype,

>> casting="same_kind")

>> True

>>

>> Casting from float is a probably always at least a ``same_kind``

>> cast, but in general, it is not safe::

>>

>> >>> np.can_cast(np.float64, np.dtype[mp.mpf](dps=4),

>> casting="safe")

>> False

>>

>> since a float64 has a higer precision than the ``mpf`` datatype with

>> ``dps=4``.

>>

>> Alternatively, we can say that::

>>

>> >>> np.common_type(np.dtype[mp.mpf](dps=5),

>> np.dtype[mp.mpf](dps=10))

>> np.dtype[mp.mpf](dps=10)

>>

>> And possibly even::

>>

>> >>> np.common_type(np.dtype[mp.mpf](dps=5), np.float64)

>> np.dtype[mp.mpf](dps=16) # equivalent precision to float64 (I

>> believe)

>>

>> since ``np.float64`` can be cast to a ``np.dtype[mp.mpf](dps=16)``

>> safely.

>>

>>

>> Categoricals

>> """"""""""""

>>

>> Categoricals are interesting in that they can have fixed, predefined

>> values, or can be dynamic with the ability to modify categories when

>> necessary.

>> The fixed categories (defined ahead of time) is the most straight

>> forward categorical definition.

>> Categoricals are *hard*, since there are many strategies to implement

>> them, suggesting NumPy should only provide the scaffolding for

>> user-defined categorical types. For instance::

>>

>> >>> cat = Categorical(["eggs", "spam", "toast"])

>> >>> breakfast = array(["eggs", "spam", "eggs", "toast"],

>> dtype=cat)

>>

>> could store the array very efficiently, since it knows that there are

>> only 3 categories.

>> Since a categorical in this sense knows almost nothing about the data

>> stored in it, few operations makes, sense, although equality does:

>>

>> >>> breakfast2 = array(["eggs", "eggs", "eggs", "eggs"],

>> dtype=cat)

>> >>> breakfast == breakfast2

>> array[True, False, True, False])

>>

>> The categorical datatype could work like a dictionary: no two items

>> names can be equal (checked on dtype creation), so that the equality

>> operation above can be performed very efficiently.

>> If the values define an order, the category labels (internally

>> integers) could

>> be ordered the same way to allow efficient sorting and comparison.

>>

>> Whether or not casting is defined from one categorical with less to

>> one with strictly more values defined, is something that the

>> Categorical datatype would need to decide. Both options should be

>> available.

>>

>>

>> Unit on the Datatype

>> """"""""""""""""""""

>>

>> There are different ways to define Units, depending on how the

>> internal machinery would be organized, one way is to have a single

>> Unit datatype for every existing numerical type.

>> This will be written as ``Unit[float64]``, the unit itself is part of

>> the DType instance ``Unit[float64]("m")`` is a ``float64`` with

>> meters

>> attached::

>>

>> >>> from astropy import units

>> >>> meters = np.array([1, 2, 3], dtype=np.float64) * units.m #

>> meters

>> >>> print(meters)

>> array([1.0, 2.0, 3.0], dtype=Unit[float64]("m"))

>>

>> Note that units are a bit tricky. It is debatable, whether::

>>

>> >>> np.array([1.0, 2.0, 3.0], dtype=Unit[float64]("m"))

>>

>> should be valid syntax (coercing the float scalars without a unit to

>> meters).

>> Once the array is created, math will work without any issue::

>>

>> >>> meters / (2 * unit.seconds)

>> array([0.5, 1.0, 1.5], dtype=Unit[float64]("m/s"))

>>

>> Casting is not valid from one unit to the other, but can be valid

>> between different scales of the same dimensionality (although this

>> may be

>> "unsafe")::

>>

>> >>> meters.astype(Unit[float64]("s"))

>> TypeError: Cannot cast meters to seconds.

>> >>> meters.astype(Unit[float64]("km"))

>> >>> # Convert to centimeter-gram-second (cgs) units:

>> >>> meters.astype(meters.dtype.to_cgs())

>>

>> The above notation is somewhat clumsy. Functions could be used

>> instead to convert between units.

>> There may be ways to make these more convenient, but those must be

>> left for future discussions::

>>

>> >>> units.convert(meters, "km")

>> >>> units.to_cgs(meters)

>>

>> There are some open questions. For example, whether additional

>> methods on the array object could exist to simplify some of the

>> notions, and how these would percolate from the datatype to the

>> ``ndarray``.

>>

>> The interaction with other scalars would likely be defined through::

>>

>> >>> np.common_type(np.float64, Unit)

>> Unit[np.float64](dimensionless)

>>

>> Ufunc output datatype determination can be more involved than for

>> simple numerical dtypes since there is no "universal" output type::

>>

>> >>> np.multiply(meters, seconds).dtype != np.result_type(meters,

>> seconds)

>>

>> In fact ``np.result_type(meters, seconds)`` must error without

>> context of the operation being done.

>> This example highlights how the specific ufunc loop (loop with known,

>> specific DTypes as inputs), has to be able to to make certain

>> decisions before the actual calculation can start.

>>

>>

>>

>> Implementation

>> --------------

>>

>> Plan to Approach the Full Refactor

>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

>>

>> To address these issues in NumPy and enable new datatypes,

>> multiple development stages are required:

>>

>> * Phase I: Restructure and extend the datatype infrastructure (This

>> NEP)

>>

>> * Organize Datatypes like normal Python classes [`PR 15508`]_

>>

>> * Phase II: Incrementally define or rework API

>>

>> * Create a new and easily extensible API for defining new datatypes

>> and related functionality. (NEP 42)

>>

>> * Incrementally define all necessary functionality through the new

>> API (NEP 42):

>>

>> * Defining operations such as ``np.common_type``.

>> * Allowing to define casting between datatypes.

>> * Add functionality necessary to create a numpy array from Python

>> scalars

>> (i.e. ``np.array(...)``).

>> * …

>>

>> * Restructure how universal functions work (NEP 43), in order to:

>>

>> * make it possible to allow a `~numpy.ufunc` such as ``np.add``

>> to be

>> extended by user-defined datatypes such as Units.

>>

>> * allow efficient lookup for the correct implementation for user-

>> defined

>> datatypes.

>>

>> * enable reuse of existing code. Units should be able to use the

>> normal math loops and add additional logic to determine output

>> type.

>>

>> * Phase III: Growth of NumPy and Scientific Python Ecosystem

>> capabilities:

>>

>> * Cleanup of legacy behaviour where it is considered buggy or

>> undesirable.

>> * Provide a path to define new datatypes from Python.

>> * Assist the community in creating types such as Units or

>> Categoricals

>> * Allow strings to be used in functions such as ``np.equal`` or

>> ``np.add``.

>> * Remove legacy code paths within NumPy to improve long term

>> maintainability

>>

>> This document serves as a basis for phase I and provides the vision

>> and

>> motivation for the full project.

>> Phase I does not introduce any new user-facing features,

>> but is concerned with the necessary conceptual cleanup of the current

>> datatype system.

>> It provides a more "pythonic" datatype Python type object, with a

>> clear class hierarchy.

>>

>> The second phase is the incremental creation of all APIs necessary to

>> define

>> fully featured datatypes and reorganization of the NumPy datatype

>> system.

>> This phase will thus be primarily concerned with defining an,

>> initially preliminary, stable public API.

>>

>> Some of the benefits of a large refactor may only become evident

>> after the full

>> deprecation of the current legacy implementation (i.e. larger code

>> removals).

>> However, these steps are necessary for improvements to many parts of

>> the

>> core NumPy API, and are expected to make the implementation generally

>> easier to understand.

>>

>> The following figure illustrates the proposed design at a high level,

>> and roughly delineates the components of the overall design.

>> Note that this NEP only regards Phase I (shaded area),

>> the rest encompasses Phase II and the design choices are up for

>> discussion,

>> however, it highlights that the DType datatype class is the central,

>> necessary

>> concept:

>>

>> .. image:: _static/nep-0041-mindmap.svg

>>

>>

>> First steps directly related to this NEP

>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

>>

>> The required changes necessary to NumPy are large and touch many

>> areas

>> of the code base

>> but many of these changes can be addressed incrementally.

>>

>> To enable an incremental approach we will start by creating a C

>> defined

>> ``PyArray_DTypeMeta`` class with its instances being the ``DType``

>> classes,

>> subclasses of ``np.dtype``.

>> This is necessary to add the ability of storing custom slots on the

>> DType in C.

>> This ``DTypeMeta`` will be implemented first to then enable

>> incremental

>> restructuring of current code.

>>

>> The addition of ``DType`` will then enable addressing other changes

>> incrementally, some of which may begin before the settling the full

>> internal

>> API:

>>

>> 1. New machinery for array coercion, with the goal of enabling user

>> DTypes

>> with appropriate class methods.

>> 2. The replacement or wrapping of the current casting machinery.

>> 3. Incremental redefinition of the current ``PyArray_ArrFuncs`` slots

>> into

>> DType method slots.

>>

>> At this point, no or only very limited new public API will be added

>> and

>> the internal API is considered to be in flux.

>> Any new public API may be set up give warnings and will have leading

>> underscores

>> to indicate that it is not finalized and can be changed without

>> warning.

>>

>>

>> Backward compatibility

>> ----------------------

>>

>> While the actual backward compatibility impact of implementing Phase

>> I and II

>> are not yet fully clear, we anticipate, and accept the following

>> changes:

>>

>> * **Python API**:

>>

>> * ``type(np.dtype("f8"))`` will be a subclass of ``np.dtype``,

>> while right

>> now ``type(np.dtype("f8")) is np.dtype``.

>> Code should use ``isinstance`` checks, and in very rare cases may

>> have to

>> be adapted to use it.

>>

>> * **C-API**:

>>

>> * In old versions of NumPy ``PyArray_DescrCheck`` is a macro

>> which uses

>> ``type(dtype) is np.dtype``. When compiling against an old

>> NumPy version,

>> the macro may have to be replaced with the corresponding

>> ``PyObject_IsInstance`` call. (If this is a problem, we could

>> backport

>> fixing the macro)

>>

>> * The UFunc machinery changes will break *limited* parts of the

>> current

>> implementation. Replacing e.g. the default ``TypeResolver`` is

>> expected

>> to remain supported for a time, although optimized masked inner

>> loop iteration

>> (which is not even used *within* NumPy) will no longer be

>> supported.

>>

>> * All functions currently defined on the dtypes, such as

>> ``PyArray_Descr->f->nonzero``, will be defined and accessed

>> differently.

>> This means that in the long run lowlevel access code will

>> have to be changed to use the new API. Such changes are expected

>> to be

>> necessary in very few project.

>>

>> * **dtype implementors (C-API)**:

>>

>> * The array which is currently provided to some functions (such as

>> cast functions),

>> will no longer be provided.

>> For example ``PyArray_Descr->f->nonzero`` or ``PyArray_Descr->f-

>> >copyswapn``,

>> may instead receive a dummy array object with only some fields

>> (mainly the

>> dtype), being valid.

>> At least in some code paths, a similar mechanism is already used.

>>

>> * The ``scalarkind`` slot and registration of scalar casting will

>> be

>> removed/ignored without replacement.

>> It currently allows partial value-based casting.

>> The ``PyArray_ScalarKind`` function will continue to work for

>> builtin types,

>> but will not be used internally and be deprecated.

>>

>> * Currently user dtypes are defined as instances of ``np.dtype``.

>> The creation works by the user providing a prototype instance.

>> NumPy will need to modify at least the type during registration.

>> This has no effect for either ``rational`` or ``quaternion`` and

>> mutation

>> of the structure seems unlikely after registration.

>>

>> Since there is a fairly large API surface concerning datatypes,

>> further changes

>> or the limitation certain function to currently existing datatypes is

>> likely to occur.

>> For example functions which use the type number as input

>> should be replaced with functions taking DType classes instead.

>> Although public, large parts of this C-API seem to be used rarely,

>> possibly never, by downstream projects.

>>

>>

>>

>> Detailed Description

>> --------------------

>>

>> This section details the design decisions covered by this NEP.

>> The subsections correspond to the list of design choices presented

>> in the Scope section.

>>

>> Datatypes as Python Classes (1)

>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

>>

>> The current NumPy datatypes are not full scale python classes.

>> They are instead (prototype) instances of a single ``np.dtype``

>> class.

>> Changing this means that any special handling, e.g. for ``datetime``

>> can be moved to the Datetime DType class instead, away from

>> monolithic general

>> code (e.g. current ``PyArray_AdjustFlexibleDType``).

>>

>> The main consequence of this change with respect to the API is that

>> special methods move from the dtype instances to methods on the new

>> DType class.

>> This is the typical design pattern used in Python.

>> Organizing these methods and information in a more Pythonic way

>> provides a

>> solid foundation for refining and extending the API in the future.

>> The current API cannot be extended due to how it is exposed

>> publically.

>> This means for example that the methods currently stored in

>> ``PyArray_ArrFuncs``

>> on each datatype (see NEP 40) will be defined differently in the

>> future and

>> deprecated in the long run.

>>

>> The most prominent visible side effect of this will be that

>> ``type(np.dtype(np.float64))`` will not be ``np.dtype`` anymore.

>> Instead it will be a subclass of ``np.dtype`` meaning that

>> ``isinstance(np.dtype(np.float64), np.dtype)`` will remain true.

>> This will also add the ability to use ``isinstance(dtype,

>> np.dtype[float64])``

>> thus removing the need to use ``dtype.kind``, ``dtype.char``, or

>> ``dtype.type``

>> to do this check.

>>

>> With the design decision of DTypes as full-scale Python classes,

>> the question of subclassing arises.

>> Inheritance, however, appears problematic and a complexity best

>> avoided

>> (at least initially) for container datatypes.

>> Further, subclasses may be more interesting for interoperability for

>> example with GPU backends (CuPy) storing additional methods related

>> to the

>> GPU rather than as a mechanism to define new datatypes.

>> A class hierarchy does provides value, this may be achieved by

>> allowing the creation of *abstract* datatypes.

>> An example for an abstract datatype would be the datatype equivalent

>> of

>> ``np.floating``, representing any floating point number.

>> These can serve the same purpose as Python's abstract base classes.

>>

>>

>> Scalars should not be instances of the datatypes (2)

>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

>>

>> For simple datatypes such as ``float64`` (see also below), it seems

>> tempting that the instance of a ``np.dtype("float64")`` can be the

>> scalar.

>> This idea may be even more appealing due to the fact that scalars,

>> rather than datatypes, currently define a useful type hierarchy.

>>

>> However, we have specifically decided against this for a number of

>> reasons.

>> First, the new datatypes described herein would be instances of DType

>> classes.

>> Making these instances themselves classes, while possible, adds

>> additional

>> complexity that users need to understand.

>> It would also mean that scalars must have storage information (such

>> as byteorder)

>> which is generally unnecessary and currently is not used.

>> Second, while the simple NumPy scalars such as ``float64`` may be

>> such instances,

>> it should be possible to create datatypes for Python objects without

>> enforcing

>> NumPy as a dependency.

>> However, Python objects that do not depend on NumPy cannot be

>> instances of a NumPy DType.

>> Third, there is a mismatch between the methods and attributes which

>> are useful

>> for scalars and datatypes. For instance ``to_float()`` makes sense

>> for a scalar

>> but not for a datatype and ``newbyteorder`` is not useful on a scalar

>> (or has

>> a different meaning).

>>

>> Overall, it seem rather than reducing the complexity, i.e. by merging

>> the two distinct type hierarchies, making scalars instances of DTypes

>> would

>> increase the complexity of both the design and implementation.

>>

>> A possible future path may be to instead simplify the current NumPy

>> scalars to

>> be much simpler objects which largely derive their behaviour from the

>> datatypes.

>>

>> C-API for creating new Datatypes (3)

>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

>>

>> The current C-API with which users can create new datatypes

>> is limited in scope, and requires use of "private" structures. This

>> means

>> the API is not extensible: no new members can be added to the

>> structure

>> without losing binary compatibility.

>> This has already limited the inclusion of new sorting methods into

>> NumPy [new_sort]_.

>>

>> The new version shall thus replace the current ``PyArray_ArrFuncs``

>> structure used

>> to define new datatypes.

>> Datatypes that currently exist and are defined using these slots will

>> be

>> supported during a deprecation period.

>>

>> The most likely solution is to hide the implementation from the user

>> and thus make

>> it extensible in the future is to model the API after Python's stable

>> API [PEP-384]_:

>>

>> .. code-block:: C

>>

>> static struct PyArrayMethodDef slots[] = {

>> {NPY_dt_method, method_implementation},

>> ...,

>> {0, NULL}

>> }

>>

>> typedef struct{

>> PyTypeObject *typeobj; /* type of python scalar */

>> ...;

>> PyType_Slot *slots;

>> } PyArrayDTypeMeta_Spec;

>>

>> PyObject* PyArray_InitDTypeMetaFromSpec(

>> PyArray_DTypeMeta *user_dtype, PyArrayDTypeMeta_Spec

>> *dtype_spec);

>>

>> The C-side slots should be designed to mirror Python side methods

>> such as ``dtype.__dtype_method__``, although the exposure to Python

>> is

>> a later step in the implementation to reduce the complexity of the

>> initial

>> implementation.

>>

>>

>> C-API Changes to the UFunc Machinery (4)

>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

>>

>> Proposed changes to the UFunc machinery will be part of NEP 43.

>> However, the following changes will be necessary (see NEP 40 for a

>> detailed

>> description of the current implementation and its issues):

>>

>> * The current UFunc type resolution must be adapted to allow better

>> control

>> for user-defined dtypes as well as resolve current inconsistencies.

>> * The inner-loop used in UFuncs must be expanded to include a return

>> value.

>> Further, error reporting must be improved, and passing in dtype-

>> specific

>> information enabled.

>> This requires the modification of the inner-loop function signature

>> and

>> addition of new hooks called before and after the inner-loop is

>> used.

>>

>> An important goal for any changes to the universal functions will be

>> to

>> allow the reuse of existing loops.

>> It should be easy for a new units datatype to fall back to existing

>> math

>> functions after handling the unit related computations.

>>

>>

>> Discussion

>> ----------

>>

>> See NEP 40 for a list of previous meetings and discussions.

>>

>>

>> References

>> ----------

>>

>> .. [pandas_extension_arrays]

>>

https://pandas.pydata.org/pandas-docs/stable/development/extending.html#extension-types>>

>> .. _xarray_dtype_issue:

https://github.com/pydata/xarray/issues/1262>>

>> .. [pygeos]

https://github.com/caspervdw/pygeos>>

>> .. [new_sort]

https://github.com/numpy/numpy/pull/12945>>

>> .. [PEP-384]

https://www.python.org/dev/peps/pep-0384/>>

>> .. [PR 15508]

https://github.com/numpy/numpy/pull/15508>>

>>

>> Copyright

>> ---------

>>

>> This document has been placed in the public domain.

>>

>>

>> Acknowledgments

>> ---------------

>>

>> The effort to create new datatypes for NumPy has been discussed for

>> several

>> years in many different contexts and settings, making it impossible

>> to list everyone involved.

>> We would like to thank especially Stephan Hoyer, Nathaniel Smith, and

>> Eric Wieser

>> for repeated in-depth discussion about datatype design.

>> We are very grateful for the community input in reviewing and

>> revising this

>> NEP and would like to thank especially Ross Barnowski and Ralf

>> Gommers.

>>

>> _______________________________________________

>> NumPy-Discussion mailing list

>>

[hidden email] <mailto:

[hidden email]>

>>

https://mail.python.org/mailman/listinfo/numpy-discussion>

> _______________________________________________

> NumPy-Discussion mailing list

>

[hidden email] <mailto:

[hidden email]>

>

https://mail.python.org/mailman/listinfo/numpy-discussion>

>