Changing the size of PyArrayObject_fields (the ndarray c-struct)

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Changing the size of PyArrayObject_fields (the ndarray c-struct)

mattip
Administrator
PyArrayObject_fields is the c-struct that underlies ndarray. It is
defined in ndarraytypes.h [0]. Since version 1.7, we have been trying to
hide it from the public C-API so that we can freely modify it, the
structure has the comment:


  * It has been recommended to use the inline functions defined below
  * (PyArray_DATA and friends) to access fields here for a number of
  * releases. Direct access to the members themselves is deprecated.
  * To ensure that your code does not use deprecated access,
  * #define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION
  * (or NPY_1_8_API_VERSION or higher as required).



In order to clean up buffer exports, Sebastian suggested (and I pushed
for supporting) PR 16938 [1] which would add a new field to the struct.
As Eric pointed out on the pull request, this would change the size of
the struct, meaning users of the struct (i.e., subclassing it in C)
would have to be very careful interacting with NumPy-generated objects
which may have changed sizes.


Or we should give up and declare that we cannot change the size of the
struct until we release a NumPy 2.0?


Are there real-world cases that changing the size of the struct would
break? I admit I have an agenda to further modify the struct in upcoming
versions to better support things like alternative data memory allocator
strategies, so in my opinion it would be a shame if we are stuck forever
with the current struct.


Matti


[0]
https://github.com/numpy/numpy/blob/v1.19.4/numpy/core/include/numpy/ndarraytypes.h#L659

[1] https://github.com/numpy/numpy/pull/16938

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Changing the size of PyArrayObject_fields (the ndarray c-struct)

Charles R Harris


On Sat, Nov 21, 2020 at 12:28 PM Matti Picus <[hidden email]> wrote:
PyArrayObject_fields is the c-struct that underlies ndarray. It is
defined in ndarraytypes.h [0]. Since version 1.7, we have been trying to
hide it from the public C-API so that we can freely modify it, the
structure has the comment:


  * It has been recommended to use the inline functions defined below
  * (PyArray_DATA and friends) to access fields here for a number of
  * releases. Direct access to the members themselves is deprecated.
  * To ensure that your code does not use deprecated access,
  * #define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION
  * (or NPY_1_8_API_VERSION or higher as required).



In order to clean up buffer exports, Sebastian suggested (and I pushed
for supporting) PR 16938 [1] which would add a new field to the struct.
As Eric pointed out on the pull request, this would change the size of
the struct, meaning users of the struct (i.e., subclassing it in C)
would have to be very careful interacting with NumPy-generated objects
which may have changed sizes.


Or we should give up and declare that we cannot change the size of the
struct until we release a NumPy 2.0?


Are there real-world cases that changing the size of the struct would
break? I admit I have an agenda to further modify the struct in upcoming
versions to better support things like alternative data memory allocator
strategies, so in my opinion it would be a shame if we are stuck forever
with the current struct.


Matti


I think the risk is small and this will probably work, the potential problem is if people have extended the essentially private structure in C code. That said, it might be better to put it in after 1.20.x is branched so that it is out there for others to test against, and perhaps implement fixes if needed. We do need to make this move at some point and can't stay fixed on an old structure forever, so the proposed change is something of a warning shot.

Chuck 

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Changing the size of PyArrayObject_fields (the ndarray c-struct)

Sebastian Berg
On Sat, 2020-11-21 at 14:42 -0700, Charles R Harris wrote:

> On Sat, Nov 21, 2020 at 12:28 PM Matti Picus <[hidden email]>
> wrote:
>
> > PyArrayObject_fields is the c-struct that underlies ndarray. It is
> > defined in ndarraytypes.h [0]. Since version 1.7, we have been
> > trying to
> > hide it from the public C-API so that we can freely modify it, the
> > structure has the comment:
> >
> >
> >   * It has been recommended to use the inline functions defined
> > below
> >   * (PyArray_DATA and friends) to access fields here for a number
> > of
> >   * releases. Direct access to the members themselves is
> > deprecated.
> >   * To ensure that your code does not use deprecated access,
> >   * #define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION
> >   * (or NPY_1_8_API_VERSION or higher as required).
> >
> >
> >
> > In order to clean up buffer exports, Sebastian suggested (and I
> > pushed
> > for supporting) PR 16938 [1] which would add a new field to the
> > struct.

TL;DR:  Unless we find real examples of affected users I currently
think we should just do it. I don't see much gain in pushing it off
(but I don't care).
I simply expect exceedingly few affected users compared to small gains
for everyone else (including our flexibility for future improvements).


> > As Eric pointed out on the pull request, this would change the size
> > of
> > the struct, meaning users of the struct (i.e., subclassing it in C)
> > would have to be very careful interacting with NumPy-generated
> > objects
> > which may have changed sizes.
> >

Access to the struct remains ABI compatible. Only those relying on the
size will have issues.  Using the struct size is mainly relevant when
subclassing in C.  Just to note: Cython appears not relevant [0].


Adapting to this change may not be pretty, but it should not be very
tricky to work around it [1].
Without fixes/recompilation you hopefully get an error. But arbitrarily
weird things could happen. (Most likely a crash.)

I currently hope that so few enough users will run into this, that we
can just help every one who reports an issue...


> >
> >
> I think the risk is small and this will probably work, the potential
> problem is if people have extended the essentially private structure
> in C
> code. That said, it might be better to put it in after 1.20.x is
> branched
> so that it is out there for others to test against, and perhaps
> implement


I don't mind waiting. Although, I also don't really see us learning
much since only the bigger projects tend to test against master.

Currently, I am aware of only one tiny project that will run into this
and the author of it seemed fine with having to adapt [2].

If we find more projects that are broken by this, I may change my mind.
Until then, I think we should go ahead.


For those who got this far... Aside from code cleanup, the PR also
reduces some overheads. Some quick, approximate timings:

* `memoryview(arr)` is 20% faste. This also helps typed memoryviews
   in cython e.g.:

     cdef myfunc(double[::1] data):
         pass

  has the same speed-up (about 15% faster without any function body).
* `arr[...]` is about 20+% faster. This is because the PR speeds up
  deletion of every NumPy array object by a small bit.

I do not know whether this actually matters for real world code, likely
not significantly... (It would be cool to have a pool of real world
benchmarks to test this type of thing).

Cheers,

Sebastian


[0] My attempt at this using:

   cdef class myarr(np.ndarray):
       pass

failed. It achieved nothing but crashes with current NumPy.



[1] There are three approaches:

1. Just recompile (using the "deprecated api"):

     struct MyArraySubclass {
         char[sizeof(PyArrayObject_fields)];
         int my_field;
     }

    And in the module init function, you should add:

     if (sizeof(PyArrayObject_fields) < PyArrayObject_Type.tp_basicsize) {
         PyErr_SetString(PyExc_RuntimeError,
             "Module not binary compatible with NumPy, please recompile.")
         return -1;
     }

2. Do the same as 1., but add a small constant:

     sizeof(PyArrayObject_fields) + constant

   That way you can compile with an old NumPy version, but ensure
   compatibility with newer versions. (You still need the check while
   loading the module).

3. You go to lengths to achieve 100% binary compatibility no matter
   what and avoid `sizeof(PyArrayObject_fields)` entirely:
   
     size_t offset_of_myfields = PyArrayObject_Type.tp_basicsize + alignment_padding;
     MyArraySubclass->tp_basicsize = offset_of_myfields + sizeof(my_fields);

   And to access `my_fields` you have to use a small macro such as:
   
     #define MY_FIELDS(obj) ((my_fields *)((char *)obj + offset_of_myfields))
   

My personal guess is that solution 2. is sufficient for most small
projects, as it allows you to compile in a forward compatible way, but
the last is the perfectly clean solution of course.


[2] https://github.com/patmiller/bignumpy Also note that it may be that
the project can be easily replaced with a simpler solution that does
not require C-subclassing at all.


> fixes if needed. We do need to make this move at some point and can't
> stay
> fixed on an old structure forever, so the proposed change is
> something of a
> warning shot.
>
> Chuck
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

signature.asc (849 bytes) Download Attachment