String accessor methods

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

String accessor methods

Todd
Currently. working with strings in numpy is not very convenient. You have to use a separate set of functions in a separate namespace, and those functions are relatively limited and poorly-documented.

A solution several other projects, including pandas [0] and xarray [1], have found are string accessor methods. These are a set of methods attached to a `str` attribute of the class.  These have the advantage that they are always available and have a well-defined object they operate on.  On non-str dtypes, it would raise an exception.

This would also provide a standardized set of methods and behaviors that are part of the numpy api that other classes could depend on. 

An example would be something like this:

>>> mystr = np.array(["test first", "test second", "test third"])
>>> mystr.str.title()
array(['Test First', 'Test Second', 'Test Third'], dtype='<U11')


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: String accessor methods

dan_patterson
The are  in np.char

mystr = np.array(["test first", "test second", "test third"])

np.char.title(mystr)
array(['Test First', 'Test Second', 'Test Third'], dtype='<U11')



--
Sent from: http://numpy-discussion.10968.n7.nabble.com/
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: String accessor methods

Todd
On Sat, Mar 6, 2021 at 12:57 PM dan_patterson <[hidden email]> wrote:
The are  in np.char

mystr = np.array(["test first", "test second", "test third"])

np.char.title(mystr)
array(['Test First', 'Test Second', 'Test Third'], dtype='<U11')

I mentioned those in my email, but they are far less convenient to use than class methods, nor do they relate well to how built-in strings are used in Python. That is why other projects have started using accessor methods and why Python removed all the separate string functions in Python 3. The functions in np.char are also limited in their capabilities, and fairly poorly documented in my opinion.  Some of those limitations are impossible to overcome, for example they inherently can never support operators, addition or multiplication, or slicing like Python strings can, while an accessor could. 

However, putting them as top-level methods for ndarray would pollute the methods too much. That is why I am suggesting numpy do the same thing that pandas, xarray, etc. are doing and putting those as methods under a 'str' attribute for ndarrays rather than as separate functions.


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: String accessor methods

bashtage
I think that and string functions that are exposed from an ndarray would have to be guaranteed to work in-place. Requiring casting to objects to use the methods feels more like syntactic sugar than an essential case. I think most of the ones mentioned are low performance and can't take advantage of the storage as a blob of int8 (ascii) or int32 (utf32) that underlay Numpy string arrays. 

I also think the existence of these in pandas reduces the case for them being in Numpy. 

On Sun, Mar 7, 2021, 05:32 Todd <[hidden email]> wrote:
On Sat, Mar 6, 2021 at 12:57 PM dan_patterson <[hidden email]> wrote:
The are  in np.char

mystr = np.array(["test first", "test second", "test third"])

np.char.title(mystr)
array(['Test First', 'Test Second', 'Test Third'], dtype='<U11')

I mentioned those in my email, but they are far less convenient to use than class methods, nor do they relate well to how built-in strings are used in Python. That is why other projects have started using accessor methods and why Python removed all the separate string functions in Python 3. The functions in np.char are also limited in their capabilities, and fairly poorly documented in my opinion.  Some of those limitations are impossible to overcome, for example they inherently can never support operators, addition or multiplication, or slicing like Python strings can, while an accessor could. 

However, putting them as top-level methods for ndarray would pollute the methods too much. That is why I am suggesting numpy do the same thing that pandas, xarray, etc. are doing and putting those as methods under a 'str' attribute for ndarrays rather than as separate functions.

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: String accessor methods

Sebastian Berg
On Sun, 2021-03-07 at 09:34 +0000, Kevin Sheppard wrote:

> I think that and string functions that are exposed from an ndarray
> would
> have to be guaranteed to work in-place. Requiring casting to objects
> to use
> the methods feels more like syntactic sugar than an essential case. I
> think
> most of the ones mentioned are low performance and can't take
> advantage of
> the storage as a blob of int8 (ascii) or int32 (utf32) that underlay
> Numpy
> string arrays.
>
> I also think the existence of these in pandas reduces the case for
> them
> being in Numpy.
I agree with this, the need seems much lower in NumPy. And NumPy's
currently somewhat weird strings at least for me makes it even less
appealing to expose more string utilities of any kind at this time.

In general, there is probably something to be said about such
"accessor", in the sense of having a place to put methods which are
specific to the array's dtype.  Other examples are datetime/timedelta
or Units and probably many potential DTypes [1]. It is one advantage
that the `astropy.units.Quantity` subclass has over a DType based
solution: `methods` can be added very transparently.

Basically: The current `np.char` functions are a bit weird and I would
need a quite a bit more convincing to expose them at this time.
But, I would be delighted if we can think of a solution that goes
beyond `str` [2].  Probably not adding `ndarray.str` at all or only if
the array has a string DType.
But do it in way that generalizes!  That could be a DType specific
mixin class, or I had previously played with the thought of a "generic"
accessor:
    `ndarray.elementwise.<ufuncs-provided-by-DType>`

But those go beyond the original string request and need some smart
idea/thoughts!

An interesting aside is that `arr.imag` and `arr.real` fall into the
same category. But they are narrow enough that we can just have a
specific solution for them.

Cheers,

Sebastian



[1] Datetimes/timedelta might have some use of basic timezone handling
(not sure if relevant to NumPy's naive datetimes).

`astropy.units.Quantity` has a few extra methods/properties:

* `.cgs`, `.si`, `.decompose()`, `.to()`: cast to different unit.
* `.unit`
* `.value`: get a value array view without any unit.
* `.to_value()` method that returns a copy, not a view.

Of course we can spell those using DTypes, but I think it might be
long: `arr.astype(arr.dtype.cgs)`, or `arr.view(arr.dtype.unitless)`.
Utility functions similar to `np.char` also can simplify all of this,
but methods do have merit.
Other user DTypes could very well have more compelling use-cases.


[2] But it probably won't reach my serious thinking cycles for a while.
For starters, dedicated utility functions seem decent enough...


>
> On Sun, Mar 7, 2021, 05:32 Todd <[hidden email]> wrote:
>
> > On Sat, Mar 6, 2021 at 12:57 PM dan_patterson <
> > [hidden email]>
> > wrote:
> >
> > > The are  in np.char
> > >
> > > mystr = np.array(["test first", "test second", "test third"])
> > >
> > > np.char.title(mystr)
> > > array(['Test First', 'Test Second', 'Test Third'], dtype='<U11')
> > >
> >
> > I mentioned those in my email, but they are far less convenient to
> > use
> > than class methods, nor do they relate well to how built-in strings
> > are
> > used in Python. That is why other projects have started using
> > accessor
> > methods and why Python removed all the separate string functions in
> > Python
> > 3. The functions in np.char are also limited in their capabilities,
> > and
> > fairly poorly documented in my opinion.  Some of those limitations
> > are
> > impossible to overcome, for example they inherently can never
> > support
> > operators, addition or multiplication, or slicing like Python
> > strings can,
> > while an accessor could.
> >
> > However, putting them as top-level methods for ndarray would
> > pollute the
> > methods too much. That is why I am suggesting numpy do the same
> > thing that
> > pandas, xarray, etc. are doing and putting those as methods under a
> > 'str'
> > attribute for ndarrays rather than as separate functions.
> >
> > _______________________________________________
> > NumPy-Discussion mailing list
> > [hidden email]
> > https://mail.python.org/mailman/listinfo/numpy-discussion
> >
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

signature.asc (849 bytes) Download Attachment