Array and string interoperability

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Array and string interoperability

Mikhail V
Array and string interoperability

Just sharing my thoughts and few ideas about simplification of casting
strings to arrays.
In examples assume Numpy is in the namespace (from numpy import *)

Initialize array from a string currently looks like:

s= "012 abc"
A= fromstring(s,"u1")
print A ->
[48 49 50 32 97 98 99]

Perfect.
Now when writing values it will not work
as IMO it should, namley consider this example:

B= zeros(7,"u1")
B[0]=s[1]
print B ->
[1 0 0 0 0 0 0]

Ugh? It tries to parse the s[1] character "1" as integer and writes 1 to B[0].
First thing I would expect is a value error and I'd never expect it does
that high-level manipulations with parsing.
IMO ideally it would do the following instead:

B[0]=s[1]
print B ->
[49  0  0  0  0  0  0]

So it should just write ord(s[1]) to B.
Sounds logical? For me very much.
Further, one could write like this:

B[:] = s
print B->
[48 49 50 32 97 98 99]

Namely cast the string into byte array. IMO this would be
the logical expected  behavior.
Currently it just throws the value error if met non-digits in a string,
so IMO current casting hardly can be of practical use.

Furthermore, I think this code:

A= array(s,"u1")

Could act exactly same as:

A= fromstring(s,"u1")

But this is just a side-idea for spelling simplicty/generality.
Not really necessary.

Further thoughts:
If trying to create "u1" array from a Pyhton 3 string, question is,
whether it should throw an error, I think yes, and in this case
"u4" type should be explicitly specified by initialisation, I suppose.
And e.g. translation from unicode to extended ascii (Latin1) or whatever
should be done on Python side  or with explicit translation.

Python3 assumes 4-byte strings but in reality most of the time
we deal with 1-byte strings, so there is huge waste of resources
when dealing with 4-bytes. For many serious projects it is just not needed.

Furthermore I think some of the methods from "chararray" submodule
should be possible to use directly on normal integer arrays without
conversions to other array types.
So I personally don't realy get why the need of additional chararray type,
Its all numbers anyway and it's up to the programmer to
decide what size of translation tables/value ranges he wants to use.

There can be some convinience methods for ascii operations,
like eg char.toupper(), but currently they don't seem to work with integer
arrays so why not make those potentially useful methots usable
and make them work on normal integer arrays?
Or even migrate them to the root namespace to e.g. introduce
names with prefixes:

A=ascii_toupper(A)
A=ascii_tolower(A)

Many things can be be achieved with general numeric methods,
e.g. translate/reduce the array. Here obviosly I mean not dynamical
arrays, just fixed-sized arrays. How to deal with dynamically
changing array sizes is another problematic, and it depends on how the
software is designed in the first place and what it does with the data.

For my own text-editing software project I consider fixed allocated 1D
and 2D "uint8"
arrays only. And specifically I experiment with own encodings, so just
as a side-note, I don't think that encoding should be assumed much for
creating new array types, it is up to the programmer
to decide what 'meanings' the bytes have.


Kind regards,
Mikhail
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Array and string interoperability

Thomas Jollans
On 04/06/17 20:04, Mikhail V wrote:

> Initialize array from a string currently looks like:
>
> s= "012 abc"
> A= fromstring(s,"u1")
> print A ->
> [48 49 50 32 97 98 99]
>
> Perfect.
> Now when writing values it will not work
> as IMO it should, namley consider this example:
>
> B= zeros(7,"u1")
> B[0]=s[1]
> print B ->
> [1 0 0 0 0 0 0]
>
> Ugh? It tries to parse the s[1] character "1" as integer and writes 1 to B[0].
> First thing I would expect is a value error and I'd never expect it does
> that high-level manipulations with parsing.
> IMO ideally it would do the following instead:
>
> B[0]=s[1]
> print B ->
> [49  0  0  0  0  0  0]
>
> So it should just write ord(s[1]) to B.
> Sounds logical? For me very much.
> Further, one could write like this:
>
> B[:] = s
> print B->
> [48 49 50 32 97 98 99]
>
> Namely cast the string into byte array. IMO this would be
> the logical expected  behavior.
I disagree. If numpy treated bytestrings as sequences of uint8s (which
would, granted, be perfectly reasonable, at least in py3), you wouldn't
have needed the fromstring function in the first place. Personally, I
think I would prefer this, actually. However, numpy normally treats
strings as objects that can sometimes be cast to numbers, so this
behaviour is perfectly logical.

For what it's worth, in Python 3 (which you probably should want to be
using), everything behaves as you'd expect:

>>> import numpy as np
>>> s = b'012 abc'
>>> a = np.fromstring(s, 'u1')
>>> a
array([48, 49, 50, 32, 97, 98, 99], dtype=uint8)
>>> b = np.zeros(7, 'u1')
>>> b[0] = s[1]
>>> b
array([49,  0,  0,  0,  0,  0,  0], dtype=uint8)
>>>

> Currently it just throws the value error if met non-digits in a string,
> so IMO current casting hardly can be of practical use.
>
> Furthermore, I think this code:
>
> A= array(s,"u1")
>
> Could act exactly same as:
>
> A= fromstring(s,"u1")
>
> But this is just a side-idea for spelling simplicty/generality.
> Not really necessary.

There is also something to be said for the current behaviour:

>>> np.array('100', 'u1')
array(100, dtype=uint8)

However, the fact that this works for bytestrings on Python 3 is, in my
humble opinion, ridiculous:

>>> np.array(b'100', 'u1') # b'100' IS NOT TEXT
array(100, dtype=uint8)

This is of course consistent with the fact that you can cast a
bytestring to builtin python int or float (but not complex).
Interestingly enough, numpy complex behaves differently from python complex:

>>> complex(b'1')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: complex() argument must be a string or a number, not 'bytes'
>>> complex('1')
(1+0j)
>>> np.complex128('1')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: a float is required
>>>


> Further thoughts:
> If trying to create "u1" array from a Pyhton 3 string, question is,
> whether it should throw an error, I think yes, and in this case
> "u4" type should be explicitly specified by initialisation, I suppose.
> And e.g. translation from unicode to extended ascii (Latin1) or whatever
> should be done on Python side  or with explicit translation.

If you ask me, passing a unicode string to fromstring with sep='' (i.e.
to parse binary data) should ALWAYS raise an error: the semantics only
make sense for strings of bytes.

Currently, there appears to be some UTF-8 conversion going on, which
creates potentially unexpected results:

>>> s = 'αβγδ'
>>> a = np.fromstring(s, 'u1')
>>> a
array([206, 177, 206, 178, 206, 179, 206, 180], dtype=uint8)
>>> assert len(a) * a.dtype.itemsize  == len(s)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AssertionError
>>>

This is, apparently (https://github.com/numpy/numpy/issues/2152), due to
how the internals of Python deal with unicode strings in C code, and not
due to anything numpy is doing.

Speaking of unexpected results, I'm not sure you realize what fromstring
does when you give it a multi-byte dtype:

>>> s = 'αβγδ'
>>> a = np.fromstring(s, 'u4')
>>> a
array([2999890382, 3033445326], dtype=uint32)
>>>

Give fromstring() a numpy unicode string, and all is right with the world:

>>> s = np.array('αβγδ')
>>> s
array('αβγδ',
      dtype='<U4')
>>> np.fromstring(s, 'u4')
array([945, 946, 947, 948], dtype=uint32)
>>>


IMHO calling fromstring(..., sep='') with a unicode string should be
deprecated and perhaps eventually forbidden. (Or fixed, but that would
break backwards compatibility)

> Python3 assumes 4-byte strings but in reality most of the time
> we deal with 1-byte strings, so there is huge waste of resources
> when dealing with 4-bytes. For many serious projects it is just not needed.

That's quite enough anglo-centrism, thank you. For when you need byte
strings, Python 3 has a type for that. For when your strings contain
text, bytes with no information on encoding are not enough.

> Furthermore I think some of the methods from "chararray" submodule
> should be possible to use directly on normal integer arrays without
> conversions to other array types.
> So I personally don't realy get why the need of additional chararray type,
> Its all numbers anyway and it's up to the programmer to
> decide what size of translation tables/value ranges he wants to use.

chararray is deprecated.

> There can be some convinience methods for ascii operations,
> like eg char.toupper(), but currently they don't seem to work with integer
> arrays so why not make those potentially useful methots usable
> and make them work on normal integer arrays?
I don't know what you're doing, but I don't think numpy is normally the
right tool for text manipulation...

> [snip]
>
> as a side-note, I don't think that encoding should be assumed much for
> creating new array types, it is up to the programmer
> to decide what 'meanings' the bytes have.

Agreed!



-- Thomas

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Array and string interoperability

Chris Barker - NOAA Federal
Just a few notes:

However, the fact that this works for bytestrings on Python 3 is, in my
humble opinion, ridiculous:

>>> np.array(b'100', 'u1') # b'100' IS NOT TEXT
array(100, dtype=uint8)

Yes, that is a mis-feature -- I think due to bytes and string being the same object in py2 -- so on py3, numpy continues to treat a bytes objects as also a 1-byte-per-char string, depending on context. And users want to be able to write numpy code that will run the same on py2 and py3, so we kinda need this kind of thing.

Makes me think that an optional "pure-py-3" mode for numpy might be a good idea. If that flag is set, your code will only run on py3 (or at least might run differently).
 
> Further thoughts:
> If trying to create "u1" array from a Pyhton 3 string, question is,
> whether it should throw an error, I think yes,

well, you can pass numbers > 255 into a u1 already:

In [96]: np.array(456, dtype='u1')

Out[96]: array(200, dtype=uint8)

and it does the wrap-around overflow thing... so why not?
 
and in this case
> "u4" type should be explicitly specified by initialisation, I suppose.
> And e.g. translation from unicode to extended ascii (Latin1) or whatever
> should be done on Python side  or with explicit translation.

absolutely!

If you ask me, passing a unicode string to fromstring with sep='' (i.e.
to parse binary data) should ALWAYS raise an error: the semantics only
make sense for strings of bytes.

exactly -- we really should have a "frombytes()" alias for fromstring() and it should only work for atual bytes objects (strings on py2, naturally).

and overloading fromstring() to mean both "binary dump of data" and "parse the text" due to whether the sep argument is set was always a bad idea :-(

.. and fromstring(s, sep=a_sep_char)
 
has been semi broken (or at least not robust) forever anyway.

Currently, there appears to be some UTF-8 conversion going on, which
creates potentially unexpected results:

>>> s = 'αβγδ'
>>> a = np.fromstring(s, 'u1')
>>> a
array([206, 177, 206, 178, 206, 179, 206, 180], dtype=uint8)
>>> assert len(a) * a.dtype.itemsize  == len(s)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AssertionError
>>>

This is, apparently (https://github.com/numpy/numpy/issues/2152), due to
how the internals of Python deal with unicode strings in C code, and not
due to anything numpy is doing.

exactly -- py3 strings are pretty nifty implementation of unicode text -- they have nothing to do with storing binary data, and should not be used that way. There is essentially no reason you would ever want to pass the actual binary representation to any other code.

fromstring should be re-named frombytes, and it should raise an exception if you pass something other than a bytes object (or maybe a memoryview or other binary container?)

we might want to keep fromstring() for parsing strings, but only if it were fixed...

IMHO calling fromstring(..., sep='') with a unicode string should be
deprecated and perhaps eventually forbidden. (Or fixed, but that would
break backwards compatibility)

agreed.

> Python3 assumes 4-byte strings but in reality most of the time
> we deal with 1-byte strings, so there is huge waste of resources
> when dealing with 4-bytes. For many serious projects it is just not needed.

That's quite enough anglo-centrism, thank you. For when you need byte
strings, Python 3 has a type for that. For when your strings contain
text, bytes with no information on encoding are not enough.

There was a big thread about this recently -- it seems to have not quite come to a conclusion. But anglo-centrism aside, there is substantial demand for a "smaller" way to store mostly-ascii text.

I _think_ the conversation was steering toward an encoding-specified string dtype, so us anglo-centric folks could use latin-1 or utf-8.

But someone would need to write the code.

-CHB

> There can be some convenience methods for ascii operations,
> like eg char.toupper(), but currently they don't seem to work with integer
> arrays so why not make those potentially useful methots usable
> and make them work on normal integer arrays?
I don't know what you're doing, but I don't think numpy is normally the
right tool for text manipulation...

I agree here. But if one were to add such a thing (vectorized string operations) -- I'd think the thing to do would be to wrap (or port) the python string methods. But it shoudl only work for actual string dtypes, of course.

note that another part of the discussion previously suggested that we have a dtype that wraps a native python string object -- then you'd get all for free. This is essentially an object array with strings in it, which you can do now.

-CHB


--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

[hidden email]

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Array and string interoperability

Thomas Jollans
On 05/06/17 19:40, Chris Barker wrote:

>
>     If you ask me, passing a unicode string to fromstring with sep=''
>     (i.e.
>     to parse binary data) should ALWAYS raise an error: the semantics only
>     make sense for strings of bytes.
>
>
> exactly -- we really should have a "frombytes()" alias for
> fromstring() and it should only work for atual bytes objects (strings
> on py2, naturally).
>
> and overloading fromstring() to mean both "binary dump of data" and
> "parse the text" due to whether the sep argument is set was always a
> bad idea :-(
>
> .. and fromstring(s, sep=a_sep_char)

As it happens, this is pretty much what stdlib bytearray does since 3.2
(http://bugs.python.org/issue8990)


>  
> has been semi broken (or at least not robust) forever anyway.
>
>     Currently, there appears to be some UTF-8 conversion going on, which
>     creates potentially unexpected results:
>
>     >>> s = 'αβγδ'
>     >>> a = np.fromstring(s, 'u1')
>     >>> a
>     array([206, 177, 206, 178, 206, 179, 206, 180], dtype=uint8)
>     >>> assert len(a) * a.dtype.itemsize  == len(s)
>     Traceback (most recent call last):
>       File "<stdin>", line 1, in <module>
>     AssertionError
>     >>>
>
>     This is, apparently (https://github.com/numpy/numpy/issues/2152
>     <https://github.com/numpy/numpy/issues/2152>), due to
>     how the internals of Python deal with unicode strings in C code,
>     and not
>     due to anything numpy is doing.
>
>
> exactly -- py3 strings are pretty nifty implementation of unicode text
> -- they have nothing to do with storing binary data, and should not be
> used that way. There is essentially no reason you would ever want to
> pass the actual binary representation to any other code.
>
> fromstring should be re-named frombytes, and it should raise an
> exception if you pass something other than a bytes object (or maybe a
> memoryview or other binary container?)
>
> we might want to keep fromstring() for parsing strings, but only if it
> were fixed...
>
>     IMHO calling fromstring(..., sep='') with a unicode string should be
>     deprecated and perhaps eventually forbidden. (Or fixed, but that would
>     break backwards compatibility)
>
>
> agreed.
>
>     > Python3 assumes 4-byte strings but in reality most of the time
>     > we deal with 1-byte strings, so there is huge waste of resources
>     > when dealing with 4-bytes. For many serious projects it is just
>     not needed.
>
>     That's quite enough anglo-centrism, thank you. For when you need byte
>     strings, Python 3 has a type for that. For when your strings contain
>     text, bytes with no information on encoding are not enough.
>
>
> There was a big thread about this recently -- it seems to have not
> quite come to a conclusion. But anglo-centrism aside, there is
> substantial demand for a "smaller" way to store mostly-ascii text.
>
> I _think_ the conversation was steering toward an encoding-specified
> string dtype, so us anglo-centric folks could use latin-1 or utf-8.
>
> But someone would need to write the code.
>
> -CHB
>
>     > There can be some convenience methods for ascii operations,
>     > like eg char.toupper(), but currently they don't seem to work
>     with integer
>     > arrays so why not make those potentially useful methots usable
>     > and make them work on normal integer arrays?
>     I don't know what you're doing, but I don't think numpy is
>     normally the
>     right tool for text manipulation...
>
>
> I agree here. But if one were to add such a thing (vectorized string
> operations) -- I'd think the thing to do would be to wrap (or port)
> the python string methods. But it shoudl only work for actual string
> dtypes, of course.
>
> note that another part of the discussion previously suggested that we
> have a dtype that wraps a native python string object -- then you'd
> get all for free. This is essentially an object array with strings in
> it, which you can do now.
>
> -CHB
>
>
> --
>
> Christopher Barker, Ph.D.
> Oceanographer
>
> Emergency Response Division
> NOAA/NOS/OR&R            (206) 526-6959   voice
> 7600 Sand Point Way NE   (206) 526-6329   fax
> Seattle, WA  98115       (206) 526-6317   main reception
>
> [hidden email] <mailto:[hidden email]>
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> https://mail.python.org/mailman/listinfo/numpy-discussion


--
Thomas Jollans

m ☎ +31 6 42630259
e ✉ [hidden email]

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Array and string interoperability

Chris Barker - NOAA Federal
On Mon, Jun 5, 2017 at 1:51 PM, Thomas Jollans <[hidden email]> wrote:
> and overloading fromstring() to mean both "binary dump of data" and
> "parse the text" due to whether the sep argument is set was always a
> bad idea :-(
>
> .. and fromstring(s, sep=a_sep_char)

As it happens, this is pretty much what stdlib bytearray does since 3.2
(http://bugs.python.org/issue8990)

I'm not sure that the array.array.fromstring() ever parsed the data string as text, did it?

Anyway, This is what array.array now has:
array.frombytes(s)
Appends items from the string, interpreting the string as an array of machine values (as if it had been read from a file using the fromfile()method).
New in version 3.2: fromstring() is renamed to frombytes() for clarity.
array.fromfile(f, n)
Read n items (as machine values) from the file object f and append them to the end of the array. If less than n items are available, EOFError is raised, but the items that were available are still inserted into the array. f must be a real built-in file object; something else with a read() method won’t do.
array.fromstring()
Deprecated alias for frombytes().
I think numpy should do the same.
And frombytes() should remove the "sep" parameter. If someone wants to write a fast efficient simple text parser, then it should get a new name: fromtext() maybe???
And the fromfile() sep argument should be deprecated as well, for the same reasons.
array also has:

array.fromunicode(s)
Extends this array with data from the given unicode string. The array must be a type 'u' array; otherwise a ValueError is raised. Usearray.frombytes(unicodestring.encode(enc)) to append Unicode data to an array of some other type.
which I think would be better supported by:
np.frombytes(str.encode('UCS-4'), dtype=uint32)
-CHB


--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

[hidden email]

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Array and string interoperability

Mikhail V
In reply to this post by Thomas Jollans
On 4 June 2017 at 23:59, Thomas Jollans <[hidden email]> wrote:

>
>
> For what it's worth, in Python 3 (which you probably should want to be
> using), everything behaves as you'd expect:
>
>>>> import numpy as np
>>>> s = b'012 abc'
>>>> a = np.fromstring(s, 'u1')
>>>> a
> array([48, 49, 50, 32, 97, 98, 99], dtype=uint8)
>>>> b = np.zeros(7, 'u1')
>>>> b[0] = s[1]
>>>> b
> array([49,  0,  0,  0,  0,  0,  0], dtype=uint8)
>>>>


Ok, examples do best.
I think we have to separate cases though.
So I will do examples in recent Python 3 now to avoid confusion.
Case divisions:

-- classify by "forward/backward" conversion:
    For this time consider only forward, i.e. I copy data from string
to numpy array

-- classify by " bytes  vs  ordinals ":

a)  bytes:  If I need raw bytes - in this case e.g.

  B = bytes(s.encode())

will do it. then I can copy data to array. So currently there are methods
coverings this. If I understand correctly the data extracted corresponds
to utf-??  byte feed, i.e. non-constant byte-length of chars (1 up to
4 bytes per char for
the 'wide' unicode, correct me if I am wrong).

b):  I need *ordinals*
  Yes, I need ordinals, so for the bytes() method, if a Python 3
string contains only
  basic ascii, I can so or so convert to bytes then to integer array
and the length will
  be the same 1byte for each char.
  Although syntactically seen, and with slicing, this will look e.g. like:

s= "012 abc"
B = bytes(s.encode())  # convert to bytes
k  = len(s)
arr = np.zeros(k,"u1")   # init empty array length k
arr[0:2] = list(B[0:2])
print ("my array: ", arr)
->
my array:  [48 49  0  0  0  0  0]

Result seems correct. Note that I also need to use list(B), otherwise
the slicing does not work (fills both values with 1, no idea where 1
comes from).
Or I can write e.g.:
arr[0:2] = np.fromstring(B[0:2], "u1")

But looks indeed like a 'hack' and not so sinple.
Considering your other examples there is other (better?) way, see below.
Note, I personally don't know best practices and many technical nuances
here so I repeat it from your words.


-- classify "what is maximal ordinal value in the string"
Well, say, I don't know what is maximal ordinal, e.g. here I take
3 Cyrillic letters instead of 'abc':

s= "012 АБВ"
k  = len(s)
arr = np.zeros(k,"u4")   # init empty 32 bit array length k
arr[:] = np.fromstring(np.array(s),"u4")
->
[  48   49   50   32 1040 1041 1042]


This gives correct results indeed. So I get my ordinals as expected.
So this is better/preferred way, right?

Ok...
Just some further thoughts on the topic:
I would want to do the above things, in simpler syntax.
For example, if there would be methods taking Python strings:

arr = np.ordinals(s)
arr[0:2] = np.ordinals(s[0:2])  # with slicing

or, e.g. in such format:

arr = np.copystr(s)
arr[0:2] = np.copystr(s[0:2])

Which would give me same result as your proposed :

arr = np.fromstring(np.array(s),"u4")
arr[0:2] = np.fromstring(np.array(s[0:2]),"u4")

IOW omitting "u4" parameter seems to be OK. E.g.
if on the left side of assignment is "u1" array the values would be
silently wrapped(?) according to Numpy rules (as Chris pointed out).
And in similar way backward conversion to Python string.

Though for Python 2 could raise questions why need casting to "u4".

Would be cool just to use = without any methods as I've originally supposed,
but as I understand now this behaviour is already occupied and would cause
backward compatibility issues if touched.


So approximately are my ideas.
For me it would cover many applicaton cases.


Mikhail
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Array and string interoperability

Mikhail V
In reply to this post by Chris Barker - NOAA Federal
On 5 June 2017 at 19:40, Chris Barker <[hidden email]> wrote:

>
>
>> > Python3 assumes 4-byte strings but in reality most of the time
>> > we deal with 1-byte strings, so there is huge waste of resources
>> > when dealing with 4-bytes. For many serious projects it is just not
>> > needed.
>>
>> That's quite enough anglo-centrism, thank you. For when you need byte
>> strings, Python 3 has a type for that. For when your strings contain
>> text, bytes with no information on encoding are not enough.
>
> There was a big thread about this recently -- it seems to have not quite
> come to a conclusion.

I have started to read that thread, though I've lost in idea transitions.
Likely it was about some new string array type...

> But anglo-centrism aside, there is substantial demand
> for a "smaller" way to store mostly-ascii text.
>

Obviously there is demand. Terror of unicode touches many aspects
of programmers life. It is not Numpy's problem though.
The realistic scenario for satisfaction for this demand is a hard and
wide problem.
Foremost, it comes down to the question of defining this "optimal
8-bit character table".
And "Latin-1", (exactly as it is)  is not that optimal table, at least
because of huge amount of
accented letters. But, granted, if define most accented letters as
"optional", i.e . delete them
then it is quite reasonable basic char table to start with.
Further comes the question of popularizisng new table (which doesn't
even exists yet).


>> > There can be some convenience methods for ascii operations,
>> > like eg char.toupper(), but currently they don't seem to work with
>> > integer
>> > arrays so why not make those potentially useful methots usable
>> > and make them work on normal integer arrays?
>> I don't know what you're doing, but I don't think numpy is normally the
>> right tool for text manipulation...
>
>
> I agree here. But if one were to add such a thing (vectorized string
> operations) -- I'd think the thing to do would be to wrap (or port) the
> python string methods. But it shoudl only work for actual string dtypes, of
> course.
>
> note that another part of the discussion previously suggested that we have a
> dtype that wraps a native python string object -- then you'd get all for
> free. This is essentially an object array with strings in it, which you can
> do now.
>

Well here I must admit I don't quite understand the whole idea of
"numpy array of string type". How used? What is main bebefit/feature...?

Example integer array usage in context of textual data in my case:
- holding data in a text editor (mutability+indexing/slicing)
- filtering, transformations (e.g. table translations, cryptography, etc.)


String type array? Will this be a string array you describe:

s= "012 abc"
arr = np.array(s)
print ("type ", arr.dtype)
print ("shape ", arr.shape)
print ("my array: ", arr)
arr = np.roll(arr[0],2)
print ("my array: ", arr)
->
type  <U7
shape  ()
my array:  012 abc
my array:  012 abc


So what it does? What's up with shape?
e.g. here I wanted to 'roll' the string.
How would I replace chars? or delete?
What is the general idea behind?



Mikhail
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Array and string interoperability

Chris Barker - NOAA Federal
In reply to this post by Mikhail V
On Mon, Jun 5, 2017 at 3:59 PM, Mikhail V <[hidden email]> wrote:
-- classify by "forward/backward" conversion:
    For this time consider only forward, i.e. I copy data from string
to numpy array

-- classify by " bytes  vs  ordinals ":

a)  bytes:  If I need raw bytes - in this case e.g.

  B = bytes(s.encode())

no need to call "bytes" -- encode() returns a bytes object:

In [1]: s = "this is a simple ascii-only string"

In [2]: b = s.encode()

In [3]: type(b)

Out[3]: bytes

In [4]: b

Out[4]: b'this is a simple ascii-only string'

 
will do it. then I can copy data to array. So currently there are methods
coverings this. If I understand correctly the data extracted corresponds
to utf-??  byte feed, i.e. non-constant byte-length of chars (1 up to
4 bytes per char for
the 'wide' unicode, correct me if I am wrong).

In [5]: s.encode?
Docstring:
S.encode(encoding='utf-8', errors='strict') -> bytes

So the default is utf-8, but you can set any encoding you want (that python supports)

 In [6]: s.encode('utf-16')

Out[6]: b'\xff\xfet\x00h\x00i\x00s\x00 \x00i\x00s\x00 \x00a\x00 \x00s\x00i\x00m\x00p\x00l\x00e\x00 \x00a\x00s\x00c\x00i\x00i\x00-\x00o\x00n\x00l\x00y\x00 \x00s\x00t\x00r\x00i\x00n\x00g\x00'

 
b):  I need *ordinals*
  Yes, I need ordinals, so for the bytes() method, if a Python 3
string contains only
  basic ascii, I can so or so convert to bytes then to integer array
and the length will
  be the same 1byte for each char.
  Although syntactically seen, and with slicing, this will look e.g. like:

s= "012 abc"
B = bytes(s.encode())  # convert to bytes
k  = len(s)
arr = np.zeros(k,"u1")   # init empty array length k
arr[0:2] = list(B[0:2])
print ("my array: ", arr)
->
my array:  [48 49  0  0  0  0  0]

This can be done more cleanly:

In [15]: s= "012 abc"

In [16]: b = s.encode('ascii')

# you want to use the ascii encoding so you don't get utf-8 cruft if there are non-ascii characters
#  you could use latin-1 too (Or any other one-byte per char encoding

In [17]: arr = np.fromstring(b, np.uint8)
# this is using fromstring() to means it's old py definiton - treat teh contenst as bytes
# -- it really should be called "frombytes()" 
# you could also use:

In [22]: np.frombuffer(b, dtype=np.uint8)
Out[22]: array([48, 49, 50, 32, 97, 98, 99], dtype=uint8)In [18]: print arr

In [19]: print(arr)
[48 49 50 32 97 98 99]

# you got the ordinals

In [20]: "".join([chr(i) for i in arr])
Out[20]: '012 abc'

# yes, they are the right ones...

 
Result seems correct. Note that I also need to use list(B), otherwise
the slicing does not work (fills both values with 1, no idea where 1
comes from).

that is odd -- I can't explain it right now either...
 
Or I can write e.g.:
arr[0:2] = np.fromstring(B[0:2], "u1")

But looks indeed like a 'hack' and not so simple.

is the above OK?
 
-- classify "what is maximal ordinal value in the string"
Well, say, I don't know what is maximal ordinal, e.g. here I take
3 Cyrillic letters instead of 'abc':

s= "012 АБВ"
k  = len(s)
arr = np.zeros(k,"u4")   # init empty 32 bit array length k
arr[:] = np.fromstring(np.array(s),"u4")
->
[  48   49   50   32 1040 1041 1042]

so this is making a numpy string, which is a UCS-4 encoding unicode -- i.e. 4 bytes per charactor. Then you care converting that to an 4-byte unsigned int. but no need to do it with fromstring:

In [52]: s
Out[52]: '012 АБВ'

In [53]: s_arr.reshape((1,)).view(np.uint32)
Out[53]: array([  48,   49,   50,   32, 1040, 1041, 1042], dtype=uint32)

we need the reshape() because .view does not work with array scalars -- not sure why not? 

This gives correct results indeed. So I get my ordinals as expected.
So this is better/preferred way, right?

I would maybe do it more "directly" -- i.e. use python's string to do the encoding:

In [64]: s
Out[64]: '012 АБВ'

In [67]: np.fromstring(s.encode('U32'), dtype=np.uint32)
Out[67]: array([65279,    48,    49,    50,    32,  1040,  1041,  1042], dtype=uint32)

that first value is the byte-order mark (I think...), you  can strip it off with:

In [68]: np.fromstring(s.encode('U32')[4:], dtype=np.uint32)
Out[68]: array([  48,   49,   50,   32, 1040, 1041, 1042], dtype=uint32)

or, probably better simply specify the byte order in the encoding:

In [69]: np.fromstring(s.encode('UTF-32LE'), dtype=np.uint32)
Out[69]: array([  48,   49,   50,   32, 1040, 1041, 1042], dtype=uint32)

arr = np.ordinals(s)
arr[0:2] = np.ordinals(s[0:2])  # with slicing

or, e.g. in such format:

arr = np.copystr(s)
arr[0:2] = np.copystr(s[0:2])

I don't think any of this is necessary -- the UCS4 (Or UTF-32) "encoding" is pretty much the ordinals anyway.

As you notices, if you make a numpy unicode string array, and change the dtype to unsigned int32, you get what you want.

You really don't want to mess with any of this unless you understand unicode and encodings anyway....

Though it is a bit akward -- why is your actual use-case for working with ordinals???

BTW, you can use regular python to get the ordinals first:

In [71]: np.array([ord(c) for c in s])
Out[71]: array([  48,   49,   50,   32, 1040, 1041, 1042])

Though for Python 2 could raise questions why need casting to "u4".

this would all work the same with python 2 if you used unicode objects instead of strings. Maybe good to put:

from __future__ import unicode_literals

in your source....
 
So approximately are my ideas.
For me it would cover many application cases.

I'm still curious as to your use-cases -- when do you have a bunch of ordinal values??

-CHB


--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

[hidden email]

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Array and string interoperability

Chris Barker - NOAA Federal
In reply to this post by Mikhail V
On Mon, Jun 5, 2017 at 4:06 PM, Mikhail V <[hidden email]> wrote:
Likely it was about some new string array type...

yes, it was.
 
> Obviously there is demand. Terror of unicode touches many aspects
of programmers life.

I don't know that I'd call it Terror, but frankly, the fact that you need up to 4 bytes for a single character is really not the big issues. Given that computer memory has grown by literally orders of magnitude since Unicode was introduced, I don't know why there is such a hang up about it.

But we're scientific programmers we like to be efficient !


Foremost, it comes down to the question of defining this "optimal
8-bit character table".
And "Latin-1", (exactly as it is)  is not that optimal table,

there is no such thing as a single "optimal" set of characters when you are limited to 255 of them...

latin-1 is pretty darn good for the, well, latin-based languages....
 
But, granted, if define most accented letters as
"optional", i.e . delete them
then it is quite reasonable basic char table to start with.

Then you are down to ASCII, no?
 
but anyway, I don't think a new encoding is really the topic at hand here....

>> I don't know what you're doing, but I don't think numpy is normally the
>> right tool for text manipulation...
>
>
> I agree here. But if one were to add such a thing (vectorized string
> operations) -- I'd think the thing to do would be to wrap (or port) the
> python string methods. But it shoudl only work for actual string dtypes, of
> course.
>
> note that another part of the discussion previously suggested that we have a
> dtype that wraps a native python string object -- then you'd get all for
> free. This is essentially an object array with strings in it, which you can
> do now.
>

Well here I must admit I don't quite understand the whole idea of
"numpy array of string type". How used? What is main bebefit/feature...?

here you go -- you can do this now:

In [74]: s_arr = np.array([s, "another string"], dtype=np.object)
In [75]:

In [75]: s_arr
Out[75]: array(['012 АБВ', 'another string'], dtype=object)

In [76]: s_arr.shape
Out[76]: (2,)

You now have an array with python string object in it -- thus access to all the string functionality:

In [81]: s_arr[1] = s_arr[1].upper()
In [82]: s_arr
Out[82]: array(['012 АБВ', 'ANOTHER STRING'], dtype=object)

and the ability to have each string be a different length.

If numpy were to know that those were string objects, rather than arbitrary python objects, it could do vectorized operations on them, etc.

You can do that now with numpy.vectorize, but it's pretty klunky.

In [87]: np_upper = np.vectorize(str.upper)
In [88]: np_upper(s_arr)

Out[88]:
array(['012 АБВ', 'ANOTHER STRING'],
      dtype='<U14')
 

Example integer array usage in context of textual data in my case:
- holding data in a text editor (mutability+indexing/slicing)

you really want to use regular old python data structures for that...
 
- filtering, transformations (e.g. table translations, cryptography, etc.)

that may be something to do with ordinals and numpy -- but then you need to work with ascii or latin-1 and uint8 dtypes, or full Unicode and uint32 dtype -- that's that.

String type array? Will this be a string array you describe:

s= "012 abc"
arr = np.array(s)
print ("type ", arr.dtype)
print ("shape ", arr.shape)
print ("my array: ", arr)
arr = np.roll(arr[0],2)
print ("my array: ", arr)
->
type  <U7
shape  ()
my array:  012 abc
my array:  012 abc


So what it does? What's up with shape?

shape is an empty tuple, meaning this is a numpy scalar, containing a single string

type '<U7' means little endian, unicode, 7 characters
 
e.g. here I wanted to 'roll' the string.
How would I replace chars? or delete?
What is the general idea behind?

the numpy string type (unicode type) works with fixed length strings -- not characters, but you can reshape it and make a view:

In [89]: s= "012 abc"

In [90]: arr.shape = (1,)

In [91]: arr.shape
Out[91]: (1,)

In [93]: c_arr = arr.view(dtype = '<U1')

In [97]: np.roll(c_arr, 3)
Out[97]:
array(['a', 'b', 'c', '0', '1', '2', ' '],
      dtype='<U1')

You could also create it as a character array in the first place by unpacking it into a list first:

In [98]: c_arr = np.array(list(s))

In [99]: c_arr
Out[99]:
array(['0', '1', '2', ' ', 'a', 'b', 'c'],
      dtype='<U1')


-CHB

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

[hidden email]

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Array and string interoperability

Mikhail V
In reply to this post by Chris Barker - NOAA Federal
On 7 June 2017 at 00:05, Chris Barker <[hidden email]> wrote:
> On Mon, Jun 5, 2017 at 3:59 PM, Mikhail V <[hidden email]> wrote:

>> s= "012 abc"
>> B = bytes(s.encode())  # convert to bytes
>> k  = len(s)
>> arr = np.zeros(k,"u1")   # init empty array length k
>> arr[0:2] = list(B[0:2])
>> print ("my array: ", arr)
>> ->
>> my array:  [48 49  0  0  0  0  0]
>
>
> This can be done more cleanly:
>
> In [15]: s= "012 abc"
>
> In [16]: b = s.encode('ascii')
>
> # you want to use the ascii encoding so you don't get utf-8 cruft if there
> are non-ascii characters
> #  you could use latin-1 too (Or any other one-byte per char encoding

Thanks for clarifying, that makes sense.
Also it's a good way to validate the string.


>
> or, probably better simply specify the byte order in the encoding:
>
> In [69]: np.fromstring(s.encode('UTF-32LE'), dtype=np.uint32)
> Out[69]: array([  48,   49,   50,   32, 1040, 1041, 1042], dtype=uint32)


Ok, this gives what I want too.
So now for unicode I am by two possible options (apart from possible
"fromstring" spelling):
with indexing (if I want to copy into already existing array on the fly):

arr[0:3] = np.fromstring(np.array(s[0:3]),"u4")
arr[0:3] = np.fromstring(s[0:3].encode('UTF-32LE'),"u4")


>
>> arr = np.ordinals(s)
>> arr[0:2] = np.ordinals(s[0:2])  # with slicing
>
>
> I don't think any of this is necessary -- the UCS4 (Or UTF-32) "encoding" is
> pretty much the ordinals anyway.
>
> As you notices, if you make a numpy unicode string array, and change the
> dtype to unsigned int32, you get what you want.

No I am not implying anything is necessary, just seems to be sort of a pattern.
And from Python 3 perspective where string indexing is by wide characters ...
well I don't know.


>> Example integer array usage in context of textual data in my case:
>> - holding data in a text editor (mutability+indexing/slicing)
>>

>you really want to use regular old python data structures for that...
>[...]
>the numpy string type (unicode type) works with fixed length strings -- not
>characters, but you can reshape it and make a view:
>[...]

I am intentionally choosing fixed size array for holding data and
writing values using indexes.
But wait a moment, characters *are* integers, identities, [put some
other name here].

> In [93]: c_arr = arr.view(dtype = '<U1')
> In [97]: np.roll(c_arr, 3)
> Out[97]:
> array(['a', 'b', 'c', '0', '1', '2', ' '],
>     dtype='<U1')

So here it prints  ['a', 'b', 'c', '0', '1', '2', ' '] which
is the same data, it is just a matter of printing.

If we talk about methods available already in particular libs, then
well, yes they are set up to work on specific object types only.
But generally speaking, if I want to select e.g. specific character values,
or I am selecting specific values in some discrete sets...

But I  have no experience with numpy string types
and could not feel the real purposes yet.



-------
(Off topic here)


>> Foremost, it comes down to the question of defining this "optimal
>> 8-bit character table".
>> And "Latin-1", (exactly as it is)  is not that optimal table,
>
>there is no such thing as a single "optimal" set of characters when you are
>limited to 255 of them...

Yeah, depends much on criteria of 'optimality' and many other things ;)

>> But, granted, if define most accented letters as
>> "optional", i.e . delete them
>> then it is quite reasonable basic char table to start with.
>
>Then you are down to ASCII, no?

No, then I am down to ASCII plus few vital characters, e.g.:

- Dashes (which could solve the painful and old as world problem of
"hyphen" vs "minus")
- Multiplication sign, degree
- Em dash, quotation marks, spaces (non-breaking, half)   --  all
vital for typesetting
...

If you think about it,  255 units is more than enough to define
perfect communication standards.

>but anyway, I don't think a new encoding is really the topic at hand
>here....

Yes I think this is off-opic on this list. But intersting indeed,
where it is on-topic.
Seems like those encodings are coming from some "mysterios castle in
the clouds".


Mikhail
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Loading...