# Array and string interoperability

10 messages
Open this post in threaded view
|

## Array and string interoperability

 Array and string interoperability Just sharing my thoughts and few ideas about simplification of casting strings to arrays. In examples assume Numpy is in the namespace (from numpy import *) Initialize array from a string currently looks like: s= "012 abc" A= fromstring(s,"u1") print A -> [48 49 50 32 97 98 99] Perfect. Now when writing values it will not work as IMO it should, namley consider this example: B= zeros(7,"u1") B[0]=s[1] print B -> [1 0 0 0 0 0 0] Ugh? It tries to parse the s[1] character "1" as integer and writes 1 to B[0]. First thing I would expect is a value error and I'd never expect it does that high-level manipulations with parsing. IMO ideally it would do the following instead: B[0]=s[1] print B -> [49  0  0  0  0  0  0] So it should just write ord(s[1]) to B. Sounds logical? For me very much. Further, one could write like this: B[:] = s print B-> [48 49 50 32 97 98 99] Namely cast the string into byte array. IMO this would be the logical expected  behavior. Currently it just throws the value error if met non-digits in a string, so IMO current casting hardly can be of practical use. Furthermore, I think this code: A= array(s,"u1") Could act exactly same as: A= fromstring(s,"u1") But this is just a side-idea for spelling simplicty/generality. Not really necessary. Further thoughts: If trying to create "u1" array from a Pyhton 3 string, question is, whether it should throw an error, I think yes, and in this case "u4" type should be explicitly specified by initialisation, I suppose. And e.g. translation from unicode to extended ascii (Latin1) or whatever should be done on Python side  or with explicit translation. Python3 assumes 4-byte strings but in reality most of the time we deal with 1-byte strings, so there is huge waste of resources when dealing with 4-bytes. For many serious projects it is just not needed. Furthermore I think some of the methods from "chararray" submodule should be possible to use directly on normal integer arrays without conversions to other array types. So I personally don't realy get why the need of additional chararray type, Its all numbers anyway and it's up to the programmer to decide what size of translation tables/value ranges he wants to use. There can be some convinience methods for ascii operations, like eg char.toupper(), but currently they don't seem to work with integer arrays so why not make those potentially useful methots usable and make them work on normal integer arrays? Or even migrate them to the root namespace to e.g. introduce names with prefixes: A=ascii_toupper(A) A=ascii_tolower(A) Many things can be be achieved with general numeric methods, e.g. translate/reduce the array. Here obviosly I mean not dynamical arrays, just fixed-sized arrays. How to deal with dynamically changing array sizes is another problematic, and it depends on how the software is designed in the first place and what it does with the data. For my own text-editing software project I consider fixed allocated 1D and 2D "uint8" arrays only. And specifically I experiment with own encodings, so just as a side-note, I don't think that encoding should be assumed much for creating new array types, it is up to the programmer to decide what 'meanings' the bytes have. Kind regards, Mikhail _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpy-discussion
Open this post in threaded view
|

## Re: Array and string interoperability

 On 04/06/17 20:04, Mikhail V wrote: > Initialize array from a string currently looks like: > > s= "012 abc" > A= fromstring(s,"u1") > print A -> > [48 49 50 32 97 98 99] > > Perfect. > Now when writing values it will not work > as IMO it should, namley consider this example: > > B= zeros(7,"u1") > B[0]=s[1] > print B -> > [1 0 0 0 0 0 0] > > Ugh? It tries to parse the s[1] character "1" as integer and writes 1 to B[0]. > First thing I would expect is a value error and I'd never expect it does > that high-level manipulations with parsing. > IMO ideally it would do the following instead: > > B[0]=s[1] > print B -> > [49  0  0  0  0  0  0] > > So it should just write ord(s[1]) to B. > Sounds logical? For me very much. > Further, one could write like this: > > B[:] = s > print B-> > [48 49 50 32 97 98 99] > > Namely cast the string into byte array. IMO this would be > the logical expected  behavior. I disagree. If numpy treated bytestrings as sequences of uint8s (which would, granted, be perfectly reasonable, at least in py3), you wouldn't have needed the fromstring function in the first place. Personally, I think I would prefer this, actually. However, numpy normally treats strings as objects that can sometimes be cast to numbers, so this behaviour is perfectly logical. For what it's worth, in Python 3 (which you probably should want to be using), everything behaves as you'd expect: >>> import numpy as np >>> s = b'012 abc' >>> a = np.fromstring(s, 'u1') >>> a array([48, 49, 50, 32, 97, 98, 99], dtype=uint8) >>> b = np.zeros(7, 'u1') >>> b[0] = s[1] >>> b array([49,  0,  0,  0,  0,  0,  0], dtype=uint8) >>> > Currently it just throws the value error if met non-digits in a string, > so IMO current casting hardly can be of practical use. > > Furthermore, I think this code: > > A= array(s,"u1") > > Could act exactly same as: > > A= fromstring(s,"u1") > > But this is just a side-idea for spelling simplicty/generality. > Not really necessary. There is also something to be said for the current behaviour: >>> np.array('100', 'u1') array(100, dtype=uint8) However, the fact that this works for bytestrings on Python 3 is, in my humble opinion, ridiculous: >>> np.array(b'100', 'u1') # b'100' IS NOT TEXT array(100, dtype=uint8) This is of course consistent with the fact that you can cast a bytestring to builtin python int or float (but not complex). Interestingly enough, numpy complex behaves differently from python complex: >>> complex(b'1') Traceback (most recent call last):   File "", line 1, in TypeError: complex() argument must be a string or a number, not 'bytes' >>> complex('1') (1+0j) >>> np.complex128('1') Traceback (most recent call last):   File "", line 1, in TypeError: a float is required >>> > Further thoughts: > If trying to create "u1" array from a Pyhton 3 string, question is, > whether it should throw an error, I think yes, and in this case > "u4" type should be explicitly specified by initialisation, I suppose. > And e.g. translation from unicode to extended ascii (Latin1) or whatever > should be done on Python side  or with explicit translation. If you ask me, passing a unicode string to fromstring with sep='' (i.e. to parse binary data) should ALWAYS raise an error: the semantics only make sense for strings of bytes. Currently, there appears to be some UTF-8 conversion going on, which creates potentially unexpected results: >>> s = 'αβγδ' >>> a = np.fromstring(s, 'u1') >>> a array([206, 177, 206, 178, 206, 179, 206, 180], dtype=uint8) >>> assert len(a) * a.dtype.itemsize  == len(s) Traceback (most recent call last):   File "", line 1, in AssertionError >>> This is, apparently (https://github.com/numpy/numpy/issues/2152), due to how the internals of Python deal with unicode strings in C code, and not due to anything numpy is doing. Speaking of unexpected results, I'm not sure you realize what fromstring does when you give it a multi-byte dtype: >>> s = 'αβγδ' >>> a = np.fromstring(s, 'u4') >>> a array([2999890382, 3033445326], dtype=uint32) >>> Give fromstring() a numpy unicode string, and all is right with the world: >>> s = np.array('αβγδ') >>> s array('αβγδ',       dtype='>> np.fromstring(s, 'u4') array([945, 946, 947, 948], dtype=uint32) >>> IMHO calling fromstring(..., sep='') with a unicode string should be deprecated and perhaps eventually forbidden. (Or fixed, but that would break backwards compatibility) > Python3 assumes 4-byte strings but in reality most of the time > we deal with 1-byte strings, so there is huge waste of resources > when dealing with 4-bytes. For many serious projects it is just not needed. That's quite enough anglo-centrism, thank you. For when you need byte strings, Python 3 has a type for that. For when your strings contain text, bytes with no information on encoding are not enough. > Furthermore I think some of the methods from "chararray" submodule > should be possible to use directly on normal integer arrays without > conversions to other array types. > So I personally don't realy get why the need of additional chararray type, > Its all numbers anyway and it's up to the programmer to > decide what size of translation tables/value ranges he wants to use. chararray is deprecated. > There can be some convinience methods for ascii operations, > like eg char.toupper(), but currently they don't seem to work with integer > arrays so why not make those potentially useful methots usable > and make them work on normal integer arrays? I don't know what you're doing, but I don't think numpy is normally the right tool for text manipulation... > [snip] > > as a side-note, I don't think that encoding should be assumed much for > creating new array types, it is up to the programmer > to decide what 'meanings' the bytes have. Agreed! -- Thomas _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpy-discussion
Open this post in threaded view
|

## Re: Array and string interoperability

Open this post in threaded view
|

## Re: Array and string interoperability

Open this post in threaded view
|

## Re: Array and string interoperability

 On Mon, Jun 5, 2017 at 1:51 PM, Thomas Jollans wrote:> and overloading fromstring() to mean both "binary dump of data" and > "parse the text" due to whether the sep argument is set was always a > bad idea :-( > > .. and fromstring(s, sep=a_sep_char) As it happens, this is pretty much what stdlib bytearray does since 3.2 (http://bugs.python.org/issue8990)I'm not sure that the array.array.fromstring() ever parsed the data string as text, did it?Anyway, This is what array.array now has:array.frombytes(s)Appends items from the string, interpreting the string as an array of machine values (as if it had been read from a file using the fromfile()method).New in version 3.2: fromstring() is renamed to frombytes() for clarity.array.fromfile(f, n)Read n items (as machine values) from the file object f and append them to the end of the array. If less than n items are available, EOFError is raised, but the items that were available are still inserted into the array. f must be a real built-in file object; something else with a read() method won’t do.array.fromstring()Deprecated alias for frombytes().I think numpy should do the same.And frombytes() should remove the "sep" parameter. If someone wants to write a fast efficient simple text parser, then it should get a new name: fromtext() maybe???And the fromfile() sep argument should be deprecated as well, for the same reasons.array also has:array.fromunicode(s)Extends this array with data from the given unicode string. The array must be a type 'u' array; otherwise a ValueError is raised. Usearray.frombytes(unicodestring.encode(enc)) to append Unicode data to an array of some other type.which I think would be better supported by:np.frombytes(str.encode('UCS-4'), dtype=uint32)-CHB-- Christopher Barker, Ph.D.OceanographerEmergency Response DivisionNOAA/NOS/OR&R            (206) 526-6959   voice7600 Sand Point Way NE   (206) 526-6329   faxSeattle, WA  98115       (206) 526-6317   main reception[hidden email] _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpy-discussion
Open this post in threaded view
|

## Re: Array and string interoperability

 In reply to this post by Thomas Jollans On 4 June 2017 at 23:59, Thomas Jollans <[hidden email]> wrote: > > > For what it's worth, in Python 3 (which you probably should want to be > using), everything behaves as you'd expect: > >>>> import numpy as np >>>> s = b'012 abc' >>>> a = np.fromstring(s, 'u1') >>>> a > array([48, 49, 50, 32, 97, 98, 99], dtype=uint8) >>>> b = np.zeros(7, 'u1') >>>> b[0] = s[1] >>>> b > array([49,  0,  0,  0,  0,  0,  0], dtype=uint8) >>>> Ok, examples do best. I think we have to separate cases though. So I will do examples in recent Python 3 now to avoid confusion. Case divisions: -- classify by "forward/backward" conversion:     For this time consider only forward, i.e. I copy data from string to numpy array -- classify by " bytes  vs  ordinals ": a)  bytes:  If I need raw bytes - in this case e.g.   B = bytes(s.encode()) will do it. then I can copy data to array. So currently there are methods coverings this. If I understand correctly the data extracted corresponds to utf-??  byte feed, i.e. non-constant byte-length of chars (1 up to 4 bytes per char for the 'wide' unicode, correct me if I am wrong). b):  I need *ordinals*   Yes, I need ordinals, so for the bytes() method, if a Python 3 string contains only   basic ascii, I can so or so convert to bytes then to integer array and the length will   be the same 1byte for each char.   Although syntactically seen, and with slicing, this will look e.g. like: s= "012 abc" B = bytes(s.encode())  # convert to bytes k  = len(s) arr = np.zeros(k,"u1")   # init empty array length k arr[0:2] = list(B[0:2]) print ("my array: ", arr) -> my array:  [48 49  0  0  0  0  0] Result seems correct. Note that I also need to use list(B), otherwise the slicing does not work (fills both values with 1, no idea where 1 comes from). Or I can write e.g.: arr[0:2] = np.fromstring(B[0:2], "u1") But looks indeed like a 'hack' and not so sinple. Considering your other examples there is other (better?) way, see below. Note, I personally don't know best practices and many technical nuances here so I repeat it from your words. -- classify "what is maximal ordinal value in the string" Well, say, I don't know what is maximal ordinal, e.g. here I take 3 Cyrillic letters instead of 'abc': s= "012 АБВ" k  = len(s) arr = np.zeros(k,"u4")   # init empty 32 bit array length k arr[:] = np.fromstring(np.array(s),"u4") -> [  48   49   50   32 1040 1041 1042] This gives correct results indeed. So I get my ordinals as expected. So this is better/preferred way, right? Ok... Just some further thoughts on the topic: I would want to do the above things, in simpler syntax. For example, if there would be methods taking Python strings: arr = np.ordinals(s) arr[0:2] = np.ordinals(s[0:2])  # with slicing or, e.g. in such format: arr = np.copystr(s) arr[0:2] = np.copystr(s[0:2]) Which would give me same result as your proposed : arr = np.fromstring(np.array(s),"u4") arr[0:2] = np.fromstring(np.array(s[0:2]),"u4") IOW omitting "u4" parameter seems to be OK. E.g. if on the left side of assignment is "u1" array the values would be silently wrapped(?) according to Numpy rules (as Chris pointed out). And in similar way backward conversion to Python string. Though for Python 2 could raise questions why need casting to "u4". Would be cool just to use = without any methods as I've originally supposed, but as I understand now this behaviour is already occupied and would cause backward compatibility issues if touched. So approximately are my ideas. For me it would cover many applicaton cases. Mikhail _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpy-discussion
Open this post in threaded view
|

## Re: Array and string interoperability

Open this post in threaded view
|

## Re: Array and string interoperability

 In reply to this post by Mikhail V On Mon, Jun 5, 2017 at 3:59 PM, Mikhail V wrote:-- classify by "forward/backward" conversion:     For this time consider only forward, i.e. I copy data from string to numpy array -- classify by " bytes  vs  ordinals ": a)  bytes:  If I need raw bytes - in this case e.g.   B = bytes(s.encode())no need to call "bytes" -- encode() returns a bytes object:In [1]: s = "this is a simple ascii-only string"In [2]: b = s.encode()In [3]: type(b)Out[3]: bytesIn [4]: bOut[4]: b'this is a simple ascii-only string'  will do it. then I can copy data to array. So currently there are methods coverings this. If I understand correctly the data extracted corresponds to utf-??  byte feed, i.e. non-constant byte-length of chars (1 up to 4 bytes per char for the 'wide' unicode, correct me if I am wrong).In [5]: s.encode?Docstring:S.encode(encoding='utf-8', errors='strict') -> bytesSo the default is utf-8, but you can set any encoding you want (that python supports) In [6]: s.encode('utf-16')Out[6]: b'\xff\xfet\x00h\x00i\x00s\x00 \x00i\x00s\x00 \x00a\x00 \x00s\x00i\x00m\x00p\x00l\x00e\x00 \x00a\x00s\x00c\x00i\x00i\x00-\x00o\x00n\x00l\x00y\x00 \x00s\x00t\x00r\x00i\x00n\x00g\x00'  b):  I need *ordinals*   Yes, I need ordinals, so for the bytes() method, if a Python 3 string contains only   basic ascii, I can so or so convert to bytes then to integer array and the length will   be the same 1byte for each char.   Although syntactically seen, and with slicing, this will look e.g. like: s= "012 abc" B = bytes(s.encode())  # convert to bytes k  = len(s) arr = np.zeros(k,"u1")   # init empty array length k arr[0:2] = list(B[0:2]) print ("my array: ", arr) -> my array:  [48 49  0  0  0  0  0]This can be done more cleanly:In [15]: s= "012 abc"In [16]: b = s.encode('ascii')# you want to use the ascii encoding so you don't get utf-8 cruft if there are non-ascii characters#  you could use latin-1 too (Or any other one-byte per char encodingIn [17]: arr = np.fromstring(b, np.uint8)# this is using fromstring() to means it's old py definiton - treat teh contenst as bytes# -- it really should be called "frombytes()" # you could also use: In [22]: np.frombuffer(b, dtype=np.uint8)Out[22]: array([48, 49, 50, 32, 97, 98, 99], dtype=uint8)In [18]: print arrIn [19]: print(arr)[48 49 50 32 97 98 99]# you got the ordinalsIn [20]: "".join([chr(i) for i in arr])Out[20]: '012 abc'# yes, they are the right ones...  Result seems correct. Note that I also need to use list(B), otherwise the slicing does not work (fills both values with 1, no idea where 1 comes from).that is odd -- I can't explain it right now either...  Or I can write e.g.: arr[0:2] = np.fromstring(B[0:2], "u1") But looks indeed like a 'hack' and not so simple.is the above OK? -- classify "what is maximal ordinal value in the string" Well, say, I don't know what is maximal ordinal, e.g. here I take 3 Cyrillic letters instead of 'abc': s= "012 АБВ" k  = len(s) arr = np.zeros(k,"u4")   # init empty 32 bit array length k arr[:] = np.fromstring(np.array(s),"u4") -> [  48   49   50   32 1040 1041 1042]so this is making a numpy string, which is a UCS-4 encoding unicode -- i.e. 4 bytes per charactor. Then you care converting that to an 4-byte unsigned int. but no need to do it with fromstring:In [52]: sOut[52]: '012 АБВ'In [53]: s_arr.reshape((1,)).view(np.uint32)Out[53]: array([  48,   49,   50,   32, 1040, 1041, 1042], dtype=uint32)we need the reshape() because .view does not work with array scalars -- not sure why not?  This gives correct results indeed. So I get my ordinals as expected. So this is better/preferred way, right?I would maybe do it more "directly" -- i.e. use python's string to do the encoding:In [64]: sOut[64]: '012 АБВ'In [67]: np.fromstring(s.encode('U32'), dtype=np.uint32)Out[67]: array([65279,    48,    49,    50,    32,  1040,  1041,  1042], dtype=uint32)that first value is the byte-order mark (I think...), you  can strip it off with:In [68]: np.fromstring(s.encode('U32')[4:], dtype=np.uint32)Out[68]: array([  48,   49,   50,   32, 1040, 1041, 1042], dtype=uint32)or, probably better simply specify the byte order in the encoding:In [69]: np.fromstring(s.encode('UTF-32LE'), dtype=np.uint32)Out[69]: array([  48,   49,   50,   32, 1040, 1041, 1042], dtype=uint32)arr = np.ordinals(s) arr[0:2] = np.ordinals(s[0:2])  # with slicing or, e.g. in such format: arr = np.copystr(s) arr[0:2] = np.copystr(s[0:2])I don't think any of this is necessary -- the UCS4 (Or UTF-32) "encoding" is pretty much the ordinals anyway.As you notices, if you make a numpy unicode string array, and change the dtype to unsigned int32, you get what you want.You really don't want to mess with any of this unless you understand unicode and encodings anyway....Though it is a bit akward -- why is your actual use-case for working with ordinals???BTW, you can use regular python to get the ordinals first:In [71]: np.array([ord(c) for c in s])Out[71]: array([  48,   49,   50,   32, 1040, 1041, 1042])Though for Python 2 could raise questions why need casting to "u4".this would all work the same with python 2 if you used unicode objects instead of strings. Maybe good to put:from __future__ import unicode_literalsin your source.... So approximately are my ideas. For me it would cover many application cases.I'm still curious as to your use-cases -- when do you have a bunch of ordinal values??-CHB-- Christopher Barker, Ph.D.OceanographerEmergency Response DivisionNOAA/NOS/OR&R            (206) 526-6959   voice7600 Sand Point Way NE   (206) 526-6329   faxSeattle, WA  98115       (206) 526-6317   main reception[hidden email] _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpy-discussion