proposal: smaller representation of string arrays

classic Classic list List threaded Threaded
116 messages Options
1234 ... 6
Reply | Threaded
Open this post in threaded view
|

proposal: smaller representation of string arrays

Julian Taylor-3
Hello,
As you probably know numpy does not deal well with strings in Python3.
The np.string type is actually zero terminated bytes and not a string.
In Python2 this happened to work out as it treats bytes and strings the
same way. But in Python3 this type is pretty hard to work with as each
time you get an item from a numpy bytes array it needs decoding to
receive a string.
The only string type available in Python3 is np.unicode which uses
4-byte utf-32 encoding which is deemed to use too much memory to
actually see much use.

What people apparently want is a string type for Python3 which uses less
memory for the common science use case which rarely needs more than
latin1 encoding.
As we have been told we cannot change the np.string type to actually be
strings as existing programs do interpret its content as bytes despite
this being very broken due to its null terminating property (it will
ignore all trailing nulls).
Also 8 years of working around numpy's poor python3 support decisions in
third parties probably make the 'return bytes' behaviour impossible to
change now.

So we need a new dtype that can represent strings in numpy arrays which
is smaller than the existing 4 byte utf-32.

To please everyone I think we need to go with a dtype that supports
multiple encodings via metadata, similar to how datatime supports
multiple units.
E.g.: 'U10[latin1]' are 10 characters in latin1 encoding

Encodings we should support are:
- latin1 (1 bytes):
it is compatible with ascii and adds extra characters used in the
western world.
- utf-32 (4 bytes):
can represent every character, equivalent with np.unicode

Encodings we should maybe support:
- utf-16 with explicitly disallowing surrogate pairs (2 bytes):
this covers a very large range of possible characters in a reasonably
compact representation
- utf-8 (4 bytes):
variable length encoding with minimum size of 1 bytes, but we would need
to assume the worst case of 4 bytes so it would not save anything
compared to utf-32 but may allow third parties replace an encoding step
with trailing null trimming on serialization.


To actually do this we have two options both of which break our ABI when
doing so without ugly hacks.

- Add a new dtype, e.g. npy.realstring
By not modifying an existing type the only break programs using the
NPY_CHAR. The most notable case of this is f2py.
It has the cosmetic disadvantage that it makes the np.unicode dtype
obsolete and is more busywork to implement.

- Modify np.unicode to have encoding metadata
This allows use to reuse of all the type boilerplate so it is more
convenient to implement and by extending an existing type instead of
making one obsolete it results in a much nicer API.
The big drawback is that it will explicitly break any third party that
receives an array with a new encoding and assumes that the buffer of an
array of type np.unicode will a character itemsize of 4 bytes.
To ease this problem we would need to add API's to get the itemsize and
encoding to numpy now so third parties can error out cleanly.

The implementation of it is not that big a deal, I have already created
a prototype for adding latin1 metadata to np.unicode which works quite
well. It is imo realistic to get this into 1.14 should we be able to
make a decision on which way to implement it.

Do you have comments on how to go forward, in particular in regards to
new dtype vs modify np.unicode?

cheers,
Julian


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

signature.asc (861 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: proposal: smaller representation of string arrays

Anne Archibald
On Thu, Apr 20, 2017 at 3:17 PM Julian Taylor <[hidden email]> wrote:
To please everyone I think we need to go with a dtype that supports
multiple encodings via metadata, similar to how datatime supports
multiple units.
E.g.: 'U10[latin1]' are 10 characters in latin1 encoding

Encodings we should support are:
- latin1 (1 bytes):
it is compatible with ascii and adds extra characters used in the
western world.
- utf-32 (4 bytes):
can represent every character, equivalent with np.unicode

Encodings we should maybe support:
- utf-16 with explicitly disallowing surrogate pairs (2 bytes):
this covers a very large range of possible characters in a reasonably
compact representation
- utf-8 (4 bytes):
variable length encoding with minimum size of 1 bytes, but we would need
to assume the worst case of 4 bytes so it would not save anything
compared to utf-32 but may allow third parties replace an encoding step
with trailing null trimming on serialization.

I should say first that I've never used even non-Unicode string arrays, but is there any reason not to support all Unicode encodings that python does, with the same names and semantics? This would surely be the simplest to understand. 

Also, if latin1 is to going to be the only practical 8-bit encoding, maybe check with some non-Western users to make sure it's not going to wreck their lives? I'd have selected ASCII as an encoding to treat specially, if any, because Unicode already does that and the consequences are familiar. (I'm used to writing and reading French without accents because it's passed through ASCII, for example.) 

Variable-length encodings, of which UTF-8 is obviously the one that makes good handling essential, are indeed more complicated. But is it strictly necessary that string arrays hold fixed-length *strings*, or can the encoding length be fixed instead? That is, currently if you try to assign a longer string than will fit, the string is truncated to the number of characters in the data type. Instead, for encoded Unicode, the string could be truncated so that the encoding fits. Of course this is not completely trivial for variable-length encodings, but it should be doable, and it would allow UTF-8 to be used just the way it usually is - as an encoding that's almost 8-bit. 

All this said, it seems to me that the important use cases for string arrays involve interaction with existing binary formats, so people who have to deal with such data should have the final say. (My own closest approach to this is the FITS format, which is restricted by the standard to ASCII.)

Anne

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: proposal: smaller representation of string arrays

Chris Barker - NOAA Federal
In reply to this post by Julian Taylor-3
Thanks so much for reviving this conversation -- we really do need to address this.

My thoughts:

What people apparently want is a string type for Python3 which uses less
memory for the common science use case which rarely needs more than
latin1 encoding.

Yes -- I think there is a real demand for that.





To please everyone I think we need to go with a dtype that supports
multiple encodings via metadata, similar to how datetime supports
multiple units.
E.g.: 'U10[latin1]' are 10 characters in latin1 encoding

I wonder if we really need that -- as you say, there is real demand for compact string type, but for many use cases, 1 byte per character is enough. So to keep things really simple, I think a single 1-byte per char encoding would meet most people's needs.

What should that encoding be?

latin-1 is obvious (and has the very nice property of being able to round-trip arbitrary bytes -- at least with Python's implementation) and scientific data sets tend to use the latin alphabet (with its ascii roots and all).

But there is now latin-9:


Maybe a better option?

Encodings we should support are:
- latin1 (1 bytes):
it is compatible with ascii and adds extra characters used in the
western world.
- utf-32 (4 bytes):
can represent every character, equivalent with np.unicode

IIUC, datetime64 is, well, always 64 bits. So it may be better to have a given dtype always be the same bitwidth.

So the utf-32 dtype would be a different dtype. which also keeps it really simple, we have a latin-* dtype and a full-on unicode dtype -- that's it.

Encodings we should maybe support:
- utf-16 with explicitly disallowing surrogate pairs (2 bytes):
this covers a very large range of possible characters in a reasonably
compact representation

I think UTF-16 is very simply, the worst of both worlds. If we want a two-byte character set, then it should be UCS-2 -- i.e. explicitly rejecting any code point that takes more than two bytes to represent. (or maybe that's what you mean by explicitly disallowing surrogate pairs). in any case, it should certainly give you an encoding error if you try to pass in a unicode character than can not fit into two bytes.

So: is there actually a demand for this? If so, then I think it should be a separate 2-byte string type, with the encoding always the same.
 
- utf-8 (4 bytes):
variable length encoding with minimum size of 1 bytes, but we would need
to assume the worst case of 4 bytes so it would not save anything
compared to utf-32 but may allow third parties replace an encoding step
with trailing null trimming on serialization.

yeach -- utf-8 is great for interchange and streaming data, but not for internal storage, particular with the numpy every item has the same number of bytes requirement. So if someone wants to work with ut-8 they can store it in a byte array, and encode and decode as they pass it to/from python. That's going to have to happen anyway, even if under the hood. And it's risky business -- if you truncate a utf-8 bytestring, you may get invalid data --  it  really does not belong in numpy.
 
- Add a new dtype, e.g. npy.realstring

I think that's the way to go. backwards compatibility is really key. Though could we make the existing string dtype a latin-1 always type without breaking too much? Or maybe depricate and get there in the future?

It has the cosmetic disadvantage that it makes the np.unicode dtype
obsolete and is more busywork to implement.

I think the np.unicode type should remain as the 4-bytes per char encoding. But that only makes sense if you follow my idea that we don't have a variable number of bytes per char dtype.

So my proposal is:

 - Create a new one-byte-per-char dtype that is always latin-9 encoded.
    - in python3 it would map to a string (i.e. unicode)
 - Keep the 4-byte per char unicode string type

Optionally (if there is really demand)
 - Create a new two-byte per char dtype that is always UCS-2 encoded.


Is there any way to leverage Python3's nifty string type? I'm thinking not. At least not for numpy arrays that can play well with C code, etc.

All that being said, a encoding-specified string dtype would be nice too -- I just think it's more complex that it needs to be. Numpy is not the tool for text processing...

-CHB



-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

[hidden email]

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: proposal: smaller representation of string arrays

Stephan Hoyer-2
In reply to this post by Anne Archibald
Julian -- thanks for taking this on. NumPy's handling of strings on Python 3 certainly needs fixing.

On Thu, Apr 20, 2017 at 9:47 AM, Anne Archibald <[hidden email]> wrote:
Variable-length encodings, of which UTF-8 is obviously the one that makes good handling essential, are indeed more complicated. But is it strictly necessary that string arrays hold fixed-length *strings*, or can the encoding length be fixed instead? That is, currently if you try to assign a longer string than will fit, the string is truncated to the number of characters in the data type. Instead, for encoded Unicode, the string could be truncated so that the encoding fits. Of course this is not completely trivial for variable-length encodings, but it should be doable, and it would allow UTF-8 to be used just the way it usually is - as an encoding that's almost 8-bit. 

I agree with Anne here. Variable-length encoding would be great to have, but even fixed length UTF-8 (in terms of memory usage, not characters) would solve NumPy's Python 3 string problem. NumPy's memory model needs a fixed size per array element, but that doesn't mean we need a fixed sized per character. Each element in a UTF-8 array would be a string with a fixed number of codepoints, not characters.

In fact, we already have this sort of distinction between element size and memory usage: np.string_ uses null padding to store shorter strings in a larger dtype.

The only reason I see for supporting encodings other than UTF-8 is for memory-mapping arrays stored with those encodings, but that seems like a lot of extra trouble for little gain.


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: proposal: smaller representation of string arrays

Chris Barker - NOAA Federal
In reply to this post by Anne Archibald
On Thu, Apr 20, 2017 at 9:47 AM, Anne Archibald <[hidden email]> wrote:
Is there any reason not to support all Unicode encodings that python does, with the same names and semantics? This would surely be the simplest to understand. 

I think it should support all fixed-length encodings, but not the non-fixed length ones -- they just don't fit well into the numpy data model.
 
Also, if latin1 is to going to be the only practical 8-bit encoding, maybe check with some non-Western users to make sure it's not going to wreck their lives? I'd have selected ASCII as an encoding to treat specially, if any, because Unicode already does that and the consequences are familiar. (I'm used to writing and reading French without accents because it's passed through ASCII, for example.) 

latin-1 (or latin-9) only makes things better than ASCII -- it buys most of the accented characters for the European language and some symbols that are nice to have (I use the degree symbol a lot...). And it is ASCII compatible -- so there is NO reason to choose ASCII over Latin-*

Which does no good for non-latin languages -- so we need to hear from the community -- is there a substantial demand for a non-latin one-byte per character encoding? 
 
Variable-length encodings, of which UTF-8 is obviously the one that makes good handling essential, are indeed more complicated. But is it strictly necessary that string arrays hold fixed-length *strings*, or can the encoding length be fixed instead? That is, currently if you try to assign a longer string than will fit, the string is truncated to the number of characters in the data type.

we could do that, yes, but an improperly truncated "string" becomes invalid -- just seems like a recipe for bugs that won't be found in testing.

memory is cheap, compressing is fast -- we really shouldn't get hung up on this!

Note: if you are storing a LOT of text (which I have no idea why you would use numpy anyway), then the memory size might matter, but then semi-arbitrary truncation would probably matter, too.

I expect most text storage in numpy arrays is things like names of datasets, ids, etc, etc -- not massive amounts of text -- so storage space really isn't critical. but having an id or something unexpectedly truncated could be bad.

I think practical experience has shown us that people do not handle "mostly fixed length but once in awhile not" text well -- see the nightmare of UTF-16 on Windows. Granted, utf-8 is multi-byte far more often, so errors are far more likely to be found in tests (why would you use utf-8 is all your data are in ascii???). but still -- why invite hard to test for errors?

Final point -- as Julian suggests, one reason to support utf-8 is for interoperability with other systems -- but that makes errors more of an issue -- if it doesn't pass through the numpy truncation machinery, invalid data could easily get put in a numpy array.

-CHB

 it would allow UTF-8 to be used just the way it usually is - as an encoding that's almost 8-bit. 

ouch! that perception is the route to way too many errors! it is by no means almost 8-bit, unless your data are almost ascii -- in which case, use latin-1 for pity's sake!

This highlights my point though -- if we support UTF-8, people WILL use it, and only test it with mostly-ascii text, and not find the bugs that will crop up later.

All this said, it seems to me that the important use cases for string arrays involve interaction with existing binary formats, so people who have to deal with such data should have the final say. (My own closest approach to this is the FITS format, which is restricted by the standard to ASCII.)

yup -- not sure we'll get much guidance here though -- netdf does not solve this problem well, either.

But if you are pulling, say, a utf-8 encoded string out of a netcdf file -- it's probably better to pull it out as bytes and pass it through the python decoding/encoding machinery than pasting the bytes straight to a numpy array and hope that the encoding and truncation are correct.

-CHB


--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

[hidden email]

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: proposal: smaller representation of string arrays

Neal Becker
I'm no unicode expert, but can't we truncate unicode strings so that only valid characters are included?

On Thu, Apr 20, 2017 at 1:32 PM Chris Barker <[hidden email]> wrote:
On Thu, Apr 20, 2017 at 9:47 AM, Anne Archibald <[hidden email]> wrote:
Is there any reason not to support all Unicode encodings that python does, with the same names and semantics? This would surely be the simplest to understand. 

I think it should support all fixed-length encodings, but not the non-fixed length ones -- they just don't fit well into the numpy data model.
 
Also, if latin1 is to going to be the only practical 8-bit encoding, maybe check with some non-Western users to make sure it's not going to wreck their lives? I'd have selected ASCII as an encoding to treat specially, if any, because Unicode already does that and the consequences are familiar. (I'm used to writing and reading French without accents because it's passed through ASCII, for example.) 

latin-1 (or latin-9) only makes things better than ASCII -- it buys most of the accented characters for the European language and some symbols that are nice to have (I use the degree symbol a lot...). And it is ASCII compatible -- so there is NO reason to choose ASCII over Latin-*

Which does no good for non-latin languages -- so we need to hear from the community -- is there a substantial demand for a non-latin one-byte per character encoding? 
 
Variable-length encodings, of which UTF-8 is obviously the one that makes good handling essential, are indeed more complicated. But is it strictly necessary that string arrays hold fixed-length *strings*, or can the encoding length be fixed instead? That is, currently if you try to assign a longer string than will fit, the string is truncated to the number of characters in the data type.

we could do that, yes, but an improperly truncated "string" becomes invalid -- just seems like a recipe for bugs that won't be found in testing.

memory is cheap, compressing is fast -- we really shouldn't get hung up on this!

Note: if you are storing a LOT of text (which I have no idea why you would use numpy anyway), then the memory size might matter, but then semi-arbitrary truncation would probably matter, too.

I expect most text storage in numpy arrays is things like names of datasets, ids, etc, etc -- not massive amounts of text -- so storage space really isn't critical. but having an id or something unexpectedly truncated could be bad.

I think practical experience has shown us that people do not handle "mostly fixed length but once in awhile not" text well -- see the nightmare of UTF-16 on Windows. Granted, utf-8 is multi-byte far more often, so errors are far more likely to be found in tests (why would you use utf-8 is all your data are in ascii???). but still -- why invite hard to test for errors?

Final point -- as Julian suggests, one reason to support utf-8 is for interoperability with other systems -- but that makes errors more of an issue -- if it doesn't pass through the numpy truncation machinery, invalid data could easily get put in a numpy array.

-CHB

 it would allow UTF-8 to be used just the way it usually is - as an encoding that's almost 8-bit. 

ouch! that perception is the route to way too many errors! it is by no means almost 8-bit, unless your data are almost ascii -- in which case, use latin-1 for pity's sake!

This highlights my point though -- if we support UTF-8, people WILL use it, and only test it with mostly-ascii text, and not find the bugs that will crop up later.

All this said, it seems to me that the important use cases for string arrays involve interaction with existing binary formats, so people who have to deal with such data should have the final say. (My own closest approach to this is the FITS format, which is restricted by the standard to ASCII.)

yup -- not sure we'll get much guidance here though -- netdf does not solve this problem well, either.

But if you are pulling, say, a utf-8 encoded string out of a netcdf file -- it's probably better to pull it out as bytes and pass it through the python decoding/encoding machinery than pasting the bytes straight to a numpy array and hope that the encoding and truncation are correct.

-CHB


--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            <a href="tel:(206)%20526-6959" value="+12065266959" target="_blank">(206) 526-6959   voice
7600 Sand Point Way NE   <a href="tel:(206)%20526-6329" value="+12065266329" target="_blank">(206) 526-6329   fax
Seattle, WA  98115       <a href="tel:(206)%20526-6317" value="+12065266317" target="_blank">(206) 526-6317   main reception

[hidden email]
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: proposal: smaller representation of string arrays

Chris Barker - NOAA Federal
In reply to this post by Stephan Hoyer-2
On Thu, Apr 20, 2017 at 10:26 AM, Stephan Hoyer <[hidden email]> wrote:
I agree with Anne here. Variable-length encoding would be great to have, but even fixed length UTF-8 (in terms of memory usage, not characters) would solve NumPy's Python 3 string problem. NumPy's memory model needs a fixed size per array element, but that doesn't mean we need a fixed sized per character. Each element in a UTF-8 array would be a string with a fixed number of codepoints, not characters.

Ah, yes -- the nightmare of Unicode!

No, it would not be a fixed number of codepoints -- it would be a fixed number of bytes (or "code units"). and an unknown number of characters.

As Julian pointed out, if you wanted to specify that a numpy element would be able to hold, say, N characters (actually code points, combining characters make this even more confusing) then you would need to allocate N*4 bytes to make sure you could hold any string that long. Which would be pretty pointless -- better to use UCS-4.

So Anne's suggestion that numpy truncates as needed would make sense -- you'd specify say N characters, numpy would arbitrarily (or user specified) over-allocate, maybe N*1.5 bytes, and you'd truncate if someone passed in a string that didn't fit. Then you'd need to make sure you truncated correctly, so as not to create an invalid string (that's just code, it could be made correct).

But how much to over allocate? for english text, with an occasional scientific symbol, only a little. for, say, Japanese text, you'd need a factor 2 maybe? 

Anyway, the idea that "just use utf-8" solves your problems is really dangerous. It simply is not the right way to handle text if:

you need fixed-length storage
you care about compactness

In fact, we already have this sort of distinction between element size and memory usage: np.string_ uses null padding to store shorter strings in a larger dtype.

sure -- but it is clear to the user that the dtype can hold "up to this many" characters.
 
The only reason I see for supporting encodings other than UTF-8 is for memory-mapping arrays stored with those encodings, but that seems like a lot of extra trouble for little gain.

I see it the other way around -- the only reason TO support utf-8 is for memory mapping with other systems that use it :-)

On the other hand,  if we ARE going to support utf-8 -- maybe use it for all unicode support, rather than messing around with all the multiple encoding options.

I think a 1-byte-per char latin-* encoded string is a good idea though -- scientific use tend to be latin only and space constrained.

All that being said, if the truncation code were carefully written, it would mostly "just work"

-CHB


--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

[hidden email]

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: proposal: smaller representation of string arrays

Chris Barker - NOAA Federal
In reply to this post by Neal Becker
On Thu, Apr 20, 2017 at 10:36 AM, Neal Becker <[hidden email]> wrote:
I'm no unicode expert, but can't we truncate unicode strings so that only valid characters are included?

sure -- it's just a bit fiddly -- and you need to make sure that everything gets passed through the proper mechanism. numpy is all about folks using other code to mess with the bytes in a numpy array. so we can't expect that all numpy string arrays will have been created with numpy code.

Does python's string have a truncated encode option? i.e. you don't want to encode to utf-8 and then just chop it off.

-CHB


--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

[hidden email]

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: proposal: smaller representation of string arrays

Eric Wieser
In reply to this post by Chris Barker - NOAA Federal
if you truncate a utf-8 bytestring, you may get invalid data

Note that in general truncating unicode codepoints is not a safe operation either, as combining characters are a thing. So I don't think this is a good argument against UTF8.

Also, is silent truncation a think that we want to allow to happen anyway? That sounds like something the user ought to be alerted to with an exception.

> if you wanted to specify that a numpy element would be able to hold, say, N characters
> ...
It simply is not the right way to handle text if [...] you need fixed-length storage

It seems to me that counting code points is pretty futile in unicode, due to combining characters. The only two meaningful things to count are:
* Graphemes, as that's what the user sees visually. These can span multiple code-points
* Bytes of encoded data, as that's the space needed to store them

So I would argue that the approach of fixed-codepoint-length storage is itself a flawed design, and so should not be used as a constraint on numpy.

Counting graphemes is hard, so that leaves the only sensible option as a byte count.

I don't forsee variable-length encodings being a problem implementation-wise - they only become one if numpy were to acquire a vectorized substring function that is intended to return a view.

I think I'd be in favor of supporting all encodings, and falling back on python to handle encoding/decoding them.


On Thu, 20 Apr 2017 at 18:44 Chris Barker <[hidden email]> wrote:
On Thu, Apr 20, 2017 at 10:26 AM, Stephan Hoyer <[hidden email]> wrote:
I agree with Anne here. Variable-length encoding would be great to have, but even fixed length UTF-8 (in terms of memory usage, not characters) would solve NumPy's Python 3 string problem. NumPy's memory model needs a fixed size per array element, but that doesn't mean we need a fixed sized per character. Each element in a UTF-8 array would be a string with a fixed number of codepoints, not characters.

Ah, yes -- the nightmare of Unicode!

No, it would not be a fixed number of codepoints -- it would be a fixed number of bytes (or "code units"). and an unknown number of characters.

As Julian pointed out, if you wanted to specify that a numpy element would be able to hold, say, N characters (actually code points, combining characters make this even more confusing) then you would need to allocate N*4 bytes to make sure you could hold any string that long. Which would be pretty pointless -- better to use UCS-4.

So Anne's suggestion that numpy truncates as needed would make sense -- you'd specify say N characters, numpy would arbitrarily (or user specified) over-allocate, maybe N*1.5 bytes, and you'd truncate if someone passed in a string that didn't fit. Then you'd need to make sure you truncated correctly, so as not to create an invalid string (that's just code, it could be made correct).

But how much to over allocate? for english text, with an occasional scientific symbol, only a little. for, say, Japanese text, you'd need a factor 2 maybe? 

Anyway, the idea that "just use utf-8" solves your problems is really dangerous. It simply is not the right way to handle text if:

you need fixed-length storage
you care about compactness

In fact, we already have this sort of distinction between element size and memory usage: np.string_ uses null padding to store shorter strings in a larger dtype.

sure -- but it is clear to the user that the dtype can hold "up to this many" characters.
 
The only reason I see for supporting encodings other than UTF-8 is for memory-mapping arrays stored with those encodings, but that seems like a lot of extra trouble for little gain.

I see it the other way around -- the only reason TO support utf-8 is for memory mapping with other systems that use it :-)

On the other hand,  if we ARE going to support utf-8 -- maybe use it for all unicode support, rather than messing around with all the multiple encoding options.

I think a 1-byte-per char latin-* encoded string is a good idea though -- scientific use tend to be latin only and space constrained.

All that being said, if the truncation code were carefully written, it would mostly "just work"

-CHB


--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

[hidden email]
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: proposal: smaller representation of string arrays

Julian Taylor-3
In reply to this post by Julian Taylor-3
I probably have formulated my goal with the proposal a bit better, I am
not very interested in a repetition of which encoding to use debate.
In the end what will be done allows any encoding via a dtype with
metadata like datetime.
This allows any codec (including truncated utf8) to be added easily (if
python supports it) and allows sidestepping the debate.

My main concern is whether it should be a new dtype or modifying the
unicode dtype. Though the backward compatibility argument is strongly in
favour of adding a new dtype that makes the np.unicode type redundant.

On 20.04.2017 15:15, Julian Taylor wrote:

> Hello,
> As you probably know numpy does not deal well with strings in Python3.
> The np.string type is actually zero terminated bytes and not a string.
> In Python2 this happened to work out as it treats bytes and strings the
> same way. But in Python3 this type is pretty hard to work with as each
> time you get an item from a numpy bytes array it needs decoding to
> receive a string.
> The only string type available in Python3 is np.unicode which uses
> 4-byte utf-32 encoding which is deemed to use too much memory to
> actually see much use.
>
> What people apparently want is a string type for Python3 which uses less
> memory for the common science use case which rarely needs more than
> latin1 encoding.
> As we have been told we cannot change the np.string type to actually be
> strings as existing programs do interpret its content as bytes despite
> this being very broken due to its null terminating property (it will
> ignore all trailing nulls).
> Also 8 years of working around numpy's poor python3 support decisions in
> third parties probably make the 'return bytes' behaviour impossible to
> change now.
>
> So we need a new dtype that can represent strings in numpy arrays which
> is smaller than the existing 4 byte utf-32.
>
> To please everyone I think we need to go with a dtype that supports
> multiple encodings via metadata, similar to how datatime supports
> multiple units.
> E.g.: 'U10[latin1]' are 10 characters in latin1 encoding
>
> Encodings we should support are:
> - latin1 (1 bytes):
> it is compatible with ascii and adds extra characters used in the
> western world.
> - utf-32 (4 bytes):
> can represent every character, equivalent with np.unicode
>
> Encodings we should maybe support:
> - utf-16 with explicitly disallowing surrogate pairs (2 bytes):
> this covers a very large range of possible characters in a reasonably
> compact representation
> - utf-8 (4 bytes):
> variable length encoding with minimum size of 1 bytes, but we would need
> to assume the worst case of 4 bytes so it would not save anything
> compared to utf-32 but may allow third parties replace an encoding step
> with trailing null trimming on serialization.
>
>
> To actually do this we have two options both of which break our ABI when
> doing so without ugly hacks.
>
> - Add a new dtype, e.g. npy.realstring
> By not modifying an existing type the only break programs using the
> NPY_CHAR. The most notable case of this is f2py.
> It has the cosmetic disadvantage that it makes the np.unicode dtype
> obsolete and is more busywork to implement.
>
> - Modify np.unicode to have encoding metadata
> This allows use to reuse of all the type boilerplate so it is more
> convenient to implement and by extending an existing type instead of
> making one obsolete it results in a much nicer API.
> The big drawback is that it will explicitly break any third party that
> receives an array with a new encoding and assumes that the buffer of an
> array of type np.unicode will a character itemsize of 4 bytes.
> To ease this problem we would need to add API's to get the itemsize and
> encoding to numpy now so third parties can error out cleanly.
>
> The implementation of it is not that big a deal, I have already created
> a prototype for adding latin1 metadata to np.unicode which works quite
> well. It is imo realistic to get this into 1.14 should we be able to
> make a decision on which way to implement it.
>
> Do you have comments on how to go forward, in particular in regards to
> new dtype vs modify np.unicode?
>
> cheers,
> Julian
>


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

signature.asc (861 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: proposal: smaller representation of string arrays

Stephan Hoyer-2
In reply to this post by Chris Barker - NOAA Federal
On Thu, Apr 20, 2017 at 10:43 AM, Chris Barker <[hidden email]> wrote:
On Thu, Apr 20, 2017 at 10:26 AM, Stephan Hoyer <[hidden email]> wrote:
I agree with Anne here. Variable-length encoding would be great to have, but even fixed length UTF-8 (in terms of memory usage, not characters) would solve NumPy's Python 3 string problem. NumPy's memory model needs a fixed size per array element, but that doesn't mean we need a fixed sized per character. Each element in a UTF-8 array would be a string with a fixed number of codepoints, not characters.

Ah, yes -- the nightmare of Unicode!

No, it would not be a fixed number of codepoints -- it would be a fixed number of bytes (or "code units"). and an unknown number of characters.

Apologies for confusing the terminology! Yes, this would mean a fixed number of bytes and an unknown number of characters. 
 
As Julian pointed out, if you wanted to specify that a numpy element would be able to hold, say, N characters (actually code points, combining characters make this even more confusing) then you would need to allocate N*4 bytes to make sure you could hold any string that long. Which would be pretty pointless -- better to use UCS-4.

It's already unsafe to try to insert arbitrary length strings into a numpy string_ or unicode_ array. When determining the dtype automatically (e.g., with np.array(list_of_strings)), the difference is that numpy would need to check the maximum encoded length instead of the character length (i.e., len(x.encode() instead of len(x)).

I certainly would not over-allocate. If users want more space, they can explicitly choose an appropriate size. (This is an hazard of not having length length dtypes.)

If users really want to be able to fit an arbitrary number of unicode characters and aren't concerned about memory usage, they can still use np.unicode_ -- that won't be going away.
 
So Anne's suggestion that numpy truncates as needed would make sense -- you'd specify say N characters, numpy would arbitrarily (or user specified) over-allocate, maybe N*1.5 bytes, and you'd truncate if someone passed in a string that didn't fit. Then you'd need to make sure you truncated correctly, so as not to create an invalid string (that's just code, it could be made correct).

NumPy already does this sort of silent truncation with longer strings inserted into shorter string dtypes. The different here would indeed be the need to check the number of bytes represented by the string instead of the number of characters.

But I don't think this is useful behavior to bring over to a new dtype. We should error instead of silently truncating. This is certainly easier than trying to figure out when we would be splitting a character.
 
But how much to over allocate? for english text, with an occasional scientific symbol, only a little. for, say, Japanese text, you'd need a factor 2 maybe? 

Anyway, the idea that "just use utf-8" solves your problems is really dangerous. It simply is not the right way to handle text if:

you need fixed-length storage
you care about compactness

In fact, we already have this sort of distinction between element size and memory usage: np.string_ uses null padding to store shorter strings in a larger dtype.

sure -- but it is clear to the user that the dtype can hold "up to this many" characters.

As Yu Feng points out in this GitHub comment, non-latin language speakers are already aware of the difference between string length and bytes length:

Making an API based on code units instead of code points really seems like the saner way to handle unicode strings. I agree with this section with the DyND design docs for it's string type, which notes precedent from Julia and Go:

I think a 1-byte-per char latin-* encoded string is a good idea though -- scientific use tend to be latin only and space constrained.

I think scientific users tend be to ASCII only, so UTF-8 would also work transparently :). 


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: proposal: smaller representation of string arrays

Antoine Pitrou-2
In reply to this post by Stephan Hoyer-2
On Thu, 20 Apr 2017 10:26:13 -0700
Stephan Hoyer <[hidden email]> wrote:

>
> I agree with Anne here. Variable-length encoding would be great to have,
> but even fixed length UTF-8 (in terms of memory usage, not characters)
> would solve NumPy's Python 3 string problem. NumPy's memory model needs a
> fixed size per array element, but that doesn't mean we need a fixed sized
> per character. Each element in a UTF-8 array would be a string with a fixed
> number of codepoints, not characters.
>
> In fact, we already have this sort of distinction between element size and
> memory usage: np.string_ uses null padding to store shorter strings in a
> larger dtype.
>
> The only reason I see for supporting encodings other than UTF-8 is for
> memory-mapping arrays stored with those encodings, but that seems like a
> lot of extra trouble for little gain.  

I think you want at least: ascii, utf8, ucs2 (aka utf16 without
surrogates), utf32.  That is, 3 common fixed width encodings and one
variable width encoding.

Regards

Antoine.
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: proposal: smaller representation of string arrays

Robert Kern-2
In reply to this post by Julian Taylor-3
On Thu, Apr 20, 2017 at 6:15 AM, Julian Taylor <[hidden email]> wrote:

> Do you have comments on how to go forward, in particular in regards to
> new dtype vs modify np.unicode?

Can we restate the use cases explicitly? I feel like we ended up with the current sub-optimal situation because we never really laid out the use cases. We just felt like we needed bytestring and unicode dtypes, more out of completionism than anything, and we made a bunch of assumptions just to get each one done. I think there may be broad agreement that many of those assumptions are "wrong", but it would be good to reference that against concretely-stated use cases.

FWIW, if I need to work with in-memory arrays of strings in Python code, I'm going to use dtype=object a la pandas. It has almost no arbitrary constraints, and I can rely on Python's unicode facilities freely. There may be some cases where it's a little less memory-efficient (e.g. representing a column of enumerated single-character values like 'M'/'F'), but that's never prevented me from doing anything (compare to the uniform-length restrictions, which *have* prevented me from doing things).

So what's left? Being able to memory-map to files that have string data conveniently laid out according to numpy assumptions (e.g. FITS). Being able to work with C/C++/Fortran APIs that have arrays of strings laid out according to numpy assumptions (e.g. HDF5). I think it would behoove us to canvass the needs of these formats and APIs before making any more assumptions.

For example, to my understanding, FITS files more or less follow numpy assumptions for its string columns (i.e. uniform-length). But it enforces 7-bit-clean ASCII and pads with terminating NULLs; I believe this was the singular motivating use case for the trailing-NULL behavior of np.string.

I don't know of a format off-hand that works with numpy uniform-length strings and Unicode as well. HDF5 (to my recollection) supports arrays of NULL-terminated, uniform-length ASCII like FITS, but only variable-length UTF8 strings.

We should look at some of the newer formats and APIs, like Parquet and Arrow, and also consider the cross-language APIs with Julia and R.

If I had to jump ahead and propose new dtypes, I might suggest this:

* For the most part, treat the string dtypes as temporary communication formats rather than the preferred in-memory working format, similar to how we use `float16` to communicate with GPU APIs.

* Acknowledge the use cases of the current NULL-terminated np.string dtype, but perhaps add a new canonical alias, document it as being for those specific use cases, and deprecate/de-emphasize the current name.

* Add a dtype for holding uniform-length `bytes` strings. This would be similar to the current `void` dtype, but work more transparently with the `bytes` type, perhaps with the scalar type multiply-inheriting from `bytes` like `float64` does with `float`. This would not be NULL-terminated. No encoding would be implied.

* Maybe add a dtype similar to `object_` that only permits `unicode/str` (2.x/3.x) strings (and maybe None to represent missing data a la pandas). This maintains all of the flexibility of using a `dtype=object` array while allowing code to specialize for working with strings without all kinds of checking on every item. But most importantly, we can serialize such an array to bytes without having to use pickle. Utility functions could be written for en-/decoding to/from the uniform-length bytestring arrays handling different encodings and things like NULL-termination (also working with the legacy dtypes and handling structured arrays easily, etc.).

--
Robert Kern

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: proposal: smaller representation of string arrays

Anne Archibald
In reply to this post by Julian Taylor-3
On Thu, Apr 20, 2017 at 8:17 PM Julian Taylor <[hidden email]> wrote:
I probably have formulated my goal with the proposal a bit better, I am
not very interested in a repetition of which encoding to use debate.
In the end what will be done allows any encoding via a dtype with
metadata like datetime.
This allows any codec (including truncated utf8) to be added easily (if
python supports it) and allows sidestepping the debate.

My main concern is whether it should be a new dtype or modifying the
unicode dtype. Though the backward compatibility argument is strongly in
favour of adding a new dtype that makes the np.unicode type redundant.

Creating a new dtype to handle encoded unicode, with the encoding specified in the dtype, sounds perfectly reasonable to me. Changing the behaviour of the existing unicode dtype seems like it's going to lead to massive headaches unless exactly nobody uses it. The only downside to a new type is having to find an obvious name that isn't already in use. (And having to actively  maintain/deprecate the old one.) 

Anne

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: proposal: smaller representation of string arrays

Stephan Hoyer-2
In reply to this post by Robert Kern-2
On Thu, Apr 20, 2017 at 11:53 AM, Robert Kern <[hidden email]> wrote:
I don't know of a format off-hand that works with numpy uniform-length strings and Unicode as well. HDF5 (to my recollection) supports arrays of NULL-terminated, uniform-length ASCII like FITS, but only variable-length UTF8 strings.

HDF5 supports two character sets, ASCII and UTF-8. Both come in fixed and variable length versions:

"Fixed length UTF-8" for HDF5 refers to the number of bytes used for storage, not the number of characters.

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: proposal: smaller representation of string arrays

Eric Wieser
In reply to this post by Anne Archibald
Perhaps `np.encoded_str[encoding]` as the name for the new type, if we decide a new type is necessary?

Am I right in thinking that the general problem here is that it's very easy to discard metadata when working with dtypes, and that by adding metadata to `unicode_`, we risk existing code carelessly dropping it? Is this a problem in both C and python, or just C?

If that's the case, can we end up with a compromise where being careless just causes old code to promote to ucs32?

On Thu, 20 Apr 2017 at 20:09 Anne Archibald <[hidden email]> wrote:
On Thu, Apr 20, 2017 at 8:17 PM Julian Taylor <[hidden email]> wrote:
I probably have formulated my goal with the proposal a bit better, I am
not very interested in a repetition of which encoding to use debate.
In the end what will be done allows any encoding via a dtype with
metadata like datetime.
This allows any codec (including truncated utf8) to be added easily (if
python supports it) and allows sidestepping the debate.

My main concern is whether it should be a new dtype or modifying the
unicode dtype. Though the backward compatibility argument is strongly in
favour of adding a new dtype that makes the np.unicode type redundant.

Creating a new dtype to handle encoded unicode, with the encoding specified in the dtype, sounds perfectly reasonable to me. Changing the behaviour of the existing unicode dtype seems like it's going to lead to massive headaches unless exactly nobody uses it. The only downside to a new type is having to find an obvious name that isn't already in use. (And having to actively  maintain/deprecate the old one.) 

Anne
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: proposal: smaller representation of string arrays

Anne Archibald
In reply to this post by Robert Kern-2

On Thu, Apr 20, 2017 at 8:55 PM Robert Kern <[hidden email]> wrote:
On Thu, Apr 20, 2017 at 6:15 AM, Julian Taylor <[hidden email]> wrote:

> Do you have comments on how to go forward, in particular in regards to
> new dtype vs modify np.unicode?

Can we restate the use cases explicitly? I feel like we ended up with the current sub-optimal situation because we never really laid out the use cases. We just felt like we needed bytestring and unicode dtypes, more out of completionism than anything, and we made a bunch of assumptions just to get each one done. I think there may be broad agreement that many of those assumptions are "wrong", but it would be good to reference that against concretely-stated use cases.

+1 
 
FWIW, if I need to work with in-memory arrays of strings in Python code, I'm going to use dtype=object a la pandas. It has almost no arbitrary constraints, and I can rely on Python's unicode facilities freely. There may be some cases where it's a little less memory-efficient (e.g. representing a column of enumerated single-character values like 'M'/'F'), but that's never prevented me from doing anything (compare to the uniform-length restrictions, which *have* prevented me from doing things).

So what's left? Being able to memory-map to files that have string data conveniently laid out according to numpy assumptions (e.g. FITS). Being able to work with C/C++/Fortran APIs that have arrays of strings laid out according to numpy assumptions (e.g. HDF5). I think it would behoove us to canvass the needs of these formats and APIs before making any more assumptions.

For example, to my understanding, FITS files more or less follow numpy assumptions for its string columns (i.e. uniform-length). But it enforces 7-bit-clean ASCII and pads with terminating NULLs; I believe this was the singular motivating use case for the trailing-NULL behavior of np.string.

Actually if I understood the spec, FITS header lines are 80 bytes long and contain ASCII with no NULLs; strings are quoted and trailing spaces are stripped.

[...]
If I had to jump ahead and propose new dtypes, I might suggest this:

* For the most part, treat the string dtypes as temporary communication formats rather than the preferred in-memory working format, similar to how we use `float16` to communicate with GPU APIs.

* Acknowledge the use cases of the current NULL-terminated np.string dtype, but perhaps add a new canonical alias, document it as being for those specific use cases, and deprecate/de-emphasize the current name.

* Add a dtype for holding uniform-length `bytes` strings. This would be similar to the current `void` dtype, but work more transparently with the `bytes` type, perhaps with the scalar type multiply-inheriting from `bytes` like `float64` does with `float`. This would not be NULL-terminated. No encoding would be implied.

How would this differ from a numpy array of bytes with one more dimension?  
 
* Maybe add a dtype similar to `object_` that only permits `unicode/str` (2.x/3.x) strings (and maybe None to represent missing data a la pandas). This maintains all of the flexibility of using a `dtype=object` array while allowing code to specialize for working with strings without all kinds of checking on every item. But most importantly, we can serialize such an array to bytes without having to use pickle. Utility functions could be written for en-/decoding to/from the uniform-length bytestring arrays handling different encodings and things like NULL-termination (also working with the legacy dtypes and handling structured arrays easily, etc.).

I think there may also be a niche for fixed-byte-size null-terminated strings of uniform encoding, that do decoding and encoding automatically. The encoding would naturally be attached to the dtype, and they would handle too-long strings by either truncating to a valid encoding or simply raising an exception. As with the current fixed-length strings, they'd mostly be for communication with other code, so the necessity depends on whether such other codes exist at all. Databases, perhaps?  Custom hunks of C that don't want to deal with variable-length packing of data? Actually this last seems plausible - if I want to pass a great wodge of data, including Unicode strings, to a C program, writing out a numpy array seems maybe the easiest.

Anne

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: proposal: smaller representation of string arrays

Robert Kern-2
In reply to this post by Stephan Hoyer-2
On Thu, Apr 20, 2017 at 12:05 PM, Stephan Hoyer <[hidden email]> wrote:

>
> On Thu, Apr 20, 2017 at 11:53 AM, Robert Kern <[hidden email]> wrote:
>>
>> I don't know of a format off-hand that works with numpy uniform-length strings and Unicode as well. HDF5 (to my recollection) supports arrays of NULL-terminated, uniform-length ASCII like FITS, but only variable-length UTF8 strings.
>
>
> HDF5 supports two character sets, ASCII and UTF-8. Both come in fixed and variable length versions:
> https://github.com/PyTables/PyTables/issues/499
> https://support.hdfgroup.org/HDF5/doc/Advanced/UsingUnicode/index.html
>
> "Fixed length UTF-8" for HDF5 refers to the number of bytes used for storage, not the number of characters.

Ah, okay, I was interpolating from a quick perusal of the h5py docs, which of course are also constrained by numpy's current set of dtypes. The NULL-terminated ASCII works well enough with np.string's semantics.

--
Robert Kern

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: proposal: smaller representation of string arrays

Charles R Harris
In reply to this post by Robert Kern-2


On Thu, Apr 20, 2017 at 12:53 PM, Robert Kern <[hidden email]> wrote:
On Thu, Apr 20, 2017 at 6:15 AM, Julian Taylor <[hidden email]> wrote:

> Do you have comments on how to go forward, in particular in regards to
> new dtype vs modify np.unicode?

Can we restate the use cases explicitly? I feel like we ended up with the current sub-optimal situation because we never really laid out the use cases. We just felt like we needed bytestring and unicode dtypes, more out of completionism than anything, and we made a bunch of assumptions just to get each one done. I think there may be broad agreement that many of those assumptions are "wrong", but it would be good to reference that against concretely-stated use cases.

FWIW, if I need to work with in-memory arrays of strings in Python code, I'm going to use dtype=object a la pandas. It has almost no arbitrary constraints, and I can rely on Python's unicode facilities freely. There may be some cases where it's a little less memory-efficient (e.g. representing a column of enumerated single-character values like 'M'/'F'), but that's never prevented me from doing anything (compare to the uniform-length restrictions, which *have* prevented me from doing things).

So what's left? Being able to memory-map to files that have string data conveniently laid out according to numpy assumptions (e.g. FITS). Being able to work with C/C++/Fortran APIs that have arrays of strings laid out according to numpy assumptions (e.g. HDF5). I think it would behoove us to canvass the needs of these formats and APIs before making any more assumptions.

For example, to my understanding, FITS files more or less follow numpy assumptions for its string columns (i.e. uniform-length). But it enforces 7-bit-clean ASCII and pads with terminating NULLs; I believe this was the singular motivating use case for the trailing-NULL behavior of np.string.

I don't know of a format off-hand that works with numpy uniform-length strings and Unicode as well. HDF5 (to my recollection) supports arrays of NULL-terminated, uniform-length ASCII like FITS, but only variable-length UTF8 strings.

We should look at some of the newer formats and APIs, like Parquet and Arrow, and also consider the cross-language APIs with Julia and R.

If I had to jump ahead and propose new dtypes, I might suggest this:

* For the most part, treat the string dtypes as temporary communication formats rather than the preferred in-memory working format, similar to how we use `float16` to communicate with GPU APIs.

* Acknowledge the use cases of the current NULL-terminated np.string dtype, but perhaps add a new canonical alias, document it as being for those specific use cases, and deprecate/de-emphasize the current name.

* Add a dtype for holding uniform-length `bytes` strings. This would be similar to the current `void` dtype, but work more transparently with the `bytes` type, perhaps with the scalar type multiply-inheriting from `bytes` like `float64` does with `float`. This would not be NULL-terminated. No encoding would be implied.

* Maybe add a dtype similar to `object_` that only permits `unicode/str` (2.x/3.x) strings (and maybe None to represent missing data a la pandas). This maintains all of the flexibility of using a `dtype=object` array while allowing code to specialize for working with strings without all kinds of checking on every item. But most importantly, we can serialize such an array to bytes without having to use pickle. Utility functions could be written for en-/decoding to/from the uniform-length bytestring arrays handling different encodings and things like NULL-termination (also working with the legacy dtypes and handling structured arrays easily, etc.).


A little history, IIRC, storing null terminated strings in fixed byte lengths was done in Fortran, strings were  usually stored in integers/integer_arrays.

If memory mapping of arbitrary types is not important, I'd settle for ascii or latin-1, utf-8 fixed byte length, and arrays of fixed python object type. Using one byte encodings and utf-8 avoids needing to deal with endianess.

Chuck 

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: proposal: smaller representation of string arrays

Julian Taylor-3
In reply to this post by Robert Kern-2
On 20.04.2017 20:53, Robert Kern wrote:

> On Thu, Apr 20, 2017 at 6:15 AM, Julian Taylor
> <[hidden email] <mailto:[hidden email]>>
> wrote:
>
>> Do you have comments on how to go forward, in particular in regards to
>> new dtype vs modify np.unicode?
>
> Can we restate the use cases explicitly? I feel like we ended up with
> the current sub-optimal situation because we never really laid out the
> use cases. We just felt like we needed bytestring and unicode dtypes,
> more out of completionism than anything, and we made a bunch of
> assumptions just to get each one done. I think there may be broad
> agreement that many of those assumptions are "wrong", but it would be
> good to reference that against concretely-stated use cases.
We ended up in this situation because we did not take the opportunity to
break compatibility when python3 support was added.
We should have made the string dtype an encoded byte type (ascii or
latin1) in python3 instead of null terminated unencoded bytes which do
not make very much practical sense.

So the use case is very simple: Give users of the string dtype a
migration path that does not involve converting to full utf32 unicode.
The latin1 encoded bytes dtype would allow that.

As we already have the infrastructure this same dtype can allow more
than just latin1 with minimal effort, for the fixed size python
supported stuff it is literally adding an enum entry, two new switch
clauses and a little bit of dtype string parsing and testcases.


Having some form of variable string handling would be nice. But this is
another topic all together.
Having builtin support for variable strings only seems overkill as the
string dtype is not that important and object arrays should work
reasonably well for this usecase already.


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

signature.asc (861 bytes) Download Attachment
1234 ... 6