call for testing: unicode loadtxt/genfromtxt

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

call for testing: unicode loadtxt/genfromtxt

Julian Taylor-3
hi,
It has been very very long overdue but we finally have an attempt of
making our text io functions actually use text IO instead of bytes IO.
This means genfromtxt, loadtxt, fromregex and savetxt should support
unicode input files of any python supported encoding and universal newlines.
This is the first stepping stone to finally making numpy python3 compatible.

The code is available in:
https://github.com/numpy/numpy/pull/4208

Great effort has been spent to keep it backward compatible but we only
have our testsuite as a reference which for sure does not cover all of
the workarounds employed for this issue in the last 8 years.
So we need people to dig out their ugliest hacks and test if they still
work with this changeset.
Functions that need testing are:
loadtxt
genfromtxt
fromregex
savetxt

Test on any input that worked in older versions of numpy (including gzip
compressed) and inputs that did not work because they where encoded in
something other than latin1 or had issues with linebreaks.

The PR adds an encoding keyword argument to all functions dealing with
text input and output. All streams opened by the function have been
changed from byte streams to text streams.
As previously only latin1 encoded byte streams were supported, all input
bytestreams are still decoded as such.

Converters added by the user may have been relying on the input to them
being bytes. To deal with that the default encoding argument is 'bytes'
which corresponds to the default encoding (None) and enables conversion
to latin1 encoded bytes before passing to user converters.
If you want to use converters based on strings now you have to
explicitly set encoding to something else (e.g. None).

Currently the functions do not support the newlines keyword argument the
python IO strings support. This probably will still get added.

Related issues and discussions:

https://github.com/numpy/numpy/issues/4600
https://github.com/numpy/numpy/issues/3184
https://github.com/numpy/numpy/issues/4939
https://github.com/numpy/numpy/issues/4543
http://numpy-discussion.10968.n7.nabble.com/using-loadtxt-to-load-a-text-file-in-to-a-numpy-array-tt35992.html#a36003
http://numpy-discussion.10968.n7.nabble.com/genfromtxt-universal-newline-support-td37816.html
https://github.com/dhomeier/numpy/commit/995ec93

cheers,
Julian


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

signature.asc (861 bytes) Download Attachment