Feature request: Alternative representation for arrays with many dimensions

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Feature request: Alternative representation for arrays with many dimensions

Fang Zhang
By default, the __repr__ and __str__ functions of NumPy arrays summarize long arrays (i.e. omit all items but a few at beginning and end of each dimension), which is a good thing because when debugging, programmers can call print() on arrays with millions of elements without clogging the output or taking up too much CPU/memory (unsurprisingly, the string representation of an array item usually takes more bytes than its binary representation).

However, this mechanic does not help when an array has a lot of short dimensions, e.g. np.arange(2 ** 20).reshape((2,) * 20). I often encounter such arrays in my work, and every once in a while I would try to print such an array without flattening it first (usually because I didn't know what shape or even what type the variable I was trying to print is), which has caused incidents ranging from losing everything in my scrollback buffer to crashing my computer by using too much memory.

I think it may be a good idea to change the way NumPy pretty prints arrays with such shapes to avoid this situation. Something like "array([ 0, 1, 2, ..., 1048573, 1048574, 1048575]).reshape(2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2)" would be good enough for me. The condition to trigger such a representation can either be a fixed number of dimensions, or when after summarizing the pretty printer would still print more items than the threshold (1000 by default). Since the outputs of __repr__ and __str__ are meant for human eyes rather than computers, I think this should not cause too much of a compatibility problem.

What do you all think?

Sincerely,
Fang Zhang

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Feature request: Alternative representation for arrays with many dimensions

Stephan Hoyer-2
On Wed, Dec 9, 2020 at 2:24 PM Fang Zhang <[hidden email]> wrote:
By default, the __repr__ and __str__ functions of NumPy arrays summarize long arrays (i.e. omit all items but a few at beginning and end of each dimension), which is a good thing because when debugging, programmers can call print() on arrays with millions of elements without clogging the output or taking up too much CPU/memory (unsurprisingly, the string representation of an array item usually takes more bytes than its binary representation).

However, this mechanic does not help when an array has a lot of short dimensions, e.g. np.arange(2 ** 20).reshape((2,) * 20). I often encounter such arrays in my work, and every once in a while I would try to print such an array without flattening it first (usually because I didn't know what shape or even what type the variable I was trying to print is), which has caused incidents ranging from losing everything in my scrollback buffer to crashing my computer by using too much memory.

I think it may be a good idea to change the way NumPy pretty prints arrays with such shapes to avoid this situation. Something like "array([ 0, 1, 2, ..., 1048573, 1048574, 1048575]).reshape(2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2)" would be good enough for me. The condition to trigger such a representation can either be a fixed number of dimensions, or when after summarizing the pretty printer would still print more items than the threshold (1000 by default). Since the outputs of __repr__ and __str__ are meant for human eyes rather than computers, I think this should not cause too much of a compatibility problem.

+1, this could use improvement. For high dimensional arrays, the way NumPy prints is way too verbose.
 
In xarray, we automatically decrease "edgeitems" for printing NumPy arrays, to 2 for ndim=3 and 1 for ndim>3:

As a last resort, we could consider automatically limiting the maximum number of displayed lines, adding "..." for clipped lines. It is unlikely, for example, that anymore ever wants to print more than ~100 lines of text to the screen, which can easily happen for very high dimensional arrays.


What do you all think?

Sincerely,
Fang Zhang
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion