NEP 21: Simplified and explicit advanced indexing

classic Classic list List threaded Threaded
38 messages Options
12
Reply | Threaded
Open this post in threaded view
|

NEP 21: Simplified and explicit advanced indexing

Stephan Hoyer-2
Sebastian and I have revised a Numpy Enhancement Proposal that he started three years ago for overhauling NumPy's advanced indexing. We'd now like to present it for official consideration.

Minor inline comments (e.g., typos) can be added to the latest pull request (https://github.com/numpy/numpy/pull/11414/files), but otherwise let's keep discussion on the mailing list. The NumPy website should update shortly with a rendered version (http://www.numpy.org/neps/nep-0021-advanced-indexing.html), but until then please see the full text below.

Cheers,
Stephan

=========================================
Simplified and explicit advanced indexing
=========================================

:Author: Sebastian Berg
:Author: Stephan Hoyer <[hidden email]>
:Status: Draft
:Type: Standards Track
:Created: 2015-08-27


Abstract
--------

NumPy's "advanced" indexing support for indexing arrays with other arrays is
one of its most powerful and popular features. Unfortunately, the existing
rules for advanced indexing with multiple array indices are typically confusing
to both new, and in many cases even old, users of NumPy. Here we propose an
overhaul and simplification of advanced indexing, including two new "indexer"
attributes ``oindex`` and ``vindex`` to facilitate explicit indexing.

Background
----------

Existing indexing operations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

NumPy arrays currently support a flexible range of indexing operations:

- "Basic" indexing involving only slices, integers, ``np.newaxis`` and ellipsis
  (``...``), e.g., ``x[0, :3, np.newaxis]`` for selecting the first element
  from the 0th axis, the first three elements from the 1st axis and inserting a
  new axis of size 1 at the end. Basic indexing always return a view of the
  indexed array's data.
- "Advanced" indexing, also called "fancy" indexing, includes all cases where
  arrays are indexed by other arrays. Advanced indexing always makes a copy:

  - "Boolean" indexing by boolean arrays, e.g., ``x[x > 0]`` for
    selecting positive elements.
  - "Vectorized" indexing by one or more integer arrays, e.g., ``x[[0, 1]]``
    for selecting the first two elements along the first axis. With multiple
    arrays, vectorized indexing uses broadcasting rules to combine indices along
    multiple dimensions. This allows for producing a result of arbitrary shape
    with arbitrary elements from the original arrays.
  - "Mixed" indexing involving any combinations of the other advancing types.
    This is no more powerful than vectorized indexing, but is sometimes more
    convenient.

For clarity, we will refer to these existing rules as "legacy indexing".
This is only a high-level summary; for more details, see NumPy's documentation
and and `Examples` below.

Outer indexing
~~~~~~~~~~~~~~

One broadly useful class of indexing operations is not supported:

- "Outer" or orthogonal indexing treats one-dimensional arrays equivalently to
  slices for determining output shapes. The rule for outer indexing is that the
  result should be equivalent to independently indexing along each dimension
  with integer or boolean arrays as if both the indexed and indexing arrays
  were one-dimensional. This form of indexing is familiar to many users of other
  programming languages such as MATLAB, Fortran and R.

The reason why NumPy omits support for outer indexing is that the rules for
outer and vectorized conflict. Consider indexing a 2D array by two 1D integer
arrays, e.g., ``x[[0, 1], [0, 1]]``:

- Outer indexing is equivalent to combining multiple integer indices with
  ``itertools.product()``. The result in this case is another 2D array with
  all combinations of indexed elements, e.g.,
  ``np.array([[x[0, 0], x[0, 1]], [x[1, 0], x[1, 1]]])``
- Vectorized indexing is equivalent to combining multiple integer indices with
  ``zip()``. The result in this case is a 1D array containing the diagonal
  elements, e.g., ``np.array([x[0, 0], x[1, 1]])``.

This difference is a frequent stumbling block for new NumPy users. The outer
indexing model is easier to understand, and is a natural generalization of
slicing rules. But NumPy instead chose to support vectorized indexing, because
it is strictly more powerful.

It is always possible to emulate outer indexing by vectorized indexing with
the right indices. To make this easier, NumPy includes utility objects and
functions such as ``np.ogrid`` and ``np.ix_``, e.g.,
``x[np.ix_([0, 1], [0, 1])]``. However, there are no utilities for emulating
fully general/mixed outer indexing, which could unambiguously allow for slices,
integers, and 1D boolean and integer arrays.

Mixed indexing
~~~~~~~~~~~~~~

NumPy's existing rules for combining multiple types of indexing in the same
operation are quite complex, involving a number of edge cases.

One reason why mixed indexing is particularly confusing is that at first glance
the result works deceptively like outer indexing. Returning to our example of a
2D array, both ``x[:2, [0, 1]]`` and ``x[[0, 1], :2]`` return 2D arrays with
axes in the same order as the original array.

However, as soon as two or more non-slice objects (including integers) are
introduced, vectorized indexing rules apply. The axes introduced by the array
indices are at the front, unless all array indices are consecutive, in which
case NumPy deduces where the user "expects" them to be. Consider indexing a 3D
array ``arr`` with shape ``(X, Y, Z)``:

1. ``arr[:, [0, 1], 0]`` has shape ``(X, 2)``.
2. ``arr[[0, 1], 0, :]`` has shape ``(2, Z)``.
3. ``arr[0, :, [0, 1]]`` has shape ``(2, Y)``, not ``(Y, 2)``!

These first two cases are intuitive and consistent with outer indexing, but
this last case is quite surprising, even to many higly experienced NumPy users.

Mixed cases involving multiple array indices are also surprising, and only
less problematic because the current behavior is so useless that it is rarely
encountered in practice. When a boolean array index is mixed with another boolean or
integer array, boolean array is converted to integer array indices (equivalent
to ``np.nonzero()``) and then broadcast. For example, indexing a 2D array of
size ``(2, 2)`` like ``x[[True, False], [True, False]]`` produces a 1D vector
with shape ``(1,)``, not a 2D sub-matrix with shape ``(1, 1)``.

Mixed indexing seems so tricky that it is tempting to say that it never should
be used. However, it is not easy to avoid, because NumPy implicitly adds full
slices if there are fewer indices than the full dimensionality of the indexed
array. This means that indexing a 2D array like `x[[0, 1]]`` is equivalent to
``x[[0, 1], :]``. These cases are not surprising, but they constrain the
behavior of mixed indexing.

Indexing in other Python array libraries
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Indexing is a useful and widely recognized mechanism for accessing
multi-dimensional array data, so it is no surprise that many other libraries in
the scientific Python ecosystem also support array indexing.

Unfortunately, the full complexity of NumPy's indexing rules mean that it is
both challenging and undesirable for other libraries to copy its behavior in all
of its nuance. The only full implementation of NumPy-style indexing is NumPy
itself. This includes projects like dask.array and h5py, which support *most*
types of array indexing in some form, and otherwise attempt to copy NumPy's API
exactly.

Vectorized indexing in particular can be challenging to implement with array
storage backends not based on NumPy. In contrast, indexing by 1D arrays along
at least one dimension in the style of outer indexing is much more acheivable.
This has led many libraries (including dask and h5py) to attempt to define a
safe subset of NumPy-style indexing that is equivalent to outer indexing, e.g.,
by only allowing indexing with an array along at most one dimension. However,
this is quite challenging to do correctly in a general enough way to be useful.
For example, the current versions of dask and h5py both handle mixed indexing
in case 3 above inconsistently with NumPy. This is quite likely to lead to
bugs.

These inconsistencies, in addition to the broader challenge of implementing
every type of indexing logic, make it challenging to write high-level array
libraries like xarray or dask.array that can interchangeably index many types of
array storage. In contrast, explicit APIs for outer and vectorized indexing in
NumPy would provide a model that external libraries could reliably emulate, even
if they don't support every type of indexing.

High level changes
------------------

Inspired by multiple "indexer" attributes for controlling different types
of indexing behavior in pandas, we propose to:

1. Introduce ``arr.oindex[indices]`` which allows array indices, but
   uses outer indexing logic.
2. Introduce ``arr.vindex[indices]`` which use the current
   "vectorized"/broadcasted logic but with two differences from
   legacy indexing:
       
   * Boolean indices are not supported. All indices must be integers,
     integer arrays or slices.
   * The integer index result dimensions are always the first axes
     of the result array. No transpose is done, even for a single
     integer array index.

3. Plain indexing on arrays will start to give warnings and eventually
   errors in cases where one of the explicit indexers should be preferred:

   * First, in all cases where legacy and outer indexing would give
     different results.
   * Later, potentially in all cases involving an integer array.

These constraints are sufficient for making indexing generally consistent
with expectations and providing a less surprising learning curve with
``oindex``.

Note that all things mentioned here apply both for assignment as well as
subscription.

Understanding these details is *not* easy. The `Examples` section in the
discussion gives code examples.
And the hopefully easier `Motivational Example` provides some
motivational use-cases for the general ideas and is likely a good start for
anyone not intimately familiar with advanced indexing.


Detailed Description
--------------------

Proposed rules
~~~~~~~~~~~~~~

From the three problems noted above some expectations for NumPy can
be deduced:

1. There should be a prominent outer/orthogonal indexing method such as
   ``arr.oindex[indices]``.

2. Considering how confusing vectorized/fancy indexing can be, it should
   be possible to be made more explicitly (e.g. ``arr.vindex[indices]``).

3. A new ``arr.vindex[indices]`` method, would not be tied to the
   confusing transpose rules of fancy indexing, which is for example
   needed for the simple case of a single advanced index. Thus,
   no transposing should be done. The axes created by the integer array
   indices are always inserted at the front, even for a single index.

4. Boolean indexing is conceptionally outer indexing. Broadcasting
   together with other advanced indices in the manner of legacy
   indexing is generally not helpful or well defined.
   A user who wishes the "``nonzero``" plus broadcast behaviour can thus
   be expected to do this manually. Thus, ``vindex`` does not need to
   support boolean index arrays.

5. An ``arr.legacy_index`` attribute should be implemented to support
   legacy indexing. This gives a simple way to update existing codebases
   using legacy indexing, which will make the deprecation of plain indexing
   behavior easier. The longer name ``legacy_index`` is intentionally chosen
   to be explicit and discourage its use in new code.

6. Plain indexing ``arr[...]`` should return an error for ambiguous cases.
   For the beginning, this probably means cases where ``arr[ind]`` and
   ``arr.oindex[ind]`` return different results give deprecation warnings.
   This includes every use of vectorized indexing with multiple integer arrays.
   Due to the transposing behaviour, this means that``arr[0, :, index_arr]``
   will be deprecated, but ``arr[:, 0, index_arr]`` will not for the time being.

7. To ensure that existing subclasses of `ndarray` that override indexing
   do not inadvertently revert to default behavior for indexing attributes,
   these attribute should have explicit checks that disable them if
   ``__getitem__`` or ``__setitem__`` has been overriden.

Unlike plain indexing, the new indexing attributes are explicitly aimed
at higher dimensional indexing, several additional changes should be implemented:

* The indexing attributes will enforce exact dimension and indexing match.
  This means that no implicit ellipsis (``...``) will be added. Unless
  an ellipsis is present the indexing expression will thus only work for
  an array with a specific number of dimensions.
  This makes the expression more explicit and safeguards against wrong
  dimensionality of arrays.
  There should be no implications for "duck typing" compatibility with
  builtin Python sequences, because Python sequences only support a limited
  form of "basic indexing" with integers and slices.

* The current plain indexing allows for the use of non-tuples for
  multi-dimensional indexing such as ``arr[[slice(None), 2]]``.
  This creates some inconsistencies and thus the indexing attributes
  should only allow plain python tuples for this purpose.
  (Whether or not this should be the case for plain indexing is a
  different issue.)

* The new attributes should not use getitem to implement setitem,
  since it is a cludge and not useful for vectorized
  indexing. (not implemented yet)


Open Questions
~~~~~~~~~~~~~~

* The names ``oindex``, ``vindex`` and ``legacy_index`` are just suggestions at
  the time of writing this, another name NumPy has used for something like
  ``oindex`` is ``np.ix_``. See also below.

* ``oindex`` and ``vindex`` could always return copies, even when no array
  operation occurs. One argument for allowing a view return is that this way
  ``oindex`` can be used as a general index replacement.
  However, there is one argument for returning copies. It is possible for
  ``arr.vindex[array_scalar, ...]``, where ``array_scalar`` should be
  a 0-D array but is not, since 0-D arrays tend to be converted.
  Copying always "fixes" this possible inconsistency.

* The final state to morph plain indexing in is not fixed in this PEP.
  It is for example possible that `arr[index]`` will be equivalent to
  ``arr.oindex`` at some point in the future.
  Since such a change will take years, it seems unnecessary to make
  specific decisions at this time.

* The proposed changes to plain indexing could be postponed indefinitely or
  not taken in order to not break or force major fixes to existing code bases.


Alternative Names
~~~~~~~~~~~~~~~~~

Possible names suggested (more suggestions will be added).

==============  ============ ========
**Orthogonal**  oindex       oix
**Vectorized**  vindex       vix
**Legacy**      legacy_index l/findex
==============  ============ ========


Subclasses
~~~~~~~~~~

Subclasses are a bit problematic in the light of these changes. There are
some possible solutions for this. For most subclasses (those which do not
provide ``__getitem__`` or ``__setitem__``) the special attributes should
just work. Subclasses that *do* provide it must be updated accordingly
and should preferably not subclass working versions of these attributes.

All subclasses will inherit the attributes, however, the implementation
of ``__getitem__`` on these attributes should test
``subclass.__getitem__ is ndarray.__getitem__``. If not, the
subclass has special handling for indexing and ``NotImplementedError``
should be raised, requiring that the indexing attributes is also explicitly
overwritten. Likewise, implementations of ``__setitem__`` should check to see
if ``__setitem__`` is overriden.

A further question is how to facilitate implementing the special attributes.
Also there is the weird functionality where ``__setitem__`` calls
``__getitem__`` for non-advanced indices. It might be good to avoid it for
the new attributes, but on the other hand, that may make it even more
confusing.

To facilitate implementations we could provide functions similar to
``operator.itemgetter`` and ``operator.setitem`` for the attributes.
Possibly a mixin could be provided to help implementation. These improvements
are not essential to the initial implementation, so they are saved for
future work.

Implementation
--------------

Implementation would start with writing special indexing objects available
through ``arr.oindex``, ``arr.vindex``, and ``arr.legacy_index`` to allow these
indexing operations. Also, we would need to start to deprecate those plain index
operations which are not ambiguous.
Furthermore, the NumPy code base will need to use the new attributes and
tests will have to be adapted.


Backward compatibility
----------------------

As a new feature, no backward compatibility issues with the new ``vindex``
and ``oindex`` attributes would arise. To facilitate backwards compatibility
as much as possible, we expect a long deprecation cycle for legacy indexing
behavior and propose the new ``legacy_index`` attribute.
Some forward compatibility issues with subclasses that do not specifically
implement the new methods may arise.


Alternatives
------------

NumPy may not choose to offer these different type of indexing methods, or
choose to only offer them through specific functions instead of the proposed
notation above.

We don't think that new functions are a good alternative, because indexing
notation ``[]`` offer some syntactic advantages in Python (i.e., direct
creation of slice objects) compared to functions.

A more reasonable alternative would be write new wrapper objects for alternative
indexing with functions rather than methods (e.g., ``np.oindex(arr)[indices]``
instead of ``arr.oindex[indices]``). Functionally, this would be equivalent,
but indexing is such a common operation that we think it is important to
minimize syntax and worth implementing it directly on `ndarray` objects
themselves. Indexing attributes also define a clear interface that is easier
for alternative array implementations to copy, nonwithstanding ongoing
efforts to make it easier to override NumPy functions [2]_.

Discussion
----------

The original discussion about vectorized vs outer/orthogonal indexing arose
on the NumPy mailing list:


Some discussion can be found on the original pull request for this NEP:


Python implementations of the indexing operations can be found at:



Examples
~~~~~~~~

Since the various kinds of indexing is hard to grasp in many cases, these
examples hopefully give some more insights. Note that they are all in terms
of shape.
In the examples, all original dimensions have 5 or more elements,
advanced indexing inserts smaller dimensions.
These examples may be hard to grasp without working knowledge of advanced
indexing as of NumPy 1.9.

Example array::

    >>> arr = np.ones((5, 6, 7, 8))


Legacy fancy indexing
---------------------

Note that the same result can be achieved with ``arr.legacy_index``, but the
"future error" will still work in this case.

Single index is transposed (this is the same for all indexing types)::

    >>> arr[[0], ...].shape
    (1, 6, 7, 8)
    >>> arr[:, [0], ...].shape
    (5, 1, 7, 8)


Multiple indices are transposed *if* consecutive::

    >>> arr[:, [0], [0], :].shape  # future error
    (5, 1, 8)
    >>> arr[:, [0], :, [0]].shape  # future error
    (1, 5, 7)


It is important to note that a scalar *is* integer array index in this sense
(and gets broadcasted with the other advanced index)::

    >>> arr[:, [0], 0, :].shape
    (5, 1, 8)
    >>> arr[:, [0], :, 0].shape  # future error (scalar is "fancy")
    (1, 5, 7)


Single boolean index can act on multiple dimensions (especially the whole
array). It has to match (as of 1.10. a deprecation warning) the dimensions.
The boolean index is otherwise identical to (multiple consecutive) integer
array indices::

    >>> # Create boolean index with one True value for the last two dimensions:
    >>> bindx = np.zeros((7, 8), dtype=np.bool_)
    >>> bindx[0, 0] = True
    >>> arr[:, 0, bindx].shape
    (5, 1)
    >>> arr[0, :, bindx].shape
    (1, 6)


The combination with anything that is not a scalar is confusing, e.g.::

    >>> arr[[0], :, bindx].shape  # bindx result broadcasts with [0]
    (1, 6)
    >>> arr[:, [0, 1], bindx].shape  # IndexError


Outer indexing
--------------

Multiple indices are "orthogonal" and their result axes are inserted 
at the same place (they are not broadcasted)::

    >>> arr.oindex[:, [0], [0, 1], :].shape
    (5, 1, 2, 8)
    >>> arr.oindex[:, [0], :, [0, 1]].shape
    (5, 1, 7, 2)
    >>> arr.oindex[:, [0], 0, :].shape
    (5, 1, 8)
    >>> arr.oindex[:, [0], :, 0].shape
    (5, 1, 7)


Boolean indices results are always inserted where the index is::

    >>> # Create boolean index with one True value for the last two dimensions:
    >>> bindx = np.zeros((7, 8), dtype=np.bool_)
    >>> bindx[0, 0] = True
    >>> arr.oindex[:, 0, bindx].shape
    (5, 1)
    >>> arr.oindex[0, :, bindx].shape
    (6, 1)


Nothing changed in the presence of other advanced indices since::

    >>> arr.oindex[[0], :, bindx].shape
    (1, 6, 1)
    >>> arr.oindex[:, [0, 1], bindx].shape
    (5, 2, 1)


Vectorized/inner indexing
-------------------------

Multiple indices are broadcasted and iterated as one like fancy indexing,
but the new axes area always inserted at the front::

    >>> arr.vindex[:, [0], [0, 1], :].shape
    (2, 5, 8)
    >>> arr.vindex[:, [0], :, [0, 1]].shape
    (2, 5, 7)
    >>> arr.vindex[:, [0], 0, :].shape
    (1, 5, 8)
    >>> arr.vindex[:, [0], :, 0].shape
    (1, 5, 7)


Boolean indices results are always inserted where the index is, exactly
as in ``oindex`` given how specific they are to the axes they operate on::

    >>> # Create boolean index with one True value for the last two dimensions:
    >>> bindx = np.zeros((7, 8), dtype=np.bool_)
    >>> bindx[0, 0] = True
    >>> arr.vindex[:, 0, bindx].shape
    (5, 1)
    >>> arr.vindex[0, :, bindx].shape
    (6, 1)


But other advanced indices are again transposed to the front::

    >>> arr.vindex[[0], :, bindx].shape
    (1, 6, 1)
    >>> arr.vindex[:, [0, 1], bindx].shape
    (2, 5, 1)


Motivational Example
~~~~~~~~~~~~~~~~~~~~

Imagine having a data acquisition software storing ``D`` channels and
``N`` datapoints along the time. She stores this into an ``(N, D)`` shaped
array. During data analysis, we needs to fetch a pool of channels, for example
to calculate a mean over them.

This data can be faked using::

    >>> arr = np.random.random((100, 10))

Now one may remember indexing with an integer array and find the correct code::

    >>> group = arr[:, [2, 5]]
    >>> mean_value = arr.mean()

However, assume that there were some specific time points (first dimension
of the data) that need to be specially considered. These time points are
already known and given by::

    >>> interesting_times = np.array([1, 5, 8, 10], dtype=np.intp)

Now to fetch them, we may try to modify the previous code::

    >>> group_at_it = arr[interesting_times, [2, 5]]
    IndexError: Ambiguous index, use `.oindex` or `.vindex`

An error such as this will point to read up the indexing documentation.
This should make it clear, that ``oindex`` behaves more like slicing.
So, out of the different methods it is the obvious choice
(for now, this is a shape mismatch, but that could possibly also mention
``oindex``)::

    >>> group_at_it = arr.oindex[interesting_times, [2, 5]]

Now of course one could also have used ``vindex``, but it is much less
obvious how to achieve the right thing!::

    >>> reshaped_times = interesting_times[:, np.newaxis]
    >>> group_at_it = arr.vindex[reshaped_times, [2, 5]]


One may find, that for example our data is corrupt in some places.
So, we need to replace these values by zero (or anything else) for these
times. The first column may for example give the necessary information,
so that changing the values becomes easy remembering boolean indexing::

    >>> bad_data = arr[:, 0] > 0.5
    >>> arr[bad_data, :] = 0  # (corrupts further examples)

Again, however, the columns may need to be handled more individually (but in
groups), and the ``oindex`` attribute works well::

    >>> arr.oindex[bad_data, [2, 5]] = 0

Note that it would be very hard to do this using legacy fancy indexing.
The only way would be to create an integer array first::

    >>> bad_data_indx = np.nonzero(bad_data)[0]
    >>> bad_data_indx_reshaped = bad_data_indx[:, np.newaxis]
    >>> arr[bad_data_indx_reshaped, [2, 5]]

In any case we can use only ``oindex`` to do all of this without getting
into any trouble or confused by the whole complexity of advanced indexing.

But, some new features are added to the data acquisition. Different sensors
have to be used depending on the times. Let us assume we already have
created an array of indices::

    >>> correct_sensors = np.random.randint(10, size=(100, 2))

Which lists for each time the two correct sensors in an ``(N, 2)`` array.

A first try to achieve this may be ``arr[:, correct_sensors]`` and this does
not work. It should be clear quickly that slicing cannot achieve the desired
thing. But hopefully users will remember that there is ``vindex`` as a more
powerful and flexible approach to advanced indexing.
One may, if trying ``vindex`` randomly, be confused about::

    >>> new_arr = arr.vindex[:, correct_sensors]

which is neither the same, nor the correct result (see transposing rules)!
This is because slicing works still the same in ``vindex``. However, reading
the documentation and examples, one can hopefully quickly find the desired
solution::

    >>> rows = np.arange(len(arr))
    >>> rows = rows[:, np.newaxis]  # make shape fit with correct_sensors
    >>> new_arr = arr.vindex[rows, correct_sensors]
    
At this point we have left the straight forward world of ``oindex`` but can
do random picking of any element from the array. Note that in the last example
a method such as mentioned in the ``Related Questions`` section could be more
straight forward. But this approach is even more flexible, since ``rows``
does not have to be a simple ``arange``, but could be ``intersting_times``::

    >>> interesting_times = np.array([0, 4, 8, 9, 10])
    >>> correct_sensors_at_it = correct_sensors[interesting_times, :]
    >>> interesting_times_reshaped = interesting_times[:, np.newaxis]
    >>> new_arr_it = arr[interesting_times_reshaped, correct_sensors_at_it]

Truly complex situation would arise now if you would for example pool ``L``
experiments into an array shaped ``(L, N, D)``. But for ``oindex`` this should
not result into surprises. ``vindex``, being more powerful, will quite
certainly create some confusion in this case but also cover pretty much all
eventualities.


Copyright
---------

This document is placed under the CC0 1.0 Universell (CC0 1.0) Public Domain Dedication [1]_.


References and Footnotes
------------------------

.. [1] To the extent possible under law, the person who associated CC0 
   with this work has waived all copyright and related or neighboring
   rights to this work. The CC0 license may be found at
.. [2] e.g., see NEP 18,


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: NEP 21: Simplified and explicit advanced indexing

Eric Wieser

Generally +1 on this, but I don’t think we need

To ensure that existing subclasses of ndarray that override indexing
do not inadvertently revert to default behavior for indexing attributes,
these attribute should have explicit checks that disable them if
__getitem__ or __setitem__ has been overriden.

Repeating my proposal from github, I think we should introduce some internal indexing objects - something simple like:

# np.core.*
class Indexer(object):  # importantly not iterable
    def __init__(self, value):
        self.value = value
class OrthogonalIndexer(Indexer): pass
class VectorizedIndexer(Indexer): pass

Keeping the proposed syntax, we’d implement:

  • arr.oindex[ind] as arr[np.core.OrthogonalIndexer(ind)]
  • arr.vindex[ind] as arr[np.core.VectorizedIndexer(ind)]

This means that subclasses like the following

class LoggingIndexer(np.ndarray):
    def __getitem__(self, ind):
        ret = super().__getitem__(ind)
        print("Got an index")
        return ret

will continue to work without issues. This includes np.ma.MaskedArray and np.memmap, so this already has value internally.

For classes like np.matrix which inspect the index object itself, an error will still be raised from __getitem__, since it looks nothing like the values normally passed - most likely of the form

TypeError: 'numpy.core.VectorizedIndexer' object does not support indexing
TypeError: 'numpy.core.VectorizedIndexer' object is not iterable

This could potentially be caught in oindex.__getitem__ and converted into a more useful error message.

So to summarize the benefits of the above tweaks:

  • Pass-through subclasses get the new behavior for free
  • No additional descriptor helpers are needed to let non-passthrough subclasses implement the new indexable attributes - only a change to __getitem__ is needed

And the costs:

  • A less clear error message when new indexing is used on old types (can chain with a more useful exception on python 3)
  • Class construction overhead for indexing via the attributes (skippable for base ndarray if significant)

Eric


On Mon, 25 Jun 2018 at 14:30 Stephan Hoyer <[hidden email]> wrote:
Sebastian and I have revised a Numpy Enhancement Proposal that he started three years ago for overhauling NumPy's advanced indexing. We'd now like to present it for official consideration.

Minor inline comments (e.g., typos) can be added to the latest pull request (https://github.com/numpy/numpy/pull/11414/files), but otherwise let's keep discussion on the mailing list. The NumPy website should update shortly with a rendered version (http://www.numpy.org/neps/nep-0021-advanced-indexing.html), but until then please see the full text below.

Cheers,
Stephan

=========================================
Simplified and explicit advanced indexing
=========================================

:Author: Sebastian Berg
:Author: Stephan Hoyer <[hidden email]>
:Status: Draft
:Type: Standards Track
:Created: 2015-08-27


Abstract
--------

NumPy's "advanced" indexing support for indexing arrays with other arrays is
one of its most powerful and popular features. Unfortunately, the existing
rules for advanced indexing with multiple array indices are typically confusing
to both new, and in many cases even old, users of NumPy. Here we propose an
overhaul and simplification of advanced indexing, including two new "indexer"
attributes ``oindex`` and ``vindex`` to facilitate explicit indexing.

Background
----------

Existing indexing operations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

NumPy arrays currently support a flexible range of indexing operations:

- "Basic" indexing involving only slices, integers, ``np.newaxis`` and ellipsis
  (``...``), e.g., ``x[0, :3, np.newaxis]`` for selecting the first element
  from the 0th axis, the first three elements from the 1st axis and inserting a
  new axis of size 1 at the end. Basic indexing always return a view of the
  indexed array's data.
- "Advanced" indexing, also called "fancy" indexing, includes all cases where
  arrays are indexed by other arrays. Advanced indexing always makes a copy:

  - "Boolean" indexing by boolean arrays, e.g., ``x[x > 0]`` for
    selecting positive elements.
  - "Vectorized" indexing by one or more integer arrays, e.g., ``x[[0, 1]]``
    for selecting the first two elements along the first axis. With multiple
    arrays, vectorized indexing uses broadcasting rules to combine indices along
    multiple dimensions. This allows for producing a result of arbitrary shape
    with arbitrary elements from the original arrays.
  - "Mixed" indexing involving any combinations of the other advancing types.
    This is no more powerful than vectorized indexing, but is sometimes more
    convenient.

For clarity, we will refer to these existing rules as "legacy indexing".
This is only a high-level summary; for more details, see NumPy's documentation
and and `Examples` below.

Outer indexing
~~~~~~~~~~~~~~

One broadly useful class of indexing operations is not supported:

- "Outer" or orthogonal indexing treats one-dimensional arrays equivalently to
  slices for determining output shapes. The rule for outer indexing is that the
  result should be equivalent to independently indexing along each dimension
  with integer or boolean arrays as if both the indexed and indexing arrays
  were one-dimensional. This form of indexing is familiar to many users of other
  programming languages such as MATLAB, Fortran and R.

The reason why NumPy omits support for outer indexing is that the rules for
outer and vectorized conflict. Consider indexing a 2D array by two 1D integer
arrays, e.g., ``x[[0, 1], [0, 1]]``:

- Outer indexing is equivalent to combining multiple integer indices with
  ``itertools.product()``. The result in this case is another 2D array with
  all combinations of indexed elements, e.g.,
  ``np.array([[x[0, 0], x[0, 1]], [x[1, 0], x[1, 1]]])``
- Vectorized indexing is equivalent to combining multiple integer indices with
  ``zip()``. The result in this case is a 1D array containing the diagonal
  elements, e.g., ``np.array([x[0, 0], x[1, 1]])``.

This difference is a frequent stumbling block for new NumPy users. The outer
indexing model is easier to understand, and is a natural generalization of
slicing rules. But NumPy instead chose to support vectorized indexing, because
it is strictly more powerful.

It is always possible to emulate outer indexing by vectorized indexing with
the right indices. To make this easier, NumPy includes utility objects and
functions such as ``np.ogrid`` and ``np.ix_``, e.g.,
``x[np.ix_([0, 1], [0, 1])]``. However, there are no utilities for emulating
fully general/mixed outer indexing, which could unambiguously allow for slices,
integers, and 1D boolean and integer arrays.

Mixed indexing
~~~~~~~~~~~~~~

NumPy's existing rules for combining multiple types of indexing in the same
operation are quite complex, involving a number of edge cases.

One reason why mixed indexing is particularly confusing is that at first glance
the result works deceptively like outer indexing. Returning to our example of a
2D array, both ``x[:2, [0, 1]]`` and ``x[[0, 1], :2]`` return 2D arrays with
axes in the same order as the original array.

However, as soon as two or more non-slice objects (including integers) are
introduced, vectorized indexing rules apply. The axes introduced by the array
indices are at the front, unless all array indices are consecutive, in which
case NumPy deduces where the user "expects" them to be. Consider indexing a 3D
array ``arr`` with shape ``(X, Y, Z)``:

1. ``arr[:, [0, 1], 0]`` has shape ``(X, 2)``.
2. ``arr[[0, 1], 0, :]`` has shape ``(2, Z)``.
3. ``arr[0, :, [0, 1]]`` has shape ``(2, Y)``, not ``(Y, 2)``!

These first two cases are intuitive and consistent with outer indexing, but
this last case is quite surprising, even to many higly experienced NumPy users.

Mixed cases involving multiple array indices are also surprising, and only
less problematic because the current behavior is so useless that it is rarely
encountered in practice. When a boolean array index is mixed with another boolean or
integer array, boolean array is converted to integer array indices (equivalent
to ``np.nonzero()``) and then broadcast. For example, indexing a 2D array of
size ``(2, 2)`` like ``x[[True, False], [True, False]]`` produces a 1D vector
with shape ``(1,)``, not a 2D sub-matrix with shape ``(1, 1)``.

Mixed indexing seems so tricky that it is tempting to say that it never should
be used. However, it is not easy to avoid, because NumPy implicitly adds full
slices if there are fewer indices than the full dimensionality of the indexed
array. This means that indexing a 2D array like `x[[0, 1]]`` is equivalent to
``x[[0, 1], :]``. These cases are not surprising, but they constrain the
behavior of mixed indexing.

Indexing in other Python array libraries
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Indexing is a useful and widely recognized mechanism for accessing
multi-dimensional array data, so it is no surprise that many other libraries in
the scientific Python ecosystem also support array indexing.

Unfortunately, the full complexity of NumPy's indexing rules mean that it is
both challenging and undesirable for other libraries to copy its behavior in all
of its nuance. The only full implementation of NumPy-style indexing is NumPy
itself. This includes projects like dask.array and h5py, which support *most*
types of array indexing in some form, and otherwise attempt to copy NumPy's API
exactly.

Vectorized indexing in particular can be challenging to implement with array
storage backends not based on NumPy. In contrast, indexing by 1D arrays along
at least one dimension in the style of outer indexing is much more acheivable.
This has led many libraries (including dask and h5py) to attempt to define a
safe subset of NumPy-style indexing that is equivalent to outer indexing, e.g.,
by only allowing indexing with an array along at most one dimension. However,
this is quite challenging to do correctly in a general enough way to be useful.
For example, the current versions of dask and h5py both handle mixed indexing
in case 3 above inconsistently with NumPy. This is quite likely to lead to
bugs.

These inconsistencies, in addition to the broader challenge of implementing
every type of indexing logic, make it challenging to write high-level array
libraries like xarray or dask.array that can interchangeably index many types of
array storage. In contrast, explicit APIs for outer and vectorized indexing in
NumPy would provide a model that external libraries could reliably emulate, even
if they don't support every type of indexing.

High level changes
------------------

Inspired by multiple "indexer" attributes for controlling different types
of indexing behavior in pandas, we propose to:

1. Introduce ``arr.oindex[indices]`` which allows array indices, but
   uses outer indexing logic.
2. Introduce ``arr.vindex[indices]`` which use the current
   "vectorized"/broadcasted logic but with two differences from
   legacy indexing:
       
   * Boolean indices are not supported. All indices must be integers,
     integer arrays or slices.
   * The integer index result dimensions are always the first axes
     of the result array. No transpose is done, even for a single
     integer array index.

3. Plain indexing on arrays will start to give warnings and eventually
   errors in cases where one of the explicit indexers should be preferred:

   * First, in all cases where legacy and outer indexing would give
     different results.
   * Later, potentially in all cases involving an integer array.

These constraints are sufficient for making indexing generally consistent
with expectations and providing a less surprising learning curve with
``oindex``.

Note that all things mentioned here apply both for assignment as well as
subscription.

Understanding these details is *not* easy. The `Examples` section in the
discussion gives code examples.
And the hopefully easier `Motivational Example` provides some
motivational use-cases for the general ideas and is likely a good start for
anyone not intimately familiar with advanced indexing.


Detailed Description
--------------------

Proposed rules
~~~~~~~~~~~~~~

From the three problems noted above some expectations for NumPy can
be deduced:

1. There should be a prominent outer/orthogonal indexing method such as
   ``arr.oindex[indices]``.

2. Considering how confusing vectorized/fancy indexing can be, it should
   be possible to be made more explicitly (e.g. ``arr.vindex[indices]``).

3. A new ``arr.vindex[indices]`` method, would not be tied to the
   confusing transpose rules of fancy indexing, which is for example
   needed for the simple case of a single advanced index. Thus,
   no transposing should be done. The axes created by the integer array
   indices are always inserted at the front, even for a single index.

4. Boolean indexing is conceptionally outer indexing. Broadcasting
   together with other advanced indices in the manner of legacy
   indexing is generally not helpful or well defined.
   A user who wishes the "``nonzero``" plus broadcast behaviour can thus
   be expected to do this manually. Thus, ``vindex`` does not need to
   support boolean index arrays.

5. An ``arr.legacy_index`` attribute should be implemented to support
   legacy indexing. This gives a simple way to update existing codebases
   using legacy indexing, which will make the deprecation of plain indexing
   behavior easier. The longer name ``legacy_index`` is intentionally chosen
   to be explicit and discourage its use in new code.

6. Plain indexing ``arr[...]`` should return an error for ambiguous cases.
   For the beginning, this probably means cases where ``arr[ind]`` and
   ``arr.oindex[ind]`` return different results give deprecation warnings.
   This includes every use of vectorized indexing with multiple integer arrays.
   Due to the transposing behaviour, this means that``arr[0, :, index_arr]``
   will be deprecated, but ``arr[:, 0, index_arr]`` will not for the time being.

7. To ensure that existing subclasses of `ndarray` that override indexing
   do not inadvertently revert to default behavior for indexing attributes,
   these attribute should have explicit checks that disable them if
   ``__getitem__`` or ``__setitem__`` has been overriden.

Unlike plain indexing, the new indexing attributes are explicitly aimed
at higher dimensional indexing, several additional changes should be implemented:

* The indexing attributes will enforce exact dimension and indexing match.
  This means that no implicit ellipsis (``...``) will be added. Unless
  an ellipsis is present the indexing expression will thus only work for
  an array with a specific number of dimensions.
  This makes the expression more explicit and safeguards against wrong
  dimensionality of arrays.
  There should be no implications for "duck typing" compatibility with
  builtin Python sequences, because Python sequences only support a limited
  form of "basic indexing" with integers and slices.

* The current plain indexing allows for the use of non-tuples for
  multi-dimensional indexing such as ``arr[[slice(None), 2]]``.
  This creates some inconsistencies and thus the indexing attributes
  should only allow plain python tuples for this purpose.
  (Whether or not this should be the case for plain indexing is a
  different issue.)

* The new attributes should not use getitem to implement setitem,
  since it is a cludge and not useful for vectorized
  indexing. (not implemented yet)


Open Questions
~~~~~~~~~~~~~~

* The names ``oindex``, ``vindex`` and ``legacy_index`` are just suggestions at
  the time of writing this, another name NumPy has used for something like
  ``oindex`` is ``np.ix_``. See also below.

* ``oindex`` and ``vindex`` could always return copies, even when no array
  operation occurs. One argument for allowing a view return is that this way
  ``oindex`` can be used as a general index replacement.
  However, there is one argument for returning copies. It is possible for
  ``arr.vindex[array_scalar, ...]``, where ``array_scalar`` should be
  a 0-D array but is not, since 0-D arrays tend to be converted.
  Copying always "fixes" this possible inconsistency.

* The final state to morph plain indexing in is not fixed in this PEP.
  It is for example possible that `arr[index]`` will be equivalent to
  ``arr.oindex`` at some point in the future.
  Since such a change will take years, it seems unnecessary to make
  specific decisions at this time.

* The proposed changes to plain indexing could be postponed indefinitely or
  not taken in order to not break or force major fixes to existing code bases.


Alternative Names
~~~~~~~~~~~~~~~~~

Possible names suggested (more suggestions will be added).

==============  ============ ========
**Orthogonal**  oindex       oix
**Vectorized**  vindex       vix
**Legacy**      legacy_index l/findex
==============  ============ ========


Subclasses
~~~~~~~~~~

Subclasses are a bit problematic in the light of these changes. There are
some possible solutions for this. For most subclasses (those which do not
provide ``__getitem__`` or ``__setitem__``) the special attributes should
just work. Subclasses that *do* provide it must be updated accordingly
and should preferably not subclass working versions of these attributes.

All subclasses will inherit the attributes, however, the implementation
of ``__getitem__`` on these attributes should test
``subclass.__getitem__ is ndarray.__getitem__``. If not, the
subclass has special handling for indexing and ``NotImplementedError``
should be raised, requiring that the indexing attributes is also explicitly
overwritten. Likewise, implementations of ``__setitem__`` should check to see
if ``__setitem__`` is overriden.

A further question is how to facilitate implementing the special attributes.
Also there is the weird functionality where ``__setitem__`` calls
``__getitem__`` for non-advanced indices. It might be good to avoid it for
the new attributes, but on the other hand, that may make it even more
confusing.

To facilitate implementations we could provide functions similar to
``operator.itemgetter`` and ``operator.setitem`` for the attributes.
Possibly a mixin could be provided to help implementation. These improvements
are not essential to the initial implementation, so they are saved for
future work.

Implementation
--------------

Implementation would start with writing special indexing objects available
through ``arr.oindex``, ``arr.vindex``, and ``arr.legacy_index`` to allow these
indexing operations. Also, we would need to start to deprecate those plain index
operations which are not ambiguous.
Furthermore, the NumPy code base will need to use the new attributes and
tests will have to be adapted.


Backward compatibility
----------------------

As a new feature, no backward compatibility issues with the new ``vindex``
and ``oindex`` attributes would arise. To facilitate backwards compatibility
as much as possible, we expect a long deprecation cycle for legacy indexing
behavior and propose the new ``legacy_index`` attribute.
Some forward compatibility issues with subclasses that do not specifically
implement the new methods may arise.


Alternatives
------------

NumPy may not choose to offer these different type of indexing methods, or
choose to only offer them through specific functions instead of the proposed
notation above.

We don't think that new functions are a good alternative, because indexing
notation ``[]`` offer some syntactic advantages in Python (i.e., direct
creation of slice objects) compared to functions.

A more reasonable alternative would be write new wrapper objects for alternative
indexing with functions rather than methods (e.g., ``np.oindex(arr)[indices]``
instead of ``arr.oindex[indices]``). Functionally, this would be equivalent,
but indexing is such a common operation that we think it is important to
minimize syntax and worth implementing it directly on `ndarray` objects
themselves. Indexing attributes also define a clear interface that is easier
for alternative array implementations to copy, nonwithstanding ongoing
efforts to make it easier to override NumPy functions [2]_.

Discussion
----------

The original discussion about vectorized vs outer/orthogonal indexing arose
on the NumPy mailing list:


Some discussion can be found on the original pull request for this NEP:


Python implementations of the indexing operations can be found at:



Examples
~~~~~~~~

Since the various kinds of indexing is hard to grasp in many cases, these
examples hopefully give some more insights. Note that they are all in terms
of shape.
In the examples, all original dimensions have 5 or more elements,
advanced indexing inserts smaller dimensions.
These examples may be hard to grasp without working knowledge of advanced
indexing as of NumPy 1.9.

Example array::

    >>> arr = np.ones((5, 6, 7, 8))


Legacy fancy indexing
---------------------

Note that the same result can be achieved with ``arr.legacy_index``, but the
"future error" will still work in this case.

Single index is transposed (this is the same for all indexing types)::

    >>> arr[[0], ...].shape
    (1, 6, 7, 8)
    >>> arr[:, [0], ...].shape
    (5, 1, 7, 8)


Multiple indices are transposed *if* consecutive::

    >>> arr[:, [0], [0], :].shape  # future error
    (5, 1, 8)
    >>> arr[:, [0], :, [0]].shape  # future error
    (1, 5, 7)


It is important to note that a scalar *is* integer array index in this sense
(and gets broadcasted with the other advanced index)::

    >>> arr[:, [0], 0, :].shape
    (5, 1, 8)
    >>> arr[:, [0], :, 0].shape  # future error (scalar is "fancy")
    (1, 5, 7)


Single boolean index can act on multiple dimensions (especially the whole
array). It has to match (as of 1.10. a deprecation warning) the dimensions.
The boolean index is otherwise identical to (multiple consecutive) integer
array indices::

    >>> # Create boolean index with one True value for the last two dimensions:
    >>> bindx = np.zeros((7, 8), dtype=np.bool_)
    >>> bindx[0, 0] = True
    >>> arr[:, 0, bindx].shape
    (5, 1)
    >>> arr[0, :, bindx].shape
    (1, 6)


The combination with anything that is not a scalar is confusing, e.g.::

    >>> arr[[0], :, bindx].shape  # bindx result broadcasts with [0]
    (1, 6)
    >>> arr[:, [0, 1], bindx].shape  # IndexError


Outer indexing
--------------

Multiple indices are "orthogonal" and their result axes are inserted 
at the same place (they are not broadcasted)::

    >>> arr.oindex[:, [0], [0, 1], :].shape
    (5, 1, 2, 8)
    >>> arr.oindex[:, [0], :, [0, 1]].shape
    (5, 1, 7, 2)
    >>> arr.oindex[:, [0], 0, :].shape
    (5, 1, 8)
    >>> arr.oindex[:, [0], :, 0].shape
    (5, 1, 7)


Boolean indices results are always inserted where the index is::

    >>> # Create boolean index with one True value for the last two dimensions:
    >>> bindx = np.zeros((7, 8), dtype=np.bool_)
    >>> bindx[0, 0] = True
    >>> arr.oindex[:, 0, bindx].shape
    (5, 1)
    >>> arr.oindex[0, :, bindx].shape
    (6, 1)


Nothing changed in the presence of other advanced indices since::

    >>> arr.oindex[[0], :, bindx].shape
    (1, 6, 1)
    >>> arr.oindex[:, [0, 1], bindx].shape
    (5, 2, 1)


Vectorized/inner indexing
-------------------------

Multiple indices are broadcasted and iterated as one like fancy indexing,
but the new axes area always inserted at the front::

    >>> arr.vindex[:, [0], [0, 1], :].shape
    (2, 5, 8)
    >>> arr.vindex[:, [0], :, [0, 1]].shape
    (2, 5, 7)
    >>> arr.vindex[:, [0], 0, :].shape
    (1, 5, 8)
    >>> arr.vindex[:, [0], :, 0].shape
    (1, 5, 7)


Boolean indices results are always inserted where the index is, exactly
as in ``oindex`` given how specific they are to the axes they operate on::

    >>> # Create boolean index with one True value for the last two dimensions:
    >>> bindx = np.zeros((7, 8), dtype=np.bool_)
    >>> bindx[0, 0] = True
    >>> arr.vindex[:, 0, bindx].shape
    (5, 1)
    >>> arr.vindex[0, :, bindx].shape
    (6, 1)


But other advanced indices are again transposed to the front::

    >>> arr.vindex[[0], :, bindx].shape
    (1, 6, 1)
    >>> arr.vindex[:, [0, 1], bindx].shape
    (2, 5, 1)


Motivational Example
~~~~~~~~~~~~~~~~~~~~

Imagine having a data acquisition software storing ``D`` channels and
``N`` datapoints along the time. She stores this into an ``(N, D)`` shaped
array. During data analysis, we needs to fetch a pool of channels, for example
to calculate a mean over them.

This data can be faked using::

    >>> arr = np.random.random((100, 10))

Now one may remember indexing with an integer array and find the correct code::

    >>> group = arr[:, [2, 5]]
    >>> mean_value = arr.mean()

However, assume that there were some specific time points (first dimension
of the data) that need to be specially considered. These time points are
already known and given by::

    >>> interesting_times = np.array([1, 5, 8, 10], dtype=np.intp)

Now to fetch them, we may try to modify the previous code::

    >>> group_at_it = arr[interesting_times, [2, 5]]
    IndexError: Ambiguous index, use `.oindex` or `.vindex`

An error such as this will point to read up the indexing documentation.
This should make it clear, that ``oindex`` behaves more like slicing.
So, out of the different methods it is the obvious choice
(for now, this is a shape mismatch, but that could possibly also mention
``oindex``)::

    >>> group_at_it = arr.oindex[interesting_times, [2, 5]]

Now of course one could also have used ``vindex``, but it is much less
obvious how to achieve the right thing!::

    >>> reshaped_times = interesting_times[:, np.newaxis]
    >>> group_at_it = arr.vindex[reshaped_times, [2, 5]]


One may find, that for example our data is corrupt in some places.
So, we need to replace these values by zero (or anything else) for these
times. The first column may for example give the necessary information,
so that changing the values becomes easy remembering boolean indexing::

    >>> bad_data = arr[:, 0] > 0.5
    >>> arr[bad_data, :] = 0  # (corrupts further examples)

Again, however, the columns may need to be handled more individually (but in
groups), and the ``oindex`` attribute works well::

    >>> arr.oindex[bad_data, [2, 5]] = 0

Note that it would be very hard to do this using legacy fancy indexing.
The only way would be to create an integer array first::

    >>> bad_data_indx = np.nonzero(bad_data)[0]
    >>> bad_data_indx_reshaped = bad_data_indx[:, np.newaxis]
    >>> arr[bad_data_indx_reshaped, [2, 5]]

In any case we can use only ``oindex`` to do all of this without getting
into any trouble or confused by the whole complexity of advanced indexing.

But, some new features are added to the data acquisition. Different sensors
have to be used depending on the times. Let us assume we already have
created an array of indices::

    >>> correct_sensors = np.random.randint(10, size=(100, 2))

Which lists for each time the two correct sensors in an ``(N, 2)`` array.

A first try to achieve this may be ``arr[:, correct_sensors]`` and this does
not work. It should be clear quickly that slicing cannot achieve the desired
thing. But hopefully users will remember that there is ``vindex`` as a more
powerful and flexible approach to advanced indexing.
One may, if trying ``vindex`` randomly, be confused about::

    >>> new_arr = arr.vindex[:, correct_sensors]

which is neither the same, nor the correct result (see transposing rules)!
This is because slicing works still the same in ``vindex``. However, reading
the documentation and examples, one can hopefully quickly find the desired
solution::

    >>> rows = np.arange(len(arr))
    >>> rows = rows[:, np.newaxis]  # make shape fit with correct_sensors
    >>> new_arr = arr.vindex[rows, correct_sensors]
    
At this point we have left the straight forward world of ``oindex`` but can
do random picking of any element from the array. Note that in the last example
a method such as mentioned in the ``Related Questions`` section could be more
straight forward. But this approach is even more flexible, since ``rows``
does not have to be a simple ``arange``, but could be ``intersting_times``::

    >>> interesting_times = np.array([0, 4, 8, 9, 10])
    >>> correct_sensors_at_it = correct_sensors[interesting_times, :]
    >>> interesting_times_reshaped = interesting_times[:, np.newaxis]
    >>> new_arr_it = arr[interesting_times_reshaped, correct_sensors_at_it]

Truly complex situation would arise now if you would for example pool ``L``
experiments into an array shaped ``(L, N, D)``. But for ``oindex`` this should
not result into surprises. ``vindex``, being more powerful, will quite
certainly create some confusion in this case but also cover pretty much all
eventualities.


Copyright
---------

This document is placed under the CC0 1.0 Universell (CC0 1.0) Public Domain Dedication [1]_.


References and Footnotes
------------------------

.. [1] To the extent possible under law, the person who associated CC0 
   with this work has waived all copyright and related or neighboring
   rights to this work. The CC0 license may be found at
.. [2] e.g., see NEP 18,

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: NEP 21: Simplified and explicit advanced indexing

Juan Nunez-Iglesias
> Plain indexing arr[...] should return an error for ambiguous cases. [...] This includes every use of vectorized indexing with multiple integer arrays.

This line concerns me. In scikit-image, we often do:

rr, cc = coords.T  # coords is an (n, 2) array of integer coordinates
values = image[rr, cc]

Are you saying that this use is deprecated? Because we love it at scikit-image. I would be very very very sad to lose this syntax.

> The current plain indexing allows for the use of non-tuples for multi-dimensional indexing.

I believe this paragraph is itself deprecated? Didn't non-non-tuple indexing just get deprecated with 1.15?


Other general comments:
- oindex in general seems very intuitive and I'm :+1:
- I would much prefer some extremely compact notation such as arr.ox[] and arr.vx.
- Depending on the above concern I am either -1 or (-1/0) on the deprecation. Deprecating (all) old vindex behaviour doesn't seem to bring many benefits while potentially causing a lot of pain to downstream libraries.

Juan.

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: NEP 21: Simplified and explicit advanced indexing

Andrew Nelson-6
On Tue, 26 Jun 2018 at 16:24, Juan Nunez-Iglesias <[hidden email]> wrote:
> Plain indexing arr[...] should return an error for ambiguous cases. [...] This includes every use of vectorized indexing with multiple integer arrays.

This line concerns me. In scikit-image, we often do:

rr, cc = coords.T  # coords is an (n, 2) array of integer coordinates
values = image[rr, cc]

Are you saying that this use is deprecated? Because we love it at scikit-image. I would be very very very sad to lose this syntax.

 I second Juan's sentiments wholeheartedly here.

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: NEP 21: Simplified and explicit advanced indexing

Robert Kern-2
On Mon, Jun 25, 2018 at 11:29 PM Andrew Nelson <[hidden email]> wrote:
On Tue, 26 Jun 2018 at 16:24, Juan Nunez-Iglesias <[hidden email]> wrote:
> Plain indexing arr[...] should return an error for ambiguous cases. [...] This includes every use of vectorized indexing with multiple integer arrays.

This line concerns me. In scikit-image, we often do:

rr, cc = coords.T  # coords is an (n, 2) array of integer coordinates
values = image[rr, cc]

Are you saying that this use is deprecated? Because we love it at scikit-image. I would be very very very sad to lose this syntax.

 I second Juan's sentiments wholeheartedly here.

And thirded. This should not be considered deprecated or discouraged. As I mentioned in the previous iteration of this discussion, this is the behavior I want more often than the orthogonal indexing. It's a really common way to work with images and other kinds of raster data, so I don't think it should be relegated to the "officially discouraged" ghetto of `.legacy_index`. It should not issue warnings or (eventual) errors. I would reserve warnings for the cases where the current behavior is something no one really wants, like mixing slices and integer arrays.

--
Robert Kern

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: NEP 21: Simplified and explicit advanced indexing

Eric Wieser
I don't think it should be relegated to the "officially discouraged" ghetto of `.legacy_index`

The way I read it, the new spelling lof that would be the explicit but not discouraged `image.vindex[rr, cc]`.

I would reserve warnings for the cases where the current behavior is something no one really wants, like mixing slices and integer arrays. 

These are the cases that would only be available under `legacy_index`.

Eric

On Mon, 25 Jun 2018 at 23:54 Robert Kern <[hidden email]> wrote:
On Mon, Jun 25, 2018 at 11:29 PM Andrew Nelson <[hidden email]> wrote:
On Tue, 26 Jun 2018 at 16:24, Juan Nunez-Iglesias <[hidden email]> wrote:
> Plain indexing arr[...] should return an error for ambiguous cases. [...] This includes every use of vectorized indexing with multiple integer arrays.

This line concerns me. In scikit-image, we often do:

rr, cc = coords.T  # coords is an (n, 2) array of integer coordinates
values = image[rr, cc]

Are you saying that this use is deprecated? Because we love it at scikit-image. I would be very very very sad to lose this syntax.

 I second Juan's sentiments wholeheartedly here.

And thirded. This should not be considered deprecated or discouraged. As I mentioned in the previous iteration of this discussion, this is the behavior I want more often than the orthogonal indexing. It's a really common way to work with images and other kinds of raster data, so I don't think it should be relegated to the "officially discouraged" ghetto of `.legacy_index`. It should not issue warnings or (eventual) errors. I would reserve warnings for the cases where the current behavior is something no one really wants, like mixing slices and integer arrays.


--
Robert Kern
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: NEP 21: Simplified and explicit advanced indexing

Andrew Nelson-6
On Tue, 26 Jun 2018 at 17:12, Eric Wieser <[hidden email]> wrote:
I don't think it should be relegated to the "officially discouraged" ghetto of `.legacy_index`

The way I read it, the new spelling lof that would be the explicit but not discouraged `image.vindex[rr, cc]`.

If I'm understanding correctly what can be achieved now by `arr[rr, cc]` would have to be modified to use `arr.vindex[rr, cc]`, which is a very large change in behaviour. I suspect that there a lot of situations out there which use `arr[idxs]` where `idxs` can mean one of a range of things depending on the code path followed. If any of those change, or a mix of nomenclatures are required to access the different cases, then havoc will probably ensue.


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: NEP 21: Simplified and explicit advanced indexing

Robert Kern-2
In reply to this post by Eric Wieser
On Tue, Jun 26, 2018 at 12:13 AM Eric Wieser <[hidden email]> wrote:
I don't think it should be relegated to the "officially discouraged" ghetto of `.legacy_index`

The way I read it, the new spelling lof that would be the explicit but not discouraged `image.vindex[rr, cc]`.

Okay, I missed that the first time through. I think having more self-contained descriptions of the semantics of each of these would be a good idea. The current description of `.vindex` spends more time talking about what it doesn't do, compared to the other methods, than what it does.

Some more typical, less-exotic examples would be a good idea.

I would reserve warnings for the cases where the current behavior is something no one really wants, like mixing slices and integer arrays. 

These are the cases that would only be available under `legacy_index`.

I'm still leaning towards not warning on current, unproblematic common uses. It's unnecessary churn for currently working, understandable code. I would still reserve warnings and deprecation for the cases where the current behavior gives us something that no one wants. Those are the real traps that people need to be warned away from.

If someone is mixing slices and integer indices, that's a really good sign that they thought indexing behaved in a different way (e.g. orthogonal indexing).

If someone is just using multiple index arrays that would currently not give an error, that's actually a really good sign that they are using it correctly and are getting the semantics that they desired. If they wanted orthogonal indexing, it is *really* likely that their index arrays would *not* broadcast together. And even if they did, the wrong shape of the result is one of the more easily noticed things. These are not silent errors that would motivate adding a new warning.

--
Robert Kern

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: NEP 21: Simplified and explicit advanced indexing

Robert Kern-2
On Tue, Jun 26, 2018 at 12:46 AM Robert Kern <[hidden email]> wrote:
On Tue, Jun 26, 2018 at 12:13 AM Eric Wieser <[hidden email]> wrote:
I would reserve warnings for the cases where the current behavior is something no one really wants, like mixing slices and integer arrays. 

These are the cases that would only be available under `legacy_index`.

I'm still leaning towards not warning on current, unproblematic common uses. It's unnecessary churn for currently working, understandable code. I would still reserve warnings and deprecation for the cases where the current behavior gives us something that no one wants. Those are the real traps that people need to be warned away from.

If someone is mixing slices and integer indices, that's a really good sign that they thought indexing behaved in a different way (e.g. orthogonal indexing).

If someone is just using multiple index arrays that would currently not give an error, that's actually a really good sign that they are using it correctly and are getting the semantics that they desired. If they wanted orthogonal indexing, it is *really* likely that their index arrays would *not* broadcast together. And even if they did, the wrong shape of the result is one of the more easily noticed things. These are not silent errors that would motivate adding a new warning.

Of course, I would definitely support adding more information to the various IndexError messages to point people to `.oindex` and `.vindex`. I think that would guide more people to correct their code than adding a new warning to code that currently executes (which is likely not erroneous), and it would cause no churn.
 
--
Robert Kern

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: NEP 21: Simplified and explicit advanced indexing

Sebastian Berg
In reply to this post by Andrew Nelson-6
On Tue, 2018-06-26 at 17:30 +1000, Andrew Nelson wrote:

> On Tue, 26 Jun 2018 at 17:12, Eric Wieser <[hidden email]
> m> wrote:
> > > I don't think it should be relegated to the "officially
> > discouraged" ghetto of `.legacy_index`
> >
> > The way I read it, the new spelling lof that would be the explicit
> > but not discouraged `image.vindex[rr, cc]`.
> >
>
> If I'm understanding correctly what can be achieved now by `arr[rr,
> cc]` would have to be modified to use `arr.vindex[rr, cc]`, which is
> a very large change in behaviour. I suspect that there a lot of
> situations out there which use `arr[idxs]` where `idxs` can mean one
> of a range of things depending on the code path followed. If any of
> those change, or a mix of nomenclatures are required to access the
> different cases, then havoc will probably ensue.
Yes, that is true, but I doubt you will find a lot of code path that
need the current indexing as opposed to vindex here, and the idea was
to have a method to get the old behaviour indefinitely. You will need
to add the `.vindex`, but that should be the only code change needed,
and it would be easy to find where with errors/warnings.
I see a possible problem with code that has to work on different numpy
versions, but only in meaning we need to delay deprecations.

The only thing I could imagine where this might happen is if you
forward someone elses indexing objects and different users are used to
different results.
Otherwise, there is mostly one case which would get annoying, and that
is `arr[:, rr, cc]` since `arr.vindex[:, rr, cc]` would not be exactly
the same. Because, yes, in some cases the current logic is convenient,
just incredibly surprising as well.

- Sebastian

>
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> https://mail.python.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: NEP 21: Simplified and explicit advanced indexing

einstein.edison
In reply to this post by Robert Kern-2
I second this design. If we were to consider the general case of a tuple `idx`, then we’d not be moving forward at all. Design changes would be impossible. I’d argue that this newer model would be easier for library maintainers overall (who are the kind of people using this), reducing maintenance cost in the long run because it’d lead to simpler code.

I would also that the “internal” classes expressing outer as vectorised indexing etc. should be exposed, for maintainers of duck arrays to use. God knows how many utility functions I’ve had to write to avoid relying on undocumented NumPy internals for pydata/sparse, fearing that I’d have to rewrite/modify them when behaviour changes or I find other corner cases.

Best Regards,
Hameer Abbasi
Sent from Astro for Mac

On 26. Jun 2018 at 09:46, Robert Kern <[hidden email]> wrote:


On Tue, Jun 26, 2018 at 12:13 AM Eric Wieser <[hidden email]> wrote:
I don't think it should be relegated to the "officially discouraged" ghetto of `.legacy_index`

The way I read it, the new spelling lof that would be the explicit but not discouraged `image.vindex[rr, cc]`.

Okay, I missed that the first time through. I think having more self-contained descriptions of the semantics of each of these would be a good idea. The current description of `.vindex` spends more time talking about what it doesn't do, compared to the other methods, than what it does.

Some more typical, less-exotic examples would be a good idea.

I would reserve warnings for the cases where the current behavior is something no one really wants, like mixing slices and integer arrays. 

These are the cases that would only be available under `legacy_index`.

I'm still leaning towards not warning on current, unproblematic common uses. It's unnecessary churn for currently working, understandable code. I would still reserve warnings and deprecation for the cases where the current behavior gives us something that no one wants. Those are the real traps that people need to be warned away from.

If someone is mixing slices and integer indices, that's a really good sign that they thought indexing behaved in a different way (e.g. orthogonal indexing).

If someone is just using multiple index arrays that would currently not give an error, that's actually a really good sign that they are using it correctly and are getting the semantics that they desired. If they wanted orthogonal indexing, it is *really* likely that their index arrays would *not* broadcast together. And even if they did, the wrong shape of the result is one of the more easily noticed things. These are not silent errors that would motivate adding a new warning.

--
Robert Kern

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: NEP 21: Simplified and explicit advanced indexing

Robert Kern-2
In reply to this post by Sebastian Berg
On Tue, Jun 26, 2018 at 12:58 AM Sebastian Berg <[hidden email]> wrote:
On Tue, 2018-06-26 at 17:30 +1000, Andrew Nelson wrote:
> On Tue, 26 Jun 2018 at 17:12, Eric Wieser <[hidden email]
> m> wrote:
> > > I don't think it should be relegated to the "officially
> > discouraged" ghetto of `.legacy_index`
> >
> > The way I read it, the new spelling lof that would be the explicit
> > but not discouraged `image.vindex[rr, cc]`.
> >
>
> If I'm understanding correctly what can be achieved now by `arr[rr,
> cc]` would have to be modified to use `arr.vindex[rr, cc]`, which is
> a very large change in behaviour. I suspect that there a lot of
> situations out there which use `arr[idxs]` where `idxs` can mean one
> of a range of things depending on the code path followed. If any of
> those change, or a mix of nomenclatures are required to access the
> different cases, then havoc will probably ensue.

Yes, that is true, but I doubt you will find a lot of code path that
need the current indexing as opposed to vindex here,

That's probably true! But I think it's besides the point. I'd wager that most code paths that will use .vindex would work perfectly well with current indexing, too. Most of the time, people aren't getting into the hairy corners of advanced indexing.

Adding to the toolbox is great, but I don't see a good reason to take out the ones that are commonly used quite safely.
 
and the idea was
to have a method to get the old behaviour indefinitely. You will need
to add the `.vindex`, but that should be the only code change needed,
and it would be easy to find where with errors/warnings.

It's not necessarily hard; it's just churn for no benefit to the downstream code. They didn't get a new feature; they just have to run faster to stay in the same place.

--
Robert Kern

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: NEP 21: Simplified and explicit advanced indexing

einstein.edison
In reply to this post by Stephan Hoyer-2
> Boolean indices are not supported. All indices must be integers, integer arrays or slices.

I would hope that there’s at least some way to do boolean indexing. I often find myself needing it. I realise that `arr.vindex[np.nonzero(boolean_idx)]` works, but it is slightly too verbose for my liking. Maybe we can have `arr.bindex[boolean_index]` as an alias to exactly that?

Or is boolean indexing preserved as-is n the newest proposal? If so, great!

Another thing I’d say is `arr.?index` should be replaced with `arr.?idx`. I personally prefer `arr.?x` for my fingers but I realise that for someone not super into NumPy indexing, this is kind of opaque to read, so I propose this less verbose but hopefully equally clear version, for my (and others’) brains.

Best Regards,
Hameer Abbasi
Sent from Astro for Mac

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: NEP 21: Simplified and explicit advanced indexing

teoliphant
In reply to this post by einstein.edison
I like the proposal generally.  NumPy could use a good orthogonal indexing method and a vectorized-indexing method is fine too. 

Robert Kern is spot on with his concerns as well.  Please do not change what arr[idx] does except to provide warnings and perhaps point people to new .oix and .vix methods.  What indexing does is documented (if hard to understand and surprising in a particular sub-case). 

There is one specific place in the code where I would make a change to raise an error rather than change the order of the axes of the output to provide a consistent subspace.  Even then, it should be done as a deprecation warning and then raise the error. 

Otherwise, just add the new methods and don't make any other changes until a major release.  

-Travis

 

On Tue, Jun 26, 2018 at 2:03 AM Hameer Abbasi <[hidden email]> wrote:
I second this design. If we were to consider the general case of a tuple `idx`, then we’d not be moving forward at all. Design changes would be impossible. I’d argue that this newer model would be easier for library maintainers overall (who are the kind of people using this), reducing maintenance cost in the long run because it’d lead to simpler code.

I would also that the “internal” classes expressing outer as vectorised indexing etc. should be exposed, for maintainers of duck arrays to use. God knows how many utility functions I’ve had to write to avoid relying on undocumented NumPy internals for pydata/sparse, fearing that I’d have to rewrite/modify them when behaviour changes or I find other corner cases.

Best Regards,
Hameer Abbasi
Sent from Astro for Mac

On 26. Jun 2018 at 09:46, Robert Kern <[hidden email]> wrote:


On Tue, Jun 26, 2018 at 12:13 AM Eric Wieser <[hidden email]> wrote:
I don't think it should be relegated to the "officially discouraged" ghetto of `.legacy_index`

The way I read it, the new spelling lof that would be the explicit but not discouraged `image.vindex[rr, cc]`.

Okay, I missed that the first time through. I think having more self-contained descriptions of the semantics of each of these would be a good idea. The current description of `.vindex` spends more time talking about what it doesn't do, compared to the other methods, than what it does.

Some more typical, less-exotic examples would be a good idea.

I would reserve warnings for the cases where the current behavior is something no one really wants, like mixing slices and integer arrays. 

These are the cases that would only be available under `legacy_index`.

I'm still leaning towards not warning on current, unproblematic common uses. It's unnecessary churn for currently working, understandable code. I would still reserve warnings and deprecation for the cases where the current behavior gives us something that no one wants. Those are the real traps that people need to be warned away from.

If someone is mixing slices and integer indices, that's a really good sign that they thought indexing behaved in a different way (e.g. orthogonal indexing).

If someone is just using multiple index arrays that would currently not give an error, that's actually a really good sign that they are using it correctly and are getting the semantics that they desired. If they wanted orthogonal indexing, it is *really* likely that their index arrays would *not* broadcast together. And even if they did, the wrong shape of the result is one of the more easily noticed things. These are not silent errors that would motivate adding a new warning.

--
Robert Kern

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: NEP 21: Simplified and explicit advanced indexing

Eric Wieser
In reply to this post by einstein.edison
Another thing I’d say is arr.?index should be replaced with arr.?idx.

Or perhaps arr.o_[] and arr.v_[], to match the style of our existing
np.r_, np.c_, np.s_, etc?
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: NEP 21: Simplified and explicit advanced indexing

Robert Kern-2
In reply to this post by teoliphant
On Tue, Jun 26, 2018 at 1:26 AM Travis Oliphant <[hidden email]> wrote:
I like the proposal generally.  NumPy could use a good orthogonal indexing method and a vectorized-indexing method is fine too. 

Robert Kern is spot on with his concerns as well.  Please do not change what arr[idx] does except to provide warnings and perhaps point people to new .oix and .vix methods.  What indexing does is documented (if hard to understand and surprising in a particular sub-case). 

There is one specific place in the code where I would make a change to raise an error rather than change the order of the axes of the output to provide a consistent subspace.  Even then, it should be done as a deprecation warning and then raise the error. 

Otherwise, just add the new methods and don't make any other changes until a major release.  

I'd suggest that the NEP explicitly disclaim deprecating current behavior. Let the NEP just be about putting the new features out there. Once we have some experience with them for a year or three, then let's talk about deprecating parts of the current behavior and make a new NEP then if we want to go that route. We're only contemplating *long* deprecation cycles anyways; we're not in a race. The success of these new features doesn't really rely on the deprecation of current indexing, so let's separate those issues.
 
--
Robert Kern

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: NEP 21: Simplified and explicit advanced indexing

einstein.edison
In reply to this post by Eric Wieser
I actually had to think a lot, read docs, use SO and so on to realise what those meant the first time around, I didn’t understand them on sight.

And I had to keep coming back to the docs from time to time as I wasn’t exactly using them too much (for exactly this reason, when some problems could be solved more simply by doing just that).

I’d prefer something that sticks in your head and “underscore” for “indexing” didn't do that for me.

Of course, this was my experience as a first-timer. I’d prefer not to up the learning curve for others in the same situation.

An experienced user might disagree. :-)

Best Regards,
Hameer Abbasi
Sent from Astro for Mac

On 26. Jun 2018 at 10:28, Eric Wieser <[hidden email]> wrote:


Another thing I’d say is arr.?index should be replaced with arr.?idx.

Or perhaps arr.o_[] and arr.v_[], to match the style of our existing
np.r_, np.c_, np.s_, etc?

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: NEP 21: Simplified and explicit advanced indexing

Sebastian Berg
In reply to this post by Robert Kern-2
On Tue, 2018-06-26 at 01:21 -0700, Robert Kern wrote:
> On Tue, Jun 26, 2018 at 12:58 AM Sebastian Berg
> <[hidden email]> wrote:

<snip>

> >
> > Yes, that is true, but I doubt you will find a lot of code path
> > that
> > need the current indexing as opposed to vindex here,
>
> That's probably true! But I think it's besides the point. I'd wager
> that most code paths that will use .vindex would work perfectly well
> with current indexing, too. Most of the time, people aren't getting
> into the hairy corners of advanced indexing.
>
Right, the proposal was to have DeprecationWarnings when they differ,
now I also thought DeprecationWarnings on two advanced indexes in
general is good, because it is good for new users.
I have to agree with your argument that most of the confused should be
running into broadcast errors (if they expect oindex vs. fancy). So I
see this as a point that we likely should just limit ourselves at least
for now to the cases for example with sudden transposing going on.

However, I would like to point out that the reason for the more broad
warnings is that it could allow warping normal indexing at some point.
Also it decreases traps with array-likes that behave differently.


> Adding to the toolbox is great, but I don't see a good reason to take
> out the ones that are commonly used quite safely.
>  
> > and the idea was
> > to have a method to get the old behaviour indefinitely. You will
> > need
> > to add the `.vindex`, but that should be the only code change
> > needed,
> > and it would be easy to find where with errors/warnings.
>
> It's not necessarily hard; it's just churn for no benefit to the
> downstream code. They didn't get a new feature; they just have to run
> faster to stay in the same place.
>
So, yes, it is annoying for quite a few projects that correctly use
fancy indexing, but if we choose to not annoy you a little, we will
have much less long term options which also includes such projects
compatibility to new/current array-likes.
So basically one point is: if we annoy scikit-image now, their code
will work better for dask arrays in the future hopefully.

- Sebastian


> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> https://mail.python.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: NEP 21: Simplified and explicit advanced indexing

Sebastian Berg
In reply to this post by einstein.edison
On Tue, 2018-06-26 at 04:23 -0400, Hameer Abbasi wrote:
> > Boolean indices are not supported. All indices must be integers,
> integer arrays or slices.
>
> I would hope that there’s at least some way to do boolean indexing. I
> often find myself needing it. I realise that
> `arr.vindex[np.nonzero(boolean_idx)]` works, but it is slightly too
> verbose for my liking. Maybe we can have `arr.bindex[boolean_index]`
> as an alias to exactly that?
>

That part is limited to `vindex` only. A single boolean index would
always work in plain indexing and you can mix it all up inside of
`oindex`. But with fancy indexing mixing boolean + integer seems
currently pretty much useless (and thus the same is true for `vindex`,
in `oindex` things make sense).
Now you could invent some new logic for such a mixing case in `vindex`,
but it seems easier to just ignore it for the moment.

- Sebastian


> Or is boolean indexing preserved as-is n the newest proposal? If so,
> great!
>
> Another thing I’d say is `arr.?index` should be replaced with
> `arr.?idx`. I personally prefer `arr.?x` for my fingers but I realise
> that for someone not super into NumPy indexing, this is kind of
> opaque to read, so I propose this less verbose but hopefully equally
> clear version, for my (and others’) brains.
>
> Best Regards,
> Hameer Abbasi
> Sent from Astro for Mac
>
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> https://mail.python.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: NEP 21: Simplified and explicit advanced indexing

einstein.edison
In reply to this post by Robert Kern-2
I would disagree here. For libraries like Dask, XArray, pydata/sparse, XND, etc., it would be bad for them if there was continued use of “weird” indexing behaviour (no warnings means more code written that’s… well… not exactly the best design). Of course, we could just choose to not support it. But that means a lot of code won’t support us, or support us later than we desire.

I agree with your design of “let’s limit the number of warnings/deprecations to cases that make very little sense” but there should be warnings.

Specifically, I recommend warnings for mixed slices and fancy indexes, and warnings followed by errors for cases where the transposing behaviour occurs.

Best Regards,
Hameer Abbasi
Sent from Astro for Mac

On 26. Jun 2018 at 10:33, Robert Kern <[hidden email]> wrote:


On Tue, Jun 26, 2018 at 1:26 AM Travis Oliphant <[hidden email]> wrote:
I like the proposal generally.  NumPy could use a good orthogonal indexing method and a vectorized-indexing method is fine too. 

Robert Kern is spot on with his concerns as well.  Please do not change what arr[idx] does except to provide warnings and perhaps point people to new .oix and .vix methods.  What indexing does is documented (if hard to understand and surprising in a particular sub-case). 

There is one specific place in the code where I would make a change to raise an error rather than change the order of the axes of the output to provide a consistent subspace.  Even then, it should be done as a deprecation warning and then raise the error. 

Otherwise, just add the new methods and don't make any other changes until a major release.  

I'd suggest that the NEP explicitly disclaim deprecating current behavior. Let the NEP just be about putting the new features out there. Once we have some experience with them for a year or three, then let's talk about deprecating parts of the current behavior and make a new NEP then if we want to go that route. We're only contemplating *long* deprecation cycles anyways; we're not in a race. The success of these new features doesn't really rely on the deprecation of current indexing, so let's separate those issues.
 
--
Robert Kern

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
12