Numpy Enhancement Proposal: group_by functionality

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Numpy Enhancement Proposal: group_by functionality

Eelco Hoogendoorn


Hi all,

Please critique my draft exploring the possibilities of adding group_by support to numpy:

In nearly ever project I work on, I require group_by functionality of some sort. There are other libraries that provide this kind of functionality, such as pandas for instance, but I will try to make the case here that numpy ought to have a solid core of group_by functionality. Primarily, one may argue that the concept of grouping values by a key is far more general than a pandas dataframe. In particular, one often needs a simple one-line transient association between some keys and values, and trying to wrangle your problem into the more permanent and specialized datastructure that a dataframe is, is simply not called for.

As a simple compact example:
key1 = list('abaabb')
key2 = np.random.randint(0,2,(6,2))
values = np.random.rand(6,3)
print group_by((key1, key2)).median(values)
Points of note; we can group by arbitrary combinations of keys, and subarrays can also act as keys. group_by has a rich set of reduction functionality, which performs efficient per-group reductions, as well as various ways to split your values per group.
 
Also, the code here has a lot of overlap with np.unique and related arraysetops. functions like np.unique are easily reimplemented using the groundwork laid out here, and also may be extended to benefit from the generalizations made, allowing for a wider variety of objects to have their unique values taken; note the axis keyword here, meaning that what is unique here are the images found along the first axis; not the elements of shuffled.
#create a stack of images 
images = np.random.rand(4,64,64)
 #shuffle the images; this is a giant mess now; how to find all the original ones?
 shuffled = images[np.random.randint(0,4,200)]
 #there you go
 print unique(shuffled, axis=0)
Some more examples and unit tests can be found at the end of the module.
 
Id love to hear your feedback on this. Specifically:
  • Do you agree numpy would benefit from group_by functionality?
  • Do you have suggestions for further generalizations/extensions?
  • Any commentary on design decisions / implementation? 
Regards,
Eelco Hoogendoorn

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Numpy Enhancement Proposal: group_by functionality

Stéfan van der Walt
Hi Eelco

On Sun, 26 Jan 2014 12:20:04 +0100, Eelco Hoogendoorn wrote:
> key1 = list('abaabb')
> key2 = np.random.randint(0,2,(6,2))
> values = np.random.rand(6,3)
> print group_by((key1, key2)).median(values)

I agree that group_by functionality could be handy in numpy.
In the above example, what would the output of

``group_by((key1, key2))``

be?

Stéfan

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Numpy Enhancement Proposal: group_by functionality

Eelco Hoogendoorn
An object of type GroupBy.
 
So a call to group_by does not return any consumable output directly. If you want for instance the unique keys, or groups if you will, you can call GroupBy.unique. In this case, for a tuple of input keys, youd get a tuple of unique keys back. If you want to compute several reductions over the same set of keys, you can hang on to the GroupBy object, and the precomputations it encapsulates.
 
To expand on that example: reduction operations also return the unique keys which the reduced elements belong to:
 
(unique1, unique2), median = group_by((key1, key2)).median(values)
print unique1
print unique2
print median
 
 yields something like
  
['a' 'a' 'b' 'b' 'a']
[[0 0]
 [0 1]
 [0 1]
 [1 0]
 [1 1]]
[[ 0.34041782  0.78579254  0.91494441]
 [ 0.59422888  0.67915262  0.04327812]
 [ 0.45045529  0.45049761  0.49633574]
 [ 0.71623235  0.95760152  0.85137696]
 [ 0.96299801  0.27639574  0.70519413]]

Note that the elements of unique1 and unique2 are not themselves unique, but rather their elements zipped together are unique.


On Sun, Jan 26, 2014 at 6:02 PM, Stéfan van der Walt <[hidden email]> wrote:
Hi Eelco

On Sun, 26 Jan 2014 12:20:04 +0100, Eelco Hoogendoorn wrote:
> key1 = list('abaabb')
> key2 = np.random.randint(0,2,(6,2))
> values = np.random.rand(6,3)
> print group_by((key1, key2)).median(values)

I agree that group_by functionality could be handy in numpy.
In the above example, what would the output of

``group_by((key1, key2))``

be?

Stéfan

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Numpy Enhancement Proposal: group_by functionality

Eelco Hoogendoorn
In reply to this post by Stéfan van der Walt
To follow up with an example as to why it is useful that a temporary object is created, consider the following (taken from the radial reduction example):

    g = group_by(np.round(radius, 5).flatten())
    pp.errorbar(
        g.unique,
        g.mean(sample.flatten())[1],
        g.std(sample.flatten())[1] / np.sqrt(g.count))

Creating the GroupBy object encapsulates the expense of 'indexing' the keys, which is the most expensive part of these operations. We would have to redo that four times here, if we didn't have access to the GroupBy object.

From looking at the numpy source, I get the impression that it is considered good practice not to overuse OOP. And I agree, but I think it is called for here.


On Sun, Jan 26, 2014 at 6:02 PM, Stéfan van der Walt <[hidden email]> wrote:
Hi Eelco

On Sun, 26 Jan 2014 12:20:04 +0100, Eelco Hoogendoorn wrote:
> key1 = list('abaabb')
> key2 = np.random.randint(0,2,(6,2))
> values = np.random.rand(6,3)
> print group_by((key1, key2)).median(values)

I agree that group_by functionality could be handy in numpy.
In the above example, what would the output of

``group_by((key1, key2))``

be?

Stéfan

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Numpy Enhancement Proposal: group_by functionality

Alan Isaac
In reply to this post by Stéfan van der Walt
On 1/26/2014 12:02 PM, Stéfan van der Walt wrote:
>   what would the output of
>
> ``group_by((key1, key2))``


I'd expect something named "groupby" to behave as below.
Alan

def groupby(seq, key):
   from collections import defaultdict
   groups = defaultdict(list)
   for item in seq:
     groups[key(item)].append(item)
   return groups

print groupby(range(20), lambda x: x%2)

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Numpy Enhancement Proposal: group_by functionality

Eelco Hoogendoorn
Alan:

The equivalent of that in my current draft would be group_by(keys, values), which is shorthand for group_by(keys).group(values); a optional values argument to the constructor of GroupBy is directly bound to return an iterable over the grouped values; but we often want to bind different value objects, with different operations, for the same set of keys, so it is convenient to be able to delay the binding of the values argument. Also, the third argument to group_by is an optional reduction function.


On Sun, Jan 26, 2014 at 6:57 PM, Alan G Isaac <[hidden email]> wrote:
On 1/26/2014 12:02 PM, Stéfan van der Walt wrote:
>   what would the output of
>
> ``group_by((key1, key2))``


I'd expect something named "groupby" to behave as below.
Alan

def groupby(seq, key):
   from collections import defaultdict
   groups = defaultdict(list)
   for item in seq:
     groups[key(item)].append(item)
   return groups

print groupby(range(20), lambda x: x%2)

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Numpy Enhancement Proposal: group_by functionality

Alan Isaac
My comment is just on the name.
I'd expect something named `groupby`
to behave essentially like Mathematica's `GatherBy` command.
http://reference.wolfram.com/mathematica/ref/GatherBy.html

I think you are after something more like Matlab's grpstats:
http://www.mathworks.com/help/stats/grpstats.html

Perhaps the implicit reference to SQL justifies the name...

Sorry if this seems off topic,
Alan Isaac


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Numpy Enhancement Proposal: group_by functionality

Eelco Hoogendoorn
not off topic at all; there are several matters of naming that I am not at all settled on yet, and I don't think it is unimportant.

indeed, those are closely related functions, and I wasn't aware of them yet, so that's some welcome additional perspective. The mathematica function differs in that the keys are always function of the values; as per your example as well. My proposed interface does not have that constraint, but that behavior is of course easily obtained by something like group_by(mapping(values), values).

indeed grpstats also has a lot of overlap, though it does not have the same generality as my proposal.

its interesting to wonder where one gets ones ideas as to how to call what. ive never worked with SQL much; I suppose I picked up this naming by working with LINQ. I rather like group_by; it is more suitable to the generality of the operations supported by the group_by object than something like grpstats. The majority of my applications for grouping have nothing whatsoever to do with statistics.


On Sun, Jan 26, 2014 at 8:44 PM, Alan G Isaac <[hidden email]> wrote:
My comment is just on the name.
I'd expect something named `groupby`
to behave essentially like Mathematica's `GatherBy` command.
http://reference.wolfram.com/mathematica/ref/GatherBy.html

I think you are after something more like Matlab's grpstats:
http://www.mathworks.com/help/stats/grpstats.html

Perhaps the implicit reference to SQL justifies the name...

Sorry if this seems off topic,
Alan Isaac


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion