# Numpy Enhancement Proposal: group_by functionality

8 messages
Open this post in threaded view
|

## Numpy Enhancement Proposal: group_by functionality

 Hi all,Please critique my draft exploring the possibilities of adding group_by support to numpy:http://pastebin.com/c5WLWPbp In nearly ever project I work on, I require group_by functionality of some sort. There are other libraries that provide this kind of functionality, such as pandas for instance, but I will try to make the case here that numpy ought to have a solid core of group_by functionality. Primarily, one may argue that the concept of grouping values by a key is far more general than a pandas dataframe. In particular, one often needs a simple one-line transient association between some keys and values, and trying to wrangle your problem into the more permanent and specialized datastructure that a dataframe is, is simply not called for. As a simple compact example:key1 = list('abaabb')key2 = np.random.randint(0,2,(6,2))values = np.random.rand(6,3)print group_by((key1, key2)).median(values) Points of note; we can group by arbitrary combinations of keys, and subarrays can also act as keys. group_by has a rich set of reduction functionality, which performs efficient per-group reductions, as well as various ways to split your values per group.  Also, the code here has a lot of overlap with np.unique and related arraysetops. functions like np.unique are easily reimplemented using the groundwork laid out here, and also may be extended to benefit from the generalizations made, allowing for a wider variety of objects to have their unique values taken; note the axis keyword here, meaning that what is unique here are the images found along the first axis; not the elements of shuffled. #create a stack of images images = np.random.rand(4,64,64) #shuffle the images; this is a giant mess now; how to find all the original ones? shuffled = images[np.random.randint(0,4,200)]  #there you go print unique(shuffled, axis=0)Some more examples and unit tests can be found at the end of the module. Id love to hear your feedback on this. Specifically: Do you agree numpy would benefit from group_by functionality?Do you have suggestions for further generalizations/extensions?Any commentary on design decisions / implementation? Regards, Eelco Hoogendoorn _______________________________________________ NumPy-Discussion mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/numpy-discussion
Open this post in threaded view
|

## Re: Numpy Enhancement Proposal: group_by functionality

 Hi Eelco On Sun, 26 Jan 2014 12:20:04 +0100, Eelco Hoogendoorn wrote: > key1 = list('abaabb') > key2 = np.random.randint(0,2,(6,2)) > values = np.random.rand(6,3) > print group_by((key1, key2)).median(values) I agree that group_by functionality could be handy in numpy. In the above example, what would the output of ``group_by((key1, key2))`` be? Stéfan _______________________________________________ NumPy-Discussion mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/numpy-discussion
Open this post in threaded view
|

## Re: Numpy Enhancement Proposal: group_by functionality

 An object of type GroupBy. So a call to group_by does not return any consumable output directly. If you want for instance the unique keys, or groups if you will, you can call GroupBy.unique. In this case, for a tuple of input keys, youd get a tuple of unique keys back. If you want to compute several reductions over the same set of keys, you can hang on to the GroupBy object, and the precomputations it encapsulates.  To expand on that example: reduction operations also return the unique keys which the reduced elements belong to:  (unique1, unique2), median = group_by((key1, key2)).median(values)print unique1print unique2print median  yields something like   ['a' 'a' 'b' 'b' 'a'][[0 0] [0 1] [0 1] [1 0] [1 1]][[ 0.34041782  0.78579254  0.91494441] [ 0.59422888  0.67915262  0.04327812]  [ 0.45045529  0.45049761  0.49633574] [ 0.71623235  0.95760152  0.85137696] [ 0.96299801  0.27639574  0.70519413]]Note that the elements of unique1 and unique2 are not themselves unique, but rather their elements zipped together are unique. On Sun, Jan 26, 2014 at 6:02 PM, Stéfan van der Walt wrote: Hi Eelco On Sun, 26 Jan 2014 12:20:04 +0100, Eelco Hoogendoorn wrote: > key1 = list('abaabb') > key2 = np.random.randint(0,2,(6,2)) > values = np.random.rand(6,3) > print group_by((key1, key2)).median(values) I agree that group_by functionality could be handy in numpy. In the above example, what would the output of ``group_by((key1, key2))`` be? Stéfan _______________________________________________ NumPy-Discussion mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/numpy-discussion _______________________________________________ NumPy-Discussion mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/numpy-discussion
Open this post in threaded view
|

## Re: Numpy Enhancement Proposal: group_by functionality

 In reply to this post by Stéfan van der Walt To follow up with an example as to why it is useful that a temporary object is created, consider the following (taken from the radial reduction example):    g = group_by(np.round(radius, 5).flatten())     pp.errorbar(        g.unique,        g.mean(sample.flatten())[1],        g.std(sample.flatten())[1] / np.sqrt(g.count))Creating the GroupBy object encapsulates the expense of 'indexing' the keys, which is the most expensive part of these operations. We would have to redo that four times here, if we didn't have access to the GroupBy object. From looking at the numpy source, I get the impression that it is considered good practice not to overuse OOP. And I agree, but I think it is called for here. On Sun, Jan 26, 2014 at 6:02 PM, Stéfan van der Walt wrote: Hi Eelco On Sun, 26 Jan 2014 12:20:04 +0100, Eelco Hoogendoorn wrote: > key1 = list('abaabb') > key2 = np.random.randint(0,2,(6,2)) > values = np.random.rand(6,3) > print group_by((key1, key2)).median(values) I agree that group_by functionality could be handy in numpy. In the above example, what would the output of ``group_by((key1, key2))`` be? Stéfan _______________________________________________ NumPy-Discussion mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/numpy-discussion _______________________________________________ NumPy-Discussion mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/numpy-discussion
Open this post in threaded view
|

## Re: Numpy Enhancement Proposal: group_by functionality

 In reply to this post by Stéfan van der Walt On 1/26/2014 12:02 PM, Stéfan van der Walt wrote: >   what would the output of > > ``group_by((key1, key2))`` I'd expect something named "groupby" to behave as below. Alan def groupby(seq, key):    from collections import defaultdict    groups = defaultdict(list)    for item in seq:      groups[key(item)].append(item)    return groups print groupby(range(20), lambda x: x%2) _______________________________________________ NumPy-Discussion mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/numpy-discussion
Open this post in threaded view
|

## Re: Numpy Enhancement Proposal: group_by functionality

 Alan:The equivalent of that in my current draft would be group_by(keys, values), which is shorthand for group_by(keys).group(values); a optional values argument to the constructor of GroupBy is directly bound to return an iterable over the grouped values; but we often want to bind different value objects, with different operations, for the same set of keys, so it is convenient to be able to delay the binding of the values argument. Also, the third argument to group_by is an optional reduction function. On Sun, Jan 26, 2014 at 6:57 PM, Alan G Isaac wrote: On 1/26/2014 12:02 PM, Stéfan van der Walt wrote: >   what would the output of > > ``group_by((key1, key2))`` I'd expect something named "groupby" to behave as below. Alan def groupby(seq, key):    from collections import defaultdict    groups = defaultdict(list)    for item in seq:      groups[key(item)].append(item)    return groups print groupby(range(20), lambda x: x%2) _______________________________________________ NumPy-Discussion mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/numpy-discussion _______________________________________________ NumPy-Discussion mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/numpy-discussion