

Hi all,
Please critique my draft exploring the possibilities of adding group_by support to numpy:
In nearly ever project I work on, I require group_by functionality of some sort. There are other libraries that provide this kind of functionality, such as pandas for instance, but I will try to make the case here that numpy ought to have a solid core of group_by functionality. Primarily, one may argue that the concept of grouping values by a key is far more general than a pandas dataframe. In particular, one often needs a simple oneline transient association between some keys and values, and trying to wrangle your problem into the more permanent and specialized datastructure that a dataframe is, is simply not called for.
As a simple compact example: key1 = list('abaabb') key2 = np.random.randint(0,2,(6,2)) values = np.random.rand(6,3) print group_by((key1, key2)).median(values)
Points of note; we can group by arbitrary combinations of keys, and subarrays can also act as keys. group_by has a rich set of reduction functionality, which performs efficient pergroup reductions, as well as various ways to split your values per group.
Also, the code here has a lot of overlap with np.unique and related arraysetops. functions like np.unique are easily reimplemented using the groundwork laid out here, and also may be extended to benefit from the generalizations made, allowing for a wider variety of objects to have their unique values taken; note the axis keyword here, meaning that what is unique here are the images found along the first axis; not the elements of shuffled.
#create a stack of images images = np.random.rand(4,64,64) #shuffle the images; this is a giant mess now; how to find all the original ones? shuffled = images[np.random.randint(0,4,200)]
#there you go print unique(shuffled, axis=0)
Some more examples and unit tests can be found at the end of the module. Id love to hear your feedback on this. Specifically:
 Do you agree numpy would benefit from group_by functionality?
 Do you have suggestions for further generalizations/extensions?
 Any commentary on design decisions / implementation?
Regards,
Eelco Hoogendoorn
_______________________________________________
NumPyDiscussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpydiscussion


Hi Eelco
On Sun, 26 Jan 2014 12:20:04 +0100, Eelco Hoogendoorn wrote:
> key1 = list('abaabb')
> key2 = np.random.randint(0,2,(6,2))
> values = np.random.rand(6,3)
> print group_by((key1, key2)).median(values)
I agree that group_by functionality could be handy in numpy.
In the above example, what would the output of
``group_by((key1, key2))``
be?
Stéfan
_______________________________________________
NumPyDiscussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpydiscussion


An object of type GroupBy. So a call to group_by does not return any consumable output directly. If you want for instance the unique keys, or groups if you will, you can call GroupBy.unique. In this case, for a tuple of input keys, youd get a tuple of unique keys back. If you want to compute several reductions over the same set of keys, you can hang on to the GroupBy object, and the precomputations it encapsulates.
To expand on that example: reduction operations also return the unique keys which the reduced elements belong to:
(unique1, unique2), median = group_by((key1, key2)).median(values) print unique1 print unique2 print median yields something like
['a' 'a' 'b' 'b' 'a'] [[0 0] [0 1] [0 1] [1 0] [1 1]] [[ 0.34041782 0.78579254 0.91494441] [ 0.59422888 0.67915262 0.04327812]
[ 0.45045529 0.45049761 0.49633574] [ 0.71623235 0.95760152 0.85137696] [ 0.96299801 0.27639574 0.70519413]]
Note that the elements of unique1 and unique2 are not themselves unique, but rather their elements zipped together are unique.
_______________________________________________
NumPyDiscussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpydiscussion


To follow up with an example as to why it is useful that a temporary object is created, consider the following (taken from the radial reduction example):
g = group_by(np.round(radius, 5).flatten())
pp.errorbar( g.unique, g.mean(sample.flatten())[1], g.std(sample.flatten())[1] / np.sqrt(g.count))
Creating the GroupBy object encapsulates the expense of 'indexing' the keys, which is the most expensive part of these operations. We would have to redo that four times here, if we didn't have access to the GroupBy object.
From looking at the numpy source, I get the impression that it is considered good practice not to overuse OOP. And I agree, but I think it is called for here.
_______________________________________________
NumPyDiscussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpydiscussion


On 1/26/2014 12:02 PM, Stéfan van der Walt wrote:
> what would the output of
>
> ``group_by((key1, key2))``
I'd expect something named "groupby" to behave as below.
Alan
def groupby(seq, key):
from collections import defaultdict
groups = defaultdict(list)
for item in seq:
groups[key(item)].append(item)
return groups
print groupby(range(20), lambda x: x%2)
_______________________________________________
NumPyDiscussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpydiscussion


Alan:
The equivalent of that in my current draft would be group_by(keys, values), which is shorthand for group_by(keys).group(values); a optional values argument to the constructor of GroupBy is directly bound to return an iterable over the grouped values; but we often want to bind different value objects, with different operations, for the same set of keys, so it is convenient to be able to delay the binding of the values argument. Also, the third argument to group_by is an optional reduction function.
_______________________________________________
NumPyDiscussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpydiscussion


not off topic at all; there are several matters of naming that I am not at all settled on yet, and I don't think it is unimportant.
indeed, those are closely related functions, and I wasn't aware of them yet, so that's some welcome additional perspective. The mathematica function differs in that the keys are always function of the values; as per your example as well. My proposed interface does not have that constraint, but that behavior is of course easily obtained by something like group_by(mapping(values), values).
indeed grpstats also has a lot of overlap, though it does not have the same generality as my proposal.
its interesting to wonder where one gets ones ideas as to how to call what. ive never worked with SQL much; I suppose I picked up this naming by working with LINQ. I rather like group_by; it is more suitable to the generality of the operations supported by the group_by object than something like grpstats. The majority of my applications for grouping have nothing whatsoever to do with statistics.
_______________________________________________
NumPyDiscussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpydiscussion

