Fancier indexing

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Fancier indexing

Kevin Jacobs <jacobs@bioinformed.com>
After poking around for a bit, I was wondering if there was a faster method for the following:

# Array of index values 0..n
items = numpy.array([0,3,2,1,4,2],dtype=int)

# Count the number of occurrences of each index
counts = numpy.zeros(5, dtype=int)
for i in items:
  counts[i] += 1

In my real code, 'items' contain up to a million values and this loop will be in a performance critical area of code.  If there is no simple solution, I can trivially code this using the C-API.

Thanks,
-Kevin


_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Fancier indexing

Keith Goodman
On Thu, May 22, 2008 at 8:59 AM, Kevin Jacobs <[hidden email]>
<[hidden email]> wrote:

> After poking around for a bit, I was wondering if there was a faster method
> for the following:
>
> # Array of index values 0..n
> items = numpy.array([0,3,2,1,4,2],dtype=int)
>
> # Count the number of occurrences of each index
> counts = numpy.zeros(5, dtype=int)
> for i in items:
>   counts[i] += 1
>
> In my real code, 'items' contain up to a million values and this loop will
> be in a performance critical area of code.  If there is no simple solution,
> I can trivially code this using the C-API.

How big is n? If it is much smaller than a million then loop over that instead.
_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Fancier indexing

Kevin Jacobs <jacobs@bioinformed.com>
On Thu, May 22, 2008 at 12:08 PM, Keith Goodman <[hidden email]> wrote:
How big is n? If it is much smaller than a million then loop over that instead.

n is always relatively small, but I'd rather not do:

for i in range(n):
  counts[i] = (items==i).sum()

If that was the best alternative, I'd just bite the bullet and code this in C.

Thanks,
-Kevin

_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Fancier indexing

Keith Goodman
In reply to this post by Keith Goodman
On Thu, May 22, 2008 at 9:08 AM, Keith Goodman <[hidden email]> wrote:

> On Thu, May 22, 2008 at 8:59 AM, Kevin Jacobs <[hidden email]>
> <[hidden email]> wrote:
>> After poking around for a bit, I was wondering if there was a faster method
>> for the following:
>>
>> # Array of index values 0..n
>> items = numpy.array([0,3,2,1,4,2],dtype=int)
>>
>> # Count the number of occurrences of each index
>> counts = numpy.zeros(5, dtype=int)
>> for i in items:
>>   counts[i] += 1
>>
>> In my real code, 'items' contain up to a million values and this loop will
>> be in a performance critical area of code.  If there is no simple solution,
>> I can trivially code this using the C-API.
>
> How big is n? If it is much smaller than a million then loop over that instead.

Or how about using a list instead:

>> items = [0,3,2,1,4,2]
>> uitems = frozenset(items)
>> count = [items.count(i) for i in uitems]
>> count
   [1, 1, 2, 1, 1]
_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Fancier indexing

Mark Miller-11
In reply to this post by Keith Goodman
You're just trying to do this...correct?

>>> import numpy
>>> items = numpy.array([0,3,2,1,4,2],dtype=int)
>>> unique = numpy.unique(items)
>>> unique
array([0, 1, 2, 3, 4])
>>> counts=numpy.histogram(items,unique)
>>> counts
(array([1, 1, 2, 1, 1]), array([0, 1, 2, 3, 4]))
>>> counts[0]
array([1, 1, 2, 1, 1])
>>>

On Thu, May 22, 2008 at 9:08 AM, Keith Goodman <[hidden email]> wrote:
On Thu, May 22, 2008 at 8:59 AM, Kevin Jacobs <[hidden email]>
<[hidden email]> wrote:
> After poking around for a bit, I was wondering if there was a faster method
> for the following:
>
> # Array of index values 0..n
> items = numpy.array([0,3,2,1,4,2],dtype=int)
>
> # Count the number of occurrences of each index
> counts = numpy.zeros(5, dtype=int)
> for i in items:
>   counts[i] += 1
>
> In my real code, 'items' contain up to a million values and this loop will
> be in a performance critical area of code.  If there is no simple solution,
> I can trivially code this using the C-API.

How big is n? If it is much smaller than a million then loop over that instead.
_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion


_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Fancier indexing

Keith Goodman
In reply to this post by Keith Goodman
On Thu, May 22, 2008 at 9:15 AM, Keith Goodman <[hidden email]> wrote:

> On Thu, May 22, 2008 at 9:08 AM, Keith Goodman <[hidden email]> wrote:
>> On Thu, May 22, 2008 at 8:59 AM, Kevin Jacobs <[hidden email]>
>> <[hidden email]> wrote:
>>> After poking around for a bit, I was wondering if there was a faster method
>>> for the following:
>>>
>>> # Array of index values 0..n
>>> items = numpy.array([0,3,2,1,4,2],dtype=int)
>>>
>>> # Count the number of occurrences of each index
>>> counts = numpy.zeros(5, dtype=int)
>>> for i in items:
>>>   counts[i] += 1
>>>
>>> In my real code, 'items' contain up to a million values and this loop will
>>> be in a performance critical area of code.  If there is no simple solution,
>>> I can trivially code this using the C-API.
>>
>> How big is n? If it is much smaller than a million then loop over that instead.
>
> Or how about using a list instead:
>
>>> items = [0,3,2,1,4,2]
>>> uitems = frozenset(items)
>>> count = [items.count(i) for i in uitems]
>>> count
>   [1, 1, 2, 1, 1]

Oh, I see, so uitems should be range(n)
_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Fancier indexing

Robin-62
In reply to this post by Kevin Jacobs <jacobs@bioinformed.com>
On Thu, May 22, 2008 at 4:59 PM, Kevin Jacobs <[hidden email]>
<[hidden email]> wrote:

> After poking around for a bit, I was wondering if there was a faster method
> for the following:
>
> # Array of index values 0..n
> items = numpy.array([0,3,2,1,4,2],dtype=int)
>
> # Count the number of occurrences of each index
> counts = numpy.zeros(5, dtype=int)
> for i in items:
>   counts[i] += 1
>
> In my real code, 'items' contain up to a million values and this loop will
> be in a performance critical area of code.  If there is no simple solution,
> I can trivially code this using the C-API.

I would use bincount:
count = bincount(items)
should be all you need:


In [192]: items = [0,3,2,1,4,2]

In [193]: bincount(items)
Out[193]: array([1, 1, 2, 1, 1])

In [194]: bincount?
Type:           builtin_function_or_method
Base Class:     <type 'builtin_function_or_method'>
String Form:    <built-in function bincount>
Namespace:      Interactive
Docstring:
    bincount(x,weights=None)

    Return the number of occurrences of each value in x.

    x must be a list of non-negative integers.  The output, b[i],
    represents the number of times that i is found in x.  If weights
    is specified, every occurrence of i at a position p contributes
    weights[p] instead of 1.

    See also: histogram, digitize, unique.

Robin
_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Fancier indexing

Keith Goodman
On Thu, May 22, 2008 at 9:22 AM, Robin <[hidden email]> wrote:

> On Thu, May 22, 2008 at 4:59 PM, Kevin Jacobs <[hidden email]>
> <[hidden email]> wrote:
>> After poking around for a bit, I was wondering if there was a faster method
>> for the following:
>>
>> # Array of index values 0..n
>> items = numpy.array([0,3,2,1,4,2],dtype=int)
>>
>> # Count the number of occurrences of each index
>> counts = numpy.zeros(5, dtype=int)
>> for i in items:
>>   counts[i] += 1
>>
>> In my real code, 'items' contain up to a million values and this loop will
>> be in a performance critical area of code.  If there is no simple solution,
>> I can trivially code this using the C-API.
>
> I would use bincount:
> count = bincount(items)
> should be all you need:

I guess bincount is *little* faster:

>> items = mp.random.randint(0, 100, (1000000,))
>> timeit mp.bincount(items)
100 loops, best of 3: 4.05 ms per loop
>> items = items.tolist()
>> timeit [items.count(i) for i in range(100)]
10 loops, best of 3: 2.91 s per loop
_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion