Faster

classic Classic list List threaded Threaded
32 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Faster

Keith Goodman
How can I make this function faster? It removes the i-th row and
column from an array.

def cut(x, i):
    idx = range(x.shape[0])
    idx.remove(i)
    y = x[idx,:]
    y = y[:,idx]
    return y

>> import numpy as np
>> x = np.random.rand(500,500)
>> timeit cut(x, 100)
100 loops, best of 3: 16.8 ms per loop
_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Faster

Charles R Harris


On Fri, May 2, 2008 at 6:24 PM, Keith Goodman <[hidden email]> wrote:
How can I make this function faster? It removes the i-th row and
column from an array.

Why do you want to do that?

Chuck



_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Faster

Keith Goodman
On Fri, May 2, 2008 at 5:38 PM, Charles R Harris
<[hidden email]> wrote:
> On Fri, May 2, 2008 at 6:24 PM, Keith Goodman <[hidden email]> wrote:
> > How can I make this function faster? It removes the i-th row and
> > column from an array.
> >
>
> Why do you want to do that?

Single linkage clustering; x is the distance matrix.
_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Faster

Keith Goodman
On Fri, May 2, 2008 at 5:47 PM, Keith Goodman <[hidden email]> wrote:

>
> On Fri, May 2, 2008 at 5:38 PM, Charles R Harris
>  <[hidden email]> wrote:
>  > On Fri, May 2, 2008 at 6:24 PM, Keith Goodman <[hidden email]> wrote:
>  > > How can I make this function faster? It removes the i-th row and
>  > > column from an array.
>  > >
>  >
>  > Why do you want to do that?
>
>  Single linkage clustering; x is the distance matrix.

Here's the full code if you are interested. I haven't used it yet
other than running the test and test2 so it may be full of bugs.

import time

import numpy as np

class Cluster:
    "Single linkage hierarchical clustering"

    def __init__(self, dist, label=None, debug=False):
        """
        dist   Distance matrix, NxN numpy array
        label  Labels of each row of the distance matrix, list of N items,
               default is range(N)
        """
        assert dist.shape[0] == dist.shape[1], 'dist must be square (nxn)'
        assert (np.abs(dist - dist.T) < 1e-8).all(), 'dist must be symmetric'
        if label is None:
            label = range(dist.shape[0])
        assert dist.shape[0] == len(label), 'dist and label must match
in size'
        self.c = [[[z] for z in label]]
        self.label = label
        self.dist = dist
        self.debug = debug

    def run(self):
        for level in xrange(len(self.label) - 1):
            i, j = self.min_dist()
            self.join(i, j)

    def join(self, i, j):

        assert i != j, 'Cannot combine a cluster with itself'

        # Join labels
        new = list(self.c[-1])
        new[i] = new[i] + new[j]
        new.pop(j)
        self.c.append(new)

        # Join distance matrix
        self.dist[:,i] = self.dist[:,[i,j]].min(1)
        self.dist[i,:] = self.dist[:,i]
        idx = range(self.dist.shape[1])
        idx.remove(j)
        self.dist = self.dist[:,idx]
        self.dist = self.dist[idx,:]

        # Debug output
        if self.debug:
            print
            print len(self.c) - 1
            print 'Clusters'
            print self.c[-1]
            print 'Distance'
            print self.dist

    def min_dist(self):
        dist = self.dist + 1e10 * np.eye(self.dist.shape[0])
        i, j = np.where(dist == dist.min())
        return i[0], j[0]


def test():
    # Example from
    # home.dei.polimi.it/matteucc/Clustering/tutorial_html/hierarchical.html
    label =          ['BA', 'FI', 'MI', 'NA', 'RM', 'TO']
    dist = np.array([[0,    662,  877,  255,  412,  996],
                     [662,  0,    295,  468,  268,  400],
                     [877,  295,  0,    754,  564,  138],
                     [255,  468,  754,  0,    219,  869],
                     [412,  268,  564,  219,  0,    669],
                     [996,  400,  138,  869,  669,  0  ]])
    clust = Cluster(dist, label, debug=True)
    clust.run()

def test2(n):
    x = np.random.rand(n,n)
    x = x + x.T
    c = Cluster(x)
    t1 = time.time()
    c.run()
    t2 = time.time()
    print 'n = %d took %0.2f seconds' % (n, t2-t1)
_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Faster

Robert Kern-2
In reply to this post by Keith Goodman
On Fri, May 2, 2008 at 7:24 PM, Keith Goodman <[hidden email]> wrote:

> How can I make this function faster? It removes the i-th row and
>  column from an array.
>
>  def cut(x, i):
>     idx = range(x.shape[0])
>     idx.remove(i)
>     y = x[idx,:]
>     y = y[:,idx]
>     return y
>
>  >> import numpy as np
>  >> x = np.random.rand(500,500)
>  >> timeit cut(x, 100)
>  100 loops, best of 3: 16.8 ms per loop

I can get a ~20% improvement with the following:

In [8]: %timeit cut(x, 100)
10 loops, best of 3: 21.6 ms per loop

In [9]: def mycut(x, i):
   ...:     A = x[:i,:i]
   ...:     B = x[:i,i+1:]
   ...:     C = x[i+1:,:i]
   ...:     D = x[i+1:,i+1:]
   ...:     return hstack([vstack([A,C]),vstack([B,D])])
   ...:

In [10]: %timeit mycut(x, 100)
10 loops, best of 3: 17.3 ms per loop

The hstack(vstack, vstack) seems to be somewhat better than
vstack(hstack, hstack), at least for these sizes.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
 -- Umberto Eco
_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Faster

Keith Goodman
On Fri, May 2, 2008 at 6:05 PM, Robert Kern <[hidden email]> wrote:

> On Fri, May 2, 2008 at 7:24 PM, Keith Goodman <[hidden email]> wrote:
>
>
> > How can I make this function faster? It removes the i-th row and
>  >  column from an array.
>  >
>  >  def cut(x, i):
>  >     idx = range(x.shape[0])
>  >     idx.remove(i)
>  >     y = x[idx,:]
>  >     y = y[:,idx]
>  >     return y
>  >
>  >  >> import numpy as np
>  >  >> x = np.random.rand(500,500)
>  >  >> timeit cut(x, 100)
>  >  100 loops, best of 3: 16.8 ms per loop
>
>  I can get a ~20% improvement with the following:
>
>  In [8]: %timeit cut(x, 100)
>  10 loops, best of 3: 21.6 ms per loop
>
>  In [9]: def mycut(x, i):
>    ...:     A = x[:i,:i]
>    ...:     B = x[:i,i+1:]
>    ...:     C = x[i+1:,:i]
>    ...:     D = x[i+1:,i+1:]
>    ...:     return hstack([vstack([A,C]),vstack([B,D])])
>    ...:
>
>  In [10]: %timeit mycut(x, 100)
>  10 loops, best of 3: 17.3 ms per loop
>
>  The hstack(vstack, vstack) seems to be somewhat better than
>  vstack(hstack, hstack), at least for these sizes.

Wow. I never would have come up with that. And I probably never will.

Original code:
>> cluster.test2(500)
n = 500 took 5.28 seconds

Your improvement:
>> cluster.test2(500)
n = 500 took 3.52 seconds

Much more than a 20% improvement when used in the larger program. Thank you.
_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Faster

Charles R Harris


On Fri, May 2, 2008 at 7:16 PM, Keith Goodman <[hidden email]> wrote:
On Fri, May 2, 2008 at 6:05 PM, Robert Kern <[hidden email]> wrote:
> On Fri, May 2, 2008 at 7:24 PM, Keith Goodman <[hidden email]> wrote:
>
>
> > How can I make this function faster? It removes the i-th row and
>  >  column from an array.
>  >
>  >  def cut(x, i):
>  >     idx = range(x.shape[0])
>  >     idx.remove(i)
>  >     y = x[idx,:]
>  >     y = y[:,idx]
>  >     return y
>  >
>  >  >> import numpy as np
>  >  >> x = np.random.rand(500,500)
>  >  >> timeit cut(x, 100)
>  >  100 loops, best of 3: 16.8 ms per loop
>
>  I can get a ~20% improvement with the following:
>
>  In [8]: %timeit cut(x, 100)
>  10 loops, best of 3: 21.6 ms per loop
>
>  In [9]: def mycut(x, i):
>    ...:     A = x[:i,:i]
>    ...:     B = x[:i,i+1:]
>    ...:     C = x[i+1:,:i]
>    ...:     D = x[i+1:,i+1:]
>    ...:     return hstack([vstack([A,C]),vstack([B,D])])
>    ...:
>
>  In [10]: %timeit mycut(x, 100)
>  10 loops, best of 3: 17.3 ms per loop
>
>  The hstack(vstack, vstack) seems to be somewhat better than
>  vstack(hstack, hstack), at least for these sizes.

Wow. I never would have come up with that. And I probably never will.

Original code:
>> cluster.test2(500)
n = 500 took 5.28 seconds

Your improvement:
>> cluster.test2(500)
n = 500 took 3.52 seconds

Much more than a 20% improvement when used in the larger program. Thank you.

Isn't the lengthy part finding the distance between clusters?  I can think of several ways to do that, but I think you will get a real speedup by doing that in c or c++. I have a module made in boost python that holds clusters and returns a list of lists containing their elements. Clusters are joined by joining any two elements, one from each. It wouldn't take much to add a distance function, but you could use the list of indices in each cluster to pull a subset out of the distance matrix and then find the minimum function in that. This also reminds me of Huffman codes.

Chuck


_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Faster

Keith Goodman
On Fri, May 2, 2008 at 6:29 PM, Charles R Harris
<[hidden email]> wrote:
> Isn't the lengthy part finding the distance between clusters?  I can think
> of several ways to do that, but I think you will get a real speedup by doing
> that in c or c++. I have a module made in boost python that holds clusters
> and returns a list of lists containing their elements. Clusters are joined
> by joining any two elements, one from each. It wouldn't take much to add a
> distance function, but you could use the list of indices in each cluster to
> pull a subset out of the distance matrix and then find the minimum function
> in that. This also reminds me of Huffman codes.

You're right. Finding the distance is slow. Is there any way to speed
up the function below? It returns the row and column indices of the
min value of the NxN array x.

def dist(x):
    x = x + 1e10 * np.eye(x.shape[0])
    i, j = np.where(x == x.min())
    return i[0], j[0]

>> x = np.random.rand(500,500)
>> timeit dist(x)
100 loops, best of 3: 14.1 ms per loop

If the clustering gives me useful results, I'll ask you about your
boost code. I'll also take a look at Damian Eads's scipy-cluster.
_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Faster

Charles R Harris


On Fri, May 2, 2008 at 8:02 PM, Keith Goodman <[hidden email]> wrote:
On Fri, May 2, 2008 at 6:29 PM, Charles R Harris
<[hidden email]> wrote:
> Isn't the lengthy part finding the distance between clusters?  I can think
> of several ways to do that, but I think you will get a real speedup by doing
> that in c or c++. I have a module made in boost python that holds clusters
> and returns a list of lists containing their elements. Clusters are joined
> by joining any two elements, one from each. It wouldn't take much to add a
> distance function, but you could use the list of indices in each cluster to
> pull a subset out of the distance matrix and then find the minimum function
> in that. This also reminds me of Huffman codes.

You're right. Finding the distance is slow. Is there any way to speed
up the function below? It returns the row and column indices of the
min value of the NxN array x.

def dist(x):
   x = x + 1e10 * np.eye(x.shape[0])
     
x += x + diag(ones(x.shape[0])*1e10

would be faster.


   i, j = np.where(x == x.min())
   return i[0], j[0]
 
i = x.argmin()
j = i % x.shape[0]
i = i / x.shape[0]

But I wouldn't worry about speed yet if you are just trying things out.

Chuck


_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Faster

Robert Kern-2
In reply to this post by Keith Goodman
On Fri, May 2, 2008 at 9:02 PM, Keith Goodman <[hidden email]> wrote:

> On Fri, May 2, 2008 at 6:29 PM, Charles R Harris
>
> <[hidden email]> wrote:
>
> > Isn't the lengthy part finding the distance between clusters?  I can think
>  > of several ways to do that, but I think you will get a real speedup by doing
>  > that in c or c++. I have a module made in boost python that holds clusters
>  > and returns a list of lists containing their elements. Clusters are joined
>  > by joining any two elements, one from each. It wouldn't take much to add a
>  > distance function, but you could use the list of indices in each cluster to
>  > pull a subset out of the distance matrix and then find the minimum function
>  > in that. This also reminds me of Huffman codes.
>
>  You're right. Finding the distance is slow. Is there any way to speed
>  up the function below? It returns the row and column indices of the
>  min value of the NxN array x.
>
>  def dist(x):
>     x = x + 1e10 * np.eye(x.shape[0])
>     i, j = np.where(x == x.min())
>
>     return i[0], j[0]

Assuming x is contiguous and you can modify x in-place:


In [1]: from numpy import *

In [2]: def dist(x):
   ...:    x = x + 1e10 * eye(x.shape[0])
   ...:    i, j = where(x == x.min())
   ...:    return i[0], j[0]
   ...:

In [3]: def symmdist(N):
   ...:     x = random.rand(N, N)
   ...:     x = x + x.T
   ...:     x.flat[::N+1] = 0
   ...:     return x
   ...:

In [4]: symmdist(5)
Out[4]:
array([[ 0.        ,  0.87508654,  1.11691704,  0.80366071,  0.57966808],
       [ 0.87508654,  0.        ,  1.5521685 ,  1.74010886,  0.52156877],
       [ 1.11691704,  1.5521685 ,  0.        ,  1.22725396,  1.04101992],
       [ 0.80366071,  1.74010886,  1.22725396,  0.        ,  1.94577965],
       [ 0.57966808,  0.52156877,  1.04101992,  1.94577965,  0.        ]])

In [5]: def kerndist(x):
   ...:     N = x.shape[0]
   ...:     x.flat[::N+1] = x.max()
   ...:     ij = argmin(x.flat)
   ...:     i, j = divmod(ij, N)
   ...:     return i, j
   ...:

In [10]: x = symmdist(500)

In [15]: %timeit dist(x)
10 loops, best of 3: 19.9 ms per loop

In [16]: %timeit kerndist(x)
100 loops, best of 3: 4.38 ms per loop

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
 -- Umberto Eco
_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Faster

Charles R Harris
In reply to this post by Keith Goodman


On Fri, May 2, 2008 at 8:02 PM, Keith Goodman <[hidden email]> wrote:
On Fri, May 2, 2008 at 6:29 PM, Charles R Harris
<[hidden email]> wrote:
> Isn't the lengthy part finding the distance between clusters?  I can think
> of several ways to do that, but I think you will get a real speedup by doing
> that in c or c++. I have a module made in boost python that holds clusters
> and returns a list of lists containing their elements. Clusters are joined
> by joining any two elements, one from each. It wouldn't take much to add a
> distance function, but you could use the list of indices in each cluster to
> pull a subset out of the distance matrix and then find the minimum function
> in that. This also reminds me of Huffman codes.

You're right. Finding the distance is slow. Is there any way to speed
up the function below? It returns the row and column indices of the
min value of the NxN array x.

def dist(x):
   x = x + 1e10 * np.eye(x.shape[0])
   i, j = np.where(x == x.min())
   return i[0], j[0]

>> x = np.random.rand(500,500)
>> timeit dist(x)
100 loops, best of 3: 14.1 ms per loop

If the clustering gives me useful results, I'll ask you about your
boost code. I'll also take a look at Damian Eads's scipy-cluster.

That package looks nice. I think your time would be better spent learning how to use it than in rolling your own routines.

Chuck


_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Faster

Keith Goodman
In reply to this post by Robert Kern-2
On Fri, May 2, 2008 at 7:25 PM, Robert Kern <[hidden email]> wrote:

> On Fri, May 2, 2008 at 9:02 PM, Keith Goodman <[hidden email]> wrote:
>  >  You're right. Finding the distance is slow. Is there any way to speed
>  >  up the function below? It returns the row and column indices of the
>  >  min value of the NxN array x.
>  >
>  >  def dist(x):
>  >     x = x + 1e10 * np.eye(x.shape[0])
>  >     i, j = np.where(x == x.min())
>  >
>  >     return i[0], j[0]
>
>  Assuming x is contiguous and you can modify x in-place:
>
>
>  In [1]: from numpy import *
>
>  In [2]: def dist(x):
>    ...:    x = x + 1e10 * eye(x.shape[0])
>    ...:    i, j = where(x == x.min())
>
>    ...:    return i[0], j[0]
>    ...:
>
>  In [3]: def symmdist(N):
>    ...:     x = random.rand(N, N)
>    ...:     x = x + x.T
>    ...:     x.flat[::N+1] = 0
>    ...:     return x
>    ...:
>
>  In [4]: symmdist(5)
>  Out[4]:
>  array([[ 0.        ,  0.87508654,  1.11691704,  0.80366071,  0.57966808],
>        [ 0.87508654,  0.        ,  1.5521685 ,  1.74010886,  0.52156877],
>        [ 1.11691704,  1.5521685 ,  0.        ,  1.22725396,  1.04101992],
>        [ 0.80366071,  1.74010886,  1.22725396,  0.        ,  1.94577965],
>        [ 0.57966808,  0.52156877,  1.04101992,  1.94577965,  0.        ]])
>
>  In [5]: def kerndist(x):
>    ...:     N = x.shape[0]
>    ...:     x.flat[::N+1] = x.max()
>    ...:     ij = argmin(x.flat)
>    ...:     i, j = divmod(ij, N)
>    ...:     return i, j
>    ...:
>
>  In [10]: x = symmdist(500)
>
>  In [15]: %timeit dist(x)
>  10 loops, best of 3: 19.9 ms per loop
>
>  In [16]: %timeit kerndist(x)
>  100 loops, best of 3: 4.38 ms per loop

This change and your previous one cut the run time from 5.28 to 2.23
seconds. Thank you.
_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Faster

Keith Goodman
In reply to this post by Charles R Harris
On Fri, May 2, 2008 at 7:36 PM, Charles R Harris
<[hidden email]> wrote:
> On Fri, May 2, 2008 at 8:02 PM, Keith Goodman <[hidden email]> wrote:
> > If the clustering gives me useful results, I'll ask you about your
> > boost code. I'll also take a look at Damian Eads's scipy-cluster.
>
> That package looks nice. I think your time would be better spent learning
> how to use it than in rolling your own routines.

Yeah, but this is more fun.
_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Faster

Keith Goodman
In reply to this post by Robert Kern-2
On Fri, May 2, 2008 at 7:25 PM, Robert Kern <[hidden email]> wrote:
>  In [5]: def kerndist(x):
>    ...:     N = x.shape[0]
>    ...:     x.flat[::N+1] = x.max()
>    ...:     ij = argmin(x.flat)
>    ...:     i, j = divmod(ij, N)
>    ...:     return i, j

I replaced

ij = argmin(x.flat)

with

x.argmin()

(they're the same in this context, right?) for a slight speed up. Now
I'm down to 1.9 seconds.

>> timeit argmin(x.flat)
10000 loops, best of 3: 24.8 µs per loop
>> timeit argmin(x)
100000 loops, best of 3: 9.19 µs per loop
>> timeit x.argmin()
100000 loops, best of 3: 8.06 µs per loop

OK, enough way-too-early optimization. But I'll definately add i, j =
modiv(x.argmin(), x.shape[0]) to my toolbelt.
_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Faster

Robert Kern-2
On Fri, May 2, 2008 at 10:48 PM, Keith Goodman <[hidden email]> wrote:

> On Fri, May 2, 2008 at 7:25 PM, Robert Kern <[hidden email]> wrote:
>  >  In [5]: def kerndist(x):
>  >    ...:     N = x.shape[0]
>  >    ...:     x.flat[::N+1] = x.max()
>  >    ...:     ij = argmin(x.flat)
>  >    ...:     i, j = divmod(ij, N)
>  >    ...:     return i, j
>
>  I replaced
>
>  ij = argmin(x.flat)
>
>  with
>
>  x.argmin()
>
>  (they're the same in this context, right?) for a slight speed up. Now
>  I'm down to 1.9 seconds.

Yes, they're the same. I forgot that .argmin() flattened by default.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
 -- Umberto Eco
_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Faster

Hoyt Koepke
In reply to this post by Keith Goodman
You know, for linkage clustering and BHC, I've found it a lot easier
to work with an intermediate 1d map of indices and never resize the
distance matrix.  I then just remove one element from this map at each
iteration, which is a LOT faster than removing a column and a row from
a matrix.  if idxmap were the map, you would be working with
X[idxmap[i], idxmap[j] ] instead of X[i, j].  Also, you could even use
a python list in this case, as they are a lot better for deletions
than an array.

--Hoyt

On Fri, May 2, 2008 at 5:47 PM, Keith Goodman <[hidden email]> wrote:

>
> On Fri, May 2, 2008 at 5:38 PM, Charles R Harris
>  <[hidden email]> wrote:
>  > On Fri, May 2, 2008 at 6:24 PM, Keith Goodman <[hidden email]> wrote:
>  > > How can I make this function faster? It removes the i-th row and
>  > > column from an array.
>  > >
>  >
>  > Why do you want to do that?
>
>  Single linkage clustering; x is the distance matrix.
>
>
> _______________________________________________
>  Numpy-discussion mailing list
>  [hidden email]
>  http://projects.scipy.org/mailman/listinfo/numpy-discussion
>



--
+++++++++++++++++++++++++++++++++++
Hoyt Koepke
UBC Department of Computer Science
http://www.cs.ubc.ca/~hoytak/
[hidden email]
+++++++++++++++++++++++++++++++++++
_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Faster

Keith Goodman
On Fri, May 2, 2008 at 11:51 PM, Hoyt Koepke <[hidden email]> wrote:
> You know, for linkage clustering and BHC, I've found it a lot easier
>  to work with an intermediate 1d map of indices and never resize the
>  distance matrix.  I then just remove one element from this map at each
>  iteration, which is a LOT faster than removing a column and a row from
>  a matrix.  if idxmap were the map, you would be working with
>  X[idxmap[i], idxmap[j] ] instead of X[i, j].  Also, you could even use
>  a python list in this case, as they are a lot better for deletions
>  than an array.

I thought of replacing the row and column with 1e10 instead of
deleting it. But your idea is better. If I use lists then I bet Psyco
will speed things up. Thanks for the tip.
_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Faster

Chris Barker - NOAA Federal
In reply to this post by Robert Kern-2
Robert Kern wrote:
> I can get a ~20% improvement with the following:

> In [9]: def mycut(x, i):
>    ...:     A = x[:i,:i]
>    ...:     B = x[:i,i+1:]
>    ...:     C = x[i+1:,:i]
>    ...:     D = x[i+1:,i+1:]
>    ...:     return hstack([vstack([A,C]),vstack([B,D])])

Might it be a touch faster to built the final array first, then fill it:

def mycut(x, i):
     r,c = x.shape
     out = np.empty((r-1, c-1), dtype=x.dtype)
     out[:i,:i] = x[:i,:i]
     out[:i,i:] = x[:i,i+1:]
     out[i:,:i] = x[i+1:,:i]
     out[i:,i+1:] = x[i+1:,i+1:]
     return out

totally untested.

That should save the creation of two temporaries.

-Chris


--
Christopher Barker, Ph.D.
Oceanographer

NOAA/OR&R/HAZMAT         (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception
_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Faster

Robert Kern-2
On Sat, May 3, 2008 at 7:05 PM, Christopher Barker
<[hidden email]> wrote:

> Robert Kern wrote:
>  > I can get a ~20% improvement with the following:
>
> > In [9]: def mycut(x, i):
>  >    ...:     A = x[:i,:i]
>  >    ...:     B = x[:i,i+1:]
>  >    ...:     C = x[i+1:,:i]
>  >    ...:     D = x[i+1:,i+1:]
>  >    ...:     return hstack([vstack([A,C]),vstack([B,D])])
>
>  Might it be a touch faster to built the final array first, then fill it:
>
>  def mycut(x, i):
>      r,c = x.shape
>      out = np.empty((r-1, c-1), dtype=x.dtype)
>      out[:i,:i] = x[:i,:i]
>      out[:i,i:] = x[:i,i+1:]
>      out[i:,:i] = x[i+1:,:i]
>      out[i:,i+1:] = x[i+1:,i+1:]
>      return out
>
>  totally untested.
>
>  That should save the creation of two temporaries.

After fixing the last statement to "out[i:,i:] = ...", yes, much
faster. About a factor of 5 with N=500.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
 -- Umberto Eco
_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Faster

Keith Goodman
In reply to this post by Chris Barker - NOAA Federal
On Sat, May 3, 2008 at 5:05 PM, Christopher Barker
<[hidden email]> wrote:

> Robert Kern wrote:
>  > I can get a ~20% improvement with the following:
>
>
> > In [9]: def mycut(x, i):
>  >    ...:     A = x[:i,:i]
>  >    ...:     B = x[:i,i+1:]
>  >    ...:     C = x[i+1:,:i]
>  >    ...:     D = x[i+1:,i+1:]
>  >    ...:     return hstack([vstack([A,C]),vstack([B,D])])
>
>  Might it be a touch faster to built the final array first, then fill it:
>
>  def mycut(x, i):
>      r,c = x.shape
>      out = np.empty((r-1, c-1), dtype=x.dtype)
>      out[:i,:i] = x[:i,:i]
>      out[:i,i:] = x[:i,i+1:]
>      out[i:,:i] = x[i+1:,:i]
>      out[i:,i+1:] = x[i+1:,i+1:]
>      return out
>
>  totally untested.
>
>  That should save the creation of two temporaries.

Initializing the array makes sense. And it is super fast:

>> timeit mycut(x, 6)
100 loops, best of 3: 7.48 ms per loop
>> timeit mycut2(x, 6)
1000 loops, best of 3: 1.5 ms per loop

The time it takes to cluster went from about 1.9 seconds to 0.7
seconds! Thank you.

When I run the single linkage clustering on my data I get one big
cluster and a bunch of tiny clusters. So I need to try a different
linkage method. Average linkage sounds good, but it sounds hard to
code.
_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
12