distance_matrix: how to speed up?

8 messages
Open this post in threaded view
|

distance_matrix: how to speed up?

 Dear all, I need to speed up this function (a little example follows): ------ import numpy as N def distance_matrix(data1,data2,weights):     rows = data1.shape[0]     columns = data2.shape[0]     dm = N.zeros((rows,columns))     for i in range(rows):         for j in range(columns):             dm[i,j] = ((data1[i,:]-data2[j,:])**2*weights).sum()             pass         pass     return dm size1 = 4 size2 = 3 dimensions = 2 data1 = N.random.rand(size1,dimensions) data2 = N.random.rand(size2,dimensions) weights = N.random.rand(dimensions) dm = distance_matrix(data1,data2,weights) print dm ------------------ The distance_matrix function computes the weighted (squared) euclidean distances between each pair of vectors from two sets (data1, data2). The previous naive algorithm is extremely slow for my standard use, i.e., when size1 and size2 are in the order of 1000 or more. It can be improved using N.subtract.outer: def distance_matrix_faster(data1,data2,weights):     rows = data1.shape[0]     columns = data2.shape[0]     dm = N.zeros((rows,columns))     for i in range(data1.shape[1]):         dm += N.subtract.outer(data1[:,i],data2[:,i])**2*weights[i]         pass     return dm This algorithm becomes slow when dimensions (i.e., data1.shape[1]) is big (i.e., >1000), due to the Python loop. In order to speed it up, I guess that N.subtract.outer could be used on the full matrices instead of one column at a time. But then there is a memory issue: 'outer' allocates too much memory since it stores all possible combinations along all dimensions. This is clearly unnecessary. Is there a NumPy way to avoid all Python loops and without wasting too much memory? As a comparison I coded the same algorithm in C through weave (inline): it is _much_ faster and requires just the memory to store the result. But I'd prefer not using C or weave if possible. Thanks in advance for any help, Emanuele _______________________________________________ Numpy-discussion mailing list [hidden email] http://projects.scipy.org/mailman/listinfo/numpy-discussion
Open this post in threaded view
|

Re: distance_matrix: how to speed up?

 Hi,Bill Baxter proposed a version of this problem some months ago on this ML. I use it regularly and it is fast enough for me.Matthieu2008/5/21 Emanuele Olivetti <[hidden email]>: Dear all, I need to speed up this function (a little example follows): ------ import numpy as N def distance_matrix(data1,data2,weights):    rows = data1.shape[0]    columns = data2.shape[0]    dm = N.zeros((rows,columns))    for i in range(rows):        for j in range(columns):            dm[i,j] = ((data1[i,:]-data2[j,:])**2*weights).sum()            pass        pass    return dm size1 = 4 size2 = 3 dimensions = 2 data1 = N.random.rand(size1,dimensions) data2 = N.random.rand(size2,dimensions) weights = N.random.rand(dimensions) dm = distance_matrix(data1,data2,weights) print dm ------------------ The distance_matrix function computes the weighted (squared) euclidean distances between each pair of vectors from two sets (data1, data2). The previous naive algorithm is extremely slow for my standard use, i.e., when size1 and size2 are in the order of 1000 or more. It can be improved using N.subtract.outer: def distance_matrix_faster(data1,data2,weights):    rows = data1.shape[0]    columns = data2.shape[0]    dm = N.zeros((rows,columns))    for i in range(data1.shape[1]):        dm += N.subtract.outer(data1[:,i],data2[:,i])**2*weights[i]        pass    return dm This algorithm becomes slow when dimensions (i.e., data1.shape[1]) is big (i.e., >1000), due to the Python loop. In order to speed it up, I guess that N.subtract.outer could be used on the full matrices instead of one column at a time. But then there is a memory issue: 'outer' allocates too much memory since it stores all possible combinations along all dimensions. This is clearly unnecessary. Is there a NumPy way to avoid all Python loops and without wasting too much memory? As a comparison I coded the same algorithm in C through weave (inline): it is _much_ faster and requires just the memory to store the result. But I'd prefer not using C or weave if possible. Thanks in advance for any help, Emanuele _______________________________________________ Numpy-discussion mailing list [hidden email] http://projects.scipy.org/mailman/listinfo/numpy-discussion -- French PhD studentWebsite : http://matthieu-brucher.developpez.com/Blogs : http://matt.eifelle.com and http://blog.developpez.com/?blog=92 LinkedIn : http://www.linkedin.com/in/matthieubrucher _______________________________________________ Numpy-discussion mailing list [hidden email] http://projects.scipy.org/mailman/listinfo/numpy-discussion
Open this post in threaded view
|

Re: distance_matrix: how to speed up?

 In reply to this post by Emanuele Olivetti-3 I think you want something like this: x1 = x1 * weights[np.newaxis,:] x2 = x2 * weights[np.newaxis,:] x1 = x1[np.newaxis, :, :] x2 = x2[:, np.newaxis, :] distance = np.sqrt( ((x1 - x2)**2).sum(axis=-1) ) x1 and x2 are arrays with size of (npoints, ndimensions), and npoints   can be different for each array.  I'm not sure I did your weights   right, but that part shouldn't be so difficult. On May 21, 2008, at 2:39 PM, Emanuele Olivetti wrote: > Dear all, > > I need to speed up this function (a little example follows): > ------ > import numpy as N > def distance_matrix(data1,data2,weights): >    rows = data1.shape[0] >    columns = data2.shape[0] >    dm = N.zeros((rows,columns)) >    for i in range(rows): >        for j in range(columns): >            dm[i,j] = ((data1[i,:]-data2[j,:])**2*weights).sum() >            pass >        pass >    return dm > > size1 = 4 > size2 = 3 > dimensions = 2 > data1 = N.random.rand(size1,dimensions) > data2 = N.random.rand(size2,dimensions) > weights = N.random.rand(dimensions) > dm = distance_matrix(data1,data2,weights) > print dm > ------------------ > The distance_matrix function computes the weighted (squared) euclidean > distances between each pair of vectors from two sets (data1, data2). > The previous naive algorithm is extremely slow for my standard use, > i.e., when size1 and size2 are in the order of 1000 or more. It can be > improved using N.subtract.outer: > > def distance_matrix_faster(data1,data2,weights): >    rows = data1.shape[0] >    columns = data2.shape[0] >    dm = N.zeros((rows,columns)) >    for i in range(data1.shape[1]): >        dm += N.subtract.outer(data1[:,i],data2[:,i])**2*weights[i] >        pass >    return dm > > This algorithm becomes slow when dimensions (i.e., data1.shape[1]) is > big (i.e., >1000), due to the Python loop. In order to speed it up,   > I guess > that N.subtract.outer could be used on the full matrices instead of   > one > column at a time. But then there is a memory issue: 'outer' allocates > too much memory since it stores all possible combinations along all > dimensions. This is clearly unnecessary. > > Is there a NumPy way to avoid all Python loops and without wasting > too much memory? As a comparison I coded the same algorithm in > C through weave (inline): it is _much_ faster and requires just > the memory to store the result. But I'd prefer not using C or weave > if possible. > > Thanks in advance for any help, > > > Emanuele > > _______________________________________________ > Numpy-discussion mailing list > [hidden email] > http://projects.scipy.org/mailman/listinfo/numpy-discussion---- Rob Hetland, Associate Professor Dept. of Oceanography, Texas A&M University http://pong.tamu.edu/~robphone: 979-458-0096, fax: 979-845-6331 _______________________________________________ Numpy-discussion mailing list [hidden email] http://projects.scipy.org/mailman/listinfo/numpy-discussion
Open this post in threaded view
|

Re: distance_matrix: how to speed up?

 In reply to this post by Matthieu Brucher-2 Matthieu Brucher wrote: > Hi, > > Bill Baxter proposed a version of this problem some months ago on this > ML. I use it regularly and it is fast enough for me. > Excellent. Exactly what I was looking for. Thanks, Emanuele _______________________________________________ Numpy-discussion mailing list [hidden email] http://projects.scipy.org/mailman/listinfo/numpy-discussion
Open this post in threaded view
|

Re: distance_matrix: how to speed up?

 In reply to this post by Rob Hetland Rob Hetland wrote: > I think you want something like this: > > x1 = x1 * weights[np.newaxis,:] > x2 = x2 * weights[np.newaxis,:] > > x1 = x1[np.newaxis, :, :] > x2 = x2[:, np.newaxis, :] > distance = np.sqrt( ((x1 - x2)**2).sum(axis=-1) ) > > x1 and x2 are arrays with size of (npoints, ndimensions), and npoints   > can be different for each array.  I'm not sure I did your weights   > right, but that part shouldn't be so difficult. > >   Weights seem not right but anyway here is the solution adapted from Bill Baxter's : def distance_matrix_final(data1,data2,weights): data1w = data1*weights dm = (data1w*data1).sum(1)[:,None]-2*N.dot(data1w,data2.T)+(data2*data2*weights).sum(1) dm[dm<0] = 0 return dm This solution is super-fast, stable and use little memory. It is based on the fact that: (x-y)^2*w = x*x*w - 2*x*y*w + y*y*w For size1=size2=dimensions=1000 requires ~0.6sec. to compute on my dual core duo. It is 2 order of magnitude faster than my previous solution, but 1-2 order of magnitude slower than using C with weave.inline. Definitely good enough for me. Emanuele _______________________________________________ Numpy-discussion mailing list [hidden email] http://projects.scipy.org/mailman/listinfo/numpy-discussion
Open this post in threaded view
|

Re: distance_matrix: how to speed up?

 Emanuele Olivetti wrote: > > This solution is super-fast, stable and use little memory. > It is based on the fact that: > (x-y)^2*w = x*x*w - 2*x*y*w + y*y*w > > For size1=size2=dimensions=1000 requires ~0.6sec. to compute > on my dual core duo. It is 2 order of magnitude faster than my > previous solution, but 1-2 order of magnitude slower than using > C with weave.inline. > > Definitely good enough for me. > > > Emanuele Reading this thread, I remembered having tried scipy's sandbox.rbf (radial basis function) to interpolate a pretty large, multidimensional dataset, to fill in the missing data points. This however failed soon with out-of-memory errors, which, if I remember correctly, came from the pretty straightforward distance calculation between the different data points that is used in this package. Being no math wonder, I assumed that there simply was no simple way to calculate distances without using much memory, and ended my rbf experiments. To make a story short: correct me if I am wrong, but might it be an idea to use the above solution in scipy.sandbox.rbf? Vincent. _______________________________________________ Numpy-discussion mailing list [hidden email] http://projects.scipy.org/mailman/listinfo/numpy-discussion