distance_matrix: how to speed up?

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

distance_matrix: how to speed up?

Emanuele Olivetti-3
Dear all,

I need to speed up this function (a little example follows):
------
import numpy as N
def distance_matrix(data1,data2,weights):
    rows = data1.shape[0]
    columns = data2.shape[0]
    dm = N.zeros((rows,columns))
    for i in range(rows):
        for j in range(columns):
            dm[i,j] = ((data1[i,:]-data2[j,:])**2*weights).sum()
            pass
        pass
    return dm

size1 = 4
size2 = 3
dimensions = 2
data1 = N.random.rand(size1,dimensions)
data2 = N.random.rand(size2,dimensions)
weights = N.random.rand(dimensions)
dm = distance_matrix(data1,data2,weights)
print dm
------------------
The distance_matrix function computes the weighted (squared) euclidean
distances between each pair of vectors from two sets (data1, data2).
The previous naive algorithm is extremely slow for my standard use,
i.e., when size1 and size2 are in the order of 1000 or more. It can be
improved using N.subtract.outer:

def distance_matrix_faster(data1,data2,weights):
    rows = data1.shape[0]
    columns = data2.shape[0]
    dm = N.zeros((rows,columns))
    for i in range(data1.shape[1]):
        dm += N.subtract.outer(data1[:,i],data2[:,i])**2*weights[i]
        pass
    return dm

This algorithm becomes slow when dimensions (i.e., data1.shape[1]) is
big (i.e., >1000), due to the Python loop. In order to speed it up, I guess
that N.subtract.outer could be used on the full matrices instead of one
column at a time. But then there is a memory issue: 'outer' allocates
too much memory since it stores all possible combinations along all
dimensions. This is clearly unnecessary.

Is there a NumPy way to avoid all Python loops and without wasting
too much memory? As a comparison I coded the same algorithm in
C through weave (inline): it is _much_ faster and requires just
the memory to store the result. But I'd prefer not using C or weave
if possible.

Thanks in advance for any help,


Emanuele

_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: distance_matrix: how to speed up?

Matthieu Brucher-2
Hi,

Bill Baxter proposed a version of this problem some months ago on this ML. I use it regularly and it is fast enough for me.

Matthieu

2008/5/21 Emanuele Olivetti <[hidden email]>:
Dear all,

I need to speed up this function (a little example follows):
------
import numpy as N
def distance_matrix(data1,data2,weights):
   rows = data1.shape[0]
   columns = data2.shape[0]
   dm = N.zeros((rows,columns))
   for i in range(rows):
       for j in range(columns):
           dm[i,j] = ((data1[i,:]-data2[j,:])**2*weights).sum()
           pass
       pass
   return dm

size1 = 4
size2 = 3
dimensions = 2
data1 = N.random.rand(size1,dimensions)
data2 = N.random.rand(size2,dimensions)
weights = N.random.rand(dimensions)
dm = distance_matrix(data1,data2,weights)
print dm
------------------
The distance_matrix function computes the weighted (squared) euclidean
distances between each pair of vectors from two sets (data1, data2).
The previous naive algorithm is extremely slow for my standard use,
i.e., when size1 and size2 are in the order of 1000 or more. It can be
improved using N.subtract.outer:

def distance_matrix_faster(data1,data2,weights):
   rows = data1.shape[0]
   columns = data2.shape[0]
   dm = N.zeros((rows,columns))
   for i in range(data1.shape[1]):
       dm += N.subtract.outer(data1[:,i],data2[:,i])**2*weights[i]
       pass
   return dm

This algorithm becomes slow when dimensions (i.e., data1.shape[1]) is
big (i.e., >1000), due to the Python loop. In order to speed it up, I guess
that N.subtract.outer could be used on the full matrices instead of one
column at a time. But then there is a memory issue: 'outer' allocates
too much memory since it stores all possible combinations along all
dimensions. This is clearly unnecessary.

Is there a NumPy way to avoid all Python loops and without wasting
too much memory? As a comparison I coded the same algorithm in
C through weave (inline): it is _much_ faster and requires just
the memory to store the result. But I'd prefer not using C or weave
if possible.

Thanks in advance for any help,


Emanuele

_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion



--
French PhD student
Website : http://matthieu-brucher.developpez.com/
Blogs : http://matt.eifelle.com and http://blog.developpez.com/?blog=92
LinkedIn : http://www.linkedin.com/in/matthieubrucher
_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: distance_matrix: how to speed up?

Rob Hetland
In reply to this post by Emanuele Olivetti-3

I think you want something like this:

x1 = x1 * weights[np.newaxis,:]
x2 = x2 * weights[np.newaxis,:]

x1 = x1[np.newaxis, :, :]
x2 = x2[:, np.newaxis, :]
distance = np.sqrt( ((x1 - x2)**2).sum(axis=-1) )

x1 and x2 are arrays with size of (npoints, ndimensions), and npoints  
can be different for each array.  I'm not sure I did your weights  
right, but that part shouldn't be so difficult.


On May 21, 2008, at 2:39 PM, Emanuele Olivetti wrote:

> Dear all,
>
> I need to speed up this function (a little example follows):
> ------
> import numpy as N
> def distance_matrix(data1,data2,weights):
>    rows = data1.shape[0]
>    columns = data2.shape[0]
>    dm = N.zeros((rows,columns))
>    for i in range(rows):
>        for j in range(columns):
>            dm[i,j] = ((data1[i,:]-data2[j,:])**2*weights).sum()
>            pass
>        pass
>    return dm
>
> size1 = 4
> size2 = 3
> dimensions = 2
> data1 = N.random.rand(size1,dimensions)
> data2 = N.random.rand(size2,dimensions)
> weights = N.random.rand(dimensions)
> dm = distance_matrix(data1,data2,weights)
> print dm
> ------------------
> The distance_matrix function computes the weighted (squared) euclidean
> distances between each pair of vectors from two sets (data1, data2).
> The previous naive algorithm is extremely slow for my standard use,
> i.e., when size1 and size2 are in the order of 1000 or more. It can be
> improved using N.subtract.outer:
>
> def distance_matrix_faster(data1,data2,weights):
>    rows = data1.shape[0]
>    columns = data2.shape[0]
>    dm = N.zeros((rows,columns))
>    for i in range(data1.shape[1]):
>        dm += N.subtract.outer(data1[:,i],data2[:,i])**2*weights[i]
>        pass
>    return dm
>
> This algorithm becomes slow when dimensions (i.e., data1.shape[1]) is
> big (i.e., >1000), due to the Python loop. In order to speed it up,  
> I guess
> that N.subtract.outer could be used on the full matrices instead of  
> one
> column at a time. But then there is a memory issue: 'outer' allocates
> too much memory since it stores all possible combinations along all
> dimensions. This is clearly unnecessary.
>
> Is there a NumPy way to avoid all Python loops and without wasting
> too much memory? As a comparison I coded the same algorithm in
> C through weave (inline): it is _much_ faster and requires just
> the memory to store the result. But I'd prefer not using C or weave
> if possible.
>
> Thanks in advance for any help,
>
>
> Emanuele
>
> _______________________________________________
> Numpy-discussion mailing list
> [hidden email]
> http://projects.scipy.org/mailman/listinfo/numpy-discussion

----
Rob Hetland, Associate Professor
Dept. of Oceanography, Texas A&M University
http://pong.tamu.edu/~rob
phone: 979-458-0096, fax: 979-845-6331



_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: distance_matrix: how to speed up?

Emanuele Olivetti-3
In reply to this post by Matthieu Brucher-2
Matthieu Brucher wrote:
> Hi,
>
> Bill Baxter proposed a version of this problem some months ago on this
> ML. I use it regularly and it is fast enough for me.
>

Excellent. Exactly what I was looking for.

Thanks,

Emanuele

_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: distance_matrix: how to speed up?

Emanuele Olivetti-3
In reply to this post by Rob Hetland
Rob Hetland wrote:

> I think you want something like this:
>
> x1 = x1 * weights[np.newaxis,:]
> x2 = x2 * weights[np.newaxis,:]
>
> x1 = x1[np.newaxis, :, :]
> x2 = x2[:, np.newaxis, :]
> distance = np.sqrt( ((x1 - x2)**2).sum(axis=-1) )
>
> x1 and x2 are arrays with size of (npoints, ndimensions), and npoints  
> can be different for each array.  I'm not sure I did your weights  
> right, but that part shouldn't be so difficult.
>
>  

Weights seem not right but anyway here is the solution adapted from
Bill Baxter's :

def distance_matrix_final(data1,data2,weights):
data1w = data1*weights
dm =
(data1w*data1).sum(1)[:,None]-2*N.dot(data1w,data2.T)+(data2*data2*weights).sum(1)
dm[dm<0] = 0
return dm

This solution is super-fast, stable and use little memory.
It is based on the fact that:
(x-y)^2*w = x*x*w - 2*x*y*w + y*y*w

For size1=size2=dimensions=1000 requires ~0.6sec. to compute
on my dual core duo. It is 2 order of magnitude faster than my
previous solution, but 1-2 order of magnitude slower than using
C with weave.inline.

Definitely good enough for me.


Emanuele

_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: distance_matrix: how to speed up?

Vincent Schut-2
Emanuele Olivetti wrote:
<snip>

>
> This solution is super-fast, stable and use little memory.
> It is based on the fact that:
> (x-y)^2*w = x*x*w - 2*x*y*w + y*y*w
>
> For size1=size2=dimensions=1000 requires ~0.6sec. to compute
> on my dual core duo. It is 2 order of magnitude faster than my
> previous solution, but 1-2 order of magnitude slower than using
> C with weave.inline.
>
> Definitely good enough for me.
>
>
> Emanuele

Reading this thread, I remembered having tried scipy's sandbox.rbf
(radial basis function) to interpolate a pretty large, multidimensional
dataset, to fill in the missing data points. This however failed soon
with out-of-memory errors, which, if I remember correctly, came from the
pretty straightforward distance calculation between the different data
points that is used in this package. Being no math wonder, I assumed
that there simply was no simple way to calculate distances without using
much memory, and ended my rbf experiments.

To make a story short: correct me if I am wrong, but might it be an idea
to use the above solution in scipy.sandbox.rbf?

Vincent.

_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: distance_matrix: how to speed up?

Rob Hetland

On May 22, 2008, at 9:45 AM, Vincent Schut wrote:

> Reading this thread, I remembered having tried scipy's sandbox.rbf
> (radial basis function) to interpolate a pretty large,  
> multidimensional
> dataset, to fill in the missing data points. This however failed soon
> with out-of-memory errors, which, if I remember correctly, came from  
> the
> pretty straightforward distance calculation between the different data
> points that is used in this package. Being no math wonder, I assumed
> that there simply was no simple way to calculate distances without  
> using
> much memory, and ended my rbf experiments.
>
> To make a story short: correct me if I am wrong, but might it be an  
> idea
> to use the above solution in scipy.sandbox.rbf?


Yes, this would be a very good substitution.  Not only does it use  
less memory, but in my quick tests it is about as fast or faster.  
Really, though, both are pretty quick.  There will still be memory  
limitations, but you only need to store a matrix of (N, M), instead of  
(NDIM, N, M), so for many dimensions there will be big memory  
improvements.  Probably only small improvements for 3 dimensions or  
less.

I'm not sure where rbf lives anymore -- it's not in scikits.  I have  
my own version (parts of which got folded into the old scipy.sandbox  
version), that I would be willing to share if there is interest.

Really, though, the rbf toolbox will not be limited by the memory of  
the distance matrix.  Later on, you need to do a large linear algebra  
'solve', like this:


r = norm(x, x) # The distances between all of the ND points to each  
other.
A = psi(r)  # where psi is some divergent function, often the  
multiquadratic function : sqrt((self.epsilon*r)**2 + 1)
coefs = linalg.solve(A, data)  # where data is the length of x, one  
data point for each spatial point.

# to find the interpolated data points at xi
ri = norm(xi, x)
Ai = psi(ri)
di = dot(Ai, coefs)


All in all, it is the 'linalg.solve' that kills you.

-Rob

----
Rob Hetland, Associate Professor
Dept. of Oceanography, Texas A&M University
http://pong.tamu.edu/~rob
phone: 979-458-0096, fax: 979-845-6331

_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: distance_matrix: how to speed up?

Vincent Schut-2
Rob Hetland wrote:
> On May 22, 2008, at 9:45 AM, Vincent Schut wrote:
>
<snip>

>
> Really, though, the rbf toolbox will not be limited by the memory of  
> the distance matrix.  Later on, you need to do a large linear algebra  
> 'solve', like this:
>
>
> r = norm(x, x) # The distances between all of the ND points to each  
> other.
> A = psi(r)  # where psi is some divergent function, often the  
> multiquadratic function : sqrt((self.epsilon*r)**2 + 1)
> coefs = linalg.solve(A, data)  # where data is the length of x, one  
> data point for each spatial point.
>
> # to find the interpolated data points at xi
> ri = norm(xi, x)
> Ai = psi(ri)
> di = dot(Ai, coefs)
>
>
> All in all, it is the 'linalg.solve' that kills you.

Ah, indeed, my memory was faulty, I'm afraid. It was in this phase that
it halted, not in the distance calculations.

Vincent.

>
> -Rob
>
> ----
> Rob Hetland, Associate Professor
> Dept. of Oceanography, Texas A&M University
> http://pong.tamu.edu/~rob
> phone: 979-458-0096, fax: 979-845-6331

_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion