

Dear all,
I need to speed up this function (a little example follows):

import numpy as N
def distance_matrix(data1,data2,weights):
rows = data1.shape[0]
columns = data2.shape[0]
dm = N.zeros((rows,columns))
for i in range(rows):
for j in range(columns):
dm[i,j] = ((data1[i,:]data2[j,:])**2*weights).sum()
pass
pass
return dm
size1 = 4
size2 = 3
dimensions = 2
data1 = N.random.rand(size1,dimensions)
data2 = N.random.rand(size2,dimensions)
weights = N.random.rand(dimensions)
dm = distance_matrix(data1,data2,weights)
print dm

The distance_matrix function computes the weighted (squared) euclidean
distances between each pair of vectors from two sets (data1, data2).
The previous naive algorithm is extremely slow for my standard use,
i.e., when size1 and size2 are in the order of 1000 or more. It can be
improved using N.subtract.outer:
def distance_matrix_faster(data1,data2,weights):
rows = data1.shape[0]
columns = data2.shape[0]
dm = N.zeros((rows,columns))
for i in range(data1.shape[1]):
dm += N.subtract.outer(data1[:,i],data2[:,i])**2*weights[i]
pass
return dm
This algorithm becomes slow when dimensions (i.e., data1.shape[1]) is
big (i.e., >1000), due to the Python loop. In order to speed it up, I guess
that N.subtract.outer could be used on the full matrices instead of one
column at a time. But then there is a memory issue: 'outer' allocates
too much memory since it stores all possible combinations along all
dimensions. This is clearly unnecessary.
Is there a NumPy way to avoid all Python loops and without wasting
too much memory? As a comparison I coded the same algorithm in
C through weave (inline): it is _much_ faster and requires just
the memory to store the result. But I'd prefer not using C or weave
if possible.
Thanks in advance for any help,
Emanuele
_______________________________________________
Numpydiscussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpydiscussion


Hi, Bill Baxter proposed a version of this problem some months ago on this ML. I use it regularly and it is fast enough for me. Matthieu 2008/5/21 Emanuele Olivetti < [hidden email]>:
Dear all,
I need to speed up this function (a little example follows):

import numpy as N
def distance_matrix(data1,data2,weights):
rows = data1.shape[0]
columns = data2.shape[0]
dm = N.zeros((rows,columns))
for i in range(rows):
for j in range(columns):
dm[i,j] = ((data1[i,:]data2[j,:])**2*weights).sum()
pass
pass
return dm
size1 = 4
size2 = 3
dimensions = 2
data1 = N.random.rand(size1,dimensions)
data2 = N.random.rand(size2,dimensions)
weights = N.random.rand(dimensions)
dm = distance_matrix(data1,data2,weights)
print dm

The distance_matrix function computes the weighted (squared) euclidean
distances between each pair of vectors from two sets (data1, data2).
The previous naive algorithm is extremely slow for my standard use,
i.e., when size1 and size2 are in the order of 1000 or more. It can be
improved using N.subtract.outer:
def distance_matrix_faster(data1,data2,weights):
rows = data1.shape[0]
columns = data2.shape[0]
dm = N.zeros((rows,columns))
for i in range(data1.shape[1]):
dm += N.subtract.outer(data1[:,i],data2[:,i])**2*weights[i]
pass
return dm
This algorithm becomes slow when dimensions (i.e., data1.shape[1]) is
big (i.e., >1000), due to the Python loop. In order to speed it up, I guess
that N.subtract.outer could be used on the full matrices instead of one
column at a time. But then there is a memory issue: 'outer' allocates
too much memory since it stores all possible combinations along all
dimensions. This is clearly unnecessary.
Is there a NumPy way to avoid all Python loops and without wasting
too much memory? As a comparison I coded the same algorithm in
C through weave (inline): it is _much_ faster and requires just
the memory to store the result. But I'd prefer not using C or weave
if possible.
Thanks in advance for any help,
Emanuele
_______________________________________________
Numpydiscussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpydiscussion
 French PhD student Website : http://matthieubrucher.developpez.com/Blogs : http://matt.eifelle.com and http://blog.developpez.com/?blog=92
LinkedIn : http://www.linkedin.com/in/matthieubrucher
_______________________________________________
Numpydiscussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpydiscussion


I think you want something like this:
x1 = x1 * weights[np.newaxis,:]
x2 = x2 * weights[np.newaxis,:]
x1 = x1[np.newaxis, :, :]
x2 = x2[:, np.newaxis, :]
distance = np.sqrt( ((x1  x2)**2).sum(axis=1) )
x1 and x2 are arrays with size of (npoints, ndimensions), and npoints
can be different for each array. I'm not sure I did your weights
right, but that part shouldn't be so difficult.
On May 21, 2008, at 2:39 PM, Emanuele Olivetti wrote:
> Dear all,
>
> I need to speed up this function (a little example follows):
> 
> import numpy as N
> def distance_matrix(data1,data2,weights):
> rows = data1.shape[0]
> columns = data2.shape[0]
> dm = N.zeros((rows,columns))
> for i in range(rows):
> for j in range(columns):
> dm[i,j] = ((data1[i,:]data2[j,:])**2*weights).sum()
> pass
> pass
> return dm
>
> size1 = 4
> size2 = 3
> dimensions = 2
> data1 = N.random.rand(size1,dimensions)
> data2 = N.random.rand(size2,dimensions)
> weights = N.random.rand(dimensions)
> dm = distance_matrix(data1,data2,weights)
> print dm
> 
> The distance_matrix function computes the weighted (squared) euclidean
> distances between each pair of vectors from two sets (data1, data2).
> The previous naive algorithm is extremely slow for my standard use,
> i.e., when size1 and size2 are in the order of 1000 or more. It can be
> improved using N.subtract.outer:
>
> def distance_matrix_faster(data1,data2,weights):
> rows = data1.shape[0]
> columns = data2.shape[0]
> dm = N.zeros((rows,columns))
> for i in range(data1.shape[1]):
> dm += N.subtract.outer(data1[:,i],data2[:,i])**2*weights[i]
> pass
> return dm
>
> This algorithm becomes slow when dimensions (i.e., data1.shape[1]) is
> big (i.e., >1000), due to the Python loop. In order to speed it up,
> I guess
> that N.subtract.outer could be used on the full matrices instead of
> one
> column at a time. But then there is a memory issue: 'outer' allocates
> too much memory since it stores all possible combinations along all
> dimensions. This is clearly unnecessary.
>
> Is there a NumPy way to avoid all Python loops and without wasting
> too much memory? As a comparison I coded the same algorithm in
> C through weave (inline): it is _much_ faster and requires just
> the memory to store the result. But I'd prefer not using C or weave
> if possible.
>
> Thanks in advance for any help,
>
>
> Emanuele
>
> _______________________________________________
> Numpydiscussion mailing list
> [hidden email]
> http://projects.scipy.org/mailman/listinfo/numpydiscussion
Rob Hetland, Associate Professor
Dept. of Oceanography, Texas A&M University
http://pong.tamu.edu/~robphone: 9794580096, fax: 9798456331
_______________________________________________
Numpydiscussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpydiscussion


Matthieu Brucher wrote:
> Hi,
>
> Bill Baxter proposed a version of this problem some months ago on this
> ML. I use it regularly and it is fast enough for me.
>
Excellent. Exactly what I was looking for.
Thanks,
Emanuele
_______________________________________________
Numpydiscussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpydiscussion


Rob Hetland wrote:
> I think you want something like this:
>
> x1 = x1 * weights[np.newaxis,:]
> x2 = x2 * weights[np.newaxis,:]
>
> x1 = x1[np.newaxis, :, :]
> x2 = x2[:, np.newaxis, :]
> distance = np.sqrt( ((x1  x2)**2).sum(axis=1) )
>
> x1 and x2 are arrays with size of (npoints, ndimensions), and npoints
> can be different for each array. I'm not sure I did your weights
> right, but that part shouldn't be so difficult.
>
>
Weights seem not right but anyway here is the solution adapted from
Bill Baxter's :
def distance_matrix_final(data1,data2,weights):
data1w = data1*weights
dm =
(data1w*data1).sum(1)[:,None]2*N.dot(data1w,data2.T)+(data2*data2*weights).sum(1)
dm[dm<0] = 0
return dm
This solution is superfast, stable and use little memory.
It is based on the fact that:
(xy)^2*w = x*x*w  2*x*y*w + y*y*w
For size1=size2=dimensions=1000 requires ~0.6sec. to compute
on my dual core duo. It is 2 order of magnitude faster than my
previous solution, but 12 order of magnitude slower than using
C with weave.inline.
Definitely good enough for me.
Emanuele
_______________________________________________
Numpydiscussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpydiscussion


Emanuele Olivetti wrote:
<snip>
>
> This solution is superfast, stable and use little memory.
> It is based on the fact that:
> (xy)^2*w = x*x*w  2*x*y*w + y*y*w
>
> For size1=size2=dimensions=1000 requires ~0.6sec. to compute
> on my dual core duo. It is 2 order of magnitude faster than my
> previous solution, but 12 order of magnitude slower than using
> C with weave.inline.
>
> Definitely good enough for me.
>
>
> Emanuele
Reading this thread, I remembered having tried scipy's sandbox.rbf
(radial basis function) to interpolate a pretty large, multidimensional
dataset, to fill in the missing data points. This however failed soon
with outofmemory errors, which, if I remember correctly, came from the
pretty straightforward distance calculation between the different data
points that is used in this package. Being no math wonder, I assumed
that there simply was no simple way to calculate distances without using
much memory, and ended my rbf experiments.
To make a story short: correct me if I am wrong, but might it be an idea
to use the above solution in scipy.sandbox.rbf?
Vincent.
_______________________________________________
Numpydiscussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpydiscussion


On May 22, 2008, at 9:45 AM, Vincent Schut wrote:
> Reading this thread, I remembered having tried scipy's sandbox.rbf
> (radial basis function) to interpolate a pretty large,
> multidimensional
> dataset, to fill in the missing data points. This however failed soon
> with outofmemory errors, which, if I remember correctly, came from
> the
> pretty straightforward distance calculation between the different data
> points that is used in this package. Being no math wonder, I assumed
> that there simply was no simple way to calculate distances without
> using
> much memory, and ended my rbf experiments.
>
> To make a story short: correct me if I am wrong, but might it be an
> idea
> to use the above solution in scipy.sandbox.rbf?
Yes, this would be a very good substitution. Not only does it use
less memory, but in my quick tests it is about as fast or faster.
Really, though, both are pretty quick. There will still be memory
limitations, but you only need to store a matrix of (N, M), instead of
(NDIM, N, M), so for many dimensions there will be big memory
improvements. Probably only small improvements for 3 dimensions or
less.
I'm not sure where rbf lives anymore  it's not in scikits. I have
my own version (parts of which got folded into the old scipy.sandbox
version), that I would be willing to share if there is interest.
Really, though, the rbf toolbox will not be limited by the memory of
the distance matrix. Later on, you need to do a large linear algebra
'solve', like this:
r = norm(x, x) # The distances between all of the ND points to each
other.
A = psi(r) # where psi is some divergent function, often the
multiquadratic function : sqrt((self.epsilon*r)**2 + 1)
coefs = linalg.solve(A, data) # where data is the length of x, one
data point for each spatial point.
# to find the interpolated data points at xi
ri = norm(xi, x)
Ai = psi(ri)
di = dot(Ai, coefs)
All in all, it is the 'linalg.solve' that kills you.
Rob

Rob Hetland, Associate Professor
Dept. of Oceanography, Texas A&M University
http://pong.tamu.edu/~robphone: 9794580096, fax: 9798456331
_______________________________________________
Numpydiscussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpydiscussion


Rob Hetland wrote:
> On May 22, 2008, at 9:45 AM, Vincent Schut wrote:
>
<snip>
>
> Really, though, the rbf toolbox will not be limited by the memory of
> the distance matrix. Later on, you need to do a large linear algebra
> 'solve', like this:
>
>
> r = norm(x, x) # The distances between all of the ND points to each
> other.
> A = psi(r) # where psi is some divergent function, often the
> multiquadratic function : sqrt((self.epsilon*r)**2 + 1)
> coefs = linalg.solve(A, data) # where data is the length of x, one
> data point for each spatial point.
>
> # to find the interpolated data points at xi
> ri = norm(xi, x)
> Ai = psi(ri)
> di = dot(Ai, coefs)
>
>
> All in all, it is the 'linalg.solve' that kills you.
Ah, indeed, my memory was faulty, I'm afraid. It was in this phase that
it halted, not in the distance calculations.
Vincent.
>
> Rob
>
> 
> Rob Hetland, Associate Professor
> Dept. of Oceanography, Texas A&M University
> http://pong.tamu.edu/~rob> phone: 9794580096, fax: 9798456331
_______________________________________________
Numpydiscussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpydiscussion

