

Hey,
I'm new to numpy but not new to python or programming in general. I
was wondering if there's a way of using numpy to do the following or
whether I've got what I've got and that's as good as it's going to
get.
I have two 2d arrays and I want to create another 2d array that
contains the values from the 2nd column of the first two arrays where
the values in the 1st column match. To elaborate with an example  if
I had an array a:
array([[2834, 1], [3282, 3], [6850, 2], [9458, 2]])
and an array b:
array([[2834, 3], [3282, 5], [4444, 5], [9458, 3], [9999, 4], [11111,
5], [12345, 1]])
then I'd want the result to be
array([[1, 3], # from 2834
[3, 5], # from 3282
[2, 3]]) # from 9458
This is what I have at the moment:
results = []
while aind < amax and bind < bmax:
if a[aind, 0] < b[bind, 0]:
aind += 1
elif a[aind, 0] > b[bind, 0]:
bind += 1
else:
results.append([a[aind, 1], b[bind, 1]])
aind += 1
bind += 1
results = array(results)
Where aind = bind = 0, amax = a.shape[0] and bmax = b.shape[0].
Any tips/pointers/speedups?
Cheers,
Matt
_______________________________________________
Numpydiscussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpydiscussion


On Sun, May 18, 2008 at 2:04 AM, Matt Crane < [hidden email]> wrote:
> Hey,
>
> I'm new to numpy but not new to python or programming in general. I
> was wondering if there's a way of using numpy to do the following or
> whether I've got what I've got and that's as good as it's going to
> get.
>
> I have two 2d arrays and I want to create another 2d array that
> contains the values from the 2nd column of the first two arrays where
> the values in the 1st column match. To elaborate with an example  if
> I had an array a:
>
> array([[2834, 1], [3282, 3], [6850, 2], [9458, 2]])
> and an array b:
>
> array([[2834, 3], [3282, 5], [4444, 5], [9458, 3], [9999, 4], [11111,
> 5], [12345, 1]])
>
> then I'd want the result to be
>
> array([[1, 3], # from 2834
> [3, 5], # from 3282
> [2, 3]]) # from 9458
Are the matching rows always going to be the same row in each? I.e.
you want rows i such that a[i,0]==b[i,0] rather than trying to find
all i,j such that a[i,0]==b[j,0]?
If so, then I would do the following:
In [1]: from numpy import *
In [2]: a = array([[2834, 1], [3282, 3], [6850, 2], [9458, 2]])
In [3]: b = array([[2834, 3], [3282, 5], [4444, 5], [9458, 3], [9999,
4], [11111,
...: 5], [12345, 1]])
In [4]: minlength = min(a.shape[0], b.shape[0])
In [5]: matching = nonzero(a[:minlength,0] == b[:minlength,0])[0]
In [6]: matching
Out[6]: array([0, 1, 3])
In [7]: column_stack([a[matching,1], b[matching,1]])
Out[7]:
array([[1, 3],
[3, 5],
[2, 3]])

Robert Kern
"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
 Umberto Eco
_______________________________________________
Numpydiscussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpydiscussion


On Sun, May 18, 2008 at 7:19 PM, Robert Kern < [hidden email]> wrote:
> Are the matching rows always going to be the same row in each? I.e.
> you want rows i such that a[i,0]==b[i,0] rather than trying to find
> all i,j such that a[i,0]==b[j,0]?
>
> If so, then I would do the following:
>
>
> In [1]: from numpy import *
>
> In [2]: a = array([[2834, 1], [3282, 3], [6850, 2], [9458, 2]])
>
> In [3]: b = array([[2834, 3], [3282, 5], [4444, 5], [9458, 3], [9999,
> 4], [11111,
> ...: 5], [12345, 1]])
>
> In [4]: minlength = min(a.shape[0], b.shape[0])
>
> In [5]: matching = nonzero(a[:minlength,0] == b[:minlength,0])[0]
>
> In [6]: matching
> Out[6]: array([0, 1, 3])
>
> In [7]: column_stack([a[matching,1], b[matching,1]])
> Out[7]:
> array([[1, 3],
> [3, 5],
> [2, 3]])
>
> 
> Robert Kern
>
> "I have come to believe that the whole world is an enigma, a harmless
> enigma that is made terrible by our own mad attempt to interpret it as
> though it had an underlying truth."
>  Umberto Eco
> _______________________________________________
> Numpydiscussion mailing list
> [hidden email]
> http://projects.scipy.org/mailman/listinfo/numpydiscussion>
Sorry, I should have mentioned that no, the matching rows won't always
be in the same position.
_______________________________________________
Numpydiscussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpydiscussion


On Sun, May 18, 2008 at 2:59 AM, Matt Crane < [hidden email]> wrote:
> Sorry, I should have mentioned that no, the matching rows won't always
> be in the same position.
Okay, then it's just a little bit more complicated.
In [18]: from numpy import *
In [19]: a = array([[1, 10], [2, 20], [3, 30], [3, 40], [4, 50]])
In [20]: b = array([[2, 60], [1, 70], [5, 80], [6, 90], [7, 100], [3, 110]])
In [21]: m = a[:,0] == b[:,0][:,newaxis]
In [22]: m
Out[22]:
array([[False, True, False, False, False],
[ True, False, False, False, False],
[False, False, False, False, False],
[False, False, False, False, False],
[False, False, False, False, False],
[False, False, True, True, False]], dtype=bool)
In [23]: i, j = nonzero(m)
In [24]: i
Out[24]: array([0, 1, 5, 5])
In [25]: j
Out[25]: array([1, 0, 2, 3])
In [26]: column_stack([a[j,1], b[i,1]])
Out[26]:
array([[ 20, 60],
[ 10, 70],
[ 30, 110],
[ 40, 110]])

Robert Kern
"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
 Umberto Eco
_______________________________________________
Numpydiscussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpydiscussion


On Sun, May 18, 2008 at 8:08 PM, Robert Kern < [hidden email]> wrote:
> Okay, then it's just a little bit more complicated.
Thanks, and that's going to be faster  the method that I posted is
linear in terms of the length of the two lists? Given that the values
in the first column are monotonically increasing (again something I
should have mentioned  I blame a lack of caffeine)  could we make
it faster?
Thanks, for everything up to this point though.
Matt
_______________________________________________
Numpydiscussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpydiscussion


On Sun, May 18, 2008 at 3:29 AM, Matt Crane < [hidden email]> wrote:
> On Sun, May 18, 2008 at 8:08 PM, Robert Kern < [hidden email]> wrote:
>> Okay, then it's just a little bit more complicated.
>
> Thanks, and that's going to be faster  the method that I posted is
> linear in terms of the length of the two lists?
It depends on the sizes.
> Given that the values
> in the first column are monotonically increasing (again something I
> should have mentioned  I blame a lack of caffeine)  could we make
> it faster?
Are there repeats?

Robert Kern
"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
 Umberto Eco
_______________________________________________
Numpydiscussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpydiscussion


On Sun, May 18, 2008 at 8:52 PM, Robert Kern < [hidden email]> wrote:
> It depends on the sizes.
The sizes could range from 3 to 240000 with an average of around 5500.
> Are there repeats?
No, no repeats in the first column.
I'm going to go get a cup of coffee before I forget to leave out any
potentially vital information again. It's going to be a long day.
Thanks,
Matt
_______________________________________________
Numpydiscussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpydiscussion


On Sun, May 18, 2008 at 4:02 AM, Matt Crane < [hidden email]> wrote:
> On Sun, May 18, 2008 at 8:52 PM, Robert Kern < [hidden email]> wrote:
>> It depends on the sizes.
> The sizes could range from 3 to 240000 with an average of around 5500.
A 240000x240000 boolean matrix will probably be too slow.
>> Are there repeats?
> No, no repeats in the first column.
Great! So let's use searchsorted() to find potential indices where the
two first columns are equal. We pull out the values at those indices
and actually do the comparison to get a boolean mask where there is an
equality. Do both a.searchsorted(b) and b.searchsorted(a) to get the
appropriate masks on b and a respectively. The number of True elements
will be the same for both. Now just apply the masks to the second
columns.
In [20]: a = array([[2, 10], [4, 20], [6, 30], [8, 40], [10, 50]])
In [21]: b = array([[2, 60], [3, 70], [4, 80], [5, 90], [8, 100], [10, 110]])
In [22]: a[b[b[:,0].searchsorted(a[:,0]),0] == a[:,0], 1]
Out[22]: array([10, 20, 40, 50])
In [23]: b[a[a[:,0].searchsorted(b[:,0]),0] == b[:,0], 1]
Out[23]: array([ 60, 80, 100, 110])
In [24]: column_stack([Out[22], Out[23]])
Out[24]:
array([[ 10, 60],
[ 20, 80],
[ 40, 100],
[ 50, 110]])

Robert Kern
"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
 Umberto Eco
_______________________________________________
Numpydiscussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpydiscussion


2008/5/18 Matt Crane < [hidden email]>:
> On Sun, May 18, 2008 at 8:52 PM, Robert Kern < [hidden email]> wrote:
>> Are there repeats?
> No, no repeats in the first column.
>
> I'm going to go get a cup of coffee before I forget to leave out any
> potentially vital information again. It's going to be a long day.
It can be done, though I had to be kind of devious. My solution might
not even be O(n log n), depending on how mergesort is implemented:
def find_matches(A,B):
"""Find positions A_idx and B_idx so that A[A_idx]==B[B_idx]
A and B should be sorted arrays with no repeats; A_idx and
B_idx will be all the places they agree.
>>> import numpy as np
>>> A = np.array([1,2,4,5,7])
>>> B = np.array([1,3,5,9])
>>> find_matches(A,B)
array([0,3]), array([0,2])
"""
AB = np.concatenate((A,B))
idx = np.argsort(np.concatenate((A,B)),kind="mergesort")
sorted = AB[idx]
pairs = np.where(np.diff(sorted)==0)[0]
A_pairs = idx[pairs]
B_pairs = idx[pairs+1]len(A)
if np.any(A_pairs>=len(A)):
raise ValueError, "first array contains a repeated element"
if np.any(B_pairs<0):
raise ValueError, "second array contains a repeated element"
return A_pairs, B_pairs
The idea is that diff is not a bad way to find repeats in a sequence;
so concatenate the two sequences and sort them (if you're lucky
mergesort will be fast because they're individually sorted, but in any
case we need stability). Of course, you have to take the pairs in the
sorted sequence and figure out where they came from, but fortunately
you don't even need to invert the permutation returned by argsort (do
we have a tool to invert permutations, or should one just use argsort
again?).
Anne
_______________________________________________
Numpydiscussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpydiscussion


On Sun, May 18, 2008 at 9:14 PM, Anne Archibald
< [hidden email]> wrote:
> 2008/5/18 Matt Crane < [hidden email]>:
>> On Sun, May 18, 2008 at 8:52 PM, Robert Kern < [hidden email]> wrote:
>>> Are there repeats?
>> No, no repeats in the first column.
>>
>> I'm going to go get a cup of coffee before I forget to leave out any
>> potentially vital information again. It's going to be a long day.
>
> It can be done, though I had to be kind of devious. My solution might
> not even be O(n log n), depending on how mergesort is implemented:
>
Although this O(n log n) solution is written in numpy  and let's
assume that the mergesort is implemented to give that.
That's O(n log n) where n is the combined sizes of the arrays 
although given that they are already sorted it might be straight
linear because it only has to merge them? I know for sure that the
solution that I originally posted was linear in terms of the combined
sizes. While it might be slow because it's written in straight python
 it perhaps might be quicker if one were to use scipy.weave
blitz/inline?
I think I might leave the discussion here until I can get some
benchmarking on the different alternatives  thanks all.
Matt
_______________________________________________
Numpydiscussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpydiscussion

