vectorization vs. numpy.linalg (shape (3, 3, 777) vs shape (777, 3, 3))

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

vectorization vs. numpy.linalg (shape (3, 3, 777) vs shape (777, 3, 3))

Nico Schlömer
Hi everyone,

When trying to speed up my code, I noticed that simply by reordering my data I could get more than twice as fast for the simplest operations:
```
import numpy
a = numpy.random.rand(50, 50, 50)

%timeit a[0] + a[1]
1000000 loops, best of 3: 1.7 µs per loop

%timeit a[:, 0] + a[:, 1]
100000 loops, best of 3: 4.42 µs per loop

%timeit a[..., 0] + a[..., 1]
100000 loops, best of 3: 5.99 µs per loop
```
This makes sense considering the fact that, by default, NumPy features C-style memory allocation: the last index runs fastest. The blocks that are added with `a[0] + a[1]` are contiguous in memory, so cache is nicely made use of. As opposed to that, the blocks that are added with `a[:, 0] + a[:, 1]` are not contiguous, even more so with `a[..., 0] + a[..., 1]`; hence the slowdown. Would that be the correct explanation?

If yes, I'm wondering why most numpy.linalg methods, when vectorized, put the vector index up front. E.g., to mass-compute determinants, one has to do
```
a = numpy.random.rand(777, 3, 3)
numpy.linalg.det(a)
```
This way, all 3x3 matrices form a memory block, so if you do `det` block by block, that'll be fine. However, vectorized operations (like `+` above) will be slower than necessary.
Any background on this? 

(I came across this when having to rearrange my data (swapaxes, rollaxis) from shape (3, 3, 777) (which allows for fast vectorized operations in the rest of the code) to (777, 3, 3) for using numpy's svd.)

Cheers,
Nico

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: vectorization vs. numpy.linalg (shape (3, 3, 777) vs shape (777, 3, 3))

Sebastian Berg
On Thu, 2017-03-02 at 10:27 +0000, Nico Schlömer wrote:

> Hi everyone,
>
> When trying to speed up my code, I noticed that simply by reordering
> my data I could get more than twice as fast for the simplest
> operations:
> ```
> import numpy
> a = numpy.random.rand(50, 50, 50)
>
> %timeit a[0] + a[1]
> 1000000 loops, best of 3: 1.7 µs per loop
>
> %timeit a[:, 0] + a[:, 1]
> 100000 loops, best of 3: 4.42 µs per loop
>
> %timeit a[..., 0] + a[..., 1]
> 100000 loops, best of 3: 5.99 µs per loop
> ```
> This makes sense considering the fact that, by default, NumPy
> features C-style memory allocation: the last index runs fastest. The
> blocks that are added with `a[0] + a[1]` are contiguous in memory, so
> cache is nicely made use of. As opposed to that, the blocks that are
> added with `a[:, 0] + a[:, 1]` are not contiguous, even more so with
> `a[..., 0] + a[..., 1]`; hence the slowdown. Would that be the
> correct explanation?
>
> If yes, I'm wondering why most numpy.linalg methods, when vectorized,
> put the vector index up front. E.g., to mass-compute determinants,
> one has to do
> ```
> a = numpy.random.rand(777, 3, 3)
> numpy.linalg.det(a)
> ```
> This way, all 3x3 matrices form a memory block, so if you do `det`
> block by block, that'll be fine. However, vectorized operations (like
> `+` above) will be slower than necessary.
> Any background on this? 
>
I am honestly not sure where you are going at. This order seems the
more natural order for most operations. Also numpy does not create
copies normally even for transposed data (though some functions may
internally, numpy just _defaults_ to C-order). So of course what is
faster depends on your use-case, but if you have an operation on many
3x3 arrays the way numpy does it is the more natural way. If you do
other things that are faster the other way around, you will have to
decide which operation is the more important one overall.

- Sebastian

> (I came across this when having to rearrange my data (swapaxes,
> rollaxis) from shape (3, 3, 777) (which allows for fast vectorized
> operations in the rest of the code) to (777, 3, 3) for using numpy's
> svd.)
>
> Cheers,
> Nico
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.scipy.org/mailman/listinfo/numpy-discussion

signature.asc (817 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: vectorization vs. numpy.linalg (shape (3, 3, 777) vs shape (777, 3, 3))

Nico Schlömer
I am honestly not sure where you are going at. This order seems the more natural order for most operations. 

Not sure what you mean by "natural". The most basic operations like `a[0] + a[1]` are several times faster than `a[...,0] + a[..., 1]`. Perhaps one can come up with code examples that don't use those operations at all, but I would guess that'll be a rare case. My point: If your code uses vector operations (as `+`) _and_ numpy.linalg functions, your vector operations will be several times slower than necessary. One way around this would perhaps be to initialize all your data in Fortran-style data layout, where `a[...,0] + a[..., 1]` is indeed faster than `a[0] + a[1]`.

What I'll end up doing is probably to rewrite whatever numpy.linalg function I need for C-style ordering. For dets of 2x2 or 3x3 matrices, this shouldn't be to hard (`a[0][0] +a[1][1] - a[0][1] - a[1][0]`). :)

Cheers,
Nico


On Sun, Mar 5, 2017 at 3:53 PM Sebastian Berg <[hidden email]> wrote:
On Thu, 2017-03-02 at 10:27 +0000, Nico Schlömer wrote:
> Hi everyone,
>
> When trying to speed up my code, I noticed that simply by reordering
> my data I could get more than twice as fast for the simplest
> operations:
> ```
> import numpy
> a = numpy.random.rand(50, 50, 50)
>
> %timeit a[0] + a[1]
> 1000000 loops, best of 3: 1.7 µs per loop
>
> %timeit a[:, 0] + a[:, 1]
> 100000 loops, best of 3: 4.42 µs per loop
>
> %timeit a[..., 0] + a[..., 1]
> 100000 loops, best of 3: 5.99 µs per loop
> ```
> This makes sense considering the fact that, by default, NumPy
> features C-style memory allocation: the last index runs fastest. The
> blocks that are added with `a[0] + a[1]` are contiguous in memory, so
> cache is nicely made use of. As opposed to that, the blocks that are
> added with `a[:, 0] + a[:, 1]` are not contiguous, even more so with
> `a[..., 0] + a[..., 1]`; hence the slowdown. Would that be the
> correct explanation?
>
> If yes, I'm wondering why most numpy.linalg methods, when vectorized,
> put the vector index up front. E.g., to mass-compute determinants,
> one has to do
> ```
> a = numpy.random.rand(777, 3, 3)
> numpy.linalg.det(a)
> ```
> This way, all 3x3 matrices form a memory block, so if you do `det`
> block by block, that'll be fine. However, vectorized operations (like
> `+` above) will be slower than necessary.
> Any background on this? 
>

I am honestly not sure where you are going at. This order seems the
more natural order for most operations. Also numpy does not create
copies normally even for transposed data (though some functions may
internally, numpy just _defaults_ to C-order). So of course what is
faster depends on your use-case, but if you have an operation on many
3x3 arrays the way numpy does it is the more natural way. If you do
other things that are faster the other way around, you will have to
decide which operation is the more important one overall.

- Sebastian

> (I came across this when having to rearrange my data (swapaxes,
> rollaxis) from shape (3, 3, 777) (which allows for fast vectorized
> operations in the rest of the code) to (777, 3, 3) for using numpy's
> svd.)
>
> Cheers,
> Nico
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> https://mail.scipy.org/mailman/listinfo/numpy-discussion_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: vectorization vs. numpy.linalg (shape (3, 3, 777) vs shape (777, 3, 3))

Sebastian Berg
On Sun, 2017-03-05 at 15:46 +0000, Nico Schlömer wrote:

> > I am honestly not sure where you are going at. This order seems
> the more natural order for most operations. 
>
> Not sure what you mean by "natural". The most basic operations
> like `a[0] + a[1]` are several times faster than `a[...,0] + a[...,
> 1]`. Perhaps one can come up with code examples that don't use those
> operations at all, but I would guess that'll be a rare case. My
> point: If your code uses vector operations (as `+`) _and_
> numpy.linalg functions, your vector operations will be several times
> slower than necessary. One way around this would perhaps be to
> initialize all your data in Fortran-style data layout, where
> `a[...,0] + a[..., 1]` is indeed faster than `a[0] + a[1]`.
>
> What I'll end up doing is probably to rewrite whatever numpy.linalg
> function I need for C-style ordering. For dets of 2x2 or 3x3
> matrices, this shouldn't be to hard (`a[0][0] +a[1][1] - a[0][1] -
> a[1][0]`). :)
Well, Linalg functions are (stacked) low-level calls into lapack.

Now for some functions (not simple math), it is possible that if you
have chosen a non C-order memory layout, numpy has to jump through
hoops to do your operations or iterates in C-order anyway.

And yes, this is also the case for most of linalg, since most of them
call into highly optimized code and algorithms defined for a single
array.

In that case manual solutions like you suggested may be faster.
Although, I would suggest to be very careful with them, since there are
may be subtleties one may miss, and it probably does make the code less
easy to maintain and more error prone.

In general numpy is smart about memory/iteration order for most cases,
if you work on stacked arrays (and not on a single slices of them as in
your examples) the memory order often has little to no effect on the
operation speed, e.g. while  `a[0] + a[1]` it is faster in your
examples, whether you do `a.sum(0)` or `a.sum(-1)` makes typically
little difference.

Of course there are some cases were you can out-smart numpy, and of
course in general choosing the right memory order can have a huge
impact on your performance. But again, trying to out-smart also comes
at a cost and I would be very reluctant to do it, and typically there are probably lower hanging fruits to get first.

- Sebastian


>
> Cheers,
> Nico
>
>
> On Sun, Mar 5, 2017 at 3:53 PM Sebastian Berg <sebastian@sipsolutions
> .net> wrote:
> > On Thu, 2017-03-02 at 10:27 +0000, Nico Schlömer wrote:
> > > Hi everyone,
> > >
> > > When trying to speed up my code, I noticed that simply by
> > reordering
> > > my data I could get more than twice as fast for the simplest
> > > operations:
> > > ```
> > > import numpy
> > > a = numpy.random.rand(50, 50, 50)
> > >
> > > %timeit a[0] + a[1]
> > > 1000000 loops, best of 3: 1.7 µs per loop
> > >
> > > %timeit a[:, 0] + a[:, 1]
> > > 100000 loops, best of 3: 4.42 µs per loop
> > >
> > > %timeit a[..., 0] + a[..., 1]
> > > 100000 loops, best of 3: 5.99 µs per loop
> > > ```
> > > This makes sense considering the fact that, by default, NumPy
> > > features C-style memory allocation: the last index runs fastest.
> > The
> > > blocks that are added with `a[0] + a[1]` are contiguous in
> > memory, so
> > > cache is nicely made use of. As opposed to that, the blocks that
> > are
> > > added with `a[:, 0] + a[:, 1]` are not contiguous, even more so
> > with
> > > `a[..., 0] + a[..., 1]`; hence the slowdown. Would that be the
> > > correct explanation?
> > >
> > > If yes, I'm wondering why most numpy.linalg methods, when
> > vectorized,
> > > put the vector index up front. E.g., to mass-compute
> > determinants,
> > > one has to do
> > > ```
> > > a = numpy.random.rand(777, 3, 3)
> > > numpy.linalg.det(a)
> > > ```
> > > This way, all 3x3 matrices form a memory block, so if you do
> > `det`
> > > block by block, that'll be fine. However, vectorized operations
> > (like
> > > `+` above) will be slower than necessary.
> > > Any background on this? 
> > >
> >
> > I am honestly not sure where you are going at. This order seems the
> > more natural order for most operations. Also numpy does not create
> > copies normally even for transposed data (though some functions may
> > internally, numpy just _defaults_ to C-order). So of course what is
> > faster depends on your use-case, but if you have an operation on
> > many
> > 3x3 arrays the way numpy does it is the more natural way. If you do
> > other things that are faster the other way around, you will have to
> > decide which operation is the more important one overall.
> >
> > - Sebastian
> >
> > > (I came across this when having to rearrange my data (swapaxes,
> > > rollaxis) from shape (3, 3, 777) (which allows for fast
> > vectorized
> > > operations in the rest of the code) to (777, 3, 3) for using
> > numpy's
> > > svd.)
> > >
> > > Cheers,
> > > Nico
> > > _______________________________________________
> > > NumPy-Discussion mailing list
> > > [hidden email]
> > > https://mail.scipy.org/mailman/listinfo/numpy-discussion_________
> > ______________________________________
> > NumPy-Discussion mailing list
> > [hidden email]
> > https://mail.scipy.org/mailman/listinfo/numpy-discussion
> >
>
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.scipy.org/mailman/listinfo/numpy-discussion

signature.asc (817 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: vectorization vs. numpy.linalg (shape (3, 3, 777) vs shape (777, 3, 3))

Nathaniel Smith
In reply to this post by Nico Schlömer
On Sun, Mar 5, 2017 at 7:46 AM, Nico Schlömer <[hidden email]> wrote:
>> I am honestly not sure where you are going at. This order seems the more
>> natural order for most operations.
>
> Not sure what you mean by "natural". The most basic operations like `a[0] +
> a[1]` are several times faster than `a[...,0] + a[..., 1]`. Perhaps one can
> come up with code examples that don't use those operations at all, but I
> would guess that'll be a rare case. My point: If your code uses vector
> operations (as `+`) _and_ numpy.linalg functions, your vector operations
> will be several times slower than necessary.

I guess it seems odd that you're thinking of "vector operations" as
always meaning "operations on two slices of the same array"? I feel
like that's a vanishingly small percentage of vector operations.
Surely a + b is a lot more common than any of those?

In any case, the reason linalg works that way is to be consistent with
the general numpy broadcasting rule where you match indices from the
right, which is certainly not possible to change. The reason
broadcasting works that way is that in the common case of C memory
layout, it makes vectorized operations like a + b faster :-).

In theory it should also make the linalg functions faster, because it
means that each call to the underlying 'det' routine is working on a
contiguous block. If they worked from the left then we'd almost always
have to copy the whole matrix into a temporary before we could
actually do any linear algebra. (Though since linear algebra routines
usually have super-linear complexity this might not matter much in
practice.)

-n

--
Nathaniel J. Smith -- https://vorpus.org
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Loading...