# vectorization vs. numpy.linalg (shape (3, 3, 777) vs shape (777, 3, 3))

5 messages
Open this post in threaded view
|

## vectorization vs. numpy.linalg (shape (3, 3, 777) vs shape (777, 3, 3))

 Hi everyone,When trying to speed up my code, I noticed that simply by reordering my data I could get more than twice as fast for the simplest operations:```import numpya = numpy.random.rand(50, 50, 50)%timeit a[0] + a[1]1000000 loops, best of 3: 1.7 µs per loop%timeit a[:, 0] + a[:, 1]100000 loops, best of 3: 4.42 µs per loop%timeit a[..., 0] + a[..., 1]100000 loops, best of 3: 5.99 µs per loop```This makes sense considering the fact that, by default, NumPy features C-style memory allocation: the last index runs fastest. The blocks that are added with `a[0] + a[1]` are contiguous in memory, so cache is nicely made use of. As opposed to that, the blocks that are added with `a[:, 0] + a[:, 1]` are not contiguous, even more so with `a[..., 0] + a[..., 1]`; hence the slowdown. Would that be the correct explanation?If yes, I'm wondering why most numpy.linalg methods, when vectorized, put the vector index up front. E.g., to mass-compute determinants, one has to do```a = numpy.random.rand(777, 3, 3)numpy.linalg.det(a)```This way, all 3x3 matrices form a memory block, so if you do `det` block by block, that'll be fine. However, vectorized operations (like `+` above) will be slower than necessary.Any background on this? (I came across this when having to rearrange my data (swapaxes, rollaxis) from shape (3, 3, 777) (which allows for fast vectorized operations in the rest of the code) to (777, 3, 3) for using numpy's svd.)Cheers,Nico _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.scipy.org/mailman/listinfo/numpy-discussion
Open this post in threaded view
|

## Re: vectorization vs. numpy.linalg (shape (3, 3, 777) vs shape (777, 3, 3))

 On Thu, 2017-03-02 at 10:27 +0000, Nico Schlömer wrote: > Hi everyone, > > When trying to speed up my code, I noticed that simply by reordering > my data I could get more than twice as fast for the simplest > operations: > ``` > import numpy > a = numpy.random.rand(50, 50, 50) > > %timeit a[0] + a[1] > 1000000 loops, best of 3: 1.7 µs per loop > > %timeit a[:, 0] + a[:, 1] > 100000 loops, best of 3: 4.42 µs per loop > > %timeit a[..., 0] + a[..., 1] > 100000 loops, best of 3: 5.99 µs per loop > ``` > This makes sense considering the fact that, by default, NumPy > features C-style memory allocation: the last index runs fastest. The > blocks that are added with `a[0] + a[1]` are contiguous in memory, so > cache is nicely made use of. As opposed to that, the blocks that are > added with `a[:, 0] + a[:, 1]` are not contiguous, even more so with > `a[..., 0] + a[..., 1]`; hence the slowdown. Would that be the > correct explanation? > > If yes, I'm wondering why most numpy.linalg methods, when vectorized, > put the vector index up front. E.g., to mass-compute determinants, > one has to do > ``` > a = numpy.random.rand(777, 3, 3) > numpy.linalg.det(a) > ``` > This way, all 3x3 matrices form a memory block, so if you do `det` > block by block, that'll be fine. However, vectorized operations (like > `+` above) will be slower than necessary. > Any background on this?  > I am honestly not sure where you are going at. This order seems the more natural order for most operations. Also numpy does not create copies normally even for transposed data (though some functions may internally, numpy just _defaults_ to C-order). So of course what is faster depends on your use-case, but if you have an operation on many 3x3 arrays the way numpy does it is the more natural way. If you do other things that are faster the other way around, you will have to decide which operation is the more important one overall. - Sebastian > (I came across this when having to rearrange my data (swapaxes, > rollaxis) from shape (3, 3, 777) (which allows for fast vectorized > operations in the rest of the code) to (777, 3, 3) for using numpy's > svd.) > > Cheers, > Nico > _______________________________________________ > NumPy-Discussion mailing list > [hidden email] > https://mail.scipy.org/mailman/listinfo/numpy-discussion_______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.scipy.org/mailman/listinfo/numpy-discussion signature.asc (817 bytes) Download Attachment
Open this post in threaded view
|

## Re: vectorization vs. numpy.linalg (shape (3, 3, 777) vs shape (777, 3, 3))

 > I am honestly not sure where you are going at. This order seems the more natural order for most operations. Not sure what you mean by "natural". The most basic operations like `a[0] + a[1]` are several times faster than `a[...,0] + a[..., 1]`. Perhaps one can come up with code examples that don't use those operations at all, but I would guess that'll be a rare case. My point: If your code uses vector operations (as `+`) _and_ numpy.linalg functions, your vector operations will be several times slower than necessary. One way around this would perhaps be to initialize all your data in Fortran-style data layout, where `a[...,0] + a[..., 1]` is indeed faster than `a[0] + a[1]`.What I'll end up doing is probably to rewrite whatever numpy.linalg function I need for C-style ordering. For dets of 2x2 or 3x3 matrices, this shouldn't be to hard (`a[0][0] +a[1][1] - a[0][1] - a[1][0]`). :)Cheers,NicoOn Sun, Mar 5, 2017 at 3:53 PM Sebastian Berg <[hidden email]> wrote:On Thu, 2017-03-02 at 10:27 +0000, Nico Schlömer wrote: > Hi everyone, > > When trying to speed up my code, I noticed that simply by reordering > my data I could get more than twice as fast for the simplest > operations: > ``` > import numpy > a = numpy.random.rand(50, 50, 50) > > %timeit a[0] + a[1] > 1000000 loops, best of 3: 1.7 µs per loop > > %timeit a[:, 0] + a[:, 1] > 100000 loops, best of 3: 4.42 µs per loop > > %timeit a[..., 0] + a[..., 1] > 100000 loops, best of 3: 5.99 µs per loop > ``` > This makes sense considering the fact that, by default, NumPy > features C-style memory allocation: the last index runs fastest. The > blocks that are added with `a[0] + a[1]` are contiguous in memory, so > cache is nicely made use of. As opposed to that, the blocks that are > added with `a[:, 0] + a[:, 1]` are not contiguous, even more so with > `a[..., 0] + a[..., 1]`; hence the slowdown. Would that be the > correct explanation? > > If yes, I'm wondering why most numpy.linalg methods, when vectorized, > put the vector index up front. E.g., to mass-compute determinants, > one has to do > ``` > a = numpy.random.rand(777, 3, 3) > numpy.linalg.det(a) > ``` > This way, all 3x3 matrices form a memory block, so if you do `det` > block by block, that'll be fine. However, vectorized operations (like > `+` above) will be slower than necessary. > Any background on this?  > I am honestly not sure where you are going at. This order seems the more natural order for most operations. Also numpy does not create copies normally even for transposed data (though some functions may internally, numpy just _defaults_ to C-order). So of course what is faster depends on your use-case, but if you have an operation on many 3x3 arrays the way numpy does it is the more natural way. If you do other things that are faster the other way around, you will have to decide which operation is the more important one overall. - Sebastian > (I came across this when having to rearrange my data (swapaxes, > rollaxis) from shape (3, 3, 777) (which allows for fast vectorized > operations in the rest of the code) to (777, 3, 3) for using numpy's > svd.) > > Cheers, > Nico > _______________________________________________ > NumPy-Discussion mailing list > [hidden email] > https://mail.scipy.org/mailman/listinfo/numpy-discussion_______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.scipy.org/mailman/listinfo/numpy-discussion _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.scipy.org/mailman/listinfo/numpy-discussion