hi,
Temporary arrays generated in expressions are expensive as the imply extra memory bandwidth which is the bottleneck in most numpy operations. For example: r = a + b + c creates the b + c temporary and then adds a to it. This can be rewritten to be more efficient using inplace operations: r = b + c r += a This saves some memory bandwidth and can speedup the operation by 50% for very large arrays or even more if the inplace operation allows it to be completed completely in the cpu cache. The problem is that inplace operations are a lot less readable so they are often only used in well optimized code. But due to pythons refcounting semantics we can actually do some inplace conversions transparently. If an operand in python has a reference count of one it must be a temporary so we can use it as the destination array. CPython itself does this optimization for string concatenations. In numpy we have the issue that we can be called from the C-API directly where the reference count may be one for other reasons. To solve this we can check the backtrace until the python frame evaluation function. If there are only numpy and python functions in between that and our entry point we should be able to elide the temporary. This PR implements this: https://github.com/numpy/numpy/pull/7997 It currently only supports Linux with glibc (which has reliable backtraces via unwinding) and maybe MacOS depending on how good their backtrace is. On windows the backtrace APIs are different and I don't know them but in theory it could also be done there. A problem is that checking the backtrace is quite expensive, so should only be enabled when the involved arrays are large enough for it to be worthwhile. In my testing this seems to be around 180-300KiB sized arrays, basically where they start spilling out of the CPU L2 cache. I made a little crappy benchmark script to test this cutoff in this branch: https://github.com/juliantaylor/numpy/tree/elide-bench If you are interested you can run it with: python setup.py build_ext -j 4 --inplace ipython --profile=null check.ipy At the end it will plot the ratio between elided and non-elided runtime. It should get larger than one around 180KiB on most cpus. If no one points out some flaw in the approach, I'm hoping to get this into the next numpy version. cheers, Julian _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.scipy.org/mailman/listinfo/numpy-discussion signature.asc (836 bytes) Download Attachment |
On Fri, Sep 30, 2016 at 9:38 AM, Julian Taylor
<[hidden email]> wrote: > hi, > Temporary arrays generated in expressions are expensive as the imply > extra memory bandwidth which is the bottleneck in most numpy operations. > For example: > > r = a + b + c > > creates the b + c temporary and then adds a to it. > This can be rewritten to be more efficient using inplace operations: > > r = b + c > r += a general question (I wouldn't understand the details even if I looked.) how is this affected by broadcasting and type promotion? Some of the main reasons that I don't like to use inplace operation in general is that I'm often not sure when type promotion occurs and when arrays expand during broadcasting. for example b + c is 1-D, a is 2-D, and r has the broadcasted shape. another case when I switch away from broadcasting is when b + c is int or bool and a is float. Thankfully, we get error messages for casting now. > > This saves some memory bandwidth and can speedup the operation by 50% > for very large arrays or even more if the inplace operation allows it to > be completed completely in the cpu cache. I didn't realize the difference can be so large. That would make streamlining some code worth the effort. Josef > > The problem is that inplace operations are a lot less readable so they > are often only used in well optimized code. But due to pythons > refcounting semantics we can actually do some inplace conversions > transparently. > If an operand in python has a reference count of one it must be a > temporary so we can use it as the destination array. CPython itself does > this optimization for string concatenations. > > In numpy we have the issue that we can be called from the C-API directly > where the reference count may be one for other reasons. > To solve this we can check the backtrace until the python frame > evaluation function. If there are only numpy and python functions in > between that and our entry point we should be able to elide the temporary. > > This PR implements this: > https://github.com/numpy/numpy/pull/7997 > > It currently only supports Linux with glibc (which has reliable > backtraces via unwinding) and maybe MacOS depending on how good their > backtrace is. On windows the backtrace APIs are different and I don't > know them but in theory it could also be done there. > > A problem is that checking the backtrace is quite expensive, so should > only be enabled when the involved arrays are large enough for it to be > worthwhile. In my testing this seems to be around 180-300KiB sized > arrays, basically where they start spilling out of the CPU L2 cache. > > I made a little crappy benchmark script to test this cutoff in this branch: > https://github.com/juliantaylor/numpy/tree/elide-bench > > If you are interested you can run it with: > python setup.py build_ext -j 4 --inplace > ipython --profile=null check.ipy > > At the end it will plot the ratio between elided and non-elided runtime. > It should get larger than one around 180KiB on most cpus. > > If no one points out some flaw in the approach, I'm hoping to get this > into the next numpy version. > > cheers, > Julian > > > _______________________________________________ > NumPy-Discussion mailing list > [hidden email] > https://mail.scipy.org/mailman/listinfo/numpy-discussion > NumPy-Discussion mailing list [hidden email] https://mail.scipy.org/mailman/listinfo/numpy-discussion |
On 30.09.2016 23:09, [hidden email] wrote:
> On Fri, Sep 30, 2016 at 9:38 AM, Julian Taylor > <[hidden email]> wrote: >> hi, >> Temporary arrays generated in expressions are expensive as the imply >> extra memory bandwidth which is the bottleneck in most numpy operations. >> For example: >> >> r = a + b + c >> >> creates the b + c temporary and then adds a to it. >> This can be rewritten to be more efficient using inplace operations: >> >> r = b + c >> r += a > > general question (I wouldn't understand the details even if I looked.) > > how is this affected by broadcasting and type promotion? > > Some of the main reasons that I don't like to use inplace operation in > general is that I'm often not sure when type promotion occurs and when > arrays expand during broadcasting. > > for example b + c is 1-D, a is 2-D, and r has the broadcasted shape. > another case when I switch away from broadcasting is when b + c is int > or bool and a is float. Thankfully, we get error messages for casting > now. it should be the same as what you get without inplace operations. E.g. float32-temporary + float64 will not be converted to the unsafe float32 += float64 which a normal inplace operations would allow. But float64-temp + float32 is transformed. Currently the only broadcasting that will be transformed is temporary + scalar value, otherwise it will only work on matching array sizes. Though there is not really anything that prevents full broadcasting but its not implemented yet in the PR. > >> >> This saves some memory bandwidth and can speedup the operation by 50% >> for very large arrays or even more if the inplace operation allows it to >> be completed completely in the cpu cache. > > I didn't realize the difference can be so large. That would make > streamlining some code worth the effort. > > Josef > > >> >> The problem is that inplace operations are a lot less readable so they >> are often only used in well optimized code. But due to pythons >> refcounting semantics we can actually do some inplace conversions >> transparently. >> If an operand in python has a reference count of one it must be a >> temporary so we can use it as the destination array. CPython itself does >> this optimization for string concatenations. >> >> In numpy we have the issue that we can be called from the C-API directly >> where the reference count may be one for other reasons. >> To solve this we can check the backtrace until the python frame >> evaluation function. If there are only numpy and python functions in >> between that and our entry point we should be able to elide the temporary. >> >> This PR implements this: >> https://github.com/numpy/numpy/pull/7997 >> >> It currently only supports Linux with glibc (which has reliable >> backtraces via unwinding) and maybe MacOS depending on how good their >> backtrace is. On windows the backtrace APIs are different and I don't >> know them but in theory it could also be done there. >> >> A problem is that checking the backtrace is quite expensive, so should >> only be enabled when the involved arrays are large enough for it to be >> worthwhile. In my testing this seems to be around 180-300KiB sized >> arrays, basically where they start spilling out of the CPU L2 cache. >> >> I made a little crappy benchmark script to test this cutoff in this branch: >> https://github.com/juliantaylor/numpy/tree/elide-bench >> >> If you are interested you can run it with: >> python setup.py build_ext -j 4 --inplace >> ipython --profile=null check.ipy >> >> At the end it will plot the ratio between elided and non-elided runtime. >> It should get larger than one around 180KiB on most cpus. >> >> If no one points out some flaw in the approach, I'm hoping to get this >> into the next numpy version. >> >> cheers, >> Julian >> >> >> _______________________________________________ >> NumPy-Discussion mailing list >> [hidden email] >> https://mail.scipy.org/mailman/listinfo/numpy-discussion >> > _______________________________________________ > NumPy-Discussion mailing list > [hidden email] > https://mail.scipy.org/mailman/listinfo/numpy-discussion > _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.scipy.org/mailman/listinfo/numpy-discussion signature.asc (836 bytes) Download Attachment |
Julian, This is really, really cool!On Fri, Sep 30, 2016 at 2:50 PM, Julian Taylor <[hidden email]> wrote: On 30.09.2016 23:09, [hidden email] wrote: -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception [hidden email] _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.scipy.org/mailman/listinfo/numpy-discussion |
Just thinking aloud, an idea I had recently takes a different approach. The problem with temporaries isn't so much that they exists, but rather they they keep on malloc'ed and cleared. What if numpy kept a small LRU cache of weakref'ed temporaries? Whenever a new numpy array is requested, numpy could see if there is already one in its cache of matching size and use it. If you think about it, expressions that result in many temporaries would quite likely have many of them being the same size in memory. Don't know how feasible it would be to implement though.On Sat, Oct 1, 2016 at 2:38 PM, Chris Barker <[hidden email]> wrote:
_______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.scipy.org/mailman/listinfo/numpy-discussion |
In reply to this post by Julian Taylor-3
The same idea was published two years ago:
http://hiperfit.dk/pdf/Doubling.pdf On Mon, Oct 3, 2016 at 8:53 AM, <[hidden email]> wrote: Send NumPy-Discussion mailing list submissions to _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.scipy.org/mailman/listinfo/numpy-discussion |
In reply to this post by Benjamin Root
the problem with this approach is that we don't really want numpy
hogging on to hundreds of megabytes of memory by default so it would need to be a user option. A context manager could work too but it would probably lead to premature optimization. Very new Linux versions (4.6+) now finally support MADV_FREE which gives memory back to the system but does not require refaulting it if nothing else needed it. So this might be an option now. But libc implementations will probably use that at some point too and then numpy doesn't need to do this. On 02.10.2016 15:10, Benjamin Root wrote: > Just thinking aloud, an idea I had recently takes a different approach. > The problem with temporaries isn't so much that they exists, but rather > they they keep on malloc'ed and cleared. What if numpy kept a small LRU > cache of weakref'ed temporaries? Whenever a new numpy array is > requested, numpy could see if there is already one in its cache of > matching size and use it. If you think about it, expressions that result > in many temporaries would quite likely have many of them being the same > size in memory. > > Don't know how feasible it would be to implement though. > > Cheers! > Ben Root > > > On Sat, Oct 1, 2016 at 2:38 PM, Chris Barker <[hidden email] > <mailto:[hidden email]>> wrote: > > Julian, > > This is really, really cool! > > I have been wanting something like this for years (over a decade? > wow!), but always thought it would require hacking the interpreter > to intercept operations. This is a really inspired idea, and could > buy numpy a lot of performance. > > I'm afraid I can't say much about the implementation details -- but > great work! > > -Chris > > > > > On Fri, Sep 30, 2016 at 2:50 PM, Julian Taylor > <[hidden email] > <mailto:[hidden email]>> wrote: > > On 30.09.2016 23:09, [hidden email] > <mailto:[hidden email]> wrote: > > On Fri, Sep 30, 2016 at 9:38 AM, Julian Taylor > > <[hidden email] > <mailto:[hidden email]>> wrote: > >> hi, > >> Temporary arrays generated in expressions are expensive as the imply > >> extra memory bandwidth which is the bottleneck in most numpy operations. > >> For example: > >> > >> r = a + b + c > >> > >> creates the b + c temporary and then adds a to it. > >> This can be rewritten to be more efficient using inplace operations: > >> > >> r = b + c > >> r += a > > > > general question (I wouldn't understand the details even if I looked.) > > > > how is this affected by broadcasting and type promotion? > > > > Some of the main reasons that I don't like to use inplace operation in > > general is that I'm often not sure when type promotion occurs and when > > arrays expand during broadcasting. > > > > for example b + c is 1-D, a is 2-D, and r has the broadcasted shape. > > another case when I switch away from broadcasting is when b + c is int > > or bool and a is float. Thankfully, we get error messages for casting > > now. > > the temporary is only avoided when the casting follows the safe > rule, so > it should be the same as what you get without inplace > operations. E.g. > float32-temporary + float64 will not be converted to the unsafe > float32 > += float64 which a normal inplace operations would allow. But > float64-temp + float32 is transformed. > > Currently the only broadcasting that will be transformed is > temporary + > scalar value, otherwise it will only work on matching array sizes. > Though there is not really anything that prevents full > broadcasting but > its not implemented yet in the PR. > > > > >> > >> This saves some memory bandwidth and can speedup the > operation by 50% > >> for very large arrays or even more if the inplace operation > allows it to > >> be completed completely in the cpu cache. > > > > I didn't realize the difference can be so large. That would make > > streamlining some code worth the effort. > > > > Josef > > > > > >> > >> The problem is that inplace operations are a lot less > readable so they > >> are often only used in well optimized code. But due to pythons > >> refcounting semantics we can actually do some inplace conversions > >> transparently. > >> If an operand in python has a reference count of one it must be a > >> temporary so we can use it as the destination array. CPython > itself does > >> this optimization for string concatenations. > >> > >> In numpy we have the issue that we can be called from the > C-API directly > >> where the reference count may be one for other reasons. > >> To solve this we can check the backtrace until the python frame > >> evaluation function. If there are only numpy and python > functions in > >> between that and our entry point we should be able to elide > the temporary. > >> > >> This PR implements this: > >> https://github.com/numpy/numpy/pull/7997 > <https://github.com/numpy/numpy/pull/7997> > >> > >> It currently only supports Linux with glibc (which has reliable > >> backtraces via unwinding) and maybe MacOS depending on how > good their > >> backtrace is. On windows the backtrace APIs are different and > I don't > >> know them but in theory it could also be done there. > >> > >> A problem is that checking the backtrace is quite expensive, > so should > >> only be enabled when the involved arrays are large enough for > it to be > >> worthwhile. In my testing this seems to be around 180-300KiB > sized > >> arrays, basically where they start spilling out of the CPU L2 > cache. > >> > >> I made a little crappy benchmark script to test this cutoff > in this branch: > >> https://github.com/juliantaylor/numpy/tree/elide-bench > <https://github.com/juliantaylor/numpy/tree/elide-bench> > >> > >> If you are interested you can run it with: > >> python setup.py build_ext -j 4 --inplace > >> ipython --profile=null check.ipy > >> > >> At the end it will plot the ratio between elided and > non-elided runtime. > >> It should get larger than one around 180KiB on most cpus. > >> > >> If no one points out some flaw in the approach, I'm hoping to > get this > >> into the next numpy version. > >> > >> cheers, > >> Julian > >> > >> > >> _______________________________________________ > >> NumPy-Discussion mailing list > >> [hidden email] <mailto:[hidden email]> > >> https://mail.scipy.org/mailman/listinfo/numpy-discussion > <https://mail.scipy.org/mailman/listinfo/numpy-discussion> > >> > > _______________________________________________ > > NumPy-Discussion mailing list > > [hidden email] <mailto:[hidden email]> > > https://mail.scipy.org/mailman/listinfo/numpy-discussion > <https://mail.scipy.org/mailman/listinfo/numpy-discussion> > > > > > > _______________________________________________ > NumPy-Discussion mailing list > [hidden email] <mailto:[hidden email]> > https://mail.scipy.org/mailman/listinfo/numpy-discussion > <https://mail.scipy.org/mailman/listinfo/numpy-discussion> > > > > > -- > > Christopher Barker, Ph.D. > Oceanographer > > Emergency Response Division > NOAA/NOS/OR&R (206) 526-6959 <tel:%28206%29%20526-6959> > voice > 7600 Sand Point Way NE (206) 526-6329 <tel:%28206%29%20526-6329> fax > Seattle, WA 98115 (206) 526-6317 <tel:%28206%29%20526-6317> > main reception > > [hidden email] <mailto:[hidden email]> > > _______________________________________________ > NumPy-Discussion mailing list > [hidden email] <mailto:[hidden email]> > https://mail.scipy.org/mailman/listinfo/numpy-discussion > <https://mail.scipy.org/mailman/listinfo/numpy-discussion> > > > > > _______________________________________________ > NumPy-Discussion mailing list > [hidden email] > https://mail.scipy.org/mailman/listinfo/numpy-discussion > _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.scipy.org/mailman/listinfo/numpy-discussion signature.asc (836 bytes) Download Attachment |
On Mon, Oct 3, 2016 at 3:16 AM, Julian Taylor <[hidden email]> wrote: the problem with this approach is that we don't really want numpy indeed -- but one could set an LRU cache to be very small (few items, not small memory), and then it get used within expressions, but not hold on to much outside of expressions. However, is the allocation the only (Or even biggest) source of the performance hit? If you generate a temporary as a result of an operation, rather than doing it in-place, that temporary needs to be allocated, but it also means that an additional array needs to be pushed through the processor -- and that can make a big performance difference too. I"m not entirely sure how to profile this correctly, but this seems to indicate that the allocation is cheap compared to the operations (for a million--element array) * Regular old temporary creation In [24]: def f1(arr1, arr2): ...: result = arr1 + arr2...: return result In [26]: %timeit f1(arr1, arr2) 1000 loops, best of 3: 1.13 ms per loop * Completely in-place, no allocation of an extra array In [27]: def f2(arr1, arr2): ...: arr1 += arr2 ...: return arr1 In [28]: %timeit f2(arr1, arr2) 1000 loops, best of 3: 755 µs per loop So that's about 30% faster * allocate a temporary that isn't used -- but should catch the creation cost In [29]: def f3(arr1, arr2): ...: result = np.empty_like(arr1) ...: arr1 += arr2 ...: return arr1 In [30]: % timeit f3(arr1, arr2) 1000 loops, best of 3: 756 µs per loop only a µs slower! Profiling is hard, and I'm not good at it, but this seems to indicate that the allocation is cheap. -CHB Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception [hidden email] _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.scipy.org/mailman/listinfo/numpy-discussion |
On 03.10.2016 20:23, Chris Barker wrote:
> > > On Mon, Oct 3, 2016 at 3:16 AM, Julian Taylor > <[hidden email] <mailto:[hidden email]>> > wrote: > > the problem with this approach is that we don't really want numpy > hogging on to hundreds of megabytes of memory by default so it would > need to be a user option. > > > indeed -- but one could set an LRU cache to be very small (few items, > not small memory), and then it get used within expressions, but not hold > on to much outside of expressions. numpy doesn't see the whole expression so we can't really do much. (technically we could in 3.5 by using pep 523, but that would be a larger undertaking) > > However, is the allocation the only (Or even biggest) source of the > performance hit? > on large arrays the allocation is insignificant. What does cost some time is faulting the memory into the process which implies writing zeros into the pages (a page at a time as it is being used). By storing memory blocks in numpy we would save this portion. This is really the job of the libc, but these are usually tuned for general purpose workloads and thus tend to give back memory back to the system much earlier than numerical workloads would like. Note that numpy already has a small memory block cache but its only used for very small arrays where the allocation cost itself is significant, it is limited to a couple megabytes at most. _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.scipy.org/mailman/listinfo/numpy-discussion |
With regards to arguments about holding onto large arrays, I would like to emphasize that my original suggestion mentioned weakref'ed numpy arrays. Essentially, the idea is to claw back only the raw memory blocks during that limbo period between discarding the numpy array python object and when python garbage-collects it. Ben RootOn Mon, Oct 3, 2016 at 2:43 PM, Julian Taylor <[hidden email]> wrote: On 03.10.2016 20:23, Chris Barker wrote: _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.scipy.org/mailman/listinfo/numpy-discussion |
Mon, 03 Oct 2016 15:07:28 -0400, Benjamin Root kirjoitti:
> With regards to arguments about holding onto large arrays, I would like > to emphasize that my original suggestion mentioned weakref'ed numpy > arrays. > Essentially, the idea is to claw back only the raw memory blocks during > that limbo period between discarding the numpy array python object and > when python garbage-collects it. CPython afaik deallocates immediately when the refcount hits zero. It's relatively rare that you have arrays hanging around waiting for cycle breaking by gc. If you have them hanging around, I don't think it's possible to distinguish these from other arrays without running the gc. Note also that an "is an array in use" check probably always requires Julian's stack based hack since you cannot rely on the refcount. Pauli _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.scipy.org/mailman/listinfo/numpy-discussion |
Note that numpy does store some larger arrays already, in the fft
module. (In fact, this was a cache of unlimited size until #7686.) It might not be bad if the same cache were used more generally. That said, if newer versions of python are offering ways of doing this better, maybe that is the best way forward. -- Marten _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.scipy.org/mailman/listinfo/numpy-discussion |
In reply to this post by Julian Taylor-3
Good discussion, but was surprised by the absence of numexpr in the discussion., given how relevant it (numexpr) is to the topic. Is the goal to fold in the numexpr functionality (and beyond) into Numpy ? On Fri, Sep 30, 2016 at 7:08 PM, Julian Taylor <[hidden email]> wrote: hi, _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.scipy.org/mailman/listinfo/numpy-discussion |
2016-10-05 8:45 GMT+02:00 srean <[hidden email]>:
Yes, the question about merging numexpr into numpy has been something that periodically shows up in this list. I think mostly everyone agree that it is a good idea, but things are not so easy, and so far nobody provided a good patch for this. Also, the fact that numexpr relies on grouping an expression by using a string (e.g. (y = ne.evaluate("x**3 + tanh(x**2) + 4")) does not play well with the way in that numpy evaluates expressions, so something should be suggested to cope with this too.
-- Francesc Alted
_______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.scipy.org/mailman/listinfo/numpy-discussion |
All,
On Wed, Oct 5, 2016 at 11:46 AM, Francesc Alted <[hidden email]> wrote:
As Francesc said, Numexpr is going to get most of its power through grouping a series of operations so it can send blocks to the CPU cache and run the entire series of operations on the cache before returning the block to system memory. If it was just used to back-end NumPy, it would only gain from the multi-threading portion inside each function call. I'm not sure how one would go about grouping successive numpy expressions without modifying the Python interpreter? I put a bit of effort into extending numexpr to use 4-byte word opcodes instead of 1-byte. Progress has been very slow, however, due to time constraints, but I have most of the numpy data types (u[1-4], i[1-4], f[4,8], c[8,16], S[1-4], U[1-4]). On Tuesday I finished writing a Python generator script that writes all the C-side opcode macros for opcodes.hpp. Now I have about 900 opcodes, and this could easily grow into thousands if more functions are added, so I also built a reverse lookup tree (based on collections.defaultdict) for the Python-side of numexpr. Robert Robert McLeod, Ph.D. Center for Cellular Imaging and Nano Analytics (C-CINA) Biozentrum der Universität Basel _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.scipy.org/mailman/listinfo/numpy-discussion |
Thanks Francesc, Robert for giving me a broader picture of where this fits in. I believe numexpr does not handle slicing, so that might be another thing to look at. On Wed, Oct 5, 2016 at 4:26 PM, Robert McLeod <[hidden email]> wrote:
Is that so ? I thought numexpr also cuts down on number of temporary buffers that get filled (in other words copy operations) if the same expression was written as series of operations. My understanding can be wrong, and would appreciate correction. The 'out' parameter in ufuncs can eliminate extra temporaries but its not composable. Right now I have to manually carry along the array where the in place operations take place. I think the goal here is to eliminate that. _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.scipy.org/mailman/listinfo/numpy-discussion |
On Wed, Oct 5, 2016 at 1:11 PM, srean <[hidden email]> wrote:
Dereferencing would be relatively simple to add into numexpr, as it would just be some getattr() calls. Personally I will add that at some point because it will clean up my code. Slicing, maybe only for continuous blocks in memory? I.e. imageStack[0,:,:] would be possible, but imageStack[:, ::2, ::2] would not be trivial (I think...). I seem to remember someone asked David Cooke about slicing and he said something along the lines of, "that's what Numba is for." Perhaps NumPy backended by Numba is more so what you are looking for, as it hooks into the byte compiler? The main advantage of numexpr is that a series of numpy functions in <expression> can be enclosed in ne.evaluate( "<expression>" ) and it provides a big acceleration for little programmer effort, but it's not nearly as sophisticated as Numba or PyPy.
The numexpr virtual machine does create temporaries where needed when it parses the abstract syntax tree for all the operations it has to do. I believe the main advantage is that the temporaries are created on the CPU cache, and not in system memory. It's certainly true that numexpr doesn't create a lot of OP_COPY operations, rather it's optimized to minimize them, so probably it's fewer ops than naive successive calls to numpy within python, but I'm unsure if there's any difference in operation count between a hand-optimized numpy with out= set and numexpr. Numexpr just does it for you. This blog post from Tim Hochberg is useful for understanding the performance advantages of blocking versus multithreading: Robert Robert McLeod, Ph.D. Center for Cellular Imaging and Nano Analytics (C-CINA) Biozentrum der Universität Basel _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.scipy.org/mailman/listinfo/numpy-discussion |
On Wed, Oct 5, 2016 at 5:36 PM, Robert McLeod <[hidden email]> wrote:
That was my understanding as well. If it automatically does what one could achieve by carrying the state along in the 'out' parameter, that's as good as it can get in terms removing unnecessary ops. There are other speedup opportunities of course, but that's a separate matter.
Hadnt come across that one before. Great link. Thanks. using caches and vector registers well trumps threading, unless one has a lot of data and it helps to disable hyper-threading. _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.scipy.org/mailman/listinfo/numpy-discussion |
In reply to this post by Julian Taylor-3
hi,
This PR has now been merged. As it has the potential to break stuff please test your applications, in particular ones that use cython and C-extensions against master. It will only do something when working on large arrays (> 256 KiB) on platforms providing the backtrace() function. So linux and mac but windows probably not (though windows could probably be added if someone volunteers to write the equivalent winapi code). If you compile from source and want to see for sure if it is doing anything you can set the NPY_ELIDE_DEBUG define in ./numpy/core/src/multiarray/temp_elide.c to 1 or 2 and it will print out when it is eliding something. cheers, Julian On 30.09.2016 15:38, Julian Taylor wrote: > hi, > Temporary arrays generated in expressions are expensive as the imply > extra memory bandwidth which is the bottleneck in most numpy operations. > For example: > > r = a + b + c > > creates the b + c temporary and then adds a to it. > This can be rewritten to be more efficient using inplace operations: > > r = b + c > r += a > > This saves some memory bandwidth and can speedup the operation by 50% > for very large arrays or even more if the inplace operation allows it to > be completed completely in the cpu cache. > > The problem is that inplace operations are a lot less readable so they > are often only used in well optimized code. But due to pythons > refcounting semantics we can actually do some inplace conversions > transparently. > If an operand in python has a reference count of one it must be a > temporary so we can use it as the destination array. CPython itself does > this optimization for string concatenations. > > In numpy we have the issue that we can be called from the C-API directly > where the reference count may be one for other reasons. > To solve this we can check the backtrace until the python frame > evaluation function. If there are only numpy and python functions in > between that and our entry point we should be able to elide the temporary. > > This PR implements this: > https://github.com/numpy/numpy/pull/7997 > > It currently only supports Linux with glibc (which has reliable > backtraces via unwinding) and maybe MacOS depending on how good their > backtrace is. On windows the backtrace APIs are different and I don't > know them but in theory it could also be done there. > > A problem is that checking the backtrace is quite expensive, so should > only be enabled when the involved arrays are large enough for it to be > worthwhile. In my testing this seems to be around 180-300KiB sized > arrays, basically where they start spilling out of the CPU L2 cache. > > I made a little crappy benchmark script to test this cutoff in this branch: > https://github.com/juliantaylor/numpy/tree/elide-bench > > If you are interested you can run it with: > python setup.py build_ext -j 4 --inplace > ipython --profile=null check.ipy > > At the end it will plot the ratio between elided and non-elided runtime. > It should get larger than one around 180KiB on most cpus. > > If no one points out some flaw in the approach, I'm hoping to get this > into the next numpy version. > > cheers, > Julian > _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.scipy.org/mailman/listinfo/numpy-discussion |
What's the timeline for the next release? I have the perfect usecase for this (Haversine calculation on large arrays that takes up ~33% of one of my processing scripts). However, to test it out, I have a huge dependency mess to wade through first, and there are no resources devoted to that project for at least a few weeks. I want to make sure I get feedback to y'all. Cheers!On Sat, Feb 25, 2017 at 5:17 AM, Julian Taylor <[hidden email] hi, _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.scipy.org/mailman/listinfo/numpy-discussion |
Free forum by Nabble | Edit this page |