automatically avoiding temporary arrays

classic Classic list List threaded Threaded
24 messages Options
12
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

automatically avoiding temporary arrays

Julian Taylor-3
hi,
Temporary arrays generated in expressions are expensive as the imply
extra memory bandwidth which is the bottleneck in most numpy operations.
For example:

r = a + b + c

creates the b + c temporary and then adds a to it.
This can be rewritten to be more efficient using inplace operations:

r = b + c
r += a

This saves some memory bandwidth and can speedup the operation by 50%
for very large arrays or even more if the inplace operation allows it to
be completed completely in the cpu cache.

The problem is that inplace operations are a lot less readable so they
are often only used in well optimized code. But due to pythons
refcounting semantics we can actually do some inplace conversions
transparently.
If an operand in python has a reference count of one it must be a
temporary so we can use it as the destination array. CPython itself does
this optimization for string concatenations.

In numpy we have the issue that we can be called from the C-API directly
where the reference count may be one for other reasons.
To solve this we can check the backtrace until the python frame
evaluation function. If there are only numpy and python functions in
between that and our entry point we should be able to elide the temporary.

This PR implements this:
https://github.com/numpy/numpy/pull/7997

It currently only supports Linux with glibc (which has reliable
backtraces via unwinding) and maybe MacOS depending on how good their
backtrace is. On windows the backtrace APIs are different and I don't
know them but in theory it could also be done there.

A problem is that checking the backtrace is quite expensive, so should
only be enabled when the involved arrays are large enough for it to be
worthwhile. In my testing this seems to be around 180-300KiB sized
arrays, basically where they start spilling out of the CPU L2 cache.

I made a little crappy benchmark script to test this cutoff in this branch:
https://github.com/juliantaylor/numpy/tree/elide-bench

If you are interested you can run it with:
python setup.py build_ext -j 4 --inplace
ipython --profile=null check.ipy

At the end it will plot the ratio between elided and non-elided runtime.
It should get larger than one around 180KiB on most cpus.

If no one points out some flaw in the approach, I'm hoping to get this
into the next numpy version.

cheers,
Julian


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.scipy.org/mailman/listinfo/numpy-discussion

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: automatically avoiding temporary arrays

josef.pktd
On Fri, Sep 30, 2016 at 9:38 AM, Julian Taylor
<[hidden email]> wrote:

> hi,
> Temporary arrays generated in expressions are expensive as the imply
> extra memory bandwidth which is the bottleneck in most numpy operations.
> For example:
>
> r = a + b + c
>
> creates the b + c temporary and then adds a to it.
> This can be rewritten to be more efficient using inplace operations:
>
> r = b + c
> r += a

general question (I wouldn't understand the details even if I looked.)

how is this affected by broadcasting and type promotion?

Some of the main reasons that I don't like to use inplace operation in
general is that I'm often not sure when type promotion occurs and when
arrays expand during broadcasting.

for example b + c is 1-D, a is 2-D, and r has the broadcasted shape.
another case when I switch away from broadcasting is when b + c is int
or bool and a is float. Thankfully, we get error messages for casting
now.


>
> This saves some memory bandwidth and can speedup the operation by 50%
> for very large arrays or even more if the inplace operation allows it to
> be completed completely in the cpu cache.

I didn't realize the difference can be so large. That would make
streamlining some code worth the effort.

Josef


>
> The problem is that inplace operations are a lot less readable so they
> are often only used in well optimized code. But due to pythons
> refcounting semantics we can actually do some inplace conversions
> transparently.
> If an operand in python has a reference count of one it must be a
> temporary so we can use it as the destination array. CPython itself does
> this optimization for string concatenations.
>
> In numpy we have the issue that we can be called from the C-API directly
> where the reference count may be one for other reasons.
> To solve this we can check the backtrace until the python frame
> evaluation function. If there are only numpy and python functions in
> between that and our entry point we should be able to elide the temporary.
>
> This PR implements this:
> https://github.com/numpy/numpy/pull/7997
>
> It currently only supports Linux with glibc (which has reliable
> backtraces via unwinding) and maybe MacOS depending on how good their
> backtrace is. On windows the backtrace APIs are different and I don't
> know them but in theory it could also be done there.
>
> A problem is that checking the backtrace is quite expensive, so should
> only be enabled when the involved arrays are large enough for it to be
> worthwhile. In my testing this seems to be around 180-300KiB sized
> arrays, basically where they start spilling out of the CPU L2 cache.
>
> I made a little crappy benchmark script to test this cutoff in this branch:
> https://github.com/juliantaylor/numpy/tree/elide-bench
>
> If you are interested you can run it with:
> python setup.py build_ext -j 4 --inplace
> ipython --profile=null check.ipy
>
> At the end it will plot the ratio between elided and non-elided runtime.
> It should get larger than one around 180KiB on most cpus.
>
> If no one points out some flaw in the approach, I'm hoping to get this
> into the next numpy version.
>
> cheers,
> Julian
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: automatically avoiding temporary arrays

Julian Taylor-3
On 30.09.2016 23:09, [hidden email] wrote:

> On Fri, Sep 30, 2016 at 9:38 AM, Julian Taylor
> <[hidden email]> wrote:
>> hi,
>> Temporary arrays generated in expressions are expensive as the imply
>> extra memory bandwidth which is the bottleneck in most numpy operations.
>> For example:
>>
>> r = a + b + c
>>
>> creates the b + c temporary and then adds a to it.
>> This can be rewritten to be more efficient using inplace operations:
>>
>> r = b + c
>> r += a
>
> general question (I wouldn't understand the details even if I looked.)
>
> how is this affected by broadcasting and type promotion?
>
> Some of the main reasons that I don't like to use inplace operation in
> general is that I'm often not sure when type promotion occurs and when
> arrays expand during broadcasting.
>
> for example b + c is 1-D, a is 2-D, and r has the broadcasted shape.
> another case when I switch away from broadcasting is when b + c is int
> or bool and a is float. Thankfully, we get error messages for casting
> now.
the temporary is only avoided when the casting follows the safe rule, so
it should be the same as what you get without inplace operations. E.g.
float32-temporary + float64 will not be converted to the unsafe float32
+= float64 which a normal inplace operations would allow. But
float64-temp + float32 is transformed.

Currently the only broadcasting that will be transformed is temporary +
scalar value, otherwise it will only work on matching array sizes.
Though there is not really anything that prevents full broadcasting but
its not implemented yet in the PR.

>
>>
>> This saves some memory bandwidth and can speedup the operation by 50%
>> for very large arrays or even more if the inplace operation allows it to
>> be completed completely in the cpu cache.
>
> I didn't realize the difference can be so large. That would make
> streamlining some code worth the effort.
>
> Josef
>
>
>>
>> The problem is that inplace operations are a lot less readable so they
>> are often only used in well optimized code. But due to pythons
>> refcounting semantics we can actually do some inplace conversions
>> transparently.
>> If an operand in python has a reference count of one it must be a
>> temporary so we can use it as the destination array. CPython itself does
>> this optimization for string concatenations.
>>
>> In numpy we have the issue that we can be called from the C-API directly
>> where the reference count may be one for other reasons.
>> To solve this we can check the backtrace until the python frame
>> evaluation function. If there are only numpy and python functions in
>> between that and our entry point we should be able to elide the temporary.
>>
>> This PR implements this:
>> https://github.com/numpy/numpy/pull/7997
>>
>> It currently only supports Linux with glibc (which has reliable
>> backtraces via unwinding) and maybe MacOS depending on how good their
>> backtrace is. On windows the backtrace APIs are different and I don't
>> know them but in theory it could also be done there.
>>
>> A problem is that checking the backtrace is quite expensive, so should
>> only be enabled when the involved arrays are large enough for it to be
>> worthwhile. In my testing this seems to be around 180-300KiB sized
>> arrays, basically where they start spilling out of the CPU L2 cache.
>>
>> I made a little crappy benchmark script to test this cutoff in this branch:
>> https://github.com/juliantaylor/numpy/tree/elide-bench
>>
>> If you are interested you can run it with:
>> python setup.py build_ext -j 4 --inplace
>> ipython --profile=null check.ipy
>>
>> At the end it will plot the ratio between elided and non-elided runtime.
>> It should get larger than one around 180KiB on most cpus.
>>
>> If no one points out some flaw in the approach, I'm hoping to get this
>> into the next numpy version.
>>
>> cheers,
>> Julian
>>
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> [hidden email]
>> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.scipy.org/mailman/listinfo/numpy-discussion

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: automatically avoiding temporary arrays

Chris Barker - NOAA Federal
Julian,

This is really, really cool!

I have been wanting something like this for years (over a decade? wow!), but always thought it would require hacking the interpreter to intercept operations. This is a really inspired idea, and could buy numpy a lot of performance.

I'm afraid I can't say much about the implementation details -- but great work!

-Chris




On Fri, Sep 30, 2016 at 2:50 PM, Julian Taylor <[hidden email]> wrote:
On 30.09.2016 23:09, [hidden email] wrote:
> On Fri, Sep 30, 2016 at 9:38 AM, Julian Taylor
> <[hidden email]> wrote:
>> hi,
>> Temporary arrays generated in expressions are expensive as the imply
>> extra memory bandwidth which is the bottleneck in most numpy operations.
>> For example:
>>
>> r = a + b + c
>>
>> creates the b + c temporary and then adds a to it.
>> This can be rewritten to be more efficient using inplace operations:
>>
>> r = b + c
>> r += a
>
> general question (I wouldn't understand the details even if I looked.)
>
> how is this affected by broadcasting and type promotion?
>
> Some of the main reasons that I don't like to use inplace operation in
> general is that I'm often not sure when type promotion occurs and when
> arrays expand during broadcasting.
>
> for example b + c is 1-D, a is 2-D, and r has the broadcasted shape.
> another case when I switch away from broadcasting is when b + c is int
> or bool and a is float. Thankfully, we get error messages for casting
> now.

the temporary is only avoided when the casting follows the safe rule, so
it should be the same as what you get without inplace operations. E.g.
float32-temporary + float64 will not be converted to the unsafe float32
+= float64 which a normal inplace operations would allow. But
float64-temp + float32 is transformed.

Currently the only broadcasting that will be transformed is temporary +
scalar value, otherwise it will only work on matching array sizes.
Though there is not really anything that prevents full broadcasting but
its not implemented yet in the PR.

>
>>
>> This saves some memory bandwidth and can speedup the operation by 50%
>> for very large arrays or even more if the inplace operation allows it to
>> be completed completely in the cpu cache.
>
> I didn't realize the difference can be so large. That would make
> streamlining some code worth the effort.
>
> Josef
>
>
>>
>> The problem is that inplace operations are a lot less readable so they
>> are often only used in well optimized code. But due to pythons
>> refcounting semantics we can actually do some inplace conversions
>> transparently.
>> If an operand in python has a reference count of one it must be a
>> temporary so we can use it as the destination array. CPython itself does
>> this optimization for string concatenations.
>>
>> In numpy we have the issue that we can be called from the C-API directly
>> where the reference count may be one for other reasons.
>> To solve this we can check the backtrace until the python frame
>> evaluation function. If there are only numpy and python functions in
>> between that and our entry point we should be able to elide the temporary.
>>
>> This PR implements this:
>> https://github.com/numpy/numpy/pull/7997
>>
>> It currently only supports Linux with glibc (which has reliable
>> backtraces via unwinding) and maybe MacOS depending on how good their
>> backtrace is. On windows the backtrace APIs are different and I don't
>> know them but in theory it could also be done there.
>>
>> A problem is that checking the backtrace is quite expensive, so should
>> only be enabled when the involved arrays are large enough for it to be
>> worthwhile. In my testing this seems to be around 180-300KiB sized
>> arrays, basically where they start spilling out of the CPU L2 cache.
>>
>> I made a little crappy benchmark script to test this cutoff in this branch:
>> https://github.com/juliantaylor/numpy/tree/elide-bench
>>
>> If you are interested you can run it with:
>> python setup.py build_ext -j 4 --inplace
>> ipython --profile=null check.ipy
>>
>> At the end it will plot the ratio between elided and non-elided runtime.
>> It should get larger than one around 180KiB on most cpus.
>>
>> If no one points out some flaw in the approach, I'm hoping to get this
>> into the next numpy version.
>>
>> cheers,
>> Julian
>>
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> [hidden email]
>> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>



_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.scipy.org/mailman/listinfo/numpy-discussion




--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

[hidden email]

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: automatically avoiding temporary arrays

Benjamin Root
Just thinking aloud, an idea I had recently takes a different approach. The problem with temporaries isn't so much that they exists, but rather they they keep on malloc'ed and cleared. What if numpy kept a small LRU cache of weakref'ed temporaries? Whenever a new numpy array is requested, numpy could see if there is already one in its cache of matching size and use it. If you think about it, expressions that result in many temporaries would quite likely have many of them being the same size in memory.

Don't know how feasible it would be to implement though.

Cheers!
Ben Root


On Sat, Oct 1, 2016 at 2:38 PM, Chris Barker <[hidden email]> wrote:
Julian,

This is really, really cool!

I have been wanting something like this for years (over a decade? wow!), but always thought it would require hacking the interpreter to intercept operations. This is a really inspired idea, and could buy numpy a lot of performance.

I'm afraid I can't say much about the implementation details -- but great work!

-Chris




On Fri, Sep 30, 2016 at 2:50 PM, Julian Taylor <[hidden email]> wrote:
On 30.09.2016 23:09, [hidden email] wrote:
> On Fri, Sep 30, 2016 at 9:38 AM, Julian Taylor
> <[hidden email]> wrote:
>> hi,
>> Temporary arrays generated in expressions are expensive as the imply
>> extra memory bandwidth which is the bottleneck in most numpy operations.
>> For example:
>>
>> r = a + b + c
>>
>> creates the b + c temporary and then adds a to it.
>> This can be rewritten to be more efficient using inplace operations:
>>
>> r = b + c
>> r += a
>
> general question (I wouldn't understand the details even if I looked.)
>
> how is this affected by broadcasting and type promotion?
>
> Some of the main reasons that I don't like to use inplace operation in
> general is that I'm often not sure when type promotion occurs and when
> arrays expand during broadcasting.
>
> for example b + c is 1-D, a is 2-D, and r has the broadcasted shape.
> another case when I switch away from broadcasting is when b + c is int
> or bool and a is float. Thankfully, we get error messages for casting
> now.

the temporary is only avoided when the casting follows the safe rule, so
it should be the same as what you get without inplace operations. E.g.
float32-temporary + float64 will not be converted to the unsafe float32
+= float64 which a normal inplace operations would allow. But
float64-temp + float32 is transformed.

Currently the only broadcasting that will be transformed is temporary +
scalar value, otherwise it will only work on matching array sizes.
Though there is not really anything that prevents full broadcasting but
its not implemented yet in the PR.

>
>>
>> This saves some memory bandwidth and can speedup the operation by 50%
>> for very large arrays or even more if the inplace operation allows it to
>> be completed completely in the cpu cache.
>
> I didn't realize the difference can be so large. That would make
> streamlining some code worth the effort.
>
> Josef
>
>
>>
>> The problem is that inplace operations are a lot less readable so they
>> are often only used in well optimized code. But due to pythons
>> refcounting semantics we can actually do some inplace conversions
>> transparently.
>> If an operand in python has a reference count of one it must be a
>> temporary so we can use it as the destination array. CPython itself does
>> this optimization for string concatenations.
>>
>> In numpy we have the issue that we can be called from the C-API directly
>> where the reference count may be one for other reasons.
>> To solve this we can check the backtrace until the python frame
>> evaluation function. If there are only numpy and python functions in
>> between that and our entry point we should be able to elide the temporary.
>>
>> This PR implements this:
>> https://github.com/numpy/numpy/pull/7997
>>
>> It currently only supports Linux with glibc (which has reliable
>> backtraces via unwinding) and maybe MacOS depending on how good their
>> backtrace is. On windows the backtrace APIs are different and I don't
>> know them but in theory it could also be done there.
>>
>> A problem is that checking the backtrace is quite expensive, so should
>> only be enabled when the involved arrays are large enough for it to be
>> worthwhile. In my testing this seems to be around 180-300KiB sized
>> arrays, basically where they start spilling out of the CPU L2 cache.
>>
>> I made a little crappy benchmark script to test this cutoff in this branch:
>> https://github.com/juliantaylor/numpy/tree/elide-bench
>>
>> If you are interested you can run it with:
>> python setup.py build_ext -j 4 --inplace
>> ipython --profile=null check.ipy
>>
>> At the end it will plot the ratio between elided and non-elided runtime.
>> It should get larger than one around 180KiB on most cpus.
>>
>> If no one points out some flaw in the approach, I'm hoping to get this
>> into the next numpy version.
>>
>> cheers,
>> Julian
>>
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> [hidden email]
>> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>



_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.scipy.org/mailman/listinfo/numpy-discussion




--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            <a href="tel:%28206%29%20526-6959" value="+12065266959" target="_blank">(206) 526-6959   voice
7600 Sand Point Way NE   <a href="tel:%28206%29%20526-6329" value="+12065266329" target="_blank">(206) 526-6329   fax
Seattle, WA  98115       <a href="tel:%28206%29%20526-6317" value="+12065266317" target="_blank">(206) 526-6317   main reception

[hidden email]

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.scipy.org/mailman/listinfo/numpy-discussion



_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: automatically avoiding temporary arrays

Hush Hush
In reply to this post by Julian Taylor-3
The same idea was published two years ago:

http://hiperfit.dk/pdf/Doubling.pdf


On Mon, Oct 3, 2016 at 8:53 AM, <[hidden email]> wrote:
Send NumPy-Discussion mailing list submissions to
        [hidden email]

To subscribe or unsubscribe via the World Wide Web, visit
        https://mail.scipy.org/mailman/listinfo/numpy-discussion
or, via email, send a message with subject or body 'help' to
        [hidden email]

You can reach the person managing the list at
        [hidden email]

When replying, please edit your Subject line so it is more specific
than "Re: Contents of NumPy-Discussion digest..."


Today's Topics:

   1. Re: automatically avoiding temporary arrays (Benjamin Root)
   2. Re: Dropping sourceforge for releases. (David Cournapeau)
   3. Re: Dropping sourceforge for releases. (Vincent Davis)


----------------------------------------------------------------------

Message: 1
Date: Sun, 2 Oct 2016 09:10:45 -0400
From: Benjamin Root <[hidden email]>
To: Discussion of Numerical Python <[hidden email]>
Subject: Re: [Numpy-discussion] automatically avoiding temporary
        arrays
Message-ID:
        <[hidden email]>
Content-Type: text/plain; charset="utf-8"

Just thinking aloud, an idea I had recently takes a different approach. The
problem with temporaries isn't so much that they exists, but rather they
they keep on malloc'ed and cleared. What if numpy kept a small LRU cache of
weakref'ed temporaries? Whenever a new numpy array is requested, numpy
could see if there is already one in its cache of matching size and use it.
If you think about it, expressions that result in many temporaries would
quite likely have many of them being the same size in memory.

Don't know how feasible it would be to implement though.

Cheers!
Ben Root


On Sat, Oct 1, 2016 at 2:38 PM, Chris Barker <[hidden email]> wrote:

> Julian,
>
> This is really, really cool!
>
> I have been wanting something like this for years (over a decade? wow!),
> but always thought it would require hacking the interpreter to intercept
> operations. This is a really inspired idea, and could buy numpy a lot of
> performance.
>
> I'm afraid I can't say much about the implementation details -- but great
> work!
>
> -Chris
>
>
>
>
> On Fri, Sep 30, 2016 at 2:50 PM, Julian Taylor <
> [hidden email]> wrote:
>
>> On 30.09.2016 23:09, [hidden email] wrote:
>> > On Fri, Sep 30, 2016 at 9:38 AM, Julian Taylor
>> > <[hidden email]> wrote:
>> >> hi,
>> >> Temporary arrays generated in expressions are expensive as the imply
>> >> extra memory bandwidth which is the bottleneck in most numpy
>> operations.
>> >> For example:
>> >>
>> >> r = a + b + c
>> >>
>> >> creates the b + c temporary and then adds a to it.
>> >> This can be rewritten to be more efficient using inplace operations:
>> >>
>> >> r = b + c
>> >> r += a
>> >
>> > general question (I wouldn't understand the details even if I looked.)
>> >
>> > how is this affected by broadcasting and type promotion?
>> >
>> > Some of the main reasons that I don't like to use inplace operation in
>> > general is that I'm often not sure when type promotion occurs and when
>> > arrays expand during broadcasting.
>> >
>> > for example b + c is 1-D, a is 2-D, and r has the broadcasted shape.
>> > another case when I switch away from broadcasting is when b + c is int
>> > or bool and a is float. Thankfully, we get error messages for casting
>> > now.
>>
>> the temporary is only avoided when the casting follows the safe rule, so
>> it should be the same as what you get without inplace operations. E.g.
>> float32-temporary + float64 will not be converted to the unsafe float32
>> += float64 which a normal inplace operations would allow. But
>> float64-temp + float32 is transformed.
>>
>> Currently the only broadcasting that will be transformed is temporary +
>> scalar value, otherwise it will only work on matching array sizes.
>> Though there is not really anything that prevents full broadcasting but
>> its not implemented yet in the PR.
>>
>> >
>> >>
>> >> This saves some memory bandwidth and can speedup the operation by 50%
>> >> for very large arrays or even more if the inplace operation allows it
>> to
>> >> be completed completely in the cpu cache.
>> >
>> > I didn't realize the difference can be so large. That would make
>> > streamlining some code worth the effort.
>> >
>> > Josef
>> >
>> >
>> >>
>> >> The problem is that inplace operations are a lot less readable so they
>> >> are often only used in well optimized code. But due to pythons
>> >> refcounting semantics we can actually do some inplace conversions
>> >> transparently.
>> >> If an operand in python has a reference count of one it must be a
>> >> temporary so we can use it as the destination array. CPython itself
>> does
>> >> this optimization for string concatenations.
>> >>
>> >> In numpy we have the issue that we can be called from the C-API
>> directly
>> >> where the reference count may be one for other reasons.
>> >> To solve this we can check the backtrace until the python frame
>> >> evaluation function. If there are only numpy and python functions in
>> >> between that and our entry point we should be able to elide the
>> temporary.
>> >>
>> >> This PR implements this:
>> >> https://github.com/numpy/numpy/pull/7997
>> >>
>> >> It currently only supports Linux with glibc (which has reliable
>> >> backtraces via unwinding) and maybe MacOS depending on how good their
>> >> backtrace is. On windows the backtrace APIs are different and I don't
>> >> know them but in theory it could also be done there.
>> >>
>> >> A problem is that checking the backtrace is quite expensive, so should
>> >> only be enabled when the involved arrays are large enough for it to be
>> >> worthwhile. In my testing this seems to be around 180-300KiB sized
>> >> arrays, basically where they start spilling out of the CPU L2 cache.
>> >>
>> >> I made a little crappy benchmark script to test this cutoff in this
>> branch:
>> >> https://github.com/juliantaylor/numpy/tree/elide-bench
>> >>
>> >> If you are interested you can run it with:
>> >> python setup.py build_ext -j 4 --inplace
>> >> ipython --profile=null check.ipy
>> >>
>> >> At the end it will plot the ratio between elided and non-elided
>> runtime.
>> >> It should get larger than one around 180KiB on most cpus.
>> >>
>> >> If no one points out some flaw in the approach, I'm hoping to get this
>> >> into the next numpy version.
>> >>
>> >> cheers,
>> >> Julian
>> >>
>> >>
>> >> _______________________________________________
>> >> NumPy-Discussion mailing list
>> >> [hidden email]
>> >> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>> >>
>> > _______________________________________________
>> > NumPy-Discussion mailing list
>> > [hidden email]
>> > https://mail.scipy.org/mailman/listinfo/numpy-discussion
>> >
>>
>>
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> [hidden email]
>> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>>
>
>
> --
>
> Christopher Barker, Ph.D.
> Oceanographer
>
> Emergency Response Division
> NOAA/NOS/OR&R            <a href="tel:%28206%29%20526-6959" value="+12065266959">(206) 526-6959   voice
> 7600 Sand Point Way NE   <a href="tel:%28206%29%20526-6329" value="+12065266329">(206) 526-6329   fax
> Seattle, WA  98115       <a href="tel:%28206%29%20526-6317" value="+12065266317">(206) 526-6317   main reception
>
> [hidden email]
>
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.scipy.org/pipermail/numpy-discussion/attachments/20161002/2e65258f/attachment-0001.html>

------------------------------

Message: 2
Date: Sun, 2 Oct 2016 22:26:28 +0100
From: David Cournapeau <[hidden email]>
To: Discussion of Numerical Python <[hidden email]>
Subject: Re: [Numpy-discussion] Dropping sourceforge for releases.
Message-ID:
        <[hidden email]>
Content-Type: text/plain; charset="utf-8"

+1 from me.

If we really need some distribution on top of github/pypi, note that
bintray (https://bintray.com/) is free for OSS projects, and is a much
better experience than sourceforge.

David

On Sun, Oct 2, 2016 at 12:02 AM, Charles R Harris <[hidden email]
> wrote:

> Hi All,
>
> Ralf has suggested dropping sourceforge as a NumPy release site. There was
> discussion of doing that some time back but we have not yet done it. Now
> that we put wheels up on PyPI for all supported architectures source forge
> is not needed. I note that there are still some 15,000 downloads a week
> from the site, so it is still used.
>
> Thoughts?
>
> Chuck
>
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.scipy.org/pipermail/numpy-discussion/attachments/20161002/4e462a48/attachment-0001.html>

------------------------------

Message: 3
Date: Sun, 2 Oct 2016 17:53:32 -0600
From: Vincent Davis <[hidden email]>
To: Discussion of Numerical Python <[hidden email]>
Subject: Re: [Numpy-discussion] Dropping sourceforge for releases.
Message-ID:
        <CALyJZZX=[hidden email]>
Content-Type: text/plain; charset="utf-8"

+1, I am very skeptical of anything on SourceForge, it negatively impacts
my opinion of any project that requires me to download from sourceforge.

On Saturday, October 1, 2016, Charles R Harris <[hidden email]>
wrote:

> Hi All,
>
> Ralf has suggested dropping sourceforge as a NumPy release site. There was
> discussion of doing that some time back but we have not yet done it. Now
> that we put wheels up on PyPI for all supported architectures source forge
> is not needed. I note that there are still some 15,000 downloads a week
> from the site, so it is still used.
>
> Thoughts?
>
> Chuck
>


--
Sent from mobile app.
Vincent Davis
<a href="tel:720-301-3003" value="+17203013003">720-301-3003
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.scipy.org/pipermail/numpy-discussion/attachments/20161002/eb4cbff3/attachment.html>

------------------------------

Subject: Digest Footer

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.scipy.org/mailman/listinfo/numpy-discussion


------------------------------

End of NumPy-Discussion Digest, Vol 121, Issue 3
************************************************


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: automatically avoiding temporary arrays

Julian Taylor-3
In reply to this post by Benjamin Root
the problem with this approach is that we don't really want numpy
hogging on to hundreds of megabytes of memory by default so it would
need to be a user option.
A context manager could work too but it would probably lead to premature
optimization.

Very new Linux versions (4.6+) now finally support MADV_FREE which gives
memory back to the system but does not require refaulting it if nothing
else needed it. So this might be an option now.
But libc implementations will probably use that at some point too and
then numpy doesn't need to do this.


On 02.10.2016 15:10, Benjamin Root wrote:

> Just thinking aloud, an idea I had recently takes a different approach.
> The problem with temporaries isn't so much that they exists, but rather
> they they keep on malloc'ed and cleared. What if numpy kept a small LRU
> cache of weakref'ed temporaries? Whenever a new numpy array is
> requested, numpy could see if there is already one in its cache of
> matching size and use it. If you think about it, expressions that result
> in many temporaries would quite likely have many of them being the same
> size in memory.
>
> Don't know how feasible it would be to implement though.
>
> Cheers!
> Ben Root
>
>
> On Sat, Oct 1, 2016 at 2:38 PM, Chris Barker <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     Julian,
>
>     This is really, really cool!
>
>     I have been wanting something like this for years (over a decade?
>     wow!), but always thought it would require hacking the interpreter
>     to intercept operations. This is a really inspired idea, and could
>     buy numpy a lot of performance.
>
>     I'm afraid I can't say much about the implementation details -- but
>     great work!
>
>     -Chris
>
>
>
>
>     On Fri, Sep 30, 2016 at 2:50 PM, Julian Taylor
>     <[hidden email]
>     <mailto:[hidden email]>> wrote:
>
>         On 30.09.2016 23:09, [hidden email]
>         <mailto:[hidden email]> wrote:
>         > On Fri, Sep 30, 2016 at 9:38 AM, Julian Taylor
>         > <[hidden email]
>         <mailto:[hidden email]>> wrote:
>         >> hi,
>         >> Temporary arrays generated in expressions are expensive as the imply
>         >> extra memory bandwidth which is the bottleneck in most numpy operations.
>         >> For example:
>         >>
>         >> r = a + b + c
>         >>
>         >> creates the b + c temporary and then adds a to it.
>         >> This can be rewritten to be more efficient using inplace operations:
>         >>
>         >> r = b + c
>         >> r += a
>         >
>         > general question (I wouldn't understand the details even if I looked.)
>         >
>         > how is this affected by broadcasting and type promotion?
>         >
>         > Some of the main reasons that I don't like to use inplace operation in
>         > general is that I'm often not sure when type promotion occurs and when
>         > arrays expand during broadcasting.
>         >
>         > for example b + c is 1-D, a is 2-D, and r has the broadcasted shape.
>         > another case when I switch away from broadcasting is when b + c is int
>         > or bool and a is float. Thankfully, we get error messages for casting
>         > now.
>
>         the temporary is only avoided when the casting follows the safe
>         rule, so
>         it should be the same as what you get without inplace
>         operations. E.g.
>         float32-temporary + float64 will not be converted to the unsafe
>         float32
>         += float64 which a normal inplace operations would allow. But
>         float64-temp + float32 is transformed.
>
>         Currently the only broadcasting that will be transformed is
>         temporary +
>         scalar value, otherwise it will only work on matching array sizes.
>         Though there is not really anything that prevents full
>         broadcasting but
>         its not implemented yet in the PR.
>
>         >
>         >>
>         >> This saves some memory bandwidth and can speedup the
>         operation by 50%
>         >> for very large arrays or even more if the inplace operation
>         allows it to
>         >> be completed completely in the cpu cache.
>         >
>         > I didn't realize the difference can be so large. That would make
>         > streamlining some code worth the effort.
>         >
>         > Josef
>         >
>         >
>         >>
>         >> The problem is that inplace operations are a lot less
>         readable so they
>         >> are often only used in well optimized code. But due to pythons
>         >> refcounting semantics we can actually do some inplace conversions
>         >> transparently.
>         >> If an operand in python has a reference count of one it must be a
>         >> temporary so we can use it as the destination array. CPython
>         itself does
>         >> this optimization for string concatenations.
>         >>
>         >> In numpy we have the issue that we can be called from the
>         C-API directly
>         >> where the reference count may be one for other reasons.
>         >> To solve this we can check the backtrace until the python frame
>         >> evaluation function. If there are only numpy and python
>         functions in
>         >> between that and our entry point we should be able to elide
>         the temporary.
>         >>
>         >> This PR implements this:
>         >> https://github.com/numpy/numpy/pull/7997
>         <https://github.com/numpy/numpy/pull/7997>
>         >>
>         >> It currently only supports Linux with glibc (which has reliable
>         >> backtraces via unwinding) and maybe MacOS depending on how
>         good their
>         >> backtrace is. On windows the backtrace APIs are different and
>         I don't
>         >> know them but in theory it could also be done there.
>         >>
>         >> A problem is that checking the backtrace is quite expensive,
>         so should
>         >> only be enabled when the involved arrays are large enough for
>         it to be
>         >> worthwhile. In my testing this seems to be around 180-300KiB
>         sized
>         >> arrays, basically where they start spilling out of the CPU L2
>         cache.
>         >>
>         >> I made a little crappy benchmark script to test this cutoff
>         in this branch:
>         >> https://github.com/juliantaylor/numpy/tree/elide-bench
>         <https://github.com/juliantaylor/numpy/tree/elide-bench>
>         >>
>         >> If you are interested you can run it with:
>         >> python setup.py build_ext -j 4 --inplace
>         >> ipython --profile=null check.ipy
>         >>
>         >> At the end it will plot the ratio between elided and
>         non-elided runtime.
>         >> It should get larger than one around 180KiB on most cpus.
>         >>
>         >> If no one points out some flaw in the approach, I'm hoping to
>         get this
>         >> into the next numpy version.
>         >>
>         >> cheers,
>         >> Julian
>         >>
>         >>
>         >> _______________________________________________
>         >> NumPy-Discussion mailing list
>         >> [hidden email] <mailto:[hidden email]>
>         >> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>         <https://mail.scipy.org/mailman/listinfo/numpy-discussion>
>         >>
>         > _______________________________________________
>         > NumPy-Discussion mailing list
>         > [hidden email] <mailto:[hidden email]>
>         > https://mail.scipy.org/mailman/listinfo/numpy-discussion
>         <https://mail.scipy.org/mailman/listinfo/numpy-discussion>
>         >
>
>
>
>         _______________________________________________
>         NumPy-Discussion mailing list
>         [hidden email] <mailto:[hidden email]>
>         https://mail.scipy.org/mailman/listinfo/numpy-discussion
>         <https://mail.scipy.org/mailman/listinfo/numpy-discussion>
>
>
>
>
>     --
>
>     Christopher Barker, Ph.D.
>     Oceanographer
>
>     Emergency Response Division
>     NOAA/NOS/OR&R            (206) 526-6959 <tel:%28206%29%20526-6959>  
>     voice
>     7600 Sand Point Way NE   (206) 526-6329 <tel:%28206%29%20526-6329>   fax
>     Seattle, WA  98115       (206) 526-6317 <tel:%28206%29%20526-6317>  
>     main reception
>
>     [hidden email] <mailto:[hidden email]>
>
>     _______________________________________________
>     NumPy-Discussion mailing list
>     [hidden email] <mailto:[hidden email]>
>     https://mail.scipy.org/mailman/listinfo/numpy-discussion
>     <https://mail.scipy.org/mailman/listinfo/numpy-discussion>
>
>
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.scipy.org/mailman/listinfo/numpy-discussion

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: automatically avoiding temporary arrays

Chris Barker - NOAA Federal


On Mon, Oct 3, 2016 at 3:16 AM, Julian Taylor <[hidden email]> wrote:
the problem with this approach is that we don't really want numpy
hogging on to hundreds of megabytes of memory by default so it would
need to be a user option.

indeed -- but one could set an LRU cache to be very small (few items, not small memory), and then it get used within expressions, but not hold on to much outside of expressions.

However, is the allocation the only (Or even biggest) source of the performance hit?
 
If you generate a temporary as a result of an operation, rather than doing it in-place, that temporary needs to be allocated, but it also means that an additional array needs to be pushed through the processor -- and that can make a big performance difference too.

I"m not entirely sure how to profile this correctly, but this seems to indicate that the allocation is cheap compared to the operations (for a million--element array)

* Regular old temporary creation

In [24]: def f1(arr1, arr2):
    ...:     result = arr1 + arr2
    ...:     return result

In [26]: %timeit f1(arr1, arr2)
1000 loops, best of 3: 1.13 ms per loop

* Completely in-place, no allocation of an extra array

In [27]: def f2(arr1, arr2):
    ...:     arr1 += arr2
    ...:     return arr1

In [28]: %timeit f2(arr1, arr2)
1000 loops, best of 3: 755 µs per loop

So that's about 30% faster

* allocate a temporary that isn't used -- but should catch the creation cost

In [29]: def f3(arr1, arr2):
    ...:     result = np.empty_like(arr1)
    ...:     arr1 += arr2
    ...:     return arr1

In [30]: % timeit f3(arr1, arr2)

1000 loops, best of 3: 756 µs per loop

only a µs slower!

Profiling is hard, and I'm not good at it, but this seems to indicate that the allocation is cheap.

-CHB


--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

[hidden email]

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: automatically avoiding temporary arrays

Julian Taylor-3
On 03.10.2016 20:23, Chris Barker wrote:

>
>
> On Mon, Oct 3, 2016 at 3:16 AM, Julian Taylor
> <[hidden email] <mailto:[hidden email]>>
> wrote:
>
>     the problem with this approach is that we don't really want numpy
>     hogging on to hundreds of megabytes of memory by default so it would
>     need to be a user option.
>
>
> indeed -- but one could set an LRU cache to be very small (few items,
> not small memory), and then it get used within expressions, but not hold
> on to much outside of expressions.

numpy doesn't see the whole expression so we can't really do much.
(technically we could in 3.5 by using pep 523, but that would be a
larger undertaking)

>
> However, is the allocation the only (Or even biggest) source of the
> performance hit?
>  

on large arrays the allocation is insignificant. What does cost some
time is faulting the memory into the process which implies writing zeros
into the pages (a page at a time as it is being used).
By storing memory blocks in numpy we would save this portion. This is
really the job of the libc, but these are usually tuned for general
purpose workloads and thus tend to give back memory back to the system
much earlier than numerical workloads would like.

Note that numpy already has a small memory block cache but its only used
for very small arrays where the allocation cost itself is significant,
it is limited to a couple megabytes at most.
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: automatically avoiding temporary arrays

Benjamin Root
With regards to arguments about holding onto large arrays, I would like to emphasize that my original suggestion mentioned weakref'ed numpy arrays. Essentially, the idea is to claw back only the raw memory blocks during that limbo period between discarding the numpy array python object and when python garbage-collects it.

Ben Root

On Mon, Oct 3, 2016 at 2:43 PM, Julian Taylor <[hidden email]> wrote:
On 03.10.2016 20:23, Chris Barker wrote:
>
>
> On Mon, Oct 3, 2016 at 3:16 AM, Julian Taylor
> <[hidden email] <mailto:[hidden email]>>
> wrote:
>
>     the problem with this approach is that we don't really want numpy
>     hogging on to hundreds of megabytes of memory by default so it would
>     need to be a user option.
>
>
> indeed -- but one could set an LRU cache to be very small (few items,
> not small memory), and then it get used within expressions, but not hold
> on to much outside of expressions.

numpy doesn't see the whole expression so we can't really do much.
(technically we could in 3.5 by using pep 523, but that would be a
larger undertaking)

>
> However, is the allocation the only (Or even biggest) source of the
> performance hit?
>

on large arrays the allocation is insignificant. What does cost some
time is faulting the memory into the process which implies writing zeros
into the pages (a page at a time as it is being used).
By storing memory blocks in numpy we would save this portion. This is
really the job of the libc, but these are usually tuned for general
purpose workloads and thus tend to give back memory back to the system
much earlier than numerical workloads would like.

Note that numpy already has a small memory block cache but its only used
for very small arrays where the allocation cost itself is significant,
it is limited to a couple megabytes at most.
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.scipy.org/mailman/listinfo/numpy-discussion


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: automatically avoiding temporary arrays

Pauli Virtanen-3
Mon, 03 Oct 2016 15:07:28 -0400, Benjamin Root kirjoitti:
> With regards to arguments about holding onto large arrays, I would like
> to emphasize that my original suggestion mentioned weakref'ed numpy
> arrays.
> Essentially, the idea is to claw back only the raw memory blocks during
> that limbo period between discarding the numpy array python object and
> when python garbage-collects it.

CPython afaik deallocates immediately when the refcount hits zero. It's
relatively rare that you have arrays hanging around waiting for cycle
breaking by gc. If you have them hanging around, I don't think it's
possible to distinguish these from other arrays without running the gc.

Note also that an "is an array in use" check probably always requires
Julian's stack based hack since you cannot rely on the refcount.

        Pauli

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: automatically avoiding temporary arrays

Marten van Kerkwijk
Note that numpy does store some larger arrays already, in the fft
module. (In fact, this was a cache of unlimited size until #7686.) It
might not be bad if the same cache were used more generally.

That said, if newer versions of python are offering ways of doing this
better, maybe that is the best way forward.

-- Marten
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: automatically avoiding temporary arrays

srean
In reply to this post by Julian Taylor-3
Good discussion, but was surprised by the absence of numexpr in the discussion., given how relevant it (numexpr) is to the topic.

Is the goal to fold in the numexpr functionality (and beyond) into Numpy ?

On Fri, Sep 30, 2016 at 7:08 PM, Julian Taylor <[hidden email]> wrote:
hi,
Temporary arrays generated in expressions are expensive as the imply
extra memory bandwidth which is the bottleneck in most numpy operations.
For example:

r = a + b + c

creates the b + c temporary and then adds a to it.
This can be rewritten to be more efficient using inplace operations:

r = b + c
r += a

This saves some memory bandwidth and can speedup the operation by 50%
for very large arrays or even more if the inplace operation allows it to
be completed completely in the cpu cache.

The problem is that inplace operations are a lot less readable so they
are often only used in well optimized code. But due to pythons
refcounting semantics we can actually do some inplace conversions
transparently.
If an operand in python has a reference count of one it must be a
temporary so we can use it as the destination array. CPython itself does
this optimization for string concatenations.

In numpy we have the issue that we can be called from the C-API directly
where the reference count may be one for other reasons.
To solve this we can check the backtrace until the python frame
evaluation function. If there are only numpy and python functions in
between that and our entry point we should be able to elide the temporary.

This PR implements this:
https://github.com/numpy/numpy/pull/7997

It currently only supports Linux with glibc (which has reliable
backtraces via unwinding) and maybe MacOS depending on how good their
backtrace is. On windows the backtrace APIs are different and I don't
know them but in theory it could also be done there.

A problem is that checking the backtrace is quite expensive, so should
only be enabled when the involved arrays are large enough for it to be
worthwhile. In my testing this seems to be around 180-300KiB sized
arrays, basically where they start spilling out of the CPU L2 cache.

I made a little crappy benchmark script to test this cutoff in this branch:
https://github.com/juliantaylor/numpy/tree/elide-bench

If you are interested you can run it with:
python setup.py build_ext -j 4 --inplace
ipython --profile=null check.ipy

At the end it will plot the ratio between elided and non-elided runtime.
It should get larger than one around 180KiB on most cpus.

If no one points out some flaw in the approach, I'm hoping to get this
into the next numpy version.

cheers,
Julian


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.scipy.org/mailman/listinfo/numpy-discussion



_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: automatically avoiding temporary arrays

Francesc Alted-3
2016-10-05 8:45 GMT+02:00 srean <[hidden email]>:
Good discussion, but was surprised by the absence of numexpr in the discussion., given how relevant it (numexpr) is to the topic.

Is the goal to fold in the numexpr functionality (and beyond) into Numpy ?

Yes, the question about merging numexpr into numpy has been something that periodically shows up in this list.  I think mostly everyone agree that it is a good idea, but things are not so easy, and so far nobody provided a good patch for this.  Also, the fact that numexpr relies on grouping an expression by using a string (e.g. (y = ne.evaluate("x**3 + tanh(x**2) + 4")) does not play well with the way in that numpy evaluates expressions, so something should be suggested to cope with this too.
 

On Fri, Sep 30, 2016 at 7:08 PM, Julian Taylor <[hidden email]> wrote:
hi,
Temporary arrays generated in expressions are expensive as the imply
extra memory bandwidth which is the bottleneck in most numpy operations.
For example:

r = a + b + c

creates the b + c temporary and then adds a to it.
This can be rewritten to be more efficient using inplace operations:

r = b + c
r += a

This saves some memory bandwidth and can speedup the operation by 50%
for very large arrays or even more if the inplace operation allows it to
be completed completely in the cpu cache.

The problem is that inplace operations are a lot less readable so they
are often only used in well optimized code. But due to pythons
refcounting semantics we can actually do some inplace conversions
transparently.
If an operand in python has a reference count of one it must be a
temporary so we can use it as the destination array. CPython itself does
this optimization for string concatenations.

In numpy we have the issue that we can be called from the C-API directly
where the reference count may be one for other reasons.
To solve this we can check the backtrace until the python frame
evaluation function. If there are only numpy and python functions in
between that and our entry point we should be able to elide the temporary.

This PR implements this:
https://github.com/numpy/numpy/pull/7997

It currently only supports Linux with glibc (which has reliable
backtraces via unwinding) and maybe MacOS depending on how good their
backtrace is. On windows the backtrace APIs are different and I don't
know them but in theory it could also be done there.

A problem is that checking the backtrace is quite expensive, so should
only be enabled when the involved arrays are large enough for it to be
worthwhile. In my testing this seems to be around 180-300KiB sized
arrays, basically where they start spilling out of the CPU L2 cache.

I made a little crappy benchmark script to test this cutoff in this branch:
https://github.com/juliantaylor/numpy/tree/elide-bench

If you are interested you can run it with:
python setup.py build_ext -j 4 --inplace
ipython --profile=null check.ipy

At the end it will plot the ratio between elided and non-elided runtime.
It should get larger than one around 180KiB on most cpus.

If no one points out some flaw in the approach, I'm hoping to get this
into the next numpy version.

cheers,
Julian


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.scipy.org/mailman/listinfo/numpy-discussion



_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.scipy.org/mailman/listinfo/numpy-discussion




--
Francesc Alted

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: automatically avoiding temporary arrays

Robert McLeod
All,

On Wed, Oct 5, 2016 at 11:46 AM, Francesc Alted <[hidden email]> wrote:
2016-10-05 8:45 GMT+02:00 srean <[hidden email]>:
Good discussion, but was surprised by the absence of numexpr in the discussion., given how relevant it (numexpr) is to the topic.

Is the goal to fold in the numexpr functionality (and beyond) into Numpy ?

Yes, the question about merging numexpr into numpy has been something that periodically shows up in this list.  I think mostly everyone agree that it is a good idea, but things are not so easy, and so far nobody provided a good patch for this.  Also, the fact that numexpr relies on grouping an expression by using a string (e.g. (y = ne.evaluate("x**3 + tanh(x**2) + 4")) does not play well with the way in that numpy evaluates expressions, so something should be suggested to cope with this too.

As Francesc said, Numexpr is going to get most of its power through grouping a series of operations so it can send blocks to the CPU cache and run the entire series of operations on the cache before returning the block to system memory.  If it was just used to back-end NumPy, it would only gain from the multi-threading portion inside each function call. I'm not sure how one would go about grouping successive numpy expressions without modifying the Python interpreter?  

I put a bit of effort into extending numexpr to use 4-byte word opcodes instead of 1-byte.  Progress has been very slow, however, due to time constraints, but I have most of the numpy data types (u[1-4], i[1-4], f[4,8], c[8,16], S[1-4], U[1-4]).  On Tuesday I finished writing a Python generator script that writes all the C-side opcode macros for opcodes.hpp.  Now I have about 900 opcodes, and this could easily grow into thousands if more functions are added, so I also built a reverse lookup tree (based on collections.defaultdict) for the Python-side of numexpr.

Robert

--
Robert McLeod, Ph.D.
Center for Cellular Imaging and Nano Analytics (C-CINA)
Biozentrum der Universität Basel
Mattenstrasse 26, 4058 Basel
Work: +41.061.387.3225
[hidden email]

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: automatically avoiding temporary arrays

srean
Thanks Francesc, Robert for giving me a broader picture of where this fits in. I believe numexpr does not  handle slicing, so that might be another thing to look at.


On Wed, Oct 5, 2016 at 4:26 PM, Robert McLeod <[hidden email]> wrote:

As Francesc said, Numexpr is going to get most of its power through grouping a series of operations so it can send blocks to the CPU cache and run the entire series of operations on the cache before returning the block to system memory.  If it was just used to back-end NumPy, it would only gain from the multi-threading portion inside each function call.

Is that so ?

I thought numexpr also cuts down on number of temporary buffers that get filled (in other words copy operations) if the same expression was written as series of operations. My understanding can be wrong, and would appreciate correction.

The 'out' parameter in ufuncs can eliminate extra temporaries but its not composable. Right now I have to manually carry along the array where the in place operations take place. I think the goal here is to eliminate that.



_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: automatically avoiding temporary arrays

Robert McLeod
On Wed, Oct 5, 2016 at 1:11 PM, srean <[hidden email]> wrote:
Thanks Francesc, Robert for giving me a broader picture of where this fits in. I believe numexpr does not  handle slicing, so that might be another thing to look at.

Dereferencing would be relatively simple to add into numexpr, as it would just be some getattr() calls.  Personally I will add that at some point because it will clean up my code.  

Slicing, maybe only for continuous blocks in memory?

I.e.

imageStack[0,:,:]

would be possible, but

imageStack[:, ::2, ::2] 

would not be trivial (I think...).  I seem to remember someone asked David Cooke about slicing and he said something along the lines of, "that's what Numba is for."  Perhaps NumPy backended by Numba is more so what you are looking for, as it hooks into the byte compiler? The main advantage of numexpr is that a series of numpy functions in <expression> can be enclosed in  ne.evaluate( "<expression>" ) and it provides a big acceleration for little programmer effort, but it's not nearly as sophisticated as Numba or PyPy.  

 
On Wed, Oct 5, 2016 at 4:26 PM, Robert McLeod <[hidden email]> wrote:

As Francesc said, Numexpr is going to get most of its power through grouping a series of operations so it can send blocks to the CPU cache and run the entire series of operations on the cache before returning the block to system memory.  If it was just used to back-end NumPy, it would only gain from the multi-threading portion inside each function call.

Is that so ?

I thought numexpr also cuts down on number of temporary buffers that get filled (in other words copy operations) if the same expression was written as series of operations. My understanding can be wrong, and would appreciate correction.

The 'out' parameter in ufuncs can eliminate extra temporaries but its not composable. Right now I have to manually carry along the array where the in place operations take place. I think the goal here is to eliminate that.

 The numexpr virtual machine does create temporaries where needed when it parses the abstract syntax tree for all the operations it has to do.  I believe the main advantage is that the temporaries are created on the CPU cache, and not in system memory. It's certainly true that numexpr doesn't create a lot of OP_COPY operations, rather it's optimized to minimize them, so probably it's fewer ops than naive successive calls to numpy within python, but I'm unsure if there's any difference in operation count between a hand-optimized numpy with out= set and numexpr.  Numexpr just does it for you.

This blog post from Tim Hochberg is useful for understanding the performance advantages of blocking versus multithreading:



Robert

--
Robert McLeod, Ph.D.
Center for Cellular Imaging and Nano Analytics (C-CINA)
Biozentrum der Universität Basel
Mattenstrasse 26, 4058 Basel
Work: +41.061.387.3225
[hidden email]

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: automatically avoiding temporary arrays

srean


On Wed, Oct 5, 2016 at 5:36 PM, Robert McLeod <[hidden email]> wrote:

It's certainly true that numexpr doesn't create a lot of OP_COPY operations, rather it's optimized to minimize them, so probably it's fewer ops than naive successive calls to numpy within python, but I'm unsure if there's any difference in operation count between a hand-optimized numpy with out= set and numexpr.  Numexpr just does it for you.

That was my understanding as well. If it automatically does what one could achieve by carrying the state along in the 'out' parameter, that's as good as it can get in terms removing unnecessary ops. There are other speedup opportunities of course, but that's a separate matter.
 
This blog post from Tim Hochberg is useful for understanding the performance advantages of blocking versus multithreading:


Hadnt come across that one before. Great link. Thanks. using caches and vector registers well trumps threading, unless one has a lot of data and it helps to disable hyper-threading.
 

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: automatically avoiding temporary arrays

Julian Taylor-3
In reply to this post by Julian Taylor-3
hi,
This PR has now been merged. As it has the potential to break stuff
please test your applications, in particular ones that use cython and
C-extensions against master.

It will only do something when working on large arrays (> 256 KiB) on
platforms providing the backtrace() function. So linux and mac but
windows probably not (though windows could probably be added if someone
volunteers to write the equivalent winapi code).

If you compile from source and want to see for sure if it is doing
anything you can set the NPY_ELIDE_DEBUG define in
./numpy/core/src/multiarray/temp_elide.c to 1 or 2 and it will print out
when it is eliding something.

cheers,
Julian


On 30.09.2016 15:38, Julian Taylor wrote:

> hi,
> Temporary arrays generated in expressions are expensive as the imply
> extra memory bandwidth which is the bottleneck in most numpy operations.
> For example:
>
> r = a + b + c
>
> creates the b + c temporary and then adds a to it.
> This can be rewritten to be more efficient using inplace operations:
>
> r = b + c
> r += a
>
> This saves some memory bandwidth and can speedup the operation by 50%
> for very large arrays or even more if the inplace operation allows it to
> be completed completely in the cpu cache.
>
> The problem is that inplace operations are a lot less readable so they
> are often only used in well optimized code. But due to pythons
> refcounting semantics we can actually do some inplace conversions
> transparently.
> If an operand in python has a reference count of one it must be a
> temporary so we can use it as the destination array. CPython itself does
> this optimization for string concatenations.
>
> In numpy we have the issue that we can be called from the C-API directly
> where the reference count may be one for other reasons.
> To solve this we can check the backtrace until the python frame
> evaluation function. If there are only numpy and python functions in
> between that and our entry point we should be able to elide the temporary.
>
> This PR implements this:
> https://github.com/numpy/numpy/pull/7997
>
> It currently only supports Linux with glibc (which has reliable
> backtraces via unwinding) and maybe MacOS depending on how good their
> backtrace is. On windows the backtrace APIs are different and I don't
> know them but in theory it could also be done there.
>
> A problem is that checking the backtrace is quite expensive, so should
> only be enabled when the involved arrays are large enough for it to be
> worthwhile. In my testing this seems to be around 180-300KiB sized
> arrays, basically where they start spilling out of the CPU L2 cache.
>
> I made a little crappy benchmark script to test this cutoff in this branch:
> https://github.com/juliantaylor/numpy/tree/elide-bench
>
> If you are interested you can run it with:
> python setup.py build_ext -j 4 --inplace
> ipython --profile=null check.ipy
>
> At the end it will plot the ratio between elided and non-elided runtime.
> It should get larger than one around 180KiB on most cpus.
>
> If no one points out some flaw in the approach, I'm hoping to get this
> into the next numpy version.
>
> cheers,
> Julian
>

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: automatically avoiding temporary arrays

Benjamin Root
What's the timeline for the next release? I have the perfect usecase for this (Haversine calculation on large arrays that takes up ~33% of one of my processing scripts). However, to test it out, I have a huge dependency mess to wade through first, and there are no resources devoted to that project for at least a few weeks. I want to make sure I get feedback to y'all.

Cheers!
Ben Root


On Sat, Feb 25, 2017 at 5:17 AM, Julian Taylor <[hidden email]> wrote:
hi,
This PR has now been merged. As it has the potential to break stuff
please test your applications, in particular ones that use cython and
C-extensions against master.

It will only do something when working on large arrays (> 256 KiB) on
platforms providing the backtrace() function. So linux and mac but
windows probably not (though windows could probably be added if someone
volunteers to write the equivalent winapi code).

If you compile from source and want to see for sure if it is doing
anything you can set the NPY_ELIDE_DEBUG define in
./numpy/core/src/multiarray/temp_elide.c to 1 or 2 and it will print out
when it is eliding something.

cheers,
Julian


On 30.09.2016 15:38, Julian Taylor wrote:
> hi,
> Temporary arrays generated in expressions are expensive as the imply
> extra memory bandwidth which is the bottleneck in most numpy operations.
> For example:
>
> r = a + b + c
>
> creates the b + c temporary and then adds a to it.
> This can be rewritten to be more efficient using inplace operations:
>
> r = b + c
> r += a
>
> This saves some memory bandwidth and can speedup the operation by 50%
> for very large arrays or even more if the inplace operation allows it to
> be completed completely in the cpu cache.
>
> The problem is that inplace operations are a lot less readable so they
> are often only used in well optimized code. But due to pythons
> refcounting semantics we can actually do some inplace conversions
> transparently.
> If an operand in python has a reference count of one it must be a
> temporary so we can use it as the destination array. CPython itself does
> this optimization for string concatenations.
>
> In numpy we have the issue that we can be called from the C-API directly
> where the reference count may be one for other reasons.
> To solve this we can check the backtrace until the python frame
> evaluation function. If there are only numpy and python functions in
> between that and our entry point we should be able to elide the temporary.
>
> This PR implements this:
> https://github.com/numpy/numpy/pull/7997
>
> It currently only supports Linux with glibc (which has reliable
> backtraces via unwinding) and maybe MacOS depending on how good their
> backtrace is. On windows the backtrace APIs are different and I don't
> know them but in theory it could also be done there.
>
> A problem is that checking the backtrace is quite expensive, so should
> only be enabled when the involved arrays are large enough for it to be
> worthwhile. In my testing this seems to be around 180-300KiB sized
> arrays, basically where they start spilling out of the CPU L2 cache.
>
> I made a little crappy benchmark script to test this cutoff in this branch:
> https://github.com/juliantaylor/numpy/tree/elide-bench
>
> If you are interested you can run it with:
> python setup.py build_ext -j 4 --inplace
> ipython --profile=null check.ipy
>
> At the end it will plot the ratio between elided and non-elided runtime.
> It should get larger than one around 180KiB on most cpus.
>
> If no one points out some flaw in the approach, I'm hoping to get this
> into the next numpy version.
>
> cheers,
> Julian
>

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.scipy.org/mailman/listinfo/numpy-discussion


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.scipy.org/mailman/listinfo/numpy-discussion
12
Loading...