NEP 38 - Universal SIMD intrinsics

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

NEP 38 - Universal SIMD intrinsics

mattip
Administrator
Together with Sayed Adel (cc) and Ralf, I am pleased to put the draft
version of NEP 38 [0] up for discussion. As per NEP 0, this is the next
step in the community accepting the approach layed out in the NEP. The
NEP PR [1] has already garnered a fair amount of discussion about the
viability of Universal SIMD Intrinsics, so I will try to capture some of
that here as well.


Abstract

While compilers are getting better at using hardware-specific routines
to optimize code, they sometimes do not produce optimal results. Also,
we would like to be able to copy binary optimized C-extension modules
from one machine to another with the same base architecture (x86, ARM,
PowerPC) but with different capabilities without recompiling.

We have a mechanism in the ufunc machinery to build alternative loops
indexed by CPU feature name. At import (in InitOperators), the loop
function that matches the run-time CPU info is chosen from the
candidates.This NEP proposes a mechanism to build on that for many more
features and architectures. The steps proposed are to:

     Establish a set of well-defined, architecture-agnostic, universal
intrisics which capture features available across architectures.

     Capture these universal intrisics in a set of C macros and use the
macros to build code paths for sets of features from the baseline up to
the maximum set of features available on that architecture. Offer these
as a limited number of compiled alternative code paths.

     At runtime, discover which CPU features are available, and choose
from among the possible code paths accordingly.

Motivation and Scope

Traditionally NumPy has counted on the compilers to generate optimal
code specifically for the target architecture. However few users today
compile NumPy locally for their machines. Most use the binary packages
which must provide run-time support for the lowest-common denominator
CPU architecture. Thus NumPy cannot take advantage of more advanced
features of their CPU processors, since they may not be available on all
users’ systems. The ufunc machinery already has a loop-selection
protocol based on dtypes, so it is easy to extend this to also select an
optimal loop for specifically available CPU features at runtime.

Traditionally, these features have been exposed through intrinsics which
are compiler-specific instructions that map directly to assembly
instructions. Recently there were discussions about the effectiveness of
adding more intrinsics (e.g., `gh-11113`_ for AVX optimizations for
floats). In the past, architecture-specific code was added to NumPy for
fast avx512 routines in various ufuncs, using the mechanism described
above to choose the best loop for the architecture. However the code is
not generic and does not generalize to other architectures.

Recently, OpenCV moved to using universal intrinsics in the Hardware
Abstraction Layer (HAL) which provided a nice abstraction for common
shared Single Instruction Multiple Data (SIMD) constructs. This NEP
proposes a similar mechanism for NumPy. There are three stages to using
the mechanism:


- Infrastructure is provided in the code for abstract intrinsics. The
ufunc machinery will be extended using sets of these abstract
intrinsics, so that a single ufunc will be expressed as a set of loops,
going from a minimal to a maximal set of possibly availabe intrinsics.


- At compile time, compiler macros and CPU detection are used to turn
the abstract intrinsics into concrete intrinsic calls. Any intrinsics
not available on the platform, either because the CPU does not support
them (and so cannot be tested) or because the abstract intrinsic does
not have a parallel concrete intrinsic on the platform will not error,
rather the corresponding loop will not be produced and added to the set
of possibilities.


- At runtime, the CPU detection code will further limit the set of loops
available, and the optimal one will be chosen for the ufunc.

The current NEP proposes only to use the runtime feature detection and
optimal loop selection mechanism for ufuncs. Future NEPS may propose
other uses for the proposed solution.


Usage and Impact

The end user will be able to get a list of intrinsics available for
their platform and compiler. Optionally, the user may be able to specify
which of the loops available at runtime will be used, perhaps via an
environment variable to enable benchmarking the impact of the different
loops. There should be no direct impact to naive end users, the results
of all the loops should be identical to within a small number (1-3?)
ULPs. On the other hand, users with more powerful machines should notice
a significant performance boost.
Binary releases - wheels on PyPI and conda packages

The binaries released by this process will be larger since they include
all possible loops for the architecture. Some packagers may prefer to
limit the number of loops in order to limit the size of the binaries, we
would hope they would still support a wide range of families of
architectures. Note this problem already exists in the Intel MKL
offering, where the binary package includes an extensive set of
alternative shared objects (DLLs) for various CPU alternatives.


Source builds

See “Detailed Description” below. A source build where the packager
knows details of the target machine could theoretically produce a
smaller binary by choosing to compile only the loops needed by the
target via command line arguments.
How to run benchmarks to assess performance benefits

Adding more code which use intrinsics will make the code harder to
maintain. Therefore, such code should only be added if it yields a
significant performance benefit. Assessing this performance benefit can
be nontrivial. To aid with this, the implementation for this NEP will
add a way to select which instruction sets can be used at runtime via
environment variables. (name TBD). This ablility is critical for CI code
verification.
Diagnostics

A new dictionary __cpu_features__ will be available to python. The keys
are the available features, the value is a boolean whether the feature
is available or not. Various new private C functions will be used
internally to query available features. These might be exposed via
specific c-extension modules for testing.
Workflow for adding a new CPU architecture-specific optimization

NumPy will always have a baseline C implementation for any code that may
be a candidate for SIMD vectorization. If a contributor wants to add
SIMD support for some architecture (typically the one of most interest
to them), this is the proposed workflow:

TODO (see
https://github.com/numpy/numpy/pull/13516#issuecomment-558859638, needs
to be worked out more)
Reuse by other projects

It would be nice if the universal intrinsics would be available to other
libraries like SciPy or Astropy that also build ufuncs, but that is not
an explicit goal of the first implementation of this NEP.

-----------------------------------------------------------------------------------

My biased summary of select comments from the PR:

(Raghuveer): A very similar SIMD library has been proposed for C++. Here
is the link to the details:

 1. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0214r8.pdf
 2. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/n4808.pdf

There is good discussion on the minimal/common set of instructions
across architectures (which narrows down to loads, stores, arithmetic,
compare, bitwise and shuffle instructions). Based on my developer
experience so far, these instructions aren't by themselves enough to
implement and optimize NumPy ufuncs. As i pointed out earlier, I think I
would find it useful to learn the workflow of how to use instructions
that don't fit in the Universal Intrinsic framework.


(Raguveer) gave a well laid out table of currently proposed unversal
intrinsics by use: load/store, reorder, operators, conversions,
arithmatic and misc [2] which led to a long response from Sayed [3] with
some sample code, demonstrating how more complex operations can be built
up from the primitives.


(catree) mentioned the Simd Library [4] and Halide [5] and asked about
maintainability.


(Ralf) responded [6] with concerns about competent developer bandwidth
for code review. He also mentioned that our CI system currently supports
all the architectures we are targeting (x86, aarch64, s390x, ppc64le)
although some of these machines may not have the most advanced hardware
to support the latest intrinsics.


I apologize if my summary is not accurate, pleas correct any mistakes or
misconceptions.

----------------------------------------------------------------------------------------


Barring complete rejection of the idea here, we will be pushing forward
with PRs to implement this. Comments either on the mailing list or in
those PRs are welcome.

Matti


[0] https://numpy.org/neps/nep-0038-SIMD-optimizations.html

[1] https://github.com/numpy/numpy/pull/15228

[2] https://github.com/numpy/numpy/pull/15228#issuecomment-580479336

[3] https://github.com/numpy/numpy/pull/15228#issuecomment-580605718

[4] https://github.com/ermig1979/Simd

[5] https://halide-lang.org

[6] https://github.com/numpy/numpy/pull/15228#issuecomment-581029991

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: NEP 38 - Universal SIMD intrinsics

Daniele Nicolodi
On 04-02-2020 08:08, Matti Picus wrote:
> Together with Sayed Adel (cc) and Ralf, I am pleased to put the draft
> version of NEP 38 [0] up for discussion. As per NEP 0, this is the next
> step in the community accepting the approach layed out in the NEP. The
> NEP PR [1] has already garnered a fair amount of discussion about the
> viability of Universal SIMD Intrinsics, so I will try to capture some of
> that here as well.

Hello,

more interesting prior art may be found in VOLK https://www.libvolk.org.
VOLK is developed mainly to be used in GNURadio, and this reflects in
the available kernels and in the supported data types, I think the
approach used there may be of interest.

Cheers,
Dan
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: NEP 38 - Universal SIMD intrinsics

Devulapalli, Raghuveer
Hi everyone,

I know had raised these questions in the PR, but wanted to post them in the mailing list as well.  

1) Once NumPy adds the framework and initial set of Universal Intrinsic, if contributors want to leverage a new architecture specific SIMD instruction, will they be expected to add software implementation of this instruction for all other architectures too?

2) On whom does the burden lie to ensure that new implementations are benchmarked and shows benefits on every architecture? What happens if optimizing an Ufunc leads to improving performance on one architecture and worsens performance on another?

Thanks,
Raghuveer


-----Original Message-----
From: NumPy-Discussion <numpy-discussion-bounces+raghuveer.devulapalli=[hidden email]> On Behalf Of Daniele Nicolodi
Sent: Tuesday, February 4, 2020 10:01 AM
To: [hidden email]
Subject: Re: [Numpy-discussion] NEP 38 - Universal SIMD intrinsics

On 04-02-2020 08:08, Matti Picus wrote:
> Together with Sayed Adel (cc) and Ralf, I am pleased to put the draft
> version of NEP 38 [0] up for discussion. As per NEP 0, this is the
> next step in the community accepting the approach layed out in the
> NEP. The NEP PR [1] has already garnered a fair amount of discussion
> about the viability of Universal SIMD Intrinsics, so I will try to
> capture some of that here as well.

Hello,

more interesting prior art may be found in VOLK https://www.libvolk.org.
VOLK is developed mainly to be used in GNURadio, and this reflects in the available kernels and in the supported data types, I think the approach used there may be of interest.

Cheers,
Dan
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: NEP 38 - Universal SIMD intrinsics

Hameer Abbasi
—snip—

> 1) Once NumPy adds the framework and initial set of Universal Intrinsic, if contributors want to leverage a new architecture specific SIMD instruction, will they be expected to add software implementation of this instruction for all other architectures too? 

In my opinion, if the instructions are lower, then yes. For example, one cannot add AVX-512 without adding, for example adding AVX-256 and AVX-128 and SSE*.  However, I would not expect one person or team to be an expert in all assemblies, so intrinsics for one architecture can be developed independently of another.

> 2) On whom does the burden lie to ensure that new implementations are benchmarked and shows benefits on every architecture? What happens if optimizing an Ufunc leads to improving performance on one architecture and worsens performance on another?

I would look at this from a maintainability point of view. If we are increasing the code size by 20% for a certain ufunc, there must be a domonstrable 20% increase in performance on any CPU. That is to say, micro-optimisation will be unwelcome, and code readability will be preferable. Usually we ask the submitter of the PR to test the PR with a machine they have on hand, and I would be inclined to keep this trend of self-reporting. Of course, if someone else came along and reported a performance regression of, say, 10%, then we have increased code by 20%, with only a net 5% gain in performance, and the PR will have to be reverted.

—snip—

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: NEP 38 - Universal SIMD intrinsics

ralfgommers


On Tue, Feb 4, 2020 at 2:00 PM Hameer Abbasi <[hidden email]> wrote:
—snip—

> 1) Once NumPy adds the framework and initial set of Universal Intrinsic, if contributors want to leverage a new architecture specific SIMD instruction, will they be expected to add software implementation of this instruction for all other architectures too? 

In my opinion, if the instructions are lower, then yes. For example, one cannot add AVX-512 without adding, for example adding AVX-256 and AVX-128 and SSE*.  However, I would not expect one person or team to be an expert in all assemblies, so intrinsics for one architecture can be developed independently of another.

I think this doesn't quite answer the question. If I understand correctly, it's about a single instruction (e.g. one needs "VEXP2PD" and it's missing from the supported AVX512 instructions in master). I think the answer is yes, it needs to be added for other architectures as well. Otherwise, if universal intrinsics are added ad-hoc and there's no guarantee that a universal instruction is available for all main supported platforms, then over time there won't be much that's "universal" about the framework.

This is a different question though from adding a new ufunc implementation. I would expect accelerating ufuncs via intrinsics that are already supported to be much more common than having to add new intrinsics. Does that sound right?


> 2) On whom does the burden lie to ensure that new implementations are benchmarked and shows benefits on every architecture? What happens if optimizing an Ufunc leads to improving performance on one architecture and worsens performance on another?

This is slightly hard to provide a recipe for. I suspect it may take a while before this becomes an issue, since we don't have much SIMD code to begin with. So adding new code with benchmarks will likely show improvements on all architectures (we should ensure benchmarks can be run via CI, otherwise it's too onerous). And if not and it's not easily fixable, the problematic platform could be skipped so performance there is unchanged.

Only once there's existing universal intrinsics and then they're tweaked will we have to be much more careful I'd think.

Cheers,
Ralf

 

I would look at this from a maintainability point of view. If we are increasing the code size by 20% for a certain ufunc, there must be a domonstrable 20% increase in performance on any CPU. That is to say, micro-optimisation will be unwelcome, and code readability will be preferable. Usually we ask the submitter of the PR to test the PR with a machine they have on hand, and I would be inclined to keep this trend of self-reporting. Of course, if someone else came along and reported a performance regression of, say, 10%, then we have increased code by 20%, with only a net 5% gain in performance, and the PR will have to be reverted.

—snip—
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: NEP 38 - Universal SIMD intrinsics

mattip
Administrator

On 11/2/20 7:16 am, Ralf Gommers wrote:

>
>
> On Tue, Feb 4, 2020 at 2:00 PM Hameer Abbasi
> <[hidden email] <mailto:[hidden email]>> wrote:
>
>     —snip—
>
>     > 1) Once NumPy adds the framework and initial set of Universal Intrinsic, if
>     contributors want to leverage a new architecture specific SIMD
>     instruction, will they be expected to add software implementation
>     of this instruction for all other architectures too?
>
>     In my opinion, if the instructions are lower, then yes. For
>     example, one cannot add AVX-512 without adding, for example adding
>     AVX-256 and AVX-128 and SSE*.  However, I would not expect one
>     person or team to be an expert in all assemblies, so intrinsics
>     for one architecture can be developed independently of another.
>
>
> I think this doesn't quite answer the question. If I understand
> correctly, it's about a single instruction (e.g. one needs
> |"VEXP2PD"and it's missing from the supported AVX512 instructions in
> master). I think the answer is yes, it needs to be added for other
> architectures as well. Otherwise, if universal intrinsics are added
> ad-hoc and there's no guarantee that a universal instruction is
> available for all main supported platforms, then over time there won't
> be much that's "universal" about the framework.|
> |
> |
> |This is a different question though from adding a new ufunc
> implementation. I would expect accelerating ufuncs via intrinsics that
> are already supported to be much more common than having to add new
> intrinsics. Does that sound right?
> |
|Yes. Universal intrinsics are cross-platform. However the NEP is open
to the possibility that certain architectures may have SIMD intrinsics
that cannot be expressed in terms of intrinsics for other platforms, and
so there may be a use case for architecture-specific loops. This is
explicitly stated in the latest PR to the NEP: "|If the regression is
not minimal, we may choose to keep the X86-specific code for that
platform and use the universal intrisic code for other platforms."

> |
> |
>
>
>     > 2) On whom does the burden lie to ensure that new
>     implementations are benchmarked and shows benefits on every
>     architecture? What happens if optimizing an Ufunc leads to
>     improving performance on one architecture and worsens performance
>     on another?
>
>
> This is slightly hard to provide a recipe for. I suspect it may take a
> while before this becomes an issue, since we don't have much SIMD code
> to begin with. So adding new code with benchmarks will likely show
> improvements on all architectures (we should ensure benchmarks can be
> run via CI, otherwise it's too onerous). And if not and it's not
> easily fixable, the problematic platform could be skipped so
> performance there is unchanged.


On HEAD, out of the 89 ufuncs in
numpy.core.code_generators.generate_umath.defdict, 34 have X86-specific
simd loops:


 >>> [x for x in defdict.keys() if any([td.simd for td in
defdict[x].type_descriptions])]
['add', 'subtract', 'multiply', 'conjugate', 'square', 'reciprocal',
'absolute', 'negative', 'greater', 'greater_equal', 'less',
'less_equal', 'equal', 'not_equal', 'logical_and', 'logical_not',
'logical_or', 'maximum', 'minimum', 'bitwise_and', 'bitwise_or',
'bitwise_xor', 'invert', 'left_shift', 'right_shift', 'cos', 'sin',
'exp', 'log', 'sqrt', 'ceil', 'trunc', 'floor', 'rint']


They would be the first targets for universal intrinsics. Of them I
estimate that the ones with more than one loop for at least one dtype
signature would be the most difficult, since these have different
optimizations for avx2, fma, and/or avx512f:


['square', 'reciprocal', 'absolute', 'cos', 'sin', 'exp', 'log', 'sqrt',
'ceil', 'trunc', 'floor', 'rint']


The other 55 ufuncs, for completeness, are


['floor_divide', 'true_divide', 'fmod', '_ones_like', 'power',
'float_power', '_arg', 'positive', 'sign', 'logical_xor', 'clip',
'fmax', 'fmin', 'logaddexp', 'logaddexp2', 'heaviside', 'degrees',
'rad2deg', 'radians', 'deg2rad', 'arccos', 'arccosh', 'arcsin',
'arcsinh', 'arctan', 'arctanh', 'tan', 'cosh', 'sinh', 'tanh', 'exp2',
'expm1', 'log2', 'log10', 'log1p', 'cbrt', 'fabs', 'arctan2',
'remainder', 'divmod', 'hypot', 'isnan', 'isnat', 'isinf', 'isfinite',
'signbit', 'copysign', 'nextafter', 'spacing', 'modf', 'ldexp', 'frexp',
'gcd', 'lcm', 'matmul']


As for testing accuracy: we recently added a framework for testing ulp
variation of ufuncs against "golden results" in
numpy/core/tests/test_umath_accuracy. So far float32 is tested for exp,
log, cos, sin. Others may be tested elsewhere by specific tests, for
instance numpy/core/test/test_half.py has test_half_ufuncs.


It is difficult to do benchmarking on CI: the machines that run CI vary
too much. We would need to set aside a machine for this and carefully
set it up to keep CPU speed and temperature constant. We do have
benchmarks for ufuncs (they could always be improved). I think Pauli
runs the benchmarks carefully on X86, and may even makes the results
public, but that resource is not really on PR reviewers' radar. We could
run benchmarks on the gcc build farm machines for other architectures.
Those machines are shared but not heavily utilized.


> Only once there's existing universal intrinsics and then they're
> tweaked will we have to be much more careful I'd think.
>
>
>
>     I would look at this from a maintainability point of view. If we
>     are increasing the code size by 20% for a certain ufunc, there
>     must be a domonstrable 20% increase in performance on any CPU.
>     That is to say, micro-optimisation will be unwelcome, and code
>     readability will be preferable. Usually we ask the submitter of
>     the PR to test the PR with a machine they have on hand, and I
>     would be inclined to keep this trend of self-reporting. Of course,
>     if someone else came along and reported a performance regression
>     of, say, 10%, then we have increased code by 20%, with only a net
>     5% gain in performance, and the PR will have to be reverted.
>
>     —snip—
>

I think we should be careful not to increase the reviewer burden, and
try to automate as much as possible. It would be nice if we could at
some point set up a set of bots that can be triggered to run benchmarks
for us and report in the PR the results.


Matti

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: NEP 38 - Universal SIMD intrinsics

Devulapalli, Raghuveer
In reply to this post by ralfgommers

>> I think this doesn't quite answer the question. If I understand correctly, it's about a single instruction (e.g. one needs "VEXP2PD" and it's missing from the  supported AVX512 instructions in master). I think the answer is yes, it needs to be added for other architectures as well.

 

That adds a lot of overhead to write SIMD based optimizations which can discourage contributors. It’s also an unreasonable expectation that a developer be familiar with SIMD of all the architectures. On top of that the performance implications aren’t clear. Software implementations of hardware instructions might perform worse and might not even produce the same result.

 

From: NumPy-Discussion <numpy-discussion-bounces+raghuveer.devulapalli=[hidden email]> On Behalf Of Ralf Gommers
Sent: Monday, February 10, 2020 9:17 PM
To: Discussion of Numerical Python <[hidden email]>
Subject: Re: [Numpy-discussion] NEP 38 - Universal SIMD intrinsics

 

 

 

On Tue, Feb 4, 2020 at 2:00 PM Hameer Abbasi <[hidden email]> wrote:

—snip—

 

> 1) Once NumPy adds the framework and initial set of Universal Intrinsic, if contributors want to leverage a new architecture specific SIMD instruction, will they be expected to add software implementation of this instruction for all other architectures too? 

 

In my opinion, if the instructions are lower, then yes. For example, one cannot add AVX-512 without adding, for example adding AVX-256 and AVX-128 and SSE*.  However, I would not expect one person or team to be an expert in all assemblies, so intrinsics for one architecture can be developed independently of another.

 

I think this doesn't quite answer the question. If I understand correctly, it's about a single instruction (e.g. one needs "VEXP2PD" and it's missing from the supported AVX512 instructions in master). I think the answer is yes, it needs to be added for other architectures as well. Otherwise, if universal intrinsics are added ad-hoc and there's no guarantee that a universal instruction is available for all main supported platforms, then over time there won't be much that's "universal" about the framework.

 

This is a different question though from adding a new ufunc implementation. I would expect accelerating ufuncs via intrinsics that are already supported to be much more common than having to add new intrinsics. Does that sound right?

 


> 2) On whom does the burden lie to ensure that new implementations are benchmarked and shows benefits on every architecture? What happens if optimizing an Ufunc leads to improving performance on one architecture and worsens performance on another?

 

This is slightly hard to provide a recipe for. I suspect it may take a while before this becomes an issue, since we don't have much SIMD code to begin with. So adding new code with benchmarks will likely show improvements on all architectures (we should ensure benchmarks can be run via CI, otherwise it's too onerous). And if not and it's not easily fixable, the problematic platform could be skipped so performance there is unchanged.

 

Only once there's existing universal intrinsics and then they're tweaked will we have to be much more careful I'd think.

 

Cheers,

Ralf

 

 

 

I would look at this from a maintainability point of view. If we are increasing the code size by 20% for a certain ufunc, there must be a domonstrable 20% increase in performance on any CPU. That is to say, micro-optimisation will be unwelcome, and code readability will be preferable. Usually we ask the submitter of the PR to test the PR with a machine they have on hand, and I would be inclined to keep this trend of self-reporting. Of course, if someone else came along and reported a performance regression of, say, 10%, then we have increased code by 20%, with only a net 5% gain in performance, and the PR will have to be reverted.

 

—snip—

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: NEP 38 - Universal SIMD intrinsics

ralfgommers


On Tue, Feb 11, 2020 at 12:03 PM Devulapalli, Raghuveer <[hidden email]> wrote:

>> I think this doesn't quite answer the question. If I understand correctly, it's about a single instruction (e.g. one needs "VEXP2PD" and it's missing from the  supported AVX512 instructions in master). I think the answer is yes, it needs to be added for other architectures as well.

 

That adds a lot of overhead to write SIMD based optimizations which can discourage contributors.


Keep in mind that a new universal intrinsics instruction is just a bunch of defines. That is way less work than writing a ufunc that uses that instruction. We can also ping a platform expert in case it's not obvious what the corresponding arch-specific instruction is - that's a bit of a chicken-and-egg problem; once we get going we hopefully get more interested people that can help each other out.
 

It’s also an unreasonable expectation that a developer be familiar with SIMD of all the architectures. On top of that the performance implications aren’t clear. Software implementations of hardware instructions might perform worse and might not even produce the same result.


I think you are worrying about writing ufuncs here, not about adding an instruction. If the same result is not produced, we have CI that should fail - and if it does, we can deal with that by (if it's not easy to figure out) making that platform fall back to the generic non-SIMD version of the ufunc.

Cheers,
Ralf

 

 

From: NumPy-Discussion <numpy-discussion-bounces+raghuveer.devulapalli=[hidden email]> On Behalf Of Ralf Gommers
Sent: Monday, February 10, 2020 9:17 PM
To: Discussion of Numerical Python <[hidden email]>
Subject: Re: [Numpy-discussion] NEP 38 - Universal SIMD intrinsics

 

 

 

On Tue, Feb 4, 2020 at 2:00 PM Hameer Abbasi <[hidden email]> wrote:

—snip—

 

> 1) Once NumPy adds the framework and initial set of Universal Intrinsic, if contributors want to leverage a new architecture specific SIMD instruction, will they be expected to add software implementation of this instruction for all other architectures too? 

 

In my opinion, if the instructions are lower, then yes. For example, one cannot add AVX-512 without adding, for example adding AVX-256 and AVX-128 and SSE*.  However, I would not expect one person or team to be an expert in all assemblies, so intrinsics for one architecture can be developed independently of another.

 

I think this doesn't quite answer the question. If I understand correctly, it's about a single instruction (e.g. one needs "VEXP2PD" and it's missing from the supported AVX512 instructions in master). I think the answer is yes, it needs to be added for other architectures as well. Otherwise, if universal intrinsics are added ad-hoc and there's no guarantee that a universal instruction is available for all main supported platforms, then over time there won't be much that's "universal" about the framework.

 

This is a different question though from adding a new ufunc implementation. I would expect accelerating ufuncs via intrinsics that are already supported to be much more common than having to add new intrinsics. Does that sound right?

 


> 2) On whom does the burden lie to ensure that new implementations are benchmarked and shows benefits on every architecture? What happens if optimizing an Ufunc leads to improving performance on one architecture and worsens performance on another?

 

This is slightly hard to provide a recipe for. I suspect it may take a while before this becomes an issue, since we don't have much SIMD code to begin with. So adding new code with benchmarks will likely show improvements on all architectures (we should ensure benchmarks can be run via CI, otherwise it's too onerous). And if not and it's not easily fixable, the problematic platform could be skipped so performance there is unchanged.

 

Only once there's existing universal intrinsics and then they're tweaked will we have to be much more careful I'd think.

 

Cheers,

Ralf

 

 

 

I would look at this from a maintainability point of view. If we are increasing the code size by 20% for a certain ufunc, there must be a domonstrable 20% increase in performance on any CPU. That is to say, micro-optimisation will be unwelcome, and code readability will be preferable. Usually we ask the submitter of the PR to test the PR with a machine they have on hand, and I would be inclined to keep this trend of self-reporting. Of course, if someone else came along and reported a performance regression of, say, 10%, then we have increased code by 20%, with only a net 5% gain in performance, and the PR will have to be reverted.

 

—snip—

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: NEP 38 - Universal SIMD intrinsics

mattip
Administrator
In reply to this post by Devulapalli, Raghuveer
On 11/2/20 8:02 pm, Devulapalli, Raghuveer wrote:
>
> On top of that the performance implications aren’t clear. Software
> implementations of hardware instructions might perform worse and might
> not even produce the same result.
>

The proposal for universal intrinsics does not enable replacing an
intrinsic on one platform with a software emulation on another: the
intrinsics are meant to be compile-time defines that overlay the
universal intrinsic with a platform specific one. In order to use a new
intrinsic, it must have parallel intrinsics on the other platforms, or
cannot be used there: "NPY_CPU_HAVE(FEATURE_NAME)" will always return
false so the compiler will not even build a loop for that platform. I
will try to clarify that intention in the NEP.


I hope there will not be a demand to use many non-universal intrinsics
in ufuncs, we will need to work this out on a case-by-case basis in each
ufunc. Does that sound reasonable? Are there intrinsics you have already
used that have no parallel on other platforms?


Matti

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: NEP 38 - Universal SIMD intrinsics

Charles R Harris


On Wed, Feb 12, 2020 at 12:19 AM Matti Picus <[hidden email]> wrote:
On 11/2/20 8:02 pm, Devulapalli, Raghuveer wrote:
>
> On top of that the performance implications aren’t clear. Software
> implementations of hardware instructions might perform worse and might
> not even produce the same result.
>

The proposal for universal intrinsics does not enable replacing an
intrinsic on one platform with a software emulation on another: the
intrinsics are meant to be compile-time defines that overlay the
universal intrinsic with a platform specific one. In order to use a new
intrinsic, it must have parallel intrinsics on the other platforms, or
cannot be used there: "NPY_CPU_HAVE(FEATURE_NAME)" will always return
false so the compiler will not even build a loop for that platform. I
will try to clarify that intention in the NEP.


I hope there will not be a demand to use many non-universal intrinsics
in ufuncs, we will need to work this out on a case-by-case basis in each
ufunc. Does that sound reasonable? Are there intrinsics you have already
used that have no parallel on other platforms?


Intrinsics are not an irreversible change, they are, after all, private. The question is whether they are sufficiently useful to justify the time spent on them. I don't think we will know that until we attempt actual implementations. There will probably be some changes as a result of experience, but that is normal.

Chuck

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: NEP 38 - Universal SIMD intrinsics

Devulapalli, Raghuveer
In reply to this post by mattip
>> I hope there will not be a demand to use many non-universal intrinsics in ufuncs, we will need to work this out on a case-by-case basis in each ufunc. Does that sound reasonable? Are there intrinsics you have already used that have no parallel on other platforms?

I think that is reasonable. It's hard to anticipate the future need and benefit of specialized intrinsics but I tried to make a list of some of the specialized intrinsics that are currently in use in NumPy that I don’t believe exist on other platforms (most of these actually don’t exist on AVX2 either). I am not an expert in ARM or VSX architecture, so please correct me if I am wrong.

a. _mm512_mask_i32gather_ps
b. _mm512_mask_i32scatter_ps/_mm512_mask_i32scatter_pd
c. _mm512_maskz_loadu_pd/_mm512_maskz_loadu_ps
d. _mm512_getexp_ps
e. _mm512_getmant_ps
f. _mm512_scalef_ps
g. _mm512_permutex2var_ps, _mm512_permutex2var_pd
h. _mm512_maskz_div_ps, _mm512_maskz_div_pd
i. _mm512_permute_ps/_mm512_permute_pd
j. _mm512_sqrt_ps/pd (I could be wrong on this one, but from the little google search I did, it seems like power ISA doesn’t have a vectorized sqrt instruction)

Software implementations of these instructions is definitely possible. But some of them are not trivial to implement and are surely not going to be one line macro's either. I am also unsure of what implications this has on performance, but we will hopefully find out once we convert these to universal intrinsic and then benchmark.

Raghuveer

-----Original Message-----
From: NumPy-Discussion <numpy-discussion-bounces+raghuveer.devulapalli=[hidden email]> On Behalf Of Matti Picus
Sent: Tuesday, February 11, 2020 11:19 PM
To: [hidden email]
Subject: Re: [Numpy-discussion] NEP 38 - Universal SIMD intrinsics

On 11/2/20 8:02 pm, Devulapalli, Raghuveer wrote:
>
> On top of that the performance implications aren’t clear. Software
> implementations of hardware instructions might perform worse and might
> not even produce the same result.
>

The proposal for universal intrinsics does not enable replacing an intrinsic on one platform with a software emulation on another: the intrinsics are meant to be compile-time defines that overlay the universal intrinsic with a platform specific one. In order to use a new intrinsic, it must have parallel intrinsics on the other platforms, or cannot be used there: "NPY_CPU_HAVE(FEATURE_NAME)" will always return false so the compiler will not even build a loop for that platform. I will try to clarify that intention in the NEP.


I hope there will not be a demand to use many non-universal intrinsics in ufuncs, we will need to work this out on a case-by-case basis in each ufunc. Does that sound reasonable? Are there intrinsics you have already used that have no parallel on other platforms?


Matti

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: NEP 38 - Universal SIMD intrinsics

Jerome Kieffer
On Wed, 12 Feb 2020 19:36:10 +0000
"Devulapalli, Raghuveer" <[hidden email]> wrote:


> j. _mm512_sqrt_ps/pd (I could be wrong on this one, but from the little google search I did, it seems like power ISA doesn’t have a vectorized sqrt instruction)

Hi,
starting at Power7 (we are at Power9), the sqrt is available both in single and double precision:

https://www.ibm.com/support/knowledgecenter/SSGH2K_12.1.0/com.ibm.xlc121.aix.doc/compiler_ref/vec_sqrt.html

Cheers,

--
Jérôme Kieffer
tel +33 476 882 445
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: NEP 38 - Universal SIMD intrinsics

ralfgommers
In reply to this post by Devulapalli, Raghuveer


On Wed, Feb 12, 2020 at 1:37 PM Devulapalli, Raghuveer <[hidden email]> wrote:
>> I hope there will not be a demand to use many non-universal intrinsics in ufuncs, we will need to work this out on a case-by-case basis in each ufunc. Does that sound reasonable? Are there intrinsics you have already used that have no parallel on other platforms?

I think that is reasonable. It's hard to anticipate the future need and benefit of specialized intrinsics but I tried to make a list of some of the specialized intrinsics that are currently in use in NumPy that I don’t believe exist on other platforms (most of these actually don’t exist on AVX2 either). I am not an expert in ARM or VSX architecture, so please correct me if I am wrong.

a. _mm512_mask_i32gather_ps
b. _mm512_mask_i32scatter_ps/_mm512_mask_i32scatter_pd
c. _mm512_maskz_loadu_pd/_mm512_maskz_loadu_ps
d. _mm512_getexp_ps
e. _mm512_getmant_ps
f. _mm512_scalef_ps
g. _mm512_permutex2var_ps, _mm512_permutex2var_pd
h. _mm512_maskz_div_ps, _mm512_maskz_div_pd
i. _mm512_permute_ps/_mm512_permute_pd
j. _mm512_sqrt_ps/pd (I could be wrong on this one, but from the little google search I did, it seems like power ISA doesn’t have a vectorized sqrt instruction)

Software implementations of these instructions is definitely possible. But some of them are not trivial to implement and are surely not going to be one line macro's either. I am also unsure of what implications this has on performance, but we will hopefully find out once we convert these to universal intrinsic and then benchmark.

For these it seems like we don't want software implementations of the universal intrinsics - if there's no equivalent on PPC/ARM and there's enough value (performance gain given additional code complexity) in the additional AVX instructions, then we should still simply use AVX instructions directly.

Ralf


Raghuveer

-----Original Message-----
From: NumPy-Discussion <numpy-discussion-bounces+raghuveer.devulapalli=[hidden email]> On Behalf Of Matti Picus
Sent: Tuesday, February 11, 2020 11:19 PM
To: [hidden email]
Subject: Re: [Numpy-discussion] NEP 38 - Universal SIMD intrinsics

On 11/2/20 8:02 pm, Devulapalli, Raghuveer wrote:
>
> On top of that the performance implications aren’t clear. Software
> implementations of hardware instructions might perform worse and might
> not even produce the same result.
>

The proposal for universal intrinsics does not enable replacing an intrinsic on one platform with a software emulation on another: the intrinsics are meant to be compile-time defines that overlay the universal intrinsic with a platform specific one. In order to use a new intrinsic, it must have parallel intrinsics on the other platforms, or cannot be used there: "NPY_CPU_HAVE(FEATURE_NAME)" will always return false so the compiler will not even build a loop for that platform. I will try to clarify that intention in the NEP.


I hope there will not be a demand to use many non-universal intrinsics in ufuncs, we will need to work this out on a case-by-case basis in each ufunc. Does that sound reasonable? Are there intrinsics you have already used that have no parallel on other platforms?


Matti

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion