

Greetings, I have a PR that warrants discussion according to @seberg. See https://github.com/numpy/numpy/pull/14278. It is an enhancement that fixes a bug. The original bug is that when using the fd estimator on a dataset with small interquartile range and large outliers, the current codebase produces more bins than memory allows. There are several related bug reports (see #11879, #10297, #8203). In terms of scope, I restricted my changes to conditions where np.histogram(bins='auto') defaults to the 'fd'. For the actual fix, I actually enhanced the API. I used a suggestion from @ericwieser to merge empty histogram bins. In practice this solves the outsized bins issue. However @seberg is concerned that extending the API in this way may not be the way to go. For example, if you use "auto" once, and then reuse the bins, the uneven bins may not be what you want. Furthermore @ericwieser is concerned that there may be a floatingpoint devil in the details. He advocates using the hypothesis testing package to increase our confidence that the current implementation adequately handles corner cases.
I would like to do my part in improving the code base. I don't have strong opinions but I have to admit that I would like to eventually make a PR that resolves these bugs. This has been a PR half a year in the making after all.
Thoughts?
areeves87
_______________________________________________
NumPyDiscussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpydiscussion


1. Though the PR is limited to the 'auto' kwarg, the issue of potential memory problems for the automated binning methods is a more general one (e.g. #15332).
2. The main concern that jumps out to me is downstream users who are relying on the implicit assumption of regular binning. This is of course bad practice and makes even less sense when using one of the bin estimators, so I'm not sure how big of a concern it is. However, there is likely downstream user code that relies on the regular binning assumption, especially since, as far as I know, NumPy has never implemented binning techniques that return irregular bins.
3. The astropy project have at least one estimator that returns irregular bins. I checked for issues related to irregular binning: though they have many of the same problems with the automatic bin estimators (i.e. memory problems for inputs with outliers), I didn't see anything specifically related to irregular binning
I just wanted to add my two cents. The binningdatawithoutliers problem is very common in highresolution spectroscopy, and I have seen practitioners rely on the assumption of regular binning (e.g. divide the `range` by the number of bins) to specify bin centers even though this is not the right way to do things.
Thanks for taking the time to write up your work! Send NumPyDiscussion mailing list submissions to
[hidden email]
To subscribe or unsubscribe via the World Wide Web, visit
https://mail.python.org/mailman/listinfo/numpydiscussion
or, via email, send a message with subject or body 'help' to
[hidden email]
You can reach the person managing the list at
[hidden email]
When replying, please edit your Subject line so it is more specific
than "Re: Contents of NumPyDiscussion digest..."
Today's Topics:
1. Proposal  extend histograms api to allow uneven bins
(Alexander Reeves)
2. Re: NEP 38  Universal SIMD intrinsics (Ralf Gommers)
3. Re: NEP 38  Universal SIMD intrinsics (Matti Picus)

Message: 1
Date: Mon, 10 Feb 2020 18:07:40 0800
From: Alexander Reeves <[hidden email]>
To: [hidden email]
Subject: [Numpydiscussion] Proposal  extend histograms api to allow
uneven bins
MessageID:
<CABeAeRyiVt4RJP2ew7=[hidden email]>
ContentType: text/plain; charset="utf8"
Greetings,
I have a PR that warrants discussion according to @seberg. See
https://github.com/numpy/numpy/pull/14278.
It is an enhancement that fixes a bug. The original bug is that when using
the fd estimator on a dataset with small interquartile range and large
outliers, the current codebase produces more bins than memory allows. There
are several related bug reports (see #11879, #10297, #8203).
In terms of scope, I restricted my changes to conditions where
np.histogram(bins='auto') defaults to the 'fd'. For the actual fix, I
actually enhanced the API. I used a suggestion from @ericwieser to merge
empty histogram bins. In practice this solves the outsized bins issue.
However @seberg is concerned that extending the API in this way may not be
the way to go. For example, if you use "auto" once, and then reuse the
bins, the uneven bins may not be what you want.
Furthermore @ericwieser is concerned that there may be a floatingpoint
devil in the details. He advocates using the hypothesis testing package to
increase our confidence that the current implementation adequately handles
corner cases.
I would like to do my part in improving the code base. I don't have strong
opinions but I have to admit that I would like to eventually make a PR that
resolves these bugs. This has been a PR half a year in the making after all.
Thoughts?
areeves87
 next part 
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpydiscussion/attachments/20200210/69a4fab1/attachment0001.html>

Message: 2
Date: Mon, 10 Feb 2020 23:16:44 0600
From: Ralf Gommers <[hidden email]>
To: Discussion of Numerical Python <[hidden email]>
Subject: Re: [Numpydiscussion] NEP 38  Universal SIMD intrinsics
MessageID:
<CABL7CQjzqoZ5Zmo=+[hidden email]>
ContentType: text/plain; charset="utf8"
On Tue, Feb 4, 2020 at 2:00 PM Hameer Abbasi <[hidden email]>
wrote:
> ?snip?
>
> > 1) Once NumPy adds the framework and initial set of Universal Intrinsic,
> if contributors want to leverage a new architecture specific SIMD
> instruction, will they be expected to add software implementation of this
> instruction for all other architectures too?
>
> In my opinion, if the instructions are lower, then yes. For example, one
> cannot add AVX512 without adding, for example adding AVX256 and AVX128
> and SSE*. However, I would not expect one person or team to be an expert
> in all assemblies, so intrinsics for one architecture can be developed
> independently of another.
>
I think this doesn't quite answer the question. If I understand correctly,
it's about a single instruction (e.g. one needs "VEXP2PD" and it's missing
from the supported AVX512 instructions in master). I think the answer is
yes, it needs to be added for other architectures as well. Otherwise, if
universal intrinsics are added adhoc and there's no guarantee that a
universal instruction is available for all main supported platforms, then
over time there won't be much that's "universal" about the framework.
This is a different question though from adding a new ufunc implementation.
I would expect accelerating ufuncs via intrinsics that are already
supported to be much more common than having to add new intrinsics. Does
that sound right?
> > 2) On whom does the burden lie to ensure that new implementations are
> benchmarked and shows benefits on every architecture? What happens if
> optimizing an Ufunc leads to improving performance on one architecture and
> worsens performance on another?
>
This is slightly hard to provide a recipe for. I suspect it may take a
while before this becomes an issue, since we don't have much SIMD code to
begin with. So adding new code with benchmarks will likely show
improvements on all architectures (we should ensure benchmarks can be run
via CI, otherwise it's too onerous). And if not and it's not easily
fixable, the problematic platform could be skipped so performance there is
unchanged.
Only once there's existing universal intrinsics and then they're tweaked
will we have to be much more careful I'd think.
Cheers,
Ralf
>
> I would look at this from a maintainability point of view. If we are
> increasing the code size by 20% for a certain ufunc, there must be a
> domonstrable 20% increase in performance on any CPU. That is to say,
> microoptimisation will be unwelcome, and code readability will be
> preferable. Usually we ask the submitter of the PR to test the PR with a
> machine they have on hand, and I would be inclined to keep this trend of
> selfreporting. Of course, if someone else came along and reported a
> performance regression of, say, 10%, then we have increased code by 20%,
> with only a net 5% gain in performance, and the PR will have to be reverted.
>
> ?snip?
> _______________________________________________
> NumPyDiscussion mailing list
> [hidden email]
> https://mail.python.org/mailman/listinfo/numpydiscussion
>
 next part 
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpydiscussion/attachments/20200210/4b8354cf/attachment0001.html>

Message: 3
Date: Tue, 11 Feb 2020 08:53:18 +0200
From: Matti Picus <[hidden email]>
To: [hidden email]
Subject: Re: [Numpydiscussion] NEP 38  Universal SIMD intrinsics
MessageID: <[hidden email]>
ContentType: text/plain; charset=utf8; format=flowed
On 11/2/20 7:16 am, Ralf Gommers wrote:
>
>
> On Tue, Feb 4, 2020 at 2:00 PM Hameer Abbasi
> <[hidden email] <mailto:[hidden email]>> wrote:
>
> ?snip?
>
> > 1) Once NumPy adds the framework and initial set of Universal Intrinsic, if
> contributors want to leverage a new architecture specific SIMD
> instruction, will they be expected to add software implementation
> of this instruction for all other architectures too?
>
> In my opinion, if the instructions are lower, then yes. For
> example, one cannot add AVX512 without adding, for example adding
> AVX256 and AVX128 and SSE*.? However, I would not expect one
> person or team to be an expert in all assemblies, so intrinsics
> for one architecture can be developed independently of another.
>
>
> I think this doesn't quite answer the question. If I understand
> correctly, it's about a single instruction (e.g. one needs
> "VEXP2PD"and it's missing from the supported AVX512 instructions in
> master). I think the answer is yes, it needs to be added for other
> architectures as well. Otherwise, if universal intrinsics are added
> adhoc and there's no guarantee that a universal instruction is
> available for all main supported platforms, then over time there won't
> be much that's "universal" about the framework.
> 
> 
> This is a different question though from adding a new ufunc
> implementation. I would expect accelerating ufuncs via intrinsics that
> are already supported to be much more common than having to add new
> intrinsics. Does that sound right?
> 
Yes. Universal intrinsics are crossplatform. However the NEP is open
to the possibility that certain architectures may have SIMD intrinsics
that cannot be expressed in terms of intrinsics for other platforms, and
so there may be a use case for architecturespecific loops. This is
explicitly stated in the latest PR to the NEP: "If the regression is
not minimal, we may choose to keep the X86specific code for that
platform and use the universal intrisic code for other platforms."
> 
> 
>
>
> > 2) On whom does the burden lie to ensure that new
> implementations are benchmarked and shows benefits on every
> architecture? What happens if optimizing an Ufunc leads to
> improving performance on one architecture and worsens performance
> on another?
>
>
> This is slightly hard to provide a recipe for. I suspect it may take a
> while before this becomes an issue, since we don't have much SIMD code
> to begin with. So adding new code with benchmarks will likely show
> improvements on all architectures (we should ensure benchmarks can be
> run via CI, otherwise it's too onerous). And if not and it's not
> easily fixable, the problematic platform could be skipped so
> performance there is unchanged.
On HEAD, out of the 89 ufuncs in
numpy.core.code_generators.generate_umath.defdict, 34 have X86specific
simd loops:
>>> [x for x in defdict.keys() if any([td.simd for td in
defdict[x].type_descriptions])]
['add', 'subtract', 'multiply', 'conjugate', 'square', 'reciprocal',
'absolute', 'negative', 'greater', 'greater_equal', 'less',
'less_equal', 'equal', 'not_equal', 'logical_and', 'logical_not',
'logical_or', 'maximum', 'minimum', 'bitwise_and', 'bitwise_or',
'bitwise_xor', 'invert', 'left_shift', 'right_shift', 'cos', 'sin',
'exp', 'log', 'sqrt', 'ceil', 'trunc', 'floor', 'rint']
They would be the first targets for universal intrinsics. Of them I
estimate that the ones with more than one loop for at least one dtype
signature would be the most difficult, since these have different
optimizations for avx2, fma, and/or avx512f:
['square', 'reciprocal', 'absolute', 'cos', 'sin', 'exp', 'log', 'sqrt',
'ceil', 'trunc', 'floor', 'rint']
The other 55 ufuncs, for completeness, are
['floor_divide', 'true_divide', 'fmod', '_ones_like', 'power',
'float_power', '_arg', 'positive', 'sign', 'logical_xor', 'clip',
'fmax', 'fmin', 'logaddexp', 'logaddexp2', 'heaviside', 'degrees',
'rad2deg', 'radians', 'deg2rad', 'arccos', 'arccosh', 'arcsin',
'arcsinh', 'arctan', 'arctanh', 'tan', 'cosh', 'sinh', 'tanh', 'exp2',
'expm1', 'log2', 'log10', 'log1p', 'cbrt', 'fabs', 'arctan2',
'remainder', 'divmod', 'hypot', 'isnan', 'isnat', 'isinf', 'isfinite',
'signbit', 'copysign', 'nextafter', 'spacing', 'modf', 'ldexp', 'frexp',
'gcd', 'lcm', 'matmul']
As for testing accuracy: we recently added a framework for testing ulp
variation of ufuncs against "golden results" in
numpy/core/tests/test_umath_accuracy. So far float32 is tested for exp,
log, cos, sin. Others may be tested elsewhere by specific tests, for
instance numpy/core/test/test_half.py has test_half_ufuncs.
It is difficult to do benchmarking on CI: the machines that run CI vary
too much. We would need to set aside a machine for this and carefully
set it up to keep CPU speed and temperature constant. We do have
benchmarks for ufuncs (they could always be improved). I think Pauli
runs the benchmarks carefully on X86, and may even makes the results
public, but that resource is not really on PR reviewers' radar. We could
run benchmarks on the gcc build farm machines for other architectures.
Those machines are shared but not heavily utilized.
> Only once there's existing universal intrinsics and then they're
> tweaked will we have to be much more careful I'd think.
>
>
>
> I would look at this from a maintainability point of view. If we
> are increasing the code size by 20% for a certain ufunc, there
> must be a domonstrable 20% increase in performance on any CPU.
> That is to say, microoptimisation will be unwelcome, and code
> readability will be preferable. Usually we ask the submitter of
> the PR to test the PR with a machine they have on hand, and I
> would be inclined to keep this trend of selfreporting. Of course,
> if someone else came along and reported a performance regression
> of, say, 10%, then we have increased code by 20%, with only a net
> 5% gain in performance, and the PR will have to be reverted.
>
> ?snip?
>
I think we should be careful not to increase the reviewer burden, and
try to automate as much as possible. It would be nice if we could at
some point set up a set of bots that can be triggered to run benchmarks
for us and report in the PR the results.
Matti

Subject: Digest Footer
_______________________________________________
NumPyDiscussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpydiscussion

End of NumPyDiscussion Digest, Vol 161, Issue 10
*************************************************
_______________________________________________
NumPyDiscussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpydiscussion

