ticket #605

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

ticket #605

Jarrod Millman
Hello,

I just turned this one into a blocker for now.  There has been a very
long and good discussion about this ticket:
http://projects.scipy.org/scipy/numpy/ticket/605

Could someone (David?, Bruce?) briefly summarize the problem and the
current proposed solution for us again?  Let's agree on the problem
and the solution.  I want to have something similiar to what is
written about median for this release:
http://projects.scipy.org/scipy/numpy/milestone/1.0.5

I agree with David's sentiment:  "This issue has been raised a number
of times since I follow this ML. It's not the first time I've proposed
patches, and I've already documented the weird behavior only to see
the comments disappear after a while. I hope this time some kind of
agreement will be reached."

If you give me the short summary I will make sure Travis or Eric
respond (and I will put it in the release notes).

Thanks,

--
Jarrod Millman
Computational Infrastructure for Research Labs
10 Giannini Hall, UC Berkeley
phone: 510.643.4014
http://cirl.berkeley.edu/
_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: ticket #605

Bruce Southey
Jarrod Millman wrote:

> Hello,
>
> I just turned this one into a blocker for now.  There has been a very
> long and good discussion about this ticket:
> http://projects.scipy.org/scipy/numpy/ticket/605
>
> Could someone (David?, Bruce?) briefly summarize the problem and the
> current proposed solution for us again?  Let's agree on the problem
> and the solution.  I want to have something similiar to what is
> written about median for this release:
> http://projects.scipy.org/scipy/numpy/milestone/1.0.5
>
> I agree with David's sentiment:  "This issue has been raised a number
> of times since I follow this ML. It's not the first time I've proposed
> patches, and I've already documented the weird behavior only to see
> the comments disappear after a while. I hope this time some kind of
> agreement will be reached."
>
> If you give me the short summary I will make sure Travis or Eric
> respond (and I will put it in the release notes).
>
> Thanks,
>
>  
Hi,
Simply put, there are actually multiple problems with the histogram
function for certain cases.

1) The initial problem was that points below the first bin are ignored:
 From Tommy Grav's email:

bin1 -> 1 to 2.99999...
bin2 -> 3 to 4.99999...
bin3 -> 5 to inf

This means there is no bin for -inf to 1 and, thus, the cause of the
initial bug report.

2) The second problem is to address how to account for any 'outliers'.
Based on the responses, David included the keyword 'discard' to handle
these.

3) The 'norm' option may be wrong but I do not have any current
understanding of this one.

Solution:
David has provided a new version of the histogram function that was
provided to the list. It also had some enhancements like an axis
keyword. However, there is a potential bug associated with the use of
the numpy.r_ function. Once that is overcome, I think that his code is
an excellent replacement for the current version. But I can understand
if this is applied to the next release.

Regards
Bruce

 
_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: ticket #605

David Huard
In reply to this post by Jarrod Millman
Hello Jarrod and co.,

here is my personal version of the histogram saga.

The current version of histogram puts in the rightmost bin all values larger than range, but does not put in the leftmost bin all values smaller than bin, eg.

In [6]: histogram([1,2,3,4,5,6], bins=3, range=[2,5])
Out[6]: (array([1, 1, 3]), array([ 2.,  3.,  4.]))

It discards 1, but puts 2 in the first bin, 3 in the second bin, and 4,5,6 in the third bin.  Also, the docstring  says that outliers are put in the closest bin, which is false. Another point to consider is normalization. Currently, the normalization factor is db=bin[1]-bin[0]. Of course, if the bins are not equally spaced, this will yield a spurious density. Also, I'd argue that since the rightmost bin covers the space from bin[-1] to infinity, it's density should always be zero.

Now if someone wants to explain all that in the docstring, that's fine by me. I fully understand the need to avoid breaking people's code. I simply hope that in the next big release, this behavior can be changed to something that is simpler: bins are the bin edges (instead of the left edges), and everything outside the edges is ignored. This would be a nice occasion to add an axis keyword and possibly weights, and would make histogram consistent with histogramdd. I'm willing to implement those changes, but I don't know how to do so without breaking histogram's behavior.

I just got Bruce reply, so sorry for the overlap.

David

2008/4/9, Jarrod Millman <[hidden email]>:
Hello,

I just turned this one into a blocker for now.  There has been a very
long and good discussion about this ticket:
<a href="http://projects.scipy.org/scipy/numpy/ticket/605" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">http://projects.scipy.org/scipy/numpy/ticket/605

Could someone (David?, Bruce?) briefly summarize the problem and the
current proposed solution for us again?  Let's agree on the problem
and the solution.  I want to have something similiar to what is
written about median for this release:
<a href="http://projects.scipy.org/scipy/numpy/milestone/1.0.5" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">http://projects.scipy.org/scipy/numpy/milestone/1.0.5

I agree with David's sentiment:  "This issue has been raised a number
of times since I follow this ML. It's not the first time I've proposed
patches, and I've already documented the weird behavior only to see
the comments disappear after a while. I hope this time some kind of
agreement will be reached."

If you give me the short summary I will make sure Travis or Eric
respond (and I will put it in the release notes).

Thanks,


--
Jarrod Millman
Computational Infrastructure for Research Labs
10 Giannini Hall, UC Berkeley
phone: 510.643.4014
<a href="http://cirl.berkeley.edu/" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">http://cirl.berkeley.edu/


_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: ticket #605

Tim Hochberg


On Wed, Apr 9, 2008 at 7:01 AM, David Huard <[hidden email]> wrote:
Hello Jarrod and co.,

here is my personal version of the histogram saga.

The current version of histogram puts in the rightmost bin all values larger than range, but does not put in the leftmost bin all values smaller than bin, eg.

In [6]: histogram([1,2,3,4,5,6], bins=3, range=[2,5])
Out[6]: (array([1, 1, 3]), array([ 2.,  3.,  4.]))

It discards 1, but puts 2 in the first bin, 3 in the second bin, and 4,5,6 in the third bin.  Also, the docstring  says that outliers are put in the closest bin, which is false. Another point to consider is normalization. Currently, the normalization factor is db=bin[1]-bin[0]. Of course, if the bins are not equally spaced, this will yield a spurious density. Also, I'd argue that since the rightmost bin covers the space from bin[-1] to infinity, it's density should always be zero.

Now if someone wants to explain all that in the docstring, that's fine by me. I fully understand the need to avoid breaking people's code. I simply hope that in the next big release, this behavior can be changed to something that is simpler: bins are the bin edges (instead of the left edges), and everything outside the edges is ignored. This would be a nice occasion to add an axis keyword and possibly weights, and would make histogram consistent with histogramdd. I'm willing to implement those changes, but I don't know how to do so without breaking histogram's behavior.

Here's one way which is more or less what they tend to do in the core Python to avoid breaking things.
  1. Choose a new name for histogram with the desired behavior. 'histogram1D' for example.
  2. Add the function with the new behavior to major release X and modify the old 'histogram' to produce a PendingDeprecationWarning (which by default does nothing, you need to change the warning filter to see anything).
  3. In major release X+1, change the PendingDeprecationWarning to a DeprecationWarning. Now people will start to see warnings when they use histogram.
  4. In major release X+2, rip out histogram.
So, if you got the new version into 1.1, in 1.2 it would start complaining when you used histogram and in 1.3 histogram would be gone, but the new version would be in it's place. In this way, there's no point where the behavior of histogram just changes subtly; since it disappears one is forced to figure out where it went and implement appropriate changes in ones code.


 


I just got Bruce reply, so sorry for the overlap.

David

2008/4/9, Jarrod Millman <[hidden email]>:
Hello,

I just turned this one into a blocker for now.  There has been a very
long and good discussion about this ticket:
http://projects.scipy.org/scipy/numpy/ticket/605

Could someone (David?, Bruce?) briefly summarize the problem and the
current proposed solution for us again?  Let's agree on the problem
and the solution.  I want to have something similiar to what is
written about median for this release:
http://projects.scipy.org/scipy/numpy/milestone/1.0.5

I agree with David's sentiment:  "This issue has been raised a number
of times since I follow this ML. It's not the first time I've proposed
patches, and I've already documented the weird behavior only to see
the comments disappear after a while. I hope this time some kind of
agreement will be reached."

If you give me the short summary I will make sure Travis or Eric
respond (and I will put it in the release notes).

Thanks,


--
Jarrod Millman
Computational Infrastructure for Research Labs
10 Giannini Hall, UC Berkeley
phone: 510.643.4014
http://cirl.berkeley.edu/


_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion




--
. __
. |-\
.
. [hidden email]
_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion