Ticket #605 Incorrect behavior of numpy.histogram

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

Ticket #605 Incorrect behavior of numpy.histogram

Bruce Southey
Hi,
I have been investigating Ticket #605 'Incorrect behavior of
numpy.histogram' (http://scipy.org/scipy/numpy/ticket/605 ).

The fix for this ticket really depends on what the expectations are
for the bin limits and different applications have different behavior.
Consequently, I think that feedback from the community is important.

I have attached a modified histogram function where I use a very
simple and obvious example:
r= numpy.array([1,2,2,3,3,3,4,4,4,4,5,5,5,5,5])
dbin=[2,3,4]

The current (Default) behavior provides the counts as array([2, 3,
9]). Here the values less than 2 are ignored and the last bin contains
all values greater than or equal to 4.

1) Should the first bin contain all values less than or equal to the
value of the first limit and the last bin contain all values greater
than the value of the last limit?
This produced the counts as: array([3, 3, 9]) (I termed this
'Accumulate' in the output).

2) Should any values outside than the range of the bins be excluded?
That is remove any value that is smaller than the lowest value of the
bin and higher than the highest value of the bin.
This produced the counts as: array([2, 3, 4]) (I termed this 'Exclude'
in the output)

3) Should there be extra bins for these values?
While I did not implement this option, it would provide the counts as:
array([1,2,3,4,5])

4) Is there some other expectation?

Thanks for any input,
Bruce

_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion

histo.py (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Ticket #605 Incorrect behavior of numpy.histogram

James Philbin
The matlab behaviour is to extend the first bin to include all data
down to -inf and extend the last bin to handle all data to inf. This
is probably the behaviour with least suprise.

Therefor, I would vote +1 for behaviour #1 by default, +1 for keeping
the old behaviour #2 around as an option and -1 for any extending of
the bins (#3).

James
_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Ticket #605 Incorrect behavior of numpy.histogram

Anne Archibald
In reply to this post by Bruce Southey
On 05/04/2008, Bruce Southey <[hidden email]> wrote:

>  1) Should the first bin contain all values less than or equal to the
>  value of the first limit and the last bin contain all values greater
>  than the value of the last limit?
>  This produced the counts as: array([3, 3, 9]) (I termed this
>  'Accumulate' in the output).
>
>  2) Should any values outside than the range of the bins be excluded?
>  That is remove any value that is smaller than the lowest value of the
>  bin and higher than the highest value of the bin.
>  This produced the counts as: array([2, 3, 4]) (I termed this 'Exclude'
>  in the output)
>
>  3) Should there be extra bins for these values?
>  While I did not implement this option, it would provide the counts as:
>  array([1,2,3,4,5])

There's also a fourth option - raise an exception if any points are
outside the range.

I hope this is a question about defaults - really what I would most
want is to have the choice, as a keyword option. For the default, I
would be tempted to go with option 4, raising an exception. This seems
pretty much guaranteed not to produce surprising results: if there's
any question about what the results should be it produces an error,
and the user can run it again with specific instructions. Plus in many
contexts having any points that don't belong in any bin is the result
of a programming error.

Anne
_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Ticket #605 Incorrect behavior of numpy.histogram

Tommy Grav
In reply to this post by Bruce Southey

On Apr 5, 2008, at 2:01 PM, Bruce Southey wrote:

> Hi,
> I have been investigating Ticket #605 'Incorrect behavior of
> numpy.histogram' (http://scipy.org/scipy/numpy/ticket/605 ).

I think that my preference depends on the definition of what
the bin number means. If the bin numbers are the lower bounds
of the bins (numpy default behavior) then it would make little
sense to exclude anything above the largest bin.

I don't have access to numpy on my laptop at the moment,
so I can't remember wether numpy has a keyword for what
the bins array is defining? Having this as a keyword of
(lower,middle,upper) of the bin would be very helpful.

Cheers
   Tommy
_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Ticket #605 Incorrect behavior of numpy.histogram

Hans Meine-4
In reply to this post by Anne Archibald
Am Samstag, 05. April 2008 21:54:27 schrieb Anne Archibald:
> There's also a fourth option - raise an exception if any points are
> outside the range.

+1

I think this should be the default.  Otherwise, I tend towards "exclude", in
order to have comparable bin sizes (when plotting, I always find peaks at the
ends annoying); this could also be called "clip" BTW.

But really, an exception would follow the Zen: "In the face of ambiguity,
refuse the temptation to guess."  And with a kwarg: "Explicit is better than
implicit."

histogram(a, arange(10), outliers = "clip")
histogram(a, arange(10), outliers = "include")
# better names? "include"->"accumulate"/"map to border"/"map"/"boundary"

--
Ciao, /  /
     /--/
    /  / ANS
_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Ticket #605 Incorrect behavior of numpy.histogram

David Huard
+1 for an outlier keyword. Note, that this implies that when bins are passed explicitly, the edges are given (nbins+1), not simply the left edges (nbins).

While we are refactoring histogram, I'd suggest adding an axis keyword. This is pretty straightforward to implement using the np.apply_along_axis function.

Also, I noticed that current normalization is buggy for non-uniform bin sizes.
    if normed:
        db = bins[1] - bins[0]
        return 1.0/(a.size*db) * n, bins

Finally, whatever option is chosen in the end, we should make sure it is consistent across all histogram functions. This may mean that we will also break the behavior of histogramdd and histogram2d.

Bruce: I did some work over the weekend on the histogram function, including tests. If you want, I'll send that to you in the evening.

David




2008/4/7, Hans Meine <[hidden email]>:
Am Samstag, 05. April 2008 21:54:27 schrieb Anne Archibald:

> There's also a fourth option - raise an exception if any points are
> outside the range.


+1

I think this should be the default.  Otherwise, I tend towards "exclude", in
order to have comparable bin sizes (when plotting, I always find peaks at the
ends annoying); this could also be called "clip" BTW.

But really, an exception would follow the Zen: "In the face of ambiguity,
refuse the temptation to guess."  And with a kwarg: "Explicit is better than
implicit."

histogram(a, arange(10), outliers = "clip")
histogram(a, arange(10), outliers = "include")
# better names? "include"->"accumulate"/"map to border"/"map"/"boundary"


--
Ciao, /  /
     /--/
    /  / ANS

_______________________________________________
Numpy-discussion mailing list
[hidden email]
<a href="http://projects.scipy.org/mailman/listinfo/numpy-discussion" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">http://projects.scipy.org/mailman/listinfo/numpy-discussion


_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Ticket #605 Incorrect behavior of numpy.histogram

Bruce Southey
Hi,
Thanks David for pointing the piece of information I forgot to add in
my original email.

-1 for 'raise an exception' because, as Dan points out, the problem
stems from user providing bins.

+1 for the outliers keyword. Should 'exclude' distinguish points that
are too low and those that are too high?

+1 for axis.

Really I was only looking at seeing what it would take to close this
bug, but I am willing to test any code.

Thanks
Bruce



On Mon, Apr 7, 2008 at 8:55 AM, David Huard <[hidden email]> wrote:

> +1 for an outlier keyword. Note, that this implies that when bins are passed
> explicitly, the edges are given (nbins+1), not simply the left edges
> (nbins).
>
> While we are refactoring histogram, I'd suggest adding an axis keyword. This
> is pretty straightforward to implement using the np.apply_along_axis
> function.
>
> Also, I noticed that current normalization is buggy for non-uniform bin
> sizes.
>     if normed:
>         db = bins[1] - bins[0]
>         return 1.0/(a.size*db) * n, bins
>
> Finally, whatever option is chosen in the end, we should make sure it is
> consistent across all histogram functions. This may mean that we will also
> break the behavior of histogramdd and histogram2d.
>
> Bruce: I did some work over the weekend on the histogram function, including
> tests. If you want, I'll send that to you in the evening.
>
> David
>
>
>
>
> 2008/4/7, Hans Meine <[hidden email]>:
>
> > Am Samstag, 05. April 2008 21:54:27 schrieb Anne Archibald:
> >
> > > There's also a fourth option - raise an exception if any points are
> > > outside the range.
> >
> >
> > +1
> >
> > I think this should be the default.  Otherwise, I tend towards "exclude",
> in
> > order to have comparable bin sizes (when plotting, I always find peaks at
> the
> > ends annoying); this could also be called "clip" BTW.
> >
> > But really, an exception would follow the Zen: "In the face of ambiguity,
> > refuse the temptation to guess."  And with a kwarg: "Explicit is better
> than
> > implicit."
> >
> > histogram(a, arange(10), outliers = "clip")
> > histogram(a, arange(10), outliers = "include")
> > # better names? "include"->"accumulate"/"map to border"/"map"/"boundary"
> >
> >
> > --
> > Ciao, /  /
> >      /--/
> >     /  / ANS
> >
> > _______________________________________________
> > Numpy-discussion mailing list
> > [hidden email]
> > http://projects.scipy.org/mailman/listinfo/numpy-discussion
> >
>
>
> _______________________________________________
>  Numpy-discussion mailing list
>  [hidden email]
>  http://projects.scipy.org/mailman/listinfo/numpy-discussion
>
>
_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Ticket #605 Incorrect behavior of numpy.histogram

Loïc BERTHE
+1 for axis and +1 for a keyword to define what to do with values
outside the range.

For the keyword, ather than 'outliers', I would propose 'discard' or
'exclude', because it could be used to describe the four
possibilities :
  - discard='low'      => values lower than the range are discarded,
values higher are added to the last bin
   - discard='up'       => values higher than the range are discarded,
values lower are added to the first bin
   - discard='out'      => values out of the range are discarded
   - discard=None    => values outside of this range are allocated to
the closest bin

For the default behavior, most of the case, the sum of the bins 's
population should be equal to the size of the original one for me, so
I would prefer discard=None. But I'm also okay with discard='low' in
order not to break older code, if this is clearly stated.

--
LB
_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Ticket #605 Incorrect behavior of numpy.histogram

Tommy Grav
On Apr 7, 2008, at 4:14 PM, LB wrote:

> +1 for axis and +1 for a keyword to define what to do with values
> outside the range.
>
> For the keyword, ather than 'outliers', I would propose 'discard' or
> 'exclude', because it could be used to describe the four
> possibilities :
>  - discard='low'      => values lower than the range are discarded,
> values higher are added to the last bin
>   - discard='up'       => values higher than the range are discarded,
> values lower are added to the first bin
>   - discard='out'      => values out of the range are discarded
>   - discard=None    => values outside of this range are allocated to
> the closest bin
>
> For the default behavior, most of the case, the sum of the bins 's
> population should be equal to the size of the original one for me, so
> I would prefer discard=None. But I'm also okay with discard='low' in
> order not to break older code, if this is clearly stated.

It seems that people in this discussion are forgetting that the bins
are actually defined by the lower boundaries supplied, such that

bins = [1,3,5]

actually currently means

bin1 -> 1 to 2.99999...
bin2 -> 3 to 4.99999...
bin3 -> 5 to inf

(of course in version 1.0.1 the documentation is inconsistent with the
behavior as described by the original poster). This definition of bins
makes it hard to exclude values as it forces the user to give an extra
value in the bin definition, i.e. the bins statement above only give two
bins, while supplying three values. That seems confusing to me.

I am not sure what the right approach is, but currently using range will
clip the values outside the number the user wants.

Cheers
   Tommy




_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Ticket #605 Incorrect behavior of numpy.histogram

David Huard

On Apr 7, 2008, at 4:14 PM, LB wrote:
> +1 for axis and +1 for a keyword to define what to do with values
> outside the range.
>
> For the keyword, ather than 'outliers', I would propose 'discard' or
> 'exclude', because it could be used to describe the four
> possibilities :
>  - discard='low'      => values lower than the range are discarded,
> values higher are added to the last bin
>   - discard='up'       => values higher than the range are discarded,
> values lower are added to the first bin
>   - discard='out'      => values out of the range are discarded
>   - discard=None    => values outside of this range are allocated to
> the closest bin
>


Suppose you set bins=5, range=[0,10], discard=None, should the returned bins be [0,2,4,6,810] or [-inf, 2, 4, 6, 8, inf] ?
Now suppose normed=True, what should be the density for the first and last bin ? It seems to me it should be zero since we are assuming that the bins extend to -infinity and infinity, but then, taking the outliers into account seems pretty useless.

Overall, I think "discard" is a confusing option with little added value. Getting the outliers is simply a matter of defining the bin edges explictly, ie [-inf, x0, x1, ..., xn, inf].

In any case, attached is a version of histogram implementing the axis and discard keywords. I'd really prefer though if we dumped the discard option.

David




_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion


_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion

histo2.py (6K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Ticket #605 Incorrect behavior of numpy.histogram

Hans Meine-4
In reply to this post by Hans Meine-4
Am Montag, 07. April 2008 14:34:08 schrieb Hans Meine:

> Am Samstag, 05. April 2008 21:54:27 schrieb Anne Archibald:
> > There's also a fourth option - raise an exception if any points are
> > outside the range.
>
> +1
>
> I think this should be the default.  Otherwise, I tend towards "exclude",
> in order to have comparable bin sizes (when plotting, I always find peaks
> at the ends annoying); this could also be called "clip" BTW.
>
> But really, an exception would follow the Zen: "In the face of ambiguity,
> refuse the temptation to guess."  And with a kwarg: "Explicit is better
> than implicit."

When posting this, I did indeed not think this through fully; as David (and
Tommy) pointed out, this API does not fit well with the existing `bins`
option, especially when a sequence of bin bounds is given.  (I guess I was
mostly thinking about the special case of discrete values and 1:1 bins, as
typical for uint8 data.)

Thus, I would like to withdraw my above opinion from and instead state that I
find the current API as clear as it gets.  If you want to exclude values,
simply pass an additional right bound, and for including outliers,
passing -inf as additional left bound seems to do the trick.  This could be
possibly added to the documentation though.

The only critical aspect I see is the `normed` arg.  As it is now, the
rightmost bin has always infinite size, but it is not treated like that:

In [1]: from numpy import *

In [2]: histogram(arange(10), [2,3,4], normed = True)
Out[2]: (array([ 0.1,  0.1,  0.6]), array([2, 3, 4]))

Even worse, if you try to add an infinite bin to the left, this pulls all
values to zero (technically, I understand that, but it looks really
undesirable to me):

In [3]: histogram(arange(10), [-inf, 2,3,4], normed = True)
Out[3]: (array([ 0.,  0.,  0.,  0.]), array([-Inf,   2.,   3.,   4.]))

--
Ciao, /  /
     /--/
    /  / ANS
_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Ticket #605 Incorrect behavior of numpy.histogram

David Huard
Hans,

Note that the current histogram is buggy, in the sense that it assumes that all bins have the same width and computes db = bins[1]-bin[0]. This is why you get zeros everywhere.

The current behavior has been heavily criticized and I think we should change it. My proposal is to have for histogram the same behavior as for histogramdd and histogram2d: bins are the bin edges, including the rightmost bin, and values outside of the bins are not tallied. The problem with this is that it breaks code, and I'm not sure it's such a good idea to do this in a point release.

My short term proposal would be to fix the normalization bug and document the current behavior of histogram for the 1.0.5 release. Once it's done, we can modify histogram and maybe print a warning the first time it's used to notice users of the change.

I'd like to hear the voice of experienced devs on this. This issue has been raised a number of times since I follow this ML. It's not the first time I've proposed patches, and I've already documented the weird behavior only to see the comments disappear after a while. I hope this time some kind of agreement will be reached.

Regards,

David




2008/4/8, Hans Meine <[hidden email]>:
Am Montag, 07. April 2008 14:34:08 schrieb Hans Meine:

> Am Samstag, 05. April 2008 21:54:27 schrieb Anne Archibald:
> > There's also a fourth option - raise an exception if any points are
> > outside the range.
>
> +1
>
> I think this should be the default.  Otherwise, I tend towards "exclude",
> in order to have comparable bin sizes (when plotting, I always find peaks
> at the ends annoying); this could also be called "clip" BTW.
>
> But really, an exception would follow the Zen: "In the face of ambiguity,
> refuse the temptation to guess."  And with a kwarg: "Explicit is better
> than implicit."


When posting this, I did indeed not think this through fully; as David (and
Tommy) pointed out, this API does not fit well with the existing `bins`
option, especially when a sequence of bin bounds is given.  (I guess I was
mostly thinking about the special case of discrete values and 1:1 bins, as
typical for uint8 data.)

Thus, I would like to withdraw my above opinion from and instead state that I
find the current API as clear as it gets.  If you want to exclude values,
simply pass an additional right bound, and for including outliers,
passing -inf as additional left bound seems to do the trick.  This could be
possibly added to the documentation though.

The only critical aspect I see is the `normed` arg.  As it is now, the
rightmost bin has always infinite size, but it is not treated like that:

In [1]: from numpy import *

In [2]: histogram(arange(10), [2,3,4], normed = True)
Out[2]: (array([ 0.1,  0.1,  0.6]), array([2, 3, 4]))

Even worse, if you try to add an infinite bin to the left, this pulls all
values to zero (technically, I understand that, but it looks really
undesirable to me):

In [3]: histogram(arange(10), [-inf, 2,3,4], normed = True)
Out[3]: (array([ 0.,  0.,  0.,  0.]), array([-Inf,   2.,   3.,   4.]))


--
Ciao, /  /
     /--/
    /  / ANS
_______________________________________________
Numpy-discussion mailing list
[hidden email]
<a href="http://projects.scipy.org/mailman/listinfo/numpy-discussion" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">http://projects.scipy.org/mailman/listinfo/numpy-discussion


_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Ticket #605 Incorrect behavior of numpy.histogram

Bruce Southey
Hi,
I agree that the current histogram should be changed. However, I am not
sure 1.0.5 is the correct release for that.

David, this doesn't work for your code:
r= np.array([1,2,2,3,3,3,4,4,4,4,5,5,5,5,5])
dbin=[2,3,4]
rc, rb=histogram(r, bins=dbin, discard=None)

Returns:
rc=[3 3] # Really should be [3, 3, 9]
rb=[-9223372036854775808                    3 -9223372036854775808]

But I have not had time to find the error.

Regards
Bruce


David Huard wrote:

> Hans,
>
> Note that the current histogram is buggy, in the sense that it assumes
> that all bins have the same width and computes db = bins[1]-bin[0].
> This is why you get zeros everywhere.
>
> The current behavior has been heavily criticized and I think we should
> change it. My proposal is to have for histogram the same behavior as
> for histogramdd and histogram2d: bins are the bin edges, including the
> rightmost bin, and values outside of the bins are not tallied. The
> problem with this is that it breaks code, and I'm not sure it's such a
> good idea to do this in a point release.
>
> My short term proposal would be to fix the normalization bug and
> document the current behavior of histogram for the 1.0.5 release. Once
> it's done, we can modify histogram and maybe print a warning the first
> time it's used to notice users of the change.
>
> I'd like to hear the voice of experienced devs on this. This issue has
> been raised a number of times since I follow this ML. It's not the
> first time I've proposed patches, and I've already documented the
> weird behavior only to see the comments disappear after a while. I
> hope this time some kind of agreement will be reached.
>
> Regards,
>
> David
>
>
>
>
> 2008/4/8, Hans Meine <[hidden email]
> <mailto:[hidden email]>>:
>
>     Am Montag, 07. April 2008 14:34:08 schrieb Hans Meine:
>
>     > Am Samstag, 05. April 2008 21:54:27 schrieb Anne Archibald:
>     > > There's also a fourth option - raise an exception if any
>     points are
>     > > outside the range.
>     >
>     > +1
>     >
>     > I think this should be the default.  Otherwise, I tend towards
>     "exclude",
>     > in order to have comparable bin sizes (when plotting, I always
>     find peaks
>     > at the ends annoying); this could also be called "clip" BTW.
>     >
>     > But really, an exception would follow the Zen: "In the face of
>     ambiguity,
>     > refuse the temptation to guess."  And with a kwarg: "Explicit is
>     better
>     > than implicit."
>
>
>     When posting this, I did indeed not think this through fully; as
>     David (and
>     Tommy) pointed out, this API does not fit well with the existing
>     `bins`
>     option, especially when a sequence of bin bounds is given.  (I
>     guess I was
>     mostly thinking about the special case of discrete values and 1:1
>     bins, as
>     typical for uint8 data.)
>
>     Thus, I would like to withdraw my above opinion from and instead
>     state that I
>     find the current API as clear as it gets.  If you want to exclude
>     values,
>     simply pass an additional right bound, and for including outliers,
>     passing -inf as additional left bound seems to do the trick.  This
>     could be
>     possibly added to the documentation though.
>
>     The only critical aspect I see is the `normed` arg.  As it is now, the
>     rightmost bin has always infinite size, but it is not treated like
>     that:
>
>     In [1]: from numpy import *
>
>     In [2]: histogram(arange(10), [2,3,4], normed = True)
>     Out[2]: (array([ 0.1,  0.1,  0.6]), array([2, 3, 4]))
>
>     Even worse, if you try to add an infinite bin to the left, this
>     pulls all
>     values to zero (technically, I understand that, but it looks really
>     undesirable to me):
>
>     In [3]: histogram(arange(10), [-inf, 2,3,4], normed = True)
>     Out[3]: (array([ 0.,  0.,  0.,  0.]), array([-Inf,   2.,   3.,   4.]))
>
>
>     --
>     Ciao, /  /
>          /--/
>         /  / ANS
>     _______________________________________________
>     Numpy-discussion mailing list
>     [hidden email] <mailto:[hidden email]>
>     http://projects.scipy.org/mailman/listinfo/numpy-discussion
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Numpy-discussion mailing list
> [hidden email]
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>  

_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Ticket #605 Incorrect behavior of numpy.histogram

David Huard


2008/4/8, Bruce Southey <[hidden email]>:
Hi,
I agree that the current histogram should be changed. However, I am not
sure 1.0.5 is the correct release for that.

We both agree. 

David, this doesn't work for your code:
r= np.array([1,2,2,3,3,3,4,4,4,4,5,5,5,5,5])
dbin=[2,3,4]
rc, rb=histogram(r, bins=dbin, discard=None)
Returns:
rc=[3 3] # Really should be [3, 3, 9]
rb=[-9223372036854775808                    3 -9223372036854775808]

I used the convention that bins are the bin edges, including the right most edge, this is why len(rc) =2 and len(rb)=3.

Now there clearly is a bug, and I traced it to the use of np.r_. Check this out:

In [26]: dbin = [1,2,3]

In [27]: np.r_[-np.inf, dbin, np.inf]
Out[27]: array([-Inf,   1.,   2.,   3.,  Inf])

In [28]: np.r_[-np.inf, asarray(dbin), np.inf]
Out[28]:
array([-9223372036854775808,                    1,                    2,                          3, -9223372036854775808])

In [29]: np.r_[-np.inf, asarray(dbin).astype(float), np.inf]
Out[29]: array([-Inf,   1.,   2.,   3.,  Inf])

Is this a misuse of r_ or a bug ?


David






 

But I have not had time to find the error.

Regards
Bruce



David Huard wrote:
> Hans,
>
> Note that the current histogram is buggy, in the sense that it assumes
> that all bins have the same width and computes db = bins[1]-bin[0].
> This is why you get zeros everywhere.
>
> The current behavior has been heavily criticized and I think we should
> change it. My proposal is to have for histogram the same behavior as
> for histogramdd and histogram2d: bins are the bin edges, including the
> rightmost bin, and values outside of the bins are not tallied. The
> problem with this is that it breaks code, and I'm not sure it's such a
> good idea to do this in a point release.
>
> My short term proposal would be to fix the normalization bug and
> document the current behavior of histogram for the 1.0.5 release. Once
> it's done, we can modify histogram and maybe print a warning the first
> time it's used to notice users of the change.
>
> I'd like to hear the voice of experienced devs on this. This issue has
> been raised a number of times since I follow this ML. It's not the
> first time I've proposed patches, and I've already documented the
> weird behavior only to see the comments disappear after a while. I
> hope this time some kind of agreement will be reached.
>
> Regards,
>
> David
>
>
>
>
> 2008/4/8, Hans Meine <[hidden email]

> <mailto:[hidden email]>>:

>
>     Am Montag, 07. April 2008 14:34:08 schrieb Hans Meine:
>
>     > Am Samstag, 05. April 2008 21:54:27 schrieb Anne Archibald:
>     > > There's also a fourth option - raise an exception if any
>     points are
>     > > outside the range.
>     >
>     > +1
>     >
>     > I think this should be the default.  Otherwise, I tend towards
>     "exclude",
>     > in order to have comparable bin sizes (when plotting, I always
>     find peaks
>     > at the ends annoying); this could also be called "clip" BTW.
>     >
>     > But really, an exception would follow the Zen: "In the face of
>     ambiguity,
>     > refuse the temptation to guess."  And with a kwarg: "Explicit is
>     better
>     > than implicit."
>
>
>     When posting this, I did indeed not think this through fully; as
>     David (and
>     Tommy) pointed out, this API does not fit well with the existing
>     `bins`
>     option, especially when a sequence of bin bounds is given.  (I
>     guess I was
>     mostly thinking about the special case of discrete values and 1:1
>     bins, as
>     typical for uint8 data.)
>
>     Thus, I would like to withdraw my above opinion from and instead
>     state that I
>     find the current API as clear as it gets.  If you want to exclude
>     values,
>     simply pass an additional right bound, and for including outliers,
>     passing -inf as additional left bound seems to do the trick.  This
>     could be
>     possibly added to the documentation though.
>
>     The only critical aspect I see is the `normed` arg.  As it is now, the
>     rightmost bin has always infinite size, but it is not treated like
>     that:
>
>     In [1]: from numpy import *
>
>     In [2]: histogram(arange(10), [2,3,4], normed = True)
>     Out[2]: (array([ 0.1,  0.1,  0.6]), array([2, 3, 4]))
>
>     Even worse, if you try to add an infinite bin to the left, this
>     pulls all
>     values to zero (technically, I understand that, but it looks really
>     undesirable to me):
>
>     In [3]: histogram(arange(10), [-inf, 2,3,4], normed = True)
>     Out[3]: (array([ 0.,  0.,  0.,  0.]), array([-Inf,   2.,   3.,   4.]))
>
>
>     --
>     Ciao, /  /
>          /--/
>         /  / ANS
>     _______________________________________________
>     Numpy-discussion mailing list

>     [hidden email] <mailto:[hidden email]>

>     http://projects.scipy.org/mailman/listinfo/numpy-discussion
>
>

> ------------------------------------------------------------------------

>
> _______________________________________________
> Numpy-discussion mailing list
> [hidden email]
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>

_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion


_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Ticket #605 Incorrect behavior of numpy.histogram

Bruce Southey
Hi,
I should have asked first (I hope that you don't mind), but I created a
ticket Ticket #728 (http://scipy.org/scipy/numpy/ticket/728 ) for
numpy.r_ because this incorrectly casts based on the array types.

The bug is that -inf and inf are numpy floats but dbin is an array of
ints. Unfortunately, numpy.r_ returns the type of the array with the
highest precision (well at least 'float' wins over 'int') and thus there
is lost precision. The fix is as you indicated below.

Regards
Bruce




David Huard wrote:

>
>
> 2008/4/8, Bruce Southey <[hidden email] <mailto:[hidden email]>>:
>
>     Hi,
>     I agree that the current histogram should be changed. However, I
>     am not
>     sure 1.0.5 is the correct release for that.
>
>
> We both agree.
>
>     David, this doesn't work for your code:
>     r= np.array([1,2,2,3,3,3,4,4,4,4,5,5,5,5,5])
>     dbin=[2,3,4]
>     rc, rb=histogram(r, bins=dbin, discard=None)
>
>     Returns:
>     rc=[3 3] # Really should be [3, 3, 9]
>     rb=[-9223372036854775808                    3 -9223372036854775808]
>
>
> I used the convention that bins are the bin edges, including the right
> most edge, this is why len(rc) =2 and len(rb)=3.
>
> Now there clearly is a bug, and I traced it to the use of np.r_. Check
> this out:
>
> In [26]: dbin = [1,2,3]
>
> In [27]: np.r_[-np.inf, dbin, np.inf]
> Out[27]: array([-Inf,   1.,   2.,   3.,  Inf])
>
> In [28]: np.r_[-np.inf, asarray(dbin), np.inf]
> Out[28]:
> array([-9223372036854775808,                    1,                    
> 2,                          3, -9223372036854775808])
>
> In [29]: np.r_[-np.inf, asarray(dbin).astype(float), np.inf]
> Out[29]: array([-Inf,   1.,   2.,   3.,  Inf])
>
> Is this a misuse of r_ or a bug ?
>
>
> David
>
>
>
>
>
>
>  
>
>     But I have not had time to find the error.
>
>     Regards
>     Bruce
>
>
>
>     David Huard wrote:
>     > Hans,
>     >
>     > Note that the current histogram is buggy, in the sense that it
>     assumes
>     > that all bins have the same width and computes db = bins[1]-bin[0].
>     > This is why you get zeros everywhere.
>     >
>     > The current behavior has been heavily criticized and I think we
>     should
>     > change it. My proposal is to have for histogram the same behavior as
>     > for histogramdd and histogram2d: bins are the bin edges,
>     including the
>     > rightmost bin, and values outside of the bins are not tallied. The
>     > problem with this is that it breaks code, and I'm not sure it's
>     such a
>     > good idea to do this in a point release.
>     >
>     > My short term proposal would be to fix the normalization bug and
>     > document the current behavior of histogram for the 1.0.5
>     release. Once
>     > it's done, we can modify histogram and maybe print a warning the
>     first
>     > time it's used to notice users of the change.
>     >
>     > I'd like to hear the voice of experienced devs on this. This
>     issue has
>     > been raised a number of times since I follow this ML. It's not the
>     > first time I've proposed patches, and I've already documented the
>     > weird behavior only to see the comments disappear after a while. I
>     > hope this time some kind of agreement will be reached.
>     >
>     > Regards,
>     >
>     > David
>     >
>     >
>     >
>     >
>     > 2008/4/8, Hans Meine <[hidden email]
>     <mailto:[hidden email]>
>
>     > <mailto:[hidden email]
>     <mailto:[hidden email]>>>:
>
>     >
>     >     Am Montag, 07. April 2008 14:34:08 schrieb Hans Meine:
>     >
>     >     > Am Samstag, 05. April 2008 21:54:27 schrieb Anne Archibald:
>     >     > > There's also a fourth option - raise an exception if any
>     >     points are
>     >     > > outside the range.
>     >     >
>     >     > +1
>     >     >
>     >     > I think this should be the default.  Otherwise, I tend towards
>     >     "exclude",
>     >     > in order to have comparable bin sizes (when plotting, I always
>     >     find peaks
>     >     > at the ends annoying); this could also be called "clip" BTW.
>     >     >
>     >     > But really, an exception would follow the Zen: "In the face of
>     >     ambiguity,
>     >     > refuse the temptation to guess."  And with a kwarg:
>     "Explicit is
>     >     better
>     >     > than implicit."
>     >
>     >
>     >     When posting this, I did indeed not think this through fully; as
>     >     David (and
>     >     Tommy) pointed out, this API does not fit well with the existing
>     >     `bins`
>     >     option, especially when a sequence of bin bounds is given.  (I
>     >     guess I was
>     >     mostly thinking about the special case of discrete values
>     and 1:1
>     >     bins, as
>     >     typical for uint8 data.)
>     >
>     >     Thus, I would like to withdraw my above opinion from and instead
>     >     state that I
>     >     find the current API as clear as it gets.  If you want to
>     exclude
>     >     values,
>     >     simply pass an additional right bound, and for including
>     outliers,
>     >     passing -inf as additional left bound seems to do the
>     trick.  This
>     >     could be
>     >     possibly added to the documentation though.
>     >
>     >     The only critical aspect I see is the `normed` arg.  As it
>     is now, the
>     >     rightmost bin has always infinite size, but it is not
>     treated like
>     >     that:
>     >
>     >     In [1]: from numpy import *
>     >
>     >     In [2]: histogram(arange(10), [2,3,4], normed = True)
>     >     Out[2]: (array([ 0.1,  0.1,  0.6]), array([2, 3, 4]))
>     >
>     >     Even worse, if you try to add an infinite bin to the left, this
>     >     pulls all
>     >     values to zero (technically, I understand that, but it looks
>     really
>     >     undesirable to me):
>     >
>     >     In [3]: histogram(arange(10), [-inf, 2,3,4], normed = True)
>     >     Out[3]: (array([ 0.,  0.,  0.,  0.]), array([-Inf,   2.,  
>     3.,   4.]))
>     >
>     >
>     >     --
>     >     Ciao, /  /
>     >          /--/
>     >         /  / ANS
>     >     _______________________________________________
>     >     Numpy-discussion mailing list
>
>     >     [hidden email]
>     <mailto:[hidden email]>
>     <mailto:[hidden email]
>     <mailto:[hidden email]>>
>
>     >     http://projects.scipy.org/mailman/listinfo/numpy-discussion
>     >
>     >
>
>     >
>     ------------------------------------------------------------------------
>
>     >
>     > _______________________________________________
>     > Numpy-discussion mailing list
>     > [hidden email] <mailto:[hidden email]>
>     > http://projects.scipy.org/mailman/listinfo/numpy-discussion
>     >
>
>     _______________________________________________
>     Numpy-discussion mailing list
>     [hidden email] <mailto:[hidden email]>
>     http://projects.scipy.org/mailman/listinfo/numpy-discussion
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Numpy-discussion mailing list
> [hidden email]
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>  

_______________________________________________
Numpy-discussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpy-discussion