help creating a reversed cumulative histogram

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

help creating a reversed cumulative histogram

Timmie
Administrator
Hello fellow numy users,
I posted some questions on histograms recently [1, 2] but still couldn't
find  a solution.

I am trying to create a inverse cumulative histogram [3] which shall
look like [4] but with the higher values at the left.

The classification shall follow this exemplary rule:

class 1: 0
all values > 0

class 2: 10
all values > 10

class 3: 15
all values > 15

class 4: 20
all values > 20

class 5: 25
all values > 25

[...]

I could get this easily in a spreadsheet by creating a matix with
conditional statements (if VALUES_COL > CLASS_BOUNDARY; VALUES_COL; '-').

With python (numpy or pylab) I was not successful. The plotted histogram
envelope turned out to be just the inverted curve as the one created
with the spreadsheet app.
       
I have briely visualised the issue here [5]. I hope that this makes it
more understandable.
       
Later I would like to sum and count all values in each bin as discussed
in [2].

May someone give me pointer or hint on how to improve my code below to
achive the desired histogram?



Thanks a lot in advance,
Timmie

[1]: http://www.nabble.com/np.hist-with-masked-values-to25243905.html
[2]:
http://www.nabble.com/histogram%3A-sum-up-values-in-each-bin-to25171265.html
[3]: http://en.wikipedia.org/wiki/Histogram#Cumulative_histogram
[4]: http://addictedtor.free.fr/graphiques/RGraphGallery.php?graph=126
[5]: http://www.scribd.com/doc/19371606/Distribution-Histogram

##### CODE #####
normed = False
values # loaded data as array
bins = 10


### sum
## taken from
##
http://www.nabble.com/Scipy-and-statistics%3A-probability-density-function-to24683007.html#a24683304
sums = np.histogram(values, weights=values,
                                     normed=normed,
                                     bins=bins)
ecdf_sums = np.hstack([0.0, sums[0].cumsum() ])
ecdf_inv_sums = ecdf_sums[::-1]


pylab.plot(sums[1], ecdf_inv_sums)
pylab.show()

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: help creating a reversed cumulative histogram

Robert Kern-2
On Wed, Sep 2, 2009 at 18:15, Tim Michelsen<[hidden email]> wrote:
> Hello fellow numy users,
> I posted some questions on histograms recently [1, 2] but still couldn't
> find  a solution.
>
> I am trying to create a inverse cumulative histogram [3] which shall
> look like [4] but with the higher values at the left.

Okay. That is completely different from what you've asked before.

> The classification shall follow this exemplary rule:
>
> class 1: 0
> all values > 0
>
> class 2: 10
> all values > 10
>
> class 3: 15
> all values > 15
>
> class 4: 20
> all values > 20
>
> class 5: 25
> all values > 25
>
> [...]
>
> I could get this easily in a spreadsheet by creating a matix with
> conditional statements (if VALUES_COL > CLASS_BOUNDARY; VALUES_COL; '-').
>
> With python (numpy or pylab) I was not successful. The plotted histogram
> envelope turned out to be just the inverted curve as the one created
> with the spreadsheet app.

> sums = np.histogram(values, weights=values,
>                                     normed=normed,
>                                     bins=bins)
> ecdf_sums = np.hstack([0.0, sums[0].cumsum() ])
> ecdf_inv_sums = ecdf_sums[::-1]

This is not the kind of "inversion" that you are looking for. You want

ecdf_inv_sums = ecdf_sums[-1] - ecdf_sums

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
  -- Umberto Eco
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: help creating a reversed cumulative histogram

josef.pktd
On Wed, Sep 2, 2009 at 7:26 PM, Robert Kern<[hidden email]> wrote:

> On Wed, Sep 2, 2009 at 18:15, Tim Michelsen<[hidden email]> wrote:
>> Hello fellow numy users,
>> I posted some questions on histograms recently [1, 2] but still couldn't
>> find  a solution.
>>
>> I am trying to create a inverse cumulative histogram [3] which shall
>> look like [4] but with the higher values at the left.
>
> Okay. That is completely different from what you've asked before.
>
>> The classification shall follow this exemplary rule:
>>
>> class 1: 0
>> all values > 0
>>
>> class 2: 10
>> all values > 10
>>
>> class 3: 15
>> all values > 15
>>
>> class 4: 20
>> all values > 20
>>
>> class 5: 25
>> all values > 25
>>
>> [...]
>>
>> I could get this easily in a spreadsheet by creating a matix with
>> conditional statements (if VALUES_COL > CLASS_BOUNDARY; VALUES_COL; '-').
>>
>> With python (numpy or pylab) I was not successful. The plotted histogram
>> envelope turned out to be just the inverted curve as the one created
>> with the spreadsheet app.
>
>> sums = np.histogram(values, weights=values,
>>                                     normed=normed,
>>                                     bins=bins)
>> ecdf_sums = np.hstack([0.0, sums[0].cumsum() ])
>> ecdf_inv_sums = ecdf_sums[::-1]
>
> This is not the kind of "inversion" that you are looking for. You want
>
> ecdf_inv_sums = ecdf_sums[-1] - ecdf_sums

and you can plot the histogram with bar

eisf_sums = ecdf_sums[-1] - ecdf_sums   # empirical inverse survival
function of weights
width = sums[1][1] - sums[1][0]
rects1 = plt.bar(sums[1], eisf_sums, width, color='b')

Are you sure you want cumulative weights in the histogram?

Josef

>
> --
> Robert Kern
>
> "I have come to believe that the whole world is an enigma, a harmless
> enigma that is made terrible by our own mad attempt to interpret it as
> though it had an underlying truth."
>  -- Umberto Eco
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: help creating a reversed cumulative histogram

Timmie
Administrator
Hello Robert and Josef,
thanks for the quick answers! I really appreciate this.

>>> I am trying to create a inverse cumulative histogram [3] which shall
>>> look like [4] but with the higher values at the left.
>> Okay. That is completely different from what you've asked before.
You are right.
But it's soemtimes hard to decribe a desired and expected output in
python terms and pseudocode.
I still have to lern more numpy vocabs...

I will evalute your answers and give feedback.

Regards,
Timmie

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: help creating a reversed cumulative histogram

Robert Kern-2
On Wed, Sep 2, 2009 at 19:11, Tim Michelsen<[hidden email]> wrote:

> Hello Robert and Josef,
> thanks for the quick answers! I really appreciate this.
>
>>>> I am trying to create a inverse cumulative histogram [3] which shall
>>>> look like [4] but with the higher values at the left.
>>> Okay. That is completely different from what you've asked before.
> You are right.
> But it's soemtimes hard to decribe a desired and expected output in
> python terms and pseudocode.
> I still have to lern more numpy vocabs...

Actually, I apologize. I meant to delete that line before sending the
message. It was unnecessary and abusive.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
  -- Umberto Eco
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: help creating a reversed cumulative histogram

Timmie
Administrator
> >>> Okay. That is completely different from what you've asked before.
> > You are right.
> > But it's soemtimes hard to decribe a desired and expected output in
> > python terms and pseudocode.
> > I still have to lern more numpy vocabs...
>
> Actually, I apologize. I meant to delete that line before sending the
> message. It was unnecessary and abusive.
Don't worry. I got it right the ways you meant it initially. No offence.

Coding and math problems get more clear once you take the effort to explain and
visualise it for others. You spend quite a lot of time responding here. I
appreciate that.

Best regards,
Timmie




_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: help creating a reversed cumulative histogram

Timmie
Administrator
In reply to this post by josef.pktd
>
Hello,
I have checked the snippets you proposed.
It does what I wanted to achieve.
Obviously, I had to substract the values as Robert
demonstrated. This could also be perceived from
the figure I posted.

I still have see how I can optimise the code
(c.f. below) or modify to be less complicated.
It seemed so simple in the spreadsheet...

> eisf_sums = ecdf_sums[-1] - ecdf_sums  
> # empirical inverse survival
> function of weights
Can you recommend me a (literature) source where
I can look up this term?
I learned statistics in my mother tongue and seem
to need a refresher on distributions...
I would like to come up with the right terms
next time.

> Are you sure you want cumulative weights in
>the histogram?
You mean it doesn't make sense at all?

I need:
1) the count of occurrences sorted in each bin
    counts = np.histogram(values,
                                    normed=normed,
                                    bins=bins)
    => here I obtain now the same as in the
    spreadsheet

2) the sum of all values sorted in each bin
    sums = np.histogram(values, weights=values,
                                    normed=normed,
                                    bins=bins)
                                   
    => here I still obtain different values for the first
    histogram value (eisf_sums[0]):
    Numpy: eisf_sums
    335.50026738, 319.21363636, 266.07724942,  
    198.10258741, 126.69270396, 67.98125874,  
    38.47335664,  24.75062937, 13.42121212,  
    2.48636364, 0.
   
    Spreadsheet:
    335.2351159, 319.2136364, 266.0772494,
    198.1025874, 126.692704, 67.98125874,
    38.47335664, 24.75062937, 13.42121212,
    2.486363636, 0

Additionally, I would like to see these implemented
as convenience functions in numpy or scipy.
There should be out of the box functions for all kinds
of distributions.
Where is the best place to contrubute a final version?
The scipy.stats?

Thanks again for your input,
Timmie

##### below the distilled code #####
## histogram settings
normed = False
bins = 10

## counts: gives expected results
counts = np.histogram(values,
                                    normed=normed,
                                    bins=bins)
                                   
ecdf_counts = np.hstack([1.0, counts[0].cumsum() ])
ecdf_inv_counts = ecdf_counts[::-1]
# empirical inverse survival function of weights
eisf_counts = ecdf_counts[-1] - ecdf_counts  


### sum: does have deviations
sums = np.histogram(values, weights=values,
                                    normed=normed,
                                    bins=bins)
ecdf_sums = np.hstack([1.0, sums[0].cumsum() ])
ecdf_inv_sums = ecdf_sums[::-1]
# empirical inverse survival function of weights
eisf_sums = ecdf_sums[-1] - ecdf_sums

##
# configure plot
xlabel = 'Bins'
ylabel_left = 'Counts'
ylabel_right = 'Sum'


fig1 = plt.figure()
ax1 = fig1.add_subplot(111)

# counts
ax1.plot(counts[1], ecdf_inv_counts, 'r-')
ax1.set_xlabel(xlabel)
ax1.set_ylabel(ylabel_left, color='b')
for tl in ax1.get_yticklabels():
    tl.set_color('b')

# sums
ax2 = ax1.twinx()
ax2.plot(sums[1], eisf_sums, 'b-')
ax2.set_ylabel(ylabel_right, color='r')
for tl in ax2.get_yticklabels():
    tl.set_color('r')
plt.show()


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: help creating a reversed cumulative histogram

josef.pktd
On Thu, Sep 3, 2009 at 9:23 AM, Tim
Michelsen<[hidden email]> wrote:

>>
> Hello,
> I have checked the snippets you proposed.
> It does what I wanted to achieve.
> Obviously, I had to substract the values as Robert
> demonstrated. This could also be perceived from
> the figure I posted.
>
> I still have see how I can optimise the code
> (c.f. below) or modify to be less complicated.
> It seemed so simple in the spreadsheet...
>
>> eisf_sums = ecdf_sums[-1] - ecdf_sums
>> # empirical inverse survival

this should have inverse in it, it was a cut and paste error

empirical survival function would be just 1-ecdf

however, as distributions they would require to be normed to 1,

>> function of weights
> Can you recommend me a (literature) source where
> I can look up this term?
> I learned statistics in my mother tongue and seem
> to need a refresher on distributions...
> I would like to come up with the right terms
> next time.

My first stop is usually wikipedia:

http://en.wikipedia.org/wiki/Survival_function
http://de.wikipedia.org/wiki/Verteilungsfunktion#.C3.9Cberlebenswahrscheinlichkeit

and the ISI - INTERNATIONAL STATISTICAL INSTITUTE glossary for terms
in different languages
http://isi.cbs.nl/glossary/bloken83.htm

>
>> Are you sure you want cumulative weights in
>>the histogram?
> You mean it doesn't make sense at all?

It depends on what you want, ecdf as it is calculated, with the
weights argument in the histogram, gives you the cumulative sum of the
values, not the count.
In the case of the weight of pigs, it would be to cumulative weight of
all pigs with a weight less than the given bin boundary weight.
If values were income, then it would be the aggregated income of all
individual with an income below the bin bin boundary.
So it makes sense, given this is what you want (below).

>
> I need:
> 1) the count of occurrences sorted in each bin
>    counts = np.histogram(values,
>                                    normed=normed,
>                                    bins=bins)
>    => here I obtain now the same as in the
>    spreadsheet
>
> 2) the sum of all values sorted in each bin
>    sums = np.histogram(values, weights=values,
>                                    normed=normed,
>                                    bins=bins)
>

>    => here I still obtain different values for the first
>    histogram value (eisf_sums[0]):
>    Numpy: eisf_sums
>    335.50026738, 319.21363636, 266.07724942,
>    198.10258741, 126.69270396, 67.98125874,
>    38.47335664,  24.75062937, 13.42121212,
>    2.48636364, 0.
>
>    Spreadsheet:
>    335.2351159, 319.2136364, 266.0772494,
>    198.1025874, 126.692704, 67.98125874,
>    38.47335664, 24.75062937, 13.42121212,
>    2.486363636, 0

there might be a mistake in the treatment of a cell when
reversing, when I run your example the highest value is
not equal to values.sum()

this might match the spreadsheet, but I haven't compared
isf = sums[0][::-1].cumsum()[::-1]

But I'm not sure yet, what's going on.

Josef

>
> Additionally, I would like to see these implemented
> as convenience functions in numpy or scipy.
> There should be out of the box functions for all kinds
> of distributions.
> Where is the best place to contrubute a final version?
> The scipy.stats?
>
> Thanks again for your input,
> Timmie
>
> ##### below the distilled code #####
> ## histogram settings
> normed = False
> bins = 10
>
> ## counts: gives expected results
> counts = np.histogram(values,
>                                    normed=normed,
>                                    bins=bins)
>
> ecdf_counts = np.hstack([1.0, counts[0].cumsum() ])
> ecdf_inv_counts = ecdf_counts[::-1]
> # empirical inverse survival function of weights
> eisf_counts = ecdf_counts[-1] - ecdf_counts
>
>
> ### sum: does have deviations
> sums = np.histogram(values, weights=values,
>                                    normed=normed,
>                                    bins=bins)
> ecdf_sums = np.hstack([1.0, sums[0].cumsum() ])
> ecdf_inv_sums = ecdf_sums[::-1]
> # empirical inverse survival function of weights
> eisf_sums = ecdf_sums[-1] - ecdf_sums
>
> ##
> # configure plot
> xlabel = 'Bins'
> ylabel_left = 'Counts'
> ylabel_right = 'Sum'
>
>
> fig1 = plt.figure()
> ax1 = fig1.add_subplot(111)
>
> # counts
> ax1.plot(counts[1], ecdf_inv_counts, 'r-')
> ax1.set_xlabel(xlabel)
> ax1.set_ylabel(ylabel_left, color='b')
> for tl in ax1.get_yticklabels():
>    tl.set_color('b')
>
> # sums
> ax2 = ax1.twinx()
> ax2.plot(sums[1], eisf_sums, 'b-')
> ax2.set_ylabel(ylabel_right, color='r')
> for tl in ax2.get_yticklabels():
>    tl.set_color('r')
> plt.show()
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: help creating a reversed cumulative histogram

Timmie
Administrator
> My first stop is usually wikipedia:
[...]
Thanks.
So I I'known that I have to call the beast a
"empirical inverse survival function", Robert would
also have foundit easier to help.
Anyway, step by step...

> In the case of the weight of pigs, it would be to cumulative weight of
> all pigs with a weight less than the given bin boundary weight.
> If values were income, then it would be the aggregated income of all
> individual with an income below the bin bin boundary.
> So it makes sense, given this is what you want (below).
Exactly!

Or for precipitation:
a) count: number of precipitation events that
    ocurred up to a certain limit
b) sum: precipitation total registered up to that limit

> there might be a mistake in the treatment of a cell when
> reversing, when I run your example the highest value is
> not equal to values.sum()
This has made me think again. Small point.

See here:
ecdf_sums = np.hstack([0.0, sums[0].cumsum() ])
ecdf_sums = np.hstack([sums[0].cumsum() ])

I had to adjust the classes in the spreadsheet by
replacing the first class limit by 0.0.
I had modifed this yesterday to a different value
(0.265152) as I was testing the code.

from:
0.265152, 0.487273, 0.709394, 0.931515,
1.153636, 1.375758, 1.597879, 1.820000,
2.042121, 2.264242, 2.486364

to:
0.0, 0.487273, 0.709394, 0.931515,
1.153636, 1.375758, 1.597879, 1.820000,
2.042121, 2.264242, 2.486364

Now everything is fine. Results and curves match.

> But I'm not sure yet, what's going on.
1) first I didn't know how to develop the code for a
    "empirical inverse survival function" in numpy
2) I screwed my spreadsheet classes up while
    testing and verifying my numpy code.

Again, would a function for the
"empirical inverse survival function" qualify for the
inclusion into numpy or scipy?

Thanks for the help.

Best regards,
Timmie


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: help creating a reversed cumulative histogram

josef.pktd
On Thu, Sep 3, 2009 at 12:58 PM, Tim
Michelsen<[hidden email]> wrote:

>> My first stop is usually wikipedia:
> [...]
> Thanks.
> So I I'known that I have to call the beast a
> "empirical inverse survival function", Robert would
> also have foundit easier to help.
> Anyway, step by step...
>
>> In the case of the weight of pigs, it would be to cumulative weight of
>> all pigs with a weight less than the given bin boundary weight.
>> If values were income, then it would be the aggregated income of all
>> individual with an income below the bin bin boundary.
>> So it makes sense, given this is what you want (below).
> Exactly!
>
> Or for precipitation:
> a) count: number of precipitation events that
>    ocurred up to a certain limit
> b) sum: precipitation total registered up to that limit
>
>> there might be a mistake in the treatment of a cell when
>> reversing, when I run your example the highest value is
>> not equal to values.sum()
> This has made me think again. Small point.
>
> See here:
> ecdf_sums = np.hstack([0.0, sums[0].cumsum() ])
> ecdf_sums = np.hstack([sums[0].cumsum() ])
>
> I had to adjust the classes in the spreadsheet by
> replacing the first class limit by 0.0.
> I had modifed this yesterday to a different value
> (0.265152) as I was testing the code.
>
> from:
> 0.265152, 0.487273, 0.709394, 0.931515,
> 1.153636, 1.375758, 1.597879, 1.820000,
> 2.042121, 2.264242, 2.486364
>
> to:
> 0.0, 0.487273, 0.709394, 0.931515,
> 1.153636, 1.375758, 1.597879, 1.820000,
> 2.042121, 2.264242, 2.486364
>
> Now everything is fine. Results and curves match.
>
>> But I'm not sure yet, what's going on.
> 1) first I didn't know how to develop the code for a
>    "empirical inverse survival function" in numpy
> 2) I screwed my spreadsheet classes up while
>    testing and verifying my numpy code.
>
> Again, would a function for the
> "empirical inverse survival function" qualify for the
> inclusion into numpy or scipy?

Sorry, I'm too distracted, correcting myself a second time
 "this should *not* have inverse in it, using inverse was a cut and paste error"
it's  empirical survival function

If it's just a one-liner with cumsum, then I don't think its necessary
to have a function for it.

But following also the previous discussion, it would be useful to have
the combination of histogram and empirical cdf, sf, and/or pdf to
define an empirical distribution. As interpretation in terms of
distribution, normed=True would be necessary, but it could also be an
option.

One question to your application, in the plot you draw lines and not
histograms. Is there a reason to use histograms in the calculation
instead of the full ecdf. (i.e. cumsum on original values instead of
cumsum on histogrammed values) ?

Josef


>
> Thanks for the help.
>
> Best regards,
> Timmie
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: help creating a reversed cumulative histogram

Timmie
Administrator
>> Again, would a function for the
>> "empirical inverse survival function" qualify for the
>> inclusion into numpy or scipy?
>
> Sorry, I'm too distracted, correcting myself a second time
>  "this should *not* have inverse in it, using inverse was a cut and paste error"
> it's  empirical survival function
I think my fault not paying too much attention to the exact terms.
The pages you sent on "survial function" are marked as to-read.

> If it's just a one-liner with cumsum, then I don't think its necessary
> to have a function for it.
>
> But following also the previous discussion, it would be useful to have
> the combination of histogram and empirical cdf, sf, and/or pdf to
> define an empirical distribution. As interpretation in terms of
> distribution, normed=True would be necessary, but it could also be an
> option.
And it seems that this is just one call in R.

> One question to your application, in the plot you draw lines and not
> histograms. Is there a reason to use histograms in the calculation
> instead of the full ecdf. (i.e. cumsum on original values instead of
> cumsum on histogrammed values) ?
Well, I was not aware of cumsum and a way to create ecdf with numpy.
I just sereach the list archives for cdf or ecdf.
As my inital version was created in a shreadsheet, I first tried to
replicate that and get it validated.

Can you give an example of a full ecdf?

In the end I am interested in the points (x and y coordinates) where the
ecdf intersects with a certain threshold value.
This is the next task:
Get x,y of the cut-point between a vertical or horizontal line and a
curve with numpy and matplotlib.
Can you point out an example for that?

Best regards,
Timmie




_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion