

On Wed, Sep 2, 2009 at 18:15, Tim Michelsen< [hidden email]> wrote:
> Hello fellow numy users,
> I posted some questions on histograms recently [1, 2] but still couldn't
> find a solution.
>
> I am trying to create a inverse cumulative histogram [3] which shall
> look like [4] but with the higher values at the left.
Okay. That is completely different from what you've asked before.
> The classification shall follow this exemplary rule:
>
> class 1: 0
> all values > 0
>
> class 2: 10
> all values > 10
>
> class 3: 15
> all values > 15
>
> class 4: 20
> all values > 20
>
> class 5: 25
> all values > 25
>
> [...]
>
> I could get this easily in a spreadsheet by creating a matix with
> conditional statements (if VALUES_COL > CLASS_BOUNDARY; VALUES_COL; '').
>
> With python (numpy or pylab) I was not successful. The plotted histogram
> envelope turned out to be just the inverted curve as the one created
> with the spreadsheet app.
> sums = np.histogram(values, weights=values,
> normed=normed,
> bins=bins)
> ecdf_sums = np.hstack([0.0, sums[0].cumsum() ])
> ecdf_inv_sums = ecdf_sums[::1]
This is not the kind of "inversion" that you are looking for. You want
ecdf_inv_sums = ecdf_sums[1]  ecdf_sums

Robert Kern
"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
 Umberto Eco
_______________________________________________
NumPyDiscussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpydiscussion


On Wed, Sep 2, 2009 at 7:26 PM, Robert Kern< [hidden email]> wrote:
> On Wed, Sep 2, 2009 at 18:15, Tim Michelsen< [hidden email]> wrote:
>> Hello fellow numy users,
>> I posted some questions on histograms recently [1, 2] but still couldn't
>> find a solution.
>>
>> I am trying to create a inverse cumulative histogram [3] which shall
>> look like [4] but with the higher values at the left.
>
> Okay. That is completely different from what you've asked before.
>
>> The classification shall follow this exemplary rule:
>>
>> class 1: 0
>> all values > 0
>>
>> class 2: 10
>> all values > 10
>>
>> class 3: 15
>> all values > 15
>>
>> class 4: 20
>> all values > 20
>>
>> class 5: 25
>> all values > 25
>>
>> [...]
>>
>> I could get this easily in a spreadsheet by creating a matix with
>> conditional statements (if VALUES_COL > CLASS_BOUNDARY; VALUES_COL; '').
>>
>> With python (numpy or pylab) I was not successful. The plotted histogram
>> envelope turned out to be just the inverted curve as the one created
>> with the spreadsheet app.
>
>> sums = np.histogram(values, weights=values,
>> normed=normed,
>> bins=bins)
>> ecdf_sums = np.hstack([0.0, sums[0].cumsum() ])
>> ecdf_inv_sums = ecdf_sums[::1]
>
> This is not the kind of "inversion" that you are looking for. You want
>
> ecdf_inv_sums = ecdf_sums[1]  ecdf_sums
and you can plot the histogram with bar
eisf_sums = ecdf_sums[1]  ecdf_sums # empirical inverse survival
function of weights
width = sums[1][1]  sums[1][0]
rects1 = plt.bar(sums[1], eisf_sums, width, color='b')
Are you sure you want cumulative weights in the histogram?
Josef
>
> 
> Robert Kern
>
> "I have come to believe that the whole world is an enigma, a harmless
> enigma that is made terrible by our own mad attempt to interpret it as
> though it had an underlying truth."
>  Umberto Eco
> _______________________________________________
> NumPyDiscussion mailing list
> [hidden email]
> http://mail.scipy.org/mailman/listinfo/numpydiscussion>
_______________________________________________
NumPyDiscussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpydiscussion

Administrator

Hello Robert and Josef,
thanks for the quick answers! I really appreciate this.
>>> I am trying to create a inverse cumulative histogram [3] which shall
>>> look like [4] but with the higher values at the left.
>> Okay. That is completely different from what you've asked before.
You are right.
But it's soemtimes hard to decribe a desired and expected output in
python terms and pseudocode.
I still have to lern more numpy vocabs...
I will evalute your answers and give feedback.
Regards,
Timmie
_______________________________________________
NumPyDiscussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpydiscussion


On Wed, Sep 2, 2009 at 19:11, Tim Michelsen< [hidden email]> wrote:
> Hello Robert and Josef,
> thanks for the quick answers! I really appreciate this.
>
>>>> I am trying to create a inverse cumulative histogram [3] which shall
>>>> look like [4] but with the higher values at the left.
>>> Okay. That is completely different from what you've asked before.
> You are right.
> But it's soemtimes hard to decribe a desired and expected output in
> python terms and pseudocode.
> I still have to lern more numpy vocabs...
Actually, I apologize. I meant to delete that line before sending the
message. It was unnecessary and abusive.

Robert Kern
"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
 Umberto Eco
_______________________________________________
NumPyDiscussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpydiscussion

Administrator

> >>> Okay. That is completely different from what you've asked before.
> > You are right.
> > But it's soemtimes hard to decribe a desired and expected output in
> > python terms and pseudocode.
> > I still have to lern more numpy vocabs...
>
> Actually, I apologize. I meant to delete that line before sending the
> message. It was unnecessary and abusive.
Don't worry. I got it right the ways you meant it initially. No offence.
Coding and math problems get more clear once you take the effort to explain and
visualise it for others. You spend quite a lot of time responding here. I
appreciate that.
Best regards,
Timmie
_______________________________________________
NumPyDiscussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpydiscussion

Administrator

>
Hello,
I have checked the snippets you proposed.
It does what I wanted to achieve.
Obviously, I had to substract the values as Robert
demonstrated. This could also be perceived from
the figure I posted.
I still have see how I can optimise the code
(c.f. below) or modify to be less complicated.
It seemed so simple in the spreadsheet...
> eisf_sums = ecdf_sums[1]  ecdf_sums
> # empirical inverse survival
> function of weights
Can you recommend me a (literature) source where
I can look up this term?
I learned statistics in my mother tongue and seem
to need a refresher on distributions...
I would like to come up with the right terms
next time.
> Are you sure you want cumulative weights in
>the histogram?
You mean it doesn't make sense at all?
I need:
1) the count of occurrences sorted in each bin
counts = np.histogram(values,
normed=normed,
bins=bins)
=> here I obtain now the same as in the
spreadsheet
2) the sum of all values sorted in each bin
sums = np.histogram(values, weights=values,
normed=normed,
bins=bins)
=> here I still obtain different values for the first
histogram value (eisf_sums[0]):
Numpy: eisf_sums
335.50026738, 319.21363636, 266.07724942,
198.10258741, 126.69270396, 67.98125874,
38.47335664, 24.75062937, 13.42121212,
2.48636364, 0.
Spreadsheet:
335.2351159, 319.2136364, 266.0772494,
198.1025874, 126.692704, 67.98125874,
38.47335664, 24.75062937, 13.42121212,
2.486363636, 0
Additionally, I would like to see these implemented
as convenience functions in numpy or scipy.
There should be out of the box functions for all kinds
of distributions.
Where is the best place to contrubute a final version?
The scipy.stats?
Thanks again for your input,
Timmie
##### below the distilled code #####
## histogram settings
normed = False
bins = 10
## counts: gives expected results
counts = np.histogram(values,
normed=normed,
bins=bins)
ecdf_counts = np.hstack([1.0, counts[0].cumsum() ])
ecdf_inv_counts = ecdf_counts[::1]
# empirical inverse survival function of weights
eisf_counts = ecdf_counts[1]  ecdf_counts
### sum: does have deviations
sums = np.histogram(values, weights=values,
normed=normed,
bins=bins)
ecdf_sums = np.hstack([1.0, sums[0].cumsum() ])
ecdf_inv_sums = ecdf_sums[::1]
# empirical inverse survival function of weights
eisf_sums = ecdf_sums[1]  ecdf_sums
##
# configure plot
xlabel = 'Bins'
ylabel_left = 'Counts'
ylabel_right = 'Sum'
fig1 = plt.figure()
ax1 = fig1.add_subplot(111)
# counts
ax1.plot(counts[1], ecdf_inv_counts, 'r')
ax1.set_xlabel(xlabel)
ax1.set_ylabel(ylabel_left, color='b')
for tl in ax1.get_yticklabels():
tl.set_color('b')
# sums
ax2 = ax1.twinx()
ax2.plot(sums[1], eisf_sums, 'b')
ax2.set_ylabel(ylabel_right, color='r')
for tl in ax2.get_yticklabels():
tl.set_color('r')
plt.show()
_______________________________________________
NumPyDiscussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpydiscussion


On Thu, Sep 3, 2009 at 9:23 AM, Tim
Michelsen< [hidden email]> wrote:
>>
> Hello,
> I have checked the snippets you proposed.
> It does what I wanted to achieve.
> Obviously, I had to substract the values as Robert
> demonstrated. This could also be perceived from
> the figure I posted.
>
> I still have see how I can optimise the code
> (c.f. below) or modify to be less complicated.
> It seemed so simple in the spreadsheet...
>
>> eisf_sums = ecdf_sums[1]  ecdf_sums
>> # empirical inverse survival
this should have inverse in it, it was a cut and paste error
empirical survival function would be just 1ecdf
however, as distributions they would require to be normed to 1,
>> function of weights
> Can you recommend me a (literature) source where
> I can look up this term?
> I learned statistics in my mother tongue and seem
> to need a refresher on distributions...
> I would like to come up with the right terms
> next time.
My first stop is usually wikipedia:
http://en.wikipedia.org/wiki/Survival_functionhttp://de.wikipedia.org/wiki/Verteilungsfunktion#.C3.9Cberlebenswahrscheinlichkeitand the ISI  INTERNATIONAL STATISTICAL INSTITUTE glossary for terms
in different languages
http://isi.cbs.nl/glossary/bloken83.htm>
>> Are you sure you want cumulative weights in
>>the histogram?
> You mean it doesn't make sense at all?
It depends on what you want, ecdf as it is calculated, with the
weights argument in the histogram, gives you the cumulative sum of the
values, not the count.
In the case of the weight of pigs, it would be to cumulative weight of
all pigs with a weight less than the given bin boundary weight.
If values were income, then it would be the aggregated income of all
individual with an income below the bin bin boundary.
So it makes sense, given this is what you want (below).
>
> I need:
> 1) the count of occurrences sorted in each bin
> counts = np.histogram(values,
> normed=normed,
> bins=bins)
> => here I obtain now the same as in the
> spreadsheet
>
> 2) the sum of all values sorted in each bin
> sums = np.histogram(values, weights=values,
> normed=normed,
> bins=bins)
>
> => here I still obtain different values for the first
> histogram value (eisf_sums[0]):
> Numpy: eisf_sums
> 335.50026738, 319.21363636, 266.07724942,
> 198.10258741, 126.69270396, 67.98125874,
> 38.47335664, 24.75062937, 13.42121212,
> 2.48636364, 0.
>
> Spreadsheet:
> 335.2351159, 319.2136364, 266.0772494,
> 198.1025874, 126.692704, 67.98125874,
> 38.47335664, 24.75062937, 13.42121212,
> 2.486363636, 0
there might be a mistake in the treatment of a cell when
reversing, when I run your example the highest value is
not equal to values.sum()
this might match the spreadsheet, but I haven't compared
isf = sums[0][::1].cumsum()[::1]
But I'm not sure yet, what's going on.
Josef
>
> Additionally, I would like to see these implemented
> as convenience functions in numpy or scipy.
> There should be out of the box functions for all kinds
> of distributions.
> Where is the best place to contrubute a final version?
> The scipy.stats?
>
> Thanks again for your input,
> Timmie
>
> ##### below the distilled code #####
> ## histogram settings
> normed = False
> bins = 10
>
> ## counts: gives expected results
> counts = np.histogram(values,
> normed=normed,
> bins=bins)
>
> ecdf_counts = np.hstack([1.0, counts[0].cumsum() ])
> ecdf_inv_counts = ecdf_counts[::1]
> # empirical inverse survival function of weights
> eisf_counts = ecdf_counts[1]  ecdf_counts
>
>
> ### sum: does have deviations
> sums = np.histogram(values, weights=values,
> normed=normed,
> bins=bins)
> ecdf_sums = np.hstack([1.0, sums[0].cumsum() ])
> ecdf_inv_sums = ecdf_sums[::1]
> # empirical inverse survival function of weights
> eisf_sums = ecdf_sums[1]  ecdf_sums
>
> ##
> # configure plot
> xlabel = 'Bins'
> ylabel_left = 'Counts'
> ylabel_right = 'Sum'
>
>
> fig1 = plt.figure()
> ax1 = fig1.add_subplot(111)
>
> # counts
> ax1.plot(counts[1], ecdf_inv_counts, 'r')
> ax1.set_xlabel(xlabel)
> ax1.set_ylabel(ylabel_left, color='b')
> for tl in ax1.get_yticklabels():
> tl.set_color('b')
>
> # sums
> ax2 = ax1.twinx()
> ax2.plot(sums[1], eisf_sums, 'b')
> ax2.set_ylabel(ylabel_right, color='r')
> for tl in ax2.get_yticklabels():
> tl.set_color('r')
> plt.show()
>
>
> _______________________________________________
> NumPyDiscussion mailing list
> [hidden email]
> http://mail.scipy.org/mailman/listinfo/numpydiscussion>
_______________________________________________
NumPyDiscussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpydiscussion

Administrator

> My first stop is usually wikipedia:
[...]
Thanks.
So I I'known that I have to call the beast a
"empirical inverse survival function", Robert would
also have foundit easier to help.
Anyway, step by step...
> In the case of the weight of pigs, it would be to cumulative weight of
> all pigs with a weight less than the given bin boundary weight.
> If values were income, then it would be the aggregated income of all
> individual with an income below the bin bin boundary.
> So it makes sense, given this is what you want (below).
Exactly!
Or for precipitation:
a) count: number of precipitation events that
ocurred up to a certain limit
b) sum: precipitation total registered up to that limit
> there might be a mistake in the treatment of a cell when
> reversing, when I run your example the highest value is
> not equal to values.sum()
This has made me think again. Small point.
See here:
ecdf_sums = np.hstack([0.0, sums[0].cumsum() ])
ecdf_sums = np.hstack([sums[0].cumsum() ])
I had to adjust the classes in the spreadsheet by
replacing the first class limit by 0.0.
I had modifed this yesterday to a different value
(0.265152) as I was testing the code.
from:
0.265152, 0.487273, 0.709394, 0.931515,
1.153636, 1.375758, 1.597879, 1.820000,
2.042121, 2.264242, 2.486364
to:
0.0, 0.487273, 0.709394, 0.931515,
1.153636, 1.375758, 1.597879, 1.820000,
2.042121, 2.264242, 2.486364
Now everything is fine. Results and curves match.
> But I'm not sure yet, what's going on.
1) first I didn't know how to develop the code for a
"empirical inverse survival function" in numpy
2) I screwed my spreadsheet classes up while
testing and verifying my numpy code.
Again, would a function for the
"empirical inverse survival function" qualify for the
inclusion into numpy or scipy?
Thanks for the help.
Best regards,
Timmie
_______________________________________________
NumPyDiscussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpydiscussion


On Thu, Sep 3, 2009 at 12:58 PM, Tim
Michelsen< [hidden email]> wrote:
>> My first stop is usually wikipedia:
> [...]
> Thanks.
> So I I'known that I have to call the beast a
> "empirical inverse survival function", Robert would
> also have foundit easier to help.
> Anyway, step by step...
>
>> In the case of the weight of pigs, it would be to cumulative weight of
>> all pigs with a weight less than the given bin boundary weight.
>> If values were income, then it would be the aggregated income of all
>> individual with an income below the bin bin boundary.
>> So it makes sense, given this is what you want (below).
> Exactly!
>
> Or for precipitation:
> a) count: number of precipitation events that
> ocurred up to a certain limit
> b) sum: precipitation total registered up to that limit
>
>> there might be a mistake in the treatment of a cell when
>> reversing, when I run your example the highest value is
>> not equal to values.sum()
> This has made me think again. Small point.
>
> See here:
> ecdf_sums = np.hstack([0.0, sums[0].cumsum() ])
> ecdf_sums = np.hstack([sums[0].cumsum() ])
>
> I had to adjust the classes in the spreadsheet by
> replacing the first class limit by 0.0.
> I had modifed this yesterday to a different value
> (0.265152) as I was testing the code.
>
> from:
> 0.265152, 0.487273, 0.709394, 0.931515,
> 1.153636, 1.375758, 1.597879, 1.820000,
> 2.042121, 2.264242, 2.486364
>
> to:
> 0.0, 0.487273, 0.709394, 0.931515,
> 1.153636, 1.375758, 1.597879, 1.820000,
> 2.042121, 2.264242, 2.486364
>
> Now everything is fine. Results and curves match.
>
>> But I'm not sure yet, what's going on.
> 1) first I didn't know how to develop the code for a
> "empirical inverse survival function" in numpy
> 2) I screwed my spreadsheet classes up while
> testing and verifying my numpy code.
>
> Again, would a function for the
> "empirical inverse survival function" qualify for the
> inclusion into numpy or scipy?
Sorry, I'm too distracted, correcting myself a second time
"this should *not* have inverse in it, using inverse was a cut and paste error"
it's empirical survival function
If it's just a oneliner with cumsum, then I don't think its necessary
to have a function for it.
But following also the previous discussion, it would be useful to have
the combination of histogram and empirical cdf, sf, and/or pdf to
define an empirical distribution. As interpretation in terms of
distribution, normed=True would be necessary, but it could also be an
option.
One question to your application, in the plot you draw lines and not
histograms. Is there a reason to use histograms in the calculation
instead of the full ecdf. (i.e. cumsum on original values instead of
cumsum on histogrammed values) ?
Josef
_______________________________________________
NumPyDiscussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpydiscussion

Administrator

>> Again, would a function for the
>> "empirical inverse survival function" qualify for the
>> inclusion into numpy or scipy?
>
> Sorry, I'm too distracted, correcting myself a second time
> "this should *not* have inverse in it, using inverse was a cut and paste error"
> it's empirical survival function
I think my fault not paying too much attention to the exact terms.
The pages you sent on "survial function" are marked as toread.
> If it's just a oneliner with cumsum, then I don't think its necessary
> to have a function for it.
>
> But following also the previous discussion, it would be useful to have
> the combination of histogram and empirical cdf, sf, and/or pdf to
> define an empirical distribution. As interpretation in terms of
> distribution, normed=True would be necessary, but it could also be an
> option.
And it seems that this is just one call in R.
> One question to your application, in the plot you draw lines and not
> histograms. Is there a reason to use histograms in the calculation
> instead of the full ecdf. (i.e. cumsum on original values instead of
> cumsum on histogrammed values) ?
Well, I was not aware of cumsum and a way to create ecdf with numpy.
I just sereach the list archives for cdf or ecdf.
As my inital version was created in a shreadsheet, I first tried to
replicate that and get it validated.
Can you give an example of a full ecdf?
In the end I am interested in the points (x and y coordinates) where the
ecdf intersects with a certain threshold value.
This is the next task:
Get x,y of the cutpoint between a vertical or horizontal line and a
curve with numpy and matplotlib.
Can you point out an example for that?
Best regards,
Timmie
_______________________________________________
NumPyDiscussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpydiscussion

