Hi, I'm looking for a way to find a random sample of C different items out of N items, with a some desired probabilty Pi for each item i.

I saw that numpy has a function that supposedly does this, numpy.random.choice (with replace=False and a probabilities array), but looking at the algorithm actually implemented, I am wondering in what sense are the probabilities Pi actually obeyed...

To me, the code doesn't seem to be doing the right thing... Let me explain:

Consider a simple numerical example: We have 3 items, and need to pick 2 different ones randomly. Let's assume the desired probabilities for item 1, 2 and 3 are: 0.2, 0.4 and 0.4.

Working out the equations there is exactly one solution here: The random outcome of numpy.random.choice in this case should be [1,2] at probability 0.2, [1,3] at probabilty 0.2, and [2,3] at probability 0.6. That is indeed a solution for the desired probabilities because it yields item 1 in [1,2]+[1,3] = 0.2 + 0.2 = 2*P1 of the trials, item 2 in [1,2]+[2,3] = 0.2+0.6 = 0.8 = 2*P2, etc.

However, the algorithm in numpy.random.choice's replace=False generates, if I understand correctly, different probabilities for the outcomes: I believe in this case it generates [1,2] at probability 0.23333, [1,3] also 0.2333, and [2,3] at probability 0.53333.

My question is how does this result fit the desired probabilities?

If we get [1,2] at probability 0.23333 and [1,3] at probability 0.2333, then the expect number of "1" results we'll get per drawing is 0.23333 + 0.2333 = 0.46666, and similarly for "2" the expected number 0.7666, and for "3" 0.76666. As you can see, the proportions are off: Item 2 is NOT twice common than item 1 as we originally desired (we asked for probabilities 0.2, 0.4, 0.4 for the individual items!).

_______________________________________________

NumPy-Discussion mailing list

[hidden email]
https://mail.scipy.org/mailman/listinfo/numpy-discussion