Low-level API for Random

classic Classic list List threaded Threaded
20 messages Options
Reply | Threaded
Open this post in threaded view
|

Low-level API for Random

bashtage
There are some users of the NumPy C code in randomkit.  This was never officially supported.  There has been a long open issue to provide this officially. 

When I wrote randomgen I supplied .pdx files that make it simpler to write Cython code that uses the components.  The lower-level API has not had much scrutiny and is in need of a clean-up.   I thought this would also encourage users to extend the random machinery themselves as part of their project or code so as to minimize the requests for new (exotic) distributions to be included in Generator. 

Most of the generator functions follow a pattern random_DISTRIBUTION.  Some have a bit more name mangling which can easily be cleaned up (like ranomd_gauss_zig, which should become PREFIX_standard_normal).

Ralf Gommers suggested unprefixed names. I tried this in a local branch and it was a bit ugly since some of the distributions have common math names (e.g., gamma) and others are very short (e.g., t or f).  I think a prefix is needed, and after looking through the C API docs npy_random_ seemed like a reasonable choice (since these live in numpy.random).  

Any thoughts on the following questions are welcome (others too):

1. Should there be a prefix on the C functions?
2. If so, what should the prefix be?
3. Should the legacy C functions be part of the API -- these are mostly the ones that produce or depend on polar transform normals (Box-Muller). I have a feeling no, but there may be reasons to prefer BM since they do not depend on rejection sampling.
4. Should low-level API be consumable like any other numpy C API by including the usual header locations and library locations?  Right now, the pxd simplifies writing Cython but users have sp specify the location of the headers and source manually  An alternative would be to provide a function like np.get_include() -> np.random.get_include() that would specialize in random.

Kevin


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Low-level API for Random

ralfgommers


On Thu, Sep 19, 2019 at 10:28 AM Kevin Sheppard <[hidden email]> wrote:
There are some users of the NumPy C code in randomkit.  This was never officially supported.  There has been a long open issue to provide this officially. 

When I wrote randomgen I supplied .pdx files that make it simpler to write Cython code that uses the components.  The lower-level API has not had much scrutiny and is in need of a clean-up.   I thought this would also encourage users to extend the random machinery themselves as part of their project or code so as to minimize the requests for new (exotic) distributions to be included in Generator. 

Most of the generator functions follow a pattern random_DISTRIBUTION.  Some have a bit more name mangling which can easily be cleaned up (like ranomd_gauss_zig, which should become PREFIX_standard_normal).

Ralf Gommers suggested unprefixed names.

I suggested that the names should match the Python API, which I think isn't quite the same. The Python API doesn't contain things like "gamma", "t" or "f".

I tried this in a local branch and it was a bit ugly since some of the distributions have common math names (e.g., gamma) and others are very short (e.g., t or f).  I think a prefix is needed, and after looking through the C API docs npy_random_ seemed like a reasonable choice (since these live in numpy.random).  

Any thoughts on the following questions are welcome (others too):

1. Should there be a prefix on the C functions?
2. If so, what should the prefix be?

Before worrying about naming details, can we start with "what should be in the C/Cython API"? If I look through the current pxd files, there's a lot there that looks like it should be private, and what we expose as Python API is not all present as far as I can tell (which may be fine, if the only goal is to let people write new generators rather than use the existing ones from Cython without the Python overhead).

In the end we want to get to a doc section similar to http://scipy.github.io/devdocs/special.cython_special.html I'd think.

3. Should the legacy C functions be part of the API -- these are mostly the ones that produce or depend on polar transform normals (Box-Muller). I have a feeling no, but there may be reasons to prefer BM since they do not depend on rejection sampling.

Even if there would be a couple of users interested, it would be odd starting to do this after deeming the code "legacy". So I agree with your "no".
 
4. Should low-level API be consumable like any other numpy C API by including the usual header locations and library locations?  Right now, the pxd simplifies writing Cython but users have sp specify the location of the headers and source manually  An alternative would be to provide a function like np.get_include() -> np.random.get_include() that would specialize in random.

Good question. I'm not sure this is "like any other NumPy C API". We don't provide a C API for fft, linalg or other functionality further from core either. It's possible of course, but does it really help library authors or end users?

Cheers,
Ralf



_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Low-level API for Random

bashtage


On Thu, Sep 19, 2019 at 10:23 AM Ralf Gommers <[hidden email]> wrote:


On Thu, Sep 19, 2019 at 10:28 AM Kevin Sheppard <[hidden email]> wrote:
There are some users of the NumPy C code in randomkit.  This was never officially supported.  There has been a long open issue to provide this officially. 

When I wrote randomgen I supplied .pdx files that make it simpler to write Cython code that uses the components.  The lower-level API has not had much scrutiny and is in need of a clean-up.   I thought this would also encourage users to extend the random machinery themselves as part of their project or code so as to minimize the requests for new (exotic) distributions to be included in Generator. 

Most of the generator functions follow a pattern random_DISTRIBUTION.  Some have a bit more name mangling which can easily be cleaned up (like ranomd_gauss_zig, which should become PREFIX_standard_normal).

Ralf Gommers suggested unprefixed names.

I suggested that the names should match the Python API, which I think isn't quite the same. The Python API doesn't contain things like "gamma", "t" or "f".

My gamma and f (I misspoke about t) I mean the names that appear as Generator methods:


If I understand your point (and with reference with page linked below), then there would be something like numpy.random.cython_random.gamma (which is currently called numpy.random.distributions.random_gamma). Maybe I'm not understanding your point about the Python API though.  
 

I tried this in a local branch and it was a bit ugly since some of the distributions have common math names (e.g., gamma) and others are very short (e.g., t or f).  I think a prefix is needed, and after looking through the C API docs npy_random_ seemed like a reasonable choice (since these live in numpy.random).  

Any thoughts on the following questions are welcome (others too):

1. Should there be a prefix on the C functions?
2. If so, what should the prefix be?

Before worrying about naming details, can we start with "what should be in the C/Cython API"? If I look through the current pxd files, there's a lot there that looks like it should be private, and what we expose as Python API is not all present as far as I can tell (which may be fine, if the only goal is to let people write new generators rather than use the existing ones from Cython without the Python overhead).

From the ground up, for someone who want to write a new distribution:
1. The bit generators.  These currently have no pxd files. These are always going to be Python obects and so it isn't absolutely essential to expose them with a low-level API.  All that is needed is the capsule which has the bitgen struct, which is what is really needed
2. bitgen_t which is in common.pxd.  This is essential since it enables access to the callables to produce basic psueod random values.
3. The distributions, which are in distributions.pdx. The integer generators are in bounded_integers.pxd.in, which would need to be processed and then included after processing (same for bounded_integers.pxd.in). 
    a. The legacy in legacy_distributions.pxd.   If the legacy is included, then aug_bitgen_t needs to also be included which is also in legacy_distributions.pxd
4. The "helpers" which are defined in common.pxd.  These simplify implementing complete distributions which support automatix broadcasting when needed. They are only provided to match the signatures for the functions in distributions.pxd. The highest level ones are cont() and disc(). Some of the lower-level ones could easily be marked as private.

1,2 and 3 are pretty important.  4 could be in or out. It could help if someone wanted to write a fully featured distribution w/ broadcasting, but I think this use case is less likely than someone say wanting to implement a custom rejection sampler.


For someone who wants to write a new BitGenerator

1. BitGenerator and SeedSequence in bit_generato.pxd are required. As is bitgen_t which is in common. bitgen_t should probably move to bit_generators.
2. aligned_malloc:  This has been requested on multiple occasions and is practically important when interfacing with SSE or AVX code. It is potentially more general than the random module. This lives in common.pxd.

 

In the end we want to get to a doc section similar to http://scipy.github.io/devdocs/special.cython_special.html I'd think.

3. Should the legacy C functions be part of the API -- these are mostly the ones that produce or depend on polar transform normals (Box-Muller). I have a feeling no, but there may be reasons to prefer BM since they do not depend on rejection sampling.

Even if there would be a couple of users interested, it would be odd starting to do this after deeming the code "legacy". So I agree with your "no".
 
4. Should low-level API be consumable like any other numpy C API by including the usual header locations and library locations?  Right now, the pxd simplifies writing Cython but users have sp specify the location of the headers and source manually  An alternative would be to provide a function like np.get_include() -> np.random.get_include() that would specialize in random.

Good question. I'm not sure this is "like any other NumPy C API". We don't provide a C API for fft, linalg or other functionality further from core either. It's possible of course, but does it really help library authors or end users?

SciPy provides a very useful Cython API to low-level linalg. But there is little reason to provide C APIs to fft or linalg since they are all directly available. The code is random is AFAICT, one of the more complete C implementations of functions needed to produce variates from many distributions (mostly due to its ancestor randomkit, which AFAICT isn't maintained). 

An ideal API would allow projects like https://github.com/deepmind/torch-randomkit/tree/master/randomkit or numba to consume the code in NumPy without vendoring it.

Best wishes,
Kevin 


Cheers,
Ralf


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Low-level API for Random

Evgeni Burovski
In reply to this post by ralfgommers





1. Should there be a prefix on the C functions?
2. If so, what should the prefix be?

Preferably, yes. Don't have an opinion on an exact prefix, as long as it allows me to e.g. swap a normal distribution generator in my cython/c++ user code without too much mess. 


if the only goal is to let people write new generators rather than use the existing ones from Cython without the Python overhead).

Is it the only goal?

If possible, it'd be worth IMO supporting something more like cython_lapack, so that one can use existing machinery from a cython application.

Use case: an MC application where drawing random variates is in a hot loop.
Then I can start from a python prototype and cythonize it gradually.
Sure I can reach into non-public parts 


In the end we want to get to a doc section similar to http://scipy.github.io/devdocs/special.cython_special.html I'd think.

3. Should the legacy C functions be part of the API -- these are mostly the ones that produce or depend on polar transform normals (Box-Muller). I have a feeling no, but there may be reasons to prefer BM since they do not depend on rejection sampling.

Even if there would be a couple of users interested, it would be odd starting to do this after deeming the code "legacy". So I agree with your "no".

Unless it's a big maintenance burden, is there an issue with exposing both ziggurat_normal and bm_normal?
Sure, I can cook up a BM transform myself (yet again) however. 

 
4. Should low-level API be consumable like any other numpy C API by including the usual header locations and library locations?  Right now, the pxd simplifies writing Cython but users have sp specify the location of the headers and source manually  An alternative would be to provide a function like np.get_include() -> np.random.get_include() that would specialize in random.

Good question. I'm not sure this is "like any other NumPy C API". We don't provide a C API for fft, linalg or other functionality further from core either. It's possible of course, but does it really help library authors or end users?

While I gave only anecdotal evidence, not hard data, I suspect that both cython and C API would be useful.
E.g. there are c++ applications which use boost::random, would be nice to be able to swap it for numpy.random.
Also reproducibility: it's *much* easier to debug the compiled app vs its python prototype if random streams are identical.

Like I said, take all I'm saying with enough salt, as I'm wearing my user hat here.

Cheers,
Evgeni

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Low-level API for Random

Robert Kern-2
In reply to this post by ralfgommers
On Thu, Sep 19, 2019 at 5:24 AM Ralf Gommers <[hidden email]> wrote:

On Thu, Sep 19, 2019 at 10:28 AM Kevin Sheppard <[hidden email]> wrote:
There are some users of the NumPy C code in randomkit.  This was never officially supported.  There has been a long open issue to provide this officially. 

When I wrote randomgen I supplied .pdx files that make it simpler to write Cython code that uses the components.  The lower-level API has not had much scrutiny and is in need of a clean-up.   I thought this would also encourage users to extend the random machinery themselves as part of their project or code so as to minimize the requests for new (exotic) distributions to be included in Generator. 

Most of the generator functions follow a pattern random_DISTRIBUTION.  Some have a bit more name mangling which can easily be cleaned up (like ranomd_gauss_zig, which should become PREFIX_standard_normal).

Ralf Gommers suggested unprefixed names.

I suggested that the names should match the Python API, which I think isn't quite the same. The Python API doesn't contain things like "gamma", "t" or "f".

As the implementations evolve, they aren't going to match one-to-one 100%. The implementations are shared by the legacy RandomState. When we update an algorithm, we'll need to make a new function with the better algorithm for Generator to use, then we'll have two C functions roughly corresponding to the same method name (albeit on different classes). C doesn't give us as many namespace options as Python. We could rely on conventional prefixes to distinguish between the two classes of function (e.g. legacy_normal vs random_normal). There are times when it would be nice to be more descriptive about the algorithm difference (e.g. random_normal_polar vs random_normal_ziggurat), most of our algorithm updates will be minor tweaks rather than changing to a new named algorithm.
 
I tried this in a local branch and it was a bit ugly since some of the distributions have common math names (e.g., gamma) and others are very short (e.g., t or f).  I think a prefix is needed, and after looking through the C API docs npy_random_ seemed like a reasonable choice (since these live in numpy.random).  

Any thoughts on the following questions are welcome (others too):

1. Should there be a prefix on the C functions?
2. If so, what should the prefix be?

Before worrying about naming details, can we start with "what should be in the C/Cython API"? If I look through the current pxd files, there's a lot there that looks like it should be private, and what we expose as Python API is not all present as far as I can tell (which may be fine, if the only goal is to let people write new generators rather than use the existing ones from Cython without the Python overhead)

Using the existing distributions from Cython was a requested feature and an explicit goal, yes. There are users waiting for this.
 
--
Robert Kern

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Low-level API for Random

Stanley Seibert
In reply to this post by bashtage
Just to chime in: Numba would definitely appreciate C functions to access the random distribution implementations, and have a side-project (numba-scipy) that is making the Cython wrapped functions in SciPy visible to Numba.

On Thu, Sep 19, 2019 at 5:41 AM Kevin Sheppard <[hidden email]> wrote:


On Thu, Sep 19, 2019 at 10:23 AM Ralf Gommers <[hidden email]> wrote:


On Thu, Sep 19, 2019 at 10:28 AM Kevin Sheppard <[hidden email]> wrote:
There are some users of the NumPy C code in randomkit.  This was never officially supported.  There has been a long open issue to provide this officially. 

When I wrote randomgen I supplied .pdx files that make it simpler to write Cython code that uses the components.  The lower-level API has not had much scrutiny and is in need of a clean-up.   I thought this would also encourage users to extend the random machinery themselves as part of their project or code so as to minimize the requests for new (exotic) distributions to be included in Generator. 

Most of the generator functions follow a pattern random_DISTRIBUTION.  Some have a bit more name mangling which can easily be cleaned up (like ranomd_gauss_zig, which should become PREFIX_standard_normal).

Ralf Gommers suggested unprefixed names.

I suggested that the names should match the Python API, which I think isn't quite the same. The Python API doesn't contain things like "gamma", "t" or "f".

My gamma and f (I misspoke about t) I mean the names that appear as Generator methods:


If I understand your point (and with reference with page linked below), then there would be something like numpy.random.cython_random.gamma (which is currently called numpy.random.distributions.random_gamma). Maybe I'm not understanding your point about the Python API though.  
 

I tried this in a local branch and it was a bit ugly since some of the distributions have common math names (e.g., gamma) and others are very short (e.g., t or f).  I think a prefix is needed, and after looking through the C API docs npy_random_ seemed like a reasonable choice (since these live in numpy.random).  

Any thoughts on the following questions are welcome (others too):

1. Should there be a prefix on the C functions?
2. If so, what should the prefix be?

Before worrying about naming details, can we start with "what should be in the C/Cython API"? If I look through the current pxd files, there's a lot there that looks like it should be private, and what we expose as Python API is not all present as far as I can tell (which may be fine, if the only goal is to let people write new generators rather than use the existing ones from Cython without the Python overhead).

From the ground up, for someone who want to write a new distribution:
1. The bit generators.  These currently have no pxd files. These are always going to be Python obects and so it isn't absolutely essential to expose them with a low-level API.  All that is needed is the capsule which has the bitgen struct, which is what is really needed
2. bitgen_t which is in common.pxd.  This is essential since it enables access to the callables to produce basic psueod random values.
3. The distributions, which are in distributions.pdx. The integer generators are in bounded_integers.pxd.in, which would need to be processed and then included after processing (same for bounded_integers.pxd.in). 
    a. The legacy in legacy_distributions.pxd.   If the legacy is included, then aug_bitgen_t needs to also be included which is also in legacy_distributions.pxd
4. The "helpers" which are defined in common.pxd.  These simplify implementing complete distributions which support automatix broadcasting when needed. They are only provided to match the signatures for the functions in distributions.pxd. The highest level ones are cont() and disc(). Some of the lower-level ones could easily be marked as private.

1,2 and 3 are pretty important.  4 could be in or out. It could help if someone wanted to write a fully featured distribution w/ broadcasting, but I think this use case is less likely than someone say wanting to implement a custom rejection sampler.


For someone who wants to write a new BitGenerator

1. BitGenerator and SeedSequence in bit_generato.pxd are required. As is bitgen_t which is in common. bitgen_t should probably move to bit_generators.
2. aligned_malloc:  This has been requested on multiple occasions and is practically important when interfacing with SSE or AVX code. It is potentially more general than the random module. This lives in common.pxd.

 

In the end we want to get to a doc section similar to http://scipy.github.io/devdocs/special.cython_special.html I'd think.

3. Should the legacy C functions be part of the API -- these are mostly the ones that produce or depend on polar transform normals (Box-Muller). I have a feeling no, but there may be reasons to prefer BM since they do not depend on rejection sampling.

Even if there would be a couple of users interested, it would be odd starting to do this after deeming the code "legacy". So I agree with your "no".
 
4. Should low-level API be consumable like any other numpy C API by including the usual header locations and library locations?  Right now, the pxd simplifies writing Cython but users have sp specify the location of the headers and source manually  An alternative would be to provide a function like np.get_include() -> np.random.get_include() that would specialize in random.

Good question. I'm not sure this is "like any other NumPy C API". We don't provide a C API for fft, linalg or other functionality further from core either. It's possible of course, but does it really help library authors or end users?

SciPy provides a very useful Cython API to low-level linalg. But there is little reason to provide C APIs to fft or linalg since they are all directly available. The code is random is AFAICT, one of the more complete C implementations of functions needed to produce variates from many distributions (mostly due to its ancestor randomkit, which AFAICT isn't maintained). 

An ideal API would allow projects like https://github.com/deepmind/torch-randomkit/tree/master/randomkit or numba to consume the code in NumPy without vendoring it.

Best wishes,
Kevin 


Cheers,
Ralf


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Low-level API for Random

ralfgommers
In reply to this post by Robert Kern-2


On Thu, Sep 19, 2019 at 4:53 PM Robert Kern <[hidden email]> wrote:
On Thu, Sep 19, 2019 at 5:24 AM Ralf Gommers <[hidden email]> wrote:

On Thu, Sep 19, 2019 at 10:28 AM Kevin Sheppard <[hidden email]> wrote:
There are some users of the NumPy C code in randomkit.  This was never officially supported.  There has been a long open issue to provide this officially. 

When I wrote randomgen I supplied .pdx files that make it simpler to write Cython code that uses the components.  The lower-level API has not had much scrutiny and is in need of a clean-up.   I thought this would also encourage users to extend the random machinery themselves as part of their project or code so as to minimize the requests for new (exotic) distributions to be included in Generator. 

Most of the generator functions follow a pattern random_DISTRIBUTION.  Some have a bit more name mangling which can easily be cleaned up (like ranomd_gauss_zig, which should become PREFIX_standard_normal).

Ralf Gommers suggested unprefixed names.

I suggested that the names should match the Python API, which I think isn't quite the same. The Python API doesn't contain things like "gamma", "t" or "f".

As the implementations evolve, they aren't going to match one-to-one 100%. The implementations are shared by the legacy RandomState. When we update an algorithm, we'll need to make a new function with the better algorithm for Generator to use, then we'll have two C functions roughly corresponding to the same method name (albeit on different classes). C doesn't give us as many namespace options as Python. We could rely on conventional prefixes to distinguish between the two classes of function (e.g. legacy_normal vs random_normal).

That seems simple and clear

There are times when it would be nice to be more descriptive about the algorithm difference (e.g. random_normal_polar vs random_normal_ziggurat),

We decided against versioning algorithms in NEP 19, so an update to an algorithm would mean we'd want to get rid of the older version (unless it's still in use by legacy). So AFAICT we'd never have both random_normal_polar and random_normal_ziggurat present at the same time?

I may be missing your point here, but if we have in Python `Generator.normal` and can switch its implementation from polar to ziggurat or vice versa without any deprecation, then why would we want to switch names in the C API?

most of our algorithm updates will be minor tweaks rather than changing to a new named algorithm.
 
I tried this in a local branch and it was a bit ugly since some of the distributions have common math names (e.g., gamma) and others are very short (e.g., t or f).  I think a prefix is needed, and after looking through the C API docs npy_random_ seemed like a reasonable choice (since these live in numpy.random).  

Any thoughts on the following questions are welcome (others too):

1. Should there be a prefix on the C functions?
2. If so, what should the prefix be?

Before worrying about naming details, can we start with "what should be in the C/Cython API"? If I look through the current pxd files, there's a lot there that looks like it should be private, and what we expose as Python API is not all present as far as I can tell (which may be fine, if the only goal is to let people write new generators rather than use the existing ones from Cython without the Python overhead)

Using the existing distributions from Cython was a requested feature and an explicit goal, yes. There are users waiting for this.

Thanks, clear (also from other responses on this thread).

Cheers,
Ralf


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Low-level API for Random

Robert Kern-2
On Thu, Sep 19, 2019 at 11:04 PM Ralf Gommers <[hidden email]> wrote:


On Thu, Sep 19, 2019 at 4:53 PM Robert Kern <[hidden email]> wrote:
On Thu, Sep 19, 2019 at 5:24 AM Ralf Gommers <[hidden email]> wrote:

On Thu, Sep 19, 2019 at 10:28 AM Kevin Sheppard <[hidden email]> wrote:
There are some users of the NumPy C code in randomkit.  This was never officially supported.  There has been a long open issue to provide this officially. 

When I wrote randomgen I supplied .pdx files that make it simpler to write Cython code that uses the components.  The lower-level API has not had much scrutiny and is in need of a clean-up.   I thought this would also encourage users to extend the random machinery themselves as part of their project or code so as to minimize the requests for new (exotic) distributions to be included in Generator. 

Most of the generator functions follow a pattern random_DISTRIBUTION.  Some have a bit more name mangling which can easily be cleaned up (like ranomd_gauss_zig, which should become PREFIX_standard_normal).

Ralf Gommers suggested unprefixed names.

I suggested that the names should match the Python API, which I think isn't quite the same. The Python API doesn't contain things like "gamma", "t" or "f".

As the implementations evolve, they aren't going to match one-to-one 100%. The implementations are shared by the legacy RandomState. When we update an algorithm, we'll need to make a new function with the better algorithm for Generator to use, then we'll have two C functions roughly corresponding to the same method name (albeit on different classes). C doesn't give us as many namespace options as Python. We could rely on conventional prefixes to distinguish between the two classes of function (e.g. legacy_normal vs random_normal).

That seems simple and clear

There are times when it would be nice to be more descriptive about the algorithm difference (e.g. random_normal_polar vs random_normal_ziggurat),

We decided against versioning algorithms in NEP 19, so an update to an algorithm would mean we'd want to get rid of the older version (unless it's still in use by legacy). So AFAICT we'd never have both random_normal_polar and random_normal_ziggurat present at the same time?

Well, we must because one's used by the legacy RandomState and one's used by Generator. :-)
 
I may be missing your point here, but if we have in Python `Generator.normal` and can switch its implementation from polar to ziggurat or vice versa without any deprecation, then why would we want to switch names in the C API?

I didn't mean to suggest that we'd have an unbounded number of functions as we improve the algorithms, just that we might have 2 once we decide to change something about the algorithm. We need 2 to support both the improved algorithm in Generator and the legacy algorithm in RandomState. The current implementation of the C function would be copied to a new name (`legacy_foo` or whatever), then we'd make RandomState use that frozen copy, then we make the desired modifications to the main function that Generator is referencing (`random_foo`).

Or we could just make those legacy copies now so that people get to use them explicitly under the legacy names, whatever they are, and we can feel more free to modify the main implementations. I suggested this earlier, but convinced myself that it wasn't strictly necessary. But then I admit I was more focused on the Python API stability than any promises about the C/Cython API.

We might end up with more than 2 implementations if we need to change something about the function signature, for whatever reason, and we want to retain C/Cython API compatibility with older code. The C functions aren't necessarily going to be one-to-one to the Generator methods. They're just part of the implementation. So for example, if we wanted to, say, precompute some intermediate values from the given scalar parameters so we don't have to recompute them for each element of the `size`-large requested output, we might do that in one C function and pass those intermediate values as arguments to the C function that does the actual sampling. So we'd have two C functions for that one Generator method, and the sampling C function will not have the same signature as it did before the modification that refactored the work into two functions. In that case, I would not be so strict as to require that `Generator.foo` is one to one with `random_foo`.

To your point, though, we don't have to use gratuitously different names when there _is_ a one-to-one relationship. `random_gauss_zig` should be `random_normal`.

--
Robert Kern

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Low-level API for Random

mattip
Administrator

On 20/9/19 6:25 am, Robert Kern wrote:
>
> Well, we must because one's used by the legacy RandomState and one's
> used by Generator. :-)
>

I would prefer not to create a legacy C-API at all. Are we required to
from the NEP?

Matti

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Low-level API for Random

ralfgommers
In reply to this post by Robert Kern-2


On Fri, Sep 20, 2019 at 5:29 AM Robert Kern <[hidden email]> wrote:
On Thu, Sep 19, 2019 at 11:04 PM Ralf Gommers <[hidden email]> wrote:


On Thu, Sep 19, 2019 at 4:53 PM Robert Kern <[hidden email]> wrote:
On Thu, Sep 19, 2019 at 5:24 AM Ralf Gommers <[hidden email]> wrote:

On Thu, Sep 19, 2019 at 10:28 AM Kevin Sheppard <[hidden email]> wrote:
There are some users of the NumPy C code in randomkit.  This was never officially supported.  There has been a long open issue to provide this officially. 

When I wrote randomgen I supplied .pdx files that make it simpler to write Cython code that uses the components.  The lower-level API has not had much scrutiny and is in need of a clean-up.   I thought this would also encourage users to extend the random machinery themselves as part of their project or code so as to minimize the requests for new (exotic) distributions to be included in Generator. 

Most of the generator functions follow a pattern random_DISTRIBUTION.  Some have a bit more name mangling which can easily be cleaned up (like ranomd_gauss_zig, which should become PREFIX_standard_normal).

Ralf Gommers suggested unprefixed names.

I suggested that the names should match the Python API, which I think isn't quite the same. The Python API doesn't contain things like "gamma", "t" or "f".

As the implementations evolve, they aren't going to match one-to-one 100%. The implementations are shared by the legacy RandomState. When we update an algorithm, we'll need to make a new function with the better algorithm for Generator to use, then we'll have two C functions roughly corresponding to the same method name (albeit on different classes). C doesn't give us as many namespace options as Python. We could rely on conventional prefixes to distinguish between the two classes of function (e.g. legacy_normal vs random_normal).

That seems simple and clear

There are times when it would be nice to be more descriptive about the algorithm difference (e.g. random_normal_polar vs random_normal_ziggurat),

We decided against versioning algorithms in NEP 19, so an update to an algorithm would mean we'd want to get rid of the older version (unless it's still in use by legacy). So AFAICT we'd never have both random_normal_polar and random_normal_ziggurat present at the same time?

Well, we must because one's used by the legacy RandomState and one's used by Generator. :-)
 
I may be missing your point here, but if we have in Python `Generator.normal` and can switch its implementation from polar to ziggurat or vice versa without any deprecation, then why would we want to switch names in the C API?

I didn't mean to suggest that we'd have an unbounded number of functions as we improve the algorithms, just that we might have 2 once we decide to change something about the algorithm. We need 2 to support both the improved algorithm in Generator and the legacy algorithm in RandomState. The current implementation of the C function would be copied to a new name (`legacy_foo` or whatever), then we'd make RandomState use that frozen copy, then we make the desired modifications to the main function that Generator is referencing (`random_foo`).

Or we could just make those legacy copies now so that people get to use them explicitly under the legacy names, whatever they are, and we can feel more free to modify the main implementations. I suggested this earlier, but convinced myself that it wasn't strictly necessary. But then I admit I was more focused on the Python API stability than any promises about the C/Cython API.

We might end up with more than 2 implementations if we need to change something about the function signature, for whatever reason, and we want to retain C/Cython API compatibility with older code. The C functions aren't necessarily going to be one-to-one to the Generator methods. They're just part of the implementation. So for example, if we wanted to, say, precompute some intermediate values from the given scalar parameters so we don't have to recompute them for each element of the `size`-large requested output, we might do that in one C function and pass those intermediate values as arguments to the C function that does the actual sampling. So we'd have two C functions for that one Generator method, and the sampling C function will not have the same signature as it did before the modification that refactored the work into two functions. In that case, I would not be so strict as to require that `Generator.foo` is one to one with `random_foo`.

You're saying "be so strict" as if it were a bad thing, or a major effort. I understand that in some cases a C API can not be evolved in the same way as a Python API, but in the example you're giving here I'd say you want one function to be public, and one private. Making both public just exposes more implementation details for no good reason, and will give us more maintenance issues long-term.

Anyway, this is not an issue today. If we try to keep Python and C APIs matching, we can deal with possible difficulties with that if and when they arise - should be infrequent.

Cheers,
Ralf


To your point, though, we don't have to use gratuitously different names when there _is_ a one-to-one relationship. `random_gauss_zig` should be `random_normal`.

--
Robert Kern
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Low-level API for Random

Neal Becker
I have used C-api in the past, and would like to see a convenient and
stable way to do this.  Currently I'm using randomgen, but calling
(from c++)
to the python api.  The inefficiency is amortized by generating and
caching batches of results.

I thought randomgen was supposed to be the future of numpy random, so
I've based on that.

On Fri, Sep 20, 2019 at 6:08 AM Ralf Gommers <[hidden email]> wrote:

>
>
>
> On Fri, Sep 20, 2019 at 5:29 AM Robert Kern <[hidden email]> wrote:
>>
>> On Thu, Sep 19, 2019 at 11:04 PM Ralf Gommers <[hidden email]> wrote:
>>>
>>>
>>>
>>> On Thu, Sep 19, 2019 at 4:53 PM Robert Kern <[hidden email]> wrote:
>>>>
>>>> On Thu, Sep 19, 2019 at 5:24 AM Ralf Gommers <[hidden email]> wrote:
>>>>>
>>>>>
>>>>> On Thu, Sep 19, 2019 at 10:28 AM Kevin Sheppard <[hidden email]> wrote:
>>>>>>
>>>>>> There are some users of the NumPy C code in randomkit.  This was never officially supported.  There has been a long open issue to provide this officially.
>>>>>>
>>>>>> When I wrote randomgen I supplied .pdx files that make it simpler to write Cython code that uses the components.  The lower-level API has not had much scrutiny and is in need of a clean-up.   I thought this would also encourage users to extend the random machinery themselves as part of their project or code so as to minimize the requests for new (exotic) distributions to be included in Generator.
>>>>>>
>>>>>> Most of the generator functions follow a pattern random_DISTRIBUTION.  Some have a bit more name mangling which can easily be cleaned up (like ranomd_gauss_zig, which should become PREFIX_standard_normal).
>>>>>>
>>>>>> Ralf Gommers suggested unprefixed names.
>>>>>
>>>>>
>>>>> I suggested that the names should match the Python API, which I think isn't quite the same. The Python API doesn't contain things like "gamma", "t" or "f".
>>>>
>>>>
>>>> As the implementations evolve, they aren't going to match one-to-one 100%. The implementations are shared by the legacy RandomState. When we update an algorithm, we'll need to make a new function with the better algorithm for Generator to use, then we'll have two C functions roughly corresponding to the same method name (albeit on different classes). C doesn't give us as many namespace options as Python. We could rely on conventional prefixes to distinguish between the two classes of function (e.g. legacy_normal vs random_normal).
>>>
>>>
>>> That seems simple and clear
>>>
>>>> There are times when it would be nice to be more descriptive about the algorithm difference (e.g. random_normal_polar vs random_normal_ziggurat),
>>>
>>>
>>> We decided against versioning algorithms in NEP 19, so an update to an algorithm would mean we'd want to get rid of the older version (unless it's still in use by legacy). So AFAICT we'd never have both random_normal_polar and random_normal_ziggurat present at the same time?
>>
>>
>> Well, we must because one's used by the legacy RandomState and one's used by Generator. :-)
>>
>>>
>>> I may be missing your point here, but if we have in Python `Generator.normal` and can switch its implementation from polar to ziggurat or vice versa without any deprecation, then why would we want to switch names in the C API?
>>
>>
>> I didn't mean to suggest that we'd have an unbounded number of functions as we improve the algorithms, just that we might have 2 once we decide to change something about the algorithm. We need 2 to support both the improved algorithm in Generator and the legacy algorithm in RandomState. The current implementation of the C function would be copied to a new name (`legacy_foo` or whatever), then we'd make RandomState use that frozen copy, then we make the desired modifications to the main function that Generator is referencing (`random_foo`).
>>
>> Or we could just make those legacy copies now so that people get to use them explicitly under the legacy names, whatever they are, and we can feel more free to modify the main implementations. I suggested this earlier, but convinced myself that it wasn't strictly necessary. But then I admit I was more focused on the Python API stability than any promises about the C/Cython API.
>>
>> We might end up with more than 2 implementations if we need to change something about the function signature, for whatever reason, and we want to retain C/Cython API compatibility with older code. The C functions aren't necessarily going to be one-to-one to the Generator methods. They're just part of the implementation. So for example, if we wanted to, say, precompute some intermediate values from the given scalar parameters so we don't have to recompute them for each element of the `size`-large requested output, we might do that in one C function and pass those intermediate values as arguments to the C function that does the actual sampling. So we'd have two C functions for that one Generator method, and the sampling C function will not have the same signature as it did before the modification that refactored the work into two functions. In that case, I would not be so strict as to require that `Generator.foo` is one to one with `random_foo`.
>
>
> You're saying "be so strict" as if it were a bad thing, or a major effort. I understand that in some cases a C API can not be evolved in the same way as a Python API, but in the example you're giving here I'd say you want one function to be public, and one private. Making both public just exposes more implementation details for no good reason, and will give us more maintenance issues long-term.
>
> Anyway, this is not an issue today. If we try to keep Python and C APIs matching, we can deal with possible difficulties with that if and when they arise - should be infrequent.
>
> Cheers,
> Ralf
>
>>
>> To your point, though, we don't have to use gratuitously different names when there _is_ a one-to-one relationship. `random_gauss_zig` should be `random_normal`.
>>
>> --
>> Robert Kern
>> _______________________________________________
>> NumPy-Discussion mailing list
>> [hidden email]
>> https://mail.python.org/mailman/listinfo/numpy-discussion
>
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> https://mail.python.org/mailman/listinfo/numpy-discussion



--
Those who don't understand recursion are doomed to repeat it
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Low-level API for Random

mattip
Administrator
On 20/9/19 2:18 pm, Neal Becker wrote:
> I have used C-api in the past, and would like to see a convenient and
> stable way to do this.  Currently I'm using randomgen, but calling
> (from c++)
> to the python api.  The inefficiency is amortized by generating and
> caching batches of results.
>
> I thought randomgen was supposed to be the future of numpy random, so
> I've based on that.
>

It would be good to have actual users tell us what APIs they need.

Are you using the BitGenerators or only the higher level Generator
functions?


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Low-level API for Random

Neal Becker
I'm using the low-level generator.  In this example I need to generate
small random integers of defined bit widths (e.g., 2 bit).  So
I get 64-bit uniform random uintegers, and cache the values, returning
them n-bits (e.g. 2 bits) at a time to the caller.

On Fri, Sep 20, 2019 at 9:03 AM Matti Picus <[hidden email]> wrote:

>
> On 20/9/19 2:18 pm, Neal Becker wrote:
> > I have used C-api in the past, and would like to see a convenient and
> > stable way to do this.  Currently I'm using randomgen, but calling
> > (from c++)
> > to the python api.  The inefficiency is amortized by generating and
> > caching batches of results.
> >
> > I thought randomgen was supposed to be the future of numpy random, so
> > I've based on that.
> >
>
> It would be good to have actual users tell us what APIs they need.
>
> Are you using the BitGenerators or only the higher level Generator
> functions?
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> https://mail.python.org/mailman/listinfo/numpy-discussion



--
Those who don't understand recursion are doomed to repeat it
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Low-level API for Random

Robert Kern-2
In reply to this post by ralfgommers


On Fri, Sep 20, 2019 at 6:09 AM Ralf Gommers <[hidden email]> wrote:


On Fri, Sep 20, 2019 at 5:29 AM Robert Kern <[hidden email]> wrote:

We might end up with more than 2 implementations if we need to change something about the function signature, for whatever reason, and we want to retain C/Cython API compatibility with older code. The C functions aren't necessarily going to be one-to-one to the Generator methods. They're just part of the implementation. So for example, if we wanted to, say, precompute some intermediate values from the given scalar parameters so we don't have to recompute them for each element of the `size`-large requested output, we might do that in one C function and pass those intermediate values as arguments to the C function that does the actual sampling. So we'd have two C functions for that one Generator method, and the sampling C function will not have the same signature as it did before the modification that refactored the work into two functions. In that case, I would not be so strict as to require that `Generator.foo` is one to one with `random_foo`.

You're saying "be so strict" as if it were a bad thing, or a major effort.

I am. It's an unnecessary limitation on the C API without a corresponding benefit. Your original complaint is much more directly addressed by a "don't gratuitously name related C functions differently than the Python methods they implement" rule (e.g. "gauss" instead of "normal").
 
I understand that in some cases a C API can not be evolved in the same way as a Python API, but in the example you're giving here I'd say you want one function to be public, and one private. Making both public just exposes more implementation details for no good reason, and will give us more maintenance issues long-term.

Not at all. In this example, neither one of those functions is useful without the other. If one is public, both must be.

--
Robert Kern

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Low-level API for Random

ralfgommers


On Fri, Sep 20, 2019 at 7:09 AM Robert Kern <[hidden email]> wrote:


On Fri, Sep 20, 2019 at 6:09 AM Ralf Gommers <[hidden email]> wrote:


On Fri, Sep 20, 2019 at 5:29 AM Robert Kern <[hidden email]> wrote:

We might end up with more than 2 implementations if we need to change something about the function signature, for whatever reason, and we want to retain C/Cython API compatibility with older code. The C functions aren't necessarily going to be one-to-one to the Generator methods. They're just part of the implementation. So for example, if we wanted to, say, precompute some intermediate values from the given scalar parameters so we don't have to recompute them for each element of the `size`-large requested output, we might do that in one C function and pass those intermediate values as arguments to the C function that does the actual sampling. So we'd have two C functions for that one Generator method, and the sampling C function will not have the same signature as it did before the modification that refactored the work into two functions. In that case, I would not be so strict as to require that `Generator.foo` is one to one with `random_foo`.

You're saying "be so strict" as if it were a bad thing, or a major effort.

I am. It's an unnecessary limitation on the C API without a corresponding benefit. Your original complaint

It's not a "complaint". We're having this discussion because we shipped a partial API in 1.17.0 that we will now have to go back and either take out or clean up in 1.17.3. The PR for the new numpy.random grew so large that we didn't notice or discuss that (such things happen, no big deal - we have limited reviewer bandwidth). So now that we do, it makes sense to actually think about what needs to be in the API. For now I think that's only the parts that are matching the Python API plus what is needed to use them from C/Cython. Future additions require similar review and criteria as adding to the Python API and the existing NumPy C API. To me, your example seems to (a) not deal with API stability, and (b) expose too much implementation detail.

To be clear about the actual status, we:
- shipped one header file (bitgen.h)
- shipped two pxd files (common.pxd, bit_generator.pxd)
- removed a header file we used to ship (randomkit.h)
- did not ship distributions.pxd, bounded_integers.pxd, legacy_distributions.pxd or related header files

bit_generator.pxd looks fine, common.pxd contains parts that shouldn't be there. I think the intent was to ship at least distributions.pxd/h, and perhaps all of those pxd files.

is much more directly addressed by a "don't gratuitously name related C functions differently than the Python methods they implement" rule (e.g. "gauss" instead of "normal").
 
I understand that in some cases a C API can not be evolved in the same way as a Python API, but in the example you're giving here I'd say you want one function to be public, and one private. Making both public just exposes more implementation details for no good reason, and will give us more maintenance issues long-term.

Not at all. In this example, neither one of those functions is useful without the other. If one is public, both must be.

If neither one is useful without the other, it sounds like both should be private and the third one that puts them together - the one that didn't change signature and implements `Generator.foo` - is the public one.

Cheers,
Ralf



_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Low-level API for Random

Robert Kern-2
On Fri, Sep 20, 2019 at 11:33 PM Ralf Gommers <[hidden email]> wrote:


On Fri, Sep 20, 2019 at 7:09 AM Robert Kern <[hidden email]> wrote:


On Fri, Sep 20, 2019 at 6:09 AM Ralf Gommers <[hidden email]> wrote:


On Fri, Sep 20, 2019 at 5:29 AM Robert Kern <[hidden email]> wrote:

We might end up with more than 2 implementations if we need to change something about the function signature, for whatever reason, and we want to retain C/Cython API compatibility with older code. The C functions aren't necessarily going to be one-to-one to the Generator methods. They're just part of the implementation. So for example, if we wanted to, say, precompute some intermediate values from the given scalar parameters so we don't have to recompute them for each element of the `size`-large requested output, we might do that in one C function and pass those intermediate values as arguments to the C function that does the actual sampling. So we'd have two C functions for that one Generator method, and the sampling C function will not have the same signature as it did before the modification that refactored the work into two functions. In that case, I would not be so strict as to require that `Generator.foo` is one to one with `random_foo`.

You're saying "be so strict" as if it were a bad thing, or a major effort.

I am. It's an unnecessary limitation on the C API without a corresponding benefit. Your original complaint

It's not a "complaint".

Please forgive me. That word choice was not intended to be dismissive. I don't view "complaints" as minor annoyances that the "complainer" should just shut up and deal with, or that the "complainer" is just being annoying, but I can see how it came across that I might. Please continue as if I said "The problem you originally noted...". It's a real problem that needs to be addressed. We just have different thoughts on exactly what is needed to address it.
 
We're having this discussion because we shipped a partial API in 1.17.0 that we will now have to go back and either take out or clean up in 1.17.3. The PR for the new numpy.random grew so large that we didn't notice or discuss that (such things happen, no big deal - we have limited reviewer bandwidth). So now that we do, it makes sense to actually think about what needs to be in the API. For now I think that's only the parts that are matching the Python API plus what is needed to use them from C/Cython. Future additions require similar review and criteria as adding to the Python API and the existing NumPy C API. To me, your example seems to (a) not deal with API stability, and (b) expose too much implementation detail.

To be clear about the actual status, we:
- shipped one header file (bitgen.h)
- shipped two pxd files (common.pxd, bit_generator.pxd)
- removed a header file we used to ship (randomkit.h)
- did not ship distributions.pxd, bounded_integers.pxd, legacy_distributions.pxd or related header files

bit_generator.pxd looks fine, common.pxd contains parts that shouldn't be there. I think the intent was to ship at least distributions.pxd/h, and perhaps all of those pxd files.

is much more directly addressed by a "don't gratuitously name related C functions differently than the Python methods they implement" rule (e.g. "gauss" instead of "normal").
 
I understand that in some cases a C API can not be evolved in the same way as a Python API, but in the example you're giving here I'd say you want one function to be public, and one private. Making both public just exposes more implementation details for no good reason, and will give us more maintenance issues long-term.

Not at all. In this example, neither one of those functions is useful without the other. If one is public, both must be.

If neither one is useful without the other, it sounds like both should be private and the third one that puts them together - the one that didn't change signature and implements `Generator.foo` - is the public one. 

That defeats the point of using the C API in this instance, though. The reason it got split into two (in this plausible hypothetical; I'm thinking of the binomial implementation here, which caches these intermediates in a passed-in struct) is because in C you want to call them in different ways (in this case, the prep function once and the sampling function many times). It's not that you always call them in lockstep pairs: `prep(); sample(); prep(); sample();`. A C function that combines them defeats the efficiency that one wanted to gain by using the C API. The C API has different needs than the Python API, because the Python API has a lot more support from the Python language and numpy data structures to be able to jam a lot of functionality into a single function signature that C just doesn't give us. The purpose of the C API is not just to avoid Python function call overhead. If there's a reason that the Generator method needs the implementation split up into multiple C functions, that's a really strong signal that *other* C code using the C API will need that same split. It's not just an implementation detail; it's a documented use case.

Given the prevalence of Cython, it's actually really easy to use the Python API pretty easily in "C", so it's actually a huge waste if the C API matches the Python API too closely. The power and utility of the C API will be in how it *differs* from the Python API. For the distribution methods, this is largely in how it lets you sample one number at a time without bothering with the numpy and broadcasting overhead. That's the driving motivation for having a C API for the distributions, and the algorithms that we choose have consequences for the C API that will best satisfy that motivation.

The issue that this raises with API stability is that if you require a one-to-one match between the C API function and the Generator method, we can never change the function signature of the C function. That's going to forbid us from moving from an algorithm that doesn't need any precomputation to one that does. That precomputation either requires a 2-function dance or a new argument to keep the cached values (c.f. `random_binomial()`), so it's always going to affect the API. To use such a new algorithm, we'll have to add a new function or two to the C API and document the deprecation of the older API function. We can't just swap it in under the same name, even if the new function is standalone. That's a significant constraint on future development when the main issue that led to the suggestion is that the names were sometimes gratuitously different between the C API and the Python API, which hindered discoverability. We can fix *that* problem easily without constraining the universe of algorithms that we might consider using in the future. 

--
Robert Kern

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Low-level API for Random

ralfgommers


On Fri, Sep 20, 2019 at 9:31 PM Robert Kern <[hidden email]> wrote:
On Fri, Sep 20, 2019 at 11:33 PM Ralf Gommers <[hidden email]> wrote:


On Fri, Sep 20, 2019 at 7:09 AM Robert Kern <[hidden email]> wrote:


On Fri, Sep 20, 2019 at 6:09 AM Ralf Gommers <[hidden email]> wrote:


On Fri, Sep 20, 2019 at 5:29 AM Robert Kern <[hidden email]> wrote:

We might end up with more than 2 implementations if we need to change something about the function signature, for whatever reason, and we want to retain C/Cython API compatibility with older code. The C functions aren't necessarily going to be one-to-one to the Generator methods. They're just part of the implementation. So for example, if we wanted to, say, precompute some intermediate values from the given scalar parameters so we don't have to recompute them for each element of the `size`-large requested output, we might do that in one C function and pass those intermediate values as arguments to the C function that does the actual sampling. So we'd have two C functions for that one Generator method, and the sampling C function will not have the same signature as it did before the modification that refactored the work into two functions. In that case, I would not be so strict as to require that `Generator.foo` is one to one with `random_foo`.

You're saying "be so strict" as if it were a bad thing, or a major effort.

I am. It's an unnecessary limitation on the C API without a corresponding benefit. Your original complaint

It's not a "complaint".

Please forgive me. That word choice was not intended to be dismissive. I don't view "complaints" as minor annoyances that the "complainer" should just shut up and deal with, or that the "complainer" is just being annoying, but I can see how it came across that I might. Please continue as if I said "The problem you originally noted...". It's a real problem that needs to be addressed. We just have different thoughts on exactly what is needed to address it.

Okay, thank you:)

 
We're having this discussion because we shipped a partial API in 1.17.0 that we will now have to go back and either take out or clean up in 1.17.3. The PR for the new numpy.random grew so large that we didn't notice or discuss that (such things happen, no big deal - we have limited reviewer bandwidth). So now that we do, it makes sense to actually think about what needs to be in the API. For now I think that's only the parts that are matching the Python API plus what is needed to use them from C/Cython. Future additions require similar review and criteria as adding to the Python API and the existing NumPy C API. To me, your example seems to (a) not deal with API stability, and (b) expose too much implementation detail.

To be clear about the actual status, we:
- shipped one header file (bitgen.h)
- shipped two pxd files (common.pxd, bit_generator.pxd)
- removed a header file we used to ship (randomkit.h)
- did not ship distributions.pxd, bounded_integers.pxd, legacy_distributions.pxd or related header files

bit_generator.pxd looks fine, common.pxd contains parts that shouldn't be there. I think the intent was to ship at least distributions.pxd/h, and perhaps all of those pxd files.

is much more directly addressed by a "don't gratuitously name related C functions differently than the Python methods they implement" rule (e.g. "gauss" instead of "normal").
 
I understand that in some cases a C API can not be evolved in the same way as a Python API, but in the example you're giving here I'd say you want one function to be public, and one private. Making both public just exposes more implementation details for no good reason, and will give us more maintenance issues long-term.

Not at all. In this example, neither one of those functions is useful without the other. If one is public, both must be.

If neither one is useful without the other, it sounds like both should be private and the third one that puts them together - the one that didn't change signature and implements `Generator.foo` - is the public one. 

That defeats the point of using the C API in this instance, though. The reason it got split into two (in this plausible hypothetical; I'm thinking of the binomial implementation here, which caches these intermediates in a passed-in struct) is because in C you want to call them in different ways (in this case, the prep function once and the sampling function many times). It's not that you always call them in lockstep pairs: `prep(); sample(); prep(); sample();`. A C function that combines them defeats the efficiency that one wanted to gain by using the C API. The C API has different needs than the Python API, because the Python API has a lot more support from the Python language and numpy data structures to be able to jam a lot of functionality into a single function signature that C just doesn't give us. The purpose of the C API is not just to avoid Python function call overhead. If there's a reason that the Generator method needs the implementation split up into multiple C functions, that's a really strong signal that *other* C code using the C API will need that same split. It's not just an implementation detail; it's a documented use case.

Given the prevalence of Cython, it's actually really easy to use the Python API pretty easily in "C", so it's actually a huge waste if the C API matches the Python API too closely. The power and utility of the C API will be in how it *differs* from the Python API. For the distribution methods, this is largely in how it lets you sample one number at a time without bothering with the numpy and broadcasting overhead. That's the driving motivation for having a C API for the distributions, and the algorithms that we choose have consequences for the C API that will best satisfy that motivation.

The issue that this raises with API stability is that if you require a one-to-one match between the C API function and the Generator method, we can never change the function signature of the C function. That's going to forbid us from moving from an algorithm that doesn't need any precomputation to one that does. That precomputation either requires a 2-function dance or a new argument to keep the cached values (c.f. `random_binomial()`), so it's always going to affect the API. To use such a new algorithm, we'll have to add a new function or two to the C API and document the deprecation of the older API function. We can't just swap it in under the same name, even if the new function is standalone. That's a significant constraint on future development when the main issue that led to the suggestion is that the names were sometimes gratuitously different between the C API and the Python API, which hindered discoverability. We can fix *that* problem easily without constraining the universe of algorithms that we might consider using in the future. 

Fair enough, now the use case is clear to me. In summary: there may be real reasons to deviate and add more functions; let's do so if and when it makes sense to.

Cheers,
Ralf


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Low-level API for Random

Stefan van der Walt
In reply to this post by Robert Kern-2
On Fri, Sep 20, 2019, at 21:30, Robert Kern wrote:
Given the prevalence of Cython, it's actually really easy to use the Python API pretty easily in "C", so it's actually a huge waste if the C API matches the Python API too closely. The power and utility of the C API will be in how it *differs* from the Python API. For the distribution methods, this is largely in how it lets you sample one number at a time without bothering with the numpy and broadcasting overhead. That's the driving motivation for having a C API for the distributions, and the algorithms that we choose have consequences for the C API that will best satisfy that motivation.

I'd like to clarify what exactly we mean by exposing a C API.  Do we have in mind that our random number generators can be used from standalone C code, or via Cython `cimport` like with the current numpy.pxd?

It sounds like we want to expose the highest level generators; do we also want to provide access to the bit streams?

Best regards,
Stéfan


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Low-level API for Random

Robert Kern-2
On Wed, Sep 25, 2019, 12:56 PM Stefan van der Walt <[hidden email]> wrote:
On Fri, Sep 20, 2019, at 21:30, Robert Kern wrote:
Given the prevalence of Cython, it's actually really easy to use the Python API pretty easily in "C", so it's actually a huge waste if the C API matches the Python API too closely. The power and utility of the C API will be in how it *differs* from the Python API. For the distribution methods, this is largely in how it lets you sample one number at a time without bothering with the numpy and broadcasting overhead. That's the driving motivation for having a C API for the distributions, and the algorithms that we choose have consequences for the C API that will best satisfy that motivation.

I'd like to clarify what exactly we mean by exposing a C API.  Do we have in mind that our random number generators can be used from standalone C code, or via Cython `cimport` like with the current numpy.pxd?

Cython is the priority. Numba and cffi/ctypes are also desired and relatively easy to do with capsules. Pure C (via #include) is desired, but can be added later because doing that is more annoying. 

It sounds like we want to expose the highest level generators; do we also want to provide access to the bit streams?

100%

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Low-level API for Random

bashtage
In reply to this post by Stefan van der Walt


I'd like to clarify what exactly we mean by exposing a C API.  Do we have in mind that our random number generators can be used from standalone C code, or via Cython `cimport` like with the current numpy.pxd?

It sounds like we want to expose the highest level generators; do we also want to provide access to the bit streams?


What do you mean by standalone C?  A Python extension is written in C (but not Cython)?  Or a C application that doesn't include Python.h?  The former is pretty easy since you can use a few PyObjects to simplify initializing the bit generator, and the rest of the code can be directly used in C without any more Python objects.  The latter is also doable although the low-level functions needed to initialize the bit generators (which are just C structs) have no standardization.  I think the only component in a standalone C application that would need some non-trivial work is SeedSequence (i.e., more than changing function names or reorganizing files).

Like Robert, I suspect that Cython users would be the largest immediate beneficiaries of a lower-level API.

numba end-users can already consume the bit generators through the exposed CFFI/ctypes interface they provide.  These can then be used with the higher-leverl generators, although end users have to build a shared lib/DLL first. Getting the C API in shape to be used directly by numba is probably a bigger task. 

Kevin




_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion