PEP 574 - zero-copy pickling with out of band data

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

PEP 574 - zero-copy pickling with out of band data

Antoine Pitrou-2

Hello,

Some of you might know that I've been working on a PEP in order to
improve pickling performance of large (or huge) data.  The PEP,
numbered 574 and titled "Pickle protocol 5 with out-of-band data",
allows participating data types to be pickled without any memory copy.
https://www.python.org/dev/peps/pep-0574/

The PEP already has an implementation, which is backported as an
independent PyPI package under the name "pickle5".
https://pypi.org/project/pickle5/

I also have a working patch updating PyArrow to use the PEP-defined
extensions to allow for zero-copy pickling of Arrow arrays - without
breaking compatibility with existing usage:
https://github.com/apache/arrow/pull/2161

Still, it is obvious one the primary targets of PEP 574 is Numpy
arrays, as the most prevalent datatype in the Python scientific
ecosystem.  I'm personally satisfied with the current state of the PEP,
but I'd like to have feedback from Numpy core maintainers.  I haven't
tried (yet?) to draft a Numpy patch to add PEP 574 support, since that's
likely to be more involved due to the complexity of Numpy and due to
the core being written in C.  Therefore I would like some help
evaluating whether the PEP is likely to be a good fit for Numpy.

Regards

Antoine.
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: PEP 574 - zero-copy pickling with out of band data

Charles R Harris


On Mon, Jul 2, 2018 at 3:03 PM, Antoine Pitrou <[hidden email]> wrote:

Hello,

Some of you might know that I've been working on a PEP in order to
improve pickling performance of large (or huge) data.  The PEP,
numbered 574 and titled "Pickle protocol 5 with out-of-band data",
allows participating data types to be pickled without any memory copy.
https://www.python.org/dev/peps/pep-0574/

The PEP already has an implementation, which is backported as an
independent PyPI package under the name "pickle5".
https://pypi.org/project/pickle5/

I also have a working patch updating PyArrow to use the PEP-defined
extensions to allow for zero-copy pickling of Arrow arrays - without
breaking compatibility with existing usage:
https://github.com/apache/arrow/pull/2161

Still, it is obvious one the primary targets of PEP 574 is Numpy
arrays, as the most prevalent datatype in the Python scientific
ecosystem.  I'm personally satisfied with the current state of the PEP,
but I'd like to have feedback from Numpy core maintainers.  I haven't
tried (yet?) to draft a Numpy patch to add PEP 574 support, since that's
likely to be more involved due to the complexity of Numpy and due to
the core being written in C.  Therefore I would like some help
evaluating whether the PEP is likely to be a good fit for Numpy.


Maybe somewhat off topic, but we have had trouble with a 2 GiB limit on file writes on OS X. See https://github.com/numpy/numpy/issues/3858. Does your implementation work around that?

Chuck

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: PEP 574 - zero-copy pickling with out of band data

Charles R Harris


On Mon, Jul 2, 2018 at 5:16 PM, Charles R Harris <[hidden email]> wrote:


On Mon, Jul 2, 2018 at 3:03 PM, Antoine Pitrou <[hidden email]> wrote:

Hello,

Some of you might know that I've been working on a PEP in order to
improve pickling performance of large (or huge) data.  The PEP,
numbered 574 and titled "Pickle protocol 5 with out-of-band data",
allows participating data types to be pickled without any memory copy.
https://www.python.org/dev/peps/pep-0574/

The PEP already has an implementation, which is backported as an
independent PyPI package under the name "pickle5".
https://pypi.org/project/pickle5/

I also have a working patch updating PyArrow to use the PEP-defined
extensions to allow for zero-copy pickling of Arrow arrays - without
breaking compatibility with existing usage:
https://github.com/apache/arrow/pull/2161

Still, it is obvious one the primary targets of PEP 574 is Numpy
arrays, as the most prevalent datatype in the Python scientific
ecosystem.  I'm personally satisfied with the current state of the PEP,
but I'd like to have feedback from Numpy core maintainers.  I haven't
tried (yet?) to draft a Numpy patch to add PEP 574 support, since that's
likely to be more involved due to the complexity of Numpy and due to
the core being written in C.  Therefore I would like some help
evaluating whether the PEP is likely to be a good fit for Numpy.


Maybe somewhat off topic, but we have had trouble with a 2 GiB limit on file writes on OS X. See https://github.com/numpy/numpy/issues/3858. Does your implementation work around that?

ISTR that some parallel processing applications sent pickled arrays around to different processes, I don't know if that is still the case, but if so, no copy might be a big gain for them.

Chuck

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: PEP 574 - zero-copy pickling with out of band data

Andrew Nelson-6
<snip>

On Tue, 3 Jul 2018 at 09:31, Charles R Harris <[hidden email]> wrote:

ISTR that some parallel processing applications sent pickled arrays around to different processes, I don't know if that is still the case, but if so, no copy might be a big gain for them.

That is very much correct. One example is using MCMC, which is massively parallel. I do parallelisation with mpi4py, and this requires distribution of pickled data of a reasonable size to the entire MPI world. This pickling introduces quite a bit of overhead.

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: PEP 574 - zero-copy pickling with out of band data

Nathan Goldbaum


On Mon, Jul 2, 2018 at 7:42 PM Andrew Nelson <[hidden email]> wrote:
<snip>

On Tue, 3 Jul 2018 at 09:31, Charles R Harris <[hidden email]> wrote:

ISTR that some parallel processing applications sent pickled arrays around to different processes, I don't know if that is still the case, but if so, no copy might be a big gain for them.

That is very much correct. One example is using MCMC, which is massively parallel. I do parallelisation with mpi4py, and this requires distribution of pickled data of a reasonable size to the entire MPI world. This pickling introduces quite a bit of overhead.

Doesn’t mpi4py have support for buffered low-level communication of numpy arrays? See e.g. 

Although I guess with Antoine’s proposal uses of the “lowercase” mpi4py API where data might get pickled will see speedups.

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: PEP 574 - zero-copy pickling with out of band data

Gael Varoquaux
In reply to this post by Charles R Harris
On Mon, Jul 02, 2018 at 05:31:05PM -0600, Charles R Harris wrote:
> ISTR that some parallel processing applications sent pickled arrays around to
> different processes, I don't know if that is still the case, but if so, no copy
> might be a big gain for them.

Yes, most parallel code that's across processes or across computers use
some form a pickle. I hope that this PEP would enable large speed ups.
This would be a big deal for parallelism in numerical Python.
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: PEP 574 - zero-copy pickling with out of band data

Andrea Gavana


On Tue, 3 Jul 2018 at 07.35, Gael Varoquaux <[hidden email]> wrote:
On Mon, Jul 02, 2018 at 05:31:05PM -0600, Charles R Harris wrote:
> ISTR that some parallel processing applications sent pickled arrays around to
> different processes, I don't know if that is still the case, but if so, no copy
> might be a big gain for them.

Yes, most parallel code that's across processes or across computers use
some form a pickle. I hope that this PEP would enable large speed ups.
This would be a big deal for parallelism in numerical Python.


This sound so very powerful... it’s such a pity that these type of gems won’t be backported to Python 2 - we have so many legacy applications smoothly running in Python 2 and nowhere near the required resources to even start porting to Python 3, and pickle5 looks like  a small revolution in the data-persistent world.

Andrea.



_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: PEP 574 - zero-copy pickling with out of band data

Gael Varoquaux
On Tue, Jul 03, 2018 at 08:54:51AM +0200, Andrea Gavana wrote:
> This sound so very powerful... it’s such a pity that these type of gems won’t
> be backported to Python 2 - we have so many legacy applications smoothly
> running in Python 2 and nowhere near the required resources to even start
> porting to Python 3,

I am a strong defender of stability and long-term support in scientific
software. But what you are demanding is that developers who do free work
do not benefit from their own work to have a more powerful environment.

More recent versions of Python are improved compared to older ones and
make it much easier to write certain idioms. Developers make these
changes over years to ensure that codebases are always simpler and more
robust. Backporting in effect means doing this work twice, but the second
time with more constraints. I just allocated something like a man-year to
have robust parallel-computing features work both on Python 2 and Python
3. With this man-year we could have done many other things. Did I make
the correct decision? I am not sure, because this is just creating more
technical dept.

I understand that we all sit on piles of code that we wrote for a given
application and one point, and that we will not be able to modernise it
all. But the fact that we don't have the bandwidth to make it evolve
probably means that we need to triage what's important and call a loss
the rest. Just like if I have 5 old cars in my backyard, I won't be able
to keep them all on the road unless I am very rich.


People asking for infinite backport to Python 2 are just asking
developers to write them a second free check, even larger than the one
they just got by having the feature under Python 3.


Gaël


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: PEP 574 - zero-copy pickling with out of band data

Andrea Gavana

Hi,


On Tue, 3 Jul 2018 at 09.20, Gael Varoquaux <[hidden email]> wrote:
On Tue, Jul 03, 2018 at 08:54:51AM +0200, Andrea Gavana wrote:
> This sound so very powerful... it’s such a pity that these type of gems won’t
> be backported to Python 2 - we have so many legacy applications smoothly
> running in Python 2 and nowhere near the required resources to even start
> porting to Python 3,

I am a strong defender of stability and long-term support in scientific
software. But what you are demanding is that developers who do free work
do not benefit from their own work to have a more powerful environment.

More recent versions of Python are improved compared to older ones and
make it much easier to write certain idioms. Developers make these
changes over years to ensure that codebases are always simpler and more
robust. Backporting in effect means doing this work twice, but the second
time with more constraints. I just allocated something like a man-year to
have robust parallel-computing features work both on Python 2 and Python
3. With this man-year we could have done many other things. Did I make
the correct decision? I am not sure, because this is just creating more
technical dept.

I understand that we all sit on piles of code that we wrote for a given
application and one point, and that we will not be able to modernise it
all. But the fact that we don't have the bandwidth to make it evolve
probably means that we need to triage what's important and call a loss
the rest. Just like if I have 5 old cars in my backyard, I won't be able
to keep them all on the road unless I am very rich.


People asking for infinite backport to Python 2 are just asking
developers to write them a second free check, even larger than the one
they just got by having the feature under Python 3.

Just to clarify: I wasn’t asking for anything, just complimenting Antoine’s work for something that appears to be a wonderful feature. There was a bit of rant from my part for sure, but I’ve never asked for someone to redo the work to make it run on Python 2.

Allocating a resource to port hundreds of thousand of LOC is close to an impossibility in the industry I work in, especially because our big team (the two of us) don’t code for a living, we have way many different duties. We code to make our life easier.

I’m happy if you feel better after your tirade.

Andrea.





Gaël


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: PEP 574 - zero-copy pickling with out of band data

Gael Varoquaux
On Tue, Jul 03, 2018 at 09:42:08AM +0200, Andrea Gavana wrote:
> I’m happy if you feel better after your tirade.

Not really. I worry a lot that many users are going to be surprised when
Python 2 stops being supported, which is in a couple of years. I wrote
this tirade not to make me feel better, but to try to underlie that the
switch is happening, and more and more of these exciting new things would
pop up in Python 3. Soon, new releases of projects like numpy and
scikit-learn won't support Python 2 anymore, which means that they will
be getting exciting features too that don't benefit Python 2 users.

It is a pity that some people find themselves left behind, because Python
3 is more and more exciting, with cool asynchronous features, more robust
multiprocessing, better pickling, and many other great features.

I found that, given a good test suite, porting from 2 to 3 wasn't very
hard. The 2 key ingredients were a good test suite, and not hand-written
C binding (Cython makes supporting both 2 and 3 really easy).

My goal is not to shame, or create uneasy discussions, but more to
encourage people to upgrade, at least for their core dependencies. Maybe
I am not conveying the right message, or using the right tone. In which
case, my apologies. I am genuinely excited about the Python3 future.

Best,

Gaël
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: PEP 574 - zero-copy pickling with out of band data

Antoine Pitrou-2
In reply to this post by Charles R Harris
On Mon, 2 Jul 2018 17:16:00 -0600
Charles R Harris <[hidden email]> wrote:
> Maybe somewhat off topic, but we have had trouble with a 2 GiB limit on
> file writes on OS X. See https://github.com/numpy/numpy/issues/3858. Does
> your implementation work around that?  

No, it's not the same topic at all.  I'd recommend perhaps pinging on
the python-dev PR.

Regards

Antoine.
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion