

Hi,
While working again on the fftpack module, to clean things up and
speed some backends (in particular fftw3, which is really suboptimal
right now), I remembered how much unaligned data pointer in numpy arrays
hurt performances. So I would like to relaunch the discussion on aligned
allocators and default alignement for numpy arrays :
http://www.mailarchive.com/numpydiscussion@.../msg04005.htmlBasically, what I have in mind is, in a first step (for numpy 1.2):
 define functions to allocate on a given alignement
 make PyMemData_NEW 16 byte aligned by default (to be compatible
with SSE and co).
The problem was, and still is, realloc. It is not possible to implement
realloc with malloc/free (in a portable way), and as such, it is not
possible to have an aligned realloc.
In numpy, we can always replace realloc by malloc/free, because we know
the size of the old block: would deprecating PyMemData_RENEW and
replacing them by PyMemeData_NEW/PyMemData_FREE be possible, such as to
make all numpy arrays follow a default alignement ? There are only a few
of them in numpy (6 of them), 0 in scipy, and I guess extensions never
really used them ?
cheers,
David
_______________________________________________
Numpydiscussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpydiscussion


On Mon, May 5, 2008 at 5:44 AM, David Cournapeau < [hidden email]> wrote:
Hi,
While working again on the fftpack module, to clean things up and
speed some backends (in particular fftw3, which is really suboptimal
right now), I remembered how much unaligned data pointer in numpy arrays
hurt performances. So I would like to relaunch the discussion on aligned
allocators and default alignement for numpy arrays :
http://www.mailarchive.com/numpydiscussion@.../msg04005.html
Basically, what I have in mind is, in a first step (for numpy 1.2):
 define functions to allocate on a given alignement
 make PyMemData_NEW 16 byte aligned by default (to be compatible
with SSE and co).
The problem was, and still is, realloc. It is not possible to implement
realloc with malloc/free (in a portable way), and as such, it is not
possible to have an aligned realloc.
In numpy, we can always replace realloc by malloc/free, because we know
the size of the old block: would deprecating PyMemData_RENEW and
replacing them by PyMemeData_NEW/PyMemData_FREE be possible, such as to
make all numpy arrays follow a default alignement ? There are only a few
of them in numpy (6 of them), 0 in scipy, and I guess extensions never
really used them ? I don't think you would want to do this in the core of PyArray_FromIter; presumably realloc can sometimes reuse the existing pointer and save on allocating a new chunk of memory. Since there are lots of allocations in fromiter, this could potentially be a big performance hit. (At least I think so, realloc has always been kinda voodoo to me). One could use PyMemData_NEW/PyMemData_FREE in the final allocation to make sure that the data is alligned, we allready do a realloc there to dump any extra space. Or, possibly better, one could choose which allocation strategy to use here depending on whether the data was alligned or not.
cheers,
David
_______________________________________________
Numpydiscussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpydiscussion
 . __ . \ . . [hidden email]
_______________________________________________
Numpydiscussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpydiscussion


On Mon, May 5, 2008 at 7:44 AM, David Cournapeau
< [hidden email]> wrote:
> In numpy, we can always replace realloc by malloc/free, because we know
> the size of the old block: would deprecating PyMemData_RENEW and
> replacing them by PyMemeData_NEW/PyMemData_FREE be possible, such as to
> make all numpy arrays follow a default alignement ? There are only a few
> of them in numpy (6 of them), 0 in scipy, and I guess extensions never
> really used them ?
I am in favor of at least trying this out. We will have to have a set
of benchmarks to make sure we haven't hurt the current uses of
PyMemData_RENEW which Tim points out.

Robert Kern
"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
 Umberto Eco
_______________________________________________
Numpydiscussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpydiscussion


On Tue, May 6, 2008 at 1:59 AM, Timothy Hochberg < [hidden email]> wrote:
> I don't think you would want to do this in the core of PyArray_FromIter;
> presumably realloc can sometimes reuse the existing pointer and save on
> allocating a new chunk of memory. Since there are lots of allocations in
> fromiter, this could potentially be a big performance hit.
Looking at PyArray_FromIter, there is already a log(n) behavior for
allocation, so I am not sure it would hurt so much: I would guess that
realloc often need to allocate a new block if the size is the double
of the former pointer.
That's how realloc seems to work on FreeBSD, at least:
http://www.freebsd.org/cgi/cvsweb.cgi/src/lib/libc/stdlib/malloc.c?rev=1.171;contenttype=text%2Fplain> (At least I think
> so, realloc has always been kinda voodoo to me).
Once malloc is implemented, I would guess realloc to be simple, no ?
For each block of memory in the malloc chain, just check whereas there
is enough size to reuse the same block and adjacent block if any,
otherwise, free + malloc + copy, non ?
One could use
> PyMemData_NEW/PyMemData_FREE in the final allocation to make sure that the
> data is alligned, we allready do a realloc there to dump any extra space.
I guess that would be the easiest: directly use realloc in the loop,
and use PyDataMem_NEW/FREE at the end if necessary. Using realloc is
no problem if the buffer is always freed in the same function.
> Or, possibly better, one could choose which allocation strategy to use here
> depending on whether the data was alligned or not.
I mentioned step 1, because there would be a step 2 (but maybe only
numpy 2): extending the C api functions to create arrays such as
asking for an explicit alignment is possible.
David
_______________________________________________
Numpydiscussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpydiscussion


2008/5/5 Robert Kern < [hidden email]>:
> On Mon, May 5, 2008 at 7:44 AM, David Cournapeau
> < [hidden email]> wrote:
>
> > In numpy, we can always replace realloc by malloc/free, because we know
> > the size of the old block: would deprecating PyMemData_RENEW and
> > replacing them by PyMemeData_NEW/PyMemData_FREE be possible, such as to
> > make all numpy arrays follow a default alignement ? There are only a few
> > of them in numpy (6 of them), 0 in scipy, and I guess extensions never
> > really used them ?
>
> I am in favor of at least trying this out. We will have to have a set
> of benchmarks to make sure we haven't hurt the current uses of
> PyMemData_RENEW which Tim points out.
realloc() may not be a performance win anyway. Allocating new memory
is quite fast, copying data is quite fast, and inplace realloc()
tends to cause memory fragmentation  take an arena full of 1024byte
blocks and suddenly make one of them 256 bytes long and you haven't
gained anything at all in most malloc() implementations. So yes,
benchmarks.
Anne
_______________________________________________
Numpydiscussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpydiscussion


On Tue, May 6, 2008 at 2:11 AM, Robert Kern < [hidden email]> wrote:
>
> I am in favor of at least trying this out. We will have to have a set
> of benchmarks to make sure we haven't hurt the current uses of
> PyMemData_RENEW which Tim points out.
What would be a good stress test for PyArray_FromIter ? I think using
array iterator was the main hotspot for reading large arff files in
scipy.io.arff, would that be enough ?
cheers,
David
_______________________________________________
Numpydiscussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpydiscussion


On Mon, May 5, 2008 at 1:30 PM, David Cournapeau < [hidden email]> wrote:
> On Tue, May 6, 2008 at 2:11 AM, Robert Kern < [hidden email]> wrote:
> >
> > I am in favor of at least trying this out. We will have to have a set
> > of benchmarks to make sure we haven't hurt the current uses of
> > PyMemData_RENEW which Tim points out.
>
> What would be a good stress test for PyArray_FromIter ? I think using
> array iterator was the main hotspot for reading large arff files in
> scipy.io.arff, would that be enough ?
Since there are only 6 places where PyMemData_RENEW is used, all 6
uses should be benchmarked. I would prefer a more targeted benchmark
so we know exactly what we are measuring.

Robert Kern
"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
 Umberto Eco
_______________________________________________
Numpydiscussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpydiscussion


Robert Kern wrote:
>
> Since there are only 6 places where PyMemData_RENEW is used, all 6
> uses should be benchmarked. I would prefer a more targeted benchmark
> so we know exactly what we are measuring.
>
Ok, I started a new branch for aligned allocator:
http://projects.scipy.org/scipy/numpy/browser/branches/aligned_allocaFor now, I've only commited:
 a new bunch of tests for all functions using PyDataMem_RENEW (in
numpy/core/tests/test_renew.py)
 If NOUSE_PYDATAMEM_RENEW is defined, PyDataMEM_RENEW is not used:
instead, SYS_REALLOC is used in loops (this is defined to realloc), and
when outside a loop, _fake_realloc is used, which does not use realloc,
but use PyDataMem_NEW/FREE.
I got a small slowdown, but I am not sure it is really relevant: the
figures change almost as much between several runs as between runs of
different implementations. I am not convinced about the quality of my
tests either.
cheers,
David
_______________________________________________
Numpydiscussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpydiscussion


2008/5/5 David Cournapeau < [hidden email]>:
> Basically, what I have in mind is, in a first step (for numpy 1.2):
>  define functions to allocate on a given alignement
>  make PyMemData_NEW 16 byte aligned by default (to be compatible
> with SSE and co).
>
> The problem was, and still is, realloc. It is not possible to implement
> realloc with malloc/free (in a portable way), and as such, it is not
> possible to have an aligned realloc.
How much does this matter? I mean, what if you simply left all the
reallocs asis? The arrays that resulted from reallocs would not
typically be aligned, but we cannot in any case expect all arrays to
be aligned.
Or, actually, depending on how you manage the alignment, it may be
possible to write a portable aligned realloc. As I understand it, a
portable aligned malloc allocates extra memory, then adds something to
the pointer to make the memory arena aligned. free() must then recover
the original pointer and call free() on it  though this can
presumably be avoided if garbage collection is smart enough. realloc()
also needs to recover the pointer to call the underlying realloc().
One answer is to ensure that at least one byte of padding is always
done at the beginning, so that the number of pad bytes used on an
aligned pointer p is accessible as ((uint8*)p)[1]. Other schemes are
available; I have a peculiar affection for the purepython solution of
constructing a view. In any of them, it seems to me, if free() can
recover the original pointer, so can realloc()...
I don't think any numpy function can assume that its input arrays are
aligned, since users may like to pass, say, mmap()ed hunks of a file,
over which numpy has no control of the alignment. In this case, then,
how much of a problem is it if a few stray realloc()s produce
nonaligned arrays?
Anne
_______________________________________________
Numpydiscussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpydiscussion


Anne Archibald wrote:
>
> How much does this matter? I mean, what if you simply left all the
> reallocs asis? The arrays that resulted from reallocs would not
> typically be aligned, but we cannot in any case expect all arrays to
> be aligned.
The problem would be the interaction between the aligned allocator and
realloc. If we do it by "ourselves", then it is no problem, but we
cannot do it if we rely on posix_memalign/memalign. I was not precise
enough when I mentioned portability: I meant that it was not possible to
implement realloc in terms of malloc/free portably, and as such, it is
not possible to implement an aligned realloc with memalign/free. Also, I
have not found any indication that the pointer given by posix_memalign
can be fed to realloc: depending on the implementation, it cannot be
freed...
Maybe the solution is to never use posix_memalign...
> As I understand it, a
> portable aligned malloc allocates extra memory, then adds something to
> the pointer to make the memory arena aligned.
Yes, that's how it works in fftw, and the implementation I have for
numpy is derived from fftw code.
> I don't think any numpy function can assume that its input arrays are
> aligned, since users may like to pass, say, mmap()ed hunks of a file,
> over which numpy has no control of the alignment. In this case, then,
> how much of a problem is it if a few stray realloc()s produce
> nonaligned arrays?
There are a few realloc, but they are used in core routines
(PyArray_FromIter, PyArray_Resize). The point is to get aligned arrays
most of the time, and avoiding losing alignement unless it is too
complicated to do otherwise.
cheers,
David
_______________________________________________
Numpydiscussion mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/numpydiscussion

