Byte aligned arrays

classic Classic list List threaded Threaded
39 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Byte aligned arrays

Henry Gomersall-2
I've written a few simple cython routines for assisting in creating
byte-aligned numpy arrays. The point being for the arrays to work with
SSE/AVX code.

https://github.com/hgomersall/pyFFTW/blob/master/pyfftw/utils.pxi

The change recently has been to add a check on the CPU as to what flags
are supported (though it's not complete, I should make the default
return 0 or something).

It occurred to me that this is something that (a) other people almost
certainly need and are solving themselves and (b) I lack the necessary
platforms to test all the possible CPU/OS combinations to make sure
something sensible happens in all cases.

Is this something that can be rolled into Numpy (the feature, not my
particular implementation or interface - though I'd be happy for it to
be so)?

Regarding (b), I've written a test case that works for Linux on x86-64
with GCC (my platform!). I can test it on 32-bit windows, but that's it.
Is ARM supported by Numpy? Neon would be great to include as well. What
other platforms might need this?

Cheers,

Henry

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Byte aligned arrays

Nathaniel Smith
On Wed, Dec 19, 2012 at 8:40 AM, Henry Gomersall <[hidden email]> wrote:

> I've written a few simple cython routines for assisting in creating
> byte-aligned numpy arrays. The point being for the arrays to work with
> SSE/AVX code.
>
> https://github.com/hgomersall/pyFFTW/blob/master/pyfftw/utils.pxi
>
> The change recently has been to add a check on the CPU as to what flags
> are supported (though it's not complete, I should make the default
> return 0 or something).
>
> It occurred to me that this is something that (a) other people almost
> certainly need and are solving themselves and (b) I lack the necessary
> platforms to test all the possible CPU/OS combinations to make sure
> something sensible happens in all cases.
>
> Is this something that can be rolled into Numpy (the feature, not my
> particular implementation or interface - though I'd be happy for it to
> be so)?
>
> Regarding (b), I've written a test case that works for Linux on x86-64
> with GCC (my platform!). I can test it on 32-bit windows, but that's it.
> Is ARM supported by Numpy? Neon would be great to include as well. What
> other platforms might need this?

Your code looks simple and portable to me (at least the alignment
part). I can see a good argument for adding this sort of functionality
directly to numpy with a nice interface, though, since these kind of
requirements seem quite common these days. Maybe an interface like
  a = np.asarray([1, 2, 3], base_alignment=32)  # should this be in
bits or in bytes?
  b = np.empty((10, 10), order="C", base_alignment=32)
  # etc.
  assert a.base_alignment == 32
which underneath tries to use posix_memalign/_aligned_malloc when
possible, or falls back on the overallocation trick otherwise?

-n
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Byte aligned arrays

Charles R Harris


On Wed, Dec 19, 2012 at 7:43 AM, Nathaniel Smith <[hidden email]> wrote:
On Wed, Dec 19, 2012 at 8:40 AM, Henry Gomersall <[hidden email]> wrote:
> I've written a few simple cython routines for assisting in creating
> byte-aligned numpy arrays. The point being for the arrays to work with
> SSE/AVX code.
>
> https://github.com/hgomersall/pyFFTW/blob/master/pyfftw/utils.pxi
>
> The change recently has been to add a check on the CPU as to what flags
> are supported (though it's not complete, I should make the default
> return 0 or something).
>
> It occurred to me that this is something that (a) other people almost
> certainly need and are solving themselves and (b) I lack the necessary
> platforms to test all the possible CPU/OS combinations to make sure
> something sensible happens in all cases.
>
> Is this something that can be rolled into Numpy (the feature, not my
> particular implementation or interface - though I'd be happy for it to
> be so)?
>
> Regarding (b), I've written a test case that works for Linux on x86-64
> with GCC (my platform!). I can test it on 32-bit windows, but that's it.
> Is ARM supported by Numpy? Neon would be great to include as well. What
> other platforms might need this?

Your code looks simple and portable to me (at least the alignment
part). I can see a good argument for adding this sort of functionality
directly to numpy with a nice interface, though, since these kind of
requirements seem quite common these days. Maybe an interface like
  a = np.asarray([1, 2, 3], base_alignment=32)  # should this be in
bits or in bytes?
  b = np.empty((10, 10), order="C", base_alignment=32)
  # etc.
  assert a.base_alignment == 32
which underneath tries to use posix_memalign/_aligned_malloc when
possible, or falls back on the overallocation trick otherwise?


There is a thread about this from several years back. IIRC, David Cournapeau was interested in the same problem. At first glance, the alignment keyword looks interesting. One possible concern is keeping alignment for rows, views, etc., which is probably not possible in any sensible way. But people who need this most likely know what they are doing and just need memory allocated on the proper boundary.

Chuck


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Byte aligned arrays

Nathaniel Smith
On Wed, Dec 19, 2012 at 2:57 PM, Charles R Harris
<[hidden email]> wrote:

>
>
> On Wed, Dec 19, 2012 at 7:43 AM, Nathaniel Smith <[hidden email]> wrote:
>>
>> On Wed, Dec 19, 2012 at 8:40 AM, Henry Gomersall <[hidden email]> wrote:
>> > I've written a few simple cython routines for assisting in creating
>> > byte-aligned numpy arrays. The point being for the arrays to work with
>> > SSE/AVX code.
>> >
>> > https://github.com/hgomersall/pyFFTW/blob/master/pyfftw/utils.pxi
>> >
>> > The change recently has been to add a check on the CPU as to what flags
>> > are supported (though it's not complete, I should make the default
>> > return 0 or something).
>> >
>> > It occurred to me that this is something that (a) other people almost
>> > certainly need and are solving themselves and (b) I lack the necessary
>> > platforms to test all the possible CPU/OS combinations to make sure
>> > something sensible happens in all cases.
>> >
>> > Is this something that can be rolled into Numpy (the feature, not my
>> > particular implementation or interface - though I'd be happy for it to
>> > be so)?
>> >
>> > Regarding (b), I've written a test case that works for Linux on x86-64
>> > with GCC (my platform!). I can test it on 32-bit windows, but that's it.
>> > Is ARM supported by Numpy? Neon would be great to include as well. What
>> > other platforms might need this?
>>
>> Your code looks simple and portable to me (at least the alignment
>> part). I can see a good argument for adding this sort of functionality
>> directly to numpy with a nice interface, though, since these kind of
>> requirements seem quite common these days. Maybe an interface like
>>   a = np.asarray([1, 2, 3], base_alignment=32)  # should this be in
>> bits or in bytes?
>>   b = np.empty((10, 10), order="C", base_alignment=32)
>>   # etc.
>>   assert a.base_alignment == 32
>> which underneath tries to use posix_memalign/_aligned_malloc when
>> possible, or falls back on the overallocation trick otherwise?
>>
>
> There is a thread about this from several years back. IIRC, David Cournapeau
> was interested in the same problem. At first glance, the alignment keyword
> looks interesting. One possible concern is keeping alignment for rows,
> views, etc., which is probably not possible in any sensible way. But people
> who need this most likely know what they are doing and just need memory
> allocated on the proper boundary.

Right, my intuition is that it's like order="C" -- if you make a new
array by, say, indexing, then it may or may not have order="C", no
guarantees. So when you care, you call asarray(a, order="C") and that
either makes a copy or not as needed. Similarly for base alignment.

I guess to push this analogy even further we could define a set of
array flags, ALIGNED_8, ALIGNED_16, etc. (In practice only power-of-2
alignment matters, I think, so the number of flags would remain
manageable?) That would make the C API easier to deal with too, no
need to add PyArray_FromAnyAligned.

-n
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Byte aligned arrays

Charles R Harris


On Wed, Dec 19, 2012 at 8:10 AM, Nathaniel Smith <[hidden email]> wrote:
On Wed, Dec 19, 2012 at 2:57 PM, Charles R Harris
<[hidden email]> wrote:
>
>
> On Wed, Dec 19, 2012 at 7:43 AM, Nathaniel Smith <[hidden email]> wrote:
>>
>> On Wed, Dec 19, 2012 at 8:40 AM, Henry Gomersall <[hidden email]> wrote:
>> > I've written a few simple cython routines for assisting in creating
>> > byte-aligned numpy arrays. The point being for the arrays to work with
>> > SSE/AVX code.
>> >
>> > https://github.com/hgomersall/pyFFTW/blob/master/pyfftw/utils.pxi
>> >
>> > The change recently has been to add a check on the CPU as to what flags
>> > are supported (though it's not complete, I should make the default
>> > return 0 or something).
>> >
>> > It occurred to me that this is something that (a) other people almost
>> > certainly need and are solving themselves and (b) I lack the necessary
>> > platforms to test all the possible CPU/OS combinations to make sure
>> > something sensible happens in all cases.
>> >
>> > Is this something that can be rolled into Numpy (the feature, not my
>> > particular implementation or interface - though I'd be happy for it to
>> > be so)?
>> >
>> > Regarding (b), I've written a test case that works for Linux on x86-64
>> > with GCC (my platform!). I can test it on 32-bit windows, but that's it.
>> > Is ARM supported by Numpy? Neon would be great to include as well. What
>> > other platforms might need this?
>>
>> Your code looks simple and portable to me (at least the alignment
>> part). I can see a good argument for adding this sort of functionality
>> directly to numpy with a nice interface, though, since these kind of
>> requirements seem quite common these days. Maybe an interface like
>>   a = np.asarray([1, 2, 3], base_alignment=32)  # should this be in
>> bits or in bytes?
>>   b = np.empty((10, 10), order="C", base_alignment=32)
>>   # etc.
>>   assert a.base_alignment == 32
>> which underneath tries to use posix_memalign/_aligned_malloc when
>> possible, or falls back on the overallocation trick otherwise?
>>
>
> There is a thread about this from several years back. IIRC, David Cournapeau
> was interested in the same problem. At first glance, the alignment keyword
> looks interesting. One possible concern is keeping alignment for rows,
> views, etc., which is probably not possible in any sensible way. But people
> who need this most likely know what they are doing and just need memory
> allocated on the proper boundary.

Right, my intuition is that it's like order="C" -- if you make a new
array by, say, indexing, then it may or may not have order="C", no
guarantees. So when you care, you call asarray(a, order="C") and that
either makes a copy or not as needed. Similarly for base alignment.

I guess to push this analogy even further we could define a set of
array flags, ALIGNED_8, ALIGNED_16, etc. (In practice only power-of-2
alignment matters, I think, so the number of flags would remain
manageable?) That would make the C API easier to deal with too, no
need to add PyArray_FromAnyAligned.


Another possibility is an aligned datatype, basically an aligned structured array with floats/ints in chunks of the appropriate size. IIRC, gcc support for sse is something like that.

Chuck


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Byte aligned arrays

Nathaniel Smith
On Wed, Dec 19, 2012 at 3:27 PM, Charles R Harris
<[hidden email]> wrote:

>
>
> On Wed, Dec 19, 2012 at 8:10 AM, Nathaniel Smith <[hidden email]> wrote:
>> Right, my intuition is that it's like order="C" -- if you make a new
>> array by, say, indexing, then it may or may not have order="C", no
>> guarantees. So when you care, you call asarray(a, order="C") and that
>> either makes a copy or not as needed. Similarly for base alignment.
>>
>> I guess to push this analogy even further we could define a set of
>> array flags, ALIGNED_8, ALIGNED_16, etc. (In practice only power-of-2
>> alignment matters, I think, so the number of flags would remain
>> manageable?) That would make the C API easier to deal with too, no
>> need to add PyArray_FromAnyAligned.
>>
>
> Another possibility is an aligned datatype, basically an aligned structured
> array with floats/ints in chunks of the appropriate size. IIRC, gcc support
> for sse is something like that.

True; right now it looks like structured dtypes have no special alignment:

In [13]: np.dtype("f4,f4").alignment
Out[13]: 1

So for this approach we'd need a way to create structured dtypes with
.alignment == .itemsize, and we'd need some way to request
dtype-aligned memory from array allocation functions. I guess existing
NPY_ALIGNED is a good enough public interface for the latter, but
AFAICT the current implementation is to just assume that whatever
malloc() returns will always be ALIGNED. This is true for all base C
types, but not for more exotic record types with larger alignment
requirements -- that would require some fancier allocation scheme.

Not sure which interface is more useful to users. On the one hand,
using funny dtypes makes regular non-SIMD access more cumbersome, and
it forces your array size to be a multiple of the SIMD word size,
which might be inconvenient if your code is smart enough to handle
arbitrary-sized arrays with partial SIMD acceleration (i.e., using
SIMD for most of the array, and then a slow path to handle any partial
word at the end). OTOH, if your code *is* that smart, you should
probably just make it smart enough to handle a partial word at the
beginning as well and then you won't need any special alignment in the
first place, and representing each SIMD word as a single numpy scalar
is an intuitively appealing model of how SIMD works. OTOOH, just
adding a single argument np.array() is a much simpler to explain than
some elaborate scheme involving the creation of special custom dtypes.

-n
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Byte aligned arrays

Henry Gomersall-2
On Wed, 2012-12-19 at 15:57 +0000, Nathaniel Smith wrote:

> Not sure which interface is more useful to users. On the one hand,
> using funny dtypes makes regular non-SIMD access more cumbersome, and
> it forces your array size to be a multiple of the SIMD word size,
> which might be inconvenient if your code is smart enough to handle
> arbitrary-sized arrays with partial SIMD acceleration (i.e., using
> SIMD for most of the array, and then a slow path to handle any partial
> word at the end). OTOH, if your code *is* that smart, you should
> probably just make it smart enough to handle a partial word at the
> beginning as well and then you won't need any special alignment in the
> first place, and representing each SIMD word as a single numpy scalar
> is an intuitively appealing model of how SIMD works. OTOOH, just
> adding a single argument np.array() is a much simpler to explain than
> some elaborate scheme involving the creation of special custom dtypes.

If it helps, my use-case is in wrapping the FFTW library. This _is_
smart enough to deal with unaligned arrays, but it just results in a
performance penalty. In the case of an FFT, there are clearly going to
be issues with the powers of two indices in the array not lying on a
suitable n-byte boundary (which would be the case with a misaligned
array), but I imagine it's not unique.

The other point is that it's easy to create a suitable power of two
array that should always bypass any special case unaligned code (e.g.
with floats, any multiple of 4 array length will fill every 16-byte
word).

Finally, I think there is significant value in auto-aligning the array
based on an appropriate inspection of the cpu capabilities (or
alternatively, a function that reports back the appropriate SIMD
alignment). Again, this makes it easier to wrap libraries that may
function with any alignment, but benefit from optimum alignment.

Cheers,

Henry

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Byte aligned arrays

Francesc Alted-4
On 12/19/12 5:47 PM, Henry Gomersall wrote:

> On Wed, 2012-12-19 at 15:57 +0000, Nathaniel Smith wrote:
>> Not sure which interface is more useful to users. On the one hand,
>> using funny dtypes makes regular non-SIMD access more cumbersome, and
>> it forces your array size to be a multiple of the SIMD word size,
>> which might be inconvenient if your code is smart enough to handle
>> arbitrary-sized arrays with partial SIMD acceleration (i.e., using
>> SIMD for most of the array, and then a slow path to handle any partial
>> word at the end). OTOH, if your code *is* that smart, you should
>> probably just make it smart enough to handle a partial word at the
>> beginning as well and then you won't need any special alignment in the
>> first place, and representing each SIMD word as a single numpy scalar
>> is an intuitively appealing model of how SIMD works. OTOOH, just
>> adding a single argument np.array() is a much simpler to explain than
>> some elaborate scheme involving the creation of special custom dtypes.
> If it helps, my use-case is in wrapping the FFTW library. This _is_
> smart enough to deal with unaligned arrays, but it just results in a
> performance penalty. In the case of an FFT, there are clearly going to
> be issues with the powers of two indices in the array not lying on a
> suitable n-byte boundary (which would be the case with a misaligned
> array), but I imagine it's not unique.
>
> The other point is that it's easy to create a suitable power of two
> array that should always bypass any special case unaligned code (e.g.
> with floats, any multiple of 4 array length will fill every 16-byte
> word).
>
> Finally, I think there is significant value in auto-aligning the array
> based on an appropriate inspection of the cpu capabilities (or
> alternatively, a function that reports back the appropriate SIMD
> alignment). Again, this makes it easier to wrap libraries that may
> function with any alignment, but benefit from optimum alignment.

Hmm, NumPy seems to return data blocks that are aligned to 16 bytes on
systems (Linux and Mac OSX):

In []: np.empty(1).data
Out[]: <read-write buffer for 0x102b97b60, size 8, offset 0 at 0x102e7c130>

In []: np.empty(1).data
Out[]: <read-write buffer for 0x102ba64e0, size 8, offset 0 at 0x102e7c430>

In []: np.empty(1).data
Out[]: <read-write buffer for 0x102b86700, size 8, offset 0 at 0x102e7c5b0>

In []: np.empty(1).data
Out[]: <read-write buffer for 0x102b981d0, size 8, offset 0 at 0x102e7c5f0>

[Check that the last digit in the addresses above is always 0]

The only scenario that I see that this would create unaligned arrays is
for machines having AVX.  But provided that the Intel architecture is
making great strides in fetching unaligned data, I'd be surprised that
the difference in performance would be even noticeable.

Can you tell us which difference in performance are you seeing for an
AVX-aligned array and other that is not AVX-aligned?  Just curious.

--
Francesc Alted

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Byte aligned arrays

Henry Gomersall-2
On Wed, 2012-12-19 at 19:03 +0100, Francesc Alted wrote:
<snip>

> > Finally, I think there is significant value in auto-aligning the
> array
> > based on an appropriate inspection of the cpu capabilities (or
> > alternatively, a function that reports back the appropriate SIMD
> > alignment). Again, this makes it easier to wrap libraries that may
> > function with any alignment, but benefit from optimum alignment.
>
> Hmm, NumPy seems to return data blocks that are aligned to 16 bytes
> on
> systems (Linux and Mac OSX):
<snip>

That is not true at least under Windows 32-bit. I think also it's not
true for Linux 32-bit from my vague recollections of testing in a
virtual machine. (disclaimer: both those statements _may_ be out of
date).

But yes, under Linux 64-bit I always get my arrays aligned to 16 bytes.

>
> The only scenario that I see that this would create unaligned arrays
> is
> for machines having AVX.  But provided that the Intel architecture is
> making great strides in fetching unaligned data, I'd be surprised
> that
> the difference in performance would be even noticeable.
>
> Can you tell us which difference in performance are you seeing for an
> AVX-aligned array and other that is not AVX-aligned?  Just curious.

I don't know; I don't own a machine with AVX ;)

It might be that the difference is negligible, though I do think it
would be _nice_ to have the arrays properly aligned if it's not too
difficult.

Cheers,

Henry

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Byte aligned arrays

Nathaniel Smith
On Wed, Dec 19, 2012 at 6:25 PM, Henry Gomersall <[hidden email]> wrote:

> On Wed, 2012-12-19 at 19:03 +0100, Francesc Alted wrote:
> <snip>
>> > Finally, I think there is significant value in auto-aligning the
>> array
>> > based on an appropriate inspection of the cpu capabilities (or
>> > alternatively, a function that reports back the appropriate SIMD
>> > alignment). Again, this makes it easier to wrap libraries that may
>> > function with any alignment, but benefit from optimum alignment.
>>
>> Hmm, NumPy seems to return data blocks that are aligned to 16 bytes
>> on
>> systems (Linux and Mac OSX):
> <snip>
>
> That is not true at least under Windows 32-bit. I think also it's not
> true for Linux 32-bit from my vague recollections of testing in a
> virtual machine. (disclaimer: both those statements _may_ be out of
> date).
>
> But yes, under Linux 64-bit I always get my arrays aligned to 16 bytes.

Currently numpy just uses whatever the system malloc() returns, so the
alignment guarantees are entirely determined by your libc.

-n
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Byte aligned arrays

David Cournapeau
In reply to this post by Francesc Alted-4
On Wed, Dec 19, 2012 at 6:03 PM, Francesc Alted <[hidden email]> wrote:

> On 12/19/12 5:47 PM, Henry Gomersall wrote:
>> On Wed, 2012-12-19 at 15:57 +0000, Nathaniel Smith wrote:
>>> Not sure which interface is more useful to users. On the one hand,
>>> using funny dtypes makes regular non-SIMD access more cumbersome, and
>>> it forces your array size to be a multiple of the SIMD word size,
>>> which might be inconvenient if your code is smart enough to handle
>>> arbitrary-sized arrays with partial SIMD acceleration (i.e., using
>>> SIMD for most of the array, and then a slow path to handle any partial
>>> word at the end). OTOH, if your code *is* that smart, you should
>>> probably just make it smart enough to handle a partial word at the
>>> beginning as well and then you won't need any special alignment in the
>>> first place, and representing each SIMD word as a single numpy scalar
>>> is an intuitively appealing model of how SIMD works. OTOOH, just
>>> adding a single argument np.array() is a much simpler to explain than
>>> some elaborate scheme involving the creation of special custom dtypes.
>> If it helps, my use-case is in wrapping the FFTW library. This _is_
>> smart enough to deal with unaligned arrays, but it just results in a
>> performance penalty. In the case of an FFT, there are clearly going to
>> be issues with the powers of two indices in the array not lying on a
>> suitable n-byte boundary (which would be the case with a misaligned
>> array), but I imagine it's not unique.
>>
>> The other point is that it's easy to create a suitable power of two
>> array that should always bypass any special case unaligned code (e.g.
>> with floats, any multiple of 4 array length will fill every 16-byte
>> word).
>>
>> Finally, I think there is significant value in auto-aligning the array
>> based on an appropriate inspection of the cpu capabilities (or
>> alternatively, a function that reports back the appropriate SIMD
>> alignment). Again, this makes it easier to wrap libraries that may
>> function with any alignment, but benefit from optimum alignment.
>
> Hmm, NumPy seems to return data blocks that are aligned to 16 bytes on
> systems (Linux and Mac OSX):

Only by accident, at least on linux. The pointers returned by the  gnu
libc malloc are at least 8 bytes aligned, but they may not be 16 bytes
when you're above the threshold where mmap is used for malloc.

The difference between aligned and unaligned ram <-> sse registers
(e.g. movaps, movups) used to be significant. Don't know if that's
still the case for recent CPUs.

David
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Byte aligned arrays

Henry Gomersall-2
In reply to this post by Nathaniel Smith
On Wed, 2012-12-19 at 15:10 +0000, Nathaniel Smith wrote:
<snip>

> >> > Is this something that can be rolled into Numpy (the feature, not
> my
> >> > particular implementation or interface - though I'd be happy for
> it to
> >> > be so)?
> >> >
> >> > Regarding (b), I've written a test case that works for Linux on
> x86-64
> >> > with GCC (my platform!). I can test it on 32-bit windows, but
> that's it.
> >> > Is ARM supported by Numpy? Neon would be great to include as
> well. What
> >> > other platforms might need this?
> >>
> >> Your code looks simple and portable to me (at least the alignment
> >> part). I can see a good argument for adding this sort of
> functionality
> >> directly to numpy with a nice interface, though, since these kind
> of
> >> requirements seem quite common these days. Maybe an interface like
> >>   a = np.asarray([1, 2, 3], base_alignment=32)  # should this be in
> >> bits or in bytes?
> >>   b = np.empty((10, 10), order="C", base_alignment=32)
> >>   # etc.
> >>   assert a.base_alignment == 32
> >> which underneath tries to use posix_memalign/_aligned_malloc when
> >> possible, or falls back on the overallocation trick otherwise?
> >>
> >
> > There is a thread about this from several years back. IIRC, David
> Cournapeau
> > was interested in the same problem. At first glance, the alignment
> keyword
> > looks interesting. One possible concern is keeping alignment for
> rows,
> > views, etc., which is probably not possible in any sensible way. But
> people
> > who need this most likely know what they are doing and just need
> memory
> > allocated on the proper boundary.
>
> Right, my intuition is that it's like order="C" -- if you make a new
> array by, say, indexing, then it may or may not have order="C", no
> guarantees. So when you care, you call asarray(a, order="C") and that
> either makes a copy or not as needed. Similarly for base alignment.
>
> I guess to push this analogy even further we could define a set of
> array flags, ALIGNED_8, ALIGNED_16, etc. (In practice only power-of-2
> alignment matters, I think, so the number of flags would remain
> manageable?) That would make the C API easier to deal with too, no
> need to add PyArray_FromAnyAligned.

So, if I were to implement this, I presume the proper way would be
through modifications to multiarray?

Would this basic description be a reasonable initial target?

Henry


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Byte aligned arrays

Henry Gomersall-2
On Thu, 2012-12-20 at 08:12 +0000, Henry Gomersall wrote:

> On Wed, 2012-12-19 at 15:10 +0000, Nathaniel Smith wrote:
> <snip>
> > >> > Is this something that can be rolled into Numpy (the feature,
> not
> > my
> > >> > particular implementation or interface - though I'd be happy
> for
> > it to
> > >> > be so)?
> > >> >
> > >> > Regarding (b), I've written a test case that works for Linux on
> > x86-64
> > >> > with GCC (my platform!). I can test it on 32-bit windows, but
> > that's it.
> > >> > Is ARM supported by Numpy? Neon would be great to include as
> > well. What
> > >> > other platforms might need this?
> > >>
> > >> Your code looks simple and portable to me (at least the alignment
> > >> part). I can see a good argument for adding this sort of
> > functionality
> > >> directly to numpy with a nice interface, though, since these kind
> > of
> > >> requirements seem quite common these days. Maybe an interface
> like
> > >>   a = np.asarray([1, 2, 3], base_alignment=32)  # should this be
> in
> > >> bits or in bytes?
> > >>   b = np.empty((10, 10), order="C", base_alignment=32)
> > >>   # etc.
> > >>   assert a.base_alignment == 32
> > >> which underneath tries to use posix_memalign/_aligned_malloc when
> > >> possible, or falls back on the overallocation trick otherwise?
> > >>
> > >
> > > There is a thread about this from several years back. IIRC, David
> > Cournapeau
> > > was interested in the same problem. At first glance, the alignment
> > keyword
> > > looks interesting. One possible concern is keeping alignment for
> > rows,
> > > views, etc., which is probably not possible in any sensible way.
> But
> > people
> > > who need this most likely know what they are doing and just need
> > memory
> > > allocated on the proper boundary.
> >
> > Right, my intuition is that it's like order="C" -- if you make a new
> > array by, say, indexing, then it may or may not have order="C", no
> > guarantees. So when you care, you call asarray(a, order="C") and
> that
> > either makes a copy or not as needed. Similarly for base alignment.
> >
> > I guess to push this analogy even further we could define a set of
> > array flags, ALIGNED_8, ALIGNED_16, etc. (In practice only
> power-of-2
> > alignment matters, I think, so the number of flags would remain
> > manageable?) That would make the C API easier to deal with too, no
> > need to add PyArray_FromAnyAligned.
>
> So, if I were to implement this, I presume the proper way would be
> through modifications to multiarray?
>
> Would this basic description be a reasonable initial target?

There is this patch:

http://projects.scipy.org/numpy/attachment/ticket/568/aligned_v1.patch

Which, rather amusingly to me, was written by Steven Johnson of FFTW. It
looks like a good starting point.

Cheers,

Henry


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Byte aligned arrays

Henry Gomersall-2
In reply to this post by Francesc Alted-4
On Wed, 2012-12-19 at 19:03 +0100, Francesc Alted wrote:
> The only scenario that I see that this would create unaligned arrays
> is
> for machines having AVX.  But provided that the Intel architecture is
> making great strides in fetching unaligned data, I'd be surprised
> that
> the difference in performance would be even noticeable.
>
> Can you tell us which difference in performance are you seeing for an
> AVX-aligned array and other that is not AVX-aligned?  Just curious.

Further to this point, from an Intel article...

http://software.intel.com/en-us/articles/practical-intel-avx-optimization-on-2nd-generation-intel-core-processors

"Aligning data to vector length is always recommended. When using Intel
SSE and Intel SSE2 instructions, loaded data should be aligned to 16
bytes. Similarly, to achieve best results use Intel AVX instructions on
32-byte vectors that are 32-byte aligned. The use of Intel AVX
instructions on unaligned 32-byte vectors means that every second load
will be across a cache-line split, since the cache line is 64 bytes.
This doubles the cache line split rate compared to Intel SSE code that
uses 16-byte vectors. A high cache-line split rate in memory-intensive
code is extremely likely to cause performance degradation. For that
reason, it is highly recommended to align the data to 32 bytes for use
with Intel AVX."

Though it would be nice to put together a little example of this!

Henry


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Byte aligned arrays

Nathaniel Smith
In reply to this post by Henry Gomersall-2
On Thu, Dec 20, 2012 at 8:12 AM, Henry Gomersall <[hidden email]> wrote:

> On Wed, 2012-12-19 at 15:10 +0000, Nathaniel Smith wrote:
> <snip>
>> >> > Is this something that can be rolled into Numpy (the feature, not
>> my
>> >> > particular implementation or interface - though I'd be happy for
>> it to
>> >> > be so)?
>> >> >
>> >> > Regarding (b), I've written a test case that works for Linux on
>> x86-64
>> >> > with GCC (my platform!). I can test it on 32-bit windows, but
>> that's it.
>> >> > Is ARM supported by Numpy? Neon would be great to include as
>> well. What
>> >> > other platforms might need this?
>> >>
>> >> Your code looks simple and portable to me (at least the alignment
>> >> part). I can see a good argument for adding this sort of
>> functionality
>> >> directly to numpy with a nice interface, though, since these kind
>> of
>> >> requirements seem quite common these days. Maybe an interface like
>> >>   a = np.asarray([1, 2, 3], base_alignment=32)  # should this be in
>> >> bits or in bytes?
>> >>   b = np.empty((10, 10), order="C", base_alignment=32)
>> >>   # etc.
>> >>   assert a.base_alignment == 32
>> >> which underneath tries to use posix_memalign/_aligned_malloc when
>> >> possible, or falls back on the overallocation trick otherwise?
>> >>
>> >
>> > There is a thread about this from several years back. IIRC, David
>> Cournapeau
>> > was interested in the same problem. At first glance, the alignment
>> keyword
>> > looks interesting. One possible concern is keeping alignment for
>> rows,
>> > views, etc., which is probably not possible in any sensible way. But
>> people
>> > who need this most likely know what they are doing and just need
>> memory
>> > allocated on the proper boundary.
>>
>> Right, my intuition is that it's like order="C" -- if you make a new
>> array by, say, indexing, then it may or may not have order="C", no
>> guarantees. So when you care, you call asarray(a, order="C") and that
>> either makes a copy or not as needed. Similarly for base alignment.
>>
>> I guess to push this analogy even further we could define a set of
>> array flags, ALIGNED_8, ALIGNED_16, etc. (In practice only power-of-2
>> alignment matters, I think, so the number of flags would remain
>> manageable?) That would make the C API easier to deal with too, no
>> need to add PyArray_FromAnyAligned.
>
> So, if I were to implement this, I presume the proper way would be
> through modifications to multiarray?

Yes, numpy/core/src/multiarray/ is the code you'd be modifying.

> Would this basic description be a reasonable initial target?

My feeling is that we're at the stage where we need to get more
information and feedback from people with experience in this area
before we'll be able to merge anything into numpy proper (since that
implies nailing down and committing to an API). One way to get that
might be to go ahead and implement something to experiment with, and
this "basic description" does seem like one of the plausible options,
so... yes it seems like a reasonable initial target to work on, but I
don't want to mislead you into thinking that I think it would
necessarily be a reasonable initial target to ship in 1.8 or whatever.
I feel like I don't have enough information to make a judgement there.

There must be other people working with SIMD and numpy, right? If
you're interested in this problem, another thing that might help would
be to spend some effort finding those people and convincing them to
get involved in discussing what they need.

-n
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Byte aligned arrays

Francesc Alted-4
In reply to this post by Henry Gomersall-2
On 12/20/12 9:53 AM, Henry Gomersall wrote:

> On Wed, 2012-12-19 at 19:03 +0100, Francesc Alted wrote:
>> The only scenario that I see that this would create unaligned arrays
>> is
>> for machines having AVX.  But provided that the Intel architecture is
>> making great strides in fetching unaligned data, I'd be surprised
>> that
>> the difference in performance would be even noticeable.
>>
>> Can you tell us which difference in performance are you seeing for an
>> AVX-aligned array and other that is not AVX-aligned?  Just curious.
> Further to this point, from an Intel article...
>
> http://software.intel.com/en-us/articles/practical-intel-avx-optimization-on-2nd-generation-intel-core-processors
>
> "Aligning data to vector length is always recommended. When using Intel
> SSE and Intel SSE2 instructions, loaded data should be aligned to 16
> bytes. Similarly, to achieve best results use Intel AVX instructions on
> 32-byte vectors that are 32-byte aligned. The use of Intel AVX
> instructions on unaligned 32-byte vectors means that every second load
> will be across a cache-line split, since the cache line is 64 bytes.
> This doubles the cache line split rate compared to Intel SSE code that
> uses 16-byte vectors. A high cache-line split rate in memory-intensive
> code is extremely likely to cause performance degradation. For that
> reason, it is highly recommended to align the data to 32 bytes for use
> with Intel AVX."
>
> Though it would be nice to put together a little example of this!

Indeed, an example is what I was looking for.  So provided that I have
access to an AVX capable machine (having 6 physical cores), and that MKL
10.3 has support for AVX, I have made some comparisons using the
Anaconda Python distribution (it ships with most packages linked against
MKL 10.3).

Here it is a first example using a DGEMM operation.  First using a NumPy
that is not turbo-loaded with MKL:

In [34]: a = np.linspace(0,1,1e7)

In [35]: b = a.reshape(1000, 10000)

In [36]: c = a.reshape(10000, 1000)

In [37]: time d = np.dot(b,c)
CPU times: user 7.56 s, sys: 0.03 s, total: 7.59 s
Wall time: 7.63 s

In [38]: time d = np.dot(c,b)
CPU times: user 78.52 s, sys: 0.18 s, total: 78.70 s
Wall time: 78.89 s

This is getting around 2.6 GFlop/s.  Now, with a MKL 10.3 NumPy and
AVX-unaligned data:

In [7]: p = ctypes.create_string_buffer(int(8e7)); hex(ctypes.addressof(p))
Out[7]: '0x7fcdef3b4010'  # 16 bytes alignment

In [8]: a = np.ndarray(1e7, "f8", p)

In [9]: a[:] = np.linspace(0,1,1e7)

In [10]: b = a.reshape(1000, 10000)

In [11]: c = a.reshape(10000, 1000)

In [37]: %timeit d = np.dot(b,c)
10 loops, best of 3: 164 ms per loop

In [38]: %timeit d = np.dot(c,b)
1 loops, best of 3: 1.65 s per loop

That is around 120 GFlop/s (i.e. almost 50x faster than without MKL/AVX).

Now, using MKL 10.3 and AVX-aligned data:

In [21]: p2 = ctypes.create_string_buffer(int(8e7+16));
hex(ctypes.addressof(p))
Out[21]: '0x7f8cb9598010'

In [22]: a2 = np.ndarray(1e7+2, "f8", p2)[2:]  # skip the first 16 bytes
(now is 32-bytes aligned)

In [23]: a2[:] = np.linspace(0,1,1e7)

In [24]: b2 = a2.reshape(1000, 10000)

In [25]: c2 = a2.reshape(10000, 1000)

In [35]: %timeit d2 = np.dot(b2,c2)
10 loops, best of 3: 163 ms per loop

In [36]: %timeit d2 = np.dot(c2,b2)
1 loops, best of 3: 1.67 s per loop

So, again, around 120 GFlop/s, and the difference wrt to unaligned AVX
data is negligible.

One may argue that DGEMM is CPU-bounded and that memory access plays
little role here, and that is certainly true.  So, let's go with a more
memory-bounded problem, like computing a transcendental function with
numexpr.  First with a with NumPy and numexpr with no MKL support:

In [8]: a = np.linspace(0,1,1e8)

In [9]: %time b = np.sin(a)
CPU times: user 1.20 s, sys: 0.22 s, total: 1.42 s
Wall time: 1.42 s

In [10]: import numexpr as ne

In [12]: %time b = ne.evaluate("sin(a)")
CPU times: user 1.42 s, sys: 0.27 s, total: 1.69 s
Wall time: 0.37 s

This time is around 4x faster than regular 'sin' in libc, and about the
same speed than a memcpy():

In [13]: %time c = a.copy()
CPU times: user 0.19 s, sys: 0.20 s, total: 0.39 s
Wall time: 0.39 s

Now, with a MKL-aware numexpr and non-AVX alignment:

In [8]: p = ctypes.create_string_buffer(int(8e8)); hex(ctypes.addressof(p))
Out[8]: '0x7fce435da010'  # 16 bytes alignment

In [9]: a = np.ndarray(1e8, "f8", p)

In [10]: a[:] = np.linspace(0,1,1e8)

In [11]: %time b = ne.evaluate("sin(a)")
CPU times: user 0.44 s, sys: 0.27 s, total: 0.71 s
Wall time: 0.15 s

That is, more than 2x faster than a memcpy() in this system, meaning
that the problem is truly memory-bounded.  So now, with an AVX aligned
buffer:

In [14]: a2 = a[2:]  # skip the first 16 bytes

In [15]: %time b = ne.evaluate("sin(a2)")
CPU times: user 0.40 s, sys: 0.28 s, total: 0.69 s
Wall time: 0.16 s

Again, times are very close.  Just to make sure, let's use the timeit magic:

In [16]: %timeit b = ne.evaluate("sin(a)")
10 loops, best of 3: 159 ms per loop   # unaligned

In [17]: %timeit b = ne.evaluate("sin(a2)")
10 loops, best of 3: 154 ms per loop   # aligned

All in all, it is not clear that AVX alignment would have an advantage,
even for memory-bounded problems.  But of course, if Intel people are
saying that AVX alignment is important is because they have use cases
for asserting this.  It is just that I'm having a difficult time to find
these cases.

--
Francesc Alted

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Byte aligned arrays

Sturla Molden-2
In reply to this post by Henry Gomersall-2
On 19.12.2012 09:40, Henry Gomersall wrote:
> I've written a few simple cython routines for assisting in creating
> byte-aligned numpy arrays. The point being for the arrays to work with
> SSE/AVX code.
>
> https://github.com/hgomersall/pyFFTW/blob/master/pyfftw/utils.pxi

Why use Cython?

http://mail.scipy.org/pipermail/scipy-user/2009-March/020289.html


def aligned_zeros(shape, boundary=16, dtype=float, order='C'):
      N = np.prod(shape)
      d = np.dtype(dtype)
      tmp = np.zeros(N * d.itemsize + boundary, dtype=np.uint8)
      address = tmp.__array_interface__['data'][0]
      offset = (boundary - address % boundary) % boundary
      return tmp[offset:offset+N]\
                .view(dtype=d)\
                .reshape(shape, order=order)


Sturla
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Byte aligned arrays

Henry Gomersall-2
On Thu, 2012-12-20 at 17:26 +0100, Sturla Molden wrote:
> On 19.12.2012 09:40, Henry Gomersall wrote:
> > I've written a few simple cython routines for assisting in creating
> > byte-aligned numpy arrays. The point being for the arrays to work
> with
> > SSE/AVX code.
> >
> > https://github.com/hgomersall/pyFFTW/blob/master/pyfftw/utils.pxi
>
> Why use Cython?

> http://mail.scipy.org/pipermail/scipy-user/2009-March/020289.html
>
>
> def aligned_zeros(shape, boundary=16, dtype=float, order='C'):
>       N = np.prod(shape)
>       d = np.dtype(dtype)
>       tmp = np.zeros(N * d.itemsize + boundary, dtype=np.uint8)
>       address = tmp.__array_interface__['data'][0]
>       offset = (boundary - address % boundary) % boundary
>       return tmp[offset:offset+N]\
>                 .view(dtype=d)\
>                 .reshape(shape, order=order)


Initially because it kept my module in a single file. That's legacy now,
but since I'm already in the Cython domain, it makes sense to get the
advantages (like speed - creating a 1000 length array with
n_byte_align_empty is about 7 times faster than with the code above).

The alignment functions is just a utility function for the FFTW wrapper.

Cheers,

Henry

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Byte aligned arrays

Henry Gomersall-2
In reply to this post by Sturla Molden-2
On Thu, 2012-12-20 at 17:26 +0100, Sturla Molden wrote:
>       return tmp[offset:offset+N]\
>                 .view(dtype=d)\
>                 .reshape(shape, order=order)

Also, just for the email record, that should be
       
        return tmp[offset:offset+N*d.itemsize]\
                 .view(dtype=d)\
                 .reshape(shape, order=order)

Cheers,

Henry
 

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Byte aligned arrays

Sturla Molden-2
In reply to this post by Henry Gomersall-2
On 19.12.2012 19:25, Henry Gomersall wrote:

> That is not true at least under Windows 32-bit. I think also it's not
> true for Linux 32-bit from my vague recollections of testing in a
> virtual machine. (disclaimer: both those statements _may_ be out of
> date).

malloc is required to return memory on 16 byte boundary on Windows.

http://msdn.microsoft.com/en-us/library/ycsb6wwf.aspx

On Windows we can also use _aligned_malloc and _aligned_realloc to
produce the alignment we want.

  _aligned_malloc(N, 4096);  // page-aligned memory


Sturla







_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
12