C-coded dot 1000x faster than numpy?

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

C-coded dot 1000x faster than numpy?

Neal Becker
I have code that performs dot product of a 2D matrix of size (on the
order of) [1000,16] with a vector of size [1000].  The matrix is
float64 and the vector is complex128.  I was using numpy.dot but it
turned out to be a bottleneck.

So I coded dot2x1 in c++ (using xtensor-python just for the
interface).  No fancy simd was used, unless g++ did it on it's own.

On a simple benchmark using timeit I find my hand-coded routine is on
the order of 1000x faster than numpy?  Here is the test code:
My custom c++ code is dot2x1.  I'm not copying it here because it has
some dependencies.  Any idea what is going on?

import numpy as np

from dot2x1 import dot2x1

a = np.ones ((1000,16))
b = np.array([ 0.80311816+0.80311816j,  0.80311816-0.80311816j,
       -0.80311816+0.80311816j, -0.80311816-0.80311816j,
        1.09707981+0.29396165j,  1.09707981-0.29396165j,
       -1.09707981+0.29396165j, -1.09707981-0.29396165j,
        0.29396165+1.09707981j,  0.29396165-1.09707981j,
       -0.29396165+1.09707981j, -0.29396165-1.09707981j,
        0.25495815+0.25495815j,  0.25495815-0.25495815j,
       -0.25495815+0.25495815j, -0.25495815-0.25495815j])

def F1():
    d = dot2x1 (a, b)

def F2():
    d = np.dot (a, b)

from timeit import timeit
print (timeit ('F1()', globals=globals(), number=1000))
print (timeit ('F2()', globals=globals(), number=1000))

In [13]: 0.013910860987380147 << 1st timeit
28.608758996007964  << 2nd timeit
--
Those who don't understand recursion are doomed to repeat it
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: C-coded dot 1000x faster than numpy?

Andrea Gavana
Hi,

On Tue, 23 Feb 2021 at 19.11, Neal Becker <[hidden email]> wrote:
I have code that performs dot product of a 2D matrix of size (on the
order of) [1000,16] with a vector of size [1000].  The matrix is
float64 and the vector is complex128.  I was using numpy.dot but it
turned out to be a bottleneck.

So I coded dot2x1 in c++ (using xtensor-python just for the
interface).  No fancy simd was used, unless g++ did it on it's own.

On a simple benchmark using timeit I find my hand-coded routine is on
the order of 1000x faster than numpy?  Here is the test code:
My custom c++ code is dot2x1.  I'm not copying it here because it has
some dependencies.  Any idea what is going on?


I had a similar experience - albeit with an older numpy and Python 2.7, so my comments are easily outdated and irrelevant. This was on Windows 10 64 bit, way more than plenty RAM. 

It took me forever to find out that numpy.dot was the culprit, and I ended up using fortran + f2py. Even with the overhead of having to go through f2py bridge, the fortran dot_product was several times faster.

Sorry if It doesn’t help much.

Andrea.




import numpy as np

from dot2x1 import dot2x1

a = np.ones ((1000,16))
b = np.array([ 0.80311816+0.80311816j,  0.80311816-0.80311816j,
       -0.80311816+0.80311816j, -0.80311816-0.80311816j,
        1.09707981+0.29396165j,  1.09707981-0.29396165j,
       -1.09707981+0.29396165j, -1.09707981-0.29396165j,
        0.29396165+1.09707981j,  0.29396165-1.09707981j,
       -0.29396165+1.09707981j, -0.29396165-1.09707981j,
        0.25495815+0.25495815j,  0.25495815-0.25495815j,
       -0.25495815+0.25495815j, -0.25495815-0.25495815j])

def F1():
    d = dot2x1 (a, b)

def F2():
    d = np.dot (a, b)

from timeit import timeit
print (timeit ('F1()', globals=globals(), number=1000))
print (timeit ('F2()', globals=globals(), number=1000))

In [13]: 0.013910860987380147 << 1st timeit
28.608758996007964  << 2nd timeit
--
Those who don't understand recursion are doomed to repeat it
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: C-coded dot 1000x faster than numpy?

Roman Yurchak
For the first benchmark apparently A.dot(B) with A real and B complex is
a known issue performance wise https://github.com/numpy/numpy/issues/10468

In general, it might be worth trying different BLAS backends. For
instance, if you install numpy from conda-forge you should be able to
switch between OpenBLAS, MKL and BLIS:
https://conda-forge.org/docs/maintainer/knowledge_base.html#switching-blas-implementation

Roman

On 23/02/2021 19:32, Andrea Gavana wrote:

> Hi,
>
> On Tue, 23 Feb 2021 at 19.11, Neal Becker <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     I have code that performs dot product of a 2D matrix of size (on the
>     order of) [1000,16] with a vector of size [1000].  The matrix is
>     float64 and the vector is complex128.  I was using numpy.dot but it
>     turned out to be a bottleneck.
>
>     So I coded dot2x1 in c++ (using xtensor-python just for the
>     interface).  No fancy simd was used, unless g++ did it on it's own.
>
>     On a simple benchmark using timeit I find my hand-coded routine is on
>     the order of 1000x faster than numpy?  Here is the test code:
>     My custom c++ code is dot2x1.  I'm not copying it here because it has
>     some dependencies.  Any idea what is going on?
>
>
>
> I had a similar experience - albeit with an older numpy and Python 2.7,
> so my comments are easily outdated and irrelevant. This was on Windows
> 10 64 bit, way more than plenty RAM.
>
> It took me forever to find out that numpy.dot was the culprit, and I
> ended up using fortran + f2py. Even with the overhead of having to go
> through f2py bridge, the fortran dot_product was several times faster.
>
> Sorry if It doesn’t help much.
>
> Andrea.
>
>
>
>
>     import numpy as np
>
>     from dot2x1 import dot2x1
>
>     a = np.ones ((1000,16))
>     b = np.array([ 0.80311816+0.80311816j,  0.80311816-0.80311816j,
>             -0.80311816+0.80311816j, -0.80311816-0.80311816j,
>              1.09707981+0.29396165j,  1.09707981-0.29396165j,
>             -1.09707981+0.29396165j, -1.09707981-0.29396165j,
>              0.29396165+1.09707981j,  0.29396165-1.09707981j,
>             -0.29396165+1.09707981j, -0.29396165-1.09707981j,
>              0.25495815+0.25495815j,  0.25495815-0.25495815j,
>             -0.25495815+0.25495815j, -0.25495815-0.25495815j])
>
>     def F1():
>          d = dot2x1 (a, b)
>
>     def F2():
>          d = np.dot (a, b)
>
>     from timeit import timeit
>     print (timeit ('F1()', globals=globals(), number=1000))
>     print (timeit ('F2()', globals=globals(), number=1000))
>
>     In [13]: 0.013910860987380147 << 1st timeit
>     28.608758996007964  << 2nd timeit
>     --
>     Those who don't understand recursion are doomed to repeat it
>     _______________________________________________
>     NumPy-Discussion mailing list
>     [hidden email] <mailto:[hidden email]>
>     https://mail.python.org/mailman/listinfo/numpy-discussion
>     <https://mail.python.org/mailman/listinfo/numpy-discussion>
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: C-coded dot 1000x faster than numpy?

Neal Becker
One suspect is that it seems the numpy version was multi-threading.
This isn't useful here, because I'm running parallel monte-carlo
simulations using all cores.  Perhaps this is perversely slowing
things down?  I don't know how to account for 1000x  slowdown though.

On Tue, Feb 23, 2021 at 1:40 PM Roman Yurchak <[hidden email]> wrote:

>
> For the first benchmark apparently A.dot(B) with A real and B complex is
> a known issue performance wise https://github.com/numpy/numpy/issues/10468
>
> In general, it might be worth trying different BLAS backends. For
> instance, if you install numpy from conda-forge you should be able to
> switch between OpenBLAS, MKL and BLIS:
> https://conda-forge.org/docs/maintainer/knowledge_base.html#switching-blas-implementation
>
> Roman
>
> On 23/02/2021 19:32, Andrea Gavana wrote:
> > Hi,
> >
> > On Tue, 23 Feb 2021 at 19.11, Neal Becker <[hidden email]
> > <mailto:[hidden email]>> wrote:
> >
> >     I have code that performs dot product of a 2D matrix of size (on the
> >     order of) [1000,16] with a vector of size [1000].  The matrix is
> >     float64 and the vector is complex128.  I was using numpy.dot but it
> >     turned out to be a bottleneck.
> >
> >     So I coded dot2x1 in c++ (using xtensor-python just for the
> >     interface).  No fancy simd was used, unless g++ did it on it's own.
> >
> >     On a simple benchmark using timeit I find my hand-coded routine is on
> >     the order of 1000x faster than numpy?  Here is the test code:
> >     My custom c++ code is dot2x1.  I'm not copying it here because it has
> >     some dependencies.  Any idea what is going on?
> >
> >
> >
> > I had a similar experience - albeit with an older numpy and Python 2.7,
> > so my comments are easily outdated and irrelevant. This was on Windows
> > 10 64 bit, way more than plenty RAM.
> >
> > It took me forever to find out that numpy.dot was the culprit, and I
> > ended up using fortran + f2py. Even with the overhead of having to go
> > through f2py bridge, the fortran dot_product was several times faster.
> >
> > Sorry if It doesn’t help much.
> >
> > Andrea.
> >
> >
> >
> >
> >     import numpy as np
> >
> >     from dot2x1 import dot2x1
> >
> >     a = np.ones ((1000,16))
> >     b = np.array([ 0.80311816+0.80311816j,  0.80311816-0.80311816j,
> >             -0.80311816+0.80311816j, -0.80311816-0.80311816j,
> >              1.09707981+0.29396165j,  1.09707981-0.29396165j,
> >             -1.09707981+0.29396165j, -1.09707981-0.29396165j,
> >              0.29396165+1.09707981j,  0.29396165-1.09707981j,
> >             -0.29396165+1.09707981j, -0.29396165-1.09707981j,
> >              0.25495815+0.25495815j,  0.25495815-0.25495815j,
> >             -0.25495815+0.25495815j, -0.25495815-0.25495815j])
> >
> >     def F1():
> >          d = dot2x1 (a, b)
> >
> >     def F2():
> >          d = np.dot (a, b)
> >
> >     from timeit import timeit
> >     print (timeit ('F1()', globals=globals(), number=1000))
> >     print (timeit ('F2()', globals=globals(), number=1000))
> >
> >     In [13]: 0.013910860987380147 << 1st timeit
> >     28.608758996007964  << 2nd timeit
> >     --
> >     Those who don't understand recursion are doomed to repeat it
> >     _______________________________________________
> >     NumPy-Discussion mailing list
> >     [hidden email] <mailto:[hidden email]>
> >     https://mail.python.org/mailman/listinfo/numpy-discussion
> >     <https://mail.python.org/mailman/listinfo/numpy-discussion>
> >
> >
> > _______________________________________________
> > NumPy-Discussion mailing list
> > [hidden email]
> > https://mail.python.org/mailman/listinfo/numpy-discussion
> >
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> https://mail.python.org/mailman/listinfo/numpy-discussion



--
Those who don't understand recursion are doomed to repeat it
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: C-coded dot 1000x faster than numpy?

Neal Becker
I'm using fedora 33 standard numpy.
ldd says:

/usr/lib64/python3.9/site-packages/numpy/core/_multiarray_umath.cpython-39-x86_64-linux-gnu.so:
linux-vdso.so.1 (0x00007ffdd1487000)
libflexiblas.so.3 => /lib64/libflexiblas.so.3 (0x00007f0512787000)

So whatever flexiblas is doing controls blas.

On Tue, Feb 23, 2021 at 1:51 PM Neal Becker <[hidden email]> wrote:

>
> One suspect is that it seems the numpy version was multi-threading.
> This isn't useful here, because I'm running parallel monte-carlo
> simulations using all cores.  Perhaps this is perversely slowing
> things down?  I don't know how to account for 1000x  slowdown though.
>
> On Tue, Feb 23, 2021 at 1:40 PM Roman Yurchak <[hidden email]> wrote:
> >
> > For the first benchmark apparently A.dot(B) with A real and B complex is
> > a known issue performance wise https://github.com/numpy/numpy/issues/10468
> >
> > In general, it might be worth trying different BLAS backends. For
> > instance, if you install numpy from conda-forge you should be able to
> > switch between OpenBLAS, MKL and BLIS:
> > https://conda-forge.org/docs/maintainer/knowledge_base.html#switching-blas-implementation
> >
> > Roman
> >
> > On 23/02/2021 19:32, Andrea Gavana wrote:
> > > Hi,
> > >
> > > On Tue, 23 Feb 2021 at 19.11, Neal Becker <[hidden email]
> > > <mailto:[hidden email]>> wrote:
> > >
> > >     I have code that performs dot product of a 2D matrix of size (on the
> > >     order of) [1000,16] with a vector of size [1000].  The matrix is
> > >     float64 and the vector is complex128.  I was using numpy.dot but it
> > >     turned out to be a bottleneck.
> > >
> > >     So I coded dot2x1 in c++ (using xtensor-python just for the
> > >     interface).  No fancy simd was used, unless g++ did it on it's own.
> > >
> > >     On a simple benchmark using timeit I find my hand-coded routine is on
> > >     the order of 1000x faster than numpy?  Here is the test code:
> > >     My custom c++ code is dot2x1.  I'm not copying it here because it has
> > >     some dependencies.  Any idea what is going on?
> > >
> > >
> > >
> > > I had a similar experience - albeit with an older numpy and Python 2.7,
> > > so my comments are easily outdated and irrelevant. This was on Windows
> > > 10 64 bit, way more than plenty RAM.
> > >
> > > It took me forever to find out that numpy.dot was the culprit, and I
> > > ended up using fortran + f2py. Even with the overhead of having to go
> > > through f2py bridge, the fortran dot_product was several times faster.
> > >
> > > Sorry if It doesn’t help much.
> > >
> > > Andrea.
> > >
> > >
> > >
> > >
> > >     import numpy as np
> > >
> > >     from dot2x1 import dot2x1
> > >
> > >     a = np.ones ((1000,16))
> > >     b = np.array([ 0.80311816+0.80311816j,  0.80311816-0.80311816j,
> > >             -0.80311816+0.80311816j, -0.80311816-0.80311816j,
> > >              1.09707981+0.29396165j,  1.09707981-0.29396165j,
> > >             -1.09707981+0.29396165j, -1.09707981-0.29396165j,
> > >              0.29396165+1.09707981j,  0.29396165-1.09707981j,
> > >             -0.29396165+1.09707981j, -0.29396165-1.09707981j,
> > >              0.25495815+0.25495815j,  0.25495815-0.25495815j,
> > >             -0.25495815+0.25495815j, -0.25495815-0.25495815j])
> > >
> > >     def F1():
> > >          d = dot2x1 (a, b)
> > >
> > >     def F2():
> > >          d = np.dot (a, b)
> > >
> > >     from timeit import timeit
> > >     print (timeit ('F1()', globals=globals(), number=1000))
> > >     print (timeit ('F2()', globals=globals(), number=1000))
> > >
> > >     In [13]: 0.013910860987380147 << 1st timeit
> > >     28.608758996007964  << 2nd timeit
> > >     --
> > >     Those who don't understand recursion are doomed to repeat it
> > >     _______________________________________________
> > >     NumPy-Discussion mailing list
> > >     [hidden email] <mailto:[hidden email]>
> > >     https://mail.python.org/mailman/listinfo/numpy-discussion
> > >     <https://mail.python.org/mailman/listinfo/numpy-discussion>
> > >
> > >
> > > _______________________________________________
> > > NumPy-Discussion mailing list
> > > [hidden email]
> > > https://mail.python.org/mailman/listinfo/numpy-discussion
> > >
> > _______________________________________________
> > NumPy-Discussion mailing list
> > [hidden email]
> > https://mail.python.org/mailman/listinfo/numpy-discussion
>
>
>
> --
> Those who don't understand recursion are doomed to repeat it



--
Those who don't understand recursion are doomed to repeat it
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: C-coded dot 1000x faster than numpy?

Carl Kleffner
In reply to this post by Neal Becker

Am Di., 23. Feb. 2021 um 19:52 Uhr schrieb Neal Becker <[hidden email]>:
One suspect is that it seems the numpy version was multi-threading.
This isn't useful here, because I'm running parallel monte-carlo
simulations using all cores.  Perhaps this is perversely slowing
things down?  I don't know how to account for 1000x  slowdown though.

On Tue, Feb 23, 2021 at 1:40 PM Roman Yurchak <[hidden email]> wrote:
>
> For the first benchmark apparently A.dot(B) with A real and B complex is
> a known issue performance wise https://github.com/numpy/numpy/issues/10468
>
> In general, it might be worth trying different BLAS backends. For
> instance, if you install numpy from conda-forge you should be able to
> switch between OpenBLAS, MKL and BLIS:
> https://conda-forge.org/docs/maintainer/knowledge_base.html#switching-blas-implementation
>
> Roman
>
> On 23/02/2021 19:32, Andrea Gavana wrote:
> > Hi,
> >
> > On Tue, 23 Feb 2021 at 19.11, Neal Becker <[hidden email]
> > <mailto:[hidden email]>> wrote:
> >
> >     I have code that performs dot product of a 2D matrix of size (on the
> >     order of) [1000,16] with a vector of size [1000].  The matrix is
> >     float64 and the vector is complex128.  I was using numpy.dot but it
> >     turned out to be a bottleneck.
> >
> >     So I coded dot2x1 in c++ (using xtensor-python just for the
> >     interface).  No fancy simd was used, unless g++ did it on it's own.
> >
> >     On a simple benchmark using timeit I find my hand-coded routine is on
> >     the order of 1000x faster than numpy?  Here is the test code:
> >     My custom c++ code is dot2x1.  I'm not copying it here because it has
> >     some dependencies.  Any idea what is going on?
> >
> >
> >
> > I had a similar experience - albeit with an older numpy and Python 2.7,
> > so my comments are easily outdated and irrelevant. This was on Windows
> > 10 64 bit, way more than plenty RAM.
> >
> > It took me forever to find out that numpy.dot was the culprit, and I
> > ended up using fortran + f2py. Even with the overhead of having to go
> > through f2py bridge, the fortran dot_product was several times faster.
> >
> > Sorry if It doesn’t help much.
> >
> > Andrea.
> >
> >
> >
> >
> >     import numpy as np
> >
> >     from dot2x1 import dot2x1
> >
> >     a = np.ones ((1000,16))
> >     b = np.array([ 0.80311816+0.80311816j,  0.80311816-0.80311816j,
> >             -0.80311816+0.80311816j, -0.80311816-0.80311816j,
> >              1.09707981+0.29396165j,  1.09707981-0.29396165j,
> >             -1.09707981+0.29396165j, -1.09707981-0.29396165j,
> >              0.29396165+1.09707981j,  0.29396165-1.09707981j,
> >             -0.29396165+1.09707981j, -0.29396165-1.09707981j,
> >              0.25495815+0.25495815j,  0.25495815-0.25495815j,
> >             -0.25495815+0.25495815j, -0.25495815-0.25495815j])
> >
> >     def F1():
> >          d = dot2x1 (a, b)
> >
> >     def F2():
> >          d = np.dot (a, b)
> >
> >     from timeit import timeit
> >     print (timeit ('F1()', globals=globals(), number=1000))
> >     print (timeit ('F2()', globals=globals(), number=1000))
> >
> >     In [13]: 0.013910860987380147 << 1st timeit
> >     28.608758996007964  << 2nd timeit
> >     --
> >     Those who don't understand recursion are doomed to repeat it
> >     _______________________________________________
> >     NumPy-Discussion mailing list
> >     [hidden email] <mailto:[hidden email]>
> >     https://mail.python.org/mailman/listinfo/numpy-discussion
> >     <https://mail.python.org/mailman/listinfo/numpy-discussion>
> >
> >
> > _______________________________________________
> > NumPy-Discussion mailing list
> > [hidden email]
> > https://mail.python.org/mailman/listinfo/numpy-discussion
> >
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> https://mail.python.org/mailman/listinfo/numpy-discussion



--
Those who don't understand recursion are doomed to repeat it
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: C-coded dot 1000x faster than numpy?

Daπid
In reply to this post by Roman Yurchak
On Tue, 23 Feb 2021, 7:41 pm Roman Yurchak, <[hidden email]> wrote:
For the first benchmark apparently A.dot(B) with A real and B complex is
a known issue performance wise https://github.com/numpy/numpy/issues/10468
 
I splitted B into a vector of size (N, 2) for the real and imaginary part, and that makes the multiplication twice as fast.


My configuration (also in Fedora 33) np.show_config()
                                                                                                                                                       
blas_mkl_info:
  NOT AVAILABLE
blis_info:
  NOT AVAILABLE
openblas_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
lapack_mkl_info:
  NOT AVAILABLE
openblas_lapack_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
lapack_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: C-coded dot 1000x faster than numpy?

Carl Kleffner
The stackoverflow link above contains a simple testcase:
>>> from scipy.linalg import get_blas_funcs
>>> gemm = get_blas_funcs("gemm", [X, Y])
>>> np.all(gemm(1, X, Y) == np.dot(X, Y))
True

It would be of interest to benchmark gemm against np.dot. Maybe np.dot doesn't use blas at al for whatever reason? 

Am Di., 23. Feb. 2021 um 20:46 Uhr schrieb David Menéndez Hurtado <[hidden email]>:
On Tue, 23 Feb 2021, 7:41 pm Roman Yurchak, <[hidden email]> wrote:
For the first benchmark apparently A.dot(B) with A real and B complex is
a known issue performance wise https://github.com/numpy/numpy/issues/10468
 
I splitted B into a vector of size (N, 2) for the real and imaginary part, and that makes the multiplication twice as fast.


My configuration (also in Fedora 33) np.show_config()
                                                                                                                                                       
blas_mkl_info:
  NOT AVAILABLE
blis_info:
  NOT AVAILABLE
openblas_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
lapack_mkl_info:
  NOT AVAILABLE
openblas_lapack_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
lapack_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: C-coded dot 1000x faster than numpy?

Charles R Harris
In reply to this post by Neal Becker


On Tue, Feb 23, 2021 at 11:10 AM Neal Becker <[hidden email]> wrote:
I have code that performs dot product of a 2D matrix of size (on the
order of) [1000,16] with a vector of size [1000].  The matrix is
float64 and the vector is complex128.  I was using numpy.dot but it
turned out to be a bottleneck.

So I coded dot2x1 in c++ (using xtensor-python just for the
interface).  No fancy simd was used, unless g++ did it on it's own.

On a simple benchmark using timeit I find my hand-coded routine is on
the order of 1000x faster than numpy?  Here is the test code:
My custom c++ code is dot2x1.  I'm not copying it here because it has
some dependencies.  Any idea what is going on?

import numpy as np

from dot2x1 import dot2x1

a = np.ones ((1000,16))
b = np.array([ 0.80311816+0.80311816j,  0.80311816-0.80311816j,
       -0.80311816+0.80311816j, -0.80311816-0.80311816j,
        1.09707981+0.29396165j,  1.09707981-0.29396165j,
       -1.09707981+0.29396165j, -1.09707981-0.29396165j,
        0.29396165+1.09707981j,  0.29396165-1.09707981j,
       -0.29396165+1.09707981j, -0.29396165-1.09707981j,
        0.25495815+0.25495815j,  0.25495815-0.25495815j,
       -0.25495815+0.25495815j, -0.25495815-0.25495815j])

def F1():
    d = dot2x1 (a, b)

def F2():
    d = np.dot (a, b)

from timeit import timeit
print (timeit ('F1()', globals=globals(), number=1000))
print (timeit ('F2()', globals=globals(), number=1000))

In [13]: 0.013910860987380147 << 1st timeit
28.608758996007964  << 2nd timeit

I'm going to guess threading, although huge pages can also be a problem on a machine under heavy load running other processes. Call overhead may also matter for such small matrices.

Chuck

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: C-coded dot 1000x faster than numpy?

Charles R Harris


On Tue, Feb 23, 2021 at 5:47 PM Charles R Harris <[hidden email]> wrote:


On Tue, Feb 23, 2021 at 11:10 AM Neal Becker <[hidden email]> wrote:
I have code that performs dot product of a 2D matrix of size (on the
order of) [1000,16] with a vector of size [1000].  The matrix is
float64 and the vector is complex128.  I was using numpy.dot but it
turned out to be a bottleneck.

So I coded dot2x1 in c++ (using xtensor-python just for the
interface).  No fancy simd was used, unless g++ did it on it's own.

On a simple benchmark using timeit I find my hand-coded routine is on
the order of 1000x faster than numpy?  Here is the test code:
My custom c++ code is dot2x1.  I'm not copying it here because it has
some dependencies.  Any idea what is going on?

import numpy as np

from dot2x1 import dot2x1

a = np.ones ((1000,16))
b = np.array([ 0.80311816+0.80311816j,  0.80311816-0.80311816j,
       -0.80311816+0.80311816j, -0.80311816-0.80311816j,
        1.09707981+0.29396165j,  1.09707981-0.29396165j,
       -1.09707981+0.29396165j, -1.09707981-0.29396165j,
        0.29396165+1.09707981j,  0.29396165-1.09707981j,
       -0.29396165+1.09707981j, -0.29396165-1.09707981j,
        0.25495815+0.25495815j,  0.25495815-0.25495815j,
       -0.25495815+0.25495815j, -0.25495815-0.25495815j])

def F1():
    d = dot2x1 (a, b)

def F2():
    d = np.dot (a, b)

from timeit import timeit
print (timeit ('F1()', globals=globals(), number=1000))
print (timeit ('F2()', globals=globals(), number=1000))

In [13]: 0.013910860987380147 << 1st timeit
28.608758996007964  << 2nd timeit

I'm going to guess threading, although huge pages can also be a problem on a machine under heavy load running other processes. Call overhead may also matter for such small matrices.


What BLAS library are you using. I get much better results using an 8 year old i5 and ATLAS.

Chuck 

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: C-coded dot 1000x faster than numpy?

Neal Becker
See my earlier email - this is fedora 33, python3.9.

I'm using fedora 33 standard numpy.
ldd says:

/usr/lib64/python3.9/site-packages/numpy/core/_multiarray_umath.cpython-39-x86_64-linux-gnu.so:
linux-vdso.so.1 (0x00007ffdd1487000)
libflexiblas.so.3 => /lib64/libflexiblas.so.3 (0x00007f0512787000)

So whatever flexiblas is doing controls blas.
flexiblas print
FlexiBLAS, version 3.0.4
Copyright (C) 2014, 2015, 2016, 2017, 2018, 2019, 2020 Martin Koehler
and others.
This is free software; see the source code for copying conditions.
There is ABSOLUTELY NO WARRANTY; not even for MERCHANTABILITY or
FITNESS FOR A PARTICULAR PURPOSE.


Configured BLAS libraries:
System-wide (/etc/flexiblasrc):

System-wide from config directory (/etc/flexiblasrc.d/)
 OPENBLAS-OPENMP
   library = libflexiblas_openblas-openmp.so
   comment =
 NETLIB
   library = libflexiblas_netlib.so
   comment =
 ATLAS
   library = libflexiblas_atlas.so
   comment =

User config (/home/nbecker/.flexiblasrc):

Host config (/home/nbecker/.flexiblasrc.nbecker8):

Available hooks:

Backend and hook search paths:
  /usr/lib64/flexiblas/

Default BLAS:
    System:       OPENBLAS-OPENMP
    User:         (none)
    Host:         (none)
    Active Default: OPENBLAS-OPENMP (System)
Runtime properties:
   verbose = 0 (System)

So it looks  to me it is using openblas-openmp.


On Tue, Feb 23, 2021 at 8:00 PM Charles R Harris
<[hidden email]> wrote:

>
>
>
> On Tue, Feb 23, 2021 at 5:47 PM Charles R Harris <[hidden email]> wrote:
>>
>>
>>
>> On Tue, Feb 23, 2021 at 11:10 AM Neal Becker <[hidden email]> wrote:
>>>
>>> I have code that performs dot product of a 2D matrix of size (on the
>>> order of) [1000,16] with a vector of size [1000].  The matrix is
>>> float64 and the vector is complex128.  I was using numpy.dot but it
>>> turned out to be a bottleneck.
>>>
>>> So I coded dot2x1 in c++ (using xtensor-python just for the
>>> interface).  No fancy simd was used, unless g++ did it on it's own.
>>>
>>> On a simple benchmark using timeit I find my hand-coded routine is on
>>> the order of 1000x faster than numpy?  Here is the test code:
>>> My custom c++ code is dot2x1.  I'm not copying it here because it has
>>> some dependencies.  Any idea what is going on?
>>>
>>> import numpy as np
>>>
>>> from dot2x1 import dot2x1
>>>
>>> a = np.ones ((1000,16))
>>> b = np.array([ 0.80311816+0.80311816j,  0.80311816-0.80311816j,
>>>        -0.80311816+0.80311816j, -0.80311816-0.80311816j,
>>>         1.09707981+0.29396165j,  1.09707981-0.29396165j,
>>>        -1.09707981+0.29396165j, -1.09707981-0.29396165j,
>>>         0.29396165+1.09707981j,  0.29396165-1.09707981j,
>>>        -0.29396165+1.09707981j, -0.29396165-1.09707981j,
>>>         0.25495815+0.25495815j,  0.25495815-0.25495815j,
>>>        -0.25495815+0.25495815j, -0.25495815-0.25495815j])
>>>
>>> def F1():
>>>     d = dot2x1 (a, b)
>>>
>>> def F2():
>>>     d = np.dot (a, b)
>>>
>>> from timeit import timeit
>>> print (timeit ('F1()', globals=globals(), number=1000))
>>> print (timeit ('F2()', globals=globals(), number=1000))
>>>
>>> In [13]: 0.013910860987380147 << 1st timeit
>>> 28.608758996007964  << 2nd timeit
>>
>>
>> I'm going to guess threading, although huge pages can also be a problem on a machine under heavy load running other processes. Call overhead may also matter for such small matrices.
>>
>
> What BLAS library are you using. I get much better results using an 8 year old i5 and ATLAS.
>
> Chuck
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> https://mail.python.org/mailman/listinfo/numpy-discussion



--
Those who don't understand recursion are doomed to repeat it
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: C-coded dot 1000x faster than numpy?

Charles R Harris


On Wed, Feb 24, 2021 at 5:36 AM Neal Becker <[hidden email]> wrote:
See my earlier email - this is fedora 33, python3.9.

I'm using fedora 33 standard numpy.
ldd says:

/usr/lib64/python3.9/site-packages/numpy/core/_multiarray_umath.cpython-39-x86_64-linux-gnu.so:
linux-vdso.so.1 (0x00007ffdd1487000)
libflexiblas.so.3 => /lib64/libflexiblas.so.3 (0x00007f0512787000)

So whatever flexiblas is doing controls blas.
flexiblas print
FlexiBLAS, version 3.0.4
Copyright (C) 2014, 2015, 2016, 2017, 2018, 2019, 2020 Martin Koehler
and others.
This is free software; see the source code for copying conditions.
There is ABSOLUTELY NO WARRANTY; not even for MERCHANTABILITY or
FITNESS FOR A PARTICULAR PURPOSE.


Configured BLAS libraries:
System-wide (/etc/flexiblasrc):

System-wide from config directory (/etc/flexiblasrc.d/)
 OPENBLAS-OPENMP
   library = libflexiblas_openblas-openmp.so
   comment =
 NETLIB
   library = libflexiblas_netlib.so
   comment =
 ATLAS
   library = libflexiblas_atlas.so
   comment =

User config (/home/nbecker/.flexiblasrc):

Host config (/home/nbecker/.flexiblasrc.nbecker8):

Available hooks:

Backend and hook search paths:
  /usr/lib64/flexiblas/

Default BLAS:
    System:       OPENBLAS-OPENMP
    User:         (none)
    Host:         (none)
    Active Default: OPENBLAS-OPENMP (System)
Runtime properties:
   verbose = 0 (System)

So it looks  to me it is using openblas-openmp.


ISTR that there have been problems with openmp. There are a ton of OpenBLAS versions available in fedora 33. Just available via flexiblas

  1. flexiblas-openblas-openmp.x86_64 : FlexiBLAS wrappers for OpenBLAS
  2. flexiblas-openblas-openmp.i686 : FlexiBLAS wrappers for OpenBLAS
  3. flexiblas-openblas-openmp64.x86_64 : FlexiBLAS wrappers for OpenBLAS (64-bit)
  4. flexiblas-openblas-serial.x86_64 : FlexiBLAS wrappers for OpenBLAS
  5. flexiblas-openblas-serial64.x86_64 : FlexiBLAS wrappers for OpenBLAS (64-bit)
  6. flexiblas-openblas-threads.x86_64 : FlexiBLAS wrappers for OpenBLAS
  7. flexiblas-openblas-threads64.x86_64 : FlexiBLAS wrappers for OpenBLAS (64-bit)
I am not sure how to make use of flexiblas, but would explore that. We might need to do something with distutils to interoperate with it or maybe you can control it though site.cfg. There are 12 versions available in total. I would suggest trying serial or pthreads.

Chuck

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: C-coded dot 1000x faster than numpy?

Charles R Harris


On Wed, Feb 24, 2021 at 8:02 AM Charles R Harris <[hidden email]> wrote:


On Wed, Feb 24, 2021 at 5:36 AM Neal Becker <[hidden email]> wrote:
See my earlier email - this is fedora 33, python3.9.

I'm using fedora 33 standard numpy.
ldd says:

/usr/lib64/python3.9/site-packages/numpy/core/_multiarray_umath.cpython-39-x86_64-linux-gnu.so:
linux-vdso.so.1 (0x00007ffdd1487000)
libflexiblas.so.3 => /lib64/libflexiblas.so.3 (0x00007f0512787000)

So whatever flexiblas is doing controls blas.
flexiblas print
FlexiBLAS, version 3.0.4
Copyright (C) 2014, 2015, 2016, 2017, 2018, 2019, 2020 Martin Koehler
and others.
This is free software; see the source code for copying conditions.
There is ABSOLUTELY NO WARRANTY; not even for MERCHANTABILITY or
FITNESS FOR A PARTICULAR PURPOSE.


Configured BLAS libraries:
System-wide (/etc/flexiblasrc):

System-wide from config directory (/etc/flexiblasrc.d/)
 OPENBLAS-OPENMP
   library = libflexiblas_openblas-openmp.so
   comment =
 NETLIB
   library = libflexiblas_netlib.so
   comment =
 ATLAS
   library = libflexiblas_atlas.so
   comment =

User config (/home/nbecker/.flexiblasrc):

Host config (/home/nbecker/.flexiblasrc.nbecker8):

Available hooks:

Backend and hook search paths:
  /usr/lib64/flexiblas/

Default BLAS:
    System:       OPENBLAS-OPENMP
    User:         (none)
    Host:         (none)
    Active Default: OPENBLAS-OPENMP (System)
Runtime properties:
   verbose = 0 (System)

So it looks  to me it is using openblas-openmp.


ISTR that there have been problems with openmp. There are a ton of OpenBLAS versions available in fedora 33. Just available via flexiblas

  1. flexiblas-openblas-openmp.x86_64 : FlexiBLAS wrappers for OpenBLAS
  2. flexiblas-openblas-openmp.i686 : FlexiBLAS wrappers for OpenBLAS
  3. flexiblas-openblas-openmp64.x86_64 : FlexiBLAS wrappers for OpenBLAS (64-bit)
  4. flexiblas-openblas-serial.x86_64 : FlexiBLAS wrappers for OpenBLAS
  5. flexiblas-openblas-serial64.x86_64 : FlexiBLAS wrappers for OpenBLAS (64-bit)
  6. flexiblas-openblas-threads.x86_64 : FlexiBLAS wrappers for OpenBLAS
  7. flexiblas-openblas-threads64.x86_64 : FlexiBLAS wrappers for OpenBLAS (64-bit)
I am not sure how to make use of flexiblas, but would explore that. We might need to do something with distutils to interoperate with it or maybe you can control it though site.cfg. There are 12 versions available in total. I would suggest trying serial or pthreads.


Seems to be controlled in the /etc directory:

/etc/flexiblas64rc.d/openblas-openmp64.conf
 
On my machine it looks like openmp64 is the system default.

Chuck

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: C-coded dot 1000x faster than numpy?

Neal Becker
Supposedly can control through env variables but I didn't see any effect

On Wed, Feb 24, 2021, 10:12 AM Charles R Harris <[hidden email]> wrote:


On Wed, Feb 24, 2021 at 8:02 AM Charles R Harris <[hidden email]> wrote:


On Wed, Feb 24, 2021 at 5:36 AM Neal Becker <[hidden email]> wrote:
See my earlier email - this is fedora 33, python3.9.

I'm using fedora 33 standard numpy.
ldd says:

/usr/lib64/python3.9/site-packages/numpy/core/_multiarray_umath.cpython-39-x86_64-linux-gnu.so:
linux-vdso.so.1 (0x00007ffdd1487000)
libflexiblas.so.3 => /lib64/libflexiblas.so.3 (0x00007f0512787000)

So whatever flexiblas is doing controls blas.
flexiblas print
FlexiBLAS, version 3.0.4
Copyright (C) 2014, 2015, 2016, 2017, 2018, 2019, 2020 Martin Koehler
and others.
This is free software; see the source code for copying conditions.
There is ABSOLUTELY NO WARRANTY; not even for MERCHANTABILITY or
FITNESS FOR A PARTICULAR PURPOSE.


Configured BLAS libraries:
System-wide (/etc/flexiblasrc):

System-wide from config directory (/etc/flexiblasrc.d/)
 OPENBLAS-OPENMP
   library = libflexiblas_openblas-openmp.so
   comment =
 NETLIB
   library = libflexiblas_netlib.so
   comment =
 ATLAS
   library = libflexiblas_atlas.so
   comment =

User config (/home/nbecker/.flexiblasrc):

Host config (/home/nbecker/.flexiblasrc.nbecker8):

Available hooks:

Backend and hook search paths:
  /usr/lib64/flexiblas/

Default BLAS:
    System:       OPENBLAS-OPENMP
    User:         (none)
    Host:         (none)
    Active Default: OPENBLAS-OPENMP (System)
Runtime properties:
   verbose = 0 (System)

So it looks  to me it is using openblas-openmp.


ISTR that there have been problems with openmp. There are a ton of OpenBLAS versions available in fedora 33. Just available via flexiblas

  1. flexiblas-openblas-openmp.x86_64 : FlexiBLAS wrappers for OpenBLAS
  2. flexiblas-openblas-openmp.i686 : FlexiBLAS wrappers for OpenBLAS
  3. flexiblas-openblas-openmp64.x86_64 : FlexiBLAS wrappers for OpenBLAS (64-bit)
  4. flexiblas-openblas-serial.x86_64 : FlexiBLAS wrappers for OpenBLAS
  5. flexiblas-openblas-serial64.x86_64 : FlexiBLAS wrappers for OpenBLAS (64-bit)
  6. flexiblas-openblas-threads.x86_64 : FlexiBLAS wrappers for OpenBLAS
  7. flexiblas-openblas-threads64.x86_64 : FlexiBLAS wrappers for OpenBLAS (64-bit)
I am not sure how to make use of flexiblas, but would explore that. We might need to do something with distutils to interoperate with it or maybe you can control it though site.cfg. There are 12 versions available in total. I would suggest trying serial or pthreads.


Seems to be controlled in the /etc directory:

/etc/flexiblas64rc.d/openblas-openmp64.conf
 
On my machine it looks like openmp64 is the system default.

Chuck
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion