# C-coded dot 1000x faster than numpy?

14 messages
Open this post in threaded view
|

## C-coded dot 1000x faster than numpy?

 I have code that performs dot product of a 2D matrix of size (on the order of) [1000,16] with a vector of size [1000].  The matrix is float64 and the vector is complex128.  I was using numpy.dot but it turned out to be a bottleneck. So I coded dot2x1 in c++ (using xtensor-python just for the interface).  No fancy simd was used, unless g++ did it on it's own. On a simple benchmark using timeit I find my hand-coded routine is on the order of 1000x faster than numpy?  Here is the test code: My custom c++ code is dot2x1.  I'm not copying it here because it has some dependencies.  Any idea what is going on? import numpy as np from dot2x1 import dot2x1 a = np.ones ((1000,16)) b = np.array([ 0.80311816+0.80311816j,  0.80311816-0.80311816j,        -0.80311816+0.80311816j, -0.80311816-0.80311816j,         1.09707981+0.29396165j,  1.09707981-0.29396165j,        -1.09707981+0.29396165j, -1.09707981-0.29396165j,         0.29396165+1.09707981j,  0.29396165-1.09707981j,        -0.29396165+1.09707981j, -0.29396165-1.09707981j,         0.25495815+0.25495815j,  0.25495815-0.25495815j,        -0.25495815+0.25495815j, -0.25495815-0.25495815j]) def F1():     d = dot2x1 (a, b) def F2():     d = np.dot (a, b) from timeit import timeit print (timeit ('F1()', globals=globals(), number=1000)) print (timeit ('F2()', globals=globals(), number=1000)) In [13]: 0.013910860987380147 << 1st timeit 28.608758996007964  << 2nd timeit -- Those who don't understand recursion are doomed to repeat it _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpy-discussion
Open this post in threaded view
|

## Re: C-coded dot 1000x faster than numpy?

 Hi,On Tue, 23 Feb 2021 at 19.11, Neal Becker <[hidden email]> wrote:I have code that performs dot product of a 2D matrix of size (on the order of) [1000,16] with a vector of size [1000].  The matrix is float64 and the vector is complex128.  I was using numpy.dot but it turned out to be a bottleneck. So I coded dot2x1 in c++ (using xtensor-python just for the interface).  No fancy simd was used, unless g++ did it on it's own. On a simple benchmark using timeit I find my hand-coded routine is on the order of 1000x faster than numpy?  Here is the test code: My custom c++ code is dot2x1.  I'm not copying it here because it has some dependencies.  Any idea what is going on?I had a similar experience - albeit with an older numpy and Python 2.7, so my comments are easily outdated and irrelevant. This was on Windows 10 64 bit, way more than plenty RAM. It took me forever to find out that numpy.dot was the culprit, and I ended up using fortran + f2py. Even with the overhead of having to go through f2py bridge, the fortran dot_product was several times faster.Sorry if It doesn’t help much.Andrea. import numpy as np from dot2x1 import dot2x1 a = np.ones ((1000,16)) b = np.array([ 0.80311816+0.80311816j,  0.80311816-0.80311816j,        -0.80311816+0.80311816j, -0.80311816-0.80311816j,         1.09707981+0.29396165j,  1.09707981-0.29396165j,        -1.09707981+0.29396165j, -1.09707981-0.29396165j,         0.29396165+1.09707981j,  0.29396165-1.09707981j,        -0.29396165+1.09707981j, -0.29396165-1.09707981j,         0.25495815+0.25495815j,  0.25495815-0.25495815j,        -0.25495815+0.25495815j, -0.25495815-0.25495815j]) def F1():     d = dot2x1 (a, b) def F2():     d = np.dot (a, b) from timeit import timeit print (timeit ('F1()', globals=globals(), number=1000)) print (timeit ('F2()', globals=globals(), number=1000)) In [13]: 0.013910860987380147 << 1st timeit 28.608758996007964  << 2nd timeit -- Those who don't understand recursion are doomed to repeat it _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpy-discussion _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpy-discussion
Open this post in threaded view
|

## Re: C-coded dot 1000x faster than numpy?

 For the first benchmark apparently A.dot(B) with A real and B complex is a known issue performance wise https://github.com/numpy/numpy/issues/10468In general, it might be worth trying different BLAS backends. For instance, if you install numpy from conda-forge you should be able to switch between OpenBLAS, MKL and BLIS: https://conda-forge.org/docs/maintainer/knowledge_base.html#switching-blas-implementationRoman On 23/02/2021 19:32, Andrea Gavana wrote: > Hi, > > On Tue, 23 Feb 2021 at 19.11, Neal Becker <[hidden email] > > wrote: > >     I have code that performs dot product of a 2D matrix of size (on the >     order of) [1000,16] with a vector of size [1000].  The matrix is >     float64 and the vector is complex128.  I was using numpy.dot but it >     turned out to be a bottleneck. > >     So I coded dot2x1 in c++ (using xtensor-python just for the >     interface).  No fancy simd was used, unless g++ did it on it's own. > >     On a simple benchmark using timeit I find my hand-coded routine is on >     the order of 1000x faster than numpy?  Here is the test code: >     My custom c++ code is dot2x1.  I'm not copying it here because it has >     some dependencies.  Any idea what is going on? > > > > I had a similar experience - albeit with an older numpy and Python 2.7, > so my comments are easily outdated and irrelevant. This was on Windows > 10 64 bit, way more than plenty RAM. > > It took me forever to find out that numpy.dot was the culprit, and I > ended up using fortran + f2py. Even with the overhead of having to go > through f2py bridge, the fortran dot_product was several times faster. > > Sorry if It doesn’t help much. > > Andrea. > > > > >     import numpy as np > >     from dot2x1 import dot2x1 > >     a = np.ones ((1000,16)) >     b = np.array([ 0.80311816+0.80311816j,  0.80311816-0.80311816j, >             -0.80311816+0.80311816j, -0.80311816-0.80311816j, >              1.09707981+0.29396165j,  1.09707981-0.29396165j, >             -1.09707981+0.29396165j, -1.09707981-0.29396165j, >              0.29396165+1.09707981j,  0.29396165-1.09707981j, >             -0.29396165+1.09707981j, -0.29396165-1.09707981j, >              0.25495815+0.25495815j,  0.25495815-0.25495815j, >             -0.25495815+0.25495815j, -0.25495815-0.25495815j]) > >     def F1(): >          d = dot2x1 (a, b) > >     def F2(): >          d = np.dot (a, b) > >     from timeit import timeit >     print (timeit ('F1()', globals=globals(), number=1000)) >     print (timeit ('F2()', globals=globals(), number=1000)) > >     In [13]: 0.013910860987380147 << 1st timeit >     28.608758996007964  << 2nd timeit >     -- >     Those who don't understand recursion are doomed to repeat it >     _______________________________________________ >     NumPy-Discussion mailing list >     [hidden email] >     https://mail.python.org/mailman/listinfo/numpy-discussion>     > > > _______________________________________________ > NumPy-Discussion mailing list > [hidden email] > https://mail.python.org/mailman/listinfo/numpy-discussion> _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpy-discussion
Open this post in threaded view
|

## Re: C-coded dot 1000x faster than numpy?

 One suspect is that it seems the numpy version was multi-threading. This isn't useful here, because I'm running parallel monte-carlo simulations using all cores.  Perhaps this is perversely slowing things down?  I don't know how to account for 1000x  slowdown though. On Tue, Feb 23, 2021 at 1:40 PM Roman Yurchak <[hidden email]> wrote: > > For the first benchmark apparently A.dot(B) with A real and B complex is > a known issue performance wise https://github.com/numpy/numpy/issues/10468> > In general, it might be worth trying different BLAS backends. For > instance, if you install numpy from conda-forge you should be able to > switch between OpenBLAS, MKL and BLIS: > https://conda-forge.org/docs/maintainer/knowledge_base.html#switching-blas-implementation> > Roman > > On 23/02/2021 19:32, Andrea Gavana wrote: > > Hi, > > > > On Tue, 23 Feb 2021 at 19.11, Neal Becker <[hidden email] > > > wrote: > > > >     I have code that performs dot product of a 2D matrix of size (on the > >     order of) [1000,16] with a vector of size [1000].  The matrix is > >     float64 and the vector is complex128.  I was using numpy.dot but it > >     turned out to be a bottleneck. > > > >     So I coded dot2x1 in c++ (using xtensor-python just for the > >     interface).  No fancy simd was used, unless g++ did it on it's own. > > > >     On a simple benchmark using timeit I find my hand-coded routine is on > >     the order of 1000x faster than numpy?  Here is the test code: > >     My custom c++ code is dot2x1.  I'm not copying it here because it has > >     some dependencies.  Any idea what is going on? > > > > > > > > I had a similar experience - albeit with an older numpy and Python 2.7, > > so my comments are easily outdated and irrelevant. This was on Windows > > 10 64 bit, way more than plenty RAM. > > > > It took me forever to find out that numpy.dot was the culprit, and I > > ended up using fortran + f2py. Even with the overhead of having to go > > through f2py bridge, the fortran dot_product was several times faster. > > > > Sorry if It doesn’t help much. > > > > Andrea. > > > > > > > > > >     import numpy as np > > > >     from dot2x1 import dot2x1 > > > >     a = np.ones ((1000,16)) > >     b = np.array([ 0.80311816+0.80311816j,  0.80311816-0.80311816j, > >             -0.80311816+0.80311816j, -0.80311816-0.80311816j, > >              1.09707981+0.29396165j,  1.09707981-0.29396165j, > >             -1.09707981+0.29396165j, -1.09707981-0.29396165j, > >              0.29396165+1.09707981j,  0.29396165-1.09707981j, > >             -0.29396165+1.09707981j, -0.29396165-1.09707981j, > >              0.25495815+0.25495815j,  0.25495815-0.25495815j, > >             -0.25495815+0.25495815j, -0.25495815-0.25495815j]) > > > >     def F1(): > >          d = dot2x1 (a, b) > > > >     def F2(): > >          d = np.dot (a, b) > > > >     from timeit import timeit > >     print (timeit ('F1()', globals=globals(), number=1000)) > >     print (timeit ('F2()', globals=globals(), number=1000)) > > > >     In [13]: 0.013910860987380147 << 1st timeit > >     28.608758996007964  << 2nd timeit > >     -- > >     Those who don't understand recursion are doomed to repeat it > >     _______________________________________________ > >     NumPy-Discussion mailing list > >     [hidden email] > >     https://mail.python.org/mailman/listinfo/numpy-discussion> >     > > > > > > _______________________________________________ > > NumPy-Discussion mailing list > > [hidden email] > > https://mail.python.org/mailman/listinfo/numpy-discussion> > > _______________________________________________ > NumPy-Discussion mailing list > [hidden email] > https://mail.python.org/mailman/listinfo/numpy-discussion-- Those who don't understand recursion are doomed to repeat it _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpy-discussion
Open this post in threaded view
|

## Re: C-coded dot 1000x faster than numpy?

 I'm using fedora 33 standard numpy. ldd says: /usr/lib64/python3.9/site-packages/numpy/core/_multiarray_umath.cpython-39-x86_64-linux-gnu.so: linux-vdso.so.1 (0x00007ffdd1487000) libflexiblas.so.3 => /lib64/libflexiblas.so.3 (0x00007f0512787000) So whatever flexiblas is doing controls blas. On Tue, Feb 23, 2021 at 1:51 PM Neal Becker <[hidden email]> wrote: > > One suspect is that it seems the numpy version was multi-threading. > This isn't useful here, because I'm running parallel monte-carlo > simulations using all cores.  Perhaps this is perversely slowing > things down?  I don't know how to account for 1000x  slowdown though. > > On Tue, Feb 23, 2021 at 1:40 PM Roman Yurchak <[hidden email]> wrote: > > > > For the first benchmark apparently A.dot(B) with A real and B complex is > > a known issue performance wise https://github.com/numpy/numpy/issues/10468> > > > In general, it might be worth trying different BLAS backends. For > > instance, if you install numpy from conda-forge you should be able to > > switch between OpenBLAS, MKL and BLIS: > > https://conda-forge.org/docs/maintainer/knowledge_base.html#switching-blas-implementation> > > > Roman > > > > On 23/02/2021 19:32, Andrea Gavana wrote: > > > Hi, > > > > > > On Tue, 23 Feb 2021 at 19.11, Neal Becker <[hidden email] > > > > wrote: > > > > > >     I have code that performs dot product of a 2D matrix of size (on the > > >     order of) [1000,16] with a vector of size [1000].  The matrix is > > >     float64 and the vector is complex128.  I was using numpy.dot but it > > >     turned out to be a bottleneck. > > > > > >     So I coded dot2x1 in c++ (using xtensor-python just for the > > >     interface).  No fancy simd was used, unless g++ did it on it's own. > > > > > >     On a simple benchmark using timeit I find my hand-coded routine is on > > >     the order of 1000x faster than numpy?  Here is the test code: > > >     My custom c++ code is dot2x1.  I'm not copying it here because it has > > >     some dependencies.  Any idea what is going on? > > > > > > > > > > > > I had a similar experience - albeit with an older numpy and Python 2.7, > > > so my comments are easily outdated and irrelevant. This was on Windows > > > 10 64 bit, way more than plenty RAM. > > > > > > It took me forever to find out that numpy.dot was the culprit, and I > > > ended up using fortran + f2py. Even with the overhead of having to go > > > through f2py bridge, the fortran dot_product was several times faster. > > > > > > Sorry if It doesn’t help much. > > > > > > Andrea. > > > > > > > > > > > > > > >     import numpy as np > > > > > >     from dot2x1 import dot2x1 > > > > > >     a = np.ones ((1000,16)) > > >     b = np.array([ 0.80311816+0.80311816j,  0.80311816-0.80311816j, > > >             -0.80311816+0.80311816j, -0.80311816-0.80311816j, > > >              1.09707981+0.29396165j,  1.09707981-0.29396165j, > > >             -1.09707981+0.29396165j, -1.09707981-0.29396165j, > > >              0.29396165+1.09707981j,  0.29396165-1.09707981j, > > >             -0.29396165+1.09707981j, -0.29396165-1.09707981j, > > >              0.25495815+0.25495815j,  0.25495815-0.25495815j, > > >             -0.25495815+0.25495815j, -0.25495815-0.25495815j]) > > > > > >     def F1(): > > >          d = dot2x1 (a, b) > > > > > >     def F2(): > > >          d = np.dot (a, b) > > > > > >     from timeit import timeit > > >     print (timeit ('F1()', globals=globals(), number=1000)) > > >     print (timeit ('F2()', globals=globals(), number=1000)) > > > > > >     In [13]: 0.013910860987380147 << 1st timeit > > >     28.608758996007964  << 2nd timeit > > >     -- > > >     Those who don't understand recursion are doomed to repeat it > > >     _______________________________________________ > > >     NumPy-Discussion mailing list > > >     [hidden email] > > >     https://mail.python.org/mailman/listinfo/numpy-discussion> > >     > > > > > > > > > _______________________________________________ > > > NumPy-Discussion mailing list > > > [hidden email] > > > https://mail.python.org/mailman/listinfo/numpy-discussion> > > > > _______________________________________________ > > NumPy-Discussion mailing list > > [hidden email] > > https://mail.python.org/mailman/listinfo/numpy-discussion> > > > -- > Those who don't understand recursion are doomed to repeat it -- Those who don't understand recursion are doomed to repeat it _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpy-discussion
Open this post in threaded view
|

## Re: C-coded dot 1000x faster than numpy?

 In reply to this post by Neal Becker maybe` C_CONTIGUOUS` vs` F_CONTIGUOUS? `Carl Am Di., 23. Feb. 2021 um 19:52 Uhr schrieb Neal Becker <[hidden email]>:One suspect is that it seems the numpy version was multi-threading. This isn't useful here, because I'm running parallel monte-carlo simulations using all cores.  Perhaps this is perversely slowing things down?  I don't know how to account for 1000x  slowdown though. On Tue, Feb 23, 2021 at 1:40 PM Roman Yurchak <[hidden email]> wrote: > > For the first benchmark apparently A.dot(B) with A real and B complex is > a known issue performance wise https://github.com/numpy/numpy/issues/10468 > > In general, it might be worth trying different BLAS backends. For > instance, if you install numpy from conda-forge you should be able to > switch between OpenBLAS, MKL and BLIS: > https://conda-forge.org/docs/maintainer/knowledge_base.html#switching-blas-implementation > > Roman > > On 23/02/2021 19:32, Andrea Gavana wrote: > > Hi, > > > > On Tue, 23 Feb 2021 at 19.11, Neal Becker <[hidden email] > > > wrote: > > > >     I have code that performs dot product of a 2D matrix of size (on the > >     order of) [1000,16] with a vector of size [1000].  The matrix is > >     float64 and the vector is complex128.  I was using numpy.dot but it > >     turned out to be a bottleneck. > > > >     So I coded dot2x1 in c++ (using xtensor-python just for the > >     interface).  No fancy simd was used, unless g++ did it on it's own. > > > >     On a simple benchmark using timeit I find my hand-coded routine is on > >     the order of 1000x faster than numpy?  Here is the test code: > >     My custom c++ code is dot2x1.  I'm not copying it here because it has > >     some dependencies.  Any idea what is going on? > > > > > > > > I had a similar experience - albeit with an older numpy and Python 2.7, > > so my comments are easily outdated and irrelevant. This was on Windows > > 10 64 bit, way more than plenty RAM. > > > > It took me forever to find out that numpy.dot was the culprit, and I > > ended up using fortran + f2py. Even with the overhead of having to go > > through f2py bridge, the fortran dot_product was several times faster. > > > > Sorry if It doesn’t help much. > > > > Andrea. > > > > > > > > > >     import numpy as np > > > >     from dot2x1 import dot2x1 > > > >     a = np.ones ((1000,16)) > >     b = np.array([ 0.80311816+0.80311816j,  0.80311816-0.80311816j, > >             -0.80311816+0.80311816j, -0.80311816-0.80311816j, > >              1.09707981+0.29396165j,  1.09707981-0.29396165j, > >             -1.09707981+0.29396165j, -1.09707981-0.29396165j, > >              0.29396165+1.09707981j,  0.29396165-1.09707981j, > >             -0.29396165+1.09707981j, -0.29396165-1.09707981j, > >              0.25495815+0.25495815j,  0.25495815-0.25495815j, > >             -0.25495815+0.25495815j, -0.25495815-0.25495815j]) > > > >     def F1(): > >          d = dot2x1 (a, b) > > > >     def F2(): > >          d = np.dot (a, b) > > > >     from timeit import timeit > >     print (timeit ('F1()', globals=globals(), number=1000)) > >     print (timeit ('F2()', globals=globals(), number=1000)) > > > >     In [13]: 0.013910860987380147 << 1st timeit > >     28.608758996007964  << 2nd timeit > >     -- > >     Those who don't understand recursion are doomed to repeat it > >     _______________________________________________ > >     NumPy-Discussion mailing list > >     [hidden email] > >     https://mail.python.org/mailman/listinfo/numpy-discussion > >      > > > > > > _______________________________________________ > > NumPy-Discussion mailing list > > [hidden email] > > https://mail.python.org/mailman/listinfo/numpy-discussion > > > _______________________________________________ > NumPy-Discussion mailing list > [hidden email] > https://mail.python.org/mailman/listinfo/numpy-discussion -- Those who don't understand recursion are doomed to repeat it _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpy-discussion _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpy-discussion
Open this post in threaded view
|

## Re: C-coded dot 1000x faster than numpy?

 In reply to this post by Roman Yurchak On Tue, 23 Feb 2021, 7:41 pm Roman Yurchak, <[hidden email]> wrote:For the first benchmark apparently A.dot(B) with A real and B complex is a known issue performance wise https://github.com/numpy/numpy/issues/10468 I splitted B into a vector of size (N, 2) for the real and imaginary part, and that makes the multiplication twice as fast.My configuration (also in Fedora 33) np.show_config()                                                                                                                                                         blas_mkl_info:  NOT AVAILABLEblis_info:  NOT AVAILABLEopenblas_info:    libraries = ['openblas', 'openblas']    library_dirs = ['/usr/local/lib']    language = c    define_macros = [('HAVE_CBLAS', None)]blas_opt_info:    libraries = ['openblas', 'openblas']    library_dirs = ['/usr/local/lib']    language = c    define_macros = [('HAVE_CBLAS', None)]lapack_mkl_info:  NOT AVAILABLEopenblas_lapack_info:    libraries = ['openblas', 'openblas']    library_dirs = ['/usr/local/lib']    language = c    define_macros = [('HAVE_CBLAS', None)]lapack_opt_info:    libraries = ['openblas', 'openblas']    library_dirs = ['/usr/local/lib']    language = c    define_macros = [('HAVE_CBLAS', None)] _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpy-discussion
Open this post in threaded view
|

## Re: C-coded dot 1000x faster than numpy?

 The stackoverflow link above contains a simple testcase: ``````>>> from scipy.linalg import get_blas_funcs >>> gemm = get_blas_funcs("gemm", [X, Y]) >>> np.all(gemm(1, X, Y) == np.dot(X, Y)) True````````It would be of interest to benchmark gemm against np.dot. Maybe np.dot doesn't use blas at al for whatever reason? `` Am Di., 23. Feb. 2021 um 20:46 Uhr schrieb David Menéndez Hurtado <[hidden email]>:On Tue, 23 Feb 2021, 7:41 pm Roman Yurchak, <[hidden email]> wrote:For the first benchmark apparently A.dot(B) with A real and B complex is a known issue performance wise https://github.com/numpy/numpy/issues/10468 I splitted B into a vector of size (N, 2) for the real and imaginary part, and that makes the multiplication twice as fast.My configuration (also in Fedora 33) np.show_config()                                                                                                                                                         blas_mkl_info:  NOT AVAILABLEblis_info:  NOT AVAILABLEopenblas_info:    libraries = ['openblas', 'openblas']    library_dirs = ['/usr/local/lib']    language = c    define_macros = [('HAVE_CBLAS', None)]blas_opt_info:    libraries = ['openblas', 'openblas']    library_dirs = ['/usr/local/lib']    language = c    define_macros = [('HAVE_CBLAS', None)]lapack_mkl_info:  NOT AVAILABLEopenblas_lapack_info:    libraries = ['openblas', 'openblas']    library_dirs = ['/usr/local/lib']    language = c    define_macros = [('HAVE_CBLAS', None)]lapack_opt_info:    libraries = ['openblas', 'openblas']    library_dirs = ['/usr/local/lib']    language = c    define_macros = [('HAVE_CBLAS', None)] _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpy-discussion _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpy-discussion
Open this post in threaded view
|

## Re: C-coded dot 1000x faster than numpy?

 In reply to this post by Neal Becker On Tue, Feb 23, 2021 at 11:10 AM Neal Becker <[hidden email]> wrote:I have code that performs dot product of a 2D matrix of size (on the order of) [1000,16] with a vector of size [1000].  The matrix is float64 and the vector is complex128.  I was using numpy.dot but it turned out to be a bottleneck. So I coded dot2x1 in c++ (using xtensor-python just for the interface).  No fancy simd was used, unless g++ did it on it's own. On a simple benchmark using timeit I find my hand-coded routine is on the order of 1000x faster than numpy?  Here is the test code: My custom c++ code is dot2x1.  I'm not copying it here because it has some dependencies.  Any idea what is going on? import numpy as np from dot2x1 import dot2x1 a = np.ones ((1000,16)) b = np.array([ 0.80311816+0.80311816j,  0.80311816-0.80311816j,        -0.80311816+0.80311816j, -0.80311816-0.80311816j,         1.09707981+0.29396165j,  1.09707981-0.29396165j,        -1.09707981+0.29396165j, -1.09707981-0.29396165j,         0.29396165+1.09707981j,  0.29396165-1.09707981j,        -0.29396165+1.09707981j, -0.29396165-1.09707981j,         0.25495815+0.25495815j,  0.25495815-0.25495815j,        -0.25495815+0.25495815j, -0.25495815-0.25495815j]) def F1():     d = dot2x1 (a, b) def F2():     d = np.dot (a, b) from timeit import timeit print (timeit ('F1()', globals=globals(), number=1000)) print (timeit ('F2()', globals=globals(), number=1000)) In [13]: 0.013910860987380147 << 1st timeit 28.608758996007964  << 2nd timeitI'm going to guess threading, although huge pages can also be a problem on a machine under heavy load running other processes. Call overhead may also matter for such small matrices.Chuck _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpy-discussion
Open this post in threaded view
|

## Re: C-coded dot 1000x faster than numpy?

 On Tue, Feb 23, 2021 at 5:47 PM Charles R Harris <[hidden email]> wrote:On Tue, Feb 23, 2021 at 11:10 AM Neal Becker <[hidden email]> wrote:I have code that performs dot product of a 2D matrix of size (on the order of) [1000,16] with a vector of size [1000].  The matrix is float64 and the vector is complex128.  I was using numpy.dot but it turned out to be a bottleneck. So I coded dot2x1 in c++ (using xtensor-python just for the interface).  No fancy simd was used, unless g++ did it on it's own. On a simple benchmark using timeit I find my hand-coded routine is on the order of 1000x faster than numpy?  Here is the test code: My custom c++ code is dot2x1.  I'm not copying it here because it has some dependencies.  Any idea what is going on? import numpy as np from dot2x1 import dot2x1 a = np.ones ((1000,16)) b = np.array([ 0.80311816+0.80311816j,  0.80311816-0.80311816j,        -0.80311816+0.80311816j, -0.80311816-0.80311816j,         1.09707981+0.29396165j,  1.09707981-0.29396165j,        -1.09707981+0.29396165j, -1.09707981-0.29396165j,         0.29396165+1.09707981j,  0.29396165-1.09707981j,        -0.29396165+1.09707981j, -0.29396165-1.09707981j,         0.25495815+0.25495815j,  0.25495815-0.25495815j,        -0.25495815+0.25495815j, -0.25495815-0.25495815j]) def F1():     d = dot2x1 (a, b) def F2():     d = np.dot (a, b) from timeit import timeit print (timeit ('F1()', globals=globals(), number=1000)) print (timeit ('F2()', globals=globals(), number=1000)) In [13]: 0.013910860987380147 << 1st timeit 28.608758996007964  << 2nd timeitI'm going to guess threading, although huge pages can also be a problem on a machine under heavy load running other processes. Call overhead may also matter for such small matrices.What BLAS library are you using. I get much better results using an 8 year old i5 and ATLAS.Chuck  _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpy-discussion
Open this post in threaded view
|

## Re: C-coded dot 1000x faster than numpy?

 See my earlier email - this is fedora 33, python3.9. I'm using fedora 33 standard numpy. ldd says: /usr/lib64/python3.9/site-packages/numpy/core/_multiarray_umath.cpython-39-x86_64-linux-gnu.so: linux-vdso.so.1 (0x00007ffdd1487000) libflexiblas.so.3 => /lib64/libflexiblas.so.3 (0x00007f0512787000) So whatever flexiblas is doing controls blas. flexiblas print FlexiBLAS, version 3.0.4 Copyright (C) 2014, 2015, 2016, 2017, 2018, 2019, 2020 Martin Koehler and others. This is free software; see the source code for copying conditions. There is ABSOLUTELY NO WARRANTY; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Configured BLAS libraries: System-wide (/etc/flexiblasrc): System-wide from config directory (/etc/flexiblasrc.d/)  OPENBLAS-OPENMP    library = libflexiblas_openblas-openmp.so    comment =  NETLIB    library = libflexiblas_netlib.so    comment =  ATLAS    library = libflexiblas_atlas.so    comment = User config (/home/nbecker/.flexiblasrc): Host config (/home/nbecker/.flexiblasrc.nbecker8): Available hooks: Backend and hook search paths:   /usr/lib64/flexiblas/ Default BLAS:     System:       OPENBLAS-OPENMP     User:         (none)     Host:         (none)     Active Default: OPENBLAS-OPENMP (System) Runtime properties:    verbose = 0 (System) So it looks  to me it is using openblas-openmp. On Tue, Feb 23, 2021 at 8:00 PM Charles R Harris <[hidden email]> wrote: > > > > On Tue, Feb 23, 2021 at 5:47 PM Charles R Harris <[hidden email]> wrote: >> >> >> >> On Tue, Feb 23, 2021 at 11:10 AM Neal Becker <[hidden email]> wrote: >>> >>> I have code that performs dot product of a 2D matrix of size (on the >>> order of) [1000,16] with a vector of size [1000].  The matrix is >>> float64 and the vector is complex128.  I was using numpy.dot but it >>> turned out to be a bottleneck. >>> >>> So I coded dot2x1 in c++ (using xtensor-python just for the >>> interface).  No fancy simd was used, unless g++ did it on it's own. >>> >>> On a simple benchmark using timeit I find my hand-coded routine is on >>> the order of 1000x faster than numpy?  Here is the test code: >>> My custom c++ code is dot2x1.  I'm not copying it here because it has >>> some dependencies.  Any idea what is going on? >>> >>> import numpy as np >>> >>> from dot2x1 import dot2x1 >>> >>> a = np.ones ((1000,16)) >>> b = np.array([ 0.80311816+0.80311816j,  0.80311816-0.80311816j, >>>        -0.80311816+0.80311816j, -0.80311816-0.80311816j, >>>         1.09707981+0.29396165j,  1.09707981-0.29396165j, >>>        -1.09707981+0.29396165j, -1.09707981-0.29396165j, >>>         0.29396165+1.09707981j,  0.29396165-1.09707981j, >>>        -0.29396165+1.09707981j, -0.29396165-1.09707981j, >>>         0.25495815+0.25495815j,  0.25495815-0.25495815j, >>>        -0.25495815+0.25495815j, -0.25495815-0.25495815j]) >>> >>> def F1(): >>>     d = dot2x1 (a, b) >>> >>> def F2(): >>>     d = np.dot (a, b) >>> >>> from timeit import timeit >>> print (timeit ('F1()', globals=globals(), number=1000)) >>> print (timeit ('F2()', globals=globals(), number=1000)) >>> >>> In [13]: 0.013910860987380147 << 1st timeit >>> 28.608758996007964  << 2nd timeit >> >> >> I'm going to guess threading, although huge pages can also be a problem on a machine under heavy load running other processes. Call overhead may also matter for such small matrices. >> > > What BLAS library are you using. I get much better results using an 8 year old i5 and ATLAS. > > Chuck > _______________________________________________ > NumPy-Discussion mailing list > [hidden email] > https://mail.python.org/mailman/listinfo/numpy-discussion-- Those who don't understand recursion are doomed to repeat it _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpy-discussion
Open this post in threaded view
|

## Re: C-coded dot 1000x faster than numpy?

 On Wed, Feb 24, 2021 at 5:36 AM Neal Becker <[hidden email]> wrote:See my earlier email - this is fedora 33, python3.9. I'm using fedora 33 standard numpy. ldd says: /usr/lib64/python3.9/site-packages/numpy/core/_multiarray_umath.cpython-39-x86_64-linux-gnu.so: linux-vdso.so.1 (0x00007ffdd1487000) libflexiblas.so.3 => /lib64/libflexiblas.so.3 (0x00007f0512787000) So whatever flexiblas is doing controls blas. flexiblas print FlexiBLAS, version 3.0.4 Copyright (C) 2014, 2015, 2016, 2017, 2018, 2019, 2020 Martin Koehler and others. This is free software; see the source code for copying conditions. There is ABSOLUTELY NO WARRANTY; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Configured BLAS libraries: System-wide (/etc/flexiblasrc): System-wide from config directory (/etc/flexiblasrc.d/)  OPENBLAS-OPENMP    library = libflexiblas_openblas-openmp.so    comment =  NETLIB    library = libflexiblas_netlib.so    comment =  ATLAS    library = libflexiblas_atlas.so    comment = User config (/home/nbecker/.flexiblasrc): Host config (/home/nbecker/.flexiblasrc.nbecker8): Available hooks: Backend and hook search paths:   /usr/lib64/flexiblas/ Default BLAS:     System:       OPENBLAS-OPENMP     User:         (none)     Host:         (none)     Active Default: OPENBLAS-OPENMP (System) Runtime properties:    verbose = 0 (System) So it looks  to me it is using openblas-openmp. ISTR that there have been problems with openmp. There are a ton of OpenBLAS versions available in fedora 33. Just available via flexiblasflexiblas-openblas-openmp.x86_64 : FlexiBLAS wrappers for OpenBLASflexiblas-openblas-openmp.i686 : FlexiBLAS wrappers for OpenBLASflexiblas-openblas-openmp64.x86_64 : FlexiBLAS wrappers for OpenBLAS (64-bit)flexiblas-openblas-serial.x86_64 : FlexiBLAS wrappers for OpenBLASflexiblas-openblas-serial64.x86_64 : FlexiBLAS wrappers for OpenBLAS (64-bit)flexiblas-openblas-threads.x86_64 : FlexiBLAS wrappers for OpenBLASflexiblas-openblas-threads64.x86_64 : FlexiBLAS wrappers for OpenBLAS (64-bit)I am not sure how to make use of flexiblas, but would explore that. We might need to do something with distutils to interoperate with it or maybe you can control it though site.cfg. There are 12 versions available in total. I would suggest trying serial or pthreads.Chuck _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpy-discussion