I have code that performs dot product of a 2D matrix of size (on the
order of) [1000,16] with a vector of size [1000]. The matrix is float64 and the vector is complex128. I was using numpy.dot but it turned out to be a bottleneck. So I coded dot2x1 in c++ (using xtensor-python just for the interface). No fancy simd was used, unless g++ did it on it's own. On a simple benchmark using timeit I find my hand-coded routine is on the order of 1000x faster than numpy? Here is the test code: My custom c++ code is dot2x1. I'm not copying it here because it has some dependencies. Any idea what is going on? import numpy as np from dot2x1 import dot2x1 a = np.ones ((1000,16)) b = np.array([ 0.80311816+0.80311816j, 0.80311816-0.80311816j, -0.80311816+0.80311816j, -0.80311816-0.80311816j, 1.09707981+0.29396165j, 1.09707981-0.29396165j, -1.09707981+0.29396165j, -1.09707981-0.29396165j, 0.29396165+1.09707981j, 0.29396165-1.09707981j, -0.29396165+1.09707981j, -0.29396165-1.09707981j, 0.25495815+0.25495815j, 0.25495815-0.25495815j, -0.25495815+0.25495815j, -0.25495815-0.25495815j]) def F1(): d = dot2x1 (a, b) def F2(): d = np.dot (a, b) from timeit import timeit print (timeit ('F1()', globals=globals(), number=1000)) print (timeit ('F2()', globals=globals(), number=1000)) In [13]: 0.013910860987380147 << 1st timeit 28.608758996007964 << 2nd timeit -- Those who don't understand recursion are doomed to repeat it _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpy-discussion |
Hi, On Tue, 23 Feb 2021 at 19.11, Neal Becker <[hidden email]> wrote: I have code that performs dot product of a 2D matrix of size (on the I had a similar experience - albeit with an older numpy and Python 2.7, so my comments are easily outdated and irrelevant. This was on Windows 10 64 bit, way more than plenty RAM. It took me forever to find out that numpy.dot was the culprit, and I ended up using fortran + f2py. Even with the overhead of having to go through f2py bridge, the fortran dot_product was several times faster. Sorry if It doesn’t help much. Andrea.
_______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpy-discussion |
For the first benchmark apparently A.dot(B) with A real and B complex is
a known issue performance wise https://github.com/numpy/numpy/issues/10468 In general, it might be worth trying different BLAS backends. For instance, if you install numpy from conda-forge you should be able to switch between OpenBLAS, MKL and BLIS: https://conda-forge.org/docs/maintainer/knowledge_base.html#switching-blas-implementation Roman On 23/02/2021 19:32, Andrea Gavana wrote: > Hi, > > On Tue, 23 Feb 2021 at 19.11, Neal Becker <[hidden email] > <mailto:[hidden email]>> wrote: > > I have code that performs dot product of a 2D matrix of size (on the > order of) [1000,16] with a vector of size [1000]. The matrix is > float64 and the vector is complex128. I was using numpy.dot but it > turned out to be a bottleneck. > > So I coded dot2x1 in c++ (using xtensor-python just for the > interface). No fancy simd was used, unless g++ did it on it's own. > > On a simple benchmark using timeit I find my hand-coded routine is on > the order of 1000x faster than numpy? Here is the test code: > My custom c++ code is dot2x1. I'm not copying it here because it has > some dependencies. Any idea what is going on? > > > > I had a similar experience - albeit with an older numpy and Python 2.7, > so my comments are easily outdated and irrelevant. This was on Windows > 10 64 bit, way more than plenty RAM. > > It took me forever to find out that numpy.dot was the culprit, and I > ended up using fortran + f2py. Even with the overhead of having to go > through f2py bridge, the fortran dot_product was several times faster. > > Sorry if It doesn’t help much. > > Andrea. > > > > > import numpy as np > > from dot2x1 import dot2x1 > > a = np.ones ((1000,16)) > b = np.array([ 0.80311816+0.80311816j, 0.80311816-0.80311816j, > -0.80311816+0.80311816j, -0.80311816-0.80311816j, > 1.09707981+0.29396165j, 1.09707981-0.29396165j, > -1.09707981+0.29396165j, -1.09707981-0.29396165j, > 0.29396165+1.09707981j, 0.29396165-1.09707981j, > -0.29396165+1.09707981j, -0.29396165-1.09707981j, > 0.25495815+0.25495815j, 0.25495815-0.25495815j, > -0.25495815+0.25495815j, -0.25495815-0.25495815j]) > > def F1(): > d = dot2x1 (a, b) > > def F2(): > d = np.dot (a, b) > > from timeit import timeit > print (timeit ('F1()', globals=globals(), number=1000)) > print (timeit ('F2()', globals=globals(), number=1000)) > > In [13]: 0.013910860987380147 << 1st timeit > 28.608758996007964 << 2nd timeit > -- > Those who don't understand recursion are doomed to repeat it > _______________________________________________ > NumPy-Discussion mailing list > [hidden email] <mailto:[hidden email]> > https://mail.python.org/mailman/listinfo/numpy-discussion > <https://mail.python.org/mailman/listinfo/numpy-discussion> > > > _______________________________________________ > NumPy-Discussion mailing list > [hidden email] > https://mail.python.org/mailman/listinfo/numpy-discussion > NumPy-Discussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpy-discussion |
One suspect is that it seems the numpy version was multi-threading.
This isn't useful here, because I'm running parallel monte-carlo simulations using all cores. Perhaps this is perversely slowing things down? I don't know how to account for 1000x slowdown though. On Tue, Feb 23, 2021 at 1:40 PM Roman Yurchak <[hidden email]> wrote: > > For the first benchmark apparently A.dot(B) with A real and B complex is > a known issue performance wise https://github.com/numpy/numpy/issues/10468 > > In general, it might be worth trying different BLAS backends. For > instance, if you install numpy from conda-forge you should be able to > switch between OpenBLAS, MKL and BLIS: > https://conda-forge.org/docs/maintainer/knowledge_base.html#switching-blas-implementation > > Roman > > On 23/02/2021 19:32, Andrea Gavana wrote: > > Hi, > > > > On Tue, 23 Feb 2021 at 19.11, Neal Becker <[hidden email] > > <mailto:[hidden email]>> wrote: > > > > I have code that performs dot product of a 2D matrix of size (on the > > order of) [1000,16] with a vector of size [1000]. The matrix is > > float64 and the vector is complex128. I was using numpy.dot but it > > turned out to be a bottleneck. > > > > So I coded dot2x1 in c++ (using xtensor-python just for the > > interface). No fancy simd was used, unless g++ did it on it's own. > > > > On a simple benchmark using timeit I find my hand-coded routine is on > > the order of 1000x faster than numpy? Here is the test code: > > My custom c++ code is dot2x1. I'm not copying it here because it has > > some dependencies. Any idea what is going on? > > > > > > > > I had a similar experience - albeit with an older numpy and Python 2.7, > > so my comments are easily outdated and irrelevant. This was on Windows > > 10 64 bit, way more than plenty RAM. > > > > It took me forever to find out that numpy.dot was the culprit, and I > > ended up using fortran + f2py. Even with the overhead of having to go > > through f2py bridge, the fortran dot_product was several times faster. > > > > Sorry if It doesn’t help much. > > > > Andrea. > > > > > > > > > > import numpy as np > > > > from dot2x1 import dot2x1 > > > > a = np.ones ((1000,16)) > > b = np.array([ 0.80311816+0.80311816j, 0.80311816-0.80311816j, > > -0.80311816+0.80311816j, -0.80311816-0.80311816j, > > 1.09707981+0.29396165j, 1.09707981-0.29396165j, > > -1.09707981+0.29396165j, -1.09707981-0.29396165j, > > 0.29396165+1.09707981j, 0.29396165-1.09707981j, > > -0.29396165+1.09707981j, -0.29396165-1.09707981j, > > 0.25495815+0.25495815j, 0.25495815-0.25495815j, > > -0.25495815+0.25495815j, -0.25495815-0.25495815j]) > > > > def F1(): > > d = dot2x1 (a, b) > > > > def F2(): > > d = np.dot (a, b) > > > > from timeit import timeit > > print (timeit ('F1()', globals=globals(), number=1000)) > > print (timeit ('F2()', globals=globals(), number=1000)) > > > > In [13]: 0.013910860987380147 << 1st timeit > > 28.608758996007964 << 2nd timeit > > -- > > Those who don't understand recursion are doomed to repeat it > > _______________________________________________ > > NumPy-Discussion mailing list > > [hidden email] <mailto:[hidden email]> > > https://mail.python.org/mailman/listinfo/numpy-discussion > > <https://mail.python.org/mailman/listinfo/numpy-discussion> > > > > > > _______________________________________________ > > NumPy-Discussion mailing list > > [hidden email] > > https://mail.python.org/mailman/listinfo/numpy-discussion > > > _______________________________________________ > NumPy-Discussion mailing list > [hidden email] > https://mail.python.org/mailman/listinfo/numpy-discussion -- Those who don't understand recursion are doomed to repeat it _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpy-discussion |
I'm using fedora 33 standard numpy.
ldd says: /usr/lib64/python3.9/site-packages/numpy/core/_multiarray_umath.cpython-39-x86_64-linux-gnu.so: linux-vdso.so.1 (0x00007ffdd1487000) libflexiblas.so.3 => /lib64/libflexiblas.so.3 (0x00007f0512787000) So whatever flexiblas is doing controls blas. On Tue, Feb 23, 2021 at 1:51 PM Neal Becker <[hidden email]> wrote: > > One suspect is that it seems the numpy version was multi-threading. > This isn't useful here, because I'm running parallel monte-carlo > simulations using all cores. Perhaps this is perversely slowing > things down? I don't know how to account for 1000x slowdown though. > > On Tue, Feb 23, 2021 at 1:40 PM Roman Yurchak <[hidden email]> wrote: > > > > For the first benchmark apparently A.dot(B) with A real and B complex is > > a known issue performance wise https://github.com/numpy/numpy/issues/10468 > > > > In general, it might be worth trying different BLAS backends. For > > instance, if you install numpy from conda-forge you should be able to > > switch between OpenBLAS, MKL and BLIS: > > https://conda-forge.org/docs/maintainer/knowledge_base.html#switching-blas-implementation > > > > Roman > > > > On 23/02/2021 19:32, Andrea Gavana wrote: > > > Hi, > > > > > > On Tue, 23 Feb 2021 at 19.11, Neal Becker <[hidden email] > > > <mailto:[hidden email]>> wrote: > > > > > > I have code that performs dot product of a 2D matrix of size (on the > > > order of) [1000,16] with a vector of size [1000]. The matrix is > > > float64 and the vector is complex128. I was using numpy.dot but it > > > turned out to be a bottleneck. > > > > > > So I coded dot2x1 in c++ (using xtensor-python just for the > > > interface). No fancy simd was used, unless g++ did it on it's own. > > > > > > On a simple benchmark using timeit I find my hand-coded routine is on > > > the order of 1000x faster than numpy? Here is the test code: > > > My custom c++ code is dot2x1. I'm not copying it here because it has > > > some dependencies. Any idea what is going on? > > > > > > > > > > > > I had a similar experience - albeit with an older numpy and Python 2.7, > > > so my comments are easily outdated and irrelevant. This was on Windows > > > 10 64 bit, way more than plenty RAM. > > > > > > It took me forever to find out that numpy.dot was the culprit, and I > > > ended up using fortran + f2py. Even with the overhead of having to go > > > through f2py bridge, the fortran dot_product was several times faster. > > > > > > Sorry if It doesn’t help much. > > > > > > Andrea. > > > > > > > > > > > > > > > import numpy as np > > > > > > from dot2x1 import dot2x1 > > > > > > a = np.ones ((1000,16)) > > > b = np.array([ 0.80311816+0.80311816j, 0.80311816-0.80311816j, > > > -0.80311816+0.80311816j, -0.80311816-0.80311816j, > > > 1.09707981+0.29396165j, 1.09707981-0.29396165j, > > > -1.09707981+0.29396165j, -1.09707981-0.29396165j, > > > 0.29396165+1.09707981j, 0.29396165-1.09707981j, > > > -0.29396165+1.09707981j, -0.29396165-1.09707981j, > > > 0.25495815+0.25495815j, 0.25495815-0.25495815j, > > > -0.25495815+0.25495815j, -0.25495815-0.25495815j]) > > > > > > def F1(): > > > d = dot2x1 (a, b) > > > > > > def F2(): > > > d = np.dot (a, b) > > > > > > from timeit import timeit > > > print (timeit ('F1()', globals=globals(), number=1000)) > > > print (timeit ('F2()', globals=globals(), number=1000)) > > > > > > In [13]: 0.013910860987380147 << 1st timeit > > > 28.608758996007964 << 2nd timeit > > > -- > > > Those who don't understand recursion are doomed to repeat it > > > _______________________________________________ > > > NumPy-Discussion mailing list > > > [hidden email] <mailto:[hidden email]> > > > https://mail.python.org/mailman/listinfo/numpy-discussion > > > <https://mail.python.org/mailman/listinfo/numpy-discussion> > > > > > > > > > _______________________________________________ > > > NumPy-Discussion mailing list > > > [hidden email] > > > https://mail.python.org/mailman/listinfo/numpy-discussion > > > > > _______________________________________________ > > NumPy-Discussion mailing list > > [hidden email] > > https://mail.python.org/mailman/listinfo/numpy-discussion > > > > -- > Those who don't understand recursion are doomed to repeat it -- Those who don't understand recursion are doomed to repeat it _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpy-discussion |
In reply to this post by Neal Becker
maybe C_CONTIGUOUS
vs F_CONTIGUOUS? Carl
Am Di., 23. Feb. 2021 um 19:52 Uhr schrieb Neal Becker <[hidden email]>: One suspect is that it seems the numpy version was multi-threading. _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpy-discussion |
In reply to this post by Roman Yurchak
On Tue, 23 Feb 2021, 7:41 pm Roman Yurchak, <[hidden email]> wrote: For the first benchmark apparently A.dot(B) with A real and B complex is I splitted B into a vector of size (N, 2) for the real and imaginary part, and that makes the multiplication twice as fast. My configuration (also in Fedora 33) np.show_config() blas_mkl_info: NOT AVAILABLE blis_info: NOT AVAILABLE openblas_info: libraries = ['openblas', 'openblas'] library_dirs = ['/usr/local/lib'] language = c define_macros = [('HAVE_CBLAS', None)] blas_opt_info: libraries = ['openblas', 'openblas'] library_dirs = ['/usr/local/lib'] language = c define_macros = [('HAVE_CBLAS', None)] lapack_mkl_info: NOT AVAILABLE openblas_lapack_info: libraries = ['openblas', 'openblas'] library_dirs = ['/usr/local/lib'] language = c define_macros = [('HAVE_CBLAS', None)] lapack_opt_info: libraries = ['openblas', 'openblas'] library_dirs = ['/usr/local/lib'] language = c define_macros = [('HAVE_CBLAS', None)] _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpy-discussion |
The stackoverflow link above contains a simple testcase:
Am Di., 23. Feb. 2021 um 20:46 Uhr schrieb David Menéndez Hurtado <[hidden email]>:
_______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpy-discussion |
In reply to this post by Neal Becker
On Tue, Feb 23, 2021 at 11:10 AM Neal Becker <[hidden email]> wrote: I have code that performs dot product of a 2D matrix of size (on the I'm going to guess threading, although huge pages can also be a problem on a machine under heavy load running other processes. Call overhead may also matter for such small matrices. Chuck _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpy-discussion |
On Tue, Feb 23, 2021 at 5:47 PM Charles R Harris <[hidden email]> wrote:
What BLAS library are you using. I get much better results using an 8 year old i5 and ATLAS. Chuck _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpy-discussion |
See my earlier email - this is fedora 33, python3.9.
I'm using fedora 33 standard numpy. ldd says: /usr/lib64/python3.9/site-packages/numpy/core/_multiarray_umath.cpython-39-x86_64-linux-gnu.so: linux-vdso.so.1 (0x00007ffdd1487000) libflexiblas.so.3 => /lib64/libflexiblas.so.3 (0x00007f0512787000) So whatever flexiblas is doing controls blas. flexiblas print FlexiBLAS, version 3.0.4 Copyright (C) 2014, 2015, 2016, 2017, 2018, 2019, 2020 Martin Koehler and others. This is free software; see the source code for copying conditions. There is ABSOLUTELY NO WARRANTY; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Configured BLAS libraries: System-wide (/etc/flexiblasrc): System-wide from config directory (/etc/flexiblasrc.d/) OPENBLAS-OPENMP library = libflexiblas_openblas-openmp.so comment = NETLIB library = libflexiblas_netlib.so comment = ATLAS library = libflexiblas_atlas.so comment = User config (/home/nbecker/.flexiblasrc): Host config (/home/nbecker/.flexiblasrc.nbecker8): Available hooks: Backend and hook search paths: /usr/lib64/flexiblas/ Default BLAS: System: OPENBLAS-OPENMP User: (none) Host: (none) Active Default: OPENBLAS-OPENMP (System) Runtime properties: verbose = 0 (System) So it looks to me it is using openblas-openmp. On Tue, Feb 23, 2021 at 8:00 PM Charles R Harris <[hidden email]> wrote: > > > > On Tue, Feb 23, 2021 at 5:47 PM Charles R Harris <[hidden email]> wrote: >> >> >> >> On Tue, Feb 23, 2021 at 11:10 AM Neal Becker <[hidden email]> wrote: >>> >>> I have code that performs dot product of a 2D matrix of size (on the >>> order of) [1000,16] with a vector of size [1000]. The matrix is >>> float64 and the vector is complex128. I was using numpy.dot but it >>> turned out to be a bottleneck. >>> >>> So I coded dot2x1 in c++ (using xtensor-python just for the >>> interface). No fancy simd was used, unless g++ did it on it's own. >>> >>> On a simple benchmark using timeit I find my hand-coded routine is on >>> the order of 1000x faster than numpy? Here is the test code: >>> My custom c++ code is dot2x1. I'm not copying it here because it has >>> some dependencies. Any idea what is going on? >>> >>> import numpy as np >>> >>> from dot2x1 import dot2x1 >>> >>> a = np.ones ((1000,16)) >>> b = np.array([ 0.80311816+0.80311816j, 0.80311816-0.80311816j, >>> -0.80311816+0.80311816j, -0.80311816-0.80311816j, >>> 1.09707981+0.29396165j, 1.09707981-0.29396165j, >>> -1.09707981+0.29396165j, -1.09707981-0.29396165j, >>> 0.29396165+1.09707981j, 0.29396165-1.09707981j, >>> -0.29396165+1.09707981j, -0.29396165-1.09707981j, >>> 0.25495815+0.25495815j, 0.25495815-0.25495815j, >>> -0.25495815+0.25495815j, -0.25495815-0.25495815j]) >>> >>> def F1(): >>> d = dot2x1 (a, b) >>> >>> def F2(): >>> d = np.dot (a, b) >>> >>> from timeit import timeit >>> print (timeit ('F1()', globals=globals(), number=1000)) >>> print (timeit ('F2()', globals=globals(), number=1000)) >>> >>> In [13]: 0.013910860987380147 << 1st timeit >>> 28.608758996007964 << 2nd timeit >> >> >> I'm going to guess threading, although huge pages can also be a problem on a machine under heavy load running other processes. Call overhead may also matter for such small matrices. >> > > What BLAS library are you using. I get much better results using an 8 year old i5 and ATLAS. > > Chuck > _______________________________________________ > NumPy-Discussion mailing list > [hidden email] > https://mail.python.org/mailman/listinfo/numpy-discussion -- Those who don't understand recursion are doomed to repeat it _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpy-discussion |
On Wed, Feb 24, 2021 at 5:36 AM Neal Becker <[hidden email]> wrote: See my earlier email - this is fedora 33, python3.9. ISTR that there have been problems with openmp. There are a ton of OpenBLAS versions available in fedora 33. Just available via flexiblas
I am not sure how to make use of flexiblas, but would explore that. We might need to do something with distutils to interoperate with it or maybe you can control it though site.cfg. There are 12 versions available in total. I would suggest trying serial or pthreads. Chuck _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpy-discussion |
On Wed, Feb 24, 2021 at 8:02 AM Charles R Harris <[hidden email]> wrote:
Seems to be controlled in the /etc directory: /etc/flexiblas64rc.d/openblas-openmp64.conf On my machine it looks like openmp64 is the system default. Chuck _______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpy-discussion |
Supposedly can control through env variables but I didn't see any effect On Wed, Feb 24, 2021, 10:12 AM Charles R Harris <[hidden email]> wrote:
_______________________________________________ NumPy-Discussion mailing list [hidden email] https://mail.python.org/mailman/listinfo/numpy-discussion |
Free forum by Nabble | Edit this page |