Numpy doesn't use RAM

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Numpy doesn't use RAM

Keyvis Damptey
Hi Numpy dev community,

I'm keyvis, a statistical data scientist.

I'm currently using numpy in python 3.8.2 64-bit for a clustering problem, on a machine with 1.9 TB RAM. When I try using np.zeros to create a 600,000 by 600,000 matrix of dtype=np.float32 it says
"Unable to allocate 1.31 TiB for an array with shape (600000, 600000) and data type float32"

I used psutils to determine how much RAM python thinks it has access to and it return with 1.8 TB approx.

Is there some way I can fix numpy to create these large arrays?
Thanks for your time and consideration

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Numpy doesn't use RAM

Sebastian Berg
On Tue, 2020-03-24 at 13:59 -0400, Keyvis Damptey wrote:

> Hi Numpy dev community,
>
> I'm keyvis, a statistical data scientist.
>
> I'm currently using numpy in python 3.8.2 64-bit for a clustering
> problem,
> on a machine with 1.9 TB RAM. When I try using np.zeros to create a
> 600,000
> by 600,000 matrix of dtype=np.float32 it says
> "Unable to allocate 1.31 TiB for an array with shape (600000, 600000)
> and
>
> data type float32"
>
If this error happens, allocating the memory failed. This should be
pretty much a simple `malloc` call in C, so this is the kernel
complaining, not Python/NumPy.

I am not quite sure, but maybe memory fragmentation plays its part, or
simply are actually out of memory for that process, 1.44TB is a
significant portion of the total memory after all.

Not sure what to say, but I think you should probably look into other
solutions, maybe using HDF5, zarr, or memory-mapping (although I am not
sure the last actually helps). It will be tricky to work with arrays of
a size that is close to the available total memory.

Maybe someone who works more with such data here can give you tips on
what projects can help you or what solutions to look into.

- Sebastian



> I used psutils to determine how much RAM python thinks it has access
> to and
> it return with 1.8 TB approx.
>
> Is there some way I can fix numpy to create these large arrays?
> Thanks for your time and consideration
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Numpy doesn't use RAM

Stanley Seibert
In reply to this post by Keyvis Damptey
In addition to what Sebastian said about memory fragmentation and OS limits about memory allocations, I do think it will be hard to work with an array that close to the memory limit in NumPy regardless.  Almost any operation will need to make a temporary array and exceed your memory limit.  You might want to look at Dask Array for a NumPy-like API for working with chunked arrays that can be staged in and out of memory:


As a bonus, Dask will also let you make better use of the large number of CPU cores that you likely have in your 1.9 TB RAM system.  :)

On Tue, Mar 24, 2020 at 1:00 PM Keyvis Damptey <[hidden email]> wrote:
Hi Numpy dev community,

I'm keyvis, a statistical data scientist.

I'm currently using numpy in python 3.8.2 64-bit for a clustering problem, on a machine with 1.9 TB RAM. When I try using np.zeros to create a 600,000 by 600,000 matrix of dtype=np.float32 it says
"Unable to allocate 1.31 TiB for an array with shape (600000, 600000) and data type float32"

I used psutils to determine how much RAM python thinks it has access to and it return with 1.8 TB approx.

Is there some way I can fix numpy to create these large arrays?
Thanks for your time and consideration
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Numpy doesn't use RAM

Benjamin Root
In reply to this post by Sebastian Berg
Another thing to point out about having an array of that percentage of the available memory is that it severely restricts what you can do with it. Since you are above 50% of the available memory, you won't be able to create another array that would be the result of computing something with that array. So, you are restricted to querying (which you could do without having everything in-memory), or in-place operations.

Dask arrays might be what you are really looking for.

Ben Root

On Tue, Mar 24, 2020 at 2:18 PM Sebastian Berg <[hidden email]> wrote:
On Tue, 2020-03-24 at 13:59 -0400, Keyvis Damptey wrote:
> Hi Numpy dev community,
>
> I'm keyvis, a statistical data scientist.
>
> I'm currently using numpy in python 3.8.2 64-bit for a clustering
> problem,
> on a machine with 1.9 TB RAM. When I try using np.zeros to create a
> 600,000
> by 600,000 matrix of dtype=np.float32 it says
> "Unable to allocate 1.31 TiB for an array with shape (600000, 600000)
> and
>
> data type float32"
>

If this error happens, allocating the memory failed. This should be
pretty much a simple `malloc` call in C, so this is the kernel
complaining, not Python/NumPy.

I am not quite sure, but maybe memory fragmentation plays its part, or
simply are actually out of memory for that process, 1.44TB is a
significant portion of the total memory after all.

Not sure what to say, but I think you should probably look into other
solutions, maybe using HDF5, zarr, or memory-mapping (although I am not
sure the last actually helps). It will be tricky to work with arrays of
a size that is close to the available total memory.

Maybe someone who works more with such data here can give you tips on
what projects can help you or what solutions to look into.

- Sebastian



> I used psutils to determine how much RAM python thinks it has access
> to and
> it return with 1.8 TB approx.
>
> Is there some way I can fix numpy to create these large arrays?
> Thanks for your time and consideration
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: Numpy doesn't use RAM

YueCompl
In reply to this post by Stanley Seibert

If you are sure your subsequent computation against the array data has enough locality to avoid thrashing, I think numpy.memmap would work for you, i.e. to use an explicit disk file serving as swap.

My env does a lot mmap'ing on disk data files by C++ (after Python read meta data), then wrap as ndarray, that's enough to run out-of-core programs as long as data access patterns fit in physical RAM at any instant, then even scanning the whole dataset is okay along the time axis (in realworld not data).

Memory (address space) fragmentation is a problem, besides OS' `nofile` (number of file handles held open) limitation, if too many small data files involved, we are in switching to a solution with FUSE based fs with virtual large file viewing many small files on remote storage server.

Cheers,
Compl

On 2020-03-25, at 02:35, Stanley Seibert <[hidden email]> wrote:

In addition to what Sebastian said about memory fragmentation and OS limits about memory allocations, I do think it will be hard to work with an array that close to the memory limit in NumPy regardless.  Almost any operation will need to make a temporary array and exceed your memory limit.  You might want to look at Dask Array for a NumPy-like API for working with chunked arrays that can be staged in and out of memory:


As a bonus, Dask will also let you make better use of the large number of CPU cores that you likely have in your 1.9 TB RAM system.  :)

On Tue, Mar 24, 2020 at 1:00 PM Keyvis Damptey <[hidden email]> wrote:
Hi Numpy dev community,

I'm keyvis, a statistical data scientist.

I'm currently using numpy in python 3.8.2 64-bit for a clustering problem, on a machine with 1.9 TB RAM. When I try using np.zeros to create a 600,000 by 600,000 matrix of dtype=np.float32 it says
"Unable to allocate 1.31 TiB for an array with shape (600000, 600000) and data type float32"

I used psutils to determine how much RAM python thinks it has access to and it return with 1.8 TB approx.

Is there some way I can fix numpy to create these large arrays?
Thanks for your time and consideration
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion