record data previous to Numpy use

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

record data previous to Numpy use

paul.carrico

Dear all


I’m sorry if my question is too basic (not fully in relation to Numpy – while it is to build matrices and to work with Numpy afterward), but I’m spending a lot of time and effort to find a way to record data from an asci while, and reassign it into a matrix/array … with unsuccessfully!


The only way I found is to use ‘append()’ instruction involving dynamic memory allocation. :-(


From my current experience under Scilab (a like Matlab scientific solver), it is well know:

  1. Step 1 : matrix initialization like ‘np.zeros(n,n)’
  2. Step 2 : record the data
  3. and write it in the matrix (step 3)


I’m obviously influenced by my current experience, but I’m interested in moving to Python and its packages


For huge asci files (involving dozens of millions of lines), my strategy is to work by ‘blocks’ as :

  • Find the line index of the beginning and the end of one block (this implies that the file is read ounce)
  • Read the block
  • (process repeated on the different other blocks)


I tried different codes such as bellow, but each time Python is telling me I cannot mix iteration and record method

#############################################

position = []; j=0

with open(PATH + file_name, "r") as rough_ data:

            for line in rough_ data:

                if my_criteria in line:

                    position.append(j) ## huge blocs but limited in number

                j=j+1


        i = 0

        blockdata = np.zeros( (size_block), dtype=np.float)

        with open(PATH + file_name, "r") as f:

                 for line in itertools.islice(f,1,size_block):

                     blockdata [i]=float(f.readline() )

                     i=i+1

 #########################################


Should I work on lists using f.readlines (but this implies to load all the file in memory).


Additional question:  can I use record with vectorization, with ‘i =np.arange(0,65406)’ if I remain  in the previous example



Thanks for your time and comprehension

(I’m obviously interested by doc references speaking about those specific tasks)


Paul


PS: for Chuck:  I’ll had a look to pandas package but in an code optimization step :-) (nearly 2000 doc pages)






_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: record data previous to Numpy use

Thomas Caswell
Are you tied to ASCII files?   HDF5 (via h5py or pytables) might be a better storage format for what you are describing.

Tom

On Wed, Jul 5, 2017 at 8:42 AM <[hidden email]> wrote:

Dear all


I’m sorry if my question is too basic (not fully in relation to Numpy – while it is to build matrices and to work with Numpy afterward), but I’m spending a lot of time and effort to find a way to record data from an asci while, and reassign it into a matrix/array … with unsuccessfully!


The only way I found is to use ‘append()’ instruction involving dynamic memory allocation. :-(


From my current experience under Scilab (a like Matlab scientific solver), it is well know:

  1. Step 1 : matrix initialization like ‘np.zeros(n,n)’
  2. Step 2 : record the data
  3. and write it in the matrix (step 3)


I’m obviously influenced by my current experience, but I’m interested in moving to Python and its packages


For huge asci files (involving dozens of millions of lines), my strategy is to work by ‘blocks’ as :

  • Find the line index of the beginning and the end of one block (this implies that the file is read ounce)
  • Read the block
  • (process repeated on the different other blocks)


I tried different codes such as bellow, but each time Python is telling me I cannot mix iteration and record method

#############################################

position = []; j=0

with open(PATH + file_name, "r") as rough_ data:

            for line in rough_ data:

                if my_criteria in line:

                    position.append(j) ## huge blocs but limited in number

                j=j+1


        i = 0

        blockdata = np.zeros( (size_block), dtype=np.float)

        with open(PATH + file_name, "r") as f:

                 for line in itertools.islice(f,1,size_block):

                     blockdata [i]=float(f.readline() )

                     i=i+1

 #########################################


Should I work on lists using f.readlines (but this implies to load all the file in memory).


Additional question:  can I use record with vectorization, with ‘i =np.arange(0,65406)’ if I remain  in the previous example



Thanks for your time and comprehension

(I’m obviously interested by doc references speaking about those specific tasks)


Paul


PS: for Chuck:  I’ll had a look to pandas package but in an code optimization step :-) (nearly 2000 doc pages)





_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: record data previous to Numpy use

paul.carrico

Hi


Thanks for the answer:


ascii file is an input format (and the only one I can deal with)

HDF5 one might be an export one (it's one of the options) in order to speed up the post-processing stage


Paul



Le 2017-07-05 20:19, Thomas Caswell a écrit :

Are you tied to ASCII files?   HDF5 (via h5py or pytables) might be a better storage format for what you are describing.
 
Tom

On Wed, Jul 5, 2017 at 8:42 AM <[hidden email]> wrote:

Dear all


I'm sorry if my question is too basic (not fully in relation to Numpy – while it is to build matrices and to work with Numpy afterward), but I'm spending a lot of time and effort to find a way to record data from an asci while, and reassign it into a matrix/array ... with unsuccessfully!


The only way I found is to use 'append()' instruction involving dynamic memory allocation. :-(


From my current experience under Scilab (a like Matlab scientific solver), it is well know:

  1. Step 1 : matrix initialization like 'np.zeros(n,n)'
  2. Step 2 : record the data
  3. and write it in the matrix (step 3)


I'm obviously influenced by my current experience, but I'm interested in moving to Python and its packages


For huge asci files (involving dozens of millions of lines), my strategy is to work by 'blocks' as :

  • Find the line index of the beginning and the end of one block (this implies that the file is read ounce)
  • Read the block
  • (process repeated on the different other blocks)


I tried different codes such as bellow, but each time Python is telling me I cannot mix iteration and record method

#############################################

position = []; j=0

with open(PATH + file_name, "r") as rough_ data:

            for line in rough_ data:

                if my_criteria in line:

                    position.append(j) ## huge blocs but limited in number

                j=j+1


        i = 0

        blockdata = np.zeros( (size_block), dtype=np.float)

        with open(PATH + file_name, "r") as f:

                 for line in itertools.islice(f,1,size_block):

                     blockdata [i]=float(f.readline() )

                     i=i+1

 #########################################


Should I work on lists using f.readlines (but this implies to load all the file in memory).


Additional question:  can I use record with vectorization, with 'i =np.arange(0,65406)' if I remain  in the previous example



Thanks for your time and comprehension

(I'm obviously interested by doc references speaking about those specific tasks)


Paul


PS: for Chuck:  I'll had a look to pandas package but in an code optimization step :-) (nearly 2000 doc pages)





_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion



_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: record data previous to Numpy use

Derek Homeier
Hi Paul,

> ascii file is an input format (and the only one I can deal with)
>
> HDF5 one might be an export one (it's one of the options) in order to speed up the post-processing stage
>
>
>
> Paul
>
>
>
>
>
> Le 2017-07-05 20:19, Thomas Caswell a écrit :
>
>> Are you tied to ASCII files?   HDF5 (via h5py or pytables) might be a better storage format for what you are describing.
>>  
>> Tom
>>
>> On Wed, Jul 5, 2017 at 8:42 AM <[hidden email]> wrote:
>> Dear all
>>
>>
>>
>> I'm sorry if my question is too basic (not fully in relation to Numpy – while it is to build matrices and to work with Numpy afterward), but I'm spending a lot of time and effort to find a way to record data from an asci while, and reassign it into a matrix/array ... with unsuccessfully!
>>
>>
>>
>> The only way I found is to use 'append()' instruction involving dynamic memory allocation. :-(
>>
>>
>>
>> From my current experience under Scilab (a like Matlab scientific solver), it is well know:
>>
>> • Step 1 : matrix initialization like 'np.zeros(n,n)'
>> • Step 2 : record the data
>> • and write it in the matrix (step 3)
>>
>>
>> I'm obviously influenced by my current experience, but I'm interested in moving to Python and its packages
>>
>>
>>
>> For huge asci files (involving dozens of millions of lines), my strategy is to work by 'blocks' as :
>>
>> • Find the line index of the beginning and the end of one block (this implies that the file is read ounce)
>> • Read the block
>> • (process repeated on the different other blocks)
>>
>>
>> I tried different codes such as bellow, but each time Python is telling me I cannot mix iteration and record method
>>

if you are indeed tied to using ASCII input data, you will of course have to deal with significant
performance handicaps, but there are at least some gains to be had by using an input parser
that does not do all the conversions at the Python level, but with a compiled (C) reader - either
pandas as Tom already mentioned, or astropy - see e.g.
https://github.com/dhomeier/astropy-notebooks/blob/master/io/ascii/ascii_read_bench.ipynb
for the almost one order of magnitude speed gains you may get.

In your example it is not clear what “record” method you were trying to use that raised the errors
you mention - we would certainly need a full traceback of the error to find out more.

In principle your approach of allocating the numpy matrix first and reading the data in chunks
makes sense, as it will avoid the much larger temporary lists created during read-in.
But it might be more convenient to just read in the block into a list of lines and pass that to a
higher-level reader like np.genfromtxt or the faster astropy.io.ascii.read or pandas.read_csv
to speed up the parsing of the numbers themselves.
That said, on most systems these readers should still be able to handle files up to a few 10^8
items (expect ~ 25-55 bytes of memory for each input number allocated for temporary lists),
so if saving memory is not an absolute priority, directly reading the entire file might still be the
best choice (and would also save the first pass reading).

Cheers,
                                        Derek

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: record data previous to Numpy use

Robert Kern-2
In reply to this post by paul.carrico
On Wed, Jul 5, 2017 at 5:41 AM, <[hidden email]> wrote:
>
> Dear all
>
> I’m sorry if my question is too basic (not fully in relation to Numpy – while it is to build matrices and to work with Numpy afterward), but I’m spending a lot of time and effort to find a way to record data from an asci while, and reassign it into a matrix/array … with unsuccessfully!
>
> The only way I found is to use ‘append()’ instruction involving dynamic memory allocation. :-(

Are you talking about appending to Python list objects? Or the np.append() function on numpy arrays?

In my experience, it is usually fine to build a list with the `.append()` method while reading the file of unknown size and then converting it to an array afterwards, even for dozens of millions of lines. The list object is quite smart about reallocating memory so it is not that expensive. You should generally avoid the np.append() function, though; it is not smart.

> From my current experience under Scilab (a like Matlab scientific solver), it is well know:
>
> Step 1 : matrix initialization like ‘np.zeros(n,n)’
> Step 2 : record the data
> and write it in the matrix (step 3)
>
> I’m obviously influenced by my current experience, but I’m interested in moving to Python and its packages
>
> For huge asci files (involving dozens of millions of lines), my strategy is to work by ‘blocks’ as :
>
> Find the line index of the beginning and the end of one block (this implies that the file is read ounce)
> Read the block
> (process repeated on the different other blocks)

Are the blocks intrinsic parts of the file? Or are you just trying to break up the file into fixed-size chunks?

> I tried different codes such as bellow, but each time Python is telling me I cannot mix iteration and record method

>
> #############################################
>
> position = []; j=0
> with open(PATH + file_name, "r") as rough_ data:
>             for line in rough_ data:
>                 if my_criteria in line:
>                     position.append(j) ## huge blocs but limited in number
>                 j=j+1
>
>         i = 0
>         blockdata = np.zeros( (size_block), dtype=np.float)
>         with open(PATH + file_name, "r") as f:
>                  for line in itertools.islice(f,1,size_block):
>                      blockdata [i]=float(f.readline() )

For what it's worth, this is the line that is causing the error that you describe. When you iterate over the file with the `for line in itertools.islice(f, ...):` loop, you already have the line text. You don't (and can't) call `f.readline()` to get it again. It would mess up the iteration if you did and cause you to skip lines.

By the way, it is useful to help us help you if you copy-paste the exact code that you are running as well as the full traceback instead of paraphrasing the error message.

--
Robert Kern

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: record data previous to Numpy use

Robert McLeod
While I'm going to bet that the fastest way to build a ndarray from ascii is with a 'io.ByteIO` stream, NumPy does have a function to load from text, `numpy.loadtxt` that works well enough for most purposes. 

https://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html

It's hard to tell from the original post if the ascii is being continuously generated or not.  If it's being produced in an on-going fashion then a stream object is definitely the way to go, as the array chunks can be produced by `numpy.frombuffer()`.



Robert


On Wed, Jul 5, 2017 at 3:21 PM, Robert Kern <[hidden email]> wrote:
On Wed, Jul 5, 2017 at 5:41 AM, <[hidden email]> wrote:
>
> Dear all
>
> I’m sorry if my question is too basic (not fully in relation to Numpy – while it is to build matrices and to work with Numpy afterward), but I’m spending a lot of time and effort to find a way to record data from an asci while, and reassign it into a matrix/array … with unsuccessfully!
>
> The only way I found is to use ‘append()’ instruction involving dynamic memory allocation. :-(

Are you talking about appending to Python list objects? Or the np.append() function on numpy arrays?

In my experience, it is usually fine to build a list with the `.append()` method while reading the file of unknown size and then converting it to an array afterwards, even for dozens of millions of lines. The list object is quite smart about reallocating memory so it is not that expensive. You should generally avoid the np.append() function, though; it is not smart.

> From my current experience under Scilab (a like Matlab scientific solver), it is well know:
>
> Step 1 : matrix initialization like ‘np.zeros(n,n)’
> Step 2 : record the data
> and write it in the matrix (step 3)
>
> I’m obviously influenced by my current experience, but I’m interested in moving to Python and its packages
>
> For huge asci files (involving dozens of millions of lines), my strategy is to work by ‘blocks’ as :
>
> Find the line index of the beginning and the end of one block (this implies that the file is read ounce)
> Read the block
> (process repeated on the different other blocks)

Are the blocks intrinsic parts of the file? Or are you just trying to break up the file into fixed-size chunks?

> I tried different codes such as bellow, but each time Python is telling me I cannot mix iteration and record method

>
> #############################################
>
> position = []; j=0
> with open(PATH + file_name, "r") as rough_ data:
>             for line in rough_ data:
>                 if my_criteria in line:
>                     position.append(j) ## huge blocs but limited in number
>                 j=j+1
>
>         i = 0
>         blockdata = np.zeros( (size_block), dtype=np.float)
>         with open(PATH + file_name, "r") as f:
>                  for line in itertools.islice(f,1,size_block):
>                      blockdata [i]=float(f.readline() )

For what it's worth, this is the line that is causing the error that you describe. When you iterate over the file with the `for line in itertools.islice(f, ...):` loop, you already have the line text. You don't (and can't) call `f.readline()` to get it again. It would mess up the iteration if you did and cause you to skip lines.

By the way, it is useful to help us help you if you copy-paste the exact code that you are running as well as the full traceback instead of paraphrasing the error message.

--
Robert Kern

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion




--

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: record data previous to Numpy use

paul.carrico
In reply to this post by Derek Homeier

Dear All


First of all thanks for the answers and the information’s (I’ll ding into it) and let me trying to add comments on what I want to :

  1. My asci file mainly contains data (float and int) in a single column
  2. (it is not always the case but I can easily manage it – as well I saw I can use ‘spli’ instruction if necessary)
  3. Comments/texts indicates the beginning of a bloc immediately followed by the number of sub-blocs
  4. So I need to read/record all the values in order to build a matrix before working on it (using Numpy & vectorization)
    • The columns 2 and 3 have been added for further treatments
    • The ‘0’ values will be specifically treated afterward


Numpy won’t be a problem I guess (I did some basic tests and I’m quite confident) on how to proceed, but I’m really blocked on data records … I trying to find a way to efficiently read and record data in a matrix:

  • avoiding dynamic memory allocation (here using ‘append’ in python meaning, not np),
  • dealing with huge asci file: the latest file I get contains more than 60 million of lines


Please find in attachment an extract of the input format (‘example_of_input’), and the matrix I’m trying to create and manage with Numpy


Thanks again for your time

Paul


#######################################

##BEGIN -> line number x in the original file

42   -> indicates the number of sub-blocs

1     -> number of the 1rst sub-bloc

6     -> gives how many value belong to the sub bloc

12

47

2

46

3

51

….

13   -> another type of sub-bloc with 25 values

25

15

88

21

42

22

76

19

89

0

18

80

23

38

24

73

20

81

0

90

0

41

0

39

0

77

42 -> another type of sub-bloc with 2 values

2

115

109


 #######################################

The matrix result

1 0 0 6 12 47 2 46 3 51 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

2 0 0 6 3 50 11 70 12 51 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

3 0 0 8 11 50 3 49 4 54 5 57 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

4 0 0 8 12 70 11 66 9 65 10 68 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

5 0 0 8 2 47 12 68 10 44 1 43 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

6 0 0 8 5 56 6 58 7 61 11 57 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

7 0 0 8 11 61 7 60 8 63 9 66 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

8 0 0 19 12 47 2 46 3 51 0 13 97 14 92 15 96 0 72 0 48 0 52 0 0 0 0 0 0

9 0 0 19 13 97 14 92 15 96 0 16 86 17 82 18 85 0 95 0 91 0 90 0 0 0 0 0 0

10 0 0 19 3 50 11 70 12 51 0 15 89 19 94 13 96 0 52 0 71 0 72 0 0 0 0 0 0

11 0 0 19 15 89 19 94 13 96 0 18 81 20 84 16 85 0 90 0 77 0 95 0 0 0 0 0 0

12 0 0 25 3 49 4 54 5 57 11 50 0 15 88 21 42 22 76 19 89 0 52 0 53 0 55 0 71

13 0 0 25 15 88 21 42 22 76 19 89 0 18 80 23 38 24 73 20 81 0 90 0 41 0 39 0 77

14 0 0 25 11 66 9 65 10 68 12 70 0 19 78 25 99 26 98 13 94 0 71 0 67 0 69 0 72

….


#######################################

An example of the code I started to write

# -*- coding: utf-8 -*-

 import time, sys, os, re

import itertools

import numpy as np


PATH = str(os.path.abspath(''))


input_file_name ='/example_of_input.txt'




## check if the file exists, then if it's empty or not

if (os.path.isfile(PATH + input_file_name)):

    if (os.stat(PATH + input_file_name).st_size > 0):

       

        ## go through the file in order to find specific sentences

        ## specific blocks will be defined afterward        

        Block_position = []; j=0;

        with open(PATH + input_file_name, "r") as data:

            for line in data:

                if '##BEGIN' in line:

                    Block_position.append(j)

                j=j+1

            

                   

        ## just to tests to get all the values

#        i = 0

#        data = np.zeros( (505), dtype=np.int )

#        with open(PATH + input_file_name, "r") as f:

#            for i in range (0,505):

#                data[i] = int(f.read(Block_position[0]+1+i))

#                print ("i = ", i)

          

               

#           for line in itertools.islice(f,Block_position[0],516):

#               data[i]=f.read(0+i)

#               i=i+1

       




    else:

        print "The file %s is empty : post-processing cannot be performed !!!\n" % input_file_name


           

else:

    print "Error : the file %s does not exist: post-processing stops !!!\n" % input_file_name

   


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: record data previous to Numpy use

Robert Kern-2
On Thu, Jul 6, 2017 at 1:49 AM, <[hidden email]> wrote:

>
> Dear All
>
> First of all thanks for the answers and the information’s (I’ll ding into it) and let me trying to add comments on what I want to :
>
> My asci file mainly contains data (float and int) in a single column
> (it is not always the case but I can easily manage it – as well I saw I can use ‘spli’ instruction if necessary)
> Comments/texts indicates the beginning of a bloc immediately followed by the number of sub-blocs
> So I need to read/record all the values in order to build a matrix before working on it (using Numpy & vectorization)
>
> The columns 2 and 3 have been added for further treatments
> The ‘0’ values will be specifically treated afterward
>
>
> Numpy won’t be a problem I guess (I did some basic tests and I’m quite confident) on how to proceed, but I’m really blocked on data records … I trying to find a way to efficiently read and record data in a matrix:
>
> avoiding dynamic memory allocation (here using ‘append’ in python meaning, not np),
Although you can avoid some list appending in your case (because the blocks self-describe their length), I would caution you against prematurely avoiding it. It's often the most natural way to write the code in Python, so go ahead and write it that way first. Once you get it working correctly, but it's too slow or memory intensive, then you can puzzle over how to preallocate the numpy arrays later. But quite often, it's fine. In this case, the reading and handling of the text data itself is probably the bottleneck, not appending to the lists. As I said, Python lists are cleverly implemented to make appending fast. Accumulating numbers in a list then converting to an array afterwards is a well-accepted numpy idiom.

> dealing with huge asci file: the latest file I get contains more than 60 million of lines
>
> Please find in attachment an extract of the input format (‘example_of_input’), and the matrix I’m trying to create and manage with Numpy
>
> Thanks again for your time

Try something like the attached. The function will return a list of blocks. Each block will itself be a list of numpy arrays, which are the sub-blocks themselves. I didn't bother adding the first three columns to the sub-blocks or trying to assemble them all into a uniform-width matrix by padding with trailing 0s. Since you say that the trailing 0s are going to be "specially treated afterwards", I suspect that you can more easily work with the lists of arrays instead. I assume floating-point data rather than trying to figure out whether int or float from the data. The code can handle multiple data values on one line (not especially well-tested, but it ought to work), but it assumes that the number of sub-blocks, index of the sub-block, and sub-block size are each on the own line. The code gets a little more complicated if that's not the case.

--
Robert Kern

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion

read_blocks.py (4K) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: record data previous to Numpy use

paul.carrico

Thanks Rober for your effort - I'll have a look on it

...  the goal is be guide in how to proceed (and to understand), and not to have a "ready-made solution" ... but I appreciate honnestly :-)

Paul


Le 2017-07-06 11:51, Robert Kern a écrit :

On Thu, Jul 6, 2017 at 1:49 AM, <[hidden email]> wrote:

>
> Dear All
>
> First of all thanks for the answers and the information's (I'll ding into it) and let me trying to add comments on what I want to :
>
> My asci file mainly contains data (float and int) in a single column
> (it is not always the case but I can easily manage it – as well I saw I can use 'spli' instruction if necessary)
> Comments/texts indicates the beginning of a bloc immediately followed by the number of sub-blocs
> So I need to read/record all the values in order to build a matrix before working on it (using Numpy & vectorization)
>
> The columns 2 and 3 have been added for further treatments
> The '0' values will be specifically treated afterward
>
>
> Numpy won't be a problem I guess (I did some basic tests and I'm quite confident) on how to proceed, but I'm really blocked on data records ... I trying to find a way to efficiently read and record data in a matrix:
>
> avoiding dynamic memory allocation (here using 'append' in python meaning, not np),
 
Although you can avoid some list appending in your case (because the blocks self-describe their length), I would caution you against prematurely avoiding it. It's often the most natural way to write the code in Python, so go ahead and write it that way first. Once you get it working correctly, but it's too slow or memory intensive, then you can puzzle over how to preallocate the numpy arrays later. But quite often, it's fine. In this case, the reading and handling of the text data itself is probably the bottleneck, not appending to the lists. As I said, Python lists are cleverly implemented to make appending fast. Accumulating numbers in a list then converting to an array afterwards is a well-accepted numpy idiom.

> dealing with huge asci file: the latest file I get contains more than 60 million of lines
>
> Please find in attachment an extract of the input format ('example_of_input'), and the matrix I'm trying to create and manage with Numpy
>
> Thanks again for your time

Try something like the attached. The function will return a list of blocks. Each block will itself be a list of numpy arrays, which are the sub-blocks themselves. I didn't bother adding the first three columns to the sub-blocks or trying to assemble them all into a uniform-width matrix by padding with trailing 0s. Since you say that the trailing 0s are going to be "specially treated afterwards", I suspect that you can more easily work with the lists of arrays instead. I assume floating-point data rather than trying to figure out whether int or float from the data. The code can handle multiple data values on one line (not especially well-tested, but it ought to work), but it assumes that the number of sub-blocks, index of the sub-block, and sub-block size are each on the own line. The code gets a little more complicated if that's not the case.

--
Robert Kern

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion



_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: record data previous to Numpy use

Chris Barker - NOAA Federal
In reply to this post by paul.carrico
OK, you have two performance "issues"

1) memory use: IF yu need to read a file to build a numpy array, and dont know how big it is when you start,  you need to accumulate the values first, and then make an array out of them. And numpy arrays are fixed size, so they can not efficiently accumulate values.

The usual way to handle this is to read the data into a list with .append() or the like, and then make an array from it. This is quite fast -- lists are fast and efficient for extending arrays. However, you are then storing (at least) a pointer and a python float object for each value, which is a lot more memory than a single float value in a numpy array, and you need to make the array from it, which means you have the full list and all its pyton floats AND the array in memory at once.

Frankly, computers have a lot of memory these days, so this is a non-issue in most cases.

Nonetheless, a while back I wrote an extendable numpy array object to address just this issue. You can find the code on gitHub here:


I have not tested it with recent numpy's but I expect is still works fine. It's also py2, but wouldn't take much to port.

In practice, it uses less memory that the "build a list, then make it into an array", but isnt any faster, unless you add (.extend) a bunch of values at once, rather than one at a time. (if you do it one at a time, the whole python float to numpy float conversion, and function call overhead takes just as long).

But it will, generally be as fast or faster than using  a list, and use less memory, so a fine basis for a big ascii file reader.

However, it looks like while your files may be huge, they hold a number of arrays, so each array may not be large enough to bother with any of this.

2) parsing and converting overhead -- for the most part, python/numpy text file reading code read the text into a python string, converts it to python number objects, then puts them in a list or converts them to native numbers in an array. This whole process is a bit slow (though reading files is slow anyway, so usually not worth worrying about, which is why the built-in file reading methods do this). To improve this, you need to use code that reads the file and parses it in C, and puts it straight into a numpy array without passing through python. This is what the pandas (and I assume astropy) text file readers do.

But if you don't want those dependencies, there is the "fromfile()" function in numpy -- it is not very robust, but if you files are well-formed, then it is quite fast. So your code would look something like:

with open(the_filename) as infile:
    while True:
        line = infile.readline()
        if not line:
            break
        # work with line to figure out the next block
        if ready_to_read_a_block:
            arr = np.fromfile(infile, dtype=np.int32, count=num_values, sep=' ')
            # sep specifies that you are reading text, not binary!
            arr.shape = the_shape_it_should_be


But Robert is right -- get it to work with the "usual" methods -- i.e. put numbers in a list, then make an array out it -- first, and then worry about making it faster.

-CHB


On Thu, Jul 6, 2017 at 1:49 AM, <[hidden email]> wrote:

Dear All


First of all thanks for the answers and the information’s (I’ll ding into it) and let me trying to add comments on what I want to :

  1. My asci file mainly contains data (float and int) in a single column
  2. (it is not always the case but I can easily manage it – as well I saw I can use ‘spli’ instruction if necessary)
  3. Comments/texts indicates the beginning of a bloc immediately followed by the number of sub-blocs
  4. So I need to read/record all the values in order to build a matrix before working on it (using Numpy & vectorization)
    • The columns 2 and 3 have been added for further treatments
    • The ‘0’ values will be specifically treated afterward


Numpy won’t be a problem I guess (I did some basic tests and I’m quite confident) on how to proceed, but I’m really blocked on data records … I trying to find a way to efficiently read and record data in a matrix:

  • avoiding dynamic memory allocation (here using ‘append’ in python meaning, not np),
  • dealing with huge asci file: the latest file I get contains more than 60 million of lines


Please find in attachment an extract of the input format (‘example_of_input’), and the matrix I’m trying to create and manage with Numpy


Thanks again for your time

Paul


#######################################

##BEGIN -> line number x in the original file

42   -> indicates the number of sub-blocs

1     -> number of the 1rst sub-bloc

6     -> gives how many value belong to the sub bloc

12

47

2

46

3

51

….

13   -> another type of sub-bloc with 25 values

25

15

88

21

42

22

76

19

89

0

18

80

23

38

24

73

20

81

0

90

0

41

0

39

0

77

42 -> another type of sub-bloc with 2 values

2

115

109


 #######################################

The matrix result

1 0 0 6 12 47 2 46 3 51 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

2 0 0 6 3 50 11 70 12 51 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

3 0 0 8 11 50 3 49 4 54 5 57 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

4 0 0 8 12 70 11 66 9 65 10 68 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

5 0 0 8 2 47 12 68 10 44 1 43 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

6 0 0 8 5 56 6 58 7 61 11 57 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

7 0 0 8 11 61 7 60 8 63 9 66 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

8 0 0 19 12 47 2 46 3 51 0 13 97 14 92 15 96 0 72 0 48 0 52 0 0 0 0 0 0

9 0 0 19 13 97 14 92 15 96 0 16 86 17 82 18 85 0 95 0 91 0 90 0 0 0 0 0 0

10 0 0 19 3 50 11 70 12 51 0 15 89 19 94 13 96 0 52 0 71 0 72 0 0 0 0 0 0

11 0 0 19 15 89 19 94 13 96 0 18 81 20 84 16 85 0 90 0 77 0 95 0 0 0 0 0 0

12 0 0 25 3 49 4 54 5 57 11 50 0 15 88 21 42 22 76 19 89 0 52 0 53 0 55 0 71

13 0 0 25 15 88 21 42 22 76 19 89 0 18 80 23 38 24 73 20 81 0 90 0 41 0 39 0 77

14 0 0 25 11 66 9 65 10 68 12 70 0 19 78 25 99 26 98 13 94 0 71 0 67 0 69 0 72

….


#######################################

An example of the code I started to write

# -*- coding: utf-8 -*-

 import time, sys, os, re

import itertools

import numpy as np


PATH = str(os.path.abspath(''))


input_file_name ='/example_of_input.txt'




## check if the file exists, then if it's empty or not

if (os.path.isfile(PATH + input_file_name)):

    if (os.stat(PATH + input_file_name).st_size > 0):

       

        ## go through the file in order to find specific sentences

        ## specific blocks will be defined afterward        

        Block_position = []; j=0;

        with open(PATH + input_file_name, "r") as data:

            for line in data:

                if '##BEGIN' in line:

                    Block_position.append(j)

                j=j+1

            

                   

        ## just to tests to get all the values

#        i = 0

#        data = np.zeros( (505), dtype=np.int )

#        with open(PATH + input_file_name, "r") as f:

#            for i in range (0,505):

#                data[i] = int(f.read(Block_position[0]+1+i))

#                print ("i = ", i)

          

               

#           for line in itertools.islice(f,Block_position[0],516):

#               data[i]=f.read(0+i)

#               i=i+1

       




    else:

        print "The file %s is empty : post-processing cannot be performed !!!\n" % input_file_name


           

else:

    print "Error : the file %s does not exist: post-processing stops !!!\n" % input_file_name

   


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion




--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

[hidden email]

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: record data previous to Numpy use

paul.carrico

Thanks all for your advices


Well many thing to look for, but it's obvious now  that I've first to work on (better) strategy than the one I was thinking previously (i.e. load all the files and results in one step).


It's is just a reflexion, but for huge files one solution might be to split/write/build first the array in a dedicated file (2x o(n) iterations - one to identify the blocks size - additional one to get and write), and then to load it in memory and work with numpy - at this stage the dimension is known and some packages will be fast and more adapted (pandas or astropy as suggested).


Thanks all for your time and help


Paul




_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: record data previous to Numpy use

Robert Kern-2
In reply to this post by paul.carrico
On Thu, Jul 6, 2017 at 3:19 AM, <[hidden email]> wrote:
>
> Thanks Rober for your effort - I'll have a look on it
>
> ...  the goal is be guide in how to proceed (and to understand), and not to have a "ready-made solution" ... but I appreciate honnestly :-)

Sometimes it's easier to just write the code than to try to explain in prose what to do. :-)

--
Robert Kern

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: record data previous to Numpy use

Chris Barker - NOAA Federal
In reply to this post by paul.carrico
On Thu, Jul 6, 2017 at 10:55 AM, <[hidden email]> wrote:

It's is just a reflexion, but for huge files one solution might be to split/write/build first the array in a dedicated file (2x o(n) iterations - one to identify the blocks size - additional one to get and write), and then to load it in memory and work with numpy -


I may have your use case confused, but if you have a huge file with multiple "blocks" in it, there shouldn't be any problem with loading it in one go -- start at the top of the file and load one block at a time (accumulating in a list) -- then you only have the memory overhead issues for one block at a time, should be no problem.

at this stage the dimension is known and some packages will be fast and more adapted (pandas or astropy as suggested).

pandas at least is designed to read variations of CSV files, not sure you could use the optimized part to read an array out of part of an open file from a particular point or not.

-CHB

--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

[hidden email]

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: record data previous to Numpy use

Derek Homeier
On 7 Jul 2017, at 1:59 am, Chris Barker <[hidden email]> wrote:

>
> On Thu, Jul 6, 2017 at 10:55 AM,  <[hidden email]> wrote:
> It's is just a reflexion, but for huge files one solution might be to split/write/build first the array in a dedicated file (2x o(n) iterations - one to identify the blocks size - additional one to get and write), and then to load it in memory and work with numpy -
>
>
> I may have your use case confused, but if you have a huge file with multiple "blocks" in it, there shouldn't be any problem with loading it in one go -- start at the top of the file and load one block at a time (accumulating in a list) -- then you only have the memory overhead issues for one block at a time, should be no problem.
>
> at this stage the dimension is known and some packages will be fast and more adapted (pandas or astropy as suggested).
>
> pandas at least is designed to read variations of CSV files, not sure you could use the optimized part to read an array out of part of an open file from a particular point or not.
>
The fragmented structure indeed would probably be the biggest challenge, although astropy,
while it cannot read from an open file handle, at least should be able to directly parse a block
of input lines, e.g. collected with readline() in a list. Guess pandas could do the same.
Alternatively the line positions of the blocks could be directly passed to the data_start and
data_end keywords, but that would require opening and at least partially reading the file
multiple times. In fact, if the blocks are relatively small, the overhead may be too large to
make it worth using the faster parsers - if you look at the timing notebooks I had linked to
earlier, it takes at least ~100 input lines before they show any speed gains over genfromtxt,
and ~1000 to see roughly linear scaling. In that case writing your own customised reader
could be the best option after all.

Cheers,
                                        Derek
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: record data previous to Numpy use

paul.carrico

Hi (all)


Ounce again I would like to thanks the community for the supports.

I progressing in moving my code to Python ..

In my mind some parts remains quite hugly (and burns me the eyes), but it works and I'll optimized it in the future ; so far I can work with the data in a single reading

I builts some blocks in a text file and used Astropy to read it (work fine now - i'll test pandas next step)

Not finish yet but in a significant progress compare to yesterday :-)

Have a good WE


Paul

ps : I'd like to use the following code that is much more familiar for me :-)

COMP_list = np.asarray(COMP_list, dtype = np.float64) 
i = np.arange(1,NumberOfRecords,2)
COMP_list = np.delete(COMP_list,i)


Le 2017-07-07 12:04, Derek Homeier a écrit :

On 7 Jul 2017, at 1:59 am, Chris Barker <[hidden email]> wrote:

On Thu, Jul 6, 2017 at 10:55 AM,  <[hidden email]> wrote:
It's is just a reflexion, but for huge files one solution might be to split/write/build first the array in a dedicated file (2x o(n) iterations - one to identify the blocks size - additional one to get and write), and then to load it in memory and work with numpy -


I may have your use case confused, but if you have a huge file with multiple "blocks" in it, there shouldn't be any problem with loading it in one go -- start at the top of the file and load one block at a time (accumulating in a list) -- then you only have the memory overhead issues for one block at a time, should be no problem.

at this stage the dimension is known and some packages will be fast and more adapted (pandas or astropy as suggested).

pandas at least is designed to read variations of CSV files, not sure you could use the optimized part to read an array out of part of an open file from a particular point or not.
The fragmented structure indeed would probably be the biggest challenge, although astropy,
while it cannot read from an open file handle, at least should be able to directly parse a block
of input lines, e.g. collected with readline() in a list. Guess pandas could do the same.
Alternatively the line positions of the blocks could be directly passed to the data_start and
data_end keywords, but that would require opening and at least partially reading the file
multiple times. In fact, if the blocks are relatively small, the overhead may be too large to
make it worth using the faster parsers - if you look at the timing notebooks I had linked to
earlier, it takes at least ~100 input lines before they show any speed gains over genfromtxt,
and ~1000 to see roughly linear scaling. In that case writing your own customised reader
could be the best option after all.

Cheers,
                    Derek
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion



_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: record data previous to Numpy use

Derek Homeier
On 07 Jul 2017, at 4:24 PM, [hidden email] wrote:
>
> ps : I'd like to use the following code that is much more familiar for me :-)
>
> COMP_list = np.asarray(COMP_list, dtype = np.float64)
> i = np.arange(1,NumberOfRecords,2)
> COMP_list = np.delete(COMP_list,i)
>
Not sure about the background of this, but if you want to remove every second entry
(if NumberOfRecords is the full length of the list, that is), it would always be preferable
to make changes to the list, or even better, extract only the entries you want:

COMP_list = np.asarray(COMP_list[::2], dtype = np.float64)

Have a good weekend

                                        Derek

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion
Loading...