Data filtering with np.genfromtxt

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Data filtering with np.genfromtxt

Éric Depagne-2

Hi all,

 

I am reading large csv file, that has 8.5 million lines and 216 columns using genfromtxt.

I'm not interested in all of the 216 columns, so I filter them out using the "usecols" and "converters" parameters.

 

That works very well, but in my original large file, all the columns I extract are not filled with values. As expected in these cases, genfromtxt replaces them by nan and thus, in the final array, there are rows that contain these nans.

I'd like to know if there is a way to filterout at the genfromtxt level the lines that do contain these nans, so that they do not appear in my final array.

 

I'd like to have something like:

genfromtxt extracts the line using the parameters I need.

If the extracted line contains a NaN, do nothing and process the next line.

If it has no NaNs, add it to the output array as usual.

 

I could of course remove in the array created by genfromtxt() all the rows that contain nans (and x[~np.isnan(x).any(axis=1)] does it nicely), but I'd like to be able to get a given size of the output array.

The idea is that I can get, for instance, the first 10000 (or any number) lines of the input file that contain all the columns I need not just the first 10000.

 

I've found a few examples on SO that do some filtering, but the ones I've found do not process the extracted lines.

 

Any help appreciated.

 

Éric.

 

--

Un clavier azerty en vaut deux

----------------------------------------------------------

Éric Depagne

 


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/numpy-discussion