`missing` argument in genfromtxt only a string?

classic Classic list List threaded Threaded
31 messages Options
12
Reply | Threaded
Open this post in threaded view
|

`missing` argument in genfromtxt only a string?

jseabold
Is there a reason that the missing argument in genfromtxt only takes a string?

For instance, I have a dataset that in most columns has a zero for
some observations but in others it was just left blank, which is the
equivalent of zero.  I would like to set all of the missing to 0 (it
defaults to -1 now) when loading in the data.  I suppose I could do
this with a converter, but I have too many columns for this.

Before I try to work on a patch, I'd just like to know if I'm missing
something, maybe there's already way to do this (without using a
mask)?

-Skipper
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: `missing` argument in genfromtxt only a string?

jseabold
On Sun, Sep 13, 2009 at 1:29 PM, Skipper Seabold <[hidden email]> wrote:

> Is there a reason that the missing argument in genfromtxt only takes a string?
>
> For instance, I have a dataset that in most columns has a zero for
> some observations but in others it was just left blank, which is the
> equivalent of zero.  I would like to set all of the missing to 0 (it
> defaults to -1 now) when loading in the data.  I suppose I could do
> this with a converter, but I have too many columns for this.
>
> Before I try to work on a patch, I'd just like to know if I'm missing
> something, maybe there's already way to do this (without using a
> mask)?
>
> -Skipper
>

To be a little more concrete here are the two problems I am having right now.

from StringIO import StringIO
import numpy as np

s = stringIO('D01N01,10/1/2003  ,1, 1,  0, 400, 600,0,   0,  0,0,0,
0,0,0,    0,   0,0,0,   0,0,0,0,0,0,   0,   0,0,   0,0,   0,0,0,3,0,
50,  80,0,  0,0,0,0,0, 4,0, 3380, 1070,   0,  0,  0,0,0,0,1,0, 600,
900,0,   0,    0,0,0,0, 0,0,   0,   0,0,0,  0,0,  0,0, 0,0,   0,
0,0,0,0,  0,0,0,2,0,1000, 900,0,   0,   0,0,0,0,0,0,   0,   0,0,0,
0,0,0,0,0,0,   0,   0,0,0,0,0,0,0,0,0,  0,  0,0,  0,0,0,0,0,0,0,   0,
 0,0,0,0,0,0,0,0,0,  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,    0,    0,0,   0,0,0,0,0,1,0, 500, 800,0,  0,
0,0,0,0,0,0,    0,    0,0,   0,   0,0,0,0,1,  0,  300,    0,0,   0,
0,0,0,0, 1,0, 1600,  900,   0,   0,0,0,   0,0,0,0,     0,
0,0,0,0,0,0,0,0,0,    0,    0,0,0,0,0,0,0, 0,0,    0,   0,
0,0,0,0,0,0,0,0,   0,   0,0,0,0,0,0,0, 0, 0,0,0,0, 0,0, 0,0,0,0,0,
0,0,0,0,0,0,0,0, 0,0,   0,  0,0, 0,0,0,0,0, 0,0,
0,0,0,0,0,0,0,0,0,0,   0,   0,0,0,0,0,0,0,0,0,  0,  0,0,0,0,0,0,0,0,0,
 0,0,0,  0,0,0,0,0, 0,0,    0,    0,    0,    0,0,   0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0\r\nL24U05,12/25/2003
,2,  ,   ,    ,    , ,    ,   , , ,   , , ,     ,    , , ,    , , , ,
, ,    ,    , ,    , ,    , , , , ,   ,    , ,   , , , , ,  , ,     ,
   ,    ,   ,   , , , , , ,    ,    , ,    ,     , , , ,  , ,    ,
, , ,   , ,   , ,  , ,    ,    , , , ,   , , , , ,    ,    , ,    ,
, , , , , ,    ,    , , ,   , , , , , ,    ,    , , , , , , , , ,   ,
 , ,   , , , , , , ,    ,    , , , , , , , , ,   , , , , , , , , , , ,
, , , , , , , , ,    , , , , , , , , , ,     ,     , ,    , , , , , ,
,    ,    , ,   ,    , , , ,0,0,    0,    0,0,   0,   0,0,0,0, ,   ,
  ,     , ,    ,   , , , ,  , ,     ,     ,    ,    , , ,    , , , ,
   ,     , , , , , , , , ,     ,     , , , , , , ,  , ,     ,    ,
, , , , , , , ,    ,    , , , , , , ,  ,  , , , ,  , ,  , , , , ,   ,
, , , , , , ,  , ,    ,   , ,  , , , , ,  , ,     , , , , , , , , , ,
  ,    , , , , , , , , ,   ,   , , , , , , ,0,0,  0,0,0,  0,0,0,0,0,
, ,     ,     ,     ,     , ,    ,   , , , , , , , , , , , , , , , , ,
, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
, , , , \r\n')

data = np.genfromtxt(s, dtype=None, delimiter=",", names=None)

All of the missing values in the second observation are now -1.  Also,
I'm having trouble defining a converter for my dates.

I have the function

from datetime import datetime

def str2date(date):
    day,month,year = date.strip().split('/')
    return datetime(*map(int, [year, month, day]))

conv = {1 : lambda s: str2date(s)}
s.seek(0)
data = np.genfromtxt(s, dtype=None, delimiter=",", names=None, converters=conv)

I get

/usr/local/lib/python2.6/dist-packages/numpy/lib/io.pyc in
genfromtxt(fname, dtype, comments, delimiter, skiprows, converters,
missing, missing_values, usecols, names, excludelist, deletechars,
case_sensitive, unpack, usemask, loose)
    990         if dtype is None:
    991             for (converter, item) in zip(converters, values):
--> 992                 converter.upgrade(item)
    993         # Store the values

    994         append_to_rows(tuple(values))

/usr/local/lib/python2.6/dist-packages/numpy/lib/_iotools.pyc in
upgrade(self, value)
    469             # Raise an exception if we locked the converter...

    470             if self._locked:
--> 471                 raise ValueError("Converter is locked and
cannot be upgraded")
    472             _statusmax = len(self._mapper)
    473             # Complains if we try to upgrade by the maximum


ValueError: Converter is locked and cannot be upgraded

Does anyone know what I'm doing wrong?

Thanks,

Skipper
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: `missing` argument in genfromtxt only a string?

Pierre GM-2

On Sep 13, 2009, at 3:51 PM, Skipper Seabold wrote:

> On Sun, Sep 13, 2009 at 1:29 PM, Skipper Seabold  
> <[hidden email]> wrote:
>> Is there a reason that the missing argument in genfromtxt only  
>> takes a string?

Because we check strings. Note that you can specify several characters  
at once, provided they're separated by a comma, like missing="0,nan,n/a"

>> For instance, I have a dataset that in most columns has a zero for
>> some observations but in others it was just left blank, which is the
>> equivalent of zero.  I would like to set all of the missing to 0 (it
>> defaults to -1 now) when loading in the data.  I suppose I could do
>> this with a converter, but I have too many columns for this.

OK, I see. Gonna try to find some fix.

> All of the missing values in the second observation are now -1.  Also,
> I'm having trouble defining a converter for my dates.
>
> I have the function
>
> from datetime import datetime
>
> def str2date(date):
>    day,month,year = date.strip().split('/')
>    return datetime(*map(int, [year, month, day]))
>
> conv = {1 : lambda s: str2date(s)}
> s.seek(0)
> data = np.genfromtxt(s, dtype=None, delimiter=",", names=None,  
> converters=conv)

OK, I see the problem...
When no dtype is defined, we try to guess what a converter should  
return by testing its inputs. At first we check whether the input is a  
boolean, then whether it's an integer, then a float, and so on. When  
you define explicitly a converter, there's no need for all those  
checks, so we lock the converter to a particular state, which sets the  
conversion function and the value to return in case of missing.
Except that I messed it up and it fails in that case (the conversion  
function is set properly, bu the dtype of the output is still  
undefined). That's a bug, I'll try to fix that once I've tamed my snow  
kitten.
Meanwhile, you can use tsfromtxt (in scikits.timeseries), or even  
simpler, define a dtype for the output (you know that your first  
column is a str, your second an object, and the others ints or floats...



_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: `missing` argument in genfromtxt only a string?

jseabold
On Mon, Sep 14, 2009 at 9:59 PM, Pierre GM <[hidden email]> wrote:

>
> On Sep 13, 2009, at 3:51 PM, Skipper Seabold wrote:
>
>> On Sun, Sep 13, 2009 at 1:29 PM, Skipper Seabold
>> <[hidden email]> wrote:
>>> Is there a reason that the missing argument in genfromtxt only
>>> takes a string?
>
> Because we check strings. Note that you can specify several characters
> at once, provided they're separated by a comma, like missing="0,nan,n/a"
>
>>> For instance, I have a dataset that in most columns has a zero for
>>> some observations but in others it was just left blank, which is the
>>> equivalent of zero.  I would like to set all of the missing to 0 (it
>>> defaults to -1 now) when loading in the data.  I suppose I could do
>>> this with a converter, but I have too many columns for this.
>
> OK, I see. Gonna try to find some fix.
>

I actually figured out a workaround with converters, since my missing
values are " ","  ","   " ie., irregular number of spaces and the
values aren't stripped of white spaces.  I just define {# : lambda s:
float(s.strip() or 0)}, and I have a loop build all of the converters,
but then I have to go through and drop the ones that are supposed to
be strings or dates, which is still pretty tedious, since I have a
number of datasets that are like this, but they all contain different
data in different orders and there's no (computer) logical order to it
that I've discovered yet.

>> All of the missing values in the second observation are now -1.  Also,
>> I'm having trouble defining a converter for my dates.
>>
>> I have the function
>>
>> from datetime import datetime
>>
>> def str2date(date):
>>    day,month,year = date.strip().split('/')
>>    return datetime(*map(int, [year, month, day]))
>>
>> conv = {1 : lambda s: str2date(s)}
>> s.seek(0)
>> data = np.genfromtxt(s, dtype=None, delimiter=",", names=None,
>> converters=conv)
>
> OK, I see the problem...
> When no dtype is defined, we try to guess what a converter should
> return by testing its inputs. At first we check whether the input is a
> boolean, then whether it's an integer, then a float, and so on. When
> you define explicitly a converter, there's no need for all those
> checks, so we lock the converter to a particular state, which sets the
> conversion function and the value to return in case of missing.
> Except that I messed it up and it fails in that case (the conversion
> function is set properly, bu the dtype of the output is still
> undefined). That's a bug, I'll try to fix that once I've tamed my snow
> kitten.

No worries.  I really like genfromtxt (having recently gotten pretty
familiar with it) and would like to help out with extending it towards
these kind of cases if there's an interest and this is feasible.

I tried another workaround for the dates with my converters defined as conv

conv.update({date : lambda s : datetime(*map(int,
s.strip().split('/')[-1:]+s.strip().split('/')[:2]))})

Where `date` is the column that contains a date.  The problem was that
my dates are "mm/dd/yyyy" and datetime needs "yyyy,mm,dd," it worked
for a test case if my dates were "dd/mm/yyyy" and I just use reversed,
but gave an error about not finding the day in the third position,
though that lambda function worked for a test case outside of
genfromtxt.

> Meanwhile, you can use tsfromtxt (in scikits.timeseries), or even
> simpler, define a dtype for the output (you know that your first
> column is a str, your second an object, and the others ints or floats...
>

I started to look at the timeseries for this, but I installed it
incorrectly and it gave an error about being compiled with the wrong
endianness.  I've since fixed that and will take another look when I
get a chance.

I also tried the new datetime dtype, but I wasn't sure how to do this
without defining the whole dtype.  I have 500 columns that aren't
homogeneous across several datasets, and each one is pretty huge, so
this is tedious and takes some time to read the data (not using a test
case) and see that it didn't work correctly.

Cheers,

Skipper
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: `missing` argument in genfromtxt only a string?

Pierre GM-2

On Sep 14, 2009, at 10:31 PM, Skipper Seabold wrote:

>
> I actually figured out a workaround with converters, since my missing
> values are " ","  ","   " ie., irregular number of spaces and the
> values aren't stripped of white spaces.  I just define {# : lambda s:
> float(s.strip() or 0)}, and I have a loop build all of the converters,
> but then I have to go through and drop the ones that are supposed to
> be strings or dates, which is still pretty tedious, since I have a
> number of datasets that are like this, but they all contain different
> data in different orders and there's no (computer) logical order to it
> that I've discovered yet.

I understand your frustration... We could think about some kind of  
global default for the missing values...

> I tried another workaround for the dates with my converters defined  
> as conv
>
> conv.update({date : lambda s : datetime(*map(int,
> s.strip().split('/')[-1:]+s.strip().split('/')[:2]))})
>
> Where `date` is the column that contains a date.  The problem was that
> my dates are "mm/dd/yyyy" and datetime needs "yyyy,mm,dd," it worked
> for a test case if my dates were "dd/mm/yyyy" and I just use reversed,
> but gave an error about not finding the day in the third position,
> though that lambda function worked for a test case outside of
> genfromtxt.

Check the archives of the mailing list, there's an example using  
dateutil.parser that may be just what you need.


_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: `missing` argument in genfromtxt only a string?

jseabold
On Mon, Sep 14, 2009 at 10:41 PM, Pierre GM <[hidden email]> wrote:

>
> On Sep 14, 2009, at 10:31 PM, Skipper Seabold wrote:
>>
>> I actually figured out a workaround with converters, since my missing
>> values are " ","  ","   " ie., irregular number of spaces and the
>> values aren't stripped of white spaces.  I just define {# : lambda s:
>> float(s.strip() or 0)}, and I have a loop build all of the converters,
>> but then I have to go through and drop the ones that are supposed to
>> be strings or dates, which is still pretty tedious, since I have a
>> number of datasets that are like this, but they all contain different
>> data in different orders and there's no (computer) logical order to it
>> that I've discovered yet.
>
> I understand your frustration... We could think about some kind of
> global default for the missing values...

I'm not too frustrated, I'd just like to do this as few times as
humanly (or machine-ly, rather) possible in the future...

The main thing I'd like right now I think is for whitespace to be
stripped, but maybe there is a good reason for this.  I didn't realize
this was the source of my confusion at first.  Also just being able to
define missing as a number would be nice.  I started a patch for this,
but I reverted when I realized I could make the converters as I did.

While we're on the subject, the other thing on my wishlist (unless I
just don't know how to do this) is being able to define a "column map"
for datasets that have no delimiters.  At first each observation of my
data was just one long string with no gaps or regular breaks but I
knew which columns had what.  Eg., the first variable was (not
zero-indexed) columns 1-6, the second columns 11-15, the third column
16, etc.  so I would just say delimiter = [1:6,11:15,16,...].

>> I tried another workaround for the dates with my converters defined
>> as conv
>>
>> conv.update({date : lambda s : datetime(*map(int,
>> s.strip().split('/')[-1:]+s.strip().split('/')[:2]))})
>>
>> Where `date` is the column that contains a date.  The problem was that
>> my dates are "mm/dd/yyyy" and datetime needs "yyyy,mm,dd," it worked
>> for a test case if my dates were "dd/mm/yyyy" and I just use reversed,
>> but gave an error about not finding the day in the third position,
>> though that lambda function worked for a test case outside of
>> genfromtxt.
>
> Check the archives of the mailing list, there's an example using
> dateutil.parser that may be just what you need.
>

Ah ok.  I looked for a bit, but I was sure I missed something.  Thanks.

Skipper
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: `missing` argument in genfromtxt only a string?

jseabold
On Mon, Sep 14, 2009 at 10:55 PM, Skipper Seabold <[hidden email]> wrote:

> On Mon, Sep 14, 2009 at 10:41 PM, Pierre GM <[hidden email]> wrote:
>>
>> On Sep 14, 2009, at 10:31 PM, Skipper Seabold wrote:
>>>
>>> I actually figured out a workaround with converters, since my missing
>>> values are " ","  ","   " ie., irregular number of spaces and the
>>> values aren't stripped of white spaces.  I just define {# : lambda s:
>>> float(s.strip() or 0)}, and I have a loop build all of the converters,
>>> but then I have to go through and drop the ones that are supposed to
>>> be strings or dates, which is still pretty tedious, since I have a
>>> number of datasets that are like this, but they all contain different
>>> data in different orders and there's no (computer) logical order to it
>>> that I've discovered yet.
>>
>> I understand your frustration... We could think about some kind of
>> global default for the missing values...
>
> I'm not too frustrated, I'd just like to do this as few times as
> humanly (or machine-ly, rather) possible in the future...
>
> The main thing I'd like right now I think is for whitespace to be
> stripped, but maybe there is a good reason for this.  I didn't realize
> this was the source of my confusion at first.  Also just being able to
> define missing as a number would be nice.  I started a patch for this,
> but I reverted when I realized I could make the converters as I did.
>
> While we're on the subject, the other thing on my wishlist (unless I
> just don't know how to do this) is being able to define a "column map"
> for datasets that have no delimiters.  At first each observation of my
> data was just one long string with no gaps or regular breaks but I
> knew which columns had what.  Eg., the first variable was (not
> zero-indexed) columns 1-6, the second columns 11-15, the third column
> 16, etc.  so I would just say delimiter = [1:6,11:15,16,...].
>

Err, 1-6, 7-10, 11-15, 16...  I need some sleep.

>>> I tried another workaround for the dates with my converters defined
>>> as conv
>>>
>>> conv.update({date : lambda s : datetime(*map(int,
>>> s.strip().split('/')[-1:]+s.strip().split('/')[:2]))})
>>>
>>> Where `date` is the column that contains a date.  The problem was that
>>> my dates are "mm/dd/yyyy" and datetime needs "yyyy,mm,dd," it worked
>>> for a test case if my dates were "dd/mm/yyyy" and I just use reversed,
>>> but gave an error about not finding the day in the third position,
>>> though that lambda function worked for a test case outside of
>>> genfromtxt.
>>
>> Check the archives of the mailing list, there's an example using
>> dateutil.parser that may be just what you need.
>>
>
> Ah ok.  I looked for a bit, but I was sure I missed something.  Thanks.
>
> Skipper
>
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: `missing` argument in genfromtxt only a string?

Pierre GM-2
In reply to this post by jseabold

On Sep 14, 2009, at 10:55 PM, Skipper Seabold wrote:
>
> While we're on the subject, the other thing on my wishlist (unless I
> just don't know how to do this) is being able to define a "column map"
> for datasets that have no delimiters.  At first each observation of my
> data was just one long string with no gaps or regular breaks but I
> knew which columns had what.  Eg., the first variable was (not
> zero-indexed) columns 1-6, the second columns 11-15, the third column
> 16, etc.  so I would just say delimiter = [1:6,11:15,16,...].

Fixed-width fields should already be supported. Instead of delimiter=
[1-6, 7-10, 11-15, 16]..., use delimiter=[6, 4, 4, 1] (that is, just  
give the widths of the fields).
Note that I wouldn't be surprised at all if it failed for some corner  
cases (eg, if you need to read the name from the first line).

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: `missing` argument in genfromtxt only a string?

jseabold
On Mon, Sep 14, 2009 at 11:40 PM, Pierre GM <[hidden email]> wrote:

>
> On Sep 14, 2009, at 10:55 PM, Skipper Seabold wrote:
>>
>> While we're on the subject, the other thing on my wishlist (unless I
>> just don't know how to do this) is being able to define a "column map"
>> for datasets that have no delimiters.  At first each observation of my
>> data was just one long string with no gaps or regular breaks but I
>> knew which columns had what.  Eg., the first variable was (not
>> zero-indexed) columns 1-6, the second columns 11-15, the third column
>> 16, etc.  so I would just say delimiter = [1:6,11:15,16,...].
>
> Fixed-width fields should already be supported. Instead of delimiter=
> [1-6, 7-10, 11-15, 16]..., use delimiter=[6, 4, 4, 1] (that is, just
> give the widths of the fields).
> Note that I wouldn't be surprised at all if it failed for some corner
> cases (eg, if you need to read the name from the first line).
>

Doh, so it does!  The docstring could probably note this unless I just
missed it somewhere.

Thanks,

Skipper
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: `missing` argument in genfromtxt only a string?

Pierre GM-2

On Sep 15, 2009, at 12:50 AM, Skipper Seabold wrote:

>>
>> Fixed-width fields should already be supported. Instead of delimiter=
>> [1-6, 7-10, 11-15, 16]..., use delimiter=[6, 4, 4, 1] (that is, just
>> give the widths of the fields).
>> Note that I wouldn't be surprised at all if it failed for some corner
>> cases (eg, if you need to read the name from the first line).
>>
>
> Doh, so it does!  The docstring could probably note this unless I just
> missed it somewhere.

Well, we sure do need some docs and more examples. </wink>

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: `missing` argument in genfromtxt only a string?

Gael Varoquaux
In reply to this post by jseabold
On Tue, Sep 15, 2009 at 12:50:55AM -0400, Skipper Seabold wrote:
> Doh, so it does!  The docstring could probably note this unless I just
> missed it somewhere.

Hey Skipper,

You sent a patch a while ago to fix a docstring. I am not sure it has
been applied ( :( ).

I just wanted to point out that there is an easy way of making a
difference, and making sure that the docstrings get fixed (which is
indeed very important). If you go to http://docs.scipy.org/ and register,
send your login name on this mailing list, we will add you to the list of
editors, and you will be able to edit easily the docstrings of scipy SVN.

Cheers,

Gaël
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: `missing` argument in genfromtxt only a string?

jseabold
On Tue, Sep 15, 2009 at 3:37 AM, Gael Varoquaux
<[hidden email]> wrote:

> On Tue, Sep 15, 2009 at 12:50:55AM -0400, Skipper Seabold wrote:
>> Doh, so it does!  The docstring could probably note this unless I just
>> missed it somewhere.
>
> Hey Skipper,
>
> You sent a patch a while ago to fix a docstring. I am not sure it has
> been applied ( :( ).
>
> I just wanted to point out that there is an easy way of making a
> difference, and making sure that the docstrings get fixed (which is
> indeed very important). If you go to http://docs.scipy.org/ and register,
> send your login name on this mailing list, we will add you to the list of
> editors, and you will be able to edit easily the docstrings of scipy SVN.
>

Yes, of course.  I have a login already, thanks.  How quickly I
forget.  I will have a look at the docs and add some examples.

Skipper
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: `missing` argument in genfromtxt only a string?

Bruce Southey
In reply to this post by jseabold
On 09/14/2009 09:31 PM, Skipper Seabold wrote:
> On Mon, Sep 14, 2009 at 9:59 PM, Pierre GM<[hidden email]>  wrote:
>    
[snip]

>> OK, I see the problem...
>> When no dtype is defined, we try to guess what a converter should
>> return by testing its inputs. At first we check whether the input is a
>> boolean, then whether it's an integer, then a float, and so on. When
>> you define explicitly a converter, there's no need for all those
>> checks, so we lock the converter to a particular state, which sets the
>> conversion function and the value to return in case of missing.
>> Except that I messed it up and it fails in that case (the conversion
>> function is set properly, bu the dtype of the output is still
>> undefined). That's a bug, I'll try to fix that once I've tamed my snow
>> kitten.
>>      
> No worries.  I really like genfromtxt (having recently gotten pretty
> familiar with it) and would like to help out with extending it towards
> these kind of cases if there's an interest and this is feasible.
>
> I tried another workaround for the dates with my converters defined as conv
>
> conv.update({date : lambda s : datetime(*map(int,
> s.strip().split('/')[-1:]+s.strip().split('/')[:2]))})
>
> Where `date` is the column that contains a date.  The problem was that
> my dates are "mm/dd/yyyy" and datetime needs "yyyy,mm,dd," it worked
> for a test case if my dates were "dd/mm/yyyy" and I just use reversed,
> but gave an error about not finding the day in the third position,
> though that lambda function worked for a test case outside of
> genfromtxt.
>
>    
>> Meanwhile, you can use tsfromtxt (in scikits.timeseries),
>>      
In SAS there are multiple ways to define formats especially dates:
http://support.sas.com/onlinedoc/913/getDoc/en/lrcon.hlp/a002200738.htm

It would be nice to accept the common variants (USA vs English dates) as
well as two digit vs 4 digit year codes.



>> or even
>> simpler, define a dtype for the output (you know that your first
>> column is a str, your second an object, and the others ints or floats...
>>
>>      
How do you specify different dtypes in genfromtxt?
I could not see the information in the docstring and the dtype argument
does not appear to allow multiple dtypes.

Bruce

_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: `missing` argument in genfromtxt only a string?

jseabold
On Tue, Sep 15, 2009 at 9:43 AM, Bruce Southey <[hidden email]> wrote:

> On 09/14/2009 09:31 PM, Skipper Seabold wrote:
>> On Mon, Sep 14, 2009 at 9:59 PM, Pierre GM<[hidden email]>  wrote:
>>
> [snip]
>>> OK, I see the problem...
>>> When no dtype is defined, we try to guess what a converter should
>>> return by testing its inputs. At first we check whether the input is a
>>> boolean, then whether it's an integer, then a float, and so on. When
>>> you define explicitly a converter, there's no need for all those
>>> checks, so we lock the converter to a particular state, which sets the
>>> conversion function and the value to return in case of missing.
>>> Except that I messed it up and it fails in that case (the conversion
>>> function is set properly, bu the dtype of the output is still
>>> undefined). That's a bug, I'll try to fix that once I've tamed my snow
>>> kitten.
>>>
>> No worries.  I really like genfromtxt (having recently gotten pretty
>> familiar with it) and would like to help out with extending it towards
>> these kind of cases if there's an interest and this is feasible.
>>
>> I tried another workaround for the dates with my converters defined as conv
>>
>> conv.update({date : lambda s : datetime(*map(int,
>> s.strip().split('/')[-1:]+s.strip().split('/')[:2]))})
>>
>> Where `date` is the column that contains a date.  The problem was that
>> my dates are "mm/dd/yyyy" and datetime needs "yyyy,mm,dd," it worked
>> for a test case if my dates were "dd/mm/yyyy" and I just use reversed,
>> but gave an error about not finding the day in the third position,
>> though that lambda function worked for a test case outside of
>> genfromtxt.
>>
>>
>>> Meanwhile, you can use tsfromtxt (in scikits.timeseries),
>>>
> In SAS there are multiple ways to define formats especially dates:
> http://support.sas.com/onlinedoc/913/getDoc/en/lrcon.hlp/a002200738.htm
>
> It would be nice to accept the common variants (USA vs English dates) as
> well as two digit vs 4 digit year codes.
>

This is relevant to what I've been doing.  I parsed a SAS input file
to get the information to pass to genfromtxt, and it might be useful
to have these types defined.  Again, I'm wondering about whether the
new datetime dtype might eventually be used for something like this.

Do you know if SAS publishes the format of its datasets, similar to
Stata?  http://www.stata.com/help.cgi?dta

>
>
>>> or even
>>> simpler, define a dtype for the output (you know that your first
>>> column is a str, your second an object, and the others ints or floats...
>>>
>>>
> How do you specify different dtypes in genfromtxt?
> I could not see the information in the docstring and the dtype argument
> does not appear to allow multiple dtypes.
>

I have also been struggling with this (and modifying the dtype of
field in structured array in place, btw).  To give a quick example,
here are some of the ways that I expected to work and didn't and a few
ways that work.

from StringIO import StringIO
import numpy as np

# a few incorrect ones

s = StringIO("11.3abcde")
data = np.genfromtxt(s, dtype=np.dtype(int, float, str), delimiter=[1,3,5])

In [42]: data
Out[42]: array([ 1,  1, -1])

s.seek(0)
data = np.genfromtxt(s, dtype=np.dtype(float, int, str), delimiter=[1,3,5])

In [45]: data
Out[45]: array([ 1. ,  1.3,  NaN])

s.seek(0)
data = np.genfromtxt(s, dtype=np.dtype(str, float, int), delimiter=[1,3,5])

In [48]: data
Out[48]:
array(['1', '1.3', 'abcde'],
      dtype='|S5')

# correct few

s.seek(0)
data = np.genfromtxt(s,
dtype=np.dtype([('myint','i8'),('myfloat','f8'),('mystring','a5')]),
delimiter=[1,3,5])

In [52]: data
Out[52]:
array((1, 1.3, 'abcde'),
      dtype=[('myint', '<i8'), ('myfloat', '<f8'), ('mystring', '|S5')])

s.seek(0)
data = np.genfromtxt(s, dtype=None, delimiter=[1,3,5])

In [55]: data
Out[55]:
array((1, 1.3, 'abcde'),
      dtype=[('f0', '<i8'), ('f1', '<f8'), ('f2', '|S5')])

# one I expected to work but have probably made an obvious mistake

s.seek(0)
data = np.genfromtxt(s, dtype=np.dtype('i8','f8','a5'),
names=['myint','myfloat','mystring'], delimiter=[1,3,5])

In [64]: data
Out[64]: array([ 1,  1, -1])

# "ugly" way to do this, but it works

s.seek(0)
data = np.genfromtxt(s,
dtype=np.dtype([('','i8'),('','f8'),('','a5')]),
names=['myint','myfloat','mystring'], delimiter=[1,3,5])

In [69]: data
Out[69]:
array((1, 1.3, 'abcde'),
      dtype=[('myint', '<i8'), ('myfloat', '<f8'), ('mystring', '|S5')])


Skipper
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: `missing` argument in genfromtxt only a string?

jseabold
On Tue, Sep 15, 2009 at 10:44 AM, Skipper Seabold <[hidden email]> wrote:

> On Tue, Sep 15, 2009 at 9:43 AM, Bruce Southey <[hidden email]> wrote:
>> On 09/14/2009 09:31 PM, Skipper Seabold wrote:
>>> On Mon, Sep 14, 2009 at 9:59 PM, Pierre GM<[hidden email]>  wrote:
>>>
>> [snip]
>>>> OK, I see the problem...
>>>> When no dtype is defined, we try to guess what a converter should
>>>> return by testing its inputs. At first we check whether the input is a
>>>> boolean, then whether it's an integer, then a float, and so on. When
>>>> you define explicitly a converter, there's no need for all those
>>>> checks, so we lock the converter to a particular state, which sets the
>>>> conversion function and the value to return in case of missing.
>>>> Except that I messed it up and it fails in that case (the conversion
>>>> function is set properly, bu the dtype of the output is still
>>>> undefined). That's a bug, I'll try to fix that once I've tamed my snow
>>>> kitten.
>>>>
>>> No worries.  I really like genfromtxt (having recently gotten pretty
>>> familiar with it) and would like to help out with extending it towards
>>> these kind of cases if there's an interest and this is feasible.
>>>
>>> I tried another workaround for the dates with my converters defined as conv
>>>
>>> conv.update({date : lambda s : datetime(*map(int,
>>> s.strip().split('/')[-1:]+s.strip().split('/')[:2]))})
>>>
>>> Where `date` is the column that contains a date.  The problem was that
>>> my dates are "mm/dd/yyyy" and datetime needs "yyyy,mm,dd," it worked
>>> for a test case if my dates were "dd/mm/yyyy" and I just use reversed,
>>> but gave an error about not finding the day in the third position,
>>> though that lambda function worked for a test case outside of
>>> genfromtxt.
>>>
>>>
>>>> Meanwhile, you can use tsfromtxt (in scikits.timeseries),
>>>>
>> In SAS there are multiple ways to define formats especially dates:
>> http://support.sas.com/onlinedoc/913/getDoc/en/lrcon.hlp/a002200738.htm
>>
>> It would be nice to accept the common variants (USA vs English dates) as
>> well as two digit vs 4 digit year codes.
>>
>
> This is relevant to what I've been doing.  I parsed a SAS input file
> to get the information to pass to genfromtxt, and it might be useful
> to have these types defined.  Again, I'm wondering about whether the
> new datetime dtype might eventually be used for something like this.
>
> Do you know if SAS publishes the format of its datasets, similar to
> Stata?  http://www.stata.com/help.cgi?dta
>
>>
>>
>>>> or even
>>>> simpler, define a dtype for the output (you know that your first
>>>> column is a str, your second an object, and the others ints or floats...
>>>>
>>>>
>> How do you specify different dtypes in genfromtxt?
>> I could not see the information in the docstring and the dtype argument
>> does not appear to allow multiple dtypes.
>>
>
> I have also been struggling with this (and modifying the dtype of
> field in structured array in place, btw).  To give a quick example,
> here are some of the ways that I expected to work and didn't and a few
> ways that work.
>
> from StringIO import StringIO
> import numpy as np
>
> # a few incorrect ones
>
> s = StringIO("11.3abcde")
> data = np.genfromtxt(s, dtype=np.dtype(int, float, str), delimiter=[1,3,5])
>
> In [42]: data
> Out[42]: array([ 1,  1, -1])
>
> s.seek(0)
> data = np.genfromtxt(s, dtype=np.dtype(float, int, str), delimiter=[1,3,5])
>
> In [45]: data
> Out[45]: array([ 1. ,  1.3,  NaN])
>
> s.seek(0)
> data = np.genfromtxt(s, dtype=np.dtype(str, float, int), delimiter=[1,3,5])
>
> In [48]: data
> Out[48]:
> array(['1', '1.3', 'abcde'],
>      dtype='|S5')
>
> # correct few
>
> s.seek(0)
> data = np.genfromtxt(s,
> dtype=np.dtype([('myint','i8'),('myfloat','f8'),('mystring','a5')]),
> delimiter=[1,3,5])
>
> In [52]: data
> Out[52]:
> array((1, 1.3, 'abcde'),
>      dtype=[('myint', '<i8'), ('myfloat', '<f8'), ('mystring', '|S5')])
>
> s.seek(0)
> data = np.genfromtxt(s, dtype=None, delimiter=[1,3,5])
>
> In [55]: data
> Out[55]:
> array((1, 1.3, 'abcde'),
>      dtype=[('f0', '<i8'), ('f1', '<f8'), ('f2', '|S5')])
>
> # one I expected to work but have probably made an obvious mistake
>
> s.seek(0)
> data = np.genfromtxt(s, dtype=np.dtype('i8','f8','a5'),
> names=['myint','myfloat','mystring'], delimiter=[1,3,5])
>
> In [64]: data
> Out[64]: array([ 1,  1, -1])
>
> # "ugly" way to do this, but it works
>
> s.seek(0)
> data = np.genfromtxt(s,
> dtype=np.dtype([('','i8'),('','f8'),('','a5')]),
> names=['myint','myfloat','mystring'], delimiter=[1,3,5])
>
> In [69]: data
> Out[69]:
> array((1, 1.3, 'abcde'),
>      dtype=[('myint', '<i8'), ('myfloat', '<f8'), ('mystring', '|S5')])
>

Btw, you don't have to pass it as a dtype.  It just needs to be able to pass

if dtype is not None:
    dtype = np.dtype(dtype)

I would like to see something like this, as it does when dtype is
None, but then we would have to have a type argument, maybe rather
than a dtype argument.

names = ['var1','var2','var3']
type = ['i', 'f', 'str']

dtype = zip(names,type)
if dtype is not None:
   ....

Again, while I'm on it...I noticed the argument to specify the
autostrip argument that can be provided to _iotools.LineSplitter is
always False.  If this does, what I think (no time to test yet), it
might be nice to be able to specify this in genfromtxt.

Skipper
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: `missing` argument in genfromtxt only a string?

josef.pktd
In reply to this post by jseabold
On Tue, Sep 15, 2009 at 10:44 AM, Skipper Seabold <[hidden email]> wrote:

> On Tue, Sep 15, 2009 at 9:43 AM, Bruce Southey <[hidden email]> wrote:
>> On 09/14/2009 09:31 PM, Skipper Seabold wrote:
>>> On Mon, Sep 14, 2009 at 9:59 PM, Pierre GM<[hidden email]>  wrote:
>>>
>> [snip]
>>>> OK, I see the problem...
>>>> When no dtype is defined, we try to guess what a converter should
>>>> return by testing its inputs. At first we check whether the input is a
>>>> boolean, then whether it's an integer, then a float, and so on. When
>>>> you define explicitly a converter, there's no need for all those
>>>> checks, so we lock the converter to a particular state, which sets the
>>>> conversion function and the value to return in case of missing.
>>>> Except that I messed it up and it fails in that case (the conversion
>>>> function is set properly, bu the dtype of the output is still
>>>> undefined). That's a bug, I'll try to fix that once I've tamed my snow
>>>> kitten.
>>>>
>>> No worries.  I really like genfromtxt (having recently gotten pretty
>>> familiar with it) and would like to help out with extending it towards
>>> these kind of cases if there's an interest and this is feasible.
>>>
>>> I tried another workaround for the dates with my converters defined as conv
>>>
>>> conv.update({date : lambda s : datetime(*map(int,
>>> s.strip().split('/')[-1:]+s.strip().split('/')[:2]))})
>>>
>>> Where `date` is the column that contains a date.  The problem was that
>>> my dates are "mm/dd/yyyy" and datetime needs "yyyy,mm,dd," it worked
>>> for a test case if my dates were "dd/mm/yyyy" and I just use reversed,
>>> but gave an error about not finding the day in the third position,
>>> though that lambda function worked for a test case outside of
>>> genfromtxt.
>>>
>>>
>>>> Meanwhile, you can use tsfromtxt (in scikits.timeseries),
>>>>
>> In SAS there are multiple ways to define formats especially dates:
>> http://support.sas.com/onlinedoc/913/getDoc/en/lrcon.hlp/a002200738.htm
>>
>> It would be nice to accept the common variants (USA vs English dates) as
>> well as two digit vs 4 digit year codes.
>>
>
> This is relevant to what I've been doing.  I parsed a SAS input file
> to get the information to pass to genfromtxt, and it might be useful
> to have these types defined.  Again, I'm wondering about whether the
> new datetime dtype might eventually be used for something like this.
>
> Do you know if SAS publishes the format of its datasets, similar to
> Stata?  http://www.stata.com/help.cgi?dta
>
>>
>>
>>>> or even
>>>> simpler, define a dtype for the output (you know that your first
>>>> column is a str, your second an object, and the others ints or floats...
>>>>
>>>>
>> How do you specify different dtypes in genfromtxt?
>> I could not see the information in the docstring and the dtype argument
>> does not appear to allow multiple dtypes.
>>
>
> I have also been struggling with this (and modifying the dtype of
> field in structured array in place, btw).  To give a quick example,
> here are some of the ways that I expected to work and didn't and a few
> ways that work.
>
> from StringIO import StringIO
> import numpy as np
>
> # a few incorrect ones
>
> s = StringIO("11.3abcde")
> data = np.genfromtxt(s, dtype=np.dtype(int, float, str), delimiter=[1,3,5])
>
> In [42]: data
> Out[42]: array([ 1,  1, -1])
>
> s.seek(0)
> data = np.genfromtxt(s, dtype=np.dtype(float, int, str), delimiter=[1,3,5])
>
> In [45]: data
> Out[45]: array([ 1. ,  1.3,  NaN])
>
> s.seek(0)
> data = np.genfromtxt(s, dtype=np.dtype(str, float, int), delimiter=[1,3,5])
>
> In [48]: data
> Out[48]:
> array(['1', '1.3', 'abcde'],
>      dtype='|S5')

these are not problem of genfromtxt, the dtype construction is not
what you think it is. What the second and third arguments are, I don't
know

>>> np.dtype(int,float,str)
dtype('int32')
>>> np.dtype(float,float,str)
dtype('float64')
>>> np.dtype(str,float,str)
dtype('|S0')

I think the versions below are the correct way of specifying a structured dtype.

Josef



>
> # correct few
>
> s.seek(0)
> data = np.genfromtxt(s,
> dtype=np.dtype([('myint','i8'),('myfloat','f8'),('mystring','a5')]),
> delimiter=[1,3,5])
>
> In [52]: data
> Out[52]:
> array((1, 1.3, 'abcde'),
>      dtype=[('myint', '<i8'), ('myfloat', '<f8'), ('mystring', '|S5')])
>
> s.seek(0)
> data = np.genfromtxt(s, dtype=None, delimiter=[1,3,5])
>
> In [55]: data
> Out[55]:
> array((1, 1.3, 'abcde'),
>      dtype=[('f0', '<i8'), ('f1', '<f8'), ('f2', '|S5')])
>
> # one I expected to work but have probably made an obvious mistake
>
> s.seek(0)
> data = np.genfromtxt(s, dtype=np.dtype('i8','f8','a5'),
> names=['myint','myfloat','mystring'], delimiter=[1,3,5])
>
> In [64]: data
> Out[64]: array([ 1,  1, -1])
>
> # "ugly" way to do this, but it works
>
> s.seek(0)
> data = np.genfromtxt(s,
> dtype=np.dtype([('','i8'),('','f8'),('','a5')]),
> names=['myint','myfloat','mystring'], delimiter=[1,3,5])
>
> In [69]: data
> Out[69]:
> array((1, 1.3, 'abcde'),
>      dtype=[('myint', '<i8'), ('myfloat', '<f8'), ('mystring', '|S5')])
>
>
> Skipper
> _______________________________________________
> NumPy-Discussion mailing list
> [hidden email]
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: `missing` argument in genfromtxt only a string?

Bruce Southey
In reply to this post by jseabold
On 09/15/2009 09:44 AM, Skipper Seabold wrote:

> On Tue, Sep 15, 2009 at 9:43 AM, Bruce Southey<[hidden email]>  wrote:
>    
>> On 09/14/2009 09:31 PM, Skipper Seabold wrote:
>>      
>>> On Mon, Sep 14, 2009 at 9:59 PM, Pierre GM<[hidden email]>    wrote:
>>>
>>>        
>> [snip]
>>      
>>>> OK, I see the problem...
>>>> When no dtype is defined, we try to guess what a converter should
>>>> return by testing its inputs. At first we check whether the input is a
>>>> boolean, then whether it's an integer, then a float, and so on. When
>>>> you define explicitly a converter, there's no need for all those
>>>> checks, so we lock the converter to a particular state, which sets the
>>>> conversion function and the value to return in case of missing.
>>>> Except that I messed it up and it fails in that case (the conversion
>>>> function is set properly, bu the dtype of the output is still
>>>> undefined). That's a bug, I'll try to fix that once I've tamed my snow
>>>> kitten.
>>>>
>>>>          
>>> No worries.  I really like genfromtxt (having recently gotten pretty
>>> familiar with it) and would like to help out with extending it towards
>>> these kind of cases if there's an interest and this is feasible.
>>>
>>> I tried another workaround for the dates with my converters defined as conv
>>>
>>> conv.update({date : lambda s : datetime(*map(int,
>>> s.strip().split('/')[-1:]+s.strip().split('/')[:2]))})
>>>
>>> Where `date` is the column that contains a date.  The problem was that
>>> my dates are "mm/dd/yyyy" and datetime needs "yyyy,mm,dd," it worked
>>> for a test case if my dates were "dd/mm/yyyy" and I just use reversed,
>>> but gave an error about not finding the day in the third position,
>>> though that lambda function worked for a test case outside of
>>> genfromtxt.
>>>
>>>
>>>        
>>>> Meanwhile, you can use tsfromtxt (in scikits.timeseries),
>>>>
>>>>          
>> In SAS there are multiple ways to define formats especially dates:
>> http://support.sas.com/onlinedoc/913/getDoc/en/lrcon.hlp/a002200738.htm
>>
>> It would be nice to accept the common variants (USA vs English dates) as
>> well as two digit vs 4 digit year codes.
>>
>>      
> This is relevant to what I've been doing.  I parsed a SAS input file
> to get the information to pass to genfromtxt, and it might be useful
> to have these types defined.  Again, I'm wondering about whether the
> new datetime dtype might eventually be used for something like this.
>
> Do you know if SAS publishes the format of its datasets, similar to
> Stata?  http://www.stata.com/help.cgi?dta
>    
I am not exactly sure what you mean. Most of type formats are available
under the data set informat statement but really you need to address
special ones like defining strings with sufficient length and time when
reading data. Usually I read dates as strings and then convert back
dates as needed since these are not always correct or have the same
format in the data.

SAS is rather complex as it has multiple ways to create what it calls
permanent datasets and these are even incompatible across OS's in the
same version. So really these are not very useful outside of the
specific version of SAS that is being used. There are many ways to
transfer files like using the xport engine that R can read (see
read.xport  in foreign package - has link to format). However, usually
it is just easier to create a new file within SAS.

>    
>>
>>      
>>>> or even
>>>> simpler, define a dtype for the output (you know that your first
>>>> column is a str, your second an object, and the others ints or floats...
>>>>
>>>>
>>>>          
>> How do you specify different dtypes in genfromtxt?
>> I could not see the information in the docstring and the dtype argument
>> does not appear to allow multiple dtypes.
>>
>>      
> I have also been struggling with this (and modifying the dtype of
> field in structured array in place, btw).  To give a quick example,
> here are some of the ways that I expected to work and didn't and a few
> ways that work.
>
> from StringIO import StringIO
> import numpy as np
>
> # a few incorrect ones
>
> s = StringIO("11.3abcde")
> data = np.genfromtxt(s, dtype=np.dtype(int, float, str), delimiter=[1,3,5])
>
> In [42]: data
> Out[42]: array([ 1,  1, -1])
>
> s.seek(0)
> data = np.genfromtxt(s, dtype=np.dtype(float, int, str), delimiter=[1,3,5])
>
> In [45]: data
> Out[45]: array([ 1. ,  1.3,  NaN])
>
> s.seek(0)
> data = np.genfromtxt(s, dtype=np.dtype(str, float, int), delimiter=[1,3,5])
>
> In [48]: data
> Out[48]:
> array(['1', '1.3', 'abcde'],
>        dtype='|S5')
>
> # correct few
>
> s.seek(0)
> data = np.genfromtxt(s,
> dtype=np.dtype([('myint','i8'),('myfloat','f8'),('mystring','a5')]),
> delimiter=[1,3,5])
>
> In [52]: data
> Out[52]:
> array((1, 1.3, 'abcde'),
>        dtype=[('myint', '<i8'), ('myfloat', '<f8'), ('mystring', '|S5')])
>
> s.seek(0)
> data = np.genfromtxt(s, dtype=None, delimiter=[1,3,5])
>
> In [55]: data
> Out[55]:
> array((1, 1.3, 'abcde'),
>        dtype=[('f0', '<i8'), ('f1', '<f8'), ('f2', '|S5')])
>
> # one I expected to work but have probably made an obvious mistake
>
> s.seek(0)
> data = np.genfromtxt(s, dtype=np.dtype('i8','f8','a5'),
> names=['myint','myfloat','mystring'], delimiter=[1,3,5])
>
> In [64]: data
> Out[64]: array([ 1,  1, -1])
>
> # "ugly" way to do this, but it works
>
> s.seek(0)
> data = np.genfromtxt(s,
> dtype=np.dtype([('','i8'),('','f8'),('','a5')]),
> names=['myint','myfloat','mystring'], delimiter=[1,3,5])
>
> In [69]: data
> Out[69]:
> array((1, 1.3, 'abcde'),
>        dtype=[('myint', '<i8'), ('myfloat', '<f8'), ('mystring', '|S5')])
>
>
> Skipper
>    
Thanks for these examples as these make sense now. I was confused
because the display shows the dtype as list not as a single dtype.


Bruce
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: `missing` argument in genfromtxt only a string?

Gael Varoquaux
In reply to this post by jseabold
On Tue, Sep 15, 2009 at 09:22:41AM -0400, Skipper Seabold wrote:
> Yes, of course.  I have a login already, thanks.  How quickly I
> forget.  I will have a look at the docs and add some examples.

Thanks a lot. Such contributions are very valuable to the community.

Gaël
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: `missing` argument in genfromtxt only a string?

Pierre GM-2
In reply to this post by jseabold

On Sep 15, 2009, at 10:44 AM, Skipper Seabold wrote:
>>>> How do you specify different dtypes in genfromtxt?
>> I could not see the information in the docstring and the dtype  
>> argument
>> does not appear to allow multiple dtypes.

Just give a regular dtype, or something that could be interpreted as  
such. Have a look at
http://docs.scipy.org/doc/numpy/reference/generated/numpy.dtype.html

> # a few incorrect ones
>
> s = StringIO("11.3abcde")
> data = np.genfromtxt(s, dtype=np.dtype(int, float, str), delimiter=
> [1,3,5])

Non-legit at all, but a good idea in that case.

>
> # one I expected to work but have probably made an obvious mistake
>
> s.seek(0)
> data = np.genfromtxt(s, dtype=np.dtype('i8','f8','a5'),
> names=['myint','myfloat','mystring'], delimiter=[1,3,5])

But this one works:
data=np.genfromtxt(s, dtype=np.dtype("i8,f8,a5"), names=
['myint','myfloat','mystring'], delimiter=[1,3,5])

>
> Btw, you don't have to pass it as a dtype.  It just needs to be able  
> to pass
>
> if dtype is not None:
>    dtype = np.dtype(dtype)
>
> I would like to see something like this, as it does when dtype is
> None, but then we would have to have a type argument, maybe rather
> than a dtype argument.

'k. Gonna see what I can do.

> Again, while I'm on it...I noticed the argument to specify the
> autostrip argument that can be provided to _iotools.LineSplitter is
> always False.  If this does, what I think (no time to test yet), it
> might be nice to be able to specify this in genfromtxt.

Would you mind giving me an example of usage with the corresponding  
expected output, so that I can work on it ?



_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Reply | Threaded
Open this post in threaded view
|

Re: `missing` argument in genfromtxt only a string?

jseabold
On Tue, Sep 15, 2009 at 1:56 PM, Pierre GM <[hidden email]> wrote:

>
> On Sep 15, 2009, at 10:44 AM, Skipper Seabold wrote:
>>>>> How do you specify different dtypes in genfromtxt?
>>> I could not see the information in the docstring and the dtype
>>> argument
>>> does not appear to allow multiple dtypes.
>
> Just give a regular dtype, or something that could be interpreted as
> such. Have a look at
> http://docs.scipy.org/doc/numpy/reference/generated/numpy.dtype.html
>
>> # a few incorrect ones
>>
>> s = StringIO("11.3abcde")
>> data = np.genfromtxt(s, dtype=np.dtype(int, float, str), delimiter=
>> [1,3,5])
>
> Non-legit at all, but a good idea in that case.
>
>>
>> # one I expected to work but have probably made an obvious mistake
>>
>> s.seek(0)
>> data = np.genfromtxt(s, dtype=np.dtype('i8','f8','a5'),
>> names=['myint','myfloat','mystring'], delimiter=[1,3,5])
>
> But this one works:
> data=np.genfromtxt(s, dtype=np.dtype("i8,f8,a5"), names=
> ['myint','myfloat','mystring'], delimiter=[1,3,5])
>
>>
>> Btw, you don't have to pass it as a dtype.  It just needs to be able
>> to pass
>>
>> if dtype is not None:
>>    dtype = np.dtype(dtype)
>>
>> I would like to see something like this, as it does when dtype is
>> None, but then we would have to have a type argument, maybe rather
>> than a dtype argument.
>
> 'k. Gonna see what I can do.
>

Oh, given that this works though, I don't think my gripe is that
legitimate.  This is essentially the same thing, I just need to read
up on declaring a dtype and stick some examples in the docstrings, so
I don't forget...

data = np.genfromtxt(s, dtype=np.dtype("i8,f8,a5"),
names=['myint','myfloat','mystring'], delimiter=[1,3,5])

>> Again, while I'm on it...I noticed the argument to specify the
>> autostrip argument that can be provided to _iotools.LineSplitter is
>> always False.  If this does, what I think (no time to test yet), it
>> might be nice to be able to specify this in genfromtxt.
>
> Would you mind giving me an example of usage with the corresponding
> expected output, so that I can work on it ?
>

Sure,  I gave a longer example of this in the 2nd email in this
thread, where my "missing" fields were " ,     ,         ,     ,   ,",
ie., fixed width white space that I wanted to just strip down to "".
Also if you notice that when it reads the date I still have
"mm/dd/yyyy  " with the trailing whitespace.  I don't know how big of
a deal this though.  I think you can just define an autostrip argument
in genfromtxt and then split_line=LineSplitter(...,
autostrip=autostrip).  I haven't tested this yet though.

http://article.gmane.org/gmane.comp.python.numeric.general/32821

Skipper
_______________________________________________
NumPy-Discussion mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
12