[R] Strange column shifting with read.table

James Pirruccello james.pirruccello at gmail.com
Mon Aug 3 02:37:48 CEST 2009


To add to Rolf's point, a tool for imputation in R is aregImpute in  
Frank Harrell's Hmisc package.

I am not sure if the discussion of past GPA as the missing variable is  
literal or merely illustrative. If literal, is the gpa missing because  
it was not reported (ie, it exists but was not reported), or because  
it does not exist? If the latter, you may wish to analyze the  
individuals with no prior GPA separately, since that seems to be a  
profound difference.

Regards,

James



On Aug 2, 2009, at 8:22 PM, Rolf Turner <r.turner at auckland.ac.nz> wrote:

>
> On 3/08/2009, at 11:32 AM, Noah Silverman wrote:
>
>> Rolf,
>>
>> Point taken.
>>
>> However, some of the variables in the experiment simply don't have  
>> data for some of the examples.
>>
>> Since I'm training an SVM that will complain about an NA, how do  
>> you suggest I handle this.
>>
>>
>> Imagine a model predicting student performance/grades/whatever.
>>
>> One variable might be "past_gpa".
>>
>> If we have some students with no history, what do you put for that  
>> column.  NA is more "correct", but won't work with an SVM.
>>
>> I'm always happy to learn...
>
> I know next to nothing about support vector machines.  Despite my  
> ignorance
> I remain suspicious of the concept.  I suspect that fortune("machine  
> learning")
> is relevant.
>
> If you have a data set that contains intrinsic NAs and you wish to  
> apply SVM
> methods to these data, then you will need to understand how SVMs  
> work and decide
> what *should* be done to handle these NAs.  My vague understanding  
> is that SVM
> tries to build pairs of hyperplanes, as widely separated as  
> possible, between classes of
> data.  This requires that each datum be representable as point in n- 
> dimensional
> space.  A datum one of whose entries is NA is not (really) such a  
> point.  Moreover
> it sure as hell isn't the same as the point produce by replacing  
> that NA by 0.
>
> To take your example involving past_gpa --- a student who has no  
> past gpa is very
> likely to be very different from a student who has previously  
> studied and
> failed everything!
>
> What you need is a *metric* which tells you the distance between a  
> point with an NA
> in it and another point.  The other point may have no NAs amongst  
> its coordinates,
> or it might have an NA in a *different* coordinate.  I.e. you need  
> to define a distance
> between points, some of whose coordinates may be missing, in a  
> *meaningful* way.
>
> After doing that, you will need (!!!) to adapt the SVM software to  
> work with this
> new metric/distance instead of the Euclidean metric.  This may  
> possibly all have
> been done already by someone, somewhere.  I dunno.
>
> Of course your proposed technique of replacing NAs by zeroes does  
> define a distance
> between such points.  But I doubt me an it be meaningful.
>
> OTOH how meaningful is the Euclidean metric between points whose  
> entries are numeric
> but in completely unrelated units (gpa, age, weight, income, ...) ???
>
> I'm sure this is little-to-no help in reality.  But I suspect that  
> little-to-no help
> is possible.
>
> A thought that just occurred to me:  there ***might*** be some  
> milage in trying
> to ``impute'' values for the NAs in your data.  However sensible  
> imputation requires
> (so I believe) pretty stringent conditions --- like multivariate  
> Gaussianity? ---
> on your data, which are unlikely to be satisfied.  (Else why are you  
> using SVM
> techniques in the first place?)  Frank Harrell might have something  
> useful --- or
> caustic (or both) --- to say on this issue.
>
>    cheers,
>
>        Rolf Turner
>
> ######################################################################
> Attention:\ This e-mail message is privileged and confid...{{dropped: 
> 9}}
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list