[R] function on factors - how best to proceed

Gustaf Rydevik gustaf.rydevik at gmail.com
Wed Sep 19 14:43:22 CEST 2007


On 9/19/07, Karin Lagesen <karin.lagesen at medisin.uio.no> wrote:
>
> Sorry about this one being long, and I apologise beforehand if there
> is something obvious here that I have missed. I am new to creating my
> own functions in R, and I am uncertain of how they work.
>
> I have a data set that I have read into a data frame:
>
> > gctable[1:5,]
>      refseq geometry X60_origin X60_terminus  length  kingdom
> 1 NC_009484      cir    1790000       773000 3389227 Bacteria
> 2 NC_009484      cir    1790000       773000 3389227 Bacteria
> 3 NC_009484      cir    1790000       773000 3389227 Bacteria
> 4 NC_009484      cir    1790000       773000 3389227 Bacteria
> 5 NC_009484      cir    1790000       773000 3389227 Bacteria
>                   grp feature gene begin dir gc_content replicor LEADLAG
> 1 Alphaproteobacteria     CDS  CDS   261   +   0.654244    RIGHT    LEAD
> 2 Alphaproteobacteria     CDS  CDS  1737   -   0.651408    RIGHT     LAG
> 3 Alphaproteobacteria     CDS  CDS  2902   +   0.607843    RIGHT    LEAD
> 4 Alphaproteobacteria     CDS  CDS  3693   +   0.617647    RIGHT    LEAD
> 5 Alphaproteobacteria     CDS  CDS  4227   +   0.699208    RIGHT    LEAD
> >
>
> Most of these columns are factors.
>
> Now, I have a function that I would like to employ on this data
> frame. Right now I cannot get it to work, and that seems to be due to
> the columns in the data frame being factors. I tested it with a data
> frame created from vectors, and it worked fine.
>
> The function:
>
> percentdistance <- function(origin, terminus, length, begin, replicor){
> print(c(origin, terminus, length, begin, repl))
> d = 0
> if (terminus>origin) {
>   if(replicor=="LEFT") {
>     d = -((origin-begin)%%length)
>   }
> else {
>     d = (begin-origin)
>   }
> }
> else {
>   if (replicor=="LEFT") {
>     d=(origin-begin)
>   }
>   else{
>     d = -((begin-origin)%%length)
>   }
> }
> d/length*2
> }
>
> The error I get:
> > percentdistance(gctable$X60_origin, gctable$X60_terminus, gctable$length, gctable$begin, gctable$replicor)
>     [1]  87  87  87  87  87  87  87  87  87  87  87  87  87  87  87  87  87  87
>    [19]  87  87  87  87  87  87  87  87  87  87  87  87  87  87  87  87  87  87
>    [37]  87  87  87  87  87  87  87  87  87  87  87  87  87  87  87  87  87  87
>    [55]  87  87  87  87  87  87  87  87  87  87  87  87  87  87  87  87  87  87
>    [73]  87  87  87  87  87  87  87  87  87  87  87  87  87  87  87  87  87  87
>    [91]  87  87  87  87  87  87  87  87  87  87  87  87  87  87  87  87  87  87
>   [109]  87  87  87  87  87  87  87  87  87  87  87  87  87  87  87  87  87  87
>   [127]  87  87  87  87  87  87  87  87  87  87  87  87  87  87  87  87  87  87
> .....[99919]   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2
> [99937]   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2
> [99955]   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2
> [99973]   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2
> [99991]   2   2   2   2   2   2   2   2   2
>  [ reached getOption("max.print") -- omitted 8526091 entries ]]
> Error in if (terminus > origin) { : missing value where TRUE/FALSE needed
> In addition: Warning messages:
> 1: > not meaningful for factors in: Ops.factor(terminus, origin)
> 2: the condition has length > 1 and only the first element will be used in: if (terminus > origin) {
> >
>
> This worked nice when the input were columns from a data frame created
> from vectors.
>
> I have also tried the different apply-functions, although I am
> uncertain of which one would be appropriate here.
>
>
...
>
> Karin
> --
> Karin Lagesen, PhD student
> karin.lagesen at medisin.uio.no
> http://folk.uio.no/karinlag


Hej Karin!

A couple of things:
First, the first warning message tells you that:
1: > not meaningful for factors in: Ops.factor(terminus, origin).

Thus, terminus and origin are factor variables, which cannot be
ordered. You have to convert
them to numerical variables (See the faq for HowTo)

The second warning message tells you that:
 2: the condition has length > 1 and only the first element will be
used in: if (terminus > origin)

You are comparing two vectors,  which generate a vector of TRUE/FALSE values.
The "if" statement need a single TRUE/FALSE value.
Either use a for loop:
for (i in 1:nrow(terminus)) {if terminus[i]> origin[i]...}
or a nested ifelse statement (which is recommendable on such a big data set).


best,

Gustaf


-- 
Gustaf Rydevik, M.Sci.
tel: +46(0)703 051 451
address:Essingetorget 40,112 66 Stockholm, SE
skype:gustaf_rydevik



More information about the R-help mailing list