[R] Dataframe of factors transform speed?

Fri Jul 20 09:29:52 CEST 2007

set.seed(123)
genoT = lapply(1:240000, function(i) factor(sample(c("AA", "AB",  
"BB"), 1000, prob=sample(c(1, 1000, 1000), 3), rep=T)))
names(genoT) = paste("snp", 1:240000, sep="")
genoT = as.data.frame(genoT)
dim(genoT)
class(genoT)
system.time(out <- lapply(genoT, function(x) match(x, c("AA", "AB",  
"BB"))-1))
##
##
    user  system elapsed
119.288   0.004 119.339

(for all 240K)

best,
b

ps: note that "out" is a list.

On Jul 20, 2007, at 2:01 AM, Latchezar Dimitrov wrote:

> Hi,
>
>> -----Original Message-----
>> From: Benilton Carvalho [mailto:bcarvalh at jhsph.edu]
>> Sent: Friday, July 20, 2007 12:25 AM
>> To: Latchezar Dimitrov
>> Cc: r-help at stat.math.ethz.ch
>> Subject: Re: [R] Dataframe of factors transform speed?
>>
>> it looks like that whatever method you used to genotype the
>> 1002 samples on the STY array gave you a transposed matrix of
>> genotype calls. :-)
>
> It only looks like :-)
>
> Otherwise it is correctly created dataframe of 1002 samples X (big
> number) of columns (SNP genotypes). It worked perfectly until I  
> decided
> to put together to cohorts independently processed in R already. I got
> stuck with my lack of foreseeing. Otherwise I would have put 3 dummy
> lines w/ AA,AB, and AB on each one to make sure all 3 genotypes are
> present and that's it! Lesson for the future :-)
>
> Maybe I am not using columns and rows appropriately here but the
> dataframe is correct (I have not used FORTRAN since FORTRAN IV ;-)  
> - as
> str says 1002 observ. of (big number) vars.
>
>>
>> i'd use:
>>
>> genoT = read.table(yourFile, stringsAsFactors = FALSE)
>>
>> as a starting point... but I don't think that would be
>> efficient (as you'd need to fix one column at a time - lapply).
>
> No it was not efficient at all. 'matter of fact nothing is more
> efficient then loading already read data, alas :-(
>
>>
>> i'd preprocess yourFile before trying to load it:
>>
>> cat yourFile | sed -e 's/AA/1/g' | sed -e 's/AB/2/g' | sed -e
>> 's/BB/3/ g' > outFile
>>
>> and, now, in R:
>>
>> genoT = read.table(outFile, header=TRUE)
>
> ... Too late ;-) As it must be clear now I have two dataframes I  
> want to
> put together with rbind(geno1,geno2). The issue again is
> "uniformization" of factor variables w/ missing factors - they  
> ended up
> like levels AA,BB on one of the and levels AB,BB on the other which
> means as.numeric of AA is 1 on the 1st and as.numeric of AB is 1 on  
> the
> second - complete mess. That's why I tried to make both uniform, i.e.
> levels "AA","AB", and "BB" for every SNP and then rbind works.
>
> In any case my 1st questions remains: "What's wrong with me?" :-)
>
> Thanks,
> Latchezar
>
>>
>> b
>>
>> On Jul 19, 2007, at 11:51 PM, Latchezar Dimitrov wrote:
>>
>>> Hello,
>>>
>>> This is a speed question. I have a dataframe genoT:
>>>
>>>> dim(genoT)
>>> [1]   1002 238304
>>>
>>>> str(genoT)
>>> 'data.frame':   1002 obs. of  238304 variables:
>>>  $ SNP_A.4261647: Factor w/ 3 levels "0","1","2": 3 3 3 3 3
>> 3 3 3 3 3
>>> ...
>>>  $ SNP_A.4261610: Factor w/ 3 levels "0","1","2": 1 1 3 3 1
>> 1 1 2 2 2
>>> ...
>>>  $ SNP_A.4261601: Factor w/ 3 levels "0","1","2": 1 1 1 1 1
>> 1 1 1 1 1
>>> ...
>>>  $ SNP_A.4261704: Factor w/ 3 levels "0","1","2": 3 3 3 3 3
>> 3 3 3 3 3
>>> ...
>>>  $ SNP_A.4261563: Factor w/ 3 levels "0","1","2": 3 1 2 1 2
>> 3 2 3 3 1
>>> ...
>>>  $ SNP_A.4261554: Factor w/ 3 levels "0","1","2": 1 1 NA 1 NA 2 1 1
>>> 2 1
>>> ...
>>>  $ SNP_A.4261666: Factor w/ 3 levels "0","1","2": 1 1 2 1 1
>> 1 1 1 1 2
>>> ...
>>>  $ SNP_A.4261634: Factor w/ 3 levels "0","1","2": 3 3 2 3 3
>> 3 3 3 3 2
>>> ...
>>>  $ SNP_A.4261656: Factor w/ 3 levels "0","1","2": 1 1 2 1 1
>> 1 1 1 1 2
>>> ...
>>>  $ SNP_A.4261637: Factor w/ 3 levels "0","1","2": 1 3 2 3 2
>> 1 2 1 1 3
>>> ...
>>>  $ SNP_A.4261597: Factor w/ 3 levels "AA","AB","BB": 2 2 3 3 3 2 1
>>> 2 2 3
>>> ...
>>>  $ SNP_A.4261659: Factor w/ 3 levels "AA","AB","BB": 3 3 3 3 3 3 3
>>> 3 3 3
>>> ...
>>>  $ SNP_A.4261594: Factor w/ 3 levels "AA","AB","BB": 2 2 2 1 1 1 2
>>> 2 2 2
>>> ...
>>>  $ SNP_A.4261698: Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1
>>> 1 ...
>>>  $ SNP_A.4261538: Factor w/ 3 levels "AA","AB","BB": 2 3 2 2 3 2 2
>>> 1 1 2
>>> ...
>>>  $ SNP_A.4261621: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1
>>> 1 1 1
>>> ...
>>>  $ SNP_A.4261553: Factor w/ 3 levels "AA","AB","BB": 1 1 2 1 1 1 1
>>> 1 1 1
>>> ...
>>>  $ SNP_A.4261528: Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1
>>> 1 ...
>>>  $ SNP_A.4261579: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 2 1
>>> 1 1 2
>>> ...
>>>  $ SNP_A.4261513: Factor w/ 3 levels "AA","AB","BB": 2 1 2
>> 2 2 NA 1 NA
>>> 2
>>> 1 ...
>>>  $ SNP_A.4261532: Factor w/ 3 levels "AA","AB","BB": 1 2 2 1 1 1 3
>>> 1 1 1
>>> ...
>>>  $ SNP_A.4261600: Factor w/ 2 levels "AB","BB": 2 2 2 2 2 2 2 2 2
>>> 2 ...
>>>  $ SNP_A.4261706: Factor w/ 2 levels "AA","BB": 1 1 1 1 1 1 1 1 1
>>> 1 ...
>>>  $ SNP_A.4261575: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1
>>> 2 2 1
>>> ...
>>>
>>> Its columns are factors with different number of levels
>> (from 1 to 3 -
>>> that's what I got from read.table, i.e., it dropped missing
>> levels). I
>>> want to convert it to uniform factors with 3 levels. The
>> 1st 10 rows
>>> above show already converted columns and the rest are not yet
>>> converted.
>>> Here's my attempt wich is a complete failure as speed:
>>>
>>>> system.time(
>>> +     for(j in 1:(10         )){ #-- this is to try 1st 10 cols and
>>> measure the time, it otherwise is ncol(genoT) instead of 10
>>>
>>> +        gt<-genoT[[j]]          #-- this is to avoid 2D indices
>>> +        for(l in 1:length(gt at levels)){
>>> +          levels(gt)[l] <-
>> switch(gt at levels[l],AA="0",AB="1",BB="2")
>>> #-- convert levels to "0","1", or "2"
>>> +          genoT[[j]]<-factor(gt,levels=0:2)   #-- make a 3-level
>>> factor
>>> and put it back
>>> +        }
>>> +     }
>>> + )
>>> [1] 785.085   4.358 789.454   0.000   0.000
>>>
>>> 789s for 10 columns only!
>>>
>>> To me it seems like replacing 10 x 3 levels and then making
>> a factor
>>> of
>>> 1002 element vector x 10 is a "negligible" amount of operations
>>> needed.
>>>
>>> So, what's wrong with me? Any idea how to accelerate
>> significantly the
>>> transformation or (to go to the very beginning) to make
>> read.table use
>>> a fixed set of levels ("AA","AB", and "BB") and not to drop any
>>> (missing)
>>> level?
>>>
>>> R-devel_2006-08-26, Sun Solaris 10 OS - x86 64-bit
>>>
>>> The machine is with 32G RAM and AMD Opteron 285 (2.? GHz)
>> so it's not
>>> it.
>>>
>>> Thank you very much for the help,
>>>
>>> Latchezar Dimitrov,
>>> Analyst/Programmer IV,
>>> Wake Forest University School of Medicine, Winston-Salem, North
>>> Carolina, USA
>>>
>>> ______________________________________________
>>> R-help at stat.math.ethz.ch mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-
>>> guide.html and provide commented, minimal, self-contained,
>>> reproducible code.
>>