[R] Fwd: which is faster "for" or "apply"

Karim Mezhoud kmezhoud at gmail.com
Wed Dec 31 17:55:15 CET 2014


for both
cidx <- !(sapply(df, is.numeric))
df[cidx] <- lapply(df[cidx], as.numeric)


  Ô__
 c/ /'_;~~~~kmezhoud
(*) \(*)   ⴽⴰⵔⵉⵎ  ⵎⴻⵣⵀⵓⴷ
http://bioinformatics.tn/



On Wed, Dec 31, 2014 at 5:51 PM, Karim Mezhoud <kmezhoud at gmail.com> wrote:

> Yes the last one this the best. But I need to test if returned data.frame
> is with factor or character:
>   cidx <- sapply(df, is.factor) or cidx <- sapply(df, is.character)
> Thanks
>
>   Ô__
>  c/ /'_;~~~~kmezhoud
> (*) \(*)   ⴽⴰⵔⵉⵎ  ⵎⴻⵣⵀⵓⴷ
> http://bioinformatics.tn/
>
>
>
> On Wed, Dec 31, 2014 at 5:24 PM, Karim Mezhoud <kmezhoud at gmail.com> wrote:
>
>> Concretely I request cbioportal through cgsdr package.
>> Depending of Cases and Genetic profiles I receive in general data.frame
>> with heterogeneous structure. The bad one if the returned data.frame is
>> composed by numeric and character columns. in this case numeric columns are
>> considered as  factor. It is the case when I explore/extract information
>> from Clinical Data (Age, gender., tumor stage..). In this case I need to
>> convert only numeric column and not character ones. I am using
>> grep("[0-9]*.[0-9]*",df[,i])!=0 {fun to convert}.
>>
>>  But this heterogeneity  comes even with only supposed numeric data.frame
>> (gene expression). here an example
>>
>>
>> library(cgdsr)
>> GeneList <- c("DDR2", "HPGDS", "MS4A2","SSUH2","MLH1" ,"MSH2", "ATM"
>> ,"ATR", "MDC1" ,"PARP1")
>> cgds<-CGDS("http://www.cbioportal.org/public-portal/")
>>
>> str(getProfileData(cgds,GeneList,
>> "stad_tcga_methylation_hm27","stad_tcga_methylation_hm27"))
>>
>> str(getProfileData(cgds,GeneList,
>> "stad_tcga_methylation_hm450","stad_tcga_methylation_hm450"))
>>
>>
>> With my computer I did not find the same structure (numeric vs factor).
>>
>> Also I need to preserve row and column names ;)
>> So I am working to resolve these details depending on data of
>> cbioportal...
>>
>> Thank you
>>
>>
>>   Ô__
>>  c/ /'_;~~~~kmezhoud
>> (*) \(*)   ⴽⴰⵔⵉⵎ  ⵎⴻⵣⵀⵓⴷ
>> http://bioinformatics.tn/
>>
>>
>>
>> On Wed, Dec 31, 2014 at 4:37 PM, Karim Mezhoud <kmezhoud at gmail.com>
>> wrote:
>>
>>> Many Many Many thanks!
>>> it is a demonstrative lesson. I need time to  test all examples :)
>>> Thank you for your time and support.
>>> Happy and Healthy New Year
>>>
>>>   Ô__
>>>  c/ /'_;~~~~kmezhoud
>>> (*) \(*)   ⴽⴰⵔⵉⵎ  ⵎⴻⵣⵀⵓⴷ
>>> http://bioinformatics.tn/
>>>
>>>
>>>
>>> On Wed, Dec 31, 2014 at 2:38 PM, Martin Morgan <mtmorgan at fredhutch.org>
>>> wrote:
>>>
>>>> On 12/31/2014 12:22 AM, Karim Mezhoud wrote:
>>>>
>>>>> Thanks,
>>>>> It seems for loop spends less time ;)
>>>>>
>>>>> with
>>>>> dim(DataFrame)
>>>>> [1] 338  70
>>>>>
>>>>> For loop has
>>>>>     user  system elapsed
>>>>>    0.012   0.000   0.012
>>>>>
>>>>> and apply has
>>>>>    user  system elapsed
>>>>>    0.020   0.000   0.021
>>>>>
>>>>
>>>> The timings are so short that the answer in terms of speed is 'it does
>>>> not matter'.
>>>>
>>>> Here is a selection of approaches
>>>>
>>>> f0 <- function(df) {
>>>>     for (i in seq_along(df))
>>>>         df[,i] <- as.numeric(df[,i])
>>>>     df
>>>> }
>>>>
>>>> f0a <- function(df) {
>>>>     ## data.frame is a list-of-equal-length vectors; access each
>>>>     ## column with "[["
>>>>     for (i in seq_along(df))
>>>>         df[[i]] <- as.numeric(df[[i]])
>>>>     df
>>>> }
>>>>
>>>> f0c <- compiler::cmpfun(f0)  ## loops sometimes benefit from compilation
>>>>
>>>> f1 <- function(df)
>>>>     as.data.frame(apply(df, 2, as.numeric))
>>>>
>>>> f2 <- function(df) {
>>>>     ## replace all columns of df with list-of-vectors
>>>>     df[] <- lapply(df, as.numeric)
>>>>     df
>>>> }
>>>>
>>>> f3 <- function(df) {
>>>>     ## coerce to matrix to avoid the explicit loop, use mode<- to
>>>>     ## change storage of elements
>>>>     m <- as.matrix(df)
>>>>     mode(m) <- "numeric"
>>>>     as.data.frame(m)
>>>> }
>>>>
>>>> f4 <- function(df) {
>>>>     ## if it's a matrix, why are we returning a data.frame?
>>>>     m <- as.matrix(df)
>>>>     mode(m) <- "numeric"
>>>>     m
>>>> }
>>>>
>>>> f4a <- function(df)
>>>>     ## unlist to single vector, coerce, then format as matrix
>>>>     matrix(as.numeric(unlist(df, use.names=FALSE)), nrow(df),
>>>>            dimnames=dimnames(df))
>>>>
>>>> It's important to test that different methods return the same result
>>>> (perhaps allowing for differences in attributes such as row or column
>>>> names). The microbenchmark package repeats timings across multiple trials
>>>> (default 100 times).
>>>>
>>>> library(microbenchmark)
>>>> test <- function(df) {
>>>>     stopifnot(
>>>>         identical(f0(df), f0a(df)),
>>>>         identical(f0(df), f0c(df)),
>>>>         identical(f0(df), f1(df)),
>>>>         identical(f0(df), f2(df)),
>>>>         identical(f0(df), f3(df)),
>>>>         identical(as.matrix(f0(df)), f4(df)),
>>>>         all.equal(f4(df), f4a(df), check.attributes=FALSE))
>>>>     microbenchmark(f0(df), f0a(df), f1(df), f2(df), f3(df), f4(df),
>>>> f4a(df))
>>>> }
>>>>
>>>> Here are some data sets
>>>>
>>>> m <- matrix(rnorm(338 * 70), 338)
>>>> df <- as.data.frame(m)
>>>> dfc <- as.data.frame(lapply(df, as.character), stringsAsFactors=FALSE)
>>>> dff <- as.data.frame(lapply(df, as.character))
>>>>
>>>> and results
>>>>
>>>> > test(df)
>>>> Unit: microseconds
>>>>     expr      min        lq      mean    median        uq      max neval
>>>>   f0(df) 6208.956 6270.5500 6367.4138 6306.7110 6362.2225 7731.281
>>>>  100
>>>>  f0a(df) 2917.973 2975.2090 3024.8623 3002.3805 3036.5365 3951.618
>>>>  100
>>>>  f0c(df) 6078.399 6150.1085 6264.0998 6188.3690 6244.5725 7684.116
>>>>  100
>>>>   f1(df) 2698.074 2743.2905 2821.8453 2769.3655 2805.5345 4033.229
>>>>  100
>>>>   f2(df) 1989.057 2041.0685 2066.1830 2055.0020 2083.8545 2267.732
>>>>  100
>>>>   f3(df) 1532.435 1572.9810 1609.7378 1597.6245 1624.2305 2003.584
>>>>  100
>>>>   f4(df)  808.593  828.5445  852.2626  847.5355  864.6665 1180.977   100
>>>>  f4a(df)  422.657  437.2705  458.9845  455.2470  465.5815  695.443   100
>>>> > test(dfc)
>>>> Unit: milliseconds
>>>>     expr       min        lq      mean    median        uq       max
>>>> neval
>>>>   f0(df) 11.416532 11.647858 11.915287 11.767647 12.016276 14.239622
>>>>  100
>>>>  f0a(df)  8.095709  8.211116  8.380638  8.289895  8.454948  9.529026
>>>>  100
>>>>  f0c(df) 11.339293 11.577811 11.772087 11.702341 11.896729 12.674766
>>>>  100
>>>>   f1(df)  8.227371  8.277147  8.422412  8.331403  8.490411  9.145499
>>>>  100
>>>>   f2(df)  6.907888  7.010828  7.162529  7.147198  7.239048  7.763758
>>>>  100
>>>>   f3(df)  6.608107  6.688232  6.845936  6.792066  6.892635  8.359274
>>>>  100
>>>>   f4(df)  5.859482  5.939680  6.046976  5.993804  6.105388  6.968601
>>>>  100
>>>>  f4a(df)  5.372214  5.460987  5.556687  5.521542  5.614482  6.107081
>>>>  100
>>>> > test(dff)
>>>> Error: identical(f0(df), f1(df)) is not TRUE
>>>>
>>>> Except when dealing with factors, the use of explicit loops is the
>>>> slowest. With factors, matrix-based methods coerce the level labels to
>>>> numeric, whereas vector-based methods coerce the underlying codes (level
>>>> values) of the factor; obviously great care needs to be taken.
>>>>
>>>> > f0(dff)[1:5, 1:5]
>>>>    V1  V2  V3  V4  V5
>>>> 1 150 232 294  88  56
>>>> 2 159   8  89  59  10
>>>> 3 132 171  40 205 119
>>>> 4 214 273  26 262 216
>>>> 5 281  49 255  31 233
>>>> > f1(dff)[1:5, 1:5]
>>>>           V1          V2         V3         V4          V5
>>>> 1 -1.7092463 0.50234009  0.8492982 -0.5636901 -0.38545566
>>>> 2 -2.3020854 -0.05580931 -0.5963673 -0.3671748 -0.09408031
>>>> 3 -1.2915110 -2.46181533 -0.2470108 0.3301129 -1.06810225
>>>> 4  0.3065989 0.89263099 -0.1717432  0.7721411 0.35856334
>>>> 5  0.8795616 -0.43049898  0.4560515 -0.1722099  0.46125149
>>>>
>>>> In terms of 'best practice', I would represent my data in the
>>>> appropriate data structure in the first place (as a matrix of appropriate
>>>> type, rather than data.frame, so the entire coercion is irrelevant). If
>>>> faced with a data.frame with specific columns to coerce I would use the
>>>> approach
>>>>
>>>>     cidx <- sapply(df, is.character)      # index of columns to coerce
>>>>     df[cidx] <- lapply(df[cidx], as.numeric)
>>>>
>>>> which seems to be reasonably correct, expressive, compact, and speedy.
>>>>
>>>> Martin Morgan
>>>>
>>>>
>>>>
>>>>>    Ô__
>>>>>   c/ /'_;~~~~kmezhoud
>>>>> (*) \(*)   ⴽⴰⵔⵉⵎ  ⵎⴻⵣⵀⵓⴷ
>>>>> http://bioinformatics.tn/
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Dec 31, 2014 at 8:54 AM, Berend Hasselman <bhh at xs4all.nl>
>>>>> wrote:
>>>>>
>>>>>
>>>>>>  On 31-12-2014, at 08:40, Karim Mezhoud <kmezhoud at gmail.com> wrote:
>>>>>>>
>>>>>>> Hi All,
>>>>>>> I would like to choice between these two data frame convert. which is
>>>>>>> faster?
>>>>>>>
>>>>>>>    for(i in 1:ncol(DataFrame)){
>>>>>>>
>>>>>>>                     DataFrame[,i] <- as.numeric(DataFrame[,i])
>>>>>>>                 }
>>>>>>>
>>>>>>>
>>>>>>> OR
>>>>>>>
>>>>>>> DataFrame <- as.data.frame(apply(DataFrame,2 ,function(x)
>>>>>>> as.numeric(x)))
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> Try it and use system.time.
>>>>>>
>>>>>> Berend
>>>>>>
>>>>>>  Thanks
>>>>>>> Karim
>>>>>>>   Ô__
>>>>>>> c/ /'_;~~~~kmezhoud
>>>>>>> (*) \(*)   ⴽⴰⵔⵉⵎ  ⵎⴻⵣⵀⵓⴷ
>>>>>>> http://bioinformatics.tn/
>>>>>>>
>>>>>>>        [[alternative HTML version deleted]]
>>>>>>>
>>>>>>> ______________________________________________
>>>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>> PLEASE do read the posting guide
>>>>>>>
>>>>>> http://www.R-project.org/posting-guide.html
>>>>>>
>>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>         [[alternative HTML version deleted]]
>>>>>
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide http://www.R-project.org/
>>>>> posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>>>>
>>>>
>>>> --
>>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>>> 1100 Fairview Ave. N.
>>>> PO Box 19024 Seattle, WA 98109
>>>>
>>>> Location: Arnold Building M1 B861
>>>> Phone: (206) 667-2793
>>>>
>>>
>>>
>>
>

	[[alternative HTML version deleted]]



More information about the R-help mailing list