[R] faster version of split()?

Søren Højsgaard Soren.Hojsgaard at agrsci.dk
Fri Jan 16 11:55:55 CET 2009


Hi,

R version 2.2.1 is slightly old. You may want to upgrade to the current version, R.2.8.1!!! 

You can for example do

library(doBy)
dd <- data.frame(x=c(1,1,1,2,2,2), y=c(1,1,2, 1,1,1))
summaryBy(y~x, data=dd, FUN=function(x)length(unique(x)))
 
Regards
Søren


-----Oprindelig meddelelse-----
Fra: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] På vegne af Simon Pickett
Sendt: 16. januar 2009 11:10
Til: R help
Emne: [R] faster version of split()?

Hi all,

I want to calculate the number of unique observations of "y" in each level of "x" from my data frame "df".

this does the job but it is very slow for this big data frame (159503 rows,
11 columns).....

group.list <- split(df$y,df$x)
count <- function(x) length(unique(na.omit(x))) sapply(group.list, count, USE.NAMES=TRUE)

I couldnt find the answer searching for "slow split" and "split time" on help forum.

I am running R version 2.2.1, on a machine with 4gb of memory and I'm using windows 2000.

thanks in advance,

Simon.







----- Original Message -----
From: "Wacek Kusnierczyk" <Waclaw.Marcin.Kusnierczyk at idi.ntnu.no>
To: "Gundala Viswanath" <gundalav at gmail.com>
Cc: "R help" <R-help at stat.math.ethz.ch>
Sent: Friday, January 16, 2009 9:30 AM
Subject: Re: [R] Value Lookup from File without Slurping


> you might try to iteratively read a limited number of line of lines in a
> batch using readLines:
>
> # filename, the name of your file
> # n, the maximal count of lines to read in a batch
> connection = file(filename, open="rt")
> while (length(lines <- readLines(con=connection, n=n))) {
>   # do your stuff here
> }
> close(connection)
>
> ?file
> ?readLines
>
> vQ
>
>
> Gundala Viswanath wrote:
>> Dear all,
>>
>> I have a repository file (let's call it repo.txt)
>>  that contain two columns like this:
>>
>> # tag  value
>> AAA    0.2
>> AAT    0.3
>> AAC   0.02
>> AAG   0.02
>> ATA    0.3
>> ATT   0.7
>>
>> Given another query vector
>>
>>
>>> qr <- c("AAC", "ATT")
>>>
>>
>> I would like to find the corresponding value for each query above,
>> yielding:
>>
>> 0.02
>> 0.7
>>
>> However, I want to avoid slurping whole repo.txt into an object (e.g. 
>> hash).
>> Is there any ways to do that?
>>
>> The reason I want to do that because repo.txt is very2 large size
>> (milions of lines,
>> with tag length > 30 bp),  and my PC memory is too small to keep it.
>>
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list