[R] Does SQL group by have a heavy duty equivalent in R
Charles C. Berry
cberry at tajo.ucsd.edu
Mon Jan 1 04:46:06 CET 2007
On Sun, 31 Dec 2006, Charles C. Berry wrote:
> On Sun, 31 Dec 2006, Farrel Buchinsky wrote:
>> I have hundreds of humans who have undergone SNP genotyping at hundreds of
>> loci. Some have even undergone the procedure twice or thrice (kind of an
>> internal control).
>> So obviously I need to find those replications, and confirm that the
>> are the same. If there is discordance then I need to address it.
> Why not use duplicated() ?
unique( IDs[ duplicated( IDs ) & ! duplicated ( cbind (IDs, SNPs ) ) ] )
gives a list of those IDs for which the SNPs in all replicates of an ID
are not the same.
> For a data.frame with 200 rows of which about 50 are duplicates and 201
> columns finding the (non) duplicates takes little time on my year old AMD 64
> running Windows XP:
>> my.dat <- data.frame(ID=rep(1:100, sample(1:3,100,repl=T)))
>> snp.dat <- lapply(1:200,function(x) 0:1 )
>> snp.frame <- as.data.frame(do.call(cbind,snp.dat))
>> my.dat <- cbind( my.dat,snp.frame[sample(nrow(my.dat))%%2+1,])
>> system.time( table(duplicated(my.dat)) )
>  0.03 0.00 0.03 NA NA
> Finding the non-duplicated rows for which there is at least one replication:
>> system.time( which( (!duplicated(my.dat)) & (my.dat$ID %in%
>> names(which(table(my.dat$ID)>1)) ) ))
>  0.05 0.00 0.05 NA NA
>> I tried to use the aggregate function
>> This was simply to figure out how many times the same piece of information
>> had been obtained. I ran out of patience. It took beyond forever and
>> did not perform much better. The reshape package did not help - it implied
>> one was out of luck if the data was not numeric. All of my data is
>> or factor.
>> Instead I used RODBC
>> to push the table into a Microsoft Access database
>> Then a sql query, courtesy of the Microsoft Access Query Wizard a la
>> SELECT RawSeq.SAMPLE_ID, RawSeq.ASSAY_ID, Min(RawSeq.GENOTYPE_ID) AS
>> MinOfGENOTYPE_ID, Max(RawSeq.GENOTYPE_ID) AS MaxOfGENOTYPE_ID, Count(
>> RawSeq.rownames) AS CountOfrownames
>> FROM RawSeq
>> WHERE (((RawSeq.GENOTYPE_ID)<>""))
>> GROUP BY RawSeq.SAMPLE_ID, RawSeq.ASSAY_ID
>> ORDER BY Count(RawSeq.rownames) DESC;
>> This way I could easily use the minimum and maximum values to see if they
>> were discordant.
>> Microsoft Access handled it with aplomb. I plan to use RODBC to bring the
>> result of the SQL query back into R.
>> This is the first time I have seen Microsoft Access outpace R.
>> Is my observation correct or am I missing something. I would much rather
>> perform all data manipulation and analyses in R.
>> Farrel Buchinsky
>> [[alternative HTML version deleted]]
>> R-help at stat.math.ethz.ch mailing list
>> PLEASE do read the posting guide
>> and provide commented, minimal, self-contained, reproducible code.
> Charles C. Berry (858) 534-2098
> Dept of Family/Preventive Medicine
> E mailto:cberry at tajo.ucsd.edu UC San Diego
> http://biostat.ucsd.edu/~cberry/ La Jolla, San Diego 92093-0717
Charles C. Berry (858) 534-2098
Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu UC San Diego
http://biostat.ucsd.edu/~cberry/ La Jolla, San Diego 92093-0717
More information about the R-help