[R] Does SQL group by have a heavy duty equivalent in R

Mon Jan 1 04:46:06 CET 2007

On Sun, 31 Dec 2006, Charles C. Berry wrote:

> On Sun, 31 Dec 2006, Farrel Buchinsky wrote:
>
>>  I have hundreds of humans who have undergone SNP genotyping at hundreds of
>>  loci. Some have even undergone the procedure twice or thrice (kind of an
>>  internal control).
>>
>>  So obviously I need to find those replications, and confirm that the
>>  results
>>  are the same. If there is discordance then I need to address it.
>
> Why not use  duplicated() ?

More specifically:

 	unique( IDs[ duplicated( IDs ) & ! duplicated ( cbind (IDs, SNPs ) ) ] )

gives a list of those IDs for which the SNPs in all replicates of an ID 
are not the same.

>
> For a data.frame with 200 rows of which about 50 are duplicates and 201 
> columns finding the (non) duplicates takes little time on my year old AMD 64 
> running Windows XP:
>
>>  my.dat <- data.frame(ID=rep(1:100, sample(1:3,100,repl=T)))
>>  snp.dat <- lapply(1:200,function(x) 0:1 )
>>  snp.frame <- as.data.frame(do.call(cbind,snp.dat))
>>  my.dat <- cbind( my.dat,snp.frame[sample(nrow(my.dat))%%2+1,])
>>  system.time( table(duplicated(my.dat)) )
> [1] 0.03 0.00 0.03   NA   NA
>> 
>
> Finding the non-duplicated rows for which there is at least one replication:
>
>>  system.time( which( (!duplicated(my.dat)) & (my.dat$ID %in%
>>  names(which(table(my.dat$ID)>1)) ) ))
> [1] 0.05 0.00 0.05   NA   NA
>
>
>> 
>>
>>  I tried to use the aggregate function
>>
>>  nr.attempts
>>  <-aggregate(RawSeq$GENOTYPE_ID,list(sample=RawSeq$SAMPLE_ID,assay=RawSeq$ASSAY_ID),length)
>>  This was simply to figure out how many times the same piece of information
>>  had been obtained. I ran out of patience. It took beyond forever and
>>  tapply
>>  did not perform much better. The reshape package did not help - it implied
>>  one was out of luck if the data was not numeric. All of my data is
>>  character
>>  or factor.
>>
>>  Instead I used RODBC
>>
>>  sqlSave(channel,RawSeq)
>>  to push the table into a Microsoft Access database
>>  Then a sql query, courtesy of the Microsoft Access Query Wizard a la
>>  design
>>  mode.
>>
>>  SELECT RawSeq.SAMPLE_ID, RawSeq.ASSAY_ID, Min(RawSeq.GENOTYPE_ID) AS
>>  MinOfGENOTYPE_ID, Max(RawSeq.GENOTYPE_ID) AS MaxOfGENOTYPE_ID, Count(
>>  RawSeq.rownames) AS CountOfrownames
>>  FROM RawSeq
>>  WHERE (((RawSeq.GENOTYPE_ID)<>""))
>>  GROUP BY RawSeq.SAMPLE_ID, RawSeq.ASSAY_ID
>>  ORDER BY Count(RawSeq.rownames) DESC;
>>
>>  This way I could easily use the minimum and maximum values to see if they
>>  were discordant.
>>  Microsoft Access handled it with aplomb. I plan to use RODBC to bring the
>>  result of the SQL query back into R.
>>
>>  This is the first time I have seen Microsoft Access outpace R.
>>  Is my observation correct or am I missing something. I would much rather
>>  perform all data manipulation and analyses in R.
>> 
>> 
>>
>>  --
>>  Farrel Buchinsky
>>
>>   [[alternative HTML version deleted]]
>>
>>  ______________________________________________
>>  R-help at stat.math.ethz.ch mailing list
>>  https://stat.ethz.ch/mailman/listinfo/r-help
>>  PLEASE do read the posting guide
>>  http://www.R-project.org/posting-guide.html
>>  and provide commented, minimal, self-contained, reproducible code.
>> 
>
> Charles C. Berry                        (858) 534-2098
>                                         Dept of Family/Preventive Medicine
> E mailto:cberry at tajo.ucsd.edu	         UC San Diego
> http://biostat.ucsd.edu/~cberry/         La Jolla, San Diego 92093-0717
>
>
>
>

Charles C. Berry                        (858) 534-2098
                                          Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu	         UC San Diego
http://biostat.ucsd.edu/~cberry/         La Jolla, San Diego 92093-0717