[BioC] finding and deleting repeated observations

Tue Jun 1 17:50:59 CEST 2010

Hi Mervi,

One solution is to order your data frame by "pvalue" using the order function and then to remove duplicate "GeneSymbol" using !duplicated.

> A<-c(12,2,4,15,11,9)
> B<-c(44,32,55,25,27,18)
> pvalue<-c(.01,.05,.2,.005,.002,.0001)
> GeneSymbol<-c(rep("ABC1",2),"AB",rep("ABCD1",3))
> tmp<-as.data.frame(cbind(A,B,pvalue))
> tmp<-cbind(GeneSymbol,tmp)
> tmp
  GeneSymbol  A  B pvalue
1       ABC1 12 44  1e-02
2       ABC1  2 32  5e-02
3         AB  4 55  2e-01
4      ABCD1 15 25  5e-03
5      ABCD1 11 27  2e-03
6      ABCD1  9 18  1e-04
## reorder your dataframe by pvalue
> tmp.ordered <- tmp[order(tmp$pvalue),]
> tmp.ordered
  GeneSymbol  A  B pvalue
6      ABCD1  9 18  1e-04
5      ABCD1 11 27  2e-03
4      ABCD1 15 25  5e-03
1       ABC1 12 44  1e-02
2       ABC1  2 32  5e-02
3         AB  4 55  2e-01
## select the first instance of a gene symbol and remove all others.  Because you have ordered by pvalues you will automatically select the gene symbol with the lowest pvalue.
> tmp.sub<- tmp.ordered[!duplicated(tmp.ordered$GeneSymbol),]
> tmp.sub
  GeneSymbol  A  B pvalue
6      ABCD1  9 18  1e-04
1       ABC1 12 44  1e-02
3         AB  4 55  2e-01
## reorder your data frame as before using the rownames.
> tmp.sub<-tmp.sub[order(rownames(tmp.sub)),]
> tmp.sub
  GeneSymbol  A  B pvalue
1       ABC1 12 44  1e-02
3         AB  4 55  2e-01
6      ABCD1  9 18  1e-04

Scott

Scott A. Ochsner, PhD
One Baylor Plaza BCM130, Houston, TX 77030
Voice: (713) 798-6227  Fax: (713) 790-1275 
-----Original Message-----
From: bioconductor-bounces at stat.math.ethz.ch [mailto:bioconductor-bounces at stat.math.ethz.ch] On Behalf Of mervi.alanne at wri.fi
Sent: Friday, May 28, 2010 12:27 PM
To: bioconductor at stat.math.ethz.ch
Subject: [BioC] finding and deleting repeated observations

Dear all,

I'm a novice with R and could use some help. How could I find repeated observations based on one column and select the one to keep based on another column?

In more detail, this is the thing I want to achieve: 
-data.frame has 4 columns GeneSymbol, A, B, pvalue -data in column GeneSymbol may be repeated 1-6 times -data also contains unique observations -Of the repeated obs, keep the obs which has the lowest pvalue -Do not discard data from cols A and B 

Example input dat
GeneSymbol A B pvalue
ABC1 12 44 0.01
ABC1 2 32 0.05
AB 4 55 0.2
ABCD1 15 25 0.005
ABCD1 11 27 0.002
ABCD1 9 18 0.0001

I'd like the output to look like this:
GeneSymbol A B pvalue
ABC1 2 32 0.01
AB 4 55 0.2
ABCD1 9 18 0.0001

Any suggestions? 

-Mervi
Wihuri Research Institute

_______________________________________________
Bioconductor mailing list
Bioconductor at stat.math.ethz.ch
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor