[BioC] Problem with removing duplicated probes of datasets without annotation

Sun Jun 22 10:49:31 CEST 2014

Kaj

this is an R question, code like the following would do the job

x = … a data.frame with columns ‘probeid’ and ‘pvalue’ ...
s = split( seq_len(nrow(x)), x$probeid)
uniqueids = sapply( s, function(i) i[which.min(x$pvalue[i])] )

And you can replace what’s inside the ‘which.min(…)’  expression with whatever pleases you.

There are plenty of places in vignettes etc. where this type of operation is done. One I happen to be aware of right now is inside the function ‘myHeatmap’ of the ‘Hiiragi2013’ package.

Wolfgang Huber

On 22 Jun 2014, at 10:02, Kaj Chokeshaiusaha [guest] <guest at bioconductor.org> wrote:

> Dear R helpers,
> 
> I'm working with the goat dataset with no available annotation db. For this reason, I use the 'genefilter' instead of 'nsFilter' function with ANOVA (p<0.05) (available in 'genefilter' package). The problem is that I have the filtered data with 500 ducplicated probes of which I want to remove.
> 
> Due to my limited ability, I cannot figure out how to do them. It would be great if I can either select a probe of each duplicates with lowest p-value or most variance.
> 
> Would you please help me with some examples? 
> 
> Best Regards,
> Kaj
> 
> -- output of sessionInfo(): 
> 
> R version 3.1.0 (2014-04-10)
> Platform: x86_64-pc-linux-gnu (64-bit)
> 
> locale:
> [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
> [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
> [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
> [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
> [9] LC_ADDRESS=C               LC_TELEPHONE=C            
> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
> 
> attached base packages:
> [1] parallel  stats     graphics  grDevices utils     datasets  methods  
> [8] base     
> 
> other attached packages:
> [1] Biobase_2.24.0      BiocGenerics_0.10.0 genefilter_1.46.1  
> 
> loaded via a namespace (and not attached):
> [1] annotate_1.42.0      AnnotationDbi_1.26.0 DBI_0.2-7           
> [4] GenomeInfoDb_1.0.2   IRanges_1.22.9       RSQLite_0.11.4      
> [7] splines_3.1.0        stats4_3.1.0         survival_2.37-7     
> [10] tcltk_3.1.0          tools_3.1.0          XML_3.98-1.1        
> [13] xtable_1.7-3
> 
> --
> Sent via the guest posting facility at bioconductor.org.
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor