[BioC] Duplicate gene names after summarization with RMA (hugene.1.0.st.v1)

Sun Apr 6 22:24:02 CEST 2014

Hello,

I am new to analyzing array files. I am attempting to generate a CSV file that contains a gene symbol and RMA-processed expression data for a set of arrays for input into an online pathway ID tool (TNBCtype, http://cbc.mc.vanderbilt.edu/tnbc/).

My problem/question (not sure if It is either, or I don't understand the process correctly): 

when I am exporting the csv file, there are duplicate entries for some gene names (i.e. ESR1). I am under the impression that RMA and the process I am using (target = 'core') summarizes at the gene level, so I am not sure why I am getting duplicate entries for certain (not all) genes after writing the expression file.  I have gone through this process with some mouse array data (mouse gene 10 st arrays) and have not run into this problem of duplicate gene names.

Any insights on what I might be doing incorrectly, or in understanding the output I should expect, would be greatly appreciated.  

Is averaging the values of these instances of duplicate gene names a valid thing to do?

Thank you!

-Ed O'Donnell
postdoctoral scholar
Oregon state university

My commands (Analysis.R), run as source("Analysis.R"):
---------------------

#install packages for analysis of the mouse array

source("http://bioconductor.org/biocLite.R")
biocLite("hugene10sttranscriptcluster.db")
biocLite("oligo")
biocLite("annotate")

#load required packages

library(oligo)
library(hugene10sttranscriptcluster.db)
library(annotate)

#set wd to myworkingdirectory

setwd("myworkingdirectory")  

#read in the raw data from the files and the pDatat

rawData <- read.celfiles(list.celfiles())

#rma normalization

rmaCore <- rma(rawData, target = 'core')

#annotation

ID <- featureNames(rmaCore)
Symbol <- getSYMBOL(ID, "hugene10sttranscriptcluster.db")
Name <- as.character(lookUp(ID, "hugene10sttranscriptcluster.db", "GENENAME"))

#make a temporary data frame with all the identifiers...

tmpframe <-data.frame(ID=ID, Symbol=Symbol, Name=Name,stringsAsFactors=F)
tmpframe[tmpframe=="NA"] <- NA

#assign data frame to rma-results

fData(rmaCore) <- tmpframe

#expression table with gene name and annotation info, processed with sed after export to get the quotations in the right spot and remove NA lines

write.table(cbind(pData(featureData(rmaCore))[,"Symbol"],exprs(rmaCore)),file="better_annotation.csv", quote = FALSE, sep = ",")

----------

 -- output of sessionInfo(): 

R version 3.0.3 (2014-03-06)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] pd.hugene.1.0.st.v1_3.8.0            gplots_2.12.1                       
 [3] annotate_1.40.1                      hugene10sttranscriptcluster.db_8.0.1
 [5] org.Hs.eg.db_2.10.1                  RSQLite_0.11.4                      
 [7] DBI_0.2-7                            AnnotationDbi_1.24.0                
 [9] limma_3.18.13                        oligo_1.26.6                        
[11] Biostrings_2.30.1                    XVector_0.2.0                       
[13] IRanges_1.20.7                       Biobase_2.22.0                      
[15] oligoClasses_1.24.0                  BiocGenerics_0.8.0                  
[17] BiocInstaller_1.12.0                

loaded via a namespace (and not attached):
 [1] affxparser_1.34.2     affyio_1.30.0         bit_1.1-11           
 [4] bitops_1.0-6          caTools_1.16          codetools_0.2-8      
 [7] ff_2.2-12             foreach_1.4.1         gdata_2.13.2         
[10] GenomicRanges_1.14.4  gtools_3.3.1          iterators_1.0.6      
[13] KernSmooth_2.23-12    preprocessCore_1.24.0 splines_3.0.3        
[16] stats4_3.0.3          tcltk_3.0.3           tools_3.0.3          
[19] XML_3.95-0.2          xtable_1.7-3          zlibbioc_1.8.0   

--
Sent via the guest posting facility at bioconductor.org.