[BioC] Duplicate gene names after summarization with RMA (hugene.1.0.st.v1)
Ed O'Donnell [guest]
guest at bioconductor.org
Sun Apr 6 22:24:02 CEST 2014
Hello,
I am new to analyzing array files. I am attempting to generate a CSV file that contains a gene symbol and RMA-processed expression data for a set of arrays for input into an online pathway ID tool (TNBCtype, http://cbc.mc.vanderbilt.edu/tnbc/).
My problem/question (not sure if It is either, or I don't understand the process correctly):
when I am exporting the csv file, there are duplicate entries for some gene names (i.e. ESR1). I am under the impression that RMA and the process I am using (target = 'core') summarizes at the gene level, so I am not sure why I am getting duplicate entries for certain (not all) genes after writing the expression file. I have gone through this process with some mouse array data (mouse gene 10 st arrays) and have not run into this problem of duplicate gene names.
Any insights on what I might be doing incorrectly, or in understanding the output I should expect, would be greatly appreciated.
Is averaging the values of these instances of duplicate gene names a valid thing to do?
Thank you!
-Ed O'Donnell
postdoctoral scholar
Oregon state university
My commands (Analysis.R), run as source("Analysis.R"):
---------------------
#install packages for analysis of the mouse array
source("http://bioconductor.org/biocLite.R")
biocLite("hugene10sttranscriptcluster.db")
biocLite("oligo")
biocLite("annotate")
#load required packages
library(oligo)
library(hugene10sttranscriptcluster.db)
library(annotate)
#set wd to myworkingdirectory
setwd("myworkingdirectory")
#read in the raw data from the files and the pDatat
rawData <- read.celfiles(list.celfiles())
#rma normalization
rmaCore <- rma(rawData, target = 'core')
#annotation
ID <- featureNames(rmaCore)
Symbol <- getSYMBOL(ID, "hugene10sttranscriptcluster.db")
Name <- as.character(lookUp(ID, "hugene10sttranscriptcluster.db", "GENENAME"))
#make a temporary data frame with all the identifiers...
tmpframe <-data.frame(ID=ID, Symbol=Symbol, Name=Name,stringsAsFactors=F)
tmpframe[tmpframe=="NA"] <- NA
#assign data frame to rma-results
fData(rmaCore) <- tmpframe
#expression table with gene name and annotation info, processed with sed after export to get the quotations in the right spot and remove NA lines
write.table(cbind(pData(featureData(rmaCore))[,"Symbol"],exprs(rmaCore)),file="better_annotation.csv", quote = FALSE, sep = ",")
----------
-- output of sessionInfo():
R version 3.0.3 (2014-03-06)
Platform: x86_64-apple-darwin10.8.0 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] pd.hugene.1.0.st.v1_3.8.0 gplots_2.12.1
[3] annotate_1.40.1 hugene10sttranscriptcluster.db_8.0.1
[5] org.Hs.eg.db_2.10.1 RSQLite_0.11.4
[7] DBI_0.2-7 AnnotationDbi_1.24.0
[9] limma_3.18.13 oligo_1.26.6
[11] Biostrings_2.30.1 XVector_0.2.0
[13] IRanges_1.20.7 Biobase_2.22.0
[15] oligoClasses_1.24.0 BiocGenerics_0.8.0
[17] BiocInstaller_1.12.0
loaded via a namespace (and not attached):
[1] affxparser_1.34.2 affyio_1.30.0 bit_1.1-11
[4] bitops_1.0-6 caTools_1.16 codetools_0.2-8
[7] ff_2.2-12 foreach_1.4.1 gdata_2.13.2
[10] GenomicRanges_1.14.4 gtools_3.3.1 iterators_1.0.6
[13] KernSmooth_2.23-12 preprocessCore_1.24.0 splines_3.0.3
[16] stats4_3.0.3 tcltk_3.0.3 tools_3.0.3
[19] XML_3.95-0.2 xtable_1.7-3 zlibbioc_1.8.0
--
Sent via the guest posting facility at bioconductor.org.
More information about the Bioconductor
mailing list