[BioC] Analysis of Affymetrix Mouse Gene 2.0 ST arrays

Naxerova, Kamila naxerova at fas.harvard.edu
Thu Mar 7 19:36:04 CET 2013


Haha, I don't mind rubbing against that belly. It's like playing a game... tyring to figure out how to get to the 40th level to face the "final enemy".

Your strategy worked, thank you! I am including all code up to RMA normalization (yay! I am there!) below, perhaps it will save somebody a few hours of work.

library(pdInfoBuilder)

baseDir <- "/Users/naxerova/Documents/xxx"
(pgf <- list.files(baseDir, pattern = ".pgf",
full.names = TRUE))
(clf <- list.files(baseDir, pattern = ".clf",
full.names = TRUE))
(prob <- list.files(baseDir, pattern = ".probeset.csv",
full.names = TRUE))
mps <- list.files(baseDir, pattern = "mps$", full.names = TRUE)
trans <- list.files(baseDir, pattern="transcript",full.names=TRUE)

seed <- new("AffyGenePDInfoPkgSeed",
pgfFile = pgf, clfFile = clf, coreMps=mps, transFile=trans,
probeFile = prob, author = "Kamila Naxerova",
email = "naxerova at fas.harvard.edu",
biocViews = "AnnotationData",
organism = "Mouse", species = "Mus Musculus")
makePdInfoPackage(seed, destDir = ".")

## This is what the beginning of your output should look like
Building annotation package for Affymetrix Gene ST Array
PGF.........: MoGene-2_0-st.pgf
CLF.........: MoGene-2_0-st.clf
Probeset....: MoGene-2_0-st-v1.na33.mm10.probeset.csv
Transcript..: MoGene-2_0-st-v1.na33.mm10.transcript.csv
Core MPS....: MoGene-2_0-st.mps

install.packages("/Users/naxerova/pd.mogene.2.0.st/", repos=NULL, type="source")

> abatch <- read.celfiles(list.celfiles())
Loading required package: pd.mogene.2.0.st
Platform design info loaded.
Reading in : xxx.CEL
[etc.]
> eset <- rma(abatch)
Background correcting
Normalizing
Calculating Expression

On Mar 7, 2013, at 11:03 AM, James W. MacDonald <jmacdon at uw.edu> wrote:

> Wow. This is really an education on the vast unwashed underbelly of 
> BioC, no?
> 
> There is a file called MoGene-2_0-st.mps that came in the zip file you 
> downloaded. Add
> 
> mps <- list.files(baseDir, pattern = "mps$", full.names = TRUE)
> 
> and then
> 
> coreMps = mps
> 
> when you create your AffyGenePDInfoPkgSeed. This file is used to 
> distinguish between the probeset and transcript probe mappings.
> 
> Best,
> 
> Jim
> 
> 
> 
> On 3/7/2013 10:36 AM, Naxerova, Kamila wrote:
>> Thanks Jim. Of course the annotation package does not contain probe -->  probe set information. What was I thinking?!??
>> 
>> What I had not realized was that I needed to build the pd.mogene.2.0.st package myself first, because it also does not exist on Bioconductor. So I just downloaded all the required files from Affy, but again I am stuck with an error message I don't understand... what is the coreMPS file that gives me the error?
>> 
>>> library(pdInfoBuilder)
>>> baseDir<- "/Users/naxerova/Documents/xxx"
>>> (pgf<- list.files(baseDir, pattern = ".pgf",
>> + full.names = TRUE))
>> [1] "/Users/naxerova/Documents/xxx/MoGene-2_0-st.pgf"
>>> (clf<- list.files(baseDir, pattern = ".clf",
>> + full.names = TRUE))
>> [1] "/Users/naxerova/Documents/xxx/MoGene-2_0-st.clf"
>>> (prob<- list.files(baseDir, pattern = ".probeset.csv",
>> + full.names = TRUE))
>> [1] "/Users/naxerova/Documents/xxx/MoGene-2_0-st-v1.na33.mm10.probeset.csv"
>>> seed<- new("AffyGenePDInfoPkgSeed",
>> + pgfFile = pgf, clfFile = clf,
>> + probeFile = prob, author = "Kamila Naxerova",
>> + email = "naxerova at fas.harvard.edu",
>> + biocViews = "AnnotationData",
>> + organism = "Mouse", species = "Mus Musculus")
>>> makePdInfoPackage(seed, destDir = ".")
>> ===============================================================================================================================================
>> Building annotation package for Affymetrix Gene ST Array
>> PGF.........: MoGene-2_0-st.pgf
>> CLF.........: MoGene-2_0-st.clf
>> Probeset....: MoGene-2_0-st-v1.na33.mm10.probeset.csv
>> Transcript..: TheTranscriptFile
>> Core MPS....: coreMps
>> ===============================================================================================================================================
>> Parsing file: MoGene-2_0-st.pgf... OK
>> Parsing file: MoGene-2_0-st.clf... OK
>> Creating initial table for probes... OK
>> Creating dictionaries... OK
>> Parsing file: MoGene-2_0-st-v1.na33.mm10.probeset.csv... OK
>> Parsing file: coreMps... Error in file(file, "rt") : cannot open the connection
>> In addition: Warning message:
>> In file(file, "rt") : cannot open file 'coreMps': No such file or directory
>> 
>> 
>> 
>> 
>> 
>> On Mar 7, 2013, at 10:06 AM, "James W. MacDonald"<jmacdon at uw.edu>  wrote:
>> 
>>> Hi Kamila,
>>> 
>>> On 3/7/2013 9:54 AM, Naxerova, Kamila wrote:
>>>> Dear all,
>>>> 
>>>> I am afraid I have to ask for help with the Mouse Gene 2.0 ST annotation package one more time. It looked like I created it successfully, but when I try to use it to read in cel files with the oligo package, I get a cryptic error message. Any suggestions would be much appreciated!
>>> You don't use the annotation package at this step. There are two
>>> packages that are used for the analysis of this chip type. The first is
>>> the pd.mogene.2.0.st.v1 package, which is used by oligo to map probes to
>>> probesets when doing the normalization/summarization step. This package
>>> will be automagically installed if you don't have it, so there is
>>> nothing to be done at the first step but
>>> 
>>> abatch<- read.celfiles(list.celfiles())
>>> eset<- rma(abatch)
>>> 
>>> This will give you the summarized and normalized data at the transcript
>>> level. You then will normally fit some model(s) using the modeling
>>> package of your choice, and then might want to output a set of
>>> significant genes, at which time you will use the
>>> mogene20sttranscriptcluster.db package to map probeset IDs to gene
>>> information.
>>> 
>>> Best,
>>> 
>>> Jim
>>> 
>>> 
>>>>> abatch<- read.celfiles(list.celfiles(),pkgname="mogene20sttranscriptcluster.db")
>>>> Platform design info loaded.
>>>> Reading in : xxx.CEL
>>>> Reading in : xxx.CEL
>>>> Reading in : xxx.CEL
>>>> [... more cel files listed]
>>>> 
>>>> Error in function (classes, fdef, mtable)  :
>>>>   unable to find an inherited method for function ‘kind’ for signature ‘"ChipDb"’
>>>> 
>>>> Thanks
>>>> Kamila
>>>> 
>>>> On Mar 6, 2013, at 6:16 PM, "Naxerova, Kamila"<naxerova at fas.harvard.edu>   wrote:
>>>> 
>>>>> Dear Christian and Jim,
>>>>> 
>>>>> many thanks to both of you for your explanations.
>>>>> 
>>>>> Your hard work paid off, and I have finally understood everything and managed to build my annotation package!!!! I wrote a little script similar to what Jim was suggesting, namely picking the first RefSeq-like thing I came across. Jim called it "naive" -- but I think there is no downside to this approach, right? I have looked at various examples in the Affy file for a long time, and simply picking the first Refseq ID seems to be kosher.
>>>>> 
>>>>> data<-read.csv("MoGene-transcript-noheader.csv",header=T,stringsAsFactors=F,sep=",")
>>>>> sdata<- data[,c(1,9)]
>>>>> 
>>>>> returnRef=function(x){
>>>>> 	refst<- strsplit(x,split="///")[[1]][grep("RefSeq",strsplit(x,split="///")[[1]])[1]]
>>>>> 	refid<- gsub(" ","",strsplit(refst,split="//")[[1]][1])
>>>>> 	return(refid)
>>>>> }
>>>>> 
>>>>> sdata$refseqids<- sapply(sdata[,2],returnRef)
>>>>> fdata<- sdata[,-2]
>>>>> write.table(fdata,"AnnotBuild.txt", sep="\t",quote=F,row.names=F,col.names=F)
>>>>> 
>>>>> library(AnnotationForge)
>>>>> library(mouse.db0)
>>>>> library(org.Mm.eg.db)
>>>>> makeDBPackage("MOUSECHIP_DB",
>>>>> affy=F,
>>>>> prefix="mogene20sttranscriptcluster",
>>>>> fileName="AnnotBuild.txt",
>>>>> outputDir = ".",
>>>>> version="2.11.1",
>>>>> baseMapType="refseq",
>>>>> manufacturer = "Affymetrix",
>>>>> chipName = "Mouse Gene 2.0 ST Array",
>>>>> manufacturerUrl = "http://www.affymetrix.com",
>>>>> author = "Kamila Naxerova",
>>>>> maintainer = "Kamila Naxerova<naxerova at fas.harvard.edu>")
>>>>> 
>>>>>> install.packages("mogene20sttranscriptcluster.db",repos=NULL, type="source")
>>>>> * installing *source* package ‘mogene20sttranscriptcluster.db’ ...
>>>>> ** R
>>>>> ** inst
>>>>> ** preparing package for lazy loading
>>>>> ** help
>>>>> *** installing help indices
>>>>> ** building package indices
>>>>> ** testing if installed package can be loaded
>>>>> *** arch - i386
>>>>> *** arch - x86_64
>>>>> 
>>>>> * DONE (mogene20sttranscriptcluster.db)
>>>>> 
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>> -- 
>>> James W. MacDonald, M.S.
>>> Biostatistician
>>> University of Washington
>>> Environmental and Occupational Health Sciences
>>> 4225 Roosevelt Way NE, # 100
>>> Seattle WA 98105-6099
>>> 
> 
> -- 
> James W. MacDonald, M.S.
> Biostatistician
> University of Washington
> Environmental and Occupational Health Sciences
> 4225 Roosevelt Way NE, # 100
> Seattle WA 98105-6099
> 



More information about the Bioconductor mailing list