[BioC] Analysis of Affymetrix Mouse Gene 2.0 ST arrays

Wed Mar 6 21:28:04 CET 2013

Hi Guido,

It wouldn't be difficult. Starting from something like

awk -F, '{if($1 !~ /#|[:alpha:]/) print $0}' 
MoGene-2_0-st-v1.na33.mm10.transcript.csv | cut -d, -f 1,3-6 > tmp.csv

and then in R

dat <- read.csv("tmp.csv", header = FALSE, stringsAsFactors = FALSE)
dat <- dat[dat[,2] != "---",]
gr <- GRanges(dat[,2], IRanges(start = dat[,4], end = dat[,5]), 
strand=dat[,3])
library(Mus.musculus)
gr <- GRanges(dat[,2], IRanges(start = dat[,4], end = dat[,5]), 
strand=dat[,3], probeset = dat[,1])
mcols(gr)$egid <- names(tx)[findOverlaps(gr, tx, select="first")]
 > head(as.data.frame(mcols(gr)))
   probeset egid
1 17210850 <NA>
2 17210852 <NA>
3 17210855   18777
4 17210869   21399
5 17210883 <NA>
6 17210887  108664

Best,

Jim

On 3/6/2013 10:55 AM, Hooiveld, Guido wrote:
> Just an additional thought: would it not be most straight-forward and unambiguous to use the info on genome location available in the NetAffx file to create the annotation file? If so, what would be the best approach of doing so? Just using one of the TxDb's for this? I don't have any experience with this.
> G
>
> First relevant line from HuGen ST 2.0 file
> probeset_id	seqname	strand	start	stop
> 16657437	chr1	+	12200	12224
>
>
> -----Original Message-----
> From: bioconductor-bounces at r-project.org [mailto:bioconductor-bounces at r-project.org] On Behalf Of Hooiveld, Guido
> Sent: Wednesday, March 06, 2013 16:29
> To: 'James W. MacDonald'; Naxerova, Kamila
> Cc: bioconductor at r-project.org
> Subject: Re: [BioC] Analysis of Affymetrix Mouse Gene 2.0 ST arrays
>
> Hi James and others,
> Since I will soon get my first set of Human Gene ST v2.0 arrays, your remark about creating a BioC-compatible annotation library for such new arrays was of particular interest for me. I therefore tried to generate such library based on the code you suggested. However, it doesn't work (see below): the CSV annotation file downloaded from Affymetrix has no column labeled "Representative Public ID", as is expected by the function makeBaseMaps. Note that the NetAffx annotation csv file for e.g. the (old) human HGU133plus2 arrays does indeed contain such column. Moreover, I also noticed that for the Gene ST arrays target sequences have been derived from multiple databases, including both NCBI and ENSEMBL. As a consequence the NetAffx annotation file for the Gene ST arrays contain in the column "gene_assignment" or "mrna_assignment" multiple ID types. Would the library 'annotationForge' be able to handle this at all?
>
> Thanks,
> Guido
>
> BTW: in case it is relevant: the NetAffx annotation file for the 3'-IVT array (HGU133plus2) was build according to " #%netaffx-annotation-tabular-format-version=1.0", whereas the file for the HuGene ST v2.0 was build according to " #%netaffx-annotation-tabular-format-version=1.1"
>
>
>> makeDBPackage("HUMANCHIP_DB",
> + affy=TRUE,
> + prefix="hugene20sttranscriptcluster",
> + fileName="HuGene-2_0-st-v1.na33.hg19.probeset.csv",
> + outputDir = ".",
> + version="2.11.1",
> + manufacturer = "Affymetrix",
> + chipName = "Human Gene 2.0 ST Array",
> + manufacturerUrl = "http://www.affymetrix.com", author = "N.N.",
> + maintainer = "me at mail.com")
>          Error in `[.data.frame`(csvFile, , GenBankIDName) :
>    undefined columns selected
>> sessionInfo()
> R version 2.15.2 (2012-10-26)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>   [7] LC_PAPER=C                 LC_NAME=C
>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] human.db0_2.8.0       AnnotationForge_1.0.3 org.Hs.eg.db_2.8.0
> [4] RSQLite_0.11.2        DBI_0.2-5             AnnotationDbi_1.20.3
> [7] Biobase_2.18.0        BiocGenerics_0.4.0
>

> loaded via a namespace (and not attached):
> [1] IRanges_1.16.6  parallel_2.15.2 stats4_2.15.2
>
> -----Original Message-----
> From: bioconductor-bounces at r-project.org [mailto:bioconductor-bounces at r-project.org] On Behalf Of James W. MacDonald
> Sent: Tuesday, March 05, 2013 23:23
> To: Naxerova, Kamila
> Cc: bioconductor at r-project.org
> Subject: Re: [BioC] Analysis of Affymetrix Mouse Gene 2.0 ST arrays
>
> Hi Kamila,
>
> On 3/5/2013 4:45 PM, Naxerova, Kamila wrote:
>> Dear all,
>>
>> I am analyzing a set of Affymetrix Mouse Gene 2.0 ST arrays. I am quite familiar with 3'-biased chips, but this is my first time looking at data from WT arrays. I have a few general questions -- any advice would be appreciated to speed up my learning process.
>>
>> 1) I have already read on this mailing list that the good old affy package does not work well with WT arrays (can anybody point me to any literature on why that is?). So I have installed the oligo and xps packages -- what are the advantages/disadvantages for each? Any opinions on which one is the right "starter kit"?
> The affy package was never intended to work with these arrays - it was designed specifically for the 3' biased arrays, which had pre-defined probesets, and which didn't share probes between probesets. In addition, the makecdfenv package is designed to work with the old style CDF packages, and Affy has never released a CDF for these new chips that they are willing to support in any meaningful way.
>
> There were some changes made to the affy package in order to accommodate the fact that probes could be shared between probesets, and it is possible to use functions in affxparser to re-create conventional CDF packages using the newer pgf and clf files. So hypothetically you could still use the affy package (and hypothetically you could still use an Apple IIe for all your computing needs, but that's crazy, so let's move on).
>
> I don't think you will find much difference between oligo and xps, other than the fact that xps requires the additional installation of ROOT. You might play around with both and see which suits you better.
>
> I should throw in my obligatory cautionary statement about summarizing Gene ST data at the probeset (as compared to the transcript) level. If you look at the number of probes/probeset, there are a huge number with<  4 probes. So hypothetically you can do this, but I wouldn't.
>
>> 2) I see with some dread that there seems to be no annotation package for the 2.0 array yet. I have never built my own... any quick bullet points on how I would go about doing that for a WT array?
> No dread should be required. All you need to do is get the transcript-level annotation file from Affy
> (http://www.affymetrix.com/Auth/analysis/downloads/na33/wtgene/MoGene-2_0-st-v1.na33.mm10.transcript.csv.zip)
> and then the AnnotationForge, mouse.db0, and org.Mm.eg.db packages. Then something like
>
> library(AnnotationForge)
> library(mouse.db0)
> library(org.Mm.eg.db)
> makeDBPackage("MOUSECHIP_DB",
> affy=TRUE,
> prefix="mogene20sttranscriptcluster",
> fileName="MoGene-2_0-st-v1.na33.mm10.transcript.csv",
> outputDir = ".",
> version="2.11.1",
> manufacturer = "Affymetrix",
> chipName = "Human Gene 2.1 ST Array",
> manufacturerUrl = "http://www.affymetrix.com", author = "Kamila Naxerova", maintainer = "Kamila Naxerova<naxerova at fas.harvard.edu>")
>
> should do the trick. You can then install directly from within R by
>
> install.packages("mogene20sttranscriptcluster.db", repos=NULL,
> type="source")
>
> And see
> http://bioconductor.org/packages/2.11/bioc/vignettes/AnnotationForge/inst/doc/SQLForge.pdf
>
>
>> 3) It seems that RMA is also used for normalization of WT arrays, so that part I am comfortable with. But are there any differences in preprocessing between 3' and WT arrays that I should watch out for?
> Not really. I don't use xps, so cannot say for certain how you do things with that package, but with oligo it's a simple
>
> abatch<- read.celfiles(list.celfiles()) eset<- rma(abatch)
>
> To normalize and summarize at the transcript level. Note however that the annotation for the resulting ExpressionSet will be the
> pd.mogene.2.0.st.v1 package, and if you use annotation(eset) in any further calls to do gene annotation, it won't work out. You need to first do
>
> annotation(eset)<- "mogene20sttranscriptcluster.db"
>
> One further note: the intronic controls (especially) have an irritating habit of popping up in lists of differentially expressed genes. This is IMO likely due to mRNA that has not been fully processed to excise the introns, but regardless, these probesets tend to have no annotation at all, so are not useful without extra work to figure out what they are supposed to be measuring. My usual MO is to just summarily excise them after e.g., the eBayes() step of an analysis using limma. If you are interested, there is a function in the affycoretools package called
> getMainProbes() that will do this for you.
>
> Best,
>
> Jim
>
>
>> Thanks so much!
>> Kamila
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
> --
> James W. MacDonald, M.S.
> Biostatistician
> University of Washington
> Environmental and Occupational Health Sciences
> 4225 Roosevelt Way NE, # 100
> Seattle WA 98105-6099
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
>
>

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099