[BioC] Queries about getting annotation post-Limma analysis

Wed May 15 17:02:29 CEST 2013

Hi Jeremy,

On 5/14/2013 8:35 PM, Jeremy Ng wrote:
> Hi James and Paul,
>
> Thanks really much for the advice (and thanks James for the package, it
> works like a charm)
>
> Just want to clarify (about the Affymetrix Human Exon array as a whole); it
> seems to be that there are no available db packages that can be used? (the
> closest to an annotation package for use is the pd.huex.1.0.st.v2 package,
> which is used for Oligo RMA normalization).

Not exactly. What you have to understand about the exon arrays is that 
they are really complex, and any analysis is not likely to be simple or 
straightforward. This is, IMO, why they now have the ability to 
summarize at the transcript level, so people can do fairly naive analyses.

The reason the exon arrays are complex has to do with a couple of 
factors. First, there is a ton of speculative content on the chip, where 
the evidence for transcription at a given site is fairly speculative. 
Back in the day, you could only summarize at the probeset level, and the 
only difference between core, full, and extended was how much 
speculative content was included.

Second, unlike the 3'-biased arrays, the Exon (and Gene ST) arrays share 
probes between probesets, and the probes may target multiple regions. In 
other words, for the 3'-biased arrays Affy tried to ensure that a given 
probe and/or probeset only bound to a single region, so when you got 
binding on the array you could attribute that binding to a single 
transcript. In addition, a given probe was only part of a single 
probeset. For the Exon and Gene arrays, a single probe may be part of 
many different probesets. In addition, there may be multiple probesets 
that target a single exon.

It wouldn't be difficult to make an annotation package for the Exon 
arrays, but I don't think it is really advisable. This would 
intentionally mask the underlying complexity of the arrays, and allow 
naive users to just Do Something(tm) and then report the results. 
Instead, if you are going to analyze your data at the probeset level, 
you should really be using the annmap package. See for example, the 
annmap cookbook:

http://bioconductor.org/packages/2.12/bioc/vignettes/annmap/inst/doc/cookbook.pdf

in particular, starting on the bottom of page 10.

You will see right away that analyzing the Exon arrays at the probeset 
level is really complicated.

If you still want to do something really naive, building the .db package 
isn't that difficult. You need the probeset.csv annotation file from 
Affy, and the AnnotationForge package. You will then need to parse 
whatever column has the correct annotation (it is the mrna_assignment 
column in the transcript.csv file, so maybe the same column?).

That column will likely have something like this:

ENST00001 // some // cruft // here /// NM_0329091 // different // cruft 
// here /// etc etc

In general you want the NM_ or NR_ (RefSeq) or GenBank IDs, but those 
are sort of hard to get as they can be lots of different things. You can 
also use the Ensembl transcript IDs, but will have to map. I usually 
parse like so:

dat <- read.csv("probesest filename goes here", comment.char = "#", 
stringsAsFactors=FALSE, na.string = "---")
mrna <- sapply(strsplit(dat$mrna_assignment, " // | /// "), function(x) 
grep("^NM|^NR|^[A-E][A-Z]+[0-9]+", x, value = TRUE)[1])

So the mrna vector will contain RefSeq IDs, or GenBank if there were no 
RefSeq, or Ensembl Transcript IDs if none of the former two, or NA if 
there wasn't anything. You can then extract the Ensembl Transcript IDs 
and map to RefSeq or GenBank

ens <- grep("^ENST", mrna, value = TRUE)
library(org.Hs.eg.db)
ens <- select(org.Hs.eg.db, ens, c("REFSEQ","ACCNUM"), "ENSEMBLTRANS")
ens <- ens[!duplicated(ens[,1]),]
## use accnum if refseq is NA
ens[is.na(ens[,2]),2] <- ens[is.na(ens[,2]),3]
## put mapped data back in mrna vector
mrna[match(ens[,1], mrna)] <- ens[,2]
## write out
write.table(cbind(dat[,1], mrna), "huex_mapper.txt", sep = "\t", quote = 
FALSE, row.names = FALSE, col.names  = FALSE)

Then use this file as input to makeDBPackage() from AnnotationForge.

Best,

Jim

Best,

Jim

>
> I asked because other than summarization at the transcript level, I would
> also be doing summarization at the probeset level, and then again,
> post-limma, I would need to retrieve the annotations for the probes for
> further analysis.
>
> Thanks for your time, and thanks James for the package!
>
> Jeremy
>
>
> On Wed, May 15, 2013 at 3:28 AM, James W. MacDonald<jmacdon at uw.edu>  wrote:
>
>> Hi Paul (and Jeremy),
>>
>> On 5/14/2013 3:11 PM, Paul Shannon wrote:
>>
>>> Hi Jeremy,
>>>
>>>   how can I map the Affy ID which is found in the results from topTable to
>>>> an ENSEMBL and an ENTREZ gene ID
>>>>
>>> The bioc annotation package "hugene10stprobeset.db" and the "select"
>>> interface should provide all of you need.
>>>
>> Jeremy is using the Human Exon ST array, and summarizing at the transcript
>> level. So he needs the (non-existant) huex10sttranscriptcluster.db package
>> to do what you suggest.
>>
>> I am building such a beast as we speak, but this array has almost 340K
>> probesets when summarized at the transcript level, so this is going
>> sloooowly.
>>
>> Best,
>>
>> Jim
>>
>>
>>
>>>        biocLite("hugene10stprobeset.**db")
>>>        library(hugene10stprobeset.db)
>>>
>>>           # what kinds of data (what columns) are store in this annotation
>>> package?
>>>        keytypes(hugene10stprobeset.**db)
>>>
>>>           # do a quick survey of each column
>>>        for(key in keytypes(hugene10stprobeset.**db)){
>>>            print(paste("---", key))
>>>            print(head(keys(**hugene10stprobeset.db, keytype=key)))
>>>            }
>>>
>>>            # get a random sample of probe ids to use for testing
>>>        sample.probe.ids<- sample(keys(**hugene10stprobeset.db,keytype=**"PROBEID"),
>>> size=10)
>>>
>>>            # look these up using the select command. a data.frame is
>>> returned
>>>        select(hugene10stprobeset.db, keys=sample.probe.ids,
>>> cols=c("ENTREZID", "ENSEMBL", "SYMBOL"))
>>>           PROBEID ENTREZID         ENSEMBL SYMBOL
>>>        1  8165444    29952 ENSG00000176978   DPP7
>>>        2  7970230    23263 ENSG00000126217  MCF2L
>>>        3  8045081     2840 ENSG00000144230  GPR17
>>>        4  7989809    54878 ENSG00000074603   DPP8
>>>        5  8015557    23415 ENSG00000089558  KCNH4
>>>        6  7930787     5406 ENSG00000175535  PNLIP
>>>        7  7984142     9960 ENSG00000140455   USP3
>>>        8  7894624<NA>              <NA>     <NA>
>>>        9  8170511    79057 ENSG00000130032  PRRG3
>>>        10 8105007    55100 ENSG00000082068  WDR70
>>>
>>>
>>>    - Paul
>>>
>>>
>>>
>>> On May 14, 2013, at 8:42 AM, Jeremy Ng wrote:
>>>
>>>   Dear all,
>>>> Following the RMA normalization of data from Affy Human Exon ST1.0 array
>>>> using the package Oligo (at the transcript level using target="core"), I
>>>> then conducted a limma analysis.
>>>>
>>>> The topTable argument in limma would then retrieve the top genes (in my
>>>> case, cause I am interested in subsequently doing GSEA analysis, I set
>>>> number=100) which are differentially addressed.
>>>>
>>>> The question I have is how can I map the Affy ID which is found in the
>>>> results from topTable to an ENSEMBL and an ENTREZ gene ID. Intuitively,
>>>> biomaRt comes to mind, and I did a biomaRt query for the list of top 100
>>>> genes which I had gotten, but I get only 14 hgnc symbols. I'd like to
>>>> think
>>>> that it's due to a lack of annotations, but I highly doubt so (14 in a
>>>> list
>>>> of 100 seems too little to me).
>>>>
>>>> My code for biomaRt is as follows:
>>>> mart<- useMart("ensembl", "hsapiens_gene_ensembl")
>>>>
>>>> hgnc<- getBM(attributes=c("hgnc_**symbol",
>>>> "ensembl_gene_id"),values=**top100$ID, filters="affy_huex_1_0_st_v2",
>>>> mart=mart)
>>>>
>>>> I was wondering 2 things:
>>>> 1. Is there any plausible explanation to why the query only returns 14
>>>> IDs;
>>>> and
>>>> 2. Are there other ways that I can use to fetch annotations from a
>>>> post-Limma analysis?
>>>>
>>>> My session info is as follows:
>>>> R version 3.0.0 (2013-04-03)
>>>> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>>>>
>>>> locale:
>>>> [1] C
>>>>
>>>> Thanks for any advice!
>>>>
>>>> Jeremy
>>>>
>>>>          [[alternative HTML version deleted]]
>>>>
>>>> ______________________________**_________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor>
>>>> Search the archives: http://news.gmane.org/gmane.**
>>>> science.biology.informatics.**conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>>>
>>> ______________________________**_________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor>
>>> Search the archives: http://news.gmane.org/gmane.**
>>> science.biology.informatics.**conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>>
>> --
>> James W. MacDonald, M.S.
>> Biostatistician
>> University of Washington
>> Environmental and Occupational Health Sciences
>> 4225 Roosevelt Way NE, # 100
>> Seattle WA 98105-6099
>>
>>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099