[BioC] Affymetrix mouse 430_2 array - annotation problem

Tue Jul 22 18:49:45 CEST 2014

Hi Xiayu,

On 7/22/2014 12:15 PM, Rao,Xiayu wrote:
> Hi, Jim
>
> Thanks a lot for your previous helps! I now have the annotation problems.
>
> I used select to annotate as you suggested me to do.
>> fData(eset) <- select(mouse4302.db, featureNames(eset),c("SYMBOL","GENENAME","ENTREZID"))
> Warning message:
> In .generateExtraRows(tab, keys, jointype) :
>    'select' resulted in 1:many mapping between keys and return rows

Hmm. My bad - I somehow thought the mouse4302 array had no multiple 
mapping probes.

>
> (1) Regarding the warning message, I read in the forum that you suggested to remove the duplicates or collapse them to comma-separated vectors and then incorporate. So for my condition, should I do
> fData(eset) <- fData(eset)[!duplicated(fData(eset)$PROBEID),]

Oh heck no! Don't do that. You want to do this in two steps:

gns <- select(mouse4302.db, 
featureNames(eset),c("SYMBOL","GENENAME","ENTREZID"))

and then

fData(eset) <- gns[!duplicated(gns[,1]),]

> OR
> eset2 <- tapply(fData(eset)$ENTREZID, fData(eset)[,1], paste, collapse = ",")

Same idea applies here; do this in two steps.

> OR
> Can I just ignore the warning and do nothing, as I want to leave everything there as generated by select()??
>

No, unfortunately you cannot ignore the warnings. If you generate a 
'gns' data.frame as I show above, and then check the number of rows 
prior to subsetting, you will note that there are more rows than you 
have in your ExpressionSet, so just stuffing it into the ExpressionSet 
will result in mismatched annotations (and trying to fix that after the 
fact won't work).

You can do either of the above suggestions. I tend to do the first, 
because I like to use ReportingTools to make HTML tables, and I also 
like to generate links for the Gene IDs, which is a bit more difficult 
if you do comma separated IDs (not surmountable, mind you, just more 
difficult).

Plus, the gene names can be long enough and may have commas already, so 
you might want to do pipe (|) separations or something else. And if you 
have like four or five genes for a given probeset, you end up with a 
whole paragraph of gene names. Nobody likes that.

Another alternative is to randomize which one you choose (if you do the 
gns[!duplicated(gns[,1]),]) business, you are selecting the first 
annotation, for each gene that has more than one).

>
> (2) It is strange to see that for the topTable, the row names and the first column (PROBEID) do not match. As you can see below, 1436717_x_at and 1435289_at are different for the 1st row. Why?
>> topTableF(fit2, adjust="BH")
>                                                 PROBEID         SYMBOL                                                                  GENENAME             ENTREZID             M129.15-M129.13
> 1436717_x_at              1435289_at          Engase                     endo-beta-N-acetylglucosaminidase         217364                              -1.946299
> 1436823_x_at              1435390_at          Eri2                                                                exoribonuclease 2            71151                              -1.975441
>
>                                   M129.17-M129.15   AveExpr         F      P.Value    adj.P.Val
> 1436717_x_at     -6.32963614               11.009177 3145.6769 8.379499e-17 3.499204e-12
> 1436823_x_at     -6.46817108               10.999412 2832.7874 1.551719e-16 3.499204e-12

Exactly. Those are the mismatched annotations I mentioned above.

Best,

Jim

>
>
> Thanks,
> Xiayu
>
>
>
>
>
> -----Original Message-----
> From: James W. MacDonald [mailto:jmacdon at uw.edu]
> Sent: Monday, July 21, 2014 11:43 AM
> To: Rao,Xiayu; 'bioconductor at r-project.org'
> Subject: Re: [BioC] Affymetrix mouse 430_2 array - gene expression and annotation
>
> Hi Xiayu,
>
>> 2) and add annotation thereafter? For the transcript level annotation,
>> I have used the following code before. But not sure for this mouse
>> array, is there a similar way or similar transcript database to do
>> such? I know there is a database called mouse4302.db.
>> ID <- featureNames(geneCore2) Symbol <-
>> getSYMBOL(ID,"hugene10sttranscriptcluster.db") fData(geneCore2) <-
>> data.frame(ID=ID,Symbol=Symbol)
>
> This is an old way of annotating things, and has been superceded (for like five years now) by a more compact API:
>
> fData(geneCore2) <- select(mouse4302.db, featureNames(geneCore2), "SYMBOL")
>
> And note you can add in other more useful things like the Gene ID as well (while biologists tend to like HUGO symbols, they are not, as advertized, actually unique things, so you always run the risk of thinking you have <a gene you care about> when in fact you are looking at the data for <some other gene with the same HUGO symbol>).
>
> fData(geneCore2) <- select(mouse4302.db, featureNames(geneCore2),
> c("SYMBOL","GENENAME","ENTREZID"))
>
>
> Best,
>
> Jim
>

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099