[BioC] annotation of a probe in hgu133plus2

Mayte Suarez-Farinas farinam at mail.rockefeller.edu
Sun Jan 12 22:02:32 CET 2014


Hi James
Thanks for guiding me in the right direction. I am updating my pipeline to accommodate this. 
My only problem was the line 
publish(tab, htab, .modifyDF = list(entrezLinks, affyLinks))
gave me a msg error
Error in .self$prepare(value, .toHTML = .toHTML, .toDF = .toDF, .modifyDF = .modifyDF,  : 
  object 'affyLinks' not found

or equivalently, 

publish(tt, htab, .modifyDF = list(entrezLinks ,affyLinks ))
Error in .self$prepare(value, .toHTML = .toHTML, .toDF = .toDF, .modifyDF = .modifyDF,  : 
  object 'entrezLinks' not found

I am sure this is an oversight of my part but after spending some Sunday hours trying to figure it out, I decided to ask you.
Best,
mayte

> 
> On Thursday, January 09, 2014 2:09:07 PM, Mayte Suarez-Farinas wrote:
>> Hi Jim
>> 
>> Thanks for your answer. i am using the annaffy functions to crete the
>> the annotation tables. can you suggest a newer and easy package/set of
>> functions
>> to do this, avoiding getting a NA in probes with 2 entrezid??
> 
> Well, I can make some recommendations, but I don't have any easy suggestions. The reason these are coming up as NA in the first place is because it isn't clear what you are measuring with this probeset, or any other probeset that interrogates more than one gene. In fact you could have a scenario where one sample type is only expressing DEFB4A and the other is only expressing DEFB4B, but you would think there isn't a difference!
> 
> For this gene it may not matter - I don't know anything about this gene - but whatever you do, you will have to make some assumptions about ALL of the multiply-mapped probesets, and what might be reasonable for one gene may not be reasonable at all for another gene.
> 
> That is one of the reasons that MBNI came up with the re-mapped cdfs, so you can have some assurance that you are measuring the transcript from a single gene without confounding. So that is one thing you could consider, switching to the MBNI re-mapped cdf for this array. There are caveats to using those cdfs as well (mainly that the number of probes in a probeset varies widely, so the accuracy of your expression estimate varies as well. However, when you make comparisons you don't then take this variance into account, unless you use something like the puma package).
> 
> These days I am using ReportingTools rather than annaffy. But just switching still won't help, as ReportingTools uses similar methods to annotate. Instead you will have to do some work yourself first. Let's assume you are using limma to do the analysis, and you are creating an MArrayLM object called 'fit2'.
> 
> fit <- lmFit(eset, design)
> fit2 <- contrasts.fit(fit, contrast)
> fit2 <- eBayes(fit2)
> 
> Now we can add annotations to the fit2 object so topTable() will append the annotations.
> 
> gns <- select(hgu133plus2.db, featureNames(eset), c("ENTREZID","SYMBOL","GENENAME"))
> 
> You can add other things as well. The problem now is that the gns data.frame is not the right dimension because of the multiply-mapped probesets. This is where the tough decisions come in. As I see it you can:
> 
> 1.) Subset gns to have just one row per probeset.
>    a.) by doing something really naive like taking just the first instance of a probeset
>    b.) or by randomly selecting one of the multiple mappings
> 2.) Collapse multiple probesets so e.g. your favorite gene would now be
> 
> 207356_at DEFB4A|DEFB4B 1673|100289462
> 
> and you wouldn't lose any information. However, it will be trickier to make links to e.g., Entrez Gene for these cells if you care about those things. If you do 1a or 1b (let's use 1a as an example), you can do:
> 
> library(affycoretools) ## shameless plug, I know
> gns <- gns[!duplicated(gns[,1]),]
> fit2$genes <- gns
> 
> tab <- topTable(fit2, coef = 1, <other arguments go here>)
> 
> htab <- HTMLReport("coef1", "Data for contrast 1", reportDirectory = "reports/")
> publish(tab, htab, .modifyDF = list(entrezLinks, affyLinks))
> finish(htab)
> indx <- HTMLReport("index", "Main page")
> publish(Link(htab), indx)
> finish(indx)
> 
> 
> And then you will have a file called index.html that you can double click, and on that page will be a link you can use to see your data. The link will bring up an HTML table and there will be links to netaffx and Entrez Gene for the probeset IDs and Gene IDs (that's why you load affycoretools; to get the entrezLinks and affyLinks functions).
> 
> If you want to go through door #2, you won't be able to make links to Entrez Gene without creating a function of your own. Instead you could forgo those links and just do
> 
> gnlst <- tapply(1:nrow(gns), gns$PROBEID, function(x) gns[x,])
> gnlst <- lapply(gnlst, function(x) apply(x, 2, function(y) paste(unique(y), collapse = " | ")))
> gns <- do.call("rbind", gnlst)
> 
> fit2$genes <- gns
> 
> and everything else the same as above except no entrezLinks for .modifyDF.
> 
> Best,
> 
> Jim
> 
>> 
>> thanks!!!
>> 
>> Mayte Suarez-Farinas
>> Research Assistant Professor
>> Laboratory of Investigative Dermatology
>> Biostatistician, Center for Clinical and Translational Science
>> The Rockefeller University
>> 1230 York Ave, Box 178
>> New York, NY 10065
>> Phone:  +1(212) 327-8213
>> Fax:       +1(212) 327-8232
>> 
>> 
>> 
>> On Jan 9, 2014, at 1:11 PM, James W. MacDonald wrote:
>> 
>>> Hi Mayte,
>>> 
>>> The gene isn't missing, instead this probeset interrogates two
>>> different genes. Plus you are using really old methods to get the
>>> data you want. These days you should use select().
>>> 
>>> > select(hgu133plus2.db, "207356_at", c("SYMBOL","ENTREZID"))
>>>   PROBEID SYMBOL  ENTREZID
>>> 1 207356_at DEFB4A      1673
>>> 2 207356_at DEFB4B 100289462
>>> Warning message:
>>> In .generateExtraRows(tab, keys, jointype) :
>>> 'select' resulted in 1:many mapping between keys and return rows
>>> 
>>> And you get a warning saying that this probeset maps to more than one
>>> gene and symbol.
>>> 
>>> Best,
>>> 
>>> Jim
>>> 
>>> On 1/9/2014 12:10 PM, Mayte Suarez-Farinas wrote:
>>>> Dear annotation builders
>>>> 
>>>> I notice that one of my favorite genes DEFB4A is missing from the
>>>> hgu133plus2.db
>>>> the NetAffy database says that probeset 207356_at correspond to gene
>>>> DEFB4A
>>>> yet getSYMBOL('207356_at','hgu133plus2.db') returns NA. This happen
>>>> in newer versions of
>>>> this package, since it has been tehre before.. This gene is
>>>> extremely important in
>>>> lots skin diseases, inflammation etc, so it is a concern for my
>>>> colleagues
>>>> Any help is appreciated,
>>>> 
>>>> Mayte Suarez-Farinas
>>>> Research Assistant Professor
>>>> Laboratory of Investigative Dermatology
>>>> Biostatistician, Center for Clinical and Translational Science
>>>> The Rockefeller University
>>>> 1230 York Ave, Box 178
>>>> New York, NY 10065
>>>> Phone:  +1(212) 327-8213
>>>> Fax:       +1(212) 327-8232
>>>> 
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>> 
>>> --
>>> James W. MacDonald, M.S.
>>> Biostatistician
>>> University of Washington
>>> Environmental and Occupational Health Sciences
>>> 4225 Roosevelt Way NE, # 100
>>> Seattle WA 98105-6099
>>> 
>> 
> 
> --
> James W. MacDonald, M.S.
> Biostatistician
> University of Washington
> Environmental and Occupational Health Sciences
> 4225 Roosevelt Way NE, # 100
> Seattle WA 98105-6099



More information about the Bioconductor mailing list