[BioC] Mapping Affymetrix annotations with Bioconductor annotations

Fri Oct 4 21:31:39 CEST 2013

Hi Joao,

There isn't a standard way that I am familiar with. But this 
illustrates a conceptual difference between the purpose of these arrays 
and what people end up using them for.

I have run headlong into this issue lately, trying to create annotation 
packages for the new 2.X ST arrays. The annotations for these arrays 
are primarily directed towards the _transcripts_ that a given probeset 
measures, rather than the underlying gene. So the data we get from 
these arrays are supposed to represent the relative abundance of a 
given transcript, and the 'duplicate' probesets on the array are 
supposed to measure transcript variants (at least I assume this is in 
general true, as the new TAC software is supposed to work with Gene ST 
arrays).

We know that there actually are transcript variants for various genes, 
and that these variants may give rise to phenotypic differences. So it 
may well be interesting to measure these variants and try to figure out 
if they have a meaningful effect on a phenotype we might be interested 
in.

However, 100% of the researchers I come into contact with are 
completely uninterested in such things, and just want to know if there 
are differences in expression at the _gene_ level. This is true BTW for 
RNA-Seq as well. This may have more to do with the crowd I run with, 
rather that the general desires of the average biologist, so I may just 
be suffering from confirmation bias here.

But I think it is a bit ironic that Affymetrix keeps trying to push 
transcript level data on us (Exon arrays, Gene ST arrays, now HTA 
arrays), and we push back just as hard, collapsing all these data to 
gene level. I am not sure if this is a lack of imagination on our part 
or a failure to understand the customer on Affy's part. Or maybe it's 
just that I don't hang with the cool kids.

Best,

Jim

On Friday, October 04, 2013 1:29:36 PM, Joao Sollari Lopes wrote:
> Hi Jim,
>
> Following on the discussion on annotation in Affymetrix Gene ST
> arrays, I wonder if there is a standard way to deal with multiple
> mRNAs (from different genes) that are assigned to the same transcript
> cluster. Is it generally accepted to follow the naive approach of
> picking the first mrna of the list.
> I know that the mRNA Assignments are ordered in a ranking so is it
> safe just to assume the ranking already performed by Affymetrix?
> Joao
>
> On 08/29/2013 04:22 PM, James W. MacDonald wrote:
>> Hi Joao,
>>
>> Unfortunately there are no readily available packages for annotating
>> all the new model organism arrays from Affy. However, the functions
>> to create your own annotation package do exist. If you look at the
>> AnnotationForge package, specifically the SQLForge vignette
>> (http://www.bioconductor.org/packages/release/bioc/vignettes/AnnotationForge/inst/doc/SQLForge.pdf),
>> it is pretty straightforward to make your own annotation package.
>>
>> I am assuming you are summarizing at the transcript level, so would
>> want to make a zebgene11sttranscriptcluster.db package. For this you
>> need the transcript csv file from Affy
>> (http://www.affymetrix.com/Auth/analysis/downloads/na33/wtgene-33_3/ZebGene-1_1-st-v1.na33.3.zv9.transcript.csv.zip).
>> From this you want to generate a two-column file with the probeset ID
>> in the first column, and then GenBank or RefSeq IDs in the second.
>>
>> This is the tough part, as the annotation files need to be parsed to
>> create this file.
>>
>> I wrote an Rscript to parse these files that you could use. It is
>> pretty naive, but seems to do a relatively reasonable job. You will
>> obviously need to change the first line to point to the correct
>> directory, and will have to have the org.Dr.eg.db package installed,
>> but this should
>>
>> <copy from below>
>>
>> #!/data/programs/lib64/R/bin/Rscript
>> args <- commandArgs(TRUE)
>> if(length(args) < 3) stop(paste("Usage: parseAffyTranscripts.R
>> <transcript.csv> <organism.db package> <output file name>",
>>                          "<mrna column header> (optional)\n", call. =
>> FALSE))
>> probefile <- args[1]
>> orgpkg <- args[2]
>> fileout <- args[3]
>> if(length(args) == 4) headercol <- args[4] else headercol <-
>> "mrna_assignment"
>>
>> dat <- read.csv(probefile, comment.char = "#",
>> stringsAsFactors=FALSE, na.string = "---")
>> mrna <- sapply(strsplit(dat[,headercol], " // | /// "), function(x)
>> grep("^[NX][MR]|^[A-G][A-Z]+[0-9]+|^[A-Z][0-9]+|^ENST", x, value =
>> TRUE)[1])
>>
>> ens <- grep("^ENS", mrna, value = TRUE)
>> require(orgpkg, character.only = TRUE) || stop(paste("You need to
>> install the", orgpkg, "package first!"))
>> ens <- select(get(orgpkg), ens, c("REFSEQ","ACCNUM"), "ENSEMBLTRANS")
>> ens <- ens[!duplicated(ens[,1]),]
>> ## use accnum if refseq is NA
>> ens[is.na(ens[,2]),2] <- ens[is.na(ens[,2]),3]
>> ## put mapped data back in mrna vector
>> mrna[match(ens[,1], mrna)] <- ens[,2]
>> mrna[grep("^ENS|^GENSCAN", mrna)] <- NA
>> ## write out
>> write.table(cbind(dat[,1], mrna), fileout, sep = "\t", quote =
>> FALSE, row.names = FALSE, col.names  = FALSE, na = "")
>>
>> <to here>
>>
>> Paste this into a file, make it executable (if on linux or macosx),
>> and change the path in the first line to point to the location of
>> your Rscript and it should create a fairly reasonable file for input
>> to AnnotationForge.
>>
>> You just call this script from the command line:
>>
>> parseAffyTranscriptCsv.R ZebGene-1_1-st-v1.na33.3.zv9.transcript.csv
>> org.Dr.eg.db zebgene_mapper.txt
>>
>> then after a while you will have a file zebgene_mapper.txt that you
>> can use as input to AnnotationForge
>>
>> Best,
>>
>> Jim
>>
>>
>>
>>
>>
>> On Thursday, August 29, 2013 10:39:38 AM, Joao Sollari Lopes wrote:
>>> Hi Jim,
>>>
>>> Thanks for your quick reply. Actually I was able to do some kind of
>>> mapping through the position of the probes in the strips using the
>>> files:
>>>
>>> zebgene11stdrentrezgprobe_17.1.0.tar.gz
>>> (http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/17.1.0/entrezg.download/zebgene11stdrentrezgprobe_17.1.0.tar.gz)
>>>
>>>
>>>
>>> and
>>>
>>> pd.zebgene.1.1.st (provided by Bioconductor)
>>>
>>> The annotations compare very well with each other, however the info
>>> provided by Affymetrix (available in pd.zebgene.1.1.st) are somewhat
>>> more complete.
>>>
>>> The trouble of working with Affymetrix Array Strip is that there seems
>>> to be little support in bioconductor for it in what concerns
>>> annotation. Particularly, because packages "annotate" and "annaffy"
>>> seem to work only with Affymetrix Chips.
>>>
>>> I know I have plenty of reading to do, but is there a best-way to work
>>> with Array Strips and still use packages "annotate" and "annaffy"? At
>>> the moment I am using package "oligo".
>>>
>>> Thanks,
>>> Joao
>>>
>>> On 08/29/2013 03:15 PM, James W. MacDonald wrote:
>>>> Hi Joao,
>>>>
>>>> On Thursday, August 29, 2013 7:07:02 AM, Joao Sollari Lopes wrote:
>>>>> Hi,
>>>>>
>>>>> I am trying to compare the annotations provided by Affymetrix with
>>>>> the
>>>>> ones provided by Bioconductor for
>>>>>
>>>>> Zebrafish Gene 1.1 ST Array Strip
>>>>>
>>>>> I have compared the files
>>>>>
>>>>> zebgene11stdrentrezg.db_17.1.0.tar.gz
>>>>> (http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/17.1.0/entrezg.download/zebgene11stdrentrezg.db_17.1.0.tar.gz)
>>>>>
>>>>>
>>>>
>>>> That file isn't supplied by Bioconductor, it is supplied by MBNI at
>>>> University of Michigan.
>>>>
>>>> In addition, (if you read what they have on their site to know what
>>>> you are using) the probesets for that CDF no longer correspond in any
>>>> way to the original probesets that Affy defined. So comparing the two
>>>> doesn't make any sense.
>>>>
>>>> Best,
>>>>
>>>> Jim
>>>>
>>>>
>>>>>
>>>>>
>>>>> ZebGene-1_1-st-v1.na33.3.zv9.transcript.csv.zip
>>>>> (http://www.affymetrix.com/Auth/analysis/downloads/na33/wtgene-33_3/ZebGene-1_1-st-v1.na33.3.zv9.transcript.csv.zip)
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> The trouble is that the first identifies the Units as "100000002_at",
>>>>> "100000006_at", ...,  "84703_at" and the second as "12943944",
>>>>> "12943954", ..., "13276104". Is there an easy way to know which
>>>>> correspond to which?
>>>>>
>>>>> Thanks in advance,
>>>>> Joao Lopes
>>>>> Instituto Gulbenkian de Ciencia, Portugal
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>> --
>>>> James W. MacDonald, M.S.
>>>> Biostatistician
>>>> University of Washington
>>>> Environmental and Occupational Health Sciences
>>>> 4225 Roosevelt Way NE, # 100
>>>> Seattle WA 98105-6099
>>>
>>
>> --
>> James W. MacDonald, M.S.
>> Biostatistician
>> University of Washington
>> Environmental and Occupational Health Sciences
>> 4225 Roosevelt Way NE, # 100
>> Seattle WA 98105-6099
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor

--
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099