[BioC] Mapping Affymetrix annotations with Bioconductor annotations

Thu Aug 29 17:22:25 CEST 2013

Hi Joao,

Unfortunately there are no readily available packages for annotating 
all the new model organism arrays from Affy. However, the functions to 
create your own annotation package do exist. If you look at the 
AnnotationForge package, specifically the SQLForge vignette 
(http://www.bioconductor.org/packages/release/bioc/vignettes/AnnotationForge/inst/doc/SQLForge.pdf), 
it is pretty straightforward to make your own annotation package.

I am assuming you are summarizing at the transcript level, so would 
want to make a zebgene11sttranscriptcluster.db package. For this you 
need the transcript csv file from Affy 
(http://www.affymetrix.com/Auth/analysis/downloads/na33/wtgene-33_3/ZebGene-1_1-st-v1.na33.3.zv9.transcript.csv.zip). 
 From this you want to generate a two-column file with the probeset ID 
in the first column, and then GenBank or RefSeq IDs in the second.

This is the tough part, as the annotation files need to be parsed to 
create this file.

I wrote an Rscript to parse these files that you could use. It is 
pretty naive, but seems to do a relatively reasonable job. You will 
obviously need to change the first line to point to the correct 
directory, and will have to have the org.Dr.eg.db package installed, 
but this should

<copy from below>

#!/data/programs/lib64/R/bin/Rscript
args <- commandArgs(TRUE)
if(length(args) < 3) stop(paste("Usage: parseAffyTranscripts.R 
<transcript.csv> <organism.db package> <output file name>",
                          "<mrna column header> (optional)\n", call. = 
FALSE))
probefile <- args[1]
orgpkg <- args[2]
fileout <- args[3]
if(length(args) == 4) headercol <- args[4] else headercol <- 
"mrna_assignment"

dat <- read.csv(probefile, comment.char = "#", stringsAsFactors=FALSE, 
na.string = "---")
mrna <- sapply(strsplit(dat[,headercol], " // | /// "), function(x)
               grep("^[NX][MR]|^[A-G][A-Z]+[0-9]+|^[A-Z][0-9]+|^ENST", 
x, value = TRUE)[1])

ens <- grep("^ENS", mrna, value = TRUE)
require(orgpkg, character.only = TRUE) || stop(paste("You need to 
install the", orgpkg, "package first!"))
ens <- select(get(orgpkg), ens, c("REFSEQ","ACCNUM"), "ENSEMBLTRANS")
ens <- ens[!duplicated(ens[,1]),]
## use accnum if refseq is NA
ens[is.na(ens[,2]),2] <- ens[is.na(ens[,2]),3]
## put mapped data back in mrna vector
mrna[match(ens[,1], mrna)] <- ens[,2]
mrna[grep("^ENS|^GENSCAN", mrna)] <- NA
## write out
write.table(cbind(dat[,1], mrna), fileout, sep = "\t", quote =
FALSE, row.names = FALSE, col.names  = FALSE, na = "")

<to here>

Paste this into a file, make it executable (if on linux or macosx), and 
change the path in the first line to point to the location of your 
Rscript and it should create a fairly reasonable file for input to 
AnnotationForge.

You just call this script from the command line:

parseAffyTranscriptCsv.R ZebGene-1_1-st-v1.na33.3.zv9.transcript.csv 
org.Dr.eg.db zebgene_mapper.txt

then after a while you will have a file zebgene_mapper.txt that you can 
use as input to AnnotationForge

Best,

Jim

On Thursday, August 29, 2013 10:39:38 AM, Joao Sollari Lopes wrote:
> Hi Jim,
>
> Thanks for your quick reply. Actually I was able to do some kind of
> mapping through the position of the probes in the strips using the files:
>
> zebgene11stdrentrezgprobe_17.1.0.tar.gz
> (http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/17.1.0/entrezg.download/zebgene11stdrentrezgprobe_17.1.0.tar.gz)
>
>
> and
>
> pd.zebgene.1.1.st (provided by Bioconductor)
>
> The annotations compare very well with each other, however the info
> provided by Affymetrix (available in pd.zebgene.1.1.st) are somewhat
> more complete.
>
> The trouble of working with Affymetrix Array Strip is that there seems
> to be little support in bioconductor for it in what concerns
> annotation. Particularly, because packages "annotate" and "annaffy"
> seem to work only with Affymetrix Chips.
>
> I know I have plenty of reading to do, but is there a best-way to work
> with Array Strips and still use packages "annotate" and "annaffy"? At
> the moment I am using package "oligo".
>
> Thanks,
> Joao
>
> On 08/29/2013 03:15 PM, James W. MacDonald wrote:
>> Hi Joao,
>>
>> On Thursday, August 29, 2013 7:07:02 AM, Joao Sollari Lopes wrote:
>>> Hi,
>>>
>>> I am trying to compare the annotations provided by Affymetrix with the
>>> ones provided by Bioconductor for
>>>
>>> Zebrafish Gene 1.1 ST Array Strip
>>>
>>> I have compared the files
>>>
>>> zebgene11stdrentrezg.db_17.1.0.tar.gz
>>> (http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/17.1.0/entrezg.download/zebgene11stdrentrezg.db_17.1.0.tar.gz)
>>>
>>
>> That file isn't supplied by Bioconductor, it is supplied by MBNI at
>> University of Michigan.
>>
>> In addition, (if you read what they have on their site to know what
>> you are using) the probesets for that CDF no longer correspond in any
>> way to the original probesets that Affy defined. So comparing the two
>> doesn't make any sense.
>>
>> Best,
>>
>> Jim
>>
>>
>>>
>>>
>>> ZebGene-1_1-st-v1.na33.3.zv9.transcript.csv.zip
>>> (http://www.affymetrix.com/Auth/analysis/downloads/na33/wtgene-33_3/ZebGene-1_1-st-v1.na33.3.zv9.transcript.csv.zip)
>>>
>>>
>>>
>>> The trouble is that the first identifies the Units as "100000002_at",
>>> "100000006_at", ...,  "84703_at" and the second as "12943944",
>>> "12943954", ..., "13276104". Is there an easy way to know which
>>> correspond to which?
>>>
>>> Thanks in advance,
>>> Joao Lopes
>>> Instituto Gulbenkian de Ciencia, Portugal
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>> --
>> James W. MacDonald, M.S.
>> Biostatistician
>> University of Washington
>> Environmental and Occupational Health Sciences
>> 4225 Roosevelt Way NE, # 100
>> Seattle WA 98105-6099
>

--
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099