[BioC] Mapping Affymetrix annotations with Bioconductor annotations

Thu Aug 29 17:50:50 CEST 2013

Hi Jim,

Many thanks for that!

All the best,
Joao

On 08/29/2013 04:22 PM, James W. MacDonald wrote:
> Hi Joao,
>
> Unfortunately there are no readily available packages for annotating 
> all the new model organism arrays from Affy. However, the functions to 
> create your own annotation package do exist. If you look at the 
> AnnotationForge package, specifically the SQLForge vignette 
> (http://www.bioconductor.org/packages/release/bioc/vignettes/AnnotationForge/inst/doc/SQLForge.pdf), 
> it is pretty straightforward to make your own annotation package.
>
> I am assuming you are summarizing at the transcript level, so would 
> want to make a zebgene11sttranscriptcluster.db package. For this you 
> need the transcript csv file from Affy 
> (http://www.affymetrix.com/Auth/analysis/downloads/na33/wtgene-33_3/ZebGene-1_1-st-v1.na33.3.zv9.transcript.csv.zip). 
> From this you want to generate a two-column file with the probeset ID 
> in the first column, and then GenBank or RefSeq IDs in the second.
>
> This is the tough part, as the annotation files need to be parsed to 
> create this file.
>
> I wrote an Rscript to parse these files that you could use. It is 
> pretty naive, but seems to do a relatively reasonable job. You will 
> obviously need to change the first line to point to the correct 
> directory, and will have to have the org.Dr.eg.db package installed, 
> but this should
>
> <copy from below>
>
> #!/data/programs/lib64/R/bin/Rscript
> args <- commandArgs(TRUE)
> if(length(args) < 3) stop(paste("Usage: parseAffyTranscripts.R 
> <transcript.csv> <organism.db package> <output file name>",
>                          "<mrna column header> (optional)\n", call. = 
> FALSE))
> probefile <- args[1]
> orgpkg <- args[2]
> fileout <- args[3]
> if(length(args) == 4) headercol <- args[4] else headercol <- 
> "mrna_assignment"
>
> dat <- read.csv(probefile, comment.char = "#", stringsAsFactors=FALSE, 
> na.string = "---")
> mrna <- sapply(strsplit(dat[,headercol], " // | /// "), function(x)
> grep("^[NX][MR]|^[A-G][A-Z]+[0-9]+|^[A-Z][0-9]+|^ENST", x, value = 
> TRUE)[1])
>
> ens <- grep("^ENS", mrna, value = TRUE)
> require(orgpkg, character.only = TRUE) || stop(paste("You need to 
> install the", orgpkg, "package first!"))
> ens <- select(get(orgpkg), ens, c("REFSEQ","ACCNUM"), "ENSEMBLTRANS")
> ens <- ens[!duplicated(ens[,1]),]
> ## use accnum if refseq is NA
> ens[is.na(ens[,2]),2] <- ens[is.na(ens[,2]),3]
> ## put mapped data back in mrna vector
> mrna[match(ens[,1], mrna)] <- ens[,2]
> mrna[grep("^ENS|^GENSCAN", mrna)] <- NA
> ## write out
> write.table(cbind(dat[,1], mrna), fileout, sep = "\t", quote =
> FALSE, row.names = FALSE, col.names  = FALSE, na = "")
>
> <to here>
>
> Paste this into a file, make it executable (if on linux or macosx), 
> and change the path in the first line to point to the location of your 
> Rscript and it should create a fairly reasonable file for input to 
> AnnotationForge.
>
> You just call this script from the command line:
>
> parseAffyTranscriptCsv.R ZebGene-1_1-st-v1.na33.3.zv9.transcript.csv 
> org.Dr.eg.db zebgene_mapper.txt
>
> then after a while you will have a file zebgene_mapper.txt that you 
> can use as input to AnnotationForge
>
> Best,
>
> Jim
>
>
>
>
>
> On Thursday, August 29, 2013 10:39:38 AM, Joao Sollari Lopes wrote:
>> Hi Jim,
>>
>> Thanks for your quick reply. Actually I was able to do some kind of
>> mapping through the position of the probes in the strips using the 
>> files:
>>
>> zebgene11stdrentrezgprobe_17.1.0.tar.gz
>> (http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/17.1.0/entrezg.download/zebgene11stdrentrezgprobe_17.1.0.tar.gz) 
>>
>>
>>
>> and
>>
>> pd.zebgene.1.1.st (provided by Bioconductor)
>>
>> The annotations compare very well with each other, however the info
>> provided by Affymetrix (available in pd.zebgene.1.1.st) are somewhat
>> more complete.
>>
>> The trouble of working with Affymetrix Array Strip is that there seems
>> to be little support in bioconductor for it in what concerns
>> annotation. Particularly, because packages "annotate" and "annaffy"
>> seem to work only with Affymetrix Chips.
>>
>> I know I have plenty of reading to do, but is there a best-way to work
>> with Array Strips and still use packages "annotate" and "annaffy"? At
>> the moment I am using package "oligo".
>>
>> Thanks,
>> Joao
>>
>> On 08/29/2013 03:15 PM, James W. MacDonald wrote:
>>> Hi Joao,
>>>
>>> On Thursday, August 29, 2013 7:07:02 AM, Joao Sollari Lopes wrote:
>>>> Hi,
>>>>
>>>> I am trying to compare the annotations provided by Affymetrix with the
>>>> ones provided by Bioconductor for
>>>>
>>>> Zebrafish Gene 1.1 ST Array Strip
>>>>
>>>> I have compared the files
>>>>
>>>> zebgene11stdrentrezg.db_17.1.0.tar.gz
>>>> (http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/17.1.0/entrezg.download/zebgene11stdrentrezg.db_17.1.0.tar.gz) 
>>>>
>>>>
>>>
>>> That file isn't supplied by Bioconductor, it is supplied by MBNI at
>>> University of Michigan.
>>>
>>> In addition, (if you read what they have on their site to know what
>>> you are using) the probesets for that CDF no longer correspond in any
>>> way to the original probesets that Affy defined. So comparing the two
>>> doesn't make any sense.
>>>
>>> Best,
>>>
>>> Jim
>>>
>>>
>>>>
>>>>
>>>> ZebGene-1_1-st-v1.na33.3.zv9.transcript.csv.zip
>>>> (http://www.affymetrix.com/Auth/analysis/downloads/na33/wtgene-33_3/ZebGene-1_1-st-v1.na33.3.zv9.transcript.csv.zip) 
>>>>
>>>>
>>>>
>>>>
>>>> The trouble is that the first identifies the Units as "100000002_at",
>>>> "100000006_at", ...,  "84703_at" and the second as "12943944",
>>>> "12943954", ..., "13276104". Is there an easy way to know which
>>>> correspond to which?
>>>>
>>>> Thanks in advance,
>>>> Joao Lopes
>>>> Instituto Gulbenkian de Ciencia, Portugal
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>> -- 
>>> James W. MacDonald, M.S.
>>> Biostatistician
>>> University of Washington
>>> Environmental and Occupational Health Sciences
>>> 4225 Roosevelt Way NE, # 100
>>> Seattle WA 98105-6099
>>
>
> -- 
> James W. MacDonald, M.S.
> Biostatistician
> University of Washington
> Environmental and Occupational Health Sciences
> 4225 Roosevelt Way NE, # 100
> Seattle WA 98105-6099