[BioC] Mapping Affymetrix annotations with Bioconductor annotations

Fri Oct 4 19:29:36 CEST 2013

Hi Jim,

Following on the discussion on annotation in Affymetrix Gene ST arrays, 
I wonder if there is a standard way to deal with multiple mRNAs (from 
different genes) that are assigned to the same transcript cluster. Is it 
generally accepted to follow the naive approach of picking the first 
mrna of the list.
I know that the mRNA Assignments are ordered in a ranking so is it safe 
just to assume the ranking already performed by Affymetrix?
Joao

On 08/29/2013 04:22 PM, James W. MacDonald wrote:
> Hi Joao,
>
> Unfortunately there are no readily available packages for annotating 
> all the new model organism arrays from Affy. However, the functions to 
> create your own annotation package do exist. If you look at the 
> AnnotationForge package, specifically the SQLForge vignette 
> (http://www.bioconductor.org/packages/release/bioc/vignettes/AnnotationForge/inst/doc/SQLForge.pdf), 
> it is pretty straightforward to make your own annotation package.
>
> I am assuming you are summarizing at the transcript level, so would 
> want to make a zebgene11sttranscriptcluster.db package. For this you 
> need the transcript csv file from Affy 
> (http://www.affymetrix.com/Auth/analysis/downloads/na33/wtgene-33_3/ZebGene-1_1-st-v1.na33.3.zv9.transcript.csv.zip). 
> From this you want to generate a two-column file with the probeset ID 
> in the first column, and then GenBank or RefSeq IDs in the second.
>
> This is the tough part, as the annotation files need to be parsed to 
> create this file.
>
> I wrote an Rscript to parse these files that you could use. It is 
> pretty naive, but seems to do a relatively reasonable job. You will 
> obviously need to change the first line to point to the correct 
> directory, and will have to have the org.Dr.eg.db package installed, 
> but this should
>
> <copy from below>
>
> #!/data/programs/lib64/R/bin/Rscript
> args <- commandArgs(TRUE)
> if(length(args) < 3) stop(paste("Usage: parseAffyTranscripts.R 
> <transcript.csv> <organism.db package> <output file name>",
>                          "<mrna column header> (optional)\n", call. = 
> FALSE))
> probefile <- args[1]
> orgpkg <- args[2]
> fileout <- args[3]
> if(length(args) == 4) headercol <- args[4] else headercol <- 
> "mrna_assignment"
>
> dat <- read.csv(probefile, comment.char = "#", stringsAsFactors=FALSE, 
> na.string = "---")
> mrna <- sapply(strsplit(dat[,headercol], " // | /// "), function(x)
> grep("^[NX][MR]|^[A-G][A-Z]+[0-9]+|^[A-Z][0-9]+|^ENST", x, value = 
> TRUE)[1])
>
> ens <- grep("^ENS", mrna, value = TRUE)
> require(orgpkg, character.only = TRUE) || stop(paste("You need to 
> install the", orgpkg, "package first!"))
> ens <- select(get(orgpkg), ens, c("REFSEQ","ACCNUM"), "ENSEMBLTRANS")
> ens <- ens[!duplicated(ens[,1]),]
> ## use accnum if refseq is NA
> ens[is.na(ens[,2]),2] <- ens[is.na(ens[,2]),3]
> ## put mapped data back in mrna vector
> mrna[match(ens[,1], mrna)] <- ens[,2]
> mrna[grep("^ENS|^GENSCAN", mrna)] <- NA
> ## write out
> write.table(cbind(dat[,1], mrna), fileout, sep = "\t", quote =
> FALSE, row.names = FALSE, col.names  = FALSE, na = "")
>
> <to here>
>
> Paste this into a file, make it executable (if on linux or macosx), 
> and change the path in the first line to point to the location of your 
> Rscript and it should create a fairly reasonable file for input to 
> AnnotationForge.
>
> You just call this script from the command line:
>
> parseAffyTranscriptCsv.R ZebGene-1_1-st-v1.na33.3.zv9.transcript.csv 
> org.Dr.eg.db zebgene_mapper.txt
>
> then after a while you will have a file zebgene_mapper.txt that you 
> can use as input to AnnotationForge
>
> Best,
>
> Jim
>
>
>
>
>
> On Thursday, August 29, 2013 10:39:38 AM, Joao Sollari Lopes wrote:
>> Hi Jim,
>>
>> Thanks for your quick reply. Actually I was able to do some kind of
>> mapping through the position of the probes in the strips using the 
>> files:
>>
>> zebgene11stdrentrezgprobe_17.1.0.tar.gz
>> (http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/17.1.0/entrezg.download/zebgene11stdrentrezgprobe_17.1.0.tar.gz) 
>>
>>
>>
>> and
>>
>> pd.zebgene.1.1.st (provided by Bioconductor)
>>
>> The annotations compare very well with each other, however the info
>> provided by Affymetrix (available in pd.zebgene.1.1.st) are somewhat
>> more complete.
>>
>> The trouble of working with Affymetrix Array Strip is that there seems
>> to be little support in bioconductor for it in what concerns
>> annotation. Particularly, because packages "annotate" and "annaffy"
>> seem to work only with Affymetrix Chips.
>>
>> I know I have plenty of reading to do, but is there a best-way to work
>> with Array Strips and still use packages "annotate" and "annaffy"? At
>> the moment I am using package "oligo".
>>
>> Thanks,
>> Joao
>>
>> On 08/29/2013 03:15 PM, James W. MacDonald wrote:
>>> Hi Joao,
>>>
>>> On Thursday, August 29, 2013 7:07:02 AM, Joao Sollari Lopes wrote:
>>>> Hi,
>>>>
>>>> I am trying to compare the annotations provided by Affymetrix with the
>>>> ones provided by Bioconductor for
>>>>
>>>> Zebrafish Gene 1.1 ST Array Strip
>>>>
>>>> I have compared the files
>>>>
>>>> zebgene11stdrentrezg.db_17.1.0.tar.gz
>>>> (http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/17.1.0/entrezg.download/zebgene11stdrentrezg.db_17.1.0.tar.gz) 
>>>>
>>>>
>>>
>>> That file isn't supplied by Bioconductor, it is supplied by MBNI at
>>> University of Michigan.
>>>
>>> In addition, (if you read what they have on their site to know what
>>> you are using) the probesets for that CDF no longer correspond in any
>>> way to the original probesets that Affy defined. So comparing the two
>>> doesn't make any sense.
>>>
>>> Best,
>>>
>>> Jim
>>>
>>>
>>>>
>>>>
>>>> ZebGene-1_1-st-v1.na33.3.zv9.transcript.csv.zip
>>>> (http://www.affymetrix.com/Auth/analysis/downloads/na33/wtgene-33_3/ZebGene-1_1-st-v1.na33.3.zv9.transcript.csv.zip) 
>>>>
>>>>
>>>>
>>>>
>>>> The trouble is that the first identifies the Units as "100000002_at",
>>>> "100000006_at", ...,  "84703_at" and the second as "12943944",
>>>> "12943954", ..., "13276104". Is there an easy way to know which
>>>> correspond to which?
>>>>
>>>> Thanks in advance,
>>>> Joao Lopes
>>>> Instituto Gulbenkian de Ciencia, Portugal
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>> -- 
>>> James W. MacDonald, M.S.
>>> Biostatistician
>>> University of Washington
>>> Environmental and Occupational Health Sciences
>>> 4225 Roosevelt Way NE, # 100
>>> Seattle WA 98105-6099
>>
>
> -- 
> James W. MacDonald, M.S.
> Biostatistician
> University of Washington
> Environmental and Occupational Health Sciences
> 4225 Roosevelt Way NE, # 100
> Seattle WA 98105-6099