[BioC] affymetrix probe databases

Mon Sep 15 17:33:50 CEST 2014

Hi Pau,

On Mon, Sep 15, 2014 at 10:40 AM, Pau Marc Muñoz Torres <paumarc at gmail.com>
wrote:

> Good afternoon to everybody,
>
>  I'm just doing my firsts steps working with  affymetrix data and i have
> some questions.
>
>  I  started to working with CEL files by moving the data contained in them
> to a csv file. Then I tried to relate affymetrix codes with uniprot codes.
> I  performed it as follow:

> library(affy)
> library("xxxx.db")
> setwd("/home/paumarc/affy/Data/Exp/Cel")
> data <- ReadAffy()
> my_frame <- data.frame(exprs(data))
> Annot <- data.frame(ACCNUM=sapply(contents(xxxACCNUM), paste, collapse=",
> "), SYMBOL=sapply(contents(xxxSYMBOL), paste, collapse=", "),
> DESC=sapply(contents(xxxGENENAME), paste, collapse=",
> "),DESC=sapply(contents(xxxUNIPROT), paste, collapse=", "))
> all <- merge(Annot, my_frame, by.x=0, by.y=0, all=T)
> write.csv(all, file = "xxx.csv")
>

It is common for people new to Bioconductor to want to write things out to
disk. You should resist that urge. There is almost no reason to write data
to disk, and every reason to keep data in the structures designed to hold
them.

In addition, you have not summarized the data, so there won't be a
one-to-one correspondence between the contents of my_frame, and Annot. In
other words, for the arrays you are analyzing, there are (mostly) 11 probes
that are intended to interrogate a single transcript (the collection of
probes for a given transcript is called a probeset). If you were to do

eset <- rma(data)

then the ExpressionSet called 'eset' would contain the data after
summarizing each probeset to give a single expression value per transcript.
At this point, you could hypothetically extract the data from the
ExpressionSet and write to disk, but there is no profit in doing that.
Presumably the next step is to make some comparisons, and all the software
in Bioconductor that you can use to make those comparisons (limma,
multtest, siggenes, samr, etc) will happily use an ExpressionSet as input.
If you want to add annotation to the ExpressionSet, then you can use the
select method:

annot <- select(xxx.db, keys(xxx.db),
c("REFSEQ","SYMBOL","GENENAME","UNIPROT"))

But note that you will get a warning:

Warning message:
In .generateExtraRows(tab, keys, jointype) :
  'select' resulted in 1:many mapping between keys and return rows

which is because there are any number of probesets on these arrays that can
measure more than one thing. In addition, using either ACCNUM or REFSEQ
isn't the best idea, as there are any number of genes that have like a
bazillion different transcripts. For example, 1007_s_at measures DDR1 (and
an internal miRNA, miR-4640). If you get all the ACCNUM, you will have
370(!) rows for that gene that you will be concatenating. If instead you
use ENTREZID, which is the more 'usual' thing to do, you get this:

   ENTREZID  SYMBOL                                    GENENAME UNIPROT
1       780    DDR1 discoidin domain receptor tyrosine kinase 1  Q08345
2       780    DDR1 discoidin domain receptor tyrosine kinase 1  Q96T62
3       780    DDR1 discoidin domain receptor tyrosine kinase 1  Q96T61
4 100616237 MIR4640                               microRNA 4640    <NA>

Which is still a bit crazy, given the replication in UNIPROT, but not as
extreme as ACCNUM or REFSEQ. You can then concatenate as you did above:

annotlst <- split(annot[,-1], annot[,1])
annot <- do.call("rbind", lapply(annotlst, function(x) apply(x, 2,
function(y) paste(unique(y), collapse = ","))))

Best,

Jim

> where XXX is one of the follwing database
>
> hgu133a.db
> hgu133b.db
> pd.hg.u133.plus.2
>

This isn't an annotation database (the pd package above), so you don't want
to be using that for this purpose.

> hgu133plus2.db
>
>  unfortuntally the codes cointained at the CEL  files and the database do
> not match well. Some examples are:
>
> 137707    208130    NA    NA    NA    NA    202.5    192    209.3    313.8
> 137708    208130_s_at    NM_030984    TBXAS1    thromboxane A synthase 1
> (platelet)    P24557, Q53F23, B4DJG6, Q16843    NA    NA    NA    NA
> 137709    208131    NA    NA    NA    NA    187.5    194.5    167.5    224
> 137710    208131_s_at    NM_000961    PTGIS    prostaglandin I2
> (prostacyclin) synthase    Q16647    NA    NA    NA    NA
>
> I suppose that     208131_s_at should mach with 208131  and 208130_s_at
> with 208130, is that supposition correct? does some body knows to which
> affymetrics database correspond the identification codes 208130_s_at and
> 208131_s_at

> Thank you
>
> pau
>
> Pau Marc Muñoz Torres
> skype: pau_marc
> http://www.linkedin.com/in/paumarc
> http://www.researchgate.net/profile/Pau_Marc_Torres3/info/
>
>         [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099

	[[alternative HTML version deleted]]