[BioC] Issues about how to filter and annotate the MoGene-2_0-st and MoEx-1_0-st-v1 array probe sets

Tue Jun 10 16:26:47 CEST 2014

Hi Chao,

On 6/8/2014 10:37 AM, 张超 wrote:
> Dear list,
>
>
> I would like to use the paCalls from oligo package for filtering probe sets with absence of transcripts. My data are from MoGene-2_0-st and MoEx-1_0-st-v1 array (Affymetrix). My data after reading CEL files is a GeneFeatureSet with 12 samples (6 for control groups, and 6 for experimental groups). What should I do with these data computed by paCalls(PSDABG) as below ?
>> library(oligo)
>> OligoRawData<-read.celfiles(CEL file lists)
>> eset<-rma(OligoRawData)
>> dagbPS <- paCalls(OligoRawData, "PSDABG")
> What to do next to filter the probe sets? Could you please send me a complete examples and a detailed explanation for it?
>

You need to decide what constitutes 'present' and how many samples have 
to be present in order to keep the probeset.

So if I were to say that a p < 0.05 is present and I needed 20 such 
samples, I could do

keep <- rowSums(dagbPS < 0.05) > 19
eset <- eset[keep,]

If the above code is mysterious to you, then you need to read 'An 
Introduction to R'.

>
> In addition, moex10sttranscriptcluster.db can be used for annotation of data from MoEx-1_0-st-v1 array, and both of mogene20stprobeset.db and mogene20sttranscriptcluster.db can be used for that of data from MoGene-2_0-st (including both of gene and lncRNA lists). But only more than half of the probe sets are anotated with gene symbols by below commands.
>> results<-decideTests(fit2, method="global", adjust.method="fdr", p.value=0.05, lfc=0.5) #DEGs determination by t tests
>> genesymbol = getText(aafSymbol(rownames(results), "moex10sttranscriptcluster.db" ));#annotated by moex10sttranscriptcluster.db for data get from MoEx-1_0-st-v1 array
> Only 1217 and 24709 can be annotated by mogene20stprobeset.db and mogene20sttranscriptcluster.db seperately for data of MoGene-2_0-st (length(genesymbol[which(genesymbol!="")])). But the total num is 41345 (length(results)). Only 14966 can be mapped by moex10sttranscriptcluster.db for data of  MoEx-1_0-st-v1 (total num is 23332 - length(results)). Should I need to add some more db for the annotation?
>

The annotation packages with 'transcriptcluster' in their names are for 
instances where you have summarized probesets at the transcript level 
(which is the default for rma() in oligo). If you want to summarize at 
the probeset level (which I would not recommend doing, btw), you need to 
use target = "probeset" in your call to rma().

In other words, you should only be using the transcriptcluster 
annotation packages. Although please note that the 
moex10transcriptcluster.db package is for the Mouse Exon 10 ST array, 
not the Gene ST array.

There are any number of reasons that only a subset of probesets on the 
array have symbols. First, there are lots of controls, which won't have 
gene symbols. Second, the lincRNA/snoRNA/miRNA probesets that Affy put 
on these array won't have gene symbols either (because, they aren't 
genes). Third, there is still some speculative content on these arrays; 
things that might end up being genes, with gene names, in the future, 
but which are just hypothetical at this point in time. Fourth, the 
annaffy package uses the old style methods of getting annotations, in 
which case any probeset that matches more than one gene symbol will be 
masked.

You will be much better served if you were to do something like

gns <- select(mogene10sttranscriptcluster.db, featureNames(eset), 
c("ENTREZID","SYMBOL","GENENAME"))

Which will result in a warning that you have multiple mappings. You will 
have to deal with those multiple mappings as you see fit. But after 
doing so, you can then do

fit$genes <- gns

and your topTable object will then be populated with the annotations. 
You might then consider using the ReportingTools package, which is under 
active development and maintenance, rather than the annaffy package 
which may still be actively maintained, but is no longer AFAICT under 
active development.

Best,

Jim

>
> BTW, I am a beginner of this field. I found there are too few documents for examples about how to use functions of oligo package. Could you please also give me some suggestions? Looking forword to your reply. I really appreciate for your any helps.
>
>
> Thanks again.
>
>
> Best regards.
>
>
> Chao
>
>
>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099