[BioC] Issues about how to filter and annotate the MoGene-2_0-st and MoEx-1_0-st-v1 array probe sets

Thu Jun 12 17:06:45 CEST 2014

Hi Chao,

One more thing; see below.

On 6/12/2014 10:55 AM, James W. MacDonald wrote:
> Hi Chao,
>
> Please don't take conversations off-list (e.g., use 'Reply-all' when
> responding).
>
> On 6/11/2014 5:02 PM, 张超 wrote:
>> Hi Jim,
>>
>> Thanks a lot for your prompt reply and I really learned a lot from it.
>> But I don't understand what do you mean as you said below,
>>
>>   >You will have to deal with those multiple mappings as you see fit.
>> But after doing so, you can then do
>>
>>> fit$genes <- gns
>>
>>  >and your topTable object will then be populated with the annotations.
>>
>> My way to deal with those multiple mappings is to retain only the first
>> one probe set of the multiple mappings(i.e. gns_rm_multmap).And I have
>> finished it. But what to do next and what is fit$genes? Could you please
>> explain it in more details? I have found a way to populated with the
>> annotations with "for,if else" commands, and it costs too much time for
>> my machine to run these commands. It seems that your way is much easier,
>> but I haven't got it.
>
> The MArrayLM object (your 'fit2' object below) has a slot for gene
> annotations, and as long as the annotation object is in the correct
> order you can just put it in there. So you have done something like this:
>
> fit <- lmFit(eset, design)
> fit2 <- contrasts.fit(fit, contrast)
> fit2 <- eBayes(fit2)
>
> And then if you annotate those genes
>
> gns <- select(hugene10sttranscriptcluster.db, row.names(fit2$coef),
> c("ENTREZID","SYMBOL","GENENAME"))
>
> and then remove duplicates like you said you did
>
> gns <- gns[duplicated(gns[,1]),]
>
> you can check first that things line up
>
> all.equal(row.names(fit2$coef), gns[,1])
>
> and if that is TRUE (it should be), you can now put those data into your
> fit2 object
>
> fit2$genes <- gns
>
> and then if you do
>
> topTable(fit2, 1)
>
> you will have all the annotation in the result as well.
>
> And then you can do something like
>
> idx <- HTMLReport("index", "Look at these results!")
> publish(fit2, idx, eset, coef=1, pvalueCutoff = 0.05)
> finish(idx)
>
> and you will have an index.html file that you can open to look at your
> results. See the ReportingTools vignettes for more information.

I should note that ReportingTools does the annotation of your data 
separately (last time I checked), so you don't necessarily have to 
annotate things by hand. However, the code used in ReportingTools is 
based on 'old school' methods from the annotate package that by default 
will not annotate any probeset that has a one-to-many mapping.

Because of this, I tend to do things a different way, using some 
functions in my affycoretools package:

idx <- HTMLReport("index", "Woah, nice data!")
tab <- makeImages(topTable(fit2, 1), eset, <a grouping factor goes 
here>, design, contrast, 1)
publish(tab, idx, .modifyDF = list(entrezLinks, affyLinks))
finish(idx)

Best,

Jim

>
>>
>> In addition, as you said, you would not recommend summarizing the oligo
>> data at the probeset level. However if rma(target = "core") is used, how
>> to use paCalls, and if paCalls("DABG") is used, how to use rma to make
>> them to be consistent?
>
> I have no idea. I don't use paCalls().
>
> Best,
>
> Jim
>
>
>>
>> Looking forward to your reply. Many thanks in advance..
>>
>> Best Regards
>>
>> Chao
>> ‍
>>
>>
>>
>>
>>
>>
>>
>>
>> At 2014-06-10 10:26:47, "James W. MacDonald" <jmacdon at uw.edu> wrote:
>>> Hi Chao,
>>>
>>> On 6/8/2014 10:37 AM, 张超 wrote:
>>>> Dear list,
>>>>
>>>>
>>>> I would like to use the paCalls from oligo package for filtering
>>>> probe sets with absence of transcripts. My data are from
>>>> MoGene-2_0-st and MoEx-1_0-st-v1 array (Affymetrix). My data after
>>>> reading CEL files is a GeneFeatureSet with 12 samples (6 for control
>>>> groups, and 6 for experimental groups). What should I do with these
>>>> data computed by paCalls(PSDABG) as below ?
>>>>> library(oligo)
>>>>> OligoRawData<-read.celfiles(CEL file lists)
>>>>> eset<-rma(OligoRawData)
>>>>> dagbPS <- paCalls(OligoRawData, "PSDABG")
>>>> What to do next to filter the probe sets? Could you please send me a
>>>> complete examples and a detailed explanation for it?
>>>>
>>>
>>> You need to decide what constitutes 'present' and how many samples have
>>> to be present in order to keep the probeset.
>>>
>>> So if I were to say that a p < 0.05 is present and I needed 20 such
>>> samples, I could do
>>>
>>> keep <- rowSums(dagbPS < 0.05) > 19
>>> eset <- eset[keep,]
>>>
>>> If the above code is mysterious to you, then you need to read 'An
>>> Introduction to R'.
>>>
>>>>
>>>> In addition, moex10sttranscriptcluster.db can be used for annotation
>>>> of data from MoEx-1_0-st-v1 array, and both of mogene20stprobeset.db
>>>> and mogene20sttranscriptcluster.db can be used for that of data from
>>>> MoGene-2_0-st (including both of gene and lncRNA lists). But only
>>>> more than half of the probe sets are anotated with gene symbols by
>>>> below commands.
>>>>> results<-decideTests(fit2, method="global", adjust.method="fdr",
>>>>> p.value=0.05, lfc=0.5) #DEGs determination by t tests
>>>>> genesymbol = getText(aafSymbol(rownames(results),
>>>>> "moex10sttranscriptcluster.db" ));#annotated by
>>>>> moex10sttranscriptcluster.db for data get from MoEx-1_0-st-v1 array
>>>> Only 1217 and 24709 can be annotated by mogene20stprobeset.db and
>>>> mogene20sttranscriptcluster.db seperately for data of MoGene-2_0-st
>>>> (length(genesymbol[which(genesymbol!="")])). But the total num is
>>>> 41345 (length(results)). Only 14966 can be mapped by
>>>> moex10sttranscriptcluster.db for data of  MoEx-1_0-st-v1 (total num
>>>> is 23332 - length(results)). Should I need to add some more db for
>>>> the annotation?
>>>>
>>>
>>> The annotation packages with 'transcriptcluster' in their names are for
>>> instances where you have summarized probesets at the transcript level
>>> (which is the default for rma() in oligo). If you want to summarize at
>>> the probeset level (which I would not recommend doing, btw), you need to
>>> use target = "probeset" in your call to rma().
>>>
>>> In other words, you should only be using the transcriptcluster
>>> annotation packages. Although please note that the
>>> moex10transcriptcluster.db package is for the Mouse Exon 10 ST array,
>>> not the Gene ST array.
>>>
>>> There are any number of reasons that only a subset of probesets on the
>>> array have symbols. First, there are lots of controls, which won't have
>>> gene symbols. Second, the lincRNA/snoRNA/miRNA probesets that Affy put
>>> on these array won't have gene symbols either (because, they aren't
>>> genes). Third, there is still some speculative content on these arrays;
>>> things that might end up being genes, with gene names, in the future,
>>> but which are just hypothetical at this point in time. Fourth, the
>>> annaffy package uses the old style methods of getting annotations, in
>>> which case any probeset that matches more than one gene symbol will be
>>> masked.
>>>
>>> You will be much better served if you were to do something like
>>>
>>> gns <- select(mogene10sttranscriptcluster.db, featureNames(eset),
>>> c("ENTREZID","SYMBOL","GENENAME"))
>>>
>>> Which will result in a warning that you have multiple mappings. You will
>>> have to deal with those multiple mappings as you see fit. But after
>>> doing so, you can then do
>>>
>>> fit$genes <- gns
>>>
>>> and your topTable object will then be populated with the annotations.
>>> You might then consider using the ReportingTools package, which is under
>>> active development and maintenance, rather than the annaffy package
>>> which may still be actively maintained, but is no longer AFAICT under
>>> active development.
>>>
>>> Best,
>>>
>>> Jim
>>>
>>>
>>>
>>>>
>>>> BTW, I am a beginner of this field. I found there are too few
>>>> documents for examples about how to use functions of oligo package.
>>>> Could you please also give me some suggestions? Looking forword to
>>>> your reply. I really appreciate for your any helps.
>>>>
>>>>
>>>> Thanks again.
>>>>
>>>>
>>>> Best regards.
>>>>
>>>>
>>>> Chao
>>>>
>>>>
>>>>
>>>>
>>>>     [[alternative HTML version deleted]]
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>
>>> --
>>> James W. MacDonald, M.S.
>>> Biostatistician
>>> University of Washington
>>> Environmental and Occupational Health Sciences
>>> 4225 Roosevelt Way NE, # 100
>>> Seattle WA 98105-6099
>>
>>
>>
>

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099