[BioC] probe summarization

Thu Sep 6 20:42:23 CEST 2007

Hi Bogdan,

Bogdan Tanasa wrote:
> Hi James,
> 
> I used the following instructions in R (mydata <- ReadAffy(), mycomp <-
> gcrma (mydata), write.table (mycomp, "mytext.txt", sep="\t")
> or I called "mydata<-expresso(...,methods.summarization="median.polish',
> ....)".  In the results table, I obtained an expression value
> per PROBE, and I would like to have an expression value per GENE. I know
> that RMA/GCRMA could use median polish to summarize
> the probes for a gene and to ask the question more specifically: is there
> anything that the code I use  is missing ? In the final results
> table, I would like to have the expression values for 10000-12000 genes
> instead of having expression values for 22000 probes. Thanks,

There is a bit of terminology here that is incorrect. You have 
expression values for 22283 _probesets_, which are based on ~250000 probes.

You are correct however that there is some duplication. How you deal 
with that duplication is not a trivial question to answer. I suppose the 
easist thing to do would be to use the MBNI re-mapped cdfs that we 
supply. For instance, to use the Entrez Gene remapped cdf you would do 
something like this:

dat <- ReadAffy(cdfname="hs133av2hsentrezgcdf")
biocLite("hs133av2hsentrezgprobe")
eset <- gcrma(dat)

As with all things, there are positive and negative aspects to using the 
MBNI cdfs, the bad being the fact that the number of probes per probeset 
are now highly variable, and one would usually then want to have 
standard errors that could be propagated through to any differential 
expression calculations. I think the puma package might be useful here, 
but I haven't tried it yet.

You could also make the assumption that the probeset that has the 
largest statistic in whatever comparison you are making is 'the right 
one', and simply use that. The findLargest() function in genefilter is 
useful in that respect.

Best,

Jim

> 
> Bogdan
> 
> 
> 
> # Read Affy CEL files
> data <- ReadAffy()\
> # Normalize and do summation using gcrma
> eset <- gcrms (data)
> #
> # Noe eset contains all the information that you require
> #
> # to get a data frame of expression values, use exprs command
> evals <- exprs (eset)
> #
> # The command below will tell you that it is a data frame
> class (evals)
> #
> # You can write out tab separated expression values to be used by other
> programs using the command
> write.table (evals, "expressvals.txt", sep="\t")
> #
> #
> Send me questions if you have any
> 
> On 9/6/07, James W. MacDonald <jmacdon at med.umich.edu> wrote:
> 
>>Hi Bogdan,
>>
>>Bogdan Tanasa wrote:
>>
>>>Hi all,
>>>
>>>I would like to ask for an information: I carry the array analysis for a
>>>large dataset (40 samples * 2 replicates);
>>>the arrays are Affy U133A, and I use GCRMA and invariant set
>>
>>normalization.
>>
>>>Please could you  let me know
>>>the way I could do the probe summarization for these arrays. Thanks and
>>
>>best
>>
>>GCRMA _is_ a method to do probe summarization. Maybe you are asking a
>>different question?
>>
>>Best,
>>
>>Jim
>>
>>
>>
>>>regards,
>>>
>>>Bogdan
>>>
>>>      [[alternative HTML version deleted]]
>>>
>>>_______________________________________________
>>>Bioconductor mailing list
>>>Bioconductor at stat.math.ethz.ch
>>>https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>Search the archives:
>>
>>http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>>--
>>James W. MacDonald
>>University of Michigan
>>Affymetrix and cDNA Microarray Core
>>1500 E Medical Center Drive
>>Ann Arbor MI 48109
>>734-647-5623
>>
>>
> 
> 
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
James W. MacDonald
University of Michigan
Affymetrix and cDNA Microarray Core
1500 E Medical Center Drive
Ann Arbor MI 48109
734-647-5623