[BioC] finding and averaging replicate gene records

Wed Mar 16 14:31:14 CET 2005

Agreeing with Sean here, in my last experience where I had to reduce each 
gene to a single metric, using Affy data I found that taking the probe set 
with the maximum average value across all chips in the dataset worked well 
[e.g. in two group situations the resulting choices tended to be probe sets 
with smaller (if not the smallest) P values].

Tom

----- Original Message ----- 
From: "Sean Davis" <sdavis2 at mail.nih.gov>
To: "zhihua li" <lzhtom at hotmail.com>
Cc: <bioconductor at stat.math.ethz.ch>
Sent: Wednesday, March 16, 2005 6:51 AM
Subject: Re: [BioC] finding and averaging replicate gene records

>
> On Mar 16, 2005, at 2:33 AM, zhihua li wrote:
>
>> Hi netter!
>>
>> In most microarray slides a single gene will be represented by multiple 
>> items. Sometimes it's unforseable because they have different genbank 
>> accession numbers and you will not find them until you get a unigene list 
>> for  all your gene items.
>>
>> Now I have a dataframe . The rows are gene records(accession number, 
>> unigene ID and expression values in different conditions) ; the 1st 
>> column is genbank accession numbers, the 2nd column is unigene IDs, from 
>> 3rd column on are different conditions). All the accession numbers are 
>> unique, but through unigene IDs i can find that some items, though with 
>> different accession numbers, are in fact sharing the same unigene ID. I 
>> would like to find the gene records containing replicate unigene IDs and 
>> merge them into one record by averaging different expression values in 
>> the same condition.
>>
>> Could anyone give me a clue about how to write the code? Or are there any 
>> contributed functions can do this stuff?
>>
>
> I generally do NOT do this.  While it seems that there should be one 
> gene/one value, we know that this isn't generally true in practice.  You 
> gain little by averaging by having a few fewer genes to go into 
> multiple-testing correction, but you stand to lose a huge amount.  In the 
> worst-case scenario, you take a "differentially-expressed" probe and 
> average it with a poor-performing probe, and end up not finding the gene 
> of interest.  If you do not merge those probes, you find one probe 
> representing the gene IS differentially-expressed and the other is not. 
> You, of course, have to determine why the two probes for the same gene 
> behave differently, but there are many explanations including things like 
> probe sequence contamination, transcript variants, array-specific effects 
> (like non-uniform background, etc.), and faulty bioinformatics (Unigene 
> may place two sequences for different genes into the same cluster, for 
> example).
>
> In short, you probably agree that you want to find ALL genes of interest 
> and then use biologic validation where necessary to determine the 
> relevance of your found genes.  However, veraging expression values per 
> gene nearly guarantees that you will sometimes miss genes of interest and 
> so is, in my opinion, not warranted.
>
> Sean
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
>