[BioC] finding and averaging replicate gene records

Thu Mar 17 16:39:31 CET 2005

Not only will you lose information but you might obtain the wrong
information ! If one has a foot in a bucket of freezing ice and the
other in a bucket of boiling water, then he _should_ be comfortable at
50 degree Celsius on average.

I had a look into the HGU-133A plus 2 CDF which has 54675 probesets of
which 47297 had unigene ID mapping. These were the distribution of
unigene ID occurrence.

       1     2     3     4     5     6     7     8
   12590  5501  2815  1508   741   384   170   106

       9    10    11    12    13    14    15    19
      45    27    18     8     7     3     4     1

( That means 12590 probesets are represented once on the arrays, 5501
probesets represented twice, ..., 1 probeset is represent 19 times. )

In short you can reduce from 47297 to 23929 unique genes. Add the 7378
without unigene ID and your final reduced dataset has 31307 rows.

I do think that the computational savings for working with 31307 rows
instead of 54675 rows justifies the possibility of average important
genes with noisy ones. Besides, unigene ID changes every couple of
months and you may have to do your analysis over and over again thereby
diminishing any computational savings you may have had.

I am in favour of approaches that works on the summary statistics (e.g.
minimum p-value for a unigene ID).

Regards, Adai

On Thu, 2005-03-17 at 03:19 +0000, zhihua li wrote: 
> Thanks to all your reply.
> 
> It is true that by averaging expression values for (putatively) the same 
> gene we will lose some information. But sometimes it's the reduction of the 
> data size that is more favorable. Especially when one is trying to perform 
> a computation-consuming algorithm to one's data. So I think maybe sometimes 
> it's worthy to do averaging.
> 
> Thanks again!
> 
> >From: "Tomas Radivoyevitch" <radivot at hal.EPBI.cwru.edu>
> >To: "Sean Davis" <sdavis2 at mail.nih.gov>, "zhihua li" <lzhtom at hotmail.com>
> >CC: <bioconductor at stat.math.ethz.ch>
> >Subject: Re: [BioC] finding and averaging replicate gene records
> >Date: Wed, 16 Mar 2005 08:31:14 -0500
> >
> >Agreeing with Sean here, in my last experience where I had to reduce 
> >each gene to a single metric, using Affy data I found that taking 
> >the probe set with the maximum average value across all chips in the 
> >dataset worked well [e.g. in two group situations the resulting 
> >choices tended to be probe sets with smaller (if not the smallest) P 
> >values].
> >
> >Tom
> >
> >----- Original Message ----- From: "Sean Davis" 
> ><sdavis2 at mail.nih.gov>
> >To: "zhihua li" <lzhtom at hotmail.com>
> >Cc: <bioconductor at stat.math.ethz.ch>
> >Sent: Wednesday, March 16, 2005 6:51 AM
> >Subject: Re: [BioC] finding and averaging replicate gene records
> >
> >
> >>
> >>On Mar 16, 2005, at 2:33 AM, zhihua li wrote:
> >>
> >>>Hi netter!
> >>>
> >>>In most microarray slides a single gene will be represented by 
> >>>multiple items. Sometimes it's unforseable because they have 
> >>>different genbank accession numbers and you will not find them 
> >>>until you get a unigene list for  all your gene items.
> >>>
> >>>Now I have a dataframe . The rows are gene records(accession 
> >>>number, unigene ID and expression values in different conditions) 
> >>>; the 1st column is genbank accession numbers, the 2nd column is 
> >>>unigene IDs, from 3rd column on are different conditions). All the 
> >>>accession numbers are unique, but through unigene IDs i can find 
> >>>that some items, though with different accession numbers, are in 
> >>>fact sharing the same unigene ID. I would like to find the gene 
> >>>records containing replicate unigene IDs and merge them into one 
> >>>record by averaging different expression values in the same 
> >>>condition.
> >>>
> >>>Could anyone give me a clue about how to write the code? Or are 
> >>>there any contributed functions can do this stuff?
> >>>
> >>
> >>I generally do NOT do this.  While it seems that there should be 
> >>one gene/one value, we know that this isn't generally true in 
> >>practice.  You gain little by averaging by having a few fewer genes 
> >>to go into multiple-testing correction, but you stand to lose a 
> >>huge amount.  In the worst-case scenario, you take a 
> >>"differentially-expressed" probe and average it with a 
> >>poor-performing probe, and end up not finding the gene of interest. 
> >>  If you do not merge those probes, you find one probe representing 
> >>the gene IS differentially-expressed and the other is not. You, of 
> >>course, have to determine why the two probes for the same gene 
> >>behave differently, but there are many explanations including 
> >>things like probe sequence contamination, transcript variants, 
> >>array-specific effects (like non-uniform background, etc.), and 
> >>faulty bioinformatics (Unigene may place two sequences for 
> >>different genes into the same cluster, for example).
> >>
> >>In short, you probably agree that you want to find ALL genes of 
> >>interest and then use biologic validation where necessary to 
> >>determine the relevance of your found genes.  However, veraging 
> >>expression values per gene nearly guarantees that you will 
> >>sometimes miss genes of interest and so is, in my opinion, not 
> >>warranted.
> >>
> >>Sean
> >>
> >>_______________________________________________
> >>Bioconductor mailing list
> >>Bioconductor at stat.math.ethz.ch
> >>https://stat.ethz.ch/mailman/listinfo/bioconductor
> >>
> >
> >
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
>