[BioC] HGU133Plus2 CDF vs hgu133plus2hsentrezgcdf CDF (30% difference in results)

Mon Sep 15 16:58:20 CEST 2014

Another thing to consider is that the probesets for that array are based on
UniGene build 133, which was current somewhere around 10 years ago (if not
longer). That is a long time ago, considering the speed with which the
human genome has been updated, so there may be many probesets on that array
that no longer measure anything recognizable.

If you care to find out how bad (or good) the conventional Affymetrix
probeset definitions are, you could re-align the probe sequences against
the current genome and see how many are still measuring the intended
target. Or you could assume that the updated alignments from MBNI are
better, and just go with that (certainly easier, but you know what they say
about assumptions...).

Personally, I would go with option A, which would have two benefits. One,
you would get to have some fun learning how to do something different. And
really, who doesn't like that? Two, it would give you a rock-solid
rationale for your choice of CDF, which should be impressive to your
advisor because you a) thought about the problem and then b) did something
to actively quantify the differences, so you can make an informed choice.

Best,

Jim

On Sun, Sep 14, 2014 at 9:51 AM, Steve Lianoglou <lianoglou.steve at gene.com>
wrote:

> Hi,
>
> On Sat, Sep 13, 2014 at 11:31 AM, Mahes Muniandy [guest]
> <guest at bioconductor.org> wrote:
> > Hello,
> > My name is Mahes Muniandy and I am a doctoral student. I have been
> analysing Affymetrix HGU133Plus2 cel files to determine differential
> expressions in twin pairs (within pair differences). I have used affy,
> gcrma, nsfilter and limma to do my analysis. I have run my analysis using
> the HGU133plus2 CDF available in biocondutor and then tried the whole
> analysis again using the HGU133plus2 cdf from Brainarray. The limma results
> differ significantly (2351 differentially expressed genes for the former
> and 2700  genes for the latter analysis). 630 genes (about 30%) from the
> 2351 genes do not exist in the list of 2700 genes.
> >
> > I have read "Evolving Gene/Transcript Definitions Significantly Alter
> the Interpretation of GeneChip Data  M. Dai  et al." and see some
> convincing arguments there. But, I am confused with which limma results to
> go with. Could you advise me on the guiding principles that I should follow
> in order to decide which cdf to use. I do realise that the onus is on me to
> decide but sadly, I am quite lost in this matter. I would appreciate any
> help available.
>
> I'd start by investigating whether or not the genes included in one
> analysis and not the other seem reasonable for your experiment (ie. do
> some GO analysis on the differences and see if they are relevant to
> the data/treatment you are studying).
>
> Another thing to check is to plot the t-statistics against each other
> from each analysis. Is the result you are finding a result of genes
> dancing around thresholds of significance? If you define significance
> by a certain FDR *and* a minimum absolute log-fold-change, it might be
> that you have better concordance -- when this too isn't perfect
> concordance, I'd go back and start looking at the differing genes and
> try to interpret the differences to see which makes more sense than
> the other.
>
> HTH,
> -steve
>
> --
> Steve Lianoglou
> Computational Biologist
> Genentech
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099

	[[alternative HTML version deleted]]