[BioC] Annotation for Nonspecificity of Affymetrix Probes?

Robert Gentleman rgentlem@jimmy.harvard.edu
Mon, 5 Aug 2002 13:23:27 -0400


Hi,
 We are doing some research that is probably not completely
 dissimilar. I don't know of research regarding the affinity of
 particular probes. However, G and C bind more tightly than A and T so
 the CG content is undoubtedly important. The former have 3 H-bonds,
 the latter only 2.

 Probably more important are the cross-hybridization issues. I do have
 some ideas on how you deal with them. 
 
 Stated quite simply, one completely ignores the mappings that
 Affymetrix has provided. The only ones that are appropriate are the
 ones that you (and lots of others) have developed by mapping the
 25mers to the transcriptome (I'm not at all sure that there is a
 reliable estimate of the transcriptome, but that's another issue).
 Note also, that in this mapping there is no such thing as PM or MM,
 there are just 25mers and they map into particular genes (or not).

 This allows you to do some interesting things. Say, you have a
 favorite gene but Affy has not indicated that it is on the chip. All
 you need is its sequence and if you can find a handful of 25mers that
 match then you can estimate its abundance.
 You might want to look at "Gene Expression Analysis with Universal
 n-mer Arrays", by van Dam and Quake in Genome Research.

 Going back one paragraph, these mappings are many to one (in both
 directions). Some genes contain multiple probe sets, and some probe
 sets are found in multiple genes. (If this sounds a lot like SAGE
 data, it should. You can think of SAGE as digital and Affy as
 analog and it sort of works).

 Things are a bit simpler with SAGE -- I don't want to say much about
 Affy because we are in the midst of figuring it out.

 Regards,
    Robert

 ps I encourage you to look at all resources used to construct our
 data packages. There are a lot of people doing similar things.


On Mon, Aug 05, 2002 at 11:49:45AM -0500, Jeff Sorenson wrote:
> I would like to thank all of the contributors to the bioconductor project
> for putting their tools into the public domain.  I'm embarking on a project
> using Affymetrix U133A/B chips and have been in the process of setting up a
> database of probe/sequence information and other annotation information
> (mysql), and learning to use the various R packages.  Looking over the probe
> sequences and putative gene sequences that affymetrix provides on their
> website, it is clear that many of the probes are nonspecific - e.g, they
> perfectly match portions of gene sequences that are differenct than the one
> they were derived from.  In some cases, it appears that affymetrix has
> simply generated multiple probe sets for transcriptional variants of the
> same gene.  In other cases, it appears that some probes are simply
> nonspecific.  Affymetrix does warn us that some probe sets are less specific
> than others, and this is indeed incorporated into their probe set
> nomenclature, but I have found no downloadable file that lists the
> specifics.  My computer should be done testing the half million probes for
> perfect matches against the ~45000 sequences some time later this week.
> After that, I will probably test the mismatch probes.
> 
> My question to this community is this:  is there already an annotation file
> or package that takes this consideration into account?  If so, can this
> information be readily adapted into the R packages for probe level analysis
> and gene expression estimation?
> 
> In a related question, can anyone point me to an algorithm for accurately
> estimating the hybridization probability of an arbitrary probe against an
> arbitrary mRNA.  Would it correlate closely to the BLAST score?  Has anyone
> done theoretical studies on the nature of the mismatch probes and their
> usefulness in measuring "nonspecific" binding?  It would be nice to be able
> to predict how strongly a particular mRNA should bind to each of the probes
> on a chip (both PM and MM).  If this is feasable, has anyone done in computo
> chip hybridization experiments to see how closely the estimated expression
> levels are to the actual input?
> 
> 
> Thanks,
> 
> Jeff Sorenson
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor@stat.math.ethz.ch
> http://www.stat.math.ethz.ch/mailman/listinfo/bioconductor

-- 
+---------------------------------------------------------------------------+
| Robert Gentleman                 phone : (617) 632-5250                   |
| Associate Professor              fax:   (617)  632-2444                   |
| Department of Biostatistics      office: M1B20
| Harvard School of Public Health  email: rgentlem@jimmy.dfci.harvard.edu   |
+---------------------------------------------------------------------------+