[BioC] Does the strand of a microarray probe matter?

Kasper Daniel Hansen khansen at stat.berkeley.edu
Mon Nov 24 02:36:37 CET 2008

On Nov 21, 2008, at 19:30 , Nick Henriquez wrote:

> And sorry for perhaps not making absolutely clear so to be  
> completely certain there is no misunderstanding about this;
> Regardless of annotation, even if a piece of DNA encodes a gene on  
> both strands only ONE of these will hybridise to your probe. The  
> reverse-complement is NOT a perfect match, except in vanishingly  
> rare cases, i.e. palindromic sequences of restriction enzymes. These  
> are usually excluded from probe sets due to ambiguity/ 
> crosshybridising potential. RC sequences are completely different  
> and do not crosshybridise with cDNA. Take any sequence (actgctgacag  
> becomes ctgtcagcagt) and you will see that and why this is the case.
> Given that we know the sequence of the probe we can always tell from  
> which strand the hybridising cDNA is derived. So  there is no doubt  
> whatsoever which gene was involved/altered in expression. If geneX  
> is on the "opposite strand" geneX was NOT the gene which was altered  
> in its expression, geneX is not detected by the probe in question.  
> This annotation introvertibly proves that geneX is not measured by  
> this probe. Therefore it was geneY encoded by the relevant strand of  
> DNA. You may have to figure out what geneY is depending on quality  
> of annotation but there are sufficient secondary databases to do  
> that. You may even discover a "new gene".

This is only true if the assay does not loose strandedness. Let us say  
your assay involves making double stranded cDNA as eg. some high- 
throughput sequencing does. In that case you have no way of telling  
what strand your original material came from.


> If 10% of genes may be affected, that means 10% of the genes in your  
> dataset. Usually we're not talking about thousands so it's fairly  
> easy to check. E.g. by looking for "encoded by" in the annotation  
> etc. If you use affy chips  their expression console provides an  
> excel/openoffice compatible output which will allow this, even if  
> within R/BioC some of the annotated information might be lost. As  
> long as the "strand identity" annotation is retained you will always  
> see from BioC output whether geneX was in fact measured or not  
> perhaps code can be adjusted to ignore "other strand" annotations  
> altogether, I don't write code but it seems a relatively easy  
> command to me, whatever the correct syntax " probes with "other  
> strand" in the description=FALSE".
> Best, Nick
> From: seandavi at gmail.com [mailto:seandavi at gmail.com] On Behalf Of  
> Sean Davis
> Sent: 20 November 2008 22:51
> To: Cei Abreu-Goodger
> Cc: n.henriquez at ion.ucl.ac.uk; bioconductor at stat.math.ethz.ch
> Subject: Re: [BioC] Does the strand of a microarray probe matter?
> On Thu, Nov 20, 2008 at 3:48 PM, Cei Abreu-Goodger <cei at ebi.ac.uk>  
> wrote:
> Hi Nick, and others,
> Apologies for not making my question more clear, but I guess there  
> have been some interesting answers anyway. I was in fact thinking of  
> expression arrays. And my main interest was from the standpoint of  
> probe annotation.
> It now does seem pretty clear that there are many regions in the  
> genome that encode transcripts on both strands. If a probe is  
> designed to such a region, the expression microarrays will be  
> measuring both transcripts, and you will essentially have a  
> "perfectly" cross-hybridizing probe.
> Not really.  It depends on the protocol being used.  For illumina,  
> you will end up with a product that goes on the array that is strand- 
> specific.  That is not true of all array platforms.
> Now, annotation-wise, what should we do? Ignore such probes? At  
> least flag them up? The problem is, many bioconductor annotation  
> packages only allow a single gene to be assigned to each probe. So,  
> in many cases you many be led to believe that your experiment has  
> measured differential expression for a particular gene (with its set  
> of GO terms, KEGG pathways, etc) when in fact the changing gene was  
> the one on the other strand.
> I don't think this comes up very often, but it is always possible  
> that for any given gene there is another explanation for  
> differential expression as observed.  That is why for a given gene,  
> it is important to validate using a different technology.  Globally  
> (as in sets of genes), it hopefully won't be too much a factor.
> These "problems" tend to show up on the list occasionally, for  
> example when people find out that different databases (Ensembl/ 
> Biomart, NCBI, the manufacturer or a bioC annotation package) lists  
> different genes for the same probe. Obviously not all, but many of  
> these differences have been due to overlapping transcripts. In fact,  
> Ensembl recently patched their probe mapping pipeline to be "strand- 
> aware". If you think that this would affect a tiny portion of  
> probes, think again: the Affymetrix probes affected on the human and  
> mouse genomes was around 10%:
> http://osdir.com/ml/science.biology.ensembl.devel/2008-06/ 
> msg00052.html
> Also, from talking to some of the NuID/Illumina mapping people it  
> seems that they simply don't consider the strand of the probe. But  
> they do calculate a "uniqueness" score to avoid probes that map to  
> multiple genes.
> In the end, I would ideally prefer "cross-hybridizing" probes (of  
> whatever sort) to be annotated in a way that they could be  
> identified. But I have no idea of how much a nightmare that would be  
> for the developers of the current annotation packages...
> There is no attempt to map probes in bioconductor annotation  
> packages (at least those maintained by the core).  The annotation  
> from which the annotation packages are derived come directly from  
> the manufacturers, generally.  Herve Pages just posted some code to  
> the list that will allow you to align your own probes to the genome  
> or, more probably, to a transcript database of your choice.  Then,  
> you can use your own definitions for probes.  I used to do this on a  
> large scale for all arrays that we used, but I have backed away  
> because the answers that one gets are very similar for the vast  
> majority of probes.
> Sean
> Nick Henriquez wrote:
> Dear Cei, Steve,
> There are two versions of the correct answer depending on whether we  
> are
> talking about an expression or CGH/SNP type array;
> If we are using an EXPRESSION array
> 1) It does not matter on which strand the gene resides.
> 2) It a not matter of bad probe design. It is either a negative  
> control or a
> misnomer derived from genome annotation.
> For ANY probe to hybridise it has to be the RC of cDNA and therefore  
> the DNA
> homologue of the original RNA sequence. (I'll let you work that one  
> out for
> yourself).
> If the probe WAS encoded on "the opposite strand" your labelled  
> target would
> not hybridise as it would be the reverse complement of the actual  
> sequence.
> The annotation "opposite strand" stems from the convention that we  
> call one
> strand the "coding strand" and the other strand the non-coding or  
> "opposite"
> strand. By definition then a gene cannot be encoded by the "opposite"
> strand.
> However, what often happens when sequencing genomes is that we find  
> several
> genes encoded on one strand (which we will then call the coding  
> strand) and
> then somewhat later also one or more genes on the "opposite" strand.  
> This
> annotation is (wrongly in my opinion) retained when genomes are  
> assembled
> and thus part of the annotation of the probes.
> So an opposite strand probe is at best a kind of negative control,  
> at worst
> a misnomer annotation retained when the genome was assembled. Mostly  
> we now
> try to use terms like + and - but even that has the drawback that we
> generally associate + with coding and - with noncoding. As we all  
> know BOTH
> strand encode functional RNAs of various kinds including those  
> coding for
> proteins.....
> If we are talking about DNA targets, e.g. a SNP array
> 1) It does not matter on which strand a gene resides, any overlap is a
> matter of coincidence- "genes" are rare events on the genome.
> 2) It is not a matter of bad probe design. Usually it simply does  
> not matter
> and this is a sequence that was used historically without knowledge  
> of the
> gene (often discovered later). Sometimes the sequence on the coding  
> strand
> may have a problem with background or sequence similarity. To get  
> around
> this one can try to use the RC (i.e. "opposite strand" sequence)  
> which is
> often different enough. Of course if more than 2 similar sequences  
> exist the
> problem remains as we can use this trick only once.
> Hope this helps,
> Nick
> N.V. Henriquez, Senior Research Associate
> Dept. Of Neurodegenerative Diseases
> Institute of Neurology, UCL, Queen Square House rm 124
> Queen Square
> London WC1N 3BG
> Message: 8
> Date: Wed, 19 Nov 2008 10:45:52 -0500
> From: Steve Lianoglou <mailinglist.honeypot at gmail.com>
> Subject: Re: [BioC] Does the strand of a microarray probe matter?
> To: Cei Abreu-Goodger <cei at ebi.ac.uk>
> Cc: Bioconductor Newsgroup <bioconductor at stat.math.ethz.ch>
> Message-ID: <7710F044-03D5-4572-8EE4-2DB96F4C348C at gmail.com>
> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
> Hi Cei,
> On Nov 19, 2008, at 3:51 AM, Cei Abreu-Goodger wrote:
> Hello all,
> Related issues have arisen before, where the probe of a particular   
> array platform was annotated to a gene on the opposite strand. But  
> I  was just asked if this even matters, or should it simply be   
> considered a case of bad probe design.
> Does the protocol for different manufacturer's arrays always  
> produce  amplified product of both strands for the transcript to be  
> measured?  I could imagine that protocols that amplify based on poly- 
> A tails  would tend to produce an anti-sense biased amplification  
> product  (older Affy arrays?), whereas those based on random priming  
> could  produce products of both strands (and so the actual strand  
> that is  on the array becomes meaningless).
> Does someone know what is the case in particular for Illumina   
> Beadarrays?
> I've never worked on the bench-side of a microarray experiment, but   
> for gene expression arrays I was under the impression that most   
> protocols:
> (i) extract the the RNA from cell lysate using their poly-A tails  
> as  targets
> (ii) reverse transcribe to cDNA and amplify the cDNA w/ random  
> primers.
> (iii) hybridize amplified cDNA to the array
> If that's the case, I don't think that the strand of the probe  
> should  be an issue.
> I'd be interested, of course, to hear other people's thoughts on  
> this,  too (while this info should be easily available from the   
> manufacturer's site, or the Methods section of many papers, let's  
> see  if the lazy-web can help :-).
> -steve
> --
> Steve Lianoglou
> Graduate Student: Physiology, Biophysics and Systems Biology
> Weill Medical College of Cornell University
> http://cbio.mskcc.org/~lianos <http://cbio.mskcc.org/%7Elianos>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> -- 
> The Wellcome Trust Sanger Institute is operated by Genome Research  
> Limited, a charity registered in England with number 1021457 and a  
> company registered in England with number 2742969, whose registered  
> offi...{{dropped:16}}
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

More information about the Bioconductor mailing list