[BioC] some questions about RNAi hit selection

Fri Jul 3 11:14:12 CEST 2009

Hi Rajarshi,

    I spent a long time thinking about this problem when I did some 
screening. My problem was slightly different because I had 2 siRNAs for 
each gene and 2 replicates for each replicate, but still not enough to 
do traditional stats. The first thing I suggest is that you analysis the 
data with the Biocondutor package cellHTS2 if you are not already. After 
performing several rounds of low through-put confirmation experiments I 
came to the following conclusions:

1) Without more data you cannot really do better than a threshold for 
selecting hit siRNAs. The only significance you can put on an siRNA in 
this situation is the rank in the hit list.

I have been thinking about some sort of FDR measure that considers the 
position of the siRNA in relation to the distributions of both the 
positive and negative control distributions. But I've never really taken 
it anywhere.

2) The fact that an siRNA is a hit, doesn't mean that a gene is. When I 
looked at the correlation between the two siRNAs targeting the same 
gene, I saw that it was pretty much zero, while there was a substantial 
correlation between replicates. The reasons for this are probably two 
fold. Firstly different siRNAs have different efficiencies in knowing 
down the gene. Secondly the different siRNAs have different off-target 
effects. If you are screening thousands of siRNAs, then those that have 
off-target effects relevant to your screen will score highly. If there 
are many of these (which there are likely to be when you are screening 
20,000 genes x 4 siRNAs), off-target effects are likely to dominate the 
top end of your list.

You could score genes based of the minimum/mean score for the 4 siRNAs, 
when I did this (using the minimum of the 2 siRNAs that I had) I found 
that I had to set my threshold so low that none of my putative hits 
confirmed. If you do find some that do, you could be finding cases where 
both siRNA are having off-target effects (because of the massive 
multiple testing). This might seem unlikely, but I have seen it happen.

My conclusions from this are that as you say hit selection is just the 
first step. You could use other information to winnow the initial 
selection of hits, but I don't really think that there is any substitute 
for experimental confirmation of hits using independent siRNAs. 
Winnowing based on GO/pathway analysis might help you select which hits 
you wan to confirm.

Hope all this waffle helps in some way,

Ian
---

Rajarshi Guha wrote:
> Hi, I have recently started working with RNAi screening data and have been
> getting up to speed on the literature. I have a few questions ,which are not
> directly related to Bioconductor (or R) but I figured that members of the
> list would probably be able to help out. If there are more appropriate
> places to post such questions I'dd appreciate pointers.
>
> My main question is about hit selection. I'm working with assays in which
> each gene is targeted by 4 different siRNA's and the plates have no
> replicates. My understanding is that in this situation, one cannot really
> use statistical tests to select siRNA's. Instead, one employs threshold
> approaches (mean, MAD, quartile etc). Is this correct? In such a
> thresholding approach, is there any way one can provide some sort of
> significance/score to a selection oh hits?
>
> Would it be correct to say that hit selction is simply a first step and one
> should use other informaiton (GO enrichment, pathway analysis) to further
> winnow an initial selection of hits?
>
> I am also working on a sensititzation screen, where I am trying to identify
> genes that are differentially knocked down. This problem seems analogous to
> microarray studies and in that vein, I have been considering the 4 signals
> (i.e., 4 siRNA's) for each gene, in the two conditions and used a t-test to
> determine whether there is a difference in the means.
>
> What I'm a little confused about is to what extent I need to perform
> multiple test corrections on the p-values - does the 'multiple' refer to the
> number of conditions in which the assay is run (drug and no drug) or the
> number of genes being considered?
>
> Thanks,
>
>   

-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE.