[BioC] Generating random gene lists: does sample/resample generate random sets

Thu Sep 11 17:40:19 CEST 2008

On Thu, Sep 11, 2008 at 10:59 AM, Ochsner, Scott A <sochsner at bcm.tmc.edu> wrote:
>  Thomas, Sean,
>
> I see your reasoning and will leave the curated list of genes in the universe prior to random list generation.  However, what about the following scenario?  Lets say the genes are left in, 1000 random gene lists are generated, and each of these are tested for their classification performance.  If I
> repeat this whole process say 1000 times and find that on average 70 random gene lists out of 1000 perform equally well as myCuratedList in sample classification.  Not a ringing endorsement for myCuratedList.  Yet, there is the possibility that the random gene lists which perform equally well have
> significant overlap with myCuratedList and are essentially the same as myCuratedList.  How do I adjust the probability to account for this?
>

Hi, Scott.  That is exactly the test that you want to do and, yes, it
could have a just the outcome that you suggest, in which case you do
not have terribly strong evidence for your gene list being better than
random.  But stacking the deck (or depleting the deck, as the case may
be) is a definite no-no.  The results obtained by leaving your gene
list out are not interpretable, so don't bother comparing the two
different methods, one leaving your gene list in and one taking it
out; the latter is simply not valid.

Sean

> Scott
> -----Original Message-----
> From: Thomas Hampton [mailto:Thomas.H.Hampton at Dartmouth.EDU]
> Sent: Thursday, September 11, 2008 7:36 AM
> To: Ochsner, Scott A
> Subject: Re: [BioC] Generating random gene lists: does sample/resample generate random sets
>
> I kind of get your reasoning, but not quite.
>
> Does this analogy make sense?
>
> Suppose I shuffle a deck and draw a full house off the top, specifically, a pair of 2s and three 10s.
>
> If I want to know how often a full house would happen by chance, I would not set these five cards aside to do the simulation, because the resulting deck of 47 cards is a bit less likely to come up with a full house than a real deck of 52.
>
> In a more extreme version of your test, suppose your curated list constituted a larger percentage of the entire gene list. Suppose you removed 75% and ran your simulations on the remaining 25%.  We could not offer that 25% as representative of the whole, could we?
>
> Cheers
>
> T
> On Sep 11, 2008, at 12:28 AM, Ochsner, Scott A wrote:
>
>> Thomas,
>>
>> I wanted to asses the performance of random gene lists which do not
>> have any overlap with myCuratedList hence the step to remove them from
>> the universe of possible genes prior to random gene selection.  If I
>> leave the curated genes in, random lists could potentially be produced
>> with significant similarity to myCuratedList.  I'm interested in the
>> chance occurrence of unique gene lists with similar classification
>> performance as myCuratedList.  I certainly have an open mind with this
>> point if others can come up good reasons why this may be a bad idea.
>>
>> Scott
>>
>> ________________________________
>>
>> From: Thomas Hampton [mailto:Thomas.H.Hampton at Dartmouth.EDU]
>> Sent: Wed 9/10/2008 3:40 PM
>> To: Ochsner, Scott A
>> Cc: bioconductor at stat.math.ethz.ch
>> Subject: Re: [BioC] Generating random gene lists: does sample/
>> resample generate random sets
>>
>>
>>
>> I would not have taken the curated list out. That strikes me as a
>> significant bias. Am I missing something?
>>
>> Tom
>>
>> On Sep 10, 2008, at 4:03 PM, Ochsner, Scott A wrote:
>>
>>> Dear BioC,
>>>
>>> I would like feedback as to the appropriateness of the following
>>> procedure to produce a set of 1000 random gene lists, each list of
>>> length 2000.  The idea is to use the set of random gene lists to
>>> assess how often random gene lists of size x can reproduce or improve
>>> the classification performance of myCuratedList.
>>>
>>>
>>> #remove myCuratedList from the universe of possible genes.  The
>>> "eset" object is your standard ExpressionSet object.
>>>> length(myCuratedList)
>>>  [1] 2000
>>>> Index<-setdiff(1:length(rownames(exprs(eset))),myCuratedList)
>>>> length(Index)
>>>  [1] 20277
>>> #generate 1000 random gene lists using the genes in Index.  The code
>>> for resample is taken from the help pages for sample.
>>>
>>>> randomMatrix<-replicate(1000,resample(index,2000))
>>>> dim(randomMatrix)
>>>  [1] 2000 1000
>>>
>>>
>>> I've verified that each column does not contain repeated genes as
>>> should be the case with resample without replacement.
>>>
>>> Is there a standard procedure for doing the above or is what I've
>>> done kosher?
>>>
>>>
>>> Scott A. Ochsner, Ph.D.
>>> NURSA Bioinformatics
>>> Molecular and Cellular Biology
>>> Baylor College of Medicine
>>> Houston, TX. 77030
>>> phone: 713-798-6227
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: http://news.gmane.org/
>>> gmane.science.biology.informatics.conductor
>>
>>
>>
>>
>>       [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/
>> gmane.science.biology.informatics.conductor
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>