[BioC] Classification

Wed Jun 29 11:30:09 CEST 2011

Thanks for great discussions. I'm just wondering if someone has already 
a tutorial or R script that runs this pipeline ? could easily be adapted 
i guess.

On 06/24/2011 09:07 PM, Tim Triche, Jr. wrote:
> Kellie Archer at VCU has done some work with weighting ordinal model
> selection in exactly this manner.  rPart for recursive partitioning, for
> example, since in a model with (say) categories ranging from "progressive
> disease" to "complete remission", it's a lot smaller sin for the model to
> guess "partial remission" in a patient who experiences a complete remission,
> than it is for the model to guess "progressive disease" (opposite end of the
> scale).  More recently she has been doing the same for e.g. Lasso fits:
>
> http://cran.r-project.org/web/packages/glmnetcr/index.html
>
>
>
> On Fri, Jun 24, 2011 at 11:21 AM, Kevin R. Coombes<
> kevin.r.coombes at gmail.com>  wrote:
>
>> **
>> (Note that I'm taking this back to the mailing list in case others are
>> interested.)
>>
>> Orthogonal.  One strategy is
>>
>> Randomly separate the data into training and test (using whatever
>> proportions you think are appropriate for your size dataset).
>> On the training set, use the combination of polr+step to find an optimal
>> model.
>> Repeat this lots of times.
>> Collect data on how often each predictor gets selected in the optimal model
>> (which depends on the exact composition of the training set).
>> Also collect data on how well the trained model fits its test data. (Tricky
>> with an ordinal outcome.  Key question is how to weight the penalties for
>> prediction errors that are off by one ordinal category as opposed to two or
>> more categories.  Might want something like a weighted Cohen's kappa.)
>>
>> Finally, you need to summarize the cross-validation results to decide which
>> predictors are the best.   With only five possible predictors, the idea
>> might be just to combine the out-of-band prediction results for each of the
>> 2^5 possible model structures (or whatever subset actually gets selected
>> some number of times.)  The other possibility is to claim that every time a
>> predictor gets selected in one of the optimal models, then it gets credit
>> (or blame) for all of the predictions those models makes.  Intuitively, I
>> prefer the first of these alternatives.
>>
>>      Kevin
>>
>>
>> On 6/24/2011 12:37 PM, Tim Triche, Jr. wrote:
>>
>> you prefer AIC to crossvalidation for model selection?  or feel they're
>> orthogonal?
>>
>>   thanks for the tip about polr, I had a vague recollection of it, but this
>> is the first time I actually read the man page.  appreciate your taking the
>> time to send it.
>>
>>   --t
>>
>>
>> On Fri, Jun 24, 2011 at 10:26 AM, Kevin R. Coombes<
>> kevin.r.coombes at gmail.com>  wrote:
>>
>>> The standard MASS package includes the "polr" function to perform ordinal
>>> regression.  After running polr to fit the base model with all parameters,
>>> you can pass the results throught the "step" function to use AIC to select
>>> the best set of predictors.
>>>
>>>     Kevin
>>>
>>>
>>> On 6/24/2011 10:38 AM, Tim Triche, Jr. wrote:
>>>
>>>>   You have an ordinal response, so you might consider an ordered probit
>>>> model
>>>> with interaction terms and a penalized likelihood fit, and determine the
>>>> best penalty by cross-validation.  I don't recall whether CMA supports
>>>> ordered probit models, but it's probably the best approach, and you could
>>>> just brute-force it -- you've only got 120 different models to fit under
>>>> this scheme.  At the very least, CMA would generate the cross-validation
>>>> sets for you.
>>>>
>>>> You might also want to consider recursively fitting a shrunken LDA model
>>>> (diseased/healthy, moderate/severe) and see how that compares to an
>>>> ordinal
>>>> model.  Regardless, cross-validation is the obvious answer to how to pick
>>>> one.
>>>>
>>>> Hope this helps,
>>>> -t
>>>>
>>>> On Fri, Jun 24, 2011 at 8:24 AM, David martin<vilanew at gmail.com>   wrote:
>>>>
>>>>     thanks.
>>>>> Is not binary since i have three categories and 5 genes. I have tried
>>>>> LDA
>>>>> and stepclass
>>>>>
>>>>> #LDR stepwise
>>>>> disc<-stepclass(Group~ ., data =dataf, method = "lda",improvement =
>>>>> 0.001)
>>>>>
>>>>> where group contains my three categories ("healthy","moderate disease",
>>>>> "severe disease") and dataf the pcr values for my 5 genes.
>>>>>
>>>>> The problem i have is that stepwise generates a different signature each
>>>>> time (as it randomly picks up a gene to start with)? This is ok for me
>>>>> but
>>>>> how many times do you need to run stepclass so that you found your mopst
>>>>> probable genes that classify your groups , Do i need to do a loop for
>>>>> stepclass ???
>>>>>
>>>>> thanks
>>>>>
>>>>>
>>>>>
>>>>> On 06/24/2011 05:17 PM, Kevin R. Coombes wrote:
>>>>>
>>>>>     .. and probably should ...
>>>>>>
>>>>>> For a binary classification with only a few predictors, you can, for
>>>>>> example, use logistic regression with some standard criterion like AIC,
>>>>>> BIC, or Bayesian model averaging to decide which predictors should be
>>>>>> retained.
>>>>>>
>>>>>> Kevin
>>>>>>
>>>>>> On 6/23/2011 6:10 PM, Moshe Olshansky wrote:
>>>>>>
>>>>>>     If you have just 5 genes and a decent number of samples you can use
>>>>>>> any of
>>>>>>> the "conventional" (i.e. not high throughput) methods like LDA, trees,
>>>>>>> Random Forest, SVM, etc.
>>>>>>>
>>>>>>>   I will have a look at both packages. It's pcr data by the way
>>>>>>>
>>>>>>>>   thanks
>>>>>>>>
>>>>>>>> On 06/23/2011 05:56 PM, Tim Triche, Jr. wrote:
>>>>>>>>
>>>>>>>>     or CMA, which is perhaps a more systematic approach for
>>>>>>>>> classification.
>>>>>>>>> (the package name stands for Classification of MicroArrays) Very
>>>>>>>>> well
>>>>>>>>> thought out.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Jun 23, 2011 at 8:02 AM, Sean
>>>>>>>>> Davis<sdavis2 at mail.nih.gov>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>   On Thu, Jun 23, 2011 at 10:58 AM, David
>>>>>>>>>
>>>>>>>>>>   martin<vilanew at gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>   Hi,
>>>>>>>>>>> I have 5 genes of interest. I would like to know which
>>>>>>>>>>> combination(s)
>>>>>>>>>>> of
>>>>>>>>>>> genes gives the best disease separation. Which test could i use in
>>>>>>>>>>> my
>>>>>>>>>>> training set to see which combination is the best classificer
>>>>>>>>>>> between
>>>>>>>>>>> my
>>>>>>>>>>> disease and my healthy population.
>>>>>>>>>>>
>>>>>>>>>>> Thanks for any comment or test that could be useful to answer that
>>>>>>>>>>>
>>>>>>>>>>>   question.
>>>>>>>>>>
>>>>>>>>>> Check out the MLInterfaces package. It should give you some ideas
>>>>>>>>>> on
>>>>>>>>>> where to start.
>>>>>>>>>>
>>>>>>>>>> Sean
>>>>>>>>>>
>>>>>>>>>> ______________________________**_________________
>>>>>>>>>> Bioconductor mailing list
>>>>>>>>>> Bioconductor at r-project.org
>>>>>>>>>>   https://stat.ethz.ch/mailman/**listinfo/bioconductor<
>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor>
>>>>>>>>>> Search the archives:
>>>>>>>>>>
>>>>>>>>>> http://news.gmane.org/gmane.**science.biology.informatics.**conductor
>>>>>>>>>> <http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    ______________________________**_________________
>>>>>>>>>
>>>>>>>>   Bioconductor mailing list
>>>>>>>> Bioconductor at r-project.org
>>>>>>>>   https://stat.ethz.ch/mailman/**listinfo/bioconductor<
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor>
>>>>>>>> Search the archives:
>>>>>>>> http://news.gmane.org/gmane.**science.biology.informatics.**conductor
>>>>>>>> <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>>>>>>>
>>>>>>>>
>>>>>>>>   ______________________________**______________________________**
>>>>>>> __________
>>>>>>> The information in this email is confidential and
>>>>>>> intend...{{dropped:4}}
>>>>>>>
>>>>>>> ______________________________**_________________
>>>>>>> Bioconductor mailing list
>>>>>>> Bioconductor at r-project.org
>>>>>>>   https://stat.ethz.ch/mailman/**listinfo/bioconductor<
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor>
>>>>>>> Search the archives:
>>>>>>> http://news.gmane.org/gmane.**science.biology.informatics.**conductor
>>>>>>> <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>>>>>>
>>>>>>>   ______________________________**_________________
>>>>>> Bioconductor mailing list
>>>>>> Bioconductor at r-project.org
>>>>>>   https://stat.ethz.ch/mailman/**listinfo/bioconductor<
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor>
>>>>>> Search the archives:
>>>>>> http://news.gmane.org/gmane.**science.biology.informatics.**conductor<
>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>>>>>
>>>>>>
>>>>>>   ______________________________**_________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at r-project.org
>>>>>   https://stat.ethz.ch/mailman/**listinfo/bioconductor<
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor>
>>>>>
>>>>> Search the archives: http://news.gmane.org/gmane.**
>>>>>   science.biology.informatics.**conductor<
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>>>>
>>>>>
>>>>
>>>>
>>
>>
>> --
>> When you emerge in a few years, you can ask someone what you missed, and
>> you'll find it can be summed up in a few minutes.
>>
>>   Derek Sivers<http://sivers.org/berklee>
>>
>>
>
>