[R] Classification methods - which one?

Mon Nov 19 20:53:10 CET 2012

Dear Max, 
first: Thanks a lot for your suggestion and the open words about methods in real life. I guess: Thats my problem.
Regarding my analysis: Yes, thats the problem and I have to coerce to do this analysis regarding lack of time to start something/other methods. 
So you suggest Linear Discriminant Analysis. Is there a special packages you recommend? Nearest Shrunken Centroids i checked with the package PAMR (http://www-stat.stanford.edu/~tibs/PAM/Rdist/doc/readme.html)
The example works fine but I guess i have to many rows (or in this case genes) for the analysis. My main problem is that i cannot reduce the amount of the genes because some of the bosses want to compare the output of classification methods with a ruled-based algorithm which works with all genes (after P/A calls and an alternative CDF) on the array. So an reduction of the 17 000 genes is only possible in a limited way (around 7000 genes after some pre-processing steps).
For all tips and suggestions I am more than happy.
Best
Peter

Am 19.11.2012 um 16:36 schrieb Max Kuhn <mxkuhn at gmail.com>:

> My suggestion is not to do any predictive modeling. Basically, the
> data doesn't support a sensible and reproducible model. Yes, the
> literature is saturated with this type of analysis but almost none of
> the examples have any utility in real life.
> 
> Stick to differential expression analysis, investigate the results
> statistically and biologically then design a prospective experiment
> with a specific set of genes and a more refined measurement system.
> 
> If you are doing this analysis to learn something from the data (as
> opposed to generating accurate predictions), a predictive model is one
> of the worst ways of going about it.
> 
> If you are coerced to do this analysis, stick to linear methods
> (regularized LDA, nearest shrunken centroids, etc) that are less
> likely to over-fit and bias yourself towards those that have embedded
> feature selection.
> 
> Max
> 
> 
> On Mon, Nov 19, 2012 at 10:16 AM, Peter Kupfer <peter.kupfer at me.com> wrote:
>> Dear all,
>> i searched for some classification methods and I have no glue if i took the right once.
>> My problem: I have a matrix with 17000 rows and 33 colums (genes and patients). The patients are grouped into 3 diseases.
>> No I want to classify the patients and for sure i want to know which rows are more helpful for the classification than others.
>> 
>> I tried SVM and random forest. Do you think this are the right classification methods? Maybe there are some hints you can give me. I am more familiar with the Bioconductor packages. Furthermore: This is/was not my field of study in the past but I want to understand it and I am willing to deal with this field.
>> Would be amazing if one of the (more) mathematical people can give me a hint.
>> Thanks and all the best
>> 
>> Peter
>> 
>> 
>> PS: I can upload my underlying data if somebody is interested
>> 
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
> 
> 
> 
> -- 
> 
> Max