[R] Text mining: Narrowing a field of 27, 855 predictors using semi-partial correlations or some other means

Wed Apr 18 17:57:02 CEST 2012

Hello Everyone,

Trying to learn a little bit about data mining. I'm working on a text mining project that will attempt to predict whether cancer patients got a particular type of genetic testing. A subsequent stage then will be aimed at predicting what the results of that testing were. 

I've used the tm package to prepare my data and am planning to use rattle to do the actual data mining. The tm package has proved to be a great help so far. I've managed to perform a variety of transformations of my data. I've also managed to create a document-term matrix that has a row for each of my patients and columns for each of the terms in my patient medical records. 

Because I'm not yet a particularly good R programmer, I've converted my document-term matrix to a data frame and then added information about the genetic testing. 

So here's the thing. The tm package has a feature that would allow me to drop words that occur infrequently in patient medical records. However, I've been asked not to use it because it's believed that even infrequently occurring terms may be highly diagnostic. The consequence is that my data frame has a large number of columns for the various words. In fact, over 27,000 of them.

So my question is how to reduce this to some more manageable number. One thought has been to look at semi-partial correlations. Here these would be between tested(y/n) and each predictor, controlling for length of medical record. The idea would be to use only those predictors that were significant in the actual data mining.

Is this likely to be a good approach? Or is there likely to be a better way of doing it?

If it is a good approach, I’m wondering how to go about obtaining the necessary results. I’ve managed to figure out how to compute semi-partial correlations using the spcor.test() function in the ppcor package, as in:

> spcor.test(as.numeric(Tested$TestStatus=="Yes"), Tested$predictor, Tested $nchar_record)

   estimate      p.value statistic   n gp  Method
1 0.3853547 2.307562e-08  5.587203 182  1 pearson

This is fine for a single pair of variables. What I’d need though is to combine a whole series of such outputs, one for each of my predictors. After that, I’d need to be able to determine which semi-partial correlations were significant (or perhaps substantial) and to create a list that I could use to eliminate a lot of the predictors from my data frame. I’m just beginning to use R in my day-to-day work. So it’s not clear to me how to do this. 

Thanks,

Paul