[R] Text mining? Text manipulation? Both? Predicting KRAS test results in cancer patients

Fri Sep 28 21:57:35 CEST 2012

Happy Friday Everyone,

Hope Friday afternoon doesn't turn out to be a terrible time to post a question. I've been doing a little data mining of patient text medical records as of late. I started out trying to predict whether or not cancer patients had received KRAS mutation testing and did quite well with that. Now I'm trying to predict the results of KRAS testing (mutated vs. wild type). This is proving to be a little more difficult.

With the first classification task, I created counts of terms (e.g., ""kras", "mutated") in the text medical records using the tm package and then used those counts to predict whether or not patients had had KRAS mutation testing. I tried a few different analyses here, but found that random forests worked the best.

Predicting the results of testing is harder though because of the way physicians and other healthcare professionals write about testing. For example, I'm finding phrases like "KRAS mutation returned wild-type". In this example, if we're counting, we get 1 instance of "kras", 1 instance of "mutated", and one instance of "wild". So you can see how it might be difficult to accurately predict the results of testing based on counts alone.

My question is how best to deal with this. Are there any R text mining packages or related software that would be particularly suited to my problem? I took a look at the CRAN Task View: Natural Language Processing and there were so many options I didn't really know where to start (and it's not even clear that an R-based solution will work best for my problem). Alternatively, is there any real chance one could simply write code that would be able to identify true references to the results of KRAS testing and then create counts only of what are likely to be true references?

I'd greatly appreciate it if someone could point me in the right direction.

Thanks,

Paul