[R] Finding words that are within +/- X words of "KRAS" using tm package or other means

Paul Miller pjmiller_57 at yahoo.com
Wed May 16 20:27:08 CEST 2012


Hello All,

This will probably be easy for some but isn't for me. Currently am working on a text mining exercise. Want to be able to predict whether cancer patients got KRAS testing, and, if so, whether the test yielded a result of wild type/negative or mutant/positive. I've begun with a "bag-of-words approach" that looks at the count of specific terms in the medical records and then uses some of those as predictors. 

This works great for predicting whether or not patients got tested. It's not so good though when it comes to predicting the outcome of testing. Trouble is that patients can have a reference to KRAS testing and also have a lot of references to, say, "positive" where that term has nothing to do with the result of their KRAS testing. 

So I'd like to be able to identify the number of instances in a patient's medical record where relevant terms like "wild type", "negative", "mutant", or "positive" come either shortly before or shortly after "KRAS". It would be great if there is a way to do this within the tm package. I've found that very helpful for preparing my data thus far.

If not though, I have a data frame that contains patient number in one column and the patient's complete text medical record in another. So some sort of regular expression likely would work just fine. 

Here are some examples of the sort of thing I'm looking to count:

"Received KRAS testing results on xx/xx/xxxx. Test results indicate the presence of a mutation."

"Tumor is KRAS negative"

"KRAS (mutated)" 

"Tumor is positive for KRAS mutation" 

And here's an example of something I want to ignore.

"Will conduct KRAS testing prior to initiation of therapy. ... (Several lines of material) ... Bilirubin positive."

A couple of things stand out here. The first is that I need to be able to pick up on variations of the relevant terms. So, for example, that means being able to identify that either "mutant" or "mutated" came in close proximity to "KRAS". 

The other thing is that while increasing the number of words to look forward and backward will identify more valid cases, it will also tend to identify more invalid ones as well. For example, looking as many as 12 words after KRAS will lead to correct identification of:

"Received KRAS testing results on xx/xx/xxxx. Test results indicate the presence of a mutation."

but also incorrect identification of:

"Will conduct KRAS testing prior to initiation of therapy. Note that patient was positive for Lynch mutation."

Thinking I will need to to keep the window short in order to obtain the best results. Would be nice if I could easily increase or decrease the number of words to look forward and backward though. Would also be good if I could, say, select a relatively small number of terms to look forward and a larger number of words to look forward.

Having gotten to the end of this description it occurs to me this is actually harder than I thought.

If one of you gurus could help me out, that would be greatly appreciated.

Thanks,

Paul



More information about the R-help mailing list