[R] Complex text parsing task

Paul Miller pjmiller_57 at yahoo.com
Mon May 21 17:31:48 CEST 2012


Hello Everyone,

I have what I think is a complex text parsing task. I've provided some sample data below. There's a relatively simple version of the coding that needs to be done and a more complex version. If someone could help me out with either version, I'd greatly appreciate it.

Here are my sample data.

haveData <- 
structure(list(profile_key = structure(c(1L, 1L, 2L, 2L, 2L, 
3L, 3L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 7L, 7L), .Label = c("001-001 ", 
"001-002 ", "001-003 ", "001-004 ", "001-005 ", "001-006 ", "001-007 "
), class = "factor"), encounter_date = structure(c(9L, 10L, 11L, 
12L, 13L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 4L, 7L, 7L), .Label = c(" 2009-03-01 ", 
" 2009-03-22 ", " 2009-04-01 ", " 2010-03-01 ", " 2010-10-15 ", 
" 2010-11-15 ", " 2011-03-01 ", " 2011-03-14 ", " 2011-10-10 ", 
" 2011-10-24 ", " 2012-09-15 ", " 2012-10-05 ", " 2012-10-17 "
), class = "factor"), raw = structure(c(9L, 12L, 16L, 13L, 10L, 
7L, 6L, 3L, 2L, 4L, 14L, 15L, 1L, 5L, 8L, 11L), .Label = c(" ... If patient KRAS result is wild type, they will start Erbitux. ... (Several lines of material) ... Ordered KRAS mutation test 11/11/2011. Results are still not available. ... ", 
" ... KRAS (mutated). Therefore did not prescribe Erbitux. ... ", 
" ... KRAS (mutated). Will not prescribe Erbitux due to mutation. ... ", 
" ... KRAS (Wild). ...", " ... KRAS results are in. Patient has the mutation. ... ", 
" ... KRAS results still pending. Note that patient was negative for Lynch mutation. ...", 
" ... KRAS test results pending. Note that patient was negative for Lynch mutation. ...", 
" ... Ordered KRAS mutation testing on 02/15/2011. Results came back negative. ... (Several lines of material) ... Patient KRAS mutation test is negative. Will start Erbitux. ...", 
" ... Ordered KRAS testing on 10/10/2010. Results not yet available. If patient has a mutaton, will start Erbitux. ...", 
" ... Ordered KRAS testing. Waiting for results. ...", " ... Patient is KRAS negative. Started Erbitux on 03/01/2011. ...", 
" ... Received KRAS results on 10/20/2010. Test results indicate tumor is wild type. Ua Protein positve. ER/PR positive. HER2/neu positve. ...", 
" ... Still need to order KRAS mutation testing. ... ", " ... Tumor is negative for KRAS mutation. ...", 
" ... Tumor is wild type. Patient is eligible to receive Eribtux. ...", 
" ... Will conduct KRAS mutation testing prior to initiation of therapy with Erbitux. ..."
), class = "factor")), .Names = c("profile_key", "encounter_date", 
"raw"), row.names = c(NA, -16L), class = "data.frame")

The following code displays the results of so-called "simple" coding.

#### Simple coding ####

KRASpatient <- c("001-001", "001-002", "001-003", "001-004", "001-005", "001-006",  "001-007")
KRAStested <- c(2,3,2,2,2,3,3)
KRASwild <- c(1,0,2,0,3,1,3)
KRASmutant <- c(4,2,2,3,1,2,2)
simpleData <- data.frame(KRASpatient, KRAStested, KRASwild, KRASmutant) 
simpleData

Here, KRAStested is calculated by summing all references to "KRAS" for each patient. Wild is calculated by summing all references to "wild type", "wild", and "negative" that come within 20 words of the closest reference to KRAS. Mutant is calculated by summing all references to "mutant", "mutated", and "positive" that occur within 20 words of the closest reference to KRAS.   

The second kind of coding is what I'm referring to as "complex coding".  The following code displays the results of this type of coding.

#### Complex coding ####

KRAStested <- c(2,1,0,2,2,2,3)
KRASwild <- c(1,0,0,0,3,0,3)
KRASmutant <- c(0,0,0,3,0,1,0)
complexData <- data.frame(KRASpatient, KRAStested, KRASwild, KRASmutant) 
complexData

The results of "complex coding" differ substantially from those obtained under "simple coding" and I think illustrate the potential problems with that approach. With "complex coding", the goal would be to identify and sum only true references to KRAS testing and true references to the result of that testing (either wild type/negative or mutant/positive).

True references to KRAS testing would be identified using a set of qualifiers that eliminate the false references. So, for example, one of the patients in my (made up) sample data has the phrase "Will conduct KRAS mutation testing prior to initiation of therapy with Erbitux" in their medical record. In this case, "Will" is a qualifier that indicates this is not a true reference to KRAS testing. For this exercise, other qualifiers related to KRAS testing would include "need", "order" (but not the past tense "ordered"), "wait", "waiting", "await", and "awaiting".
To be a qualifier, these terms would need to occur within 12 words of the closest true reference to KRAS.

True references to the results of testing would also be identified using a set of qualifiers that eliminate false references. Here the list of qualifiers would include "if", "lynch", "kras mutation test", "kras mutation testing" and "for kras mutation". Qualifiers would need to come within 12 words of a true reference to KRAS testing.

There's an additional wrinkle for identifying true references to the results of testing. One also needs to take into account the presence of what I'm calling "nullifiers". For purposes of this exercise, nullfiers include "Ua Protein", "ER/PR", and "HER2/neu" If "positive" or "negative" come closer to one of these words than to a true reference to KRAS, then they should not be used to identify the results of KRAS testing. 

Help with either type of coding would be greatly appreciated.

Thanks,

Paul



More information about the R-help mailing list