[R] Complex text parsing task

Joshua Wiley jwiley.psych at gmail.com
Mon May 21 18:01:23 CEST 2012


Hi Paul,

I do not think that Nick's comment was really meant to be directed at
you.  He is probably just tired of getting so many emails from R-help.

Nick, to stop getting emails if you no longer want them, try following
the link at the bottom of every single email you have received from
R-help...you can unsubscribe yourself from there if you want.  If you
like R-help but just do not like the quantity of emails, you could
consider switching your subscription to a daily digest so you just get
one email.  Alternately, you could create a special folder in your
email for R-help messages, and create a filter that automatically
sends all message from R-help to that special folder so you still have
them all but they do not clutter up your inbox.

Cheers,

Josh

On Mon, May 21, 2012 at 8:53 AM, Paul Miller <pjmiller_57 at yahoo.com> wrote:
> Hi Nick,
>
> Can you elaborate (hopefully in a constructive way) on what it is that you find objectionable about my post?
>
> Thanks,
>
> Paul
>
> --- On Mon, 5/21/12, Nick Gayeski <nick at wildfishconservancy.org> wrote:
>
>> From: Nick Gayeski <nick at wildfishconservancy.org>
>> Subject: RE: [R] Complex text parsing task
>> To: "'Paul Miller'" <pjmiller_57 at yahoo.com>, r-help at r-project.org
>> Received: Monday, May 21, 2012, 10:36 AM
>> Please stop sending these emails!
>>
>>
>> -----Original Message-----
>> From: r-help-bounces at r-project.org
>> [mailto:r-help-bounces at r-project.org]
>> On
>> Behalf Of Paul Miller
>> Sent: Monday, May 21, 2012 8:32 AM
>> To: r-help at r-project.org
>> Subject: [R] Complex text parsing task
>>
>> Hello Everyone,
>>
>> I have what I think is a complex text parsing task. I've
>> provided some
>> sample data below. There's a relatively simple version of
>> the coding that
>> needs to be done and a more complex version. If someone
>> could help me out
>> with either version, I'd greatly appreciate it.
>>
>> Here are my sample data.
>>
>> haveData <-
>> structure(list(profile_key = structure(c(1L, 1L, 2L, 2L, 2L,
>> 3L, 3L, 4L, 4L,
>> 5L, 5L, 5L, 6L, 6L, 7L, 7L), .Label = c("001-001 ",
>> "001-002 ", "001-003 ", "001-004 ", "001-005 ", "001-006 ",
>> "001-007 "
>> ), class = "factor"), encounter_date = structure(c(9L, 10L,
>> 11L, 12L, 13L,
>> 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 4L, 7L, 7L), .Label = c("
>> 2009-03-01 ", "
>> 2009-03-22 ", " 2009-04-01 ", " 2010-03-01 ", " 2010-10-15
>> ", " 2010-11-15
>> ", " 2011-03-01 ", " 2011-03-14 ", " 2011-10-10 ", "
>> 2011-10-24 ", "
>> 2012-09-15 ", " 2012-10-05 ", " 2012-10-17 "
>> ), class = "factor"), raw = structure(c(9L, 12L, 16L, 13L,
>> 10L, 7L, 6L, 3L,
>> 2L, 4L, 14L, 15L, 1L, 5L, 8L, 11L), .Label = c(" ... If
>> patient KRAS result
>> is wild type, they will start Erbitux. ... (Several lines of
>> material) ...
>> Ordered KRAS mutation test 11/11/2011. Results are still not
>> available. ...
>> ", " ... KRAS (mutated). Therefore did not prescribe
>> Erbitux. ... ", " ...
>> KRAS (mutated). Will not prescribe Erbitux due to mutation.
>> ... ", " ...
>> KRAS (Wild). ...", " ... KRAS results are in. Patient has
>> the mutation. ...
>> ", " ... KRAS results still pending. Note that patient was
>> negative for
>> Lynch mutation. ...", " ... KRAS test results pending. Note
>> that patient was
>> negative for Lynch mutation. ...", " ... Ordered KRAS
>> mutation testing on
>> 02/15/2011. Results came back negative. ... (Several lines
>> of material) ...
>> Patient KRAS mutation test is negative. Will start Erbitux.
>> ...", " ...
>> Ordered KRAS testing on 10/10/2010. Results not yet
>> available. If patient
>> has a mutaton, will start Erbitux. ...", " ... Ordered KRAS
>> testing. Waiting
>> for results. ...", " ... Patient is KRAS negative. Started
>> Erbitux on
>> 03/01/2011. ...", " ... Received KRAS results on 10/20/2010.
>> Test results
>> indicate tumor is wild type. Ua Protein positve. ER/PR
>> positive. HER2/neu
>> positve. ...", " ... Still need to order KRAS mutation
>> testing. ... ", " ...
>> Tumor is negative for KRAS mutation. ...", " ... Tumor is
>> wild type. Patient
>> is eligible to receive Eribtux. ...", " ... Will conduct
>> KRAS mutation
>> testing prior to initiation of therapy with Erbitux. ..."
>> ), class = "factor")), .Names = c("profile_key",
>> "encounter_date", "raw"),
>> row.names = c(NA, -16L), class = "data.frame")
>>
>> The following code displays the results of so-called
>> "simple" coding.
>>
>> #### Simple coding ####
>>
>> KRASpatient <- c("001-001", "001-002", "001-003",
>> "001-004", "001-005",
>> "001-006",  "001-007") KRAStested <-
>> c(2,3,2,2,2,3,3) KRASwild <-
>> c(1,0,2,0,3,1,3) KRASmutant <- c(4,2,2,3,1,2,2)
>> simpleData <-
>> data.frame(KRASpatient, KRAStested, KRASwild, KRASmutant)
>> simpleData
>>
>> Here, KRAStested is calculated by summing all references to
>> "KRAS" for each
>> patient. Wild is calculated by summing all references to
>> "wild type",
>> "wild", and "negative" that come within 20 words of the
>> closest reference to
>> KRAS. Mutant is calculated by summing all references to
>> "mutant", "mutated",
>> and "positive" that occur within 20 words of the closest
>> reference to KRAS.
>>
>>
>> The second kind of coding is what I'm referring to as
>> "complex coding".  The
>> following code displays the results of this type of coding.
>>
>> #### Complex coding ####
>>
>> KRAStested <- c(2,1,0,2,2,2,3)
>> KRASwild <- c(1,0,0,0,3,0,3)
>> KRASmutant <- c(0,0,0,3,0,1,0)
>> complexData <- data.frame(KRASpatient, KRAStested,
>> KRASwild, KRASmutant)
>> complexData
>>
>> The results of "complex coding" differ substantially from
>> those obtained
>> under "simple coding" and I think illustrate the potential
>> problems with
>> that approach. With "complex coding", the goal would be to
>> identify and sum
>> only true references to KRAS testing and true references to
>> the result of
>> that testing (either wild type/negative or
>> mutant/positive).
>>
>> True references to KRAS testing would be identified using a
>> set of
>> qualifiers that eliminate the false references. So, for
>> example, one of the
>> patients in my (made up) sample data has the phrase "Will
>> conduct KRAS
>> mutation testing prior to initiation of therapy with
>> Erbitux" in their
>> medical record. In this case, "Will" is a qualifier that
>> indicates this is
>> not a true reference to KRAS testing. For this exercise,
>> other qualifiers
>> related to KRAS testing would include "need", "order" (but
>> not the past
>> tense "ordered"), "wait", "waiting", "await", and
>> "awaiting".
>> To be a qualifier, these terms would need to occur within 12
>> words of the
>> closest true reference to KRAS.
>>
>> True references to the results of testing would also be
>> identified using a
>> set of qualifiers that eliminate false references. Here the
>> list of
>> qualifiers would include "if", "lynch", "kras mutation
>> test", "kras mutation
>> testing" and "for kras mutation". Qualifiers would need to
>> come within 12
>> words of a true reference to KRAS testing.
>>
>> There's an additional wrinkle for identifying true
>> references to the results
>> of testing. One also needs to take into account the presence
>> of what I'm
>> calling "nullifiers". For purposes of this exercise,
>> nullfiers include "Ua
>> Protein", "ER/PR", and "HER2/neu" If "positive" or
>> "negative" come closer to
>> one of these words than to a true reference to KRAS, then
>> they should not be
>> used to identify the results of KRAS testing.
>>
>> Help with either type of coding would be greatly
>> appreciated.
>>
>> Thanks,
>>
>> Paul
>>
>> ______________________________________________
>> R-help at r-project.org
>> mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible
>> code.
>>
>>
>>
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 
Joshua Wiley
Ph.D. Student, Health Psychology
Programmer Analyst II, Statistical Consulting Group
University of California, Los Angeles
https://joshuawiley.com/



More information about the R-help mailing list