[R] Changing cell value for MANY unique pairings of values in 2 columns

Alicia Ellis alicia.m.ellis at gmail.com
Mon Mar 20 23:51:25 CET 2017


The solution proposed below does not accomplish my goal.  In the column
labeled NEW_VALUE, there are two NAs where the value in NUMERIC_VALUE for
that row should be.

Can anyone else suggest a solution to my problem?

Thanks!

On Sat, Mar 18, 2017 at 1:33 PM, David Winsemius <dwinsemius at comcast.net>
wrote:

>
> > On Mar 18, 2017, at 9:52 AM, David Winsemius <dwinsemius at comcast.net>
> wrote:
> >
> >
> >> On Mar 17, 2017, at 11:33 AM, Alicia Ellis <alicia.m.ellis at gmail.com>
> wrote:
> >>
> >> am cleaning some very messy health record lab data.  Several of the rows
> >> in the VALUE column have text entries and they need to be converted to
> >> numeric in the NUMERIC_VALUE column based on the values in VALUE and
> >> DESCRIPTION.  For example:
> >>
> >> df <- data.frame(VALUE = c("<60", "Positive", "Negative", "Less than
> 0.30",
> >> "12%", "<0.2", "Unknown"),
> >>                DESCRIPTION = c("A","A", "B", "C", "D", "E", "E"),
> >>                NUMERIC_VALUE=c(9, 9,9,9,9,9,9))
> >> df
> >>
> >> df$NUMERIC_VALUE[df$VALUE == "Positive" & df$DESCRIPTION ==
> "A"]=999999999
> >>
> >>
> >> However, I need to do this for ~500 unique pairings of VALUE and
> >> DESCRIPTION entries.  I'm trying to find an easy way to do this without
> >> having to have 500 lines of code for each unique pairing.  Some of the
> >> pairings will be changed to the same value (e.g., 99999999, or
> -999999999)
> >> but many will be unique numeric values.
> >
> > I'm not convinced that you have the necessary scientific background to
> do this job properly. There are different sorts of lab tests: enzymatic
> activity, solute concentrations, and viral titers come to mind immediately
> for which the handling would be materially different. If you have a serum
> sodium level of less than 100 mEq/ml, then that is a value inconsistent
> with human life and the value should be set to NA. If you have a value of
> alkaline phosphatase that is 0 it is most suggestive of specimen
> mishandling. If you have a hepatitis B antigen level of "Positive", then it
> seems a perfectly informative value that should not be changed.
> >
> > Furthermore the attempt to change vlaues that you think are missing to
> 99999999, or -999999999 is simply wrongheaded. Learn to use the missing
> value indicator NA rather thna setting these to a numeric value.
> >
> > You should seek advice within your organization before you charge ahead
> with this strategy.
>
> I've been informed that my concerns are misplaced and that the scientific
> concerns I raised were inflated. The "answer" then might be:
>
> merge( df, rules, by=1:2, all.x=TRUE)  # Or
>
> merge( df, rules, by=c("VALUE", "DESCRIPTION") , all.x=TRUE)
>
>
> Which returns in this example:
>
>
>            VALUE DESCRIPTION NUMERIC_VALUE  NEW_VALUE
> 1           <0.2           E         9e+00       0.10
> 2            <60           A         9e+00      60.00
> 3            12%           D         9e+00         NA
> 4 Less than 0.30           C         9e+00       0.29
> 5       Negative           B         9e+00 -999999.00
> 6       Positive           A         1e+09         NA
> 7        Unknown           E         9e+00  777777.00
>
> I'm leaving it in this form because I remain worried. I have no particular
> concerns about setting a value of "<0.2" to 0.1 but worry about setting
> values of  "<60" to 60.00 or "Less than 0.30" to 0.29 don't really accord
> with practice that I have seen in analysis of laboratory values. And I
> remain concerned that using flagrantly false numeric values as indicators
> will create serious errors down the line if this data is ever used by
> someone who has not been involved with its manipulation or has even
> forgotten months later how it was massaged.
>
> --
> David.
> > --
> > David.
> >>
> >>
> >> I've started by creating a new object called rules where a SUBSET of df
> >> rows are included with the new value they should be changed to.
> >>
> >>
> >> rules <- data.frame(VALUE = c("<60",  "Negative", "Less than 0.30",
> "<0.2",
> >> "Unknown"),
> >>                   DESCRIPTION = c("A", "B", "C", "E", "E"),
> >>                   NEW_VALUE=c(60, -999999,0.29,0.1,777777))
> >> rules
> >>
> >>
> >> I tried doing a loop to change the values in df based on the suggested
> >> value in rules:
> >>
> >> for (i in (1 : nrow(rules))) {
> >> df$NUMERIC_VALUE[df$VALUE == rules[i,1] & df$DESCRIPTION == rules
> >> [i,2]]=rules[i,3]
> >> }
> >> df
> >>
> >> This gives the error:
> >>
> >> Error in Ops.factor(df$VALUE, rules[i, 1]) :
> >> level sets of factors are differentwork and I think because when I write
> >>
> >> If I create rules using the exact same levels as df it works:
> >>
> >> rules <- data.frame(VALUE = c("<60", "Positive", "Negative", "Less than
> >> 0.30", "12%", "<0.2", "Unknown"),
> >>                   DESCRIPTION = c("A","A", "B", "C", "D", "E", "E"),
> >>                   NEW_VALUE=c(60, 999999,-999999,0.29,12,0.1,777777))
> >> rules
> >>
> >> for (i in (1 : nrow(rules))) {
> >> df$NUMERIC_VALUE[df$VALUE == rules[i,1] & df$DESCRIPTION == rules
> >> [i,2]]=rules[i,3]
> >> }
> >> df
> >>
> >>
> >> Can anyone suggest a way to modify my for loop so that it works for a
> >> subset of rows in df and accomplish what I want?  Or suggest a
> completely
> >> different method that works?
> >>
> >>
> >> Thanks!
> >>
> >>      [[alternative HTML version deleted]]
> >>
> >> ______________________________________________
> >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> >
> > David Winsemius
> > Alameda, CA, USA
> >
> > ______________________________________________
> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> David Winsemius
> Alameda, CA, USA
>
>

	[[alternative HTML version deleted]]



More information about the R-help mailing list