[R] Updating a data frame based on if condition

arun smartpink111 at yahoo.com
Tue Feb 18 21:17:58 CET 2014





Hi, 
I don't know whether the 'mydata" object was updated or not before you run the table.



mydata <- within(mydata,FNAME_SUSPECT <- FNAME_TOKEN_COUNT >10|FNAME_LENGTH>45|regexpr("9",FNAME_PATTERN)==0)
table(mydata$FNAME_SUSPECT)
#
#FALSE 
#   50 


Now, your second condition (reply to David).
 indx <- with(mydata,FNAME_TOKEN_COUNT >3| FNAME_LENGTH>55|regexpr("9",FNAME_PATTERN)==0)

 indx1 <-  ifelse(mydata$FNAME_TOKEN_COUNT > 3, TRUE,
             ifelse(mydata$FNAME_LENGTH > 55, TRUE,
                     ifelse(regexpr("9", mydata$FNAME_PATTERN) == 0, TRUE,
 FALSE
                           )
                       )
                     )
 identical(indx,indx1)
#[1] TRUE

A.K.


On Tuesday, February 18, 2014 12:57 PM, Jeff Johnson <mrjefftoyou at gmail.com> wrote:

Hmm, I don't think as constructed the within clause is yielding the desired results. The test case you suggested works. However, if I try another test case:

within(mydata,FNAME_SUSPECT <- FNAME_TOKEN_COUNT >10|FNAME_LENGTH>45|regexpr("9",FNAME_PATTERN)==0)


which I read as if any row has more than 10 tokens, longer than 45 characters OR does not have a number (9), it should assign the result (FALSE in this case) to FNAME_SUSPECT.

table(mydata$FNAME_SUSPECT)

TRUE 
  50 




On Tue, Feb 18, 2014 at 9:38 AM, arun <smartpink111 at yahoo.com> wrote:


>
>I think it doesn't even need ifelse()
>
>  within(mydata,FNAME_SUSPECT <- FNAME_TOKEN_COUNT >3|FNAME_LENGTH>35|regexpr("9",FNAME_PATTERN)>0)
>A.K.
>
>
>
>On , arun <smartpink111 at yahoo.com> wrote:
>Hi,
>Try ?ifelse()
>A.K.
>
>
>
>
>
>
>On Tuesday, February 18, 2014 12:26 PM, Jeff Johnson <mrjefftoyou at gmail.com> wrote:
>I have a subset of data that I have identified as "suspect" (for example,
>the first name has excessive spaces, is longer than 35 characters or has a
>number).
>
>What I want to do is update the FNAME_SUSPECT field in "mydata" to TRUE if
>any of those conditions are met.
>
>Here's my data:
>> dput(mydata)
>structure(list(PERSON_FIRST_NAME = c("1298530", "JULIA, TAYLOR, CS AND
>JEFF",
>"88", "4465891170098562", "1124211", "LEWIS & MARY KAY", "KARL R O S",
>"5466181820076010", "JULI0 C", "WAYNE   T.", "1124211", "1124211",
>"ROBERT B & VIONA D", "DENNIS and MARY SUE", "BRIAN   JOANNE",
>"1124211", "RONALD and  GAIL", "Mike and Mary Lou", "31763006",
>"7", "11460735", "Paul and Mary Beth", "JIMMY and RUTH MARIE",
>"1124211", "WAYNE & LU ANN", "SCOTT & ANNA MARIE", "1124211",
>"1124211", "952714", "DAVID, RHONDA and NATALIE", "VIRGINIA   S",
>"707069", "4397836190001917", "MARIA DE LA LUZ", "MARIA DE LA LUZ",
>"G & S COMPUTERIZED GRADING", "1124211", "1124211", "1124211",
>"1124211", "MARIA DE LA LUZ", "ED AND JANICE KISHI", "1124211",
>"Garrett A. and Jenny E.", "1124211", "1124211", "Hiram T. and A. Judith",
>"MA DE LA LUZ", "STEVE, Bev, and Caleb", "MR AND MRS EVER"),
>    FNAME_SUSPECT = c(FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
>    FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
>    FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
>    FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
>    FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
>    FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE),
>    FNAME_LENGTH = c(7L, 26L, 2L, 16L, 7L, 16L, 10L, 16L, 7L,
>    10L, 7L, 7L, 18L, 19L, 14L, 7L, 16L, 17L, 8L, 1L, 8L, 18L,
>    20L, 7L, 14L, 18L, 7L, 7L, 6L, 25L, 12L, 6L, 16L, 15L, 15L,
>    26L, 7L, 7L, 7L, 7L, 15L, 19L, 7L, 23L, 7L, 7L, 22L, 12L,
>    21L, 15L), FNAME_PATTERN = c("9999999", "AAAAA,_AAAAAA,_AA_AAA_AAAA",
>    "99", "9999999999999999", "9999999", "AAAAA_&_AAAA_AAA",
>    "AAAA_A_A_A", "9999999999999999", "AAAA9_A", "AAAAA___A.",
>    "9999999", "9999999", "AAAAAA_A_&_AAAAA_A", "AAAAAA_AAA_AAAA_AAA",
>    "AAAAA___AAAAAA", "9999999", "AAAAAA_AAA__AAAA", "AAAA_AAA_AAAA_AAA",
>    "99999999", "9", "99999999", "AAAA_AAA_AAAA_AAAA",
>"AAAAA_AAA_AAAA_AAAAA",
>    "9999999", "AAAAA_&_AA_AAA", "AAAAA_&_AAAA_AAAAA", "9999999",
>    "9999999", "999999", "AAAAA,_AAAAAA_AAA_AAAAAAA", "AAAAAAAA___A",
>    "999999", "9999999999999999", "AAAAA_AA_AA_AAA", "AAAAA_AA_AA_AAA",
>    "A_&_A_AAAAAAAAAAAA_AAAAAAA", "9999999", "9999999", "9999999",
>    "9999999", "AAAAA_AA_AA_AAA", "AA_AAA_AAAAAA_AAAAA", "9999999",
>    "AAAAAAA_A._AAA_AAAAA_A.", "9999999", "9999999",
>"AAAAA_A._AAA_A._AAAAAA",
>    "AA_AA_AA_AAA", "AAAAA,_AAA,_AAA_AAAAA", "AA_AAA_AAA_AAAA"
>    ), FNAME_TOKEN_COUNT = c(1L, 5L, 1L, 1L, 1L, 4L, 4L, 1L,
>    2L, 4L, 1L, 1L, 5L, 4L, 4L, 1L, 4L, 4L, 1L, 1L, 1L, 4L, 4L,
>    1L, 4L, 4L, 1L, 1L, 1L, 4L, 4L, 1L, 1L, 4L, 4L, 5L, 1L, 1L,
>    1L, 1L, 4L, 4L, 1L, 5L, 1L, 1L, 5L, 4L, 4L, 4L)), .Names =
>c("PERSON_FIRST_NAME",
>"FNAME_SUSPECT", "FNAME_LENGTH", "FNAME_PATTERN", "FNAME_TOKEN_COUNT"
>), row.names = c(6717L, 11035L, 11626L, 14965L, 17874L, 24341L,
>25582L, 25834L, 26851L, 30134L, 36385L, 45244L, 46947L, 61449L,
>67564L, 71465L, 73782L, 75278L, 78977L, 79037L, 80577L, 81644L,
>84427L, 86286L, 89963L, 91208L, 94054L, 99518L, 114658L, 128305L,
>129082L, 137492L, 137573L, 138556L, 139489L, 148757L, 153956L,
>155546L, 160533L, 162386L, 162681L, 165220L, 168063L, 173003L,
>175322L, 179935L, 180991L, 181215L, 183787L, 184573L), class = "data.frame")
>
>Note I defaulted all of the FNAME_SUSPECT to FALSE. I plan to change that
>later.
>
>I've tried running this:
>if(mydata$FNAME_TOKEN_COUNT > 3 | mydata$FNAME_LENGTH > 35 | regexpr("9",
>mydata$FNAME_PATTERN) > 0)
>        mydata$FNAME_SUSPECT <- TRUE
>
>however I get the error:
>Warning message:
>In if (mydata$FNAME_TOKEN_COUNT > 3 | mydata$FNAME_LENGTH > 35 |  :
>  the condition has length > 1 and only the first element will be used
>
>Would I be better doing this in a for loop? I had once heard that if you're
>doing a for loop in R, you're doing something wrong.
>--
>Jeff
>
>    [[alternative HTML version deleted]]
>
>______________________________________________
>R-help at r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.
>
>


-- 

Jeff




More information about the R-help mailing list