[R] remove

Mon Feb 13 08:26:53 CET 2017

Val,

Working with R's special missing value indicator (NA) would be useful 
here. You could use the na.strings arg in read.table() to recognise "-" 
as a missing value:

dfr <- read.table( text=
'first  week last
Alex    1  West
Bob     1  John
Cory    1  Jack
Cory    2  -
Bob     2  John
Bob     3  John
Alex    2  Joseph
Alex    3  West
Alex    4  West
', header = TRUE, as.is = TRUE, na.strings = c("NA", "-"))

and then modify the function used by ave() or by() to exclude missing 
values from the count of unique last names. Here's one approach adapting 
code from earlier in this thread:

err <- ave(dfr$last, dfr$first, FUN = function(x) 
length(unique(x[!is.na(x)])))
res <- dfr[err == 1 , ]
res <- res[order(res$first) , ]
res

   first week last
2   Bob    1 John
5   Bob    2 John
6   Bob    3 John
3  Cory    1 Jack
4  Cory    2 <NA>

Alternatively, if not using na.strings, change "-" to NA after first 
reading the data in: identify last names recorded as "-" using an index, 
and assign NA to these elements, before proceeding as above.

Philip

On 13/02/2017 3:18 PM, Val wrote:
> Hi Jeff and All,
>
> When I examined the excluded  data,  ie.,  first name with  with
> different last names, I noticed that  some last names were  not
> recorded
> or instance, I modified the data as follows
> DF<- read.table( text=
> 'first  week last
> Alex    1  West
> Bob     1  John
> Cory    1  Jack
> Cory    2     -
> Bob     2  John
> Bob     3  John
> Alex    2  Joseph
> Alex    3  West
> Alex    4  West
> ', header = TRUE, as.is = TRUE )
>
>
> err2<- ave( seq_along( DF$first )
>             , DF[ , "first", drop = FALSE]
>             , FUN = function( n ) {
>                length( unique( DF[ n, "last" ] ) )
>               }
>             )
> result2<- DF[ 1 == err2, ]
> result2
>
> first week last
> 2   Bob    1 John
> 5   Bob    2 John
> 6   Bob    3 John
>
> However, I want keep Cory's record. It is assumed that not recorded
> should have the same last name.
>
> Final out put should be
>
> first week last
>     Bob    1 John
>     Bob    2 John
>     Bob    3 John
>    Cory    1  Jack
>    Cory    2   -
>
> Thank you again!
>
> On Sun, Feb 12, 2017 at 7:28 PM, Val<valkremk at gmail.com>  wrote:
>> Sorry  Jeff, I did not finish my email. I accidentally touched the send button.
>> My question was the
>> when I used this one
>> length(unique(result2$first))
>>       vs
>> dim(result2[!duplicated(result2[,c('first')]),]) [1]
>>
>> I did get different results but now I found out the problem.
>>
>> Thank you!.
>>
>>
>>
>>
>>
>>
>>
>>
>> On Sun, Feb 12, 2017 at 6:31 PM, Jeff Newmiller
>> <jdnewmil at dcn.davis.ca.us>  wrote:
>>> Your question mystifies me, since it looks to me like you already know the answer.
>>> --
>>> Sent from my phone. Please excuse my brevity.
>>>
>>> On February 12, 2017 3:30:49 PM PST, Val<valkremk at gmail.com>  wrote:
>>>> Hi Jeff and all,
>>>> How do I get the  number of unique first names   in the two data sets?
>>>>
>>>> for the first one,
>>>> result2<- DF[ 1 == err2, ]
>>>> length(unique(result2$first))
>>>>
>>>>
>>>>
>>>>
>>>> On Sun, Feb 12, 2017 at 12:42 AM, Jeff Newmiller
>>>> <jdnewmil at dcn.davis.ca.us>  wrote:
>>>>> The "by" function aggregates and returns a result with generally
>>>> fewer rows
>>>>> than the original data. Since you are looking to index the rows in
>>>> the
>>>>> original data set, the "ave" function is better suited because it
>>>> always
>>>>> returns a vector that is just as long as the input vector:
>>>>>
>>>>> # I usually work with character data rather than factors if I plan
>>>>> # to modify the data (e.g. removing rows)
>>>>> DF<- read.table( text=
>>>>> 'first  week last
>>>>> Alex    1  West
>>>>> Bob     1  John
>>>>> Cory    1  Jack
>>>>> Cory    2  Jack
>>>>> Bob     2  John
>>>>> Bob     3  John
>>>>> Alex    2  Joseph
>>>>> Alex    3  West
>>>>> Alex    4  West
>>>>> ', header = TRUE, as.is = TRUE )
>>>>>
>>>>> err<- ave( DF$last
>>>>>            , DF[ , "first", drop = FALSE]
>>>>>            , FUN = function( lst ) {
>>>>>                length( unique( lst ) )
>>>>>              }
>>>>>            )
>>>>> result<- DF[ "1" == err, ]
>>>>> result
>>>>>
>>>>> Notice that the ave function returns a vector of the same type as was
>>>> given
>>>>> to it, so even though the function returns a numeric the err
>>>>> vector is character.
>>>>>
>>>>> If you wanted to be able to examine more than one other column in
>>>>> determining the keep/reject decision, you could do:
>>>>>
>>>>> err2<- ave( seq_along( DF$first )
>>>>>             , DF[ , "first", drop = FALSE]
>>>>>             , FUN = function( n ) {
>>>>>                length( unique( DF[ n, "last" ] ) )
>>>>>               }
>>>>>             )
>>>>> result2<- DF[ 1 == err2, ]
>>>>> result2
>>>>>
>>>>> and then you would have the option to re-use the "n" index to look at
>>>> other
>>>>> columns as well.
>>>>>
>>>>> Finally, here is a dplyr solution:
>>>>>
>>>>> library(dplyr)
>>>>> result3<- (   DF
>>>>>             %>% group_by( first ) # like a prep for ave or by
>>>>>             %>% mutate( err = length( unique( last ) ) ) # similar to
>>>> ave
>>>>>             %>% filter( 1 == err ) # drop the rows with too many last
>>>> names
>>>>>             %>% select( -err ) # drop the temporary column
>>>>>             %>% as.data.frame # convert back to a plain-jane data
>>>> frame
>>>>>             )
>>>>> result3
>>>>>
>>>>> which uses a small set of verbs in a pipeline of functions to go from
>>>> input
>>>>> to result in one pass.
>>>>>
>>>>> If your data set is really big (running out of memory big) then you
>>>> might
>>>>> want to investigate the data.table or sqlite packages, either of
>>>> which can
>>>>> be combined with dplyr to get a standardized syntax for managing
>>>> larger
>>>>> amounts of data. However, most people actually aren't running out of
>>>> memory
>>>>> so in most cases the extra horsepower isn't actually needed.
>>>>>
>>>>>
>>>>> On Sun, 12 Feb 2017, P Tennant wrote:
>>>>>
>>>>>> Hi Val,
>>>>>>
>>>>>> The by() function could be used here. With the dataframe dfr:
>>>>>>
>>>>>> # split the data by first name and check for more than one last name
>>>> for
>>>>>> each first name
>>>>>> res<- by(dfr, dfr['first'], function(x) length(unique(x$last))>  1)
>>>>>> # make the result more easily manipulated
>>>>>> res<- as.table(res)
>>>>>> res
>>>>>> # first
>>>>>> # Alex   Bob  Cory
>>>>>> # TRUE FALSE FALSE
>>>>>>
>>>>>> # then use this result to subset the data
>>>>>> nw.dfr<- dfr[!dfr$first %in% names(res[res]) , ]
>>>>>> # sort if needed
>>>>>> nw.dfr[order(nw.dfr$first) , ]
>>>>>>
>>>>>>   first week last
>>>>>> 2   Bob    1 John
>>>>>> 5   Bob    2 John
>>>>>> 6   Bob    3 John
>>>>>> 3  Cory    1 Jack
>>>>>> 4  Cory    2 Jack
>>>>>>
>>>>>>
>>>>>> Philip
>>>>>>
>>>>>> On 12/02/2017 4:02 PM, Val wrote:
>>>>>>> Hi all,
>>>>>>> I have a big data set and want to  remove rows conditionally.
>>>>>>> In my data file  each person were recorded  for several weeks.
>>>> Somehow
>>>>>>> during the recording periods, their last name was misreported.
>>>> For
>>>>>>> each person,   the last name should be the same. Otherwise remove
>>>> from
>>>>>>> the data. Example, in the following data set, Alex was found to
>>>> have
>>>>>>> two last names .
>>>>>>>
>>>>>>> Alex   West
>>>>>>> Alex   Joseph
>>>>>>>
>>>>>>> Alex should be removed  from the data.  if this happens then I want
>>>>>>> remove  all rows with Alex. Here is my data set
>>>>>>>
>>>>>>> df<- read.table(header=TRUE, text='first  week last
>>>>>>> Alex    1  West
>>>>>>> Bob     1  John
>>>>>>> Cory    1  Jack
>>>>>>> Cory    2  Jack
>>>>>>> Bob     2  John
>>>>>>> Bob     3  John
>>>>>>> Alex    2  Joseph
>>>>>>> Alex    3  West
>>>>>>> Alex    4  West ')
>>>>>>>
>>>>>>> Desired output
>>>>>>>
>>>>>>>         first  week last
>>>>>>> 1     Bob     1   John
>>>>>>> 2     Bob     2   John
>>>>>>> 3     Bob     3   John
>>>>>>> 4     Cory     1   Jack
>>>>>>> 5     Cory     2   Jack
>>>>>>>
>>>>>>> Thank you in advance
>>>>>>>
>>>>>>> ______________________________________________
>>>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>> PLEASE do read the posting guide
>>>>>>> http://www.R-project.org/posting-guide.html
>>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>
>>>>>> ______________________________________________
>>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guide
>>>>>> http://www.R-project.org/posting-guide.html
>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>
>>>>>
>>>> ---------------------------------------------------------------------------
>>>>> Jeff Newmiller                        The     .....       .....  Go
>>>> Live...
>>>>> DCN:<jdnewmil at dcn.davis.ca.us>         Basics: ##.#.       ##.#.  Live
>>>> Go...
>>>>>                                        Live:   OO#.. Dead: OO#..
>>>> Playing
>>>>> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
>>>>> /Software/Embedded Controllers)               .OO#.       .OO#.
>>>> rocks...1k
>>>> ---------------------------------------------------------------------------