[R] Delete rows with duplicate field...

kMan kchamberln at gmail.com
Wed May 5 06:01:37 CEST 2010


Dear someone,

Jorge's solution is excellent, assuming it is what you had in mind. Please
note that the help page for unique() has duplicated() listed in its "See
Also" section. Thus, when you studied ?unique(), it would have made sense to
read about duplicated() as well. Or perhaps you did look into it, but have
not yet seen how to use logical indexes, and as a result, you did not think
duplicated() was relevant for what you wanted. 

I gather that unique() does not work for you because you want the rest of
the information, not just the levels of a particular column to be listed? 

Keep in mind that you got responses that did not make sense because it was
difficult to make sense out of what you were asking. This is a common theme
on the list, not just you (I get called on it from time to time, too), and
is a normal part of learning the art of asking for R-help. Part of the
solution for not only the "R noob" paradox, but also the
cross-cultural/language ones, is to include clear examples showing what is
needed, in addition to the narrative. For example:
-----------------------------------------
# Data to test
>(df<-data.frame(ID=c("userA", "userB", "userA", "userC"), 
>   OS=c("Win","OSX","Win", "Win64"),
>   time=c("12:22","23:22","04:44","12:28")))
# Output of df
     ID    OS  time
1 userA   Win 12:22
2 userB   OSX 23:22
3 userA   Win 04:44
4 userC Win64 12:28

# Desired outcome of manipulation (ID as unique; unique(df) and
unique(df$ID) do NOT work for this)
1 userA   Win 12:22
2 userB   OSX 23:22
4 userC Win64 12:28
--------------------------------------------
would have helped readers orient to your question better. If you tried other
things, that might be helpful for readers to provide more helpful replies.

# Cannot use the results of unique() as an index either
> df[unique(df$ID),]
     ID  OS  time
1 userA Win 12:22
2 userB OSX 23:22
3 userA Win 04:44 # <- not unique, expected "userC"

# duplicated() on the entire data.frame() shows no duplicates
[1] FALSE FALSE FALSE FALSE
>df[!duplicated(df),] # same as unique(df)

The above would have effectively oriented readers to the specific problem
far better than approximated meaning used in narrative. It also tends to
have the consequence of organizing how some readers may respond, meaning
your effort in showing clear examples/details/working code may make a longer
reply feel more equitable for someone responding. 

>duplicated(df$ID) # makes sense, so Jorge suggested using this as an index
[1] FALSE FALSE  TRUE FALSE

# using the raw results, only the duplicated row is returned, hence why
Jorge inverted the values
>df[duplicated(df$ID),]
     ID  OS  time
3 userA Win 04:44

>df[!duplicated(df$ID),] # selects just the not-duplicated rows. This is
what you needed, right?
     ID    OS  time
1 userA   Win 12:22
2 userB   OSX 23:22
4 userC Win64 12:28 # note that the original row labels are intact

rownames(df[!duplicated(df$ID),]) # not sure why I'd do this, though it
occurs that I could, given the above

By the way, if you keep your indexes, then all you have to store in the R
environment is the original dataset, and your indexes. It helps me stay
organized. 

Welcome, someone, aka "R noob" (as you put it). Keep after it, you won't be
a "noob" for long.

Sincerely,
KeithC.



More information about the R-help mailing list