[R] Odp: duplicated() and unique() problems

Petr PIKAL petr.pikal at precheza.cz
Tue Jun 8 12:58:08 CEST 2010


Hi

r-help-bounces at r-project.org napsal dne 08.06.2010 08:44:39:

> Hi everybody
> 
> I have found something (for me at least) strange with duplicated(). I 
will
> first provide a replicable example of a certain kind of behaviour that I
> find odd and then give a sample of unexpected results from my own data. 
I
> hope someone can help me understand this.
> 
> Consider the following
> 
> # this works as expected
> 
> ex=sample(1:20, replace=TRUE)
> 
> ex
> 
> duplicated(ex)
> 
> ex=sort(ex)

This is OK as sort sorts your data


> 
> ex
> 
> duplicated(ex)
> 
> 
> # but why does duplicate not work after order() ?
> 
> ex=sample(1:20, replace=TRUE)
> 
> ex
> 
> duplicated(ex)
> 
> ex=order(ex)

This is not as order gives you positions not your data

> ex=sample(letters[1:5],20, replace=TRUE)
> ex
 [1] "b" "b" "b" "e" "d" "c" "e" "a" "a" "d" "d" "d" "a" "e" "b" "c" "e" 
"d" "a"
[20] "a"
> ex<-order(ex)
> ex
 [1]  8  9 13 19 20  1  2  3 15  6 16  5 10 11 12 18  4  7 14 17
>

ex=ex[order(ex)]

shall give you the same result as sort. Maybe with exception of ties.

> 
> duplicated(ex)
> 
> Why does duplicated() not work after order() has been applied but it 
works
> fine after sort()  ? Is this an error or is there something I don't
> understand.
> 
> I have been getting very strage results from duplicated() and unique() 
in a
> dataset I am analysing. Her is a little sample of my real life problem
> 
> > str(Masechaba$PROPDESC)
>  Factor w/ 24545 levels "     06","   71Hemilton str",..: 14527 8043 
16113
> 16054 13875 15780 12522 7771 14824 12314 ...
> > # Create a indicator if the PROPDESC is unique. Default false
> > Masechaba$unique=FALSE
> > Masechaba$unique[which(is.na(unique(Masechaba$PROPDESC))==FALSE)]=TRUE
> > # Check is something happended
> > length(which(Masechaba$unique==TRUE))
> [1] 2174
> > length(which(Masechaba$unique==FALSE))
> [1] 476
> > Masechaba$duplicate=FALSE
> > Masechaba$duplicate[which(duplicated(Masechaba$PROPDESC)==TRUE)]=TRUE
> > length(which(Masechaba$duplicate==TRUE))
> [1] 476
> > length(which(Masechaba$duplicate==FALSE))
> [1] 2174
> > # Looks OK so far
> > # Test on a known duplicate. I expect one to be true and one to be 
false
> > Masechaba[which(Masechaba$PROPDESC==2363),10:12]
>       PROPDESC unique duplicate
> 24874     2363   TRUE     FALSE
> 31280     2363   TRUE      TRUE
> 
> # This is strange.  I expected that unique() and duplicate() would give 
the
> same results. The variable PROPDESC is clearly not unique in both cases.

No.

ex=sample(letters[1:5],10, replace=TRUE)
ex
 [1] "b" "d" "d" "b" "a" "c" "b" "c" "d" "d"
unique(ex)
[1] "b" "d" "a" "c"
duplicated(ex)
 [1] FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE

Functions give you different answers about your data as you ask different 
questions.

> > Masechaba$unique[which(is.na(unique(Masechaba$PROPDESC))==FALSE)]=TRUE
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 
This seems to be strange. At first sight I am puzzlet what result I shall 
expect from such construction.

Regards
Petr

> # The totals are the same but not the individual results
> > table(Masechaba$unique,Masechaba$duplicate)
> 
>         FALSE TRUE
>   FALSE   342  134
>   TRUE   1832  342
> 
> I don't understand this. Is there something I am missing?
> 
> Best regards
> Christaan
> 
> 
> P.S
> > sessionInfo()
> R version 2.11.1 (2010-05-31)
> x86_64-apple-darwin9.8.0
> 
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
> 
> attached base packages:
> [1] splines   stats     graphics  grDevices utils     datasets  methods
> base
> 
> other attached packages:
> [1] plyr_0.1.9      maptools_0.7-34 lattice_0.18-8  foreign_0.8-40
>  Hmisc_3.8-0     survival_2.35-8 rgdal_0.6-26
> [8] sp_0.9-64
> 
> loaded via a namespace (and not attached):
> [1] cluster_1.12.3 grid_2.11.1    tools_2.11.1
> 
>    [[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list