Bert Gunter
gunter.berton at gene.com
Thu May 14 23:31:10 CEST 2009
Thanks, Bill. I also had some concerns about how reliable numeric values
converted to character might be, so I'm glad to have an authoritative
criticism. Of course, I was really just being cute with R's versatility.
But Jim Holtman's solution seems like the best way to go, anyway, does it
not?
-- Bert
The table()-based solution can have problems when there are
very closely spaced floating point numbers in x, as in
x1<-c(1, 1-.Machine$double.eps, 1+2*.Machine$double.eps)[c(1,2,3,2,3)]
It also relies on table(x) turning x into a factor with the default
levels=as.character(sort(x)) and that default may change.
It omits NA's from the result. (I think it also ought to put the results in
the original order of the data, so one can, e.g., omit or select values
which are duplicated.)
The ave()-based solution fails when there are NA's or NaN's in the data.
x2 <- c(1,2,3,NA,10,6,3)
The ave()-based solution can be slower than necessary on long datasets,
especially ones with few or no duplicates.
x3 <- sample(1e5,replace=FALSE) ; x3[17] <- x3[length(x3)-17]
I think the following function avoids these problems. It never converts
the data to character, but uses match() on the original data to convert
it to a set of unique integers that tabulate can handle.
f2 <- function(x){
ix<-match(x,x)
tix<-tabulate(ix)
retval<-logical(length(x))
retval[which(tix!=1)]<-TRUE
retval
}
>
> ... or, similar in character to Gabor's solution:
>
> tbl <- table(x)
> (tbl[as.character(sort(x))]>1)+0
>
>
> Bert Gunter
> Nonclinical Biostatistics
> 467-7374
>
>
> Noting that:
>
> > ave(x, x, FUN = length) > 1
> [1] FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
>
> try this:
>
> > rbind(x, dup = ave(x, x, FUN = length) > 1)
> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
> x 1 2 3 4 4 5 6 7 8 9
> dup 0 0 0 1 1 0 0 0 0 0
>
>
> On Thu, May 14, 2009 at 2:16 AM, christiaan pauw
> <cjpauw at gmail.com> wrote:
> > Hi everybody.
> > I want to identify not only duplicate number but also the
> original number
> > that has been duplicated.
> > Example:
> > x=c(1,2,3,4,4,5,6,7,8,9)
> > y=duplicated(x)
> > rbind(x,y)
> >
> > gives:
> > [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
> > x 1 2 3 4 4 5 6 7 8 9
> > y 0 0 0 0 1 0 0 0 0 0
> >
> > i.e. the second 4 [,5] is a duplicate.
> >
> > What I want is the first and second 4. i.e [,4] and [,5] to be TRUE
> >
> > [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
> > x 1 2 3 4 4 5 6 7 8 9
> > y 0 0 0 1 1 0 0 0 0 0
> >
> > I assume it can be done by sorting the vector and then
> checking is the
> next
> > or the previous entry matches using
> > identical() . I am just unsure on how to write such a loop
> the logic of
> > which (I think) is as follows:
> >
> > sort x
> > for every value of x check if the next value is identical
> and return TRUE
> > (or 1) if it is and FALSE (or 0) if it is not
> > AND
> > check is the previous value is identical and return TRUE
> (or 1) if it is
> and
> > FALSE (or 0) if it is not
> >
> > Im i thinking correct and can some help to write such a function
> >
> > regards
> > Christiaan
> >
> >
>
>
>
