[R] Finding the rows with the duplicated index

Tue Nov 23 17:49:09 CET 2010

Thanks. Sorry about the naming (just cobbled the example).
The rows are duplicate in the actual data. I am looking for a way to find
the number of duplicate rows for each unique rowindex combination which is
based on columns 1:6.

-----Original Message-----
From: Gabor Grothendieck [mailto:ggrothendieck at gmail.com] 
Sent: 23 November 2010 21:54
To: Santosh Srinivas
Cc: r-help at r-project.org
Subject: Re: [R] Finding the rows with the duplicated index

On Tue, Nov 23, 2010 at 11:04 AM, Santosh Srinivas
<santosh.srinivas at gmail.com> wrote:
> Hello Group,
>
> I have a huge time series dataset with sample below. I am basically trying
> to read it into a zoo object with columns 1:6 to index. Zoo issues a
warning
> that some of the rows have duplicated index.
>
> dput(z)
> structure(list(TrdTimestamp = structure(list(sec = c(19, 19,
> 18, 10, 12, 43, 41, 59, 40, 29), min = c(58L, 57L, 39L, 37L,
> 4L, 5L, 26L, 45L, 24L, 16L), hour = c(11L, 12L, 10L, 12L, 14L,
> 14L, 15L, 14L, 11L, 11L), mday = c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
> 1L, 1L, 1L), mon = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L),
>    year = c(109L, 109L, 109L, 109L, 109L, 109L, 109L, 109L,
>    109L, 109L), wday = c(4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L,
>    4L), yday = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), isdst = c(0L,
>    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("sec", "min",
> "hour", "mday", "mon", "year", "wday", "yday", "isdst"), class =
c("POSIXt",
>
> "POSIXlt")), Ticker = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
> 1L, 1L, 1L), .Label = "NIFTY", class = "factor"), InstTyp =
structure(c(2L,
> 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("FUTIDX", "OPTIDX"
> ), class = "factor"), ExpDt = c(20090129L, 20090129L, 20090129L,
> 20090129L, 20090129L, 20090129L, 20090129L, 20090129L, 20090129L,
> 20090129L), OptTyp = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
> 1L, 1L, 1L), .Label = c("CE", "FF", "PE"), class = "factor"),
>    Strike = c(2700L, 2700L, 2700L, 2700L, 2700L, 2700L, 2700L,
>    2700L, 2700L, 2700L), TrdPrice = c(347.4, 340, 334.95, 335.5,
>    349.95, 353, 380, 378.1, 340.25, 339), TrdQty = c(50L, 50L,
>    50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L)), .Names = c("TrdTimestamp",
> "Ticker", "InstTyp", "ExpDt", "OptTyp", "Strike", "TrdPrice",
> "TrdQty"), row.names = c(NA, 10L), class = "data.frame")
>
>
>
> Z should ideally have a unique index based on columns 1:6 ... but looks
like
> some are duplicated.
> I want to get the count against each unique set .. i.e. z[,1:6] and the
> number of rows with that combination?
>
> I can put the unique index sets into a different object and do a rowcount
> using apply but was wondering if there is an easier way already?
>

Data frames should not hold POSIXlt objects and its very confusing
naming the data frame z since that makes one think its a zoo object.
Without your code its hard to know what the problem is but perhaps you
were reading the series in with Date index in which case all the rows
would have the same date.  Specify tz = "" to use POSIXct.  Also note
that you cannot have a mixture of character and numeric variables in a
zoo object.  A 2d zoo object is a matrix with an index.  Assuming that
DF is a data frame in long format with multiple stocks identified by
the second column we can split it according to the stock like this and
there are no duplicates, at least in the data above.

library(zoo)
DF <- z
rm(z)
DF2 <- transform(DF, TrdTimestamp = as.POSIXct(TrdTimestamp))
z <- read.zoo(DF2[c(1:2, 4, 6:8)], split = 2, tz = "")

-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com