[R] R issue with unequal large data frames with multiple columns

arun smartpink111 at yahoo.com
Thu May 2 16:08:39 CEST 2013


Hi,May be this helps:


dat1<-structure(list(X.DATE = c("01052007", "01072007", "01072007", 
"02182007", "02182007", "02242007", "03252007"), X.TIME = c("0230", 
"0330", "0440", "0440", "0440", "0330", "0230"), VALUE = c(37, 
42, 45, 45, 45, 42, 45), VALUE2 = c(29, 24, 28, 27, 35, 32, 32
)), .Names = c("X.DATE", "X.TIME", "VALUE", "VALUE2"), class = "data.frame", row.names = c(NA, 
-7L))
dat2<- structure(list(X.DATE = c("01052007", "01182007", "01242007", 
"02142007", "02182007", "03242007", "03252007"), X.TIME = c("0230", 
"0330", "0430", "0330", "0440", "0230", "0230"), VALUE = c(34, 
41, 42, 44, 45, 21, 42), VALUE2 = c(28, 25, 26, 28, 32, 35, 36
)), .Names = c("X.DATE", "X.TIME", "VALUE", "VALUE2"), class = "data.frame", row.names = c(NA, 
-7L))
dat3<- structure(list(X.DATE = c("01052007", "01182007", "01252007", 
"02142007", "02182007", "03222007", "03252007"), X.TIME = c("0230", 
"0330", "0430", "0330", "0440", "0230", "0230"), VALUE = c(32, 
42, 44, 44, 47, 42, 46), VALUE2 = c(24, 29, 32, 34, 38, 39, 42
)), .Names = c("X.DATE", "X.TIME", "VALUE", "VALUE2"), class = "data.frame", row.names = c(NA, 
-7L))


library(xts)
lst1<-lapply(list(dat1,dat2,dat3),function(x){ xts(x[,-c(1,2)], order.by=as.POSIXct(paste0(x[,1],x[,2]),format="%m%d%Y%H%M"))})

#subset by date and time
 lapply(lst1,function(x) x['2007-01-05 02:30:00/2007-01-25 04:30:00'])
#[[1]]
#                    VALUE VALUE2
#2007-01-05 02:30:00    37     29
#2007-01-07 03:30:00    42     24
#2007-01-07 04:40:00    45     28
#
#[[2]]
#                    VALUE VALUE2
#2007-01-05 02:30:00    34     28
#2007-01-18 03:30:00    41     25
#2007-01-24 04:30:00    42     26
#
#[[3]]
#                    VALUE VALUE2
#2007-01-05 02:30:00    32     24
#2007-01-18 03:30:00    42     29
#2007-01-25 04:30:00    44     32

#subset by time
lapply(lst1,function(x) x['T02:30/T03:30'])

res<-na.omit(Reduce(function(...) merge(...),lst1))
res
#                    VALUE VALUE2 VALUE.1 VALUE2.1 VALUE.2 VALUE2.2
#2007-01-05 02:30:00    37     29      34       28      32       24
#2007-02-18 04:40:00    45     27      45       32      47       38
#2007-03-25 02:30:00    45     32      42       36      46       42

lst2<-as.list(res)
lst3<- lapply(list(c("VALUE","VALUE2"),c("VALUE.1","VALUE2.1"),c("VALUE.2","VALUE2.2")),function(x) do.call(cbind,lst2[x]))
#or
lst3<- lapply(split(names(lst2),((seq_along(names(lst2))-1)%/%2)+1),function(x) do.call(cbind,lst2[x])) #change according to the number of columns

lst3
#$`1`
#                    VALUE VALUE2
#2007-01-05 02:30:00    37     29
#2007-02-18 04:40:00    45     27
#2007-03-25 02:30:00    45     32
#
#$`2`
#                    VALUE.1 VALUE2.1
#2007-01-05 02:30:00      34       28
#2007-02-18 04:40:00      45       32
#2007-03-25 02:30:00      42       36
#
#$`3`
#                    VALUE.2 VALUE2.2
#2007-01-05 02:30:00      32       24
#2007-02-18 04:40:00      47       38
#2007-03-25 02:30:00      46       42
A.K.




----- Original Message -----
From: Adeel Amin <adeel.amin at gmail.com>
To: r-help at r-project.org
Cc: 
Sent: Thursday, May 2, 2013 2:28 AM
Subject: [R] R issue with unequal large data frames with multiple columns

I'm a bit of an amateur R programmer.  I can do simple R scenarios but my
handle on complex grammatical issues isn't steady.

I have 12 CSV files that I've read into dataframes.  Each has 8 columns and
over 2000000 rows.  Each dataframe has data associated by time component
and a date component in the format of:

X.DATE and then X.TIME

X.DATE is in the format of MMDDYYYY and X.TIME is format HHMM.  The issue
is that even though each dataframe begins and ends with the same X.DATE and
X.TIME values, each data frame has different number of rows.  One may have
as many 100000 rows more than the other.

I want to do two things:

1) I want to extract a certain portion of data depending on date and time
(easy)

2) In lock step with number 2 I want to eliminate values from the data
frame that are a) redundant or b) do not appear in the other data sets.

When step 2 is done, all the time/date data within all 12 dataframes will
be the same.

Suggestions?  Thanks R Community --

    [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list