[R] take data from a file to another according to their correlation coefficient

Rui Barradas ruipbarradas at sapo.pt
Mon Apr 23 17:48:04 CEST 2012


Hello,



jeff6868 wrote
> 
> Hi Sarah,
> 
> Thank you for your answer.
> Yes I know that my proposition is not necessary the better way to do it.
> But my problem concerns only big gaps of course (more than half a day of
> missing data, till several months of missing data).
> I've already filled small gaps with the interpolation that you were
> talking in your message (with the function na.approx of the package zoo).
> For the study, it's not important to have perfectly  identical values
> between the 2 correlated stations, because I'll calculate after the
> reconstruction the daily mean of each station. For my boss, it's enough to
> work on daily means. But before that, I need to rebuild the big missing
> data gaps of my stations (by the way I explained in the first message of
> my topic).
> Do you have any idea of the way to do it on R according to my first post?
> I forgot to precise that my examples are completely fakes! I chose these
> numbers in order for you to understand what I want to do (I chose easy and
> readable numbers). I tested on excel with 2 stations, it was not too bad
> when I filled the gaps (between the data of the 2 well correlated
> stations).
> 

I remember this data set from some time ago. (Weeks?)

First of all, please use ?dput to post your data, it makes it much easier
for everyone to
just copy and paste to an R session. The output you should post looks like
this:

> dput(s1)
structure(list(time = c("01/01/2008 00:00", "01/01/2008 00:15", 
"01/01/2008 00:30", "01/01/2008 00:45"), data = c(1L, 2L, NA, 
4L)), .Names = c("time", "data"), row.names = c(NA, -4L), class =
"data.frame")
> dput(s2)
structure(list(time = c("01/01/2008 00:00", "01/01/2008 00:15", 
"01/01/2008 00:30", "01/01/2008 00:45"), data = 8:11), .Names = c("time", 
"data"), row.names = c(NA, -4L), class = "data.frame")
> dput(s3)
structure(list(time = c("01/01/2008 00:00", "01/01/2008 00:15", 
"01/01/2008 00:30", "01/01/2008 00:45"), data = c(123L, NA, NA, 
NA)), .Names = c("time", "data"), row.names = c(NA, -4L), class =
"data.frame")
> dput(m)
structure(c(1, 0.9, 0.8, 0.9, 1, 0.7, 0.8, 0.7, 1), .Dim = c(3L, 
3L), .Dimnames = list(c("Station1", "Station2", "Station3"), 
    c("Station1", "Station2", "Station3")))


I've named your data.frames 's1', 's2' and made up an 's3'; 'm' is the
correlation matrix.

Now the problem.
Sarah's comment seems sensible, to just fill in missing values using some
other dataset isn't very canonic
but here it goes.
It assumes the data frames are in a list.

lst <- list(s1, s2, s3)
names(lst) <- paste("Station", seq.int(length(lst)), sep="")
lst



# station - list number or name, not the data.frame
# mat - correlation matrix
get.max.cor <- function(station, mat){
	mat[row(mat) == col(mat)] <- -Inf
	which( mat[station, ] == max(mat[station, ]) )
}

# x - data.frame to be transformed
# y - data.frame with greater correlation
na.fill <- function(x, y){
	i <- is.na(x$data)
	x$data[i] <- y$data[i]
	x
}

mx.cor <- get.max.cor(1, m)
mx.cor
na.fill(lst[[1]], lst[[mx.cor]])

Like it's said in the comments before the function, the call to the first
function could be

get.max.cor("Station1", m)

The two functions above solve the problem, all what's left to do is to
automate their calls.
Note that there might be a need for two passes through 'na.fill', if the
data.frame with greater correlation
also has NAs. This is the case of Station1 filling in values for Station3.
Try commenting out the second pass
in the function below


process.all <- function(df.list, mat){
	f <- function(station)
		na.fill(df.list[[ station ]], df.list[[ max.cor[station] ]])
	#
	n <- length(df.list)
	nms <- names(df.list)
	# First the max on each row
	max.cor <- sapply(seq.int(n), get.max.cor, m)
	# Note the two passes
	df.list <- lapply(seq.int(n), f)
	df.list <- lapply(seq.int(n), f)
	# Makes nicer output
	names(df.list) <- nms
	df.list
}

process.all(lst, m)



Hope this helps,

Rui Barradas


--
View this message in context: http://r.789695.n4.nabble.com/take-data-from-a-file-to-another-according-to-their-correlation-coefficient-tp4580054p4580845.html
Sent from the R help mailing list archive at Nabble.com.



More information about the R-help mailing list