[R] take data from a file to another according to their correlation coefficient

Mon Apr 23 13:29:16 CEST 2012

Hi,

Even your example should show why this is a bad way to fill in missing weather data: you end up with a sequence for station 1 of 1, 2, 10, 4 even though that's certainly wrong because Station 2 is reliably 7 units above Station 1. "Correlated" doesn't mean "identical."

There are other better options. If you're only missing a single value, interpolation between the values you do have for that station is likely better. If you're missing lots, regression of that station with another correlated station would be the more reasonable way to do what you're trying to propose here.

But in fact interpolation of weather data is vey complicated, and the subject of a lot of research. The most realistic methods use elevation as a covariate. These may well be overkill for your situation, though, unless you are missing whole days of data.

Sarah

On Apr 23, 2012, at 6:42 AM, jeff6868 <geoffrey_klein at etu.u-bourgogne.fr> wrote:

> Hi everyone.
> 
> I have a question about a work on R I have to do for my job.
> I have temperature data coming from 70 weather stations. One data file
> corresponds to one station for one year (so 70 files for one year). Each
> file looks like this (important: each file contains NAs):
> 
> time                          data
> 01/01/2008 00:00 -0.25 
> 01/01/2008 00:15 -0.18 
> 01/01/2008 00:30 -0.25 
> 01/01/2008 00:45 -0.25 
> 
> (one column with date + time every 15mn for the whole year, and one column
> with data). 
> 
> I already did correlation matrices between my weather stations (in order to
> find the nearest). For example:
> 
>                  Station1 Station2 Station3 [...]
> Station1        1              0.9            0.8
> Station2        0.9             1             0.7
> Station3        0.8           0.7             1
> [...]
> 
> Now, I would like to fill the NA data gaps of a station with data from
> another station according to their correlation coefficient.
> Let's take an example for the Station 1: if the most correlated Station with
> Station 1 is Station 2, it has to take data from Station 2 to fill NA gaps
> of Station 1, for the same date and hour of course (or same lines as I'm
> doing correlations for the same year). 
> So for year 2008 (for example), if the correlation is the highest between
> Station 1 and 2 (according to all the Stations), and if the data are:
> 
> time                            data
> 01/01/2008 00:00   1
> 01/01/2008 00:15   2       FOR STATION 1
> 01/01/2008 00:30   *NA* 
> 01/01/2008 00:45   4 
> 
> and 
> 
> time                            data
> 01/01/2008 00:00   8
> 01/01/2008 00:15   9          FOR STATION 2 for the same year and the same
> time
> 01/01/2008 00:30   *10 *
> 01/01/2008 00:45   11
> 
> The Station1 file should become:
> 
> time                            data
> 01/01/2008 00:00   1
> 01/01/2008 00:15   2       STATION 1
> 01/01/2008 00:30   *10 *
> 01/01/2008 00:45   4 
> 
> Hope you've understood what I would like to do :)
> Thanks a lot for your ideas and your replies!
> 
>