[R] Distance and Aggregate Data - Again...

Fri Feb 27 03:31:43 CET 2004

dsheuman at rogers.com wrote:

> I appreciate the help I've been given so far.  The issue I face is
> that the data I'm working with has 53000 rows, so in calculating
> distance, finding all recids that fall within 2km and summing the
> population, etc. - a) takes too long and b) have no sense of progress.
>
Well, there are ways to speed this up.

> Below is a loop that reads each recid one at a time, calculates the
> distance and identifies the recids that fall within 2 km.  It iterates
> through all records successfully.
>
But you don't need to subset d[, 2:3] every time, e.g.  Also, my
experience is that writing a single line appended to a file every time
around the loop is very inefficient.

> Where I'm stuck is how to get the sum of population and dwellings and
> the mean age for the records that are selected.  Also, the desired
> output should have the following fields:  recid, sum(pop), sum(dwell),
> mean(age).  I don't know how to write only those fields out to the
> file.
>
Well, that part is easy, essentially, you have to save the indices, then
extract the relevant rows of pop, dwell and age. So I would modify your
code as follows:

library(fields)
d <- as.matrix( read.csv("filein.csv") )

lonlat2 <- d[,2:3]
results <- matrix(0, nrow=nrow(d), ncol=4)
for(i in 1:nrow(d)){
        lonlat1 <- d[i, 2:3]
        whichdist <- which(rdist.earth(t(as.matrix(lonlat1)),
                as.matrix(lonlat2), miles=F) < 2)
        distval <- d[, 1][whichdist]
        sumpop <- sum(data[, "pop"][whichdist])
        sumdwell <- sum(data[, "dwell"][whichdist])
        meanage <- mean(data[, "age"][whichdist])
        results[i, ] <- c(d[i, "recid"], sumpop, sumdwell, meanage)
}
write(t(results), file="C:\\outfile.out", ncol=ncol(results))

Then there is a trick you can use to speed up the process.  Essentially
you reduce the 'inner loop' which is inside rdist.earth().  This is
achieved by initially computing a single distance vector with reference
to a fixed point [(0, 0) seems reasonable since it is in the Atlantic
Ocean].  Then you select a subset of your points that are the same
distance is your circle centre from the fixed point plus or minus the
radius you want (and a bit of tolerance).  This will generate an
annulus of points rather than a circle, but in particular all the
points within the circle of interest will also be within this annulus.
Then you apply rdist.earth() to find the distance of each of these
points from your circle centre.

This is what I have used to process a 52000 length matrix in about 40
minutes (and ~50MB memory):

DistAgg <-
function(data, dist=2) {
  results <- matrix(0, nrow=nrow(data), ncol=4)
  colnames(results) <- c("recid", "sumpop", "sumdwell", "meanage")
  lonlat2 <- data[, 2:3]
  basedist <- rdist.earth(t(as.matrix(c(0, 0))), as.matrix(lonlat2))
  for(i in 1:nrow(data)){
    lonlat1 <- data[i, 2:3]
    approxval <- which(abs(basedist - basedist[i]) < dist*1.001)
    if (length(approxval) > 1) {
      whichval <- approxval[which(rdist.earth(t(as.matrix(lonlat1)),
        as.matrix(lonlat2[approxval, ]), miles=F) < dist)]
    } else {
      whichval <- approxval
    }
    sumpop <- sum(data[, "pop"][whichval])
    sumdwell <- sum(data[, "dwell"][whichval])
    meanage <- mean(data[, "age"][whichval])
    results[i, ] <- c(data[i, "recid"], sumpop, sumdwell, meanage)
  }
  write(t(results), file="C:\\outfile.out", ncol=ncol(results))
}

The data I used was based on yours, but randomised latitude and
longitude (in restricted ranges).

Hope this helps,
Ray Brownrigg