[R] Inserting missing seq number

Wed Mar 30 23:19:24 CEST 2022

Several perfectly fine "off the shelf" solutions have been offered for
this post, so what follows here should be considered only for
amusement.

This seemed like a simple problem to me, so I wondered if one could
write simple code in plain old base R to solve it. Of course, one can.
What follow are two such approaches.

First, as several have already noted, one can use the merge() trick to
obtain the full 'seq' vector and the corresponding 'count' vector with
NA's for the missing seq values. Here is a slightly more complicated
example than the OP posted that allows for more than one missing 'seq'
value in a row and starts at something other than 1(one could use diff
if NA's only occur one at a time I think).

dat <- data.frame(seq = c(3,4,6,9,10), count = c(4,7,3,5,2))
> dat
  seq count
1   3     4
2   4     7
3   6     3
4   9     5
5  10     2

The merge() trick than gives:
dat <- merge(data.frame(
   seq = seq.int(dat[1,'seq'], tail(dat,1)[1,'seq'])),
   dat, all.x = TRUE)
> dat
  seq count
1   3     4
2   4     7
3   5    NA
4   6     3
5   7    NA
6   8    NA
7   9     5
8  10     2

So focusing on the 'count' vector, one needs a 'fill in' function to
appropriately fill in the missing values. Here are two. fillin1() does
this in the obvious way, moving sequentially from the beginning to the
end, filling in the previous value whenever NA is encountered:

fillin1 <- function(x){
   for(i in seq_along(x))
      if(is.na(x[i]))x[i] <- x[i-1]
   x
}

fillin2() is a bit trickier, working recursively from end to
beginning. Still, it's only a few lines of code, and might be improved
in some way I didn't think of:

fillin2 <- function(x){
   if(length(x) > 1){
      z <- Recall(head(x, -1))
      if(is.na(tail(x,1))){
         x <-c(z,tail(z,1))
      } else x[-length(x)] <- z
   }
   x
}

It might be interesting to compare the performance of all the
suggestions, but I'm too lazy to do that and will compare it only to
Bill's suggestion of approx(). To make the comparison fairer, I'll
remove unnecessary overhead and put 'seq' and 'count' in the global
environment.

seq <- c(3,4,6, 9, 10)
count <- c(4,7,3,5,2)

I'll also use the post merged 'count' from the above merge()
merged_count <- dat$count

> merged_count
[1]  4  7 NA  3 NA NA  5  2

First, check that the fillinx functions actually work:
> fillin1(merged_count)
[1] 4 7 7 3 3 3 5 2
> fillin2(merged_count)
[1] 4 7 7 3 3 3 5 2

## and of course!
> approx(x=seq, y=count, xout=3:10, method="constant", f=0)
$x
[1]  3  4  5  6  7  8  9 10

$y
[1] 4 7 7 3 3 3 5 2

Timing the execution of each 5000 times:

> system.time(replicate(5000,approx(x=seq, y=count, xout=3:10, method="constant", f=0)))
   user  system elapsed
  0.062   0.001   0.063

> system.time(replicate(5000, fillin1(merged_count)))
   user  system elapsed
  0.008   0.000   0.007

> system.time(replicate(5000, fillin2(merged_count)))
   user  system elapsed
  0.222   0.001   0.223

I was not surprised that the recursive solution was the slowest, but
was a little surprised that approx() was considerably slower than the
iterative fillin1() . Of course, one shouldn't invest too much faith
in this little exercise. Results might change drastically for very
long vectors with different proportions/patterns of missings, for
example. And approx is much more general and the above comparison may
be a cheat of sorts anyway since I've omitted the overhead of the
merge(), which I assumed would be small for reasonably sized examples.
That might be wrong.

Anyway, as I said, for amusement only.

Cheers,
Bert

Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )

On Wed, Mar 30, 2022 at 8:41 AM Bill Dunlap <williamwdunlap using gmail.com> wrote:
>
> stats::approx can do the job:
>
> > approx(x=df$seq, df$count, xout=1:7, method="constant", f=0)
> $x
> [1] 1 2 3 4 5 6 7
>
> $y
> [1] 4 7 7 3 5 5 2
>
> -Bill
>
> On Tue, Mar 29, 2022 at 7:47 PM Jeff Reichman <reichmanj using sbcglobal.net>
> wrote:
>
> > R-help
> >
> > Is there a R function that will insert missing sequence number(s) and then
> > fill a missing observation with the preceding value.
> >
> > For example df <- data.frame(seq = c(1,2,4,5,7), count = c(4,7,3,5,2))
> >
> >   seq count
> > 1    1        4
> > 2    2        7
> > 3    4        3
> > 4    5        5
> > 5    7        2
> >
> > What I need is
> >
> >   seq count
> > 1    1        4
> > 2    2        7
> > 3    3        7
> > 4    4        3
> > 5    5        5
> > 6    6        5
> > 7    7        2
> >
> > Jeff
> >
> > ______________________________________________
> > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.