[R] Calculate NAs from known data: how to?

(Ted Harding) Ted.Harding at nessie.mcc.ac.uk
Tue Oct 17 18:07:52 CEST 2006


On 17-Oct-06 Torleif Markussen Lunde wrote:
> Hi
> 
> In a dataset I have length and age for cod. The age, however,
> is only given for 40-100% of the fish. What I need to do is
> to fill in the NAs in a correct way, so that age has a value
> for each length. This is to be done for each sample seperately
> (there are 324 samples), meaning the NAs for sample no 1 shall
> be calculated from the known values from sample no 1.
> 
> As for example length 55 cm can be both 4 and 5 years, I guess
> a fish with NA age and length 55 cm should be given a "random"
> age given a probability for example "55 cm = 4 years has a p=75%,
> while 55 cm = 4 years has a p=25%". Those "p-values" should be
> calculated from the real data.
> 
> How can this be done in R, and what is the right way to do it?
> 
> Sample number 1 is given below.
[snip]

A question with many ramifications!

First of all, there are several possible approaches to imputing
missing values. You are wise in recognising that there is
uncertainty in this, in general and also for your data set.
For this, I would normally recommend a Multiple Imputation
approach, since this would proceed by sampling from a posterior
distribution for Age given Length, as estimated by Maximum
Likelihood from your data. The differing results of successive
imputations then exhibit a variability corresponding to the
uincertainty about what value to impute. Furthermore, when
subsequent analyses (such as estimating the paameters of a
growth curve from Age and Length, or estimating population
dynamics from Age distributions in successive years) are
carried out, these can be done for each of the imputations
in the multiple set, and the results (and etimated standard
errors) can be combined to give an overall estimate and the
unceertainty in this--which not only includes the variability
in the complete data, but also the uncertainty due to imputation.

For this approach, I would be inclined to start with Shafer's
"norm" or "mix" packages, available in R. But see below.

However, I have had a look at the data for "Sample 1" which
you included. This throws up several features which should
be taken into account, and which indicate that "blind" use
of an imputation package may not be the best approach.

First, I made a CSV file from your table (3 columns: length,
age, sample). Then:

  D<-read.csv("LengthAge.csv")
  A<-D$age
  L<-D$length

Now: index which lines have "NA" for age, then a histogram of
Length when Age is present:

  ix0 <- is.na(A)
  hist(L[!ix0],breaks=5*(0:20))

Superimpose a histogram of Length when Age is NA:

  hist(L[ix0],add=TRUE,breaks=5*(0:20),col="red")
  hist(L[!ix0],add=TRUE,breaks=5*(0:20))

(the repetition of L[!ix0] is done because the red has overlaid
it for one length range, and the repetition restores it).

This immediately shows that, in general, missing Age is unusual,
except for Length in the range (25:50) in which it is the
majority. An alternative picture of the same scene appears
if you first make the histogram of all lengths:

  X11()  ## to get a second graphics window
  hist(L,breaks=5*(0:20))
  hist(L[ix0],add=TRUE,breaks=5*(0:20),col="red")

Now you can ask why there should be so many NAs for Age
in the Length range (25:30) and, indeed (2nd histogram)
why there are so many specimens anyway in that Length
range compared with their neghbours. Comparing the two
histograms idicates that the excess in that Length range
arises from the fish with NA Ages: in the first histogram
the number of non-NA Ages in )25:30) is very comparable
with the numbers of all fish in other Length ranges.

So my immediate suspicion is that there is something
special about the Length range (25:30) in relation to
whether Age is recorded or not.

The next thing that emerges from the first histogram
is the sharp decline in numbers above Length=55.

I therefore wonder why this also happens.

Is 55cm a magic (e.g. legal) length threshold for cod?
And is 30cm also special in some way? If so, could there
be some pressure on whoever records the data not to take
measurements of Age when the fish is near but under 30cm,
or to record a value of Length just below the threshold
(i.e. incorrectly)? Does this happen in your other data
sets?

As well as these "why" questions, however, there is a
a technical issue arising from the fact that many missing
Ages occur in a very limited range of Length:

  sum(ix0)
  [1] 27  ### total number of Age = NA

  sum((L[ix0]>25)&(L[ix0]<=30))
  [1] 12  ### Age = NA in (25:50)
  > sum((L[!ix0]>25)&(L[!ix0]<=30))
  [1] 11  ### Age is known in (25:30)

This is: whether "Age=NA" is "uninformative" about Age given
recorded Length. More precisely, whether

  Prob[Age=NA | true Age, recorded Length]

  = Prob[Age= NA | recorded Length]

since many imputation techniques depend on this. If it is
not true, then being missing is "informative" about Age
(i.e. the distribution of Age at Length is different for
Age=NA cases than for Age != NA cases), though you would
not be able to ascertain this without a suitable model
for Prob[missing | Age, Length].

The above remarks concern possible traps and pitfalls in
approaching the problem, which need to be considered.

Next, it is useful to look at the relationship between
Length and Age where bothe are known. First plot L vs A:

  plot(c(0,A),c(0,L),ylim=c(0,100))

(I've added (0,0) to the list of (Age,Length) to indicate
the difference between 1st-year growth and subsequent years).

This shows a smooth growth curve, with quite regular scatter
with the possible exception of the 3 highest values of L
(which I'm inclined to treat as "outliers").

Now look at it logarithmically:

  plot(log(A),log(L))

The three outliers still stand out, but for the rest it
seems that a straight line with constant variance is
a good description.

Therefore, onitting these three, and working with
non-NA cases:

  ix1<-(L<70)
  L1<-L[ix1&(!ix0)]
  A1<-A[ix1&(!ix0)]

now:

  LA.lm <- lm(log(L1) ~ log(A1))
  summary(LA.lm)
  Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
  (Intercept)  3.05062    0.03139   97.20   <2e-16 ***
  log(A1)      0.55405    0.02454   22.58   <2e-16 ***
  Residual standard error: 0.1255 on 91 degrees of freedom

indicating a growth curve

  L(A) = exp(3.05062)*(A ^ 0.55405)

Now re-do

  plot(c(0,A),c(0,L), ylim=c(0,100))

and now

  AA <-0.1*(0:70)
  lines(AA,exp(3.05062)*(AA ^ 0.55405),col="green")

which looks pretty good (at least for Ages 1:5); maybe
it is visubly starting to level off at Age 6 (ignoring
outliers) as growth curves do. Nevertheless, for the
sake of illustration, I'll stick with the above.

Note that the intercept (3.05..) and coefficent (0.55...)
are estimated with relatively small SE, so I'll adopt
a normally distributed log(L), conditional on log(A),
with mean given by the above fit and SD = 0.1255.

Now, in order to infer what A might be, given L, we
need to get a distribution for A given L; and for
this we need a distribution for A.

So:

  hist(A,breaks=0.5 + (0:7))

Here we see that the two extreme Ages (1 and 6, since
Age=7 is 2 outliers) are a bit less frequent than the
other; but it looks as though it might be feasible
(at least as an approximation) to see if we can work
with a uniform distribution of Age:

  table(A1)
  A1
   1  2  3  4  5  6 
   9 19 16 16 21 12 

  chisq.test(table(A1))
    Chi-squared test for given probabilities
    data:  table(A1) 
    X-squared = 6.2903, df = 5, p-value = 0.279

so the data are quite consistent with a uniform distribution,
i.e. p1 = P[Age=1] = p2 = P[Age=2] = ... = 1/6.

Now

  P[Age = A | Length = L]
  = Const*pA*exp((log(L)-mean.log.LA)^2/2*sd.log.LA^2)

for A = 1,2,3,4,5,6 and Const is such that P[Age=A|Length=L]
sums to 1 over these Ages, and

  mean.log.LA = 3.05062)*(A ^ 0.55405)

  sd.log.LA = 0.1255

and here we are proceeding as though pA is constant (1/6).

Now we are close. For a given Length L, compute the
probabilities that Age = 1,2,3,4,5,6 and then sample from
this diatribution to make a "random" imputation.

Here (simplified because of the uniform prior on pA)
the result is simply proportional to

  exp(-(log(L)-mean.log.LA)^2/2*sd.log.LA^2)

so

  As <- (1:6)
  mean.log.LA <- 3.05062 + 0.55405*log(As)
  sd.log.LA <- 0.1255

  pA <- function(L){
  temp<-exp(-(log(L)-mean.log.LA)^2/(2*sd.log.LA^2))
  temp/sum(temp)
  }

Now (e.g.) for L=30, we get

  round(pA(30),digits=4)
  0.0182 0.8695 0.1087 0.0036 0.0001 0.0000

so that A=2 is likely, A=3 is possible, others unlikely.

Likewise

  round(pA(40),digits=4)
  0.0000 0.0700 0.5299 0.3191 0.0709 0.0101

  round(pA(50),digits=4)
  0.0000 0.0003 0.0540 0.3108 0.3980 0.2370

  round(pA(60),digits=4)
  0.0000 0.0000 0.0016 0.0600 0.3216 0.6167

  round(pA(70),digits=4)
  0.0000 0.0000 0.0001 0.0090 0.1610 0.8299

Thus, in particular, for L=50, at least 3 Ages (5,6,7)
are likely to occur. This shows the desirability of
making multiple imputations to represent this undertainty.

So, for example, you could sample 10 imputed values for
a Length=50 fish as

  sample((1:6),10,replace=TRUE,prob=pA(50))
  5 5 5 6 6 4 6 4 4 3  ### --> 3, 4^3, 5^3, 6^3

Such considerations could be developed into an approach
for your probalem. The problem with many Multiple Imputation
packages (and Shafer's are among them) is that they generally
do not have provision to incorporate a user-defined model
for the relation between variables more sophisticated
than a multivariate normal, or a model where continuous
variable means depend in an unstructured way on categorical
covariates.

Given what was observed in your Sample1, I moght try with
"norm", using log(A) and log(L) as if they werre bivariate
normals (and hence a linear regression of log(L) on log(A)).
This might work (and looked reasonable for Sample 1). But
if you want to incorporate a more complex growth curve
model, then you may have difficulty doing so using a package.

Yu may like to have a look at the packages "gee" and "geepack"
in R to see what has been done, but -- for istance -- I don't
see how you would incorporate a "von Bertalanffy" link function
without doing a lot of work!

Probably, developing you own approach along the lines shown
above will do what you want.


CAVEATS

1. The above approach is derived from observation of the
   particular sample, as are the remarks about problematic
   features of the data. Hence it is purely illustrative
   of a possible way to approach the problem. Whether it
   would be useful for the rest of your 324 samples is
   beyond my knowledge!

2. The fitted linear growth rate of log(L) with Age is
   again compatible with the data in your sample. More
   generally, and in particular given the possible
   flattening at Age = 6 which may be discernible in
   your data, it would be better in a general context
   to fit a more realistic growth curve (e.g. von
   Bertalanffy). This would not affect the general
   strategy of the approach, and would affect only
   the expression used to evaluate mean.log.LA in
   the equations.

3. The adoption of a uniform prior distribution for Age
   was chosen (rather, leapt on) for simplicity since
   it was compatible with the data. In real life (i.e.
   324 samples) things may look different. In any case,
   you may have a model for the distribution of Age
   in your fish populations (though the features noted
   in the histograms of Age need to be accounted for).

Hoping this help!
Best wishes,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 17-Oct-06                                       Time: 17:07:48
------------------------------ XFMail ------------------------------



More information about the R-help mailing list