[R] Recoding multiple columns consistently

Jim Lemon jim at bitwrit.com.au
Wed Aug 29 13:01:00 CEST 2007


Ron Crump wrote:
> Hi,
> 
> I have a dataframe that contains pedigree information;
> that is individual, sire and dam identities as separate
> columns. It also has date of birth.
> 
> These identifiers are not numeric, or not sequential.
> 
> Obviously, an identifier can appear in one or two columns,
> depending on whether it was a parent or not. These should
> be consistent.
> 
> Not all identifiers appear in the individual column - it
> is possible for a parent not to have its own record if its
> parents were not known.
> 
> Missing parental (sire and/or dam) identifiers can occur.
> 
> I need to export the data for use in another program that
> requires the pedigree to be coded as integers, increasing
> with date of birth (therefore sire and dam always have
> lower identifiers than their offspring) and with missing
> values coded as 0.
> 
> How would I go about doing this?
>
Hi Ron,
Without the genealogical coding system for the output, I can only make a 
guess. It seems as though you are going from a series of records for 
which the index is the individual, followed by fields containing sire, 
dam and date of birth (perhaps not in that order).

I think you want to transform this into a network (maybe hierarchical 
unless consanguinuity intervenes) with individuals coded as positive 
integers (and maybe some or all of the original information attached to 
those identifiers). At a guess, I would recode the birthdates as 
integers, preserving the order and including a rule for breaking ties.

Assuming that you want an inverted tree for each individual, construct a 
linked list beginning with the individual with two pointers to the 
parents (their integer identifiers). Each parent has two links pointing 
to their parents, and so on. Whenever a pointer is zero, the linking 
stops. I don't know whether this can be represented in any of the tree 
diagrams in R, but it certainly could be coded.

I think a bit more information for non-genealogists about the formats 
might elicit a more specific answer.

> And a second, simpler related question, if I have a column with
> n different values (may be strings or non-sequential integers)
> identifying levels (possibly with repeated occurences), how
> can I recode them to be sequential from 1 to n?
> 
> I can solve both problems in fortran, so could use loops to
> do it in R, but feel there should be quicker, more elegant,
> "more R" solution.
> 
sounds like "sort"

Jim



More information about the R-help mailing list