[R] Advice on obscuring unique IDs in R

Wed Jan 5 23:01:15 CET 2011

On Wed, Jan 05, 2011 at 09:19:49PM +0000, Anthony Staines wrote:
> Dear colleagues,
> 
> This may be a question with a really obvious answer, but I
> can't find it. I have access to a large file with real
> medical record identifiers (mixed strings of characters and
> numbers) in it. These represent medical events for many
> thousands of people. It's important to be able to link
> events for the same people.
> 
> It's much more important that the real record numbers are
> strongly obscured. I'm interested in some kind of strong
> one-way hash function to which I can feed the real numbers
> and get back unique codes for each record  identifier fed
> in. I can do this on the health service system, and I have
> to do this before making further use of the data!

Producing unique integer codes for character values may be
done using a factor, for example

  s <- c("cd", "bc", "ab", "bc", "ab")
  f <- factor(s)
  as.integer(f) # [1] 3 2 1 2 1
  levels(f) # [1] "ab" "bc" "cd"

If the codes should be ordered by the first ocurrence in the
data, then use

  f <- factor(s, levels=unique(s))
  as.integer(f) # [1] 1 2 3 2 3
  levels(f) # [1] "cd" "bc" "ab"

This does not perform any approximate matching. The codes are
assigned based on exact equality. If an approximate matching
is required, then an example of the identifiers would be helpful.

Filtering out different types of delimiters may be done as
a preprocessing step, for example, using gsub()

  s <- c("ab cd", "ab  cd", "a b cd")
  gsub(" ", "", s) # [1] "abcd" "abcd" "abcd"

where a general regular expression may also be used.

Petr Savicky.