[R] Converting unique strings to unique numbers

William Dunlap wdunlap at tibco.com
Fri May 29 22:48:46 CEST 2015


>I'm not sure why which particular ID gets assigned to each string would
>matter but maybe I'm missing something. What really matters is that each
>string receives a unique ID. match(x, x) does that.

I think each row of the OP's dataset represented an individual (column 2)
followed by its mother and father (columns 3 and 4).  I assume that the
marker "0" means that a parent is not in the dataset.  If you match against
the strings in column 2 only, in their original order, then the resulting
numbers
give the row number of an individual, making it straightforward to look up
information regarding the ancestors of an individual.  Hence the choice of
numeric ID's may be important.



Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Fri, May 29, 2015 at 1:29 PM, Hervé Pagès <hpages at fredhutch.org> wrote:

> Hi Sarah,
>
> On 05/29/2015 12:04 PM, Sarah Goslee wrote:
>
>> On Fri, May 29, 2015 at 2:16 PM, Hervé Pagès <hpages at fredhutch.org>
>> wrote:
>>
>>> Hi Kate,
>>>
>>> I found that matching the character vector to itself is a very
>>> effective way to do this:
>>>
>>>    x <- c("a", "bunch", "of", "strings", "whose", "exact", "content",
>>>           "is", "of", "little", "interest")
>>>    ids <- match(x, x)
>>>    ids
>>>    # [1]  1  2  3  4  5  6  7  8  3 10 11
>>>
>>> By using this trick, many manipulations on character vectors can
>>> be replaced by manipulations on integer vectors, which are sometimes
>>> way more efficient.
>>>
>>
>> Hm. I hadn't thought of that approach - I use the
>> as.numeric(factor(...)) approach.
>>
>> So I was curious, and compared the two:
>>
>>
>> set.seed(43)
>> x <- sample(letters, 10000, replace=TRUE)
>>
>> system.time({
>>    for(i in seq_len(20000)) {
>>    ids1 <- match(x, x)
>> }})
>>
>> #   user  system elapsed
>> #  9.657   0.000   9.657
>>
>> system.time({
>>    for(i in seq_len(20000)) {
>>    ids2 <- as.numeric(factor(x, levels=letters))
>> }})
>>
>> #   user  system elapsed
>> #   6.16    0.00    6.16
>>
>> Using factor() is faster.
>>
>
> That's an unfair comparison, because you already know what the levels
> are so you can supply them to your call to factor(). Most of the time
> you don't know what the levels are so either you just do factor(x) and
> let the factor() constructor compute the levels for you, or you compute
> them yourself upfront with something like factor(x, levels=unique(x)).
>
>   library(microbenchmark)
>
>   microbenchmark(
>     {ids1 <- match(x, x)},
>     {ids2 <- as.integer(factor(x, levels=letters))},
>     {ids3 <- as.integer(factor(x))},
>     {ids4 <- as.integer(factor(x, levels=unique(x)))}
>   )
>   Unit: microseconds
>                                                       expr     min       lq
>                                {     ids1 <- match(x, x) } 245.979 262.2390
>    {     ids2 <- as.integer(factor(x, levels = letters)) } 214.115 219.2320
>                      {     ids3 <- as.integer(factor(x)) } 380.782 388.7295
>  {     ids4 <- as.integer(factor(x, levels = unique(x))) } 332.250 342.6630
>        mean   median      uq     max neval
>    267.3210 264.4845 268.348 293.894   100
>    226.9913 220.9870 226.147 314.875   100
>    402.2242 394.7165 412.075 481.410   100
>    349.7405 345.3090 353.162 383.002   100
>
>  More importantly, using factor() lets you
>> set the order of the indices in an expected fashion, where match()
>> assigns them in the order of occurrence.
>>
>> head(data.frame(x, ids1, ids2))
>>
>>    x ids1 ids2
>> 1 m    1   13
>> 2 x    2   24
>> 3 b    3    2
>> 4 s    4   19
>> 5 i    5    9
>> 6 o    6   15
>>
>> In a problem like Kate's where there are several columns for which the
>> same ordering of indices is desired, that becomes really important.
>>
>
> I'm not sure why which particular ID gets assigned to each string would
> matter but maybe I'm missing something. What really matters is that each
> string receives a unique ID. match(x, x) does that.
>
> In Kate's problem, where the strings are in more than one column,
> and you want the ID to be unique across the columns, you need to do
> match(x, x) where 'x' contains the strings from all the columns
> that you want to replace:
>
>   m <- matrix(c(
>     "X0001", "BYX859",        0,        0,  2,  1, "BYX859",
>     "X0001", "BYX894",        0,        0,  1,  1, "BYX894",
>     "X0001", "BYX862", "BYX894", "BYX859",  2,  2, "BYX862",
>     "X0001", "BYX863", "BYX894", "BYX859",  2,  2, "BYX863",
>     "X0001", "BYX864", "BYX894", "BYX859",  2,  2, "BYX864",
>     "X0001", "BYX865", "BYX894", "BYX859",  2,  2, "BYX865"
>   ), ncol=7, byrow=TRUE)
>
>   x <- m[ , 2:4]
>   id <- match(x, x, nomatch=0, incomparables="0")
>   m[ , 2:4] <- id
>
> No factor needed. No loop needed. ;-)
>
> Cheers,
> H.
>
>
>> If you take Bill Dunlap's modification of the match() approach, it
>> resolves both problems: matching against the pooled unique values is
>> both faster than the factor() version and gives the same result:
>>
>>
>> On Fri, May 29, 2015 at 1:31 PM, William Dunlap <wdunlap at tibco.com>
>> wrote:
>>
>>> match() will do what you want.  E.g., run your data through
>>> the following function.
>>>
>>>  f <- function (data)
>> {
>>      uniqStrings <- unique(c(data[, 2], data[, 3], data[, 4]))
>>      uniqStrings <- setdiff(uniqStrings, "0")
>>      for (j in 2:4) {
>>          data[[j]] <- match(data[[j]], uniqStrings, nomatch = 0L)
>>      }
>>      data
>> }
>>
>> ##
>>
>> y <- data.frame(id = 1:5000, v1 = sample(letters, 5000, replace=TRUE),
>> v2 = sample(letters, 5000, replace=TRUE), v3 = sample(letters, 5000,
>> replace=TRUE), stringsAsFactors=FALSE)
>>
>>
>> system.time({
>>    for(i in seq_len(20000)) {
>>      ids3 <- f(data.frame(y))
>> }})
>>
>> #   user  system elapsed
>> # 22.515   0.000  22.518
>>
>>
>>
>> ff <- function(data)
>> {
>>      uniqStrings <- unique(c(data[, 2], data[, 3], data[, 4]))
>>      uniqStrings <- setdiff(uniqStrings, "0")
>>      for (j in 2:4) {
>>          data[[j]] <- as.numeric(factor(data[[j]], levels=uniqStrings))
>>      }
>>      data
>> }
>>
>> system.time({
>>    for(i in seq_len(20000)) {
>>      ids4 <- ff(data.frame(y))
>> }})
>>
>> #    user  system elapsed
>> #  26.083   0.002  26.090
>>
>> head(ids3)
>>
>>    id v1 v2 v3
>> 1  1  1  2  8
>> 2  2  2 19 22
>> 3  3  3 21 16
>> 4  4  4 10 17
>> 5  5  1  8 18
>> 6  6  1 12 26
>>
>> head(ids4)
>>
>>    id v1 v2 v3
>> 1  1  1  2  8
>> 2  2  2 19 22
>> 3  3  3 21 16
>> 4  4  4 10 17
>> 5  5  1  8 18
>> 6  6  1 12 26
>>
>> Kate, if you're getting all zeros, check str(yourdataframe) - it's
>> likely that when you imported your data into R the strings were
>> already converted to factors, which is not what you want (ask me how I
>> know this!).
>>
>> Sarah
>>
>>
>>
>>  On 05/29/2015 09:58 AM, Kate Ignatius wrote:
>>>
>>>>
>>>> I have a pedigree file as so:
>>>>
>>>> X0001 BYX859      0      0  2  1 BYX859
>>>> X0001 BYX894      0      0  1  1 BYX894
>>>> X0001 BYX862 BYX894 BYX859  2  2 BYX862
>>>> X0001 BYX863 BYX894 BYX859  2  2 BYX863
>>>> X0001 BYX864 BYX894 BYX859  2  2 BYX864
>>>> X0001 BYX865 BYX894 BYX859  2  2 BYX865
>>>>
>>>> And I was hoping to change all unique string values to numbers.
>>>>
>>>> That is:
>>>>
>>>> BYX859 = 1
>>>> BYX894 = 2
>>>> BYX862 = 3
>>>> BYX863 = 4
>>>> BYX864 = 5
>>>> BYX865 = 6
>>>>
>>>> But only in columns 2 - 4.  Essentially I would like the data to look
>>>> like
>>>> this:
>>>>
>>>> X0001 1 0 0  2  1 BYX859
>>>> X0001 2 0 0  1  1 BYX894
>>>> X0001 3 2 1  2  2 BYX862
>>>> X0001 4 2 1  2  2 BYX863
>>>> X0001 5 2 1  2  2 BYX864
>>>> X0001 6 2 1  2  2 BYX865
>>>>
>>>> Is this possible with factors?
>>>>
>>>> Thanks!
>>>>
>>>> K.
>>>>
>>>>
>>
>>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fredhutch.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

	[[alternative HTML version deleted]]



More information about the R-help mailing list