[R] Converting unique strings to unique numbers
Sarah Goslee
sarah.goslee at gmail.com
Fri May 29 21:04:25 CEST 2015
On Fri, May 29, 2015 at 2:16 PM, Hervé Pagès <hpages at fredhutch.org> wrote:
> Hi Kate,
>
> I found that matching the character vector to itself is a very
> effective way to do this:
>
> x <- c("a", "bunch", "of", "strings", "whose", "exact", "content",
> "is", "of", "little", "interest")
> ids <- match(x, x)
> ids
> # [1] 1 2 3 4 5 6 7 8 3 10 11
>
> By using this trick, many manipulations on character vectors can
> be replaced by manipulations on integer vectors, which are sometimes
> way more efficient.
Hm. I hadn't thought of that approach - I use the
as.numeric(factor(...)) approach.
So I was curious, and compared the two:
set.seed(43)
x <- sample(letters, 10000, replace=TRUE)
system.time({
for(i in seq_len(20000)) {
ids1 <- match(x, x)
}})
# user system elapsed
# 9.657 0.000 9.657
system.time({
for(i in seq_len(20000)) {
ids2 <- as.numeric(factor(x, levels=letters))
}})
# user system elapsed
# 6.16 0.00 6.16
Using factor() is faster. More importantly, using factor() lets you
set the order of the indices in an expected fashion, where match()
assigns them in the order of occurrence.
head(data.frame(x, ids1, ids2))
x ids1 ids2
1 m 1 13
2 x 2 24
3 b 3 2
4 s 4 19
5 i 5 9
6 o 6 15
In a problem like Kate's where there are several columns for which the
same ordering of indices is desired, that becomes really important.
If you take Bill Dunlap's modification of the match() approach, it
resolves both problems: matching against the pooled unique values is
both faster than the factor() version and gives the same result:
On Fri, May 29, 2015 at 1:31 PM, William Dunlap <wdunlap at tibco.com> wrote:
> match() will do what you want. E.g., run your data through
> the following function.
>
f <- function (data)
{
uniqStrings <- unique(c(data[, 2], data[, 3], data[, 4]))
uniqStrings <- setdiff(uniqStrings, "0")
for (j in 2:4) {
data[[j]] <- match(data[[j]], uniqStrings, nomatch = 0L)
}
data
}
##
y <- data.frame(id = 1:5000, v1 = sample(letters, 5000, replace=TRUE),
v2 = sample(letters, 5000, replace=TRUE), v3 = sample(letters, 5000,
replace=TRUE), stringsAsFactors=FALSE)
system.time({
for(i in seq_len(20000)) {
ids3 <- f(data.frame(y))
}})
# user system elapsed
# 22.515 0.000 22.518
ff <- function(data)
{
uniqStrings <- unique(c(data[, 2], data[, 3], data[, 4]))
uniqStrings <- setdiff(uniqStrings, "0")
for (j in 2:4) {
data[[j]] <- as.numeric(factor(data[[j]], levels=uniqStrings))
}
data
}
system.time({
for(i in seq_len(20000)) {
ids4 <- ff(data.frame(y))
}})
# user system elapsed
# 26.083 0.002 26.090
head(ids3)
id v1 v2 v3
1 1 1 2 8
2 2 2 19 22
3 3 3 21 16
4 4 4 10 17
5 5 1 8 18
6 6 1 12 26
head(ids4)
id v1 v2 v3
1 1 1 2 8
2 2 2 19 22
3 3 3 21 16
4 4 4 10 17
5 5 1 8 18
6 6 1 12 26
Kate, if you're getting all zeros, check str(yourdataframe) - it's
likely that when you imported your data into R the strings were
already converted to factors, which is not what you want (ask me how I
know this!).
Sarah
> On 05/29/2015 09:58 AM, Kate Ignatius wrote:
>>
>> I have a pedigree file as so:
>>
>> X0001 BYX859 0 0 2 1 BYX859
>> X0001 BYX894 0 0 1 1 BYX894
>> X0001 BYX862 BYX894 BYX859 2 2 BYX862
>> X0001 BYX863 BYX894 BYX859 2 2 BYX863
>> X0001 BYX864 BYX894 BYX859 2 2 BYX864
>> X0001 BYX865 BYX894 BYX859 2 2 BYX865
>>
>> And I was hoping to change all unique string values to numbers.
>>
>> That is:
>>
>> BYX859 = 1
>> BYX894 = 2
>> BYX862 = 3
>> BYX863 = 4
>> BYX864 = 5
>> BYX865 = 6
>>
>> But only in columns 2 - 4. Essentially I would like the data to look like
>> this:
>>
>> X0001 1 0 0 2 1 BYX859
>> X0001 2 0 0 1 1 BYX894
>> X0001 3 2 1 2 2 BYX862
>> X0001 4 2 1 2 2 BYX863
>> X0001 5 2 1 2 2 BYX864
>> X0001 6 2 1 2 2 BYX865
>>
>> Is this possible with factors?
>>
>> Thanks!
>>
>> K.
>>
--
Sarah Goslee
http://www.functionaldiversity.org
More information about the R-help
mailing list