[BioC] revmap question

Fri Oct 10 04:20:42 CEST 2008

Hi Laurent, Robert,

Maybe a simple solution for the unlist() problem would be to have an extra argument
that gives the user more control over what to do with the names.

For example, with the unlist2() function below:

   > x <- list(A=c(b=-4, 2, b=7), B=3:-1, c(a=1, a=-2), C=list(c(2:-1, d=55), e=99))

   > unlist2(x, what.names="inherited")
    b  A  b  B  B  B  B  B  a  a  C  C  C  C  d  e
   -4  2  7  3  2  1  0 -1  1 -2  2  1  0 -1 55 99

names are just "inherited" without being mangled.

By default (what.names="mangled"), unlist2() behaves like base::unlist()

   > unlist2(x)
   A.b  A2 A.b  B1  B2  B3  B4  B5   a   a  C1  C2  C3  C4 C.d C.e
    -4   2   7   3   2   1   0  -1   1  -2   2   1   0  -1  55  99

It supports an extra mode (what.names="full"), not necessary very useful, to get
the full names of the elements:

   > unlist2(x, what.names="full")
    A.b   A.  A.b   B.   B.   B.   B.   B.   .a   .a  C..  C..  C..  C.. C..d C.e.
     -4    2    7    3    2    1    0   -1    1   -2    2    1    0   -1   55   99

The full name of an element describes the path that was followed to walk from the
top of level of the hierarchy to the element. The names of all the intermediate
elements that were walked on are concatenated together with the dot as the separator.

make.name.tree <- function(x, recursive, what.names)
{
     if (!is.character(what.names) || length(what.names) != 1)
         stop("'what.names' must be a single string")
     what.names <- match.arg(what.names, c("inherited" , "full"))
     .make.name.tree.rec <- function(x, parent_name, depth)
     {
         if (length(x) == 0)
             return(character(0))
         x_names <- names(x)
         if (is.null(x_names))
             x_names <- rep.int(parent_name, length(x))
         else if (what.names == "full")
             x_names <- paste(parent_name, x_names, sep="")
         else
             x_names[x_names == ""] <- parent_name
         if (!is.list(x) || (!recursive && depth >= 1L))
             return(x_names)
         if (what.names == "full")
             x_names <- paste(x_names, ".", sep="")
         lapply(seq_len(length(x)),
                function(i) .make.name.tree.rec(x[[i]], x_names[i], depth + 1L))
     }
     .make.name.tree.rec(x, "", 0L)
}

unlist2 <- function(x, recursive = TRUE, use.names = TRUE, what.names = "mangled")
{
     if (!is.character(what.names) || length(what.names) != 1)
         stop("'what.names' must be a single string")
     what.names <- match.arg(what.names, c("mangled", "inherited" , "full"))
     if (!use.names || what.names == "mangled")
         return(unlist(x, recursive, use.names))
     ans <- unlist(x, recursive, FALSE)
     names(ans) <- unlist(make.name.tree(x, recursive, what.names), recursive, FALSE)
     ans
}

Cheers,
H.

lgautier at altern.org wrote:
>>
>> lgautier at altern.org wrote:
>>>> James W. MacDonald wrote:
>>>>> Hi Raffaele,
>>>>>
>>>>> rcaloger wrote:
>>>>>> Hi,
>>>>>> I  found very interesting the possibility of using reversing the
>>>>>> mapping by revmap in the XXXX.db annotation databases.
>>>>>>
>>>>>> However, I have two problems:
>>>>>> 1) if  I use:
>>>>>> egs <- c("1", "100", "1000")
>>>>>> unlist(mget(egs, revmap(hgu133plus2ENTREZID)))
>>>>>>
>>>>>> I am getting not only the probesets associated to the three EGs:
>>>>>>            1          1001          1002          1003         10001
>>>>>>  "229819_at"  "1556117_at"   "204639_at" "216705_s_at"   "203440_at"
>>>>>>        10002         10003
>>>>>> "203441_s_at"   "237305_at"
>>>>> Well, not really. This appears to be so because you are unlisting a
>>>>> named list. Since the names have to be unique,
>>>> Well, that's were I don't follow the logic behind unlist() and I've
>>>> always
>>>> found this "feature" pretty strange. unlist() won't even make a good
>>>> job
>>>> at
>>>> keeping the names unique:
>>>>    > unlist(list(AA=letters[1:3], AA2="bb"))
>>>>     AA1  AA2  AA3  AA2
>>>>     "a"  "b"  "c" "bb"
>>>> So mangling the names doesn't solve anything but just adds confusion.
>>>>
>>>> IMO it would be better if unlist() was keeping the original names, even
>>>> if
>>>> that
>>>> means that they are not unique in the returned vector. At least I can
>>>> do
>>>> something
>>>> with it programmatically, and it's easy. With the mangled names, it's
>>>> much
>>>> harder
>>>> (there are a couple of serious pitfalls).
>>>>
>>> The problem might originate in what one could perceive a flaw with lists
>>> (or any named vectors for that matter) in allowing non-unique names.
>>>
>>> Mangled names are shurely a headache, as well as the "get only the first
>>> element with the given name while it was not known there were several
>>> elements with the same name" behavior in R.
>>    I disagree - I think that requiring unique row names in R is/was a
>> mistake - restrictions are often expensive - as they limit what can be
>> done.
> 
> I am certain that the best design decision was made at the time, and my
> comment is obviously a retrospective one.
> 
> I do suspect that computational cost weighted a lot in the decision, and
> uniqueness of names *was* necessarily not a mistake.
> However, with the growing presence and of lookup structures in R (as well
> as awareness of them and their benefit by programmers from all horizons),
> these restrictions might be appearing much less expensive, and the benefit
> from having them might be becoming competitive with not having them.
> 
> Regarding the freedom, it might not be completely challenged as one might
> use a data.frame (with vector of either mode "factor" or "character"
> whenever non-uniqueness is required - and those would just not be "names"
> but "tags").
> 
> 
> Best,
> 
> 
> Laurent
> 
> 
> 
>>  Yes there are issues about dealing with non-unique row names, but
>> those can be dealt with, by careful programming. Such methods would work
>> in all cases of duplicate row names, but with name-mangling schemes, one
>> needs to know what name mangling scheme was used to be able to
>> disentangle - and that means every solution is different -- not exactly
>> the kind of situation I would personally engineer in.
>>
>>   best wishes
>>     Robert
>>
>>>
>>> L.
>>>
>>>> H.
>>>>
>>>>
>>>>> R adds an additional
>>>>> integer to the end of duplicate names:
>>>>>
>>>>>  > egs <- c("1", "100", "1000")
>>>>>  > mget(egs, revmap(hgu133plus2ENTREZID))
>>>>> $`1`
>>>>> [1] "229819_at"
>>>>>
>>>>> $`100`
>>>>> [1] "1556117_at"  "204639_at"   "216705_s_at"
>>>>>
>>>>> $`1000`
>>>>> [1] "203440_at"   "203441_s_at" "237305_at"
>>>>>
>>>>>> There is any possibility to avoid this problem?
>>>>>>
>>>>>> 2) if in the egs vector is present an eg (6333) that is not present
>>>>>> in
>>>>>> the annotation database I get the following error:
>>>>>> egs <- c("1", "100", "1000", "6333")
>>>>>> unlist(mget(egs, revmap(hgu133plus2ENTREZID)))
>>>>>>
>>>>>> Error in .checkKeys(value, Rkeys(x), x at ifnotfound) :
>>>>>>  value for "6333" not found
>>>>>>
>>>>>> There is any possibility to make a query that simply avoid the
>>>>>> unmapped keys?
>>>>> Yes. The help for mget is a bit confusing on this point, but you need
>>>>> to
>>>>> use the argument ifnotfound = NA.
>>>>>
>>>>>  > egs <- c("1", "100", "1000", "6333")
>>>>>  > mget(egs, revmap(hgu133plus2ENTREZID), ifnotfound = NA)
>>>>> $`1`
>>>>> [1] "229819_at"
>>>>>
>>>>> $`100`
>>>>> [1] "1556117_at"  "204639_at"   "216705_s_at"
>>>>>
>>>>> $`1000`
>>>>> [1] "203440_at"   "203441_s_at" "237305_at"
>>>>>
>>>>> $`6333`
>>>>> [1] NA
>>>>>
>>>>> Best,
>>>>>
>>>>> Jim
>>>>>
>>>>>
>>>>>
>>>>>> Many thanks
>>>>>> Raffaele
>>>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at stat.math.ethz.ch
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>> --
>> Robert Gentleman, PhD
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M2-B876
>> PO Box 19024
>> Seattle, Washington 98109-1024
>> 206-667-7700
>> rgentlem at fhcrc.org
>>
> 
>