[BioC] convert IDs in a data frame?

Marc Carlson mcarlson at fhcrc.org
Sat Jul 27 04:38:55 CEST 2013


On 07/26/2013 04:25 PM, Hervé Pagès wrote:
> Putting this back on the list.
>
> On 07/26/2013 01:59 PM, Marc Carlson wrote:
>> On 07/26/2013 01:46 PM, Hervé Pagès wrote:
>>> Hi Marc,
>>>
>>> On 07/26/2013 12:57 PM, Marc Carlson wrote:
>>> ...
>>>> Hello everyone,
>>>>
>>>> Sorry that I saw this thread so late.  Basically, select() does 
>>>> *try* to
>>>> keep your initial keys and map them each to an equivalent number of
>>>> unique values.  We did actually anticipate that people would *want* to
>>>> cbind() their results.
>>>>
>>>> But as you discovered there are many circumstances where the data make
>>>> this kind of behavior impossible.
>>>>
>>>> So passing in NAs as keys for example can't ever find anything
>>>> meaningful.  Those will simply have to be removed before we can
>>>> proceed.  And, it is also impossible to maintain a 1:1 mapping if you
>>>> retrieve fields that have many to one relationships with your initial
>>>> keys (also seen here).
>>>>
>>>> For convenience, when this kind of 1:1 output is already impossible 
>>>> (as
>>>> it is for most of your examples), select will also try to simplify the
>>>> output by removing rows that are identical all the way across etc..
>>>>
>>>> My aim was that select should try to do the most reasonable thing
>>>> possible based on the data we have in each case.  The rationale is 
>>>> that
>>>> in the case where there are 1:many mappings, you should not be 
>>>> planning
>>>> to bind those directly onto any other data.frames anyways (as this
>>>> circumstance would require you to call merge() instead). So in that
>>>> case, non-destructive simplification seems beneficial.
>>>
>>> Other tools in our infrastructure use an extra argument to pick-up 1
>>> thing in case of multiple mapping e.g. findOverlaps() has the 'select'
>>> argument with possible values "all", "first", "last", and "arbitrary".
>>> Also nearest() and family have this argument and it accepts similar
>>> values.
>>>
>>> Couldn't select() use a similar approach? The default should be "all"
>>> so the current behavior is preserved but if it's something else then
>>> the returned data.frame should align with the input.
>>>
>>> Thanks,
>>> H.
>>
>> Hi Herve,
>>
>> I know that for things like findOverlaps it can sometimes make some
>> sense to allow this kind of behavior.  But in this case, I really don't
>> think that this a good idea.  Biologically speaking, this is not
>> something anyone should really ever do with annotations like this, so I
>> really see no upside to making it more convenient for people to do stuff
>> that they should basically never do anyways.
>
> I agree that picking up one thing when your key is mapped to more than
> one thing should be done carefully and the user needs to understand the
> consequences so I'm all for keeping the current behavior as it is.
> However I think there are a few legitimate situations where the user
> might actually want to get rid of this complexity. For example:
>
>   (a) One of them is the "multiple probe" situation i.e. when a probe
>       is mapped to more than 1 gene. This is currently handled thru
>       toggleProbes() when working directly with the Bimap API. However
>       there is no equivalent for the select() API. This could be
>       handled by supporting "none" in addition to disambiguation
>       modes "all", "first", "last", and "arbitrary".
>
Hi Herve,

This is an extremely specialized situation that was motivated in large 
part by the fact that the default behavior for probes was very 
unexpected by many users.  I think that the defaults we have with select 
are much better.

>   (b) Another situation is when the user only wants to know whether
>       the keys map to something or not. If s/he could disambiguate
>       with "first" (or "last", or "arbitrary", it doesn't matter) then
>       s/he would get back a data.frame that aligns with his/her vector
>       of keys which is very convenient.
>

This sort of thing is already supported in devel for this by using the 
keys() method.  You can do it like this:

## gets ALL the keys of type "ENTREZID"
allKeys = keys(org.Hs.eg.db, keytype="ENTREZID")

## and this gets all the keys of type "ENTREZID" that have a kegg 
pathway ID associated with them
pathKeys = keys(org.Hs.eg.db, keytype="ENTREZID", column="PATH")

This is even more powerful than what you proposed because it also allows 
for searching for keys based on fuzzy or partial matches by using the 
(also new) pattern argument.

And as you know, there are already a ton of standard R operators for 
comparing things like character vectors etc.

> Given that those mappings are retrieved from a database, the notion of
> "first" and "last" is maybe not clear and we might just want to support
> "all", "arbitrary", and "none".
>
>>
>> On the other hand, I suppose that I could have support for DataFrames as
>> an output.  But I worry that this would be a ton of work since the code
>> would have to compress things at the right times?
>
> Very roughly each column (except the 1st one) of the data.frame
> returned by select() would need to go thru splitAsList() using the
> 1st column as the common split factor, and then go thru subsetting
> by the user-supplied 'keys' vector. Here is a simple wrapper to
> select() that does this but it doesn't know how to handle NAs in
> the input:
>
> selectAsDataFrame <- function(x, keys, columns, keytype)
> {
>     if (any(is.na(keys)))
>         stop("NAs in 'keys' are not supported yet")
>     keys0 <- unique(keys)
>     ans0 <- select(x, keys0, columns, keytype)
>     stopifnot(names(ans0)[1L] == keytype)
>     f <- ans0[[1]]
>     if (ncol(ans0) == 1L) {
>         ans_col1 <- unique(f)
>         m <- match(keys, ans_col1)
>         ans <- DataFrame(ans_col1[m])
>     } else {
>         ans_col2 <- unique(splitAsList(ans0[[2L]], f))
>         ans_col1 <- names(ans_col2)
>         m <- match(keys, ans_col1)
>         ans <- DataFrame(ans_col1[m], unname(ans_col2)[m])
>         if (ncol(ans0) >= 3L) {
>             ans_cols <- lapply(ans0[-(1:2)],
>                                function(col)
>                                  unname(unique(splitAsList(col, f)[m])))
>             ans <- cbind(ans, DataFrame(ans_cols))
>         }
>     }
>     colnames(ans) <- colnames(ans0)
>     ans
> }
>
> Then:
>
>   > library(org.Hs.eg.db)
>   > selectAsDataFrame(org.Hs.eg.db, keys=c("ALOX5", "HTR7", "XX", 
> "ALOX5"),
>                       keytype="ALIAS", columns=c("ENTREZID", "ENSEMBL"))
>   DataFrame with 4 rows and 3 columns
>           ALIAS        ENTREZID                         ENSEMBL
>     <character> <CharacterList> <CharacterList>
>   1       ALOX5             240 ENSG00000012779,ENSG00000262552
>   2        HTR7            3363                 ENSG00000148680
>   3          XX              NA                              NA
>   4       ALOX5             240 ENSG00000012779,ENSG00000262552
>
> An unpleasant effect of this approach is that every columns is now a
> List object even when not needed (e.g. ENTREZID).
>
>>  Is there an
>> automagical way to have DataFrames automagically pack up redundant
>> columns?  Of course, what then?  Could they then cbind() a DataFrame to
>> a data.frame?
>
> No but they could cbind() it to another DataFrame.
>
> My feeling is that the DataFrame route is maybe a little bit more
> complicated than introducing an extra argument for disambiguating
> the ordinary data.frame but I could be wrong. They are complementary
> though so we could also support both ;-) Depends what the people on
> the list find more useful...
>
> Thanks,
> H.
>

And with that I think you have just convinced (me at least) that I 
probably don't want to do support for DataFrames either.  In either case 
I don't think that we need to add any more complexity to the interface. 
Ideally, I think we want it to be easy for people to create select 
interfaces to other database resources.  The more extra stuff that we 
layer on top, the harder this will become.


   Marc


>>
>>
>>
>>    Marc
>>
>>
>>
>>
>



More information about the Bioconductor mailing list