[BioC] convert IDs in a data frame?

Sat Jul 27 01:25:26 CEST 2013

Putting this back on the list.

On 07/26/2013 01:59 PM, Marc Carlson wrote:
> On 07/26/2013 01:46 PM, Hervé Pagès wrote:
>> Hi Marc,
>>
>> On 07/26/2013 12:57 PM, Marc Carlson wrote:
>> ...
>>> Hello everyone,
>>>
>>> Sorry that I saw this thread so late.  Basically, select() does *try* to
>>> keep your initial keys and map them each to an equivalent number of
>>> unique values.  We did actually anticipate that people would *want* to
>>> cbind() their results.
>>>
>>> But as you discovered there are many circumstances where the data make
>>> this kind of behavior impossible.
>>>
>>> So passing in NAs as keys for example can't ever find anything
>>> meaningful.  Those will simply have to be removed before we can
>>> proceed.  And, it is also impossible to maintain a 1:1 mapping if you
>>> retrieve fields that have many to one relationships with your initial
>>> keys (also seen here).
>>>
>>> For convenience, when this kind of 1:1 output is already impossible (as
>>> it is for most of your examples), select will also try to simplify the
>>> output by removing rows that are identical all the way across etc..
>>>
>>> My aim was that select should try to do the most reasonable thing
>>> possible based on the data we have in each case.  The rationale is that
>>> in the case where there are 1:many mappings, you should not be planning
>>> to bind those directly onto any other data.frames anyways (as this
>>> circumstance would require you to call merge() instead). So in that
>>> case, non-destructive simplification seems beneficial.
>>
>> Other tools in our infrastructure use an extra argument to pick-up 1
>> thing in case of multiple mapping e.g. findOverlaps() has the 'select'
>> argument with possible values "all", "first", "last", and "arbitrary".
>> Also nearest() and family have this argument and it accepts similar
>> values.
>>
>> Couldn't select() use a similar approach? The default should be "all"
>> so the current behavior is preserved but if it's something else then
>> the returned data.frame should align with the input.
>>
>> Thanks,
>> H.
>
> Hi Herve,
>
> I know that for things like findOverlaps it can sometimes make some
> sense to allow this kind of behavior.  But in this case, I really don't
> think that this a good idea.  Biologically speaking, this is not
> something anyone should really ever do with annotations like this, so I
> really see no upside to making it more convenient for people to do stuff
> that they should basically never do anyways.

I agree that picking up one thing when your key is mapped to more than
one thing should be done carefully and the user needs to understand the
consequences so I'm all for keeping the current behavior as it is.
However I think there are a few legitimate situations where the user
might actually want to get rid of this complexity. For example:

   (a) One of them is the "multiple probe" situation i.e. when a probe
       is mapped to more than 1 gene. This is currently handled thru
       toggleProbes() when working directly with the Bimap API. However
       there is no equivalent for the select() API. This could be
       handled by supporting "none" in addition to disambiguation
       modes "all", "first", "last", and "arbitrary".

   (b) Another situation is when the user only wants to know whether
       the keys map to something or not. If s/he could disambiguate
       with "first" (or "last", or "arbitrary", it doesn't matter) then
       s/he would get back a data.frame that aligns with his/her vector
       of keys which is very convenient.

Given that those mappings are retrieved from a database, the notion of
"first" and "last" is maybe not clear and we might just want to support
"all", "arbitrary", and "none".

>
> On the other hand, I suppose that I could have support for DataFrames as
> an output.  But I worry that this would be a ton of work since the code
> would have to compress things at the right times?

Very roughly each column (except the 1st one) of the data.frame
returned by select() would need to go thru splitAsList() using the
1st column as the common split factor, and then go thru subsetting
by the user-supplied 'keys' vector. Here is a simple wrapper to
select() that does this but it doesn't know how to handle NAs in
the input:

selectAsDataFrame <- function(x, keys, columns, keytype)
{
     if (any(is.na(keys)))
         stop("NAs in 'keys' are not supported yet")
     keys0 <- unique(keys)
     ans0 <- select(x, keys0, columns, keytype)
     stopifnot(names(ans0)[1L] == keytype)
     f <- ans0[[1]]
     if (ncol(ans0) == 1L) {
         ans_col1 <- unique(f)
         m <- match(keys, ans_col1)
         ans <- DataFrame(ans_col1[m])
     } else {
         ans_col2 <- unique(splitAsList(ans0[[2L]], f))
         ans_col1 <- names(ans_col2)
         m <- match(keys, ans_col1)
         ans <- DataFrame(ans_col1[m], unname(ans_col2)[m])
         if (ncol(ans0) >= 3L) {
             ans_cols <- lapply(ans0[-(1:2)],
                                function(col)
                                  unname(unique(splitAsList(col, f)[m])))
             ans <- cbind(ans, DataFrame(ans_cols))
         }
     }
     colnames(ans) <- colnames(ans0)
     ans
}

Then:

   > library(org.Hs.eg.db)
   > selectAsDataFrame(org.Hs.eg.db, keys=c("ALOX5", "HTR7", "XX", "ALOX5"),
                       keytype="ALIAS", columns=c("ENTREZID", "ENSEMBL"))
   DataFrame with 4 rows and 3 columns
           ALIAS        ENTREZID                         ENSEMBL
     <character> <CharacterList>                 <CharacterList>
   1       ALOX5             240 ENSG00000012779,ENSG00000262552
   2        HTR7            3363                 ENSG00000148680
   3          XX              NA                              NA
   4       ALOX5             240 ENSG00000012779,ENSG00000262552

An unpleasant effect of this approach is that every columns is now a
List object even when not needed (e.g. ENTREZID).

>  Is there an
> automagical way to have DataFrames automagically pack up redundant
> columns?  Of course, what then?  Could they then cbind() a DataFrame to
> a data.frame?

No but they could cbind() it to another DataFrame.

My feeling is that the DataFrame route is maybe a little bit more
complicated than introducing an extra argument for disambiguating
the ordinary data.frame but I could be wrong. They are complementary
though so we could also support both ;-) Depends what the people on
the list find more useful...

Thanks,
H.

>
>
>
>    Marc
>
>
>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319