[BioC] Quickest way to convert IDs in a data frame?

Hervé Pagès hpages at fhcrc.org
Thu Jul 25 22:32:50 CEST 2013


Hi Enrico,

On 07/25/2013 01:20 PM, James W. MacDonald wrote:
> Hi Enrico,
>
> Please don't take things off-list (e.g., use reply-all).
>
>
> On 7/25/2013 2:17 PM, Enrico Ferrero wrote:
>> Hi James,
>>
>> Thanks very much for your help.
>> There is an issue that needs to be solved before thinking about what's
>> the best approach in my opinion.
>>
>> I don't understand why, but the object created with the call to select
>> (test in my example, first.two in yours) has a different number of
>> rows from the original object (df in my example). Specifically it has
>> *less* rows.

I'm surprised it has less rows. It can definitely have more, when some
of the keys passed to select() are mapped to more than 1 row, but my
understanding was that select() would propagate unmapped keys to the
output by placing them in rows stuffed with NAs. So maybe I
misunderstood how select() works, or its behavior was changed, or
there is a bug somewhere. Could you please send the code that allows
us to reproduce this? Thanks.

H.

> If all symbols were converted to all possible Entrez IDs,
>> I would expect it to have more rows, not less. To me, it looks like
>> not all rows are looked up and returned.
>>
>> Do you see what I mean?
>
> Sure. You could be using outdated gene symbols. Or perhaps you are using
> a mixture of symbols and aliases. Which is even cooler than just all
> symbols:
>
>  > symb <- c(Rkeys(org.Hs.egSYMBOL)[1:10], Rkeys(org.Hs.egALIAS2EG)[31:45])
>  > symb
>   [1] "A1BG"     "A2M"      "A2MP1"    "NAT1"     "NAT2"     "AACP"
>   [7] "SERPINA3" "AADAC"    "AAMP"     "AANAT"    "AAMP"     "AANAT"
> [13] "DSPS"     "SNAT"     "AARS"     "CMT2N"    "AAV"      "AAVS1"
> [19] "ABAT"     "GABA-AT"  "GABAT"    "NPD009"   "ABC-1"    "ABC1"
> [25] "ABCA1"
>  > select(org.Hs.eg.db, symb, "ENTREZID","SYMBOL")
>       SYMBOL ENTREZID
> 1      A1BG        1
> 2       A2M        2
> 3     A2MP1        3
> 4      NAT1        9
> 5      NAT2       10
> 6      AACP       11
> 7  SERPINA3       12
> 8     AADAC       13
> 9      AAMP       14
> 10    AANAT       15
> 11     AAMP       14
> 12    AANAT       15
> 13     DSPS <NA>
> 14     SNAT <NA>
> 15     AARS       16
> 16    CMT2N <NA>
> 17      AAV <NA>
> 18    AAVS1       17
> 19     ABAT       18
> 20  GABA-AT <NA>
> 21    GABAT <NA>
> 22   NPD009 <NA>
> 23    ABC-1 <NA>
> 24     ABC1 <NA>
> 25    ABCA1       19
>  > select(org.Hs.eg.db, symb, "ENTREZID","ALIAS")
>        ALIAS ENTREZID
> 1      A1BG        1
> 2       A2M        2
> 3     A2MP1        3
> 4      NAT1        9
> 5      NAT1     1982
> 6      NAT1     6530
> 7      NAT1    10991
> 8      NAT2       10
> 9      NAT2    81539
> 10     AACP       11
> 11 SERPINA3       12
> 12    AADAC       13
> 13     AAMP       14
> 14    AANAT       15
> 15     DSPS       15
> 16     SNAT       15
> 17     AARS       16
> 18    CMT2N       16
> 19      AAV       17
> 20    AAVS1       17
> 21     ABAT       18
> 22  GABA-AT       18
> 23    GABAT       18
> 24   NPD009       18
> 25    ABC-1       19
> 26     ABC1       19
> 27     ABC1    63897
> 28    ABCA1       19
> Warning message:
> In .generateExtraRows(tab, keys, jointype) :
>    'select' and duplicate query keys resulted in 1:many mapping between
> keys and return rows
>  > mget(c("1982","6530","10991"), org.Hs.egGENENAME)
> $`1982`
> [1] "eukaryotic translation initiation factor 4 gamma, 2"
>
> $`6530`
> [1] "solute carrier family 6 (neurotransmitter transporter,
> noradrenalin), member 2"
>
> $`10991`
> [1] "solute carrier family 38, member 3"
>
> Best,
>
> Jim
>
>>
>> On 25 July 2013 18:17, James W. MacDonald<jmacdon at uw.edu>  wrote:
>>> Hi Enrico,
>>>
>>>
>>> On 7/25/2013 12:56 PM, Enrico Ferrero wrote:
>>>> Dear James,
>>>>
>>>> Thanks very much for your prompt reply.
>>>> I knew the problem was the for loop and the select function is indeed
>>>> a lot faster than that and works perfectly with toy data.
>>>>
>>>> However, this is what happens when I try to use it with real data:
>>>>
>>>>> test<- select(org.Hs.eg.db, keys=df$GeneSymbol, keytype="ALIAS",
>>>>> cols=c("SYMBOL","ENTREZID","ENSEMBL"))
>>>> Warning message:
>>>> In .generateExtraRows(tab, keys, jointype) :
>>>>     'select' and duplicate query keys resulted in 1:many mapping
>>>> between
>>>> keys and return rows
>>>>
>>>> which is probably the warning you mentioned.
>>>
>>> That's not the warning I mentioned, but it does point out the same
>>> issue,
>>> which is that there is a one to many mapping between symbol and
>>> entrez gene
>>> ID.
>>>
>>> So now you have to decide if you want to be naive (or stupid,
>>> depending on
>>> your perspective) or not. You could just cover your eyes and do this:
>>>
>>> first.two<- first.two[!duplicated(first.two$SYMBOL),]
>>>
>>> which will choose for you the first symbol ->  gene ID mapping and
>>> nuke the
>>> rest. That's nice and quick, but you are making huge assumptions.
>>>
>>> Or you could decide to be a bit more sophisticated and do something like
>>>
>>> thelst<- tapply(1:nrow(first.two), first.two$SYMBOL, function(x)
>>> first.two[x,])
>>>
>>> At this point you can take a look at e.g., thelst[1:10] to see what
>>> we just
>>> did
>>>
>>> thelst<- do.call("rbind", lapply(thelst, function(x) c(x[1,1],
>>> paste(x[,2],
>>> collapse = "|")))
>>>
>>> and here you can look at head(thelst).
>>>
>>> Then you can check to ensure that the first column of thelst is
>>> identical to
>>> the first column of df, and proceed as before.
>>>
>>> But there is still the problem of the multiple mappings. As an example:
>>>
>>>> thelst[1:5]
>>> $HBD
>>>       SYMBOL  ENTREZID
>>> 2535    HBD      3045
>>> 2536    HBD 100187828
>>>
>>> $KIR3DL3
>>>         SYMBOL  ENTREZID
>>> 17513 KIR3DL3    115653
>>> 17514 KIR3DL3 100133046
>>>
>>>> mget(as.character(thelst[[1]][,2]), org.Hs.egGENENAME)
>>> $`3045`
>>> [1] "hemoglobin, delta"
>>>
>>> $`100187828`
>>> [1] "hypophosphatemic bone disease"
>>>
>>>> mget(as.character(thelst[[2]][,2]), org.Hs.egGENENAME)
>>> $`115653`
>>> [1] "killer cell immunoglobulin-like receptor, three domains, long
>>> cytoplasmic tail, 3"
>>>
>>> $`100133046`
>>> [1] "killer cell immunoglobulin-like receptor three domains long
>>> cytoplasmic
>>> tail 3"
>>>
>>>
>>> So HBD is the gene symbol for two different genes! If this gene
>>> symbol is in
>>> your data, you will now have attributed your data to two genes that
>>> apparently are not remotely similar. if KIR3DL3 is in your data, then it
>>> worked out OK for that gene.
>>>
>>> Best,
>>>
>>> Jim
>>>
>>>
>>>
>>>
>>>
>>>> The real problem is that the number of rows is now different for the 2
>>>> objects:
>>>>> nrow(df); nrow(test)
>>>> [1] 573
>>>> [1] 201
>>>>
>>>> So I obviously can't put the new data into the original df. My
>>>> impression is that when the 1 to many mapping arises, the select
>>>> functions exits, with that warning message. As a result, my test
>>>> object is incomplete.
>>>>
>>>> On top of that, and I can't really explain this, the row positions are
>>>> messed up, e.g.
>>>>
>>>>> all.equal(df[100,],test[100,])
>>>> returns FALSE.
>>>>
>>>> How can I work around this?
>>>>
>>>> Thanks a  lot!
>>>>
>>>> Best,
>>>>
>>>> On 25 July 2013 16:58, James W. MacDonald<jmacdon at uw.edu>   wrote:
>>>>> Hi Enrico,
>>>>>
>>>>>
>>>>> On 7/25/2013 11:35 AM, Enrico Ferrero wrote:
>>>>>> Hello,
>>>>>>
>>>>>> I often have data frames where I need to perform ID conversions on
>>>>>> one
>>>>>> or
>>>>>> more of the columns while preserving the order of the rows, e.g.:
>>>>>>
>>>>>> GeneSymbol    Value1    Value2
>>>>>> GS1    2.5    0.1
>>>>>> GS2    3    0.2
>>>>>> ..
>>>>>>
>>>>>> And I want to obtain:
>>>>>>
>>>>>> GeneSymbol    EntrezGeneID    Value1    Value2
>>>>>> GS1    EG1    2.5    0.1
>>>>>> GS2    EG2    3    0.2
>>>>>> ..
>>>>>>
>>>>>> What I've done so far was to create a function that uses
>>>>>> org.Hs.eg.db to
>>>>>> loop over the rows of the column and does the conversion:
>>>>>>
>>>>>> library(org.Hs.eg.db)
>>>>>> alias2EG<- function(x) {
>>>>>> for (i in 1:length(x)) {
>>>>>> if (!is.na(x[i])) {
>>>>>> repl<- org.Hs.egALIAS2EG[[x[i]]][1]
>>>>>> if (!is.null(repl)) {
>>>>>> x[i]<- repl
>>>>>> }
>>>>>> else {
>>>>>> x[i]<- NA
>>>>>> }
>>>>>> }
>>>>>> }
>>>>>> return(x)
>>>>>> }
>>>>>
>>>>> I should first note that gene symbols are not unique, so you are
>>>>> taking a
>>>>> chance on your mappings. Is there no other annotation for your data?
>>>>>
>>>>> In addition, you should note that it is almost always better to
>>>>> think of
>>>>> objects as vectors and matrices in R, rather than as things that
>>>>> need to
>>>>> be
>>>>> looped over (e.g., R isn't Perl or C).
>>>>>
>>>>> first.two<- select(org.Hs.eg.db, as.character(df$GeneSymbol),
>>>>> "ENTREZID",
>>>>> "SYMBOL")
>>>>>
>>>>> Note that there used to be a warning or an error (don't remember
>>>>> which)
>>>>> when
>>>>> you did something like this, stating that gene symbols are not unique,
>>>>> and
>>>>> that you shouldn't do this sort of thing. Apparently this warning has
>>>>> been
>>>>> removed, but the issue remains valid.
>>>>>
>>>>> ## check yourself
>>>>>
>>>>> all.equal(df$GeneSymbol, first.two$SYMBOL)
>>>>>
>>>>> ## if true, proceed
>>>>>
>>>>> df<- data.frame(first.two, df[,-1])
>>>>>
>>>>> Best,
>>>>>
>>>>> Jim
>>>>>
>>>>>
>>>>>
>>>>>> and then call the function like this:
>>>>>>
>>>>>> df$EntrezGeneID<- alias2GS(df$GeneSymbol)
>>>>>>
>>>>>> This works well, but gets very slow when I need to do multiple
>>>>>> conversions
>>>>>> on large datasets.
>>>>>>
>>>>>> Is there any way I can achieve the same result but in a quicker, more
>>>>>> efficient way?
>>>>>>
>>>>>> Thank you.
>>>>>>
>>>>> --
>>>>> James W. MacDonald, M.S.
>>>>> Biostatistician
>>>>> University of Washington
>>>>> Environmental and Occupational Health Sciences
>>>>> 4225 Roosevelt Way NE, # 100
>>>>> Seattle WA 98105-6099
>>>>>
>>>>
>>> --
>>> James W. MacDonald, M.S.
>>> Biostatistician
>>> University of Washington
>>> Environmental and Occupational Health Sciences
>>> 4225 Roosevelt Way NE, # 100
>>> Seattle WA 98105-6099
>>>
>>
>>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list