[BioC] Quickest way to convert IDs in a data frame?

Fri Jul 26 00:54:25 CEST 2013

Hi both,

Thanks for your insights, this is extremely interesting!

While I (kind of) understand why NAs get removed, deliberately
truncating the output that way is probably not what most people
expect. It may be worth considering filing a bug report for this?

This also brings me back to my original question: what's the simplest
and most effienct way to create an exact copy of a column containing
converted IDs in a data.frame?

I'm surprised there doesn't seem to be an easy ready-to-go solution,
as I would imagine it is a rather common task to perform. As I
mentioned in my first post, the for loop function works, but it's
highly inefficient.

Any help is greatly appreciated, thank you.

Best,

On 25 July 2013 23:18, Hervé Pagès <hpages at fhcrc.org> wrote:
> Hi James,
>
> You're right.
>
> It's actually both: NAs *and* duplicated keys that are mapped to
> more than 1 row are removed from the input. I don't think this
> is documented.
>
> I wonder if select() behavior couldn't be a little bit simpler by
> either preserving or removing all duplicated keys, and not just some
> of them (on a somewhat arbitrary criteria).
>
> Thanks,
> H.
>
>
>
> On 07/25/2013 02:57 PM, James W. MacDonald wrote:
>>
>> Hi Enrico and Herve,
>>
>> This has to do with duplicate entries, but only when the duplicate entry
>> maps to many ENTREZID:
>>
>>  > select(org.Hs.eg.db, rep("ADORA2A", 4), "ENTREZID", "ALIAS")
>>      ALIAS ENTREZID
>> 1 ADORA2A      135
>> 2 ADORA2A      135
>> 3 ADORA2A      135
>> 4 ADORA2A      135
>>
>>  > select(org.Hs.eg.db, rep("AGT", 4), "ENTREZID", "ALIAS")
>>    ALIAS ENTREZID
>> 1   AGT      183
>> 2   AGT      189
>> Warning message:
>> In .generateExtraRows(tab, keys, jointype) :
>>    'select' and duplicate query keys resulted in 1:many mapping between
>> keys and return rows
>>
>>  > select(org.Hs.eg.db, "AGT", "ENTREZID", "ALIAS")
>>    ALIAS ENTREZID
>> 1   AGT      183
>> 2   AGT      189
>> Warning message:
>> In .generateExtraRows(tab, keys, jointype) :
>>    'select' resulted in 1:many mapping between keys and return rows
>>
>>
>> So in the instances where a gene symbol maps to more than one ENTREZID,
>> the output gets truncated, whereas if it is a one-to-one mapping, it
>> does not.
>>
>> Best,
>>
>> Jim
>>
>>
>>
>>
>> On 7/25/2013 5:06 PM, Enrico Ferrero wrote:
>>>
>>> Hi,
>>>
>>> Hervé, that's exactly what I'm trying to say.
>>>
>>> Attached to this email is a tab delimited file with two columns of
>>> GeneSymbols (or Aliases), and here is some simple code to reproduce
>>> the unexpected behaviour:
>>>
>>> library(org.Hs.eg.db)
>>> mydf<- read.table("testdata.txt", sep="\t", header=TRUE, as.is=TRUE)
>>> mytest<- select(org.Hs.eg.db, key=mydf$GeneSymbol1, keytype="ALIAS",
>>> cols=c("SYMBOL","ENTREZID","ENSEMBL"))
>>> # check that mytest has less rows than mydf
>>> nrow(mydf)
>>> nrow(mytest)
>>> # pick a random row: they don't match
>>> mydf[250,]
>>> mytest[250,]
>>>
>>> Ideally, mytest should have the same number and position of rows of
>>> mydf so that I can then cbind them.
>>> If mytest has more rows because of multiple mappings that's also fine:
>>> I can always use merge(mydf, mytest), right?
>>>
>>> Thanks a lot to both for your help, it's very appreciated.
>>> Best,
>>>
>>>
>>> On 25 July 2013 21:32, Hervé Pagès<hpages at fhcrc.org>  wrote:
>>>>
>>>> Hi Enrico,
>>>>
>>>>
>>>> On 07/25/2013 01:20 PM, James W. MacDonald wrote:
>>>>>
>>>>> Hi Enrico,
>>>>>
>>>>> Please don't take things off-list (e.g., use reply-all).
>>>>>
>>>>>
>>>>> On 7/25/2013 2:17 PM, Enrico Ferrero wrote:
>>>>>>
>>>>>> Hi James,
>>>>>>
>>>>>> Thanks very much for your help.
>>>>>> There is an issue that needs to be solved before thinking about what's
>>>>>> the best approach in my opinion.
>>>>>>
>>>>>> I don't understand why, but the object created with the call to select
>>>>>> (test in my example, first.two in yours) has a different number of
>>>>>> rows from the original object (df in my example). Specifically it has
>>>>>> *less* rows.
>>>>
>>>>
>>>> I'm surprised it has less rows. It can definitely have more, when some
>>>> of the keys passed to select() are mapped to more than 1 row, but my
>>>> understanding was that select() would propagate unmapped keys to the
>>>> output by placing them in rows stuffed with NAs. So maybe I
>>>> misunderstood how select() works, or its behavior was changed, or
>>>> there is a bug somewhere. Could you please send the code that allows
>>>> us to reproduce this? Thanks.
>>>>
>>>> H.
>>>>
>>>>
>>>>> If all symbols were converted to all possible Entrez IDs,
>>>>>>
>>>>>> I would expect it to have more rows, not less. To me, it looks like
>>>>>> not all rows are looked up and returned.
>>>>>>
>>>>>> Do you see what I mean?
>>>>>
>>>>>
>>>>> Sure. You could be using outdated gene symbols. Or perhaps you are
>>>>> using
>>>>> a mixture of symbols and aliases. Which is even cooler than just all
>>>>> symbols:
>>>>>
>>>>>   >  symb<- c(Rkeys(org.Hs.egSYMBOL)[1:10],
>>>>> Rkeys(org.Hs.egALIAS2EG)[31:45])
>>>>>   >  symb
>>>>>    [1] "A1BG"     "A2M"      "A2MP1"    "NAT1"     "NAT2"     "AACP"
>>>>>    [7] "SERPINA3" "AADAC"    "AAMP"     "AANAT"    "AAMP"     "AANAT"
>>>>> [13] "DSPS"     "SNAT"     "AARS"     "CMT2N"    "AAV"      "AAVS1"
>>>>> [19] "ABAT"     "GABA-AT"  "GABAT"    "NPD009"   "ABC-1"    "ABC1"
>>>>> [25] "ABCA1"
>>>>>   >  select(org.Hs.eg.db, symb, "ENTREZID","SYMBOL")
>>>>>        SYMBOL ENTREZID
>>>>> 1      A1BG        1
>>>>> 2       A2M        2
>>>>> 3     A2MP1        3
>>>>> 4      NAT1        9
>>>>> 5      NAT2       10
>>>>> 6      AACP       11
>>>>> 7  SERPINA3       12
>>>>> 8     AADAC       13
>>>>> 9      AAMP       14
>>>>> 10    AANAT       15
>>>>> 11     AAMP       14
>>>>> 12    AANAT       15
>>>>> 13     DSPS<NA>
>>>>> 14     SNAT<NA>
>>>>> 15     AARS       16
>>>>> 16    CMT2N<NA>
>>>>> 17      AAV<NA>
>>>>> 18    AAVS1       17
>>>>> 19     ABAT       18
>>>>> 20  GABA-AT<NA>
>>>>> 21    GABAT<NA>
>>>>> 22   NPD009<NA>
>>>>> 23    ABC-1<NA>
>>>>> 24     ABC1<NA>
>>>>> 25    ABCA1       19
>>>>>   >  select(org.Hs.eg.db, symb, "ENTREZID","ALIAS")
>>>>>         ALIAS ENTREZID
>>>>> 1      A1BG        1
>>>>> 2       A2M        2
>>>>> 3     A2MP1        3
>>>>> 4      NAT1        9
>>>>> 5      NAT1     1982
>>>>> 6      NAT1     6530
>>>>> 7      NAT1    10991
>>>>> 8      NAT2       10
>>>>> 9      NAT2    81539
>>>>> 10     AACP       11
>>>>> 11 SERPINA3       12
>>>>> 12    AADAC       13
>>>>> 13     AAMP       14
>>>>> 14    AANAT       15
>>>>> 15     DSPS       15
>>>>> 16     SNAT       15
>>>>> 17     AARS       16
>>>>> 18    CMT2N       16
>>>>> 19      AAV       17
>>>>> 20    AAVS1       17
>>>>> 21     ABAT       18
>>>>> 22  GABA-AT       18
>>>>> 23    GABAT       18
>>>>> 24   NPD009       18
>>>>> 25    ABC-1       19
>>>>> 26     ABC1       19
>>>>> 27     ABC1    63897
>>>>> 28    ABCA1       19
>>>>> Warning message:
>>>>> In .generateExtraRows(tab, keys, jointype) :
>>>>>     'select' and duplicate query keys resulted in 1:many mapping
>>>>> between
>>>>> keys and return rows
>>>>>   >  mget(c("1982","6530","10991"), org.Hs.egGENENAME)
>>>>> $`1982`
>>>>> [1] "eukaryotic translation initiation factor 4 gamma, 2"
>>>>>
>>>>> $`6530`
>>>>> [1] "solute carrier family 6 (neurotransmitter transporter,
>>>>> noradrenalin), member 2"
>>>>>
>>>>> $`10991`
>>>>> [1] "solute carrier family 38, member 3"
>>>>>
>>>>> Best,
>>>>>
>>>>> Jim
>>>>>
>>>>>> On 25 July 2013 18:17, James W. MacDonald<jmacdon at uw.edu>   wrote:
>>>>>>>
>>>>>>> Hi Enrico,
>>>>>>>
>>>>>>>
>>>>>>> On 7/25/2013 12:56 PM, Enrico Ferrero wrote:
>>>>>>>>
>>>>>>>> Dear James,
>>>>>>>>
>>>>>>>> Thanks very much for your prompt reply.
>>>>>>>> I knew the problem was the for loop and the select function is
>>>>>>>> indeed
>>>>>>>> a lot faster than that and works perfectly with toy data.
>>>>>>>>
>>>>>>>> However, this is what happens when I try to use it with real data:
>>>>>>>>
>>>>>>>>> test<- select(org.Hs.eg.db, keys=df$GeneSymbol, keytype="ALIAS",
>>>>>>>>> cols=c("SYMBOL","ENTREZID","ENSEMBL"))
>>>>>>>>
>>>>>>>> Warning message:
>>>>>>>> In .generateExtraRows(tab, keys, jointype) :
>>>>>>>>      'select' and duplicate query keys resulted in 1:many mapping
>>>>>>>> between
>>>>>>>> keys and return rows
>>>>>>>>
>>>>>>>> which is probably the warning you mentioned.
>>>>>>>
>>>>>>>
>>>>>>> That's not the warning I mentioned, but it does point out the same
>>>>>>> issue,
>>>>>>> which is that there is a one to many mapping between symbol and
>>>>>>> entrez gene
>>>>>>> ID.
>>>>>>>
>>>>>>> So now you have to decide if you want to be naive (or stupid,
>>>>>>> depending on
>>>>>>> your perspective) or not. You could just cover your eyes and do this:
>>>>>>>
>>>>>>> first.two<- first.two[!duplicated(first.two$SYMBOL),]
>>>>>>>
>>>>>>> which will choose for you the first symbol ->   gene ID mapping and
>>>>>>> nuke the
>>>>>>> rest. That's nice and quick, but you are making huge assumptions.
>>>>>>>
>>>>>>> Or you could decide to be a bit more sophisticated and do
>>>>>>> something like
>>>>>>>
>>>>>>> thelst<- tapply(1:nrow(first.two), first.two$SYMBOL, function(x)
>>>>>>> first.two[x,])
>>>>>>>
>>>>>>> At this point you can take a look at e.g., thelst[1:10] to see what
>>>>>>> we just
>>>>>>> did
>>>>>>>
>>>>>>> thelst<- do.call("rbind", lapply(thelst, function(x) c(x[1,1],
>>>>>>> paste(x[,2],
>>>>>>> collapse = "|")))
>>>>>>>
>>>>>>> and here you can look at head(thelst).
>>>>>>>
>>>>>>> Then you can check to ensure that the first column of thelst is
>>>>>>> identical to
>>>>>>> the first column of df, and proceed as before.
>>>>>>>
>>>>>>> But there is still the problem of the multiple mappings. As an
>>>>>>> example:
>>>>>>>
>>>>>>>> thelst[1:5]
>>>>>>>
>>>>>>> $HBD
>>>>>>>        SYMBOL  ENTREZID
>>>>>>> 2535    HBD      3045
>>>>>>> 2536    HBD 100187828
>>>>>>>
>>>>>>> $KIR3DL3
>>>>>>>          SYMBOL  ENTREZID
>>>>>>> 17513 KIR3DL3    115653
>>>>>>> 17514 KIR3DL3 100133046
>>>>>>>
>>>>>>>> mget(as.character(thelst[[1]][,2]), org.Hs.egGENENAME)
>>>>>>>
>>>>>>> $`3045`
>>>>>>> [1] "hemoglobin, delta"
>>>>>>>
>>>>>>> $`100187828`
>>>>>>> [1] "hypophosphatemic bone disease"
>>>>>>>
>>>>>>>> mget(as.character(thelst[[2]][,2]), org.Hs.egGENENAME)
>>>>>>>
>>>>>>> $`115653`
>>>>>>> [1] "killer cell immunoglobulin-like receptor, three domains, long
>>>>>>> cytoplasmic tail, 3"
>>>>>>>
>>>>>>> $`100133046`
>>>>>>> [1] "killer cell immunoglobulin-like receptor three domains long
>>>>>>> cytoplasmic
>>>>>>> tail 3"
>>>>>>>
>>>>>>>
>>>>>>> So HBD is the gene symbol for two different genes! If this gene
>>>>>>> symbol is in
>>>>>>> your data, you will now have attributed your data to two genes that
>>>>>>> apparently are not remotely similar. if KIR3DL3 is in your data,
>>>>>>> then it
>>>>>>> worked out OK for that gene.
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> Jim
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> The real problem is that the number of rows is now different for
>>>>>>>> the 2
>>>>>>>> objects:
>>>>>>>>>
>>>>>>>>> nrow(df); nrow(test)
>>>>>>>>
>>>>>>>> [1] 573
>>>>>>>> [1] 201
>>>>>>>>
>>>>>>>> So I obviously can't put the new data into the original df. My
>>>>>>>> impression is that when the 1 to many mapping arises, the select
>>>>>>>> functions exits, with that warning message. As a result, my test
>>>>>>>> object is incomplete.
>>>>>>>>
>>>>>>>> On top of that, and I can't really explain this, the row
>>>>>>>> positions are
>>>>>>>> messed up, e.g.
>>>>>>>>
>>>>>>>>> all.equal(df[100,],test[100,])
>>>>>>>>
>>>>>>>> returns FALSE.
>>>>>>>>
>>>>>>>> How can I work around this?
>>>>>>>>
>>>>>>>> Thanks a  lot!
>>>>>>>>
>>>>>>>> Best,
>>>>>>>>
>>>>>>>> On 25 July 2013 16:58, James W. MacDonald<jmacdon at uw.edu>    wrote:
>>>>>>>>>
>>>>>>>>> Hi Enrico,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 7/25/2013 11:35 AM, Enrico Ferrero wrote:
>>>>>>>>>>
>>>>>>>>>> Hello,
>>>>>>>>>>
>>>>>>>>>> I often have data frames where I need to perform ID conversions on
>>>>>>>>>> one
>>>>>>>>>> or
>>>>>>>>>> more of the columns while preserving the order of the rows, e.g.:
>>>>>>>>>>
>>>>>>>>>> GeneSymbol    Value1    Value2
>>>>>>>>>> GS1    2.5    0.1
>>>>>>>>>> GS2    3    0.2
>>>>>>>>>> ..
>>>>>>>>>>
>>>>>>>>>> And I want to obtain:
>>>>>>>>>>
>>>>>>>>>> GeneSymbol    EntrezGeneID    Value1    Value2
>>>>>>>>>> GS1    EG1    2.5    0.1
>>>>>>>>>> GS2    EG2    3    0.2
>>>>>>>>>> ..
>>>>>>>>>>
>>>>>>>>>> What I've done so far was to create a function that uses
>>>>>>>>>> org.Hs.eg.db to
>>>>>>>>>> loop over the rows of the column and does the conversion:
>>>>>>>>>>
>>>>>>>>>> library(org.Hs.eg.db)
>>>>>>>>>> alias2EG<- function(x) {
>>>>>>>>>> for (i in 1:length(x)) {
>>>>>>>>>> if (!is.na(x[i])) {
>>>>>>>>>> repl<- org.Hs.egALIAS2EG[[x[i]]][1]
>>>>>>>>>> if (!is.null(repl)) {
>>>>>>>>>> x[i]<- repl
>>>>>>>>>> }
>>>>>>>>>> else {
>>>>>>>>>> x[i]<- NA
>>>>>>>>>> }
>>>>>>>>>> }
>>>>>>>>>> }
>>>>>>>>>> return(x)
>>>>>>>>>> }
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I should first note that gene symbols are not unique, so you are
>>>>>>>>> taking a
>>>>>>>>> chance on your mappings. Is there no other annotation for your
>>>>>>>>> data?
>>>>>>>>>
>>>>>>>>> In addition, you should note that it is almost always better to
>>>>>>>>> think of
>>>>>>>>> objects as vectors and matrices in R, rather than as things that
>>>>>>>>> need to
>>>>>>>>> be
>>>>>>>>> looped over (e.g., R isn't Perl or C).
>>>>>>>>>
>>>>>>>>> first.two<- select(org.Hs.eg.db, as.character(df$GeneSymbol),
>>>>>>>>> "ENTREZID",
>>>>>>>>> "SYMBOL")
>>>>>>>>>
>>>>>>>>> Note that there used to be a warning or an error (don't remember
>>>>>>>>> which)
>>>>>>>>> when
>>>>>>>>> you did something like this, stating that gene symbols are not
>>>>>>>>> unique,
>>>>>>>>> and
>>>>>>>>> that you shouldn't do this sort of thing. Apparently this
>>>>>>>>> warning has
>>>>>>>>> been
>>>>>>>>> removed, but the issue remains valid.
>>>>>>>>>
>>>>>>>>> ## check yourself
>>>>>>>>>
>>>>>>>>> all.equal(df$GeneSymbol, first.two$SYMBOL)
>>>>>>>>>
>>>>>>>>> ## if true, proceed
>>>>>>>>>
>>>>>>>>> df<- data.frame(first.two, df[,-1])
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>>
>>>>>>>>> Jim
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> and then call the function like this:
>>>>>>>>>>
>>>>>>>>>> df$EntrezGeneID<- alias2GS(df$GeneSymbol)
>>>>>>>>>>
>>>>>>>>>> This works well, but gets very slow when I need to do multiple
>>>>>>>>>> conversions
>>>>>>>>>> on large datasets.
>>>>>>>>>>
>>>>>>>>>> Is there any way I can achieve the same result but in a
>>>>>>>>>> quicker, more
>>>>>>>>>> efficient way?
>>>>>>>>>>
>>>>>>>>>> Thank you.
>>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> James W. MacDonald, M.S.
>>>>>>>>> Biostatistician
>>>>>>>>> University of Washington
>>>>>>>>> Environmental and Occupational Health Sciences
>>>>>>>>> 4225 Roosevelt Way NE, # 100
>>>>>>>>> Seattle WA 98105-6099
>>>>>>>>>
>>>>>>> --
>>>>>>> James W. MacDonald, M.S.
>>>>>>> Biostatistician
>>>>>>> University of Washington
>>>>>>> Environmental and Occupational Health Sciences
>>>>>>> 4225 Roosevelt Way NE, # 100
>>>>>>> Seattle WA 98105-6099
>>>>>>>
>>>>>>
>>>> --
>>>> Hervé Pagès
>>>>
>>>> Program in Computational Biology
>>>> Division of Public Health Sciences
>>>> Fred Hutchinson Cancer Research Center
>>>> 1100 Fairview Ave. N, M1-B514
>>>> P.O. Box 19024
>>>> Seattle, WA 98109-1024
>>>>
>>>> E-mail: hpages at fhcrc.org
>>>> Phone:  (206) 667-5791
>>>> Fax:    (206) 667-1319
>>>
>>>
>>>
>>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319

-- 
Enrico Ferrero
PhD Student
Steve Russell Lab - Department of Genetics
FlyChip - Cambridge Systems Biology Centre
University of Cambridge

e.ferrero at gen.cam.ac.uk
http://flypress.gen.cam.ac.uk/