[BioC] Quickest way to convert IDs in a data frame?

James W. MacDonald jmacdon at uw.edu
Thu Jul 25 19:17:10 CEST 2013


Hi Enrico,

On 7/25/2013 12:56 PM, Enrico Ferrero wrote:
> Dear James,
>
> Thanks very much for your prompt reply.
> I knew the problem was the for loop and the select function is indeed
> a lot faster than that and works perfectly with toy data.
>
> However, this is what happens when I try to use it with real data:
>
>> test<- select(org.Hs.eg.db, keys=df$GeneSymbol, keytype="ALIAS", cols=c("SYMBOL","ENTREZID","ENSEMBL"))
> Warning message:
> In .generateExtraRows(tab, keys, jointype) :
>    'select' and duplicate query keys resulted in 1:many mapping between
> keys and return rows
>
> which is probably the warning you mentioned.

That's not the warning I mentioned, but it does point out the same 
issue, which is that there is a one to many mapping between symbol and 
entrez gene ID.

So now you have to decide if you want to be naive (or stupid, depending 
on your perspective) or not. You could just cover your eyes and do this:

first.two <- first.two[!duplicated(first.two$SYMBOL),]

which will choose for you the first symbol -> gene ID mapping and nuke 
the rest. That's nice and quick, but you are making huge assumptions.

Or you could decide to be a bit more sophisticated and do something like

thelst <- tapply(1:nrow(first.two), first.two$SYMBOL, function(x) 
first.two[x,])

At this point you can take a look at e.g., thelst[1:10] to see what we 
just did

thelst <- do.call("rbind", lapply(thelst, function(x) c(x[1,1], 
paste(x[,2], collapse = "|")))

and here you can look at head(thelst).

Then you can check to ensure that the first column of thelst is 
identical to the first column of df, and proceed as before.

But there is still the problem of the multiple mappings. As an example:

 > thelst[1:5]
$HBD
      SYMBOL  ENTREZID
2535    HBD      3045
2536    HBD 100187828

$KIR3DL3
        SYMBOL  ENTREZID
17513 KIR3DL3    115653
17514 KIR3DL3 100133046

 > mget(as.character(thelst[[1]][,2]), org.Hs.egGENENAME)
$`3045`
[1] "hemoglobin, delta"

$`100187828`
[1] "hypophosphatemic bone disease"

 > mget(as.character(thelst[[2]][,2]), org.Hs.egGENENAME)
$`115653`
[1] "killer cell immunoglobulin-like receptor, three domains, long 
cytoplasmic tail, 3"

$`100133046`
[1] "killer cell immunoglobulin-like receptor three domains long 
cytoplasmic tail 3"


So HBD is the gene symbol for two different genes! If this gene symbol 
is in your data, you will now have attributed your data to two genes 
that apparently are not remotely similar. if KIR3DL3 is in your data, 
then it worked out OK for that gene.

Best,

Jim




>
> The real problem is that the number of rows is now different for the 2 objects:
>> nrow(df); nrow(test)
> [1] 573
> [1] 201
>
> So I obviously can't put the new data into the original df. My
> impression is that when the 1 to many mapping arises, the select
> functions exits, with that warning message. As a result, my test
> object is incomplete.
>
> On top of that, and I can't really explain this, the row positions are
> messed up, e.g.
>
>> all.equal(df[100,],test[100,])
> returns FALSE.
>
> How can I work around this?
>
> Thanks a  lot!
>
> Best,
>
> On 25 July 2013 16:58, James W. MacDonald<jmacdon at uw.edu>  wrote:
>> Hi Enrico,
>>
>>
>> On 7/25/2013 11:35 AM, Enrico Ferrero wrote:
>>> Hello,
>>>
>>> I often have data frames where I need to perform ID conversions on one or
>>> more of the columns while preserving the order of the rows, e.g.:
>>>
>>> GeneSymbol    Value1    Value2
>>> GS1    2.5    0.1
>>> GS2    3    0.2
>>> ..
>>>
>>> And I want to obtain:
>>>
>>> GeneSymbol    EntrezGeneID    Value1    Value2
>>> GS1    EG1    2.5    0.1
>>> GS2    EG2    3    0.2
>>> ..
>>>
>>> What I've done so far was to create a function that uses org.Hs.eg.db to
>>> loop over the rows of the column and does the conversion:
>>>
>>> library(org.Hs.eg.db)
>>> alias2EG<- function(x) {
>>> for (i in 1:length(x)) {
>>> if (!is.na(x[i])) {
>>> repl<- org.Hs.egALIAS2EG[[x[i]]][1]
>>> if (!is.null(repl)) {
>>> x[i]<- repl
>>> }
>>> else {
>>> x[i]<- NA
>>> }
>>> }
>>> }
>>> return(x)
>>> }
>>
>> I should first note that gene symbols are not unique, so you are taking a
>> chance on your mappings. Is there no other annotation for your data?
>>
>> In addition, you should note that it is almost always better to think of
>> objects as vectors and matrices in R, rather than as things that need to be
>> looped over (e.g., R isn't Perl or C).
>>
>> first.two<- select(org.Hs.eg.db, as.character(df$GeneSymbol), "ENTREZID",
>> "SYMBOL")
>>
>> Note that there used to be a warning or an error (don't remember which) when
>> you did something like this, stating that gene symbols are not unique, and
>> that you shouldn't do this sort of thing. Apparently this warning has been
>> removed, but the issue remains valid.
>>
>> ## check yourself
>>
>> all.equal(df$GeneSymbol, first.two$SYMBOL)
>>
>> ## if true, proceed
>>
>> df<- data.frame(first.two, df[,-1])
>>
>> Best,
>>
>> Jim
>>
>>
>>
>>> and then call the function like this:
>>>
>>> df$EntrezGeneID<- alias2GS(df$GeneSymbol)
>>>
>>> This works well, but gets very slow when I need to do multiple conversions
>>> on large datasets.
>>>
>>> Is there any way I can achieve the same result but in a quicker, more
>>> efficient way?
>>>
>>> Thank you.
>>>
>> --
>> James W. MacDonald, M.S.
>> Biostatistician
>> University of Washington
>> Environmental and Occupational Health Sciences
>> 4225 Roosevelt Way NE, # 100
>> Seattle WA 98105-6099
>>
>
>

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099



More information about the Bioconductor mailing list