[BioC] merging GRange objects

Tue Jun 25 16:14:03 CEST 2013

Thanks Hervé, that helps a lot. Indeed I was getting a lot of duplicates as you pointed out. I shall use your approach. 
Cheers../Murli

-----Original Message-----
From: Hervé Pagès [mailto:hpages at fhcrc.org] 
Sent: Tuesday, June 25, 2013 2:50 AM
To: Nair, Murlidharan T
Cc: Murli [guest]; bioconductor at r-project.org
Subject: Re: [BioC] merging GRange objects

Murli,

On 06/24/2013 06:00 PM, Nair, Murlidharan T wrote:
> Hi Hervé ,
> I am annotating paired end reads and code is for mate1, mate2 is the same. As you can see I am using the genomic coordinates and trying to annotate them using the UCSC known genes table. I need to ultimately make an association with the coordinates, a reason why I am trying to merge the outputs. I am converting them into data frames and merging them because select returns a data frame, so I have to convert the GRanges object to a data frame to merge them.  I want to make sure is that I am not messing my data when I am merging them. Would the following lines correctly combine them?
>
>> mrg.data1=merge(as.data.frame(trans.names), 
>> as.data.frame(trans.info), by.x="ENTREZID", by.y="GENEID")
>
>> mrg.data2=merge(mrg.data1, as.data.frame(codingRegions), by.x="TXID",   by.y="TXID")
>
> When I reviewed the first few lines and they seemed ok, but there could always be exceptions. If there is a better way please let me know. I am very new to Bioconductor.

You didn't send us code we can reproduce but here is code that tries to achieve something similar:

   library(GenomicRanges)
   mate.range <- GRanges("chr1", IRanges(c(39800000, 90400000,
39800066),width=500))

   library(VariantAnnotation)
   library(TxDb.Hsapiens.UCSC.hg19.knownGene)
   txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
   codingRegions <- refLocsToLocalLocs(mate.range, txdb)

   trans.info <- select(txdb, key=mcols(codingRegions)$TXID,
                        cols=c("GENEID","TXNAME"), keytype="TXID")

   library(org.Hs.eg.db)
   trans.names <- select(org.Hs.eg.db, trans.info$GENEID,
                         c("GENENAME", "SYMBOL"))

   mrg.data1 <- merge(trans.names, trans.info,
                      by.x="ENTREZID", by.y="GENEID")

   mrg.data2 <- merge(mrg.data1, as.data.frame(codingRegions),
                      by.x="TXID", by.y="TXID")

   > mrg.data2
      TXID ENTREZID                                          GENENAME SYMBOL
   1   963    23499           microtubule-actin crosslinking factor 1  MACF1
   2   963    23499           microtubule-actin crosslinking factor 1  MACF1
   3   963    23499           microtubule-actin crosslinking factor 1  MACF1
   4   963    23499           microtubule-actin crosslinking factor 1  MACF1
   5   963    23499           microtubule-actin crosslinking factor 1  MACF1
   6   963    23499           microtubule-actin crosslinking factor 1  MACF1
   7   963    23499           microtubule-actin crosslinking factor 1  MACF1
   8   963    23499           microtubule-actin crosslinking factor 1  MACF1
   9  1687    55144 leucine rich repeat containing 8 family, member D LRRC8D
   10 1687    55144 leucine rich repeat containing 8 family, member D LRRC8D
   11 1688    55144 leucine rich repeat containing 8 family, member D LRRC8D
   12 1688    55144 leucine rich repeat containing 8 family, member D LRRC8D
          TXNAME seqnames    start      end width strand CDSLOC.start 
CDSLOC.end
   1  uc021olw.1     chr1 39800000 39800499   500      +         3060 
     3559
   2  uc021olw.1     chr1 39800066 39800565   500      +         3126 
     3625
   3  uc021olw.1     chr1 39800000 39800499   500      +         3060 
     3559
   4  uc021olw.1     chr1 39800066 39800565   500      +         3126 
     3625
   5  uc021olw.1     chr1 39800000 39800499   500      +         3060 
     3559
   6  uc021olw.1     chr1 39800066 39800565   500      +         3126 
     3625
   7  uc021olw.1     chr1 39800000 39800499   500      +         3060 
     3559
   8  uc021olw.1     chr1 39800066 39800565   500      +         3126 
     3625
   9  uc001dnm.3     chr1 90400000 90400499   500      +         1373 
     1872
   10 uc001dnm.3     chr1 90400000 90400499   500      +         1373 
     1872
   11 uc001dnn.3     chr1 90400000 90400499   500      +         1373 
     1872
   12 uc001dnn.3     chr1 90400000 90400499   500      +         1373 
     1872
      CDSLOC.width PROTEINLOC QUERYID CDSID
   1           500 1020, 1187       1  2948
   2           500 1042, 1209       3  2948
   3           500 1020, 1187       1  2948
   4           500 1042, 1209       3  2948
   5           500 1020, 1187       1  2948
   6           500 1042, 1209       3  2948
   7           500 1020, 1187       1  2948
   8           500 1042, 1209       3  2948
   9           500   458, 624       2  5132
   10          500   458, 624       2  5132
   11          500   458, 624       2  5132
   12          500   458, 624       2  5132

At first glance it looks like the merging worked. But there are so many duplicated rows in the final data.frame that I don't find this approach very appealing:

   > duplicated(mrg.data2)
    [1] FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE

Querying the Homo.sapiens package allows you to retrieve all the annotations you want in one single call:

   > library(Homo.sapiens)
   > mrg.data1b <- select(Homo.sapiens, keys=mcols(codingRegions)$TXID,
                          cols=c("GENEID", "TXNAME", "ENTREZID", "GENENAME", "SYMBOL"),
                          keytype="TXID")
   > mrg.data2b <- merge(mrg.data1b, as.data.frame(codingRegions),
                         by.x="TXID", by.y="TXID")

   > mrg.data2b
     TXID GENEID     TXNAME SYMBOL
   1  963  23499 uc021olw.1  MACF1
   2  963  23499 uc021olw.1  MACF1
   3 1687  55144 uc001dnm.3 LRRC8D
   4 1688  55144 uc001dnn.3 LRRC8D
   5 1689   <NA> uc021opq.1   <NA>
                                              GENENAME ENTREZID seqnames 
    start
   1           microtubule-actin crosslinking factor 1    23499     chr1 
39800000
   2           microtubule-actin crosslinking factor 1    23499     chr1 
39800066
   3 leucine rich repeat containing 8 family, member D    55144     chr1 
90400000
   4 leucine rich repeat containing 8 family, member D    55144     chr1 
90400000
   5                                              <NA>     <NA>     chr1 
90400000
          end width strand CDSLOC.start CDSLOC.end CDSLOC.width PROTEINLOC QUERYID
   1 39800499   500      +         3060       3559          500 1020, 
1187       1
   2 39800565   500      +         3126       3625          500 1042, 
1209       3
   3 90400499   500      +         1373       1872          500   458, 
624       2
   4 90400499   500      +         1373       1872          500   458, 
624       2
   5 90400499   500      +         1373       1872          500   458, 
624       2
     CDSID
   1  2948
   2  2948
   3  5132
   4  5132
   5  5132

Note that this solution does not produce all the duplicated rows and it also preserves those rows that correspond to transcripts not linked to a gene id (e.g. transcript uc021opq.1).

Personally, I prefer to merge the annotations to the metadata columns of the GRanges object:

   m <- match(mcols(codingRegions)$TXID, mrg.data1b$TXID)
   mcols(codingRegions) <- cbind(mcols(codingRegions),
                                 DataFrame(mrg.data1b)[m, -1])

   > codingRegions
   GRanges with 5 ranges and 10 metadata columns:
         seqnames               ranges strand |       CDSLOC    PROTEINLOC
            <Rle>            <IRanges>  <Rle> |    <IRanges> <IntegerList>
     [1]     chr1 [39800000, 39800499]      + | [3060, 3559]     1020,1187
     [2]     chr1 [90400000, 90400499]      + | [1373, 1872]       458,624
     [3]     chr1 [90400000, 90400499]      + | [1373, 1872]       458,624
     [4]     chr1 [90400000, 90400499]      + | [1373, 1872]       458,624
     [5]     chr1 [39800066, 39800565]      + | [3126, 3625]     1042,1209
           QUERYID        TXID     CDSID      GENEID      TXNAME      SYMBOL
         <integer> <character> <integer> <character> <character> <character>
     [1]         1         963      2948       23499  uc021olw.1       MACF1
     [2]         2        1687      5132       55144  uc001dnm.3      LRRC8D
     [3]         2        1688      5132       55144  uc001dnn.3      LRRC8D
     [4]         2        1689      5132        <NA>  uc021opq.1        <NA>
     [5]         3         963      2948       23499  uc021olw.1       MACF1
                                                  GENENAME ENTREZID
                                               <character> <factor>
     [1]           microtubule-actin crosslinking factor 1    23499
     [2] leucine rich repeat containing 8 family, member D    55144
     [3] leucine rich repeat containing 8 family, member D    55144
     [4]                                              <NA>     <NA>
     [5]           microtubule-actin crosslinking factor 1    23499
     ---
     seqlengths:
      chr1
        NA

Maybe this is what you wanted i.e. add the GENEID, TXNAME, SYMBOL, GENENAME, and ENTREZID metadata cols to the GRanges object 'codingRegions'?

H.

>
> Thanks for your help.
>
> Cheers../Murli
>
>
> -----Original Message-----
> From: Hervé Pagès [mailto:hpages at fhcrc.org]
> Sent: Monday, June 24, 2013 8:29 PM
> To: Murli [guest]
> Cc: bioconductor at r-project.org; Nair, Murlidharan T
> Subject: Re: [BioC] merging GRange objects
>
> Hi Murli,
>
> On 06/24/2013 04:55 PM, Murli [guest] wrote:
>>
>> Hi,
>>
>> I would appreciate if you could tell me if the way I am merging the GappedAlignments object and GRanges objects is correct. mate1 and mate2 are GappedAlignment objects. I am merging them in order to associate my reads with the annotation.
>>
>> txdb = TxDb.Hsapiens.UCSC.hg19.knownGene
>>
>> mate.range=
>> GRanges(seqnames(mate1[isSameCzome]),IRanges(start(mate1[isSameCzome]
>> )
>> -offset,start(mate1[isSameCzome])+offset))
>>
>> codingRegions = refLocsToLocalLocs(mate.range, txdb)
>>
>> trans.info=select(txdb, key=values(codingRegions)$TXID, 
>> cols=c("GENEID","TXNAME"), keytype="TXID")
>>
>> trans.names=select(org.Hs.eg.db, trans.info$GENEID, c("GENENAME",
>> "SYMBOL"))
>>
>> mrg.data1=merge(as.data.frame(trans.names), 
>> as.data.frame(trans.info), by.x="ENTREZID", by.y="GENEID")
>>
>> mrg.data2=merge(mrg.data1, as.data.frame(codingRegions), by.x="TXID",
>> by.y="TXID")
>
> It looks like you are merging data.frames, not GappedAlignments or GRanges objects. Also you say that 'mate1' and 'mate2' are GappedAlignments objects but I only see 'mate1' in the above code.
>
> The exact meaning of "merging" depends on the objects involved.
> Sometimes people use the term "merging" when they actually want to combine or bind objects together with c(), rbind() or cbind().
> Note that since GRanges and GappedAlignments objects are conceptually 
> vector-like objects of dimension 1, only c() works on them. That is, 
> rbind(), cbind(), and merge() (which are typically operating on 2-D
> objects) are not supported on those objects.
>
> Cheers,
> H.
>
>>
>> Thanks ../Murli
>>
>>
>>
>>    -- output of sessionInfo():
>>
>>> sessionInfo()
>> R version 3.0.1 (2013-05-16)
>> Platform: x86_64-redhat-linux-gnu (64-bit)
>>
>> locale:
>>    [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>    [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>    [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>    [7] LC_PAPER=C                 LC_NAME=C
>>    [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] parallel  stats     graphics  grDevices utils     datasets  methods
>> [8] base
>>
>> other attached packages:
>>    [1] biomaRt_2.16.0
>>    [2] org.Hs.eg.db_2.9.0
>>    [3] RSQLite_0.11.4
>>    [4] DBI_0.2-7
>>    [5] VariantAnnotation_1.6.6
>>    [6] Rsamtools_1.12.3
>>    [7] BSgenome.Hsapiens.UCSC.hg19_1.3.19
>>    [8] BSgenome_1.28.0
>>    [9] Biostrings_2.28.0
>> [10] TxDb.Hsapiens.UCSC.hg19.knownGene_2.9.2
>> [11] GenomicFeatures_1.12.2
>> [12] AnnotationDbi_1.22.6
>> [13] Biobase_2.20.0
>> [14] GenomicRanges_1.12.4
>> [15] IRanges_1.18.1
>> [16] BiocGenerics_0.6.0
>>
>> loaded via a namespace (and not attached):
>> [1] bitops_1.0-5       RCurl_1.95-4.1     rtracklayer_1.20.2 stats4_3.0.1
>> [5] tcltk_3.0.1        tools_3.0.1        XML_3.98-1.1       zlibbioc_1.6.0
>>
>> --
>> Sent via the guest posting facility at bioconductor.org.
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319
>

--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319