[BioC] GenomicRanges : How to retrieve exon ID and gene ID from exon coordinates?

Tue Sep 11 22:24:35 CEST 2012

Hi Ying,

On 9/11/2012 4:06 PM, ying chen wrote:
> Hi,
>
> I read the vignettes and tried the code Jim suggested, but got an 
> error message I could not figure out why.
>
> Here is what I did:
>
>     > library(GenomicFeatures)
>     > library(GenomicRanges)
>     > library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>     > TCGA_CRC <-
>     read.delim("TCGA_CRC_UNC_RNASeq_ExonID_detail_new.txt",stringsAsFactors=FALSE)
>     > ex <- exons(TxDb.Hsapiens.UCSC.hg19.knownGene, columns =
>     c("exon_id","gene_id","tx_id"))
>     > TCGA_CRC <- GRanges(seqnames=TCGA_CRC$Chrom,
>     ranges=IRanges(start=TCGA_CRC$Start,end=TCGA_CRC$End),strand=TCGA_CRC$Strand,name=TCGA_CRC$ExonID)
>     > elementMetadata(TCGA_CRC) <- elementMetadata(ex)[match(TCGA_CRC,
>     ex),]
>     Error in elementMetadata(ex)[match(TCGA_CRC, ex), ] :
>       selecting rows: subscript contains NAs or out of bounds indices
>

You will get that error if there are ranges in TCGA_CRC that are not in 
ex. So the assumption here is that the exons from your RNA-seq 
experiment are exons that match up with what is in the knownGene table 
from UCSC.

The fact that you have counts from exons that don't align to the 
expected ranges from the version of the knownGene table you are using is 
problematic, and implies that something is amiss. A first check is to 
ensure that your RNA-seq data (and counts) refer to the hg19 build.

You can find out which ranges are different between the two using the 
%in% operator:

x <- which(!TCGA_CRC %in% ex)

Which will at least tell you how far off you are. Hypothetically, if 
there are only a few mismatching ranges, you could subset the TCGA_CRC 
GRanges object, but personally I would want to know why things aren't 
agreeing.

Alternatively you could use the HTSeq python scripts to get the counts, 
in which case you won't need to bother with all of this, or you could 
use the summarizeOverlaps() function in GenomicRanges to do the 
counting, which will again ensure that the counts you have align to the 
exons from the knownGene table.

Best,

Jim

>     >
>
> I cannot tell what is subscript it mentioned.
>
> Any suggestion?
>
> Thanks a lot fo the help!
>
> Ying
>
> The following are the some info:
>
> > sessionInfo()
> R version 2.15.1 (2012-06-22)
> Platform: x86_64-pc-mingw32/x64 (64-bit)
>
> locale:
> [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United 
> States.1252    LC_MONETARY=English_United States.1252
> [4] LC_NUMERIC=C                           LC_TIME=English_United 
> States.1252
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] TxDb.Hsapiens.UCSC.hg19.knownGene_2.7.1 
> GenomicFeatures_1.8.3                   AnnotationDbi_1.18.3
> [4] Biobase_2.16.0                          
> GenomicRanges_1.8.13                    IRanges_1.14.4
> [7] BiocGenerics_0.2.0
>
> loaded via a namespace (and not attached):
>  [1] biomaRt_2.12.0     Biostrings_2.24.1  bitops_1.0-4.1     
> BSgenome_1.24.0    DBI_0.2-5          RCurl_1.91-1.1     Rsamtools_1.8.6
>  [8] RSQLite_0.11.1     rtracklayer_1.16.3 stats4_2.15.1      
> tools_2.15.1       XML_3.9-4.1        zlibbioc_1.2.0
> >
>
> > TCGA_CRC
> GRanges with 239886 ranges and 1 elementMetadata col:
>            seqnames                 ranges strand   
> |                        name
> <Rle> <IRanges> <Rle>   | <character>
>        [1]    chr10 [100003848, 100004653]      +   | 
> chr10:100003848-100004653:+
>        [2]    chr10 [100007443, 100008748]      -   | 
> chr10:100008748-100007443:-
>        [3]    chr10 [100010822, 100010933]      -   | 
> chr10:100010933-100010822:-
>        [4]    chr10 [100011323, 100011459]      -   | 
> chr10:100011459-100011323:-
>        [5]    chr10 [100012110, 100012225]      -   | 
> chr10:100012225-100012110:-
>        [6]    chr10 [100013310, 100013553]      -   | 
> chr10:100013553-100013310:-
>        [7]    chr10 [100015334, 100015496]      -   | 
> chr10:100015496-100015334:-
>        [8]    chr10 [100016537, 100016704]      -   | 
> chr10:100016704-100016537:-
>        [9]    chr10 [100017407, 100017561]      -   | 
> chr10:100017561-100017407:-
>        ...      ...                    ...    ... 
> ...                         ...
>   [239878]     chrY     [9611654, 9611928]      -   |      
> chrY:9611928-9611654:-
>   [239879]     chrY     [9638762, 9638916]      +   |      
> chrY:9638762-9638916:+
>   [239880]     chrY     [9642383, 9642494]      +   |      
> chrY:9642383-9642494:+
>   [239881]     chrY     [9646920, 9646994]      +   |      
> chrY:9646920-9646994:+
>   [239882]     chrY     [9647680, 9647718]      +   |      
> chrY:9647680-9647718:+
>   [239883]     chrY     [9650809, 9650854]      +   |      
> chrY:9650809-9650854:+
>   [239884]     chrY     [9748407, 9748463]      +   |      
> chrY:9748407-9748463:+
>   [239885]     chrY     [9748577, 9748722]      +   |      
> chrY:9748577-9748722:+
>   [239886]     chrY     [9749263, 9749571]      +   |      
> chrY:9749263-9749571:+
>   ---
>   seqlengths:
>     chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19  
> chr2 chr20 chr21 chr22  chr3  chr4  chr5  chr6  chr7  chr8  chr9  
> chrM  chrX  chrY
>       NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    
> NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    
> NA    NA    NA    NA
> >
> > ex
> GRanges with 286852 ranges and 3 elementMetadata cols:
>            seqnames               ranges strand   |   
> exon_id                   gene_id                   tx_id
> <Rle> <IRanges> <Rle>   | <integer> <CompressedCharacterList> 
> <CompressedIntegerList>
>        [1]     chr1     [ 11874,  12227]      +   |         
> 1                        NA                   1,2,3
>        [2]     chr1     [ 12595,  12721]      +   |         
> 5                        NA                       3
>        [3]     chr1     [ 12613,  12721]      +   |         
> 2                        NA                       1
>        [4]     chr1     [ 12646,  12697]      +   |         
> 4                        NA                       2
>        [5]     chr1     [ 13221,  14409]      +   |         
> 3                        NA                     1,2
>        [6]     chr1     [ 13403,  14409]      +   |         
> 6                        NA                       3
>        [7]     chr1     [ 69091,  70008]      +   |        
> 37                     79501                      25
>        [8]     chr1     [321084, 321115]      +   |        
> 44                        NA                      28
>        [9]     chr1     [321146, 321207]      +   |        
> 45                        NA                      29
>        ...      ...                  ...    ... ...       
> ...                       ...                     ...
>   [286844]     chrY [27603458, 27603499]      -   |    
> 142741                        NA                   40070
>   [286845]     chrY [27604352, 27604379]      -   |    
> 142742                        NA                   40071
>   [286846]     chrY [27605645, 27605678]      -   |    
> 142743                        NA                   40072
>   [286847]     chrY [27606394, 27606421]      -   |    
> 142744                        NA                   40073
>   [286848]     chrY [27607404, 27607432]      -   |    
> 142745                        NA                   40074
>   [286849]     chrY [27635919, 27635954]      -   |    
> 142750                        NA                   40078
>   [286850]     chrY [59358329, 59359508]      -   |    
> 142802                        NA                   40099
>   [286851]     chrY [59360007, 59360115]      -   |    
> 142803                        NA                   40099
>   [286852]     chrY [59360501, 59360854]      -   |    
> 142804                        NA                   40099
>   ---
>   seqlengths:
>                     chr1                  chr2                  chr3 
> ...                  chrM        chrUn_gl000226 chr18_gl000207_random
>                249250621             243199373             198022430 
> ...                 16571                 15008                  4262
> >
>
> > Date: Mon, 10 Sep 2012 17:29:24 -0400
> > From: jmacdon at uw.edu
> > To: ying_chen at live.com
> > CC: bioconductor at r-project.org
> > Subject: Re: [BioC] How to retrieve exon ID and gene ID from exon 
> coordinates?
> >
> > Hi Ying,
> >
> > On 9/10/2012 4:54 PM, ying chen wrote:
> > >
> > >
> > > Hi guys, I have a RNASeq data table which has exon cooridinates 
> (chrom, start. end) and raw count. I want to use DEXseq to see 
> differential transcripts. To do it I need to get geneIDs and exonIDs 
> from corresponding exon cooridinates. Any suggestion how to do it? 
> Thanks a lot for the help!
> >
> > You don't give much to go on. Assuming you are working with a common
> > species, it is simple. Let's assume you are working with mice.
> >
> > Something like this should work:
> >
> > yourdata <- read.table("yourdata.txt", stringsAsFactors=FALSE)
> > library(TxDb.Mmusculus.UCSC.mm9.knownGene)
> > ex <- exons(TxDb.Mmusculus.UCSC.mm9.knownGene, columns =
> > c("exon_id","gene_id"))
> > yourdata <- GRanges(yourdata$chrom, IRanges(start=yourdata$start,
> > end=yourdata$end))
> > elementMetadata(yourdata) <- elementMetadata(ex)[match(yourdata, ex),]
> >
> > If you are planning on doing this sort of stuff, do yourself a favor 
> and
> > read the GenomicFeatures and GenomicRanges vignettes. They are chock
> > full of info that you will need.
> >
> > Best,
> >
> > Jim
> >
> >
> >
> > > Ying
> > > [[alternative HTML version deleted]]
> > >
> > > _______________________________________________
> > > Bioconductor mailing list
> > > Bioconductor at r-project.org
> > > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > > Search the archives: 
> http://news.gmane.org/gmane.science.biology.informatics.conductor
> >
> > --
> > James W. MacDonald, M.S.
> > Biostatistician
> > University of Washington
> > Environmental and Occupational Health Sciences
> > 4225 Roosevelt Way NE, # 100
> > Seattle WA 98105-6099
> >

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099