[BioC] Problem with genomicFeatures: id2name

Marc Carlson mcarlson at fhcrc.org
Thu Sep 30 23:36:16 CEST 2010


Hi Paul,

The NAs is because there are no unique IDs (that we can find) for these
elements.  In practice we almost never get unique IDs for cds or exons
from either ensembl or UCSC.  But there is always hope that this will
change in the future.

  Marc



On 09/23/2010 06:21 PM, Paul Leo wrote:
> id2name(txdb, feature.type="cds") and  id2name(txdb,
> feature.type="exon") both return all NAs foe ensemble and refseq.
>
> The cds_id perhaps don't have names ? but the exon results is strange
> for ensemble .
> using the.cds<-cds(txdb,columns=c("cds_id","tx_id","tx_name")) takes a
> *VERY* long time but is perhaps not indeed for use on a whole genome
> scale (often) ?
>
> Looking for a quick way to map the cds_id, or exon_ids to exon_names etc
> so I can complete the annotations with biomaRt when needed.....
>
>
>   
>> txdb
>>     
> TranscriptDb object:
> | Db type: TranscriptDb
> | Data source: UCSC
> | Genome: hg19
> | UCSC Table: ensGene
> | Type of Gene ID: Ensembl gene ID
> | Full dataset: yes
> | transcript_nrow: 151222
> | exon_nrow: 470051
> | cds_nrow: 264558
> | Db created by: GenomicFeatures package from Bioconductor
> | Creation time: 2010-09-24 11:00:14 +1000 (Fri, 24 Sep 2010)
> | GenomicFeatures version at creation time: 1.1.12
> | RSQLite version at creation time: 0.9-2
>   
>> the.cds<-cds(txdb)
>> the.cds
>>     
> GRanges with 264558 ranges and 1 elementMetadata value
>          seqnames               ranges strand   |    cds_id
>             <Rle>            <IRanges>  <Rle>   | <integer>
>      [1]     chr1     [ 69091,  70008]      +   |     10762
>      [2]     chr1     [367659, 368597]      +   |     10763
>      [3]     chr1     [721406, 721912]      +   |     10765
>      [4]     chr1     [861322, 861393]      +   |     10766
>      [5]     chr1     [865535, 865716]      +   |     10767
>      [6]     chr1     [865692, 865716]      +   |     10782
>      [7]     chr1     [866419, 866469]      +   |     10768
>      [8]     chr1     [871152, 871173]      +   |     10772
>      [9]     chr1     [871152, 871276]      +   |     10769
>      ...      ...                  ...    ... ...       ...
> [264550]     chrY [26951104, 26951167]      -   |    139000
> [264551]     chrY [26951604, 26951655]      -   |    139001
> [264552]     chrY [26952216, 26952307]      -   |    139002
> [264553]     chrY [26952582, 26952728]      -   |    139003
> [264554]     chrY [26959330, 26959332]      -   |    139004
> [264555]     chrY [27184245, 27184263]      -   |    139018
> [264556]     chrY [27184956, 27185061]      -   |    139019
> [264557]     chrY [27187916, 27188033]      -   |    139020
> [264558]     chrY [27190093, 27190170]      -   |    139021
>
> seqlengths
>                   chr1                  chr2 ... chr18_gl000207_random
>              249250621             243199373 ...                  4262
>   
>> ?id2name
>> cds.id.to.name<-id2name(txdb, feature.type="cds")
>> length(cds.id.to.name)
>>     
> [1] 264558
>   
>> sum(!is.na(cds.id.to.name))
>>     
> [1] 0 ## ALL NA's
>
>   
>> exon.id.to.name<-id2name(txdb, feature.type="exon")
>> exon.id.to.name[40000:40100]
>>     
> 40000 40001 40002 40003 40004 40005 40006 40007 40008 40009 40010 40011
> 40012 
>    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
> NA 
> 40013 40014 40015 40016 40017 40018 40019 40020 40021 40022 40023 40024
> 40025 
>    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
> NA 
> 40026 40027 40028 40029 40030 40031 40032 40033 40034 40035 40036 40037
> 40038 
>    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
> NA 
> 40039 40040 40041 40042 40043 40044 40045 40046 40047 40048 40049 40050
> 40051 
>    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
> NA 
> 40052 40053 40054 40055 40056 40057 40058 40059 40060 40061 40062 40063
> 40064 
>    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
> NA 
> 40065 40066 40067 40068 40069 40070 40071 40072 40073 40074 40075 40076
> 40077 
>    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
> NA 
> 40078 40079 40080 40081 40082 40083 40084 40085 40086 40087 40088 40089
> 40090 
>    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
> NA 
> 40091 40092 40093 40094 40095 40096 40097 40098 40099 40100 
>    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA 
>   
>> length(exon.id.to.name)
>>     
> [1] 470051
>   
>> sum(!is.na(exon.id.to.name))
>>     
> [1] 0
>   
>> tx.id.to.n
>>     
> ################# they are all missing same is true for 
>   
>> sessionInfo()
>>     
> R version 2.13.0 Under development (unstable) (2010-09-20 r52949)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
>  [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C              
>  [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8    
>  [5] LC_MONETARY=C              LC_MESSAGES=en_AU.UTF-8   
>  [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                 
>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
> [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods
> base     
>
> other attached packages:
> [1] BSgenome.Hsapiens.UCSC.hg19_1.3.16
> BSgenome_1.17.7                   
> [3] Biostrings_2.17.47
> GenomicFeatures_1.1.12            
> [5] GenomicRanges_1.1.25
> IRanges_1.7.34                    
> [7] biomaRt_2.5.1                     
>
> loaded via a namespace (and not attached):
> [1] Biobase_2.9.1     DBI_0.2-5         RCurl_1.4-3
> RSQLite_0.9-2    
> [5] rtracklayer_1.9.9 tools_2.13.0      XML_3.1-1        
>   
>>     
>



More information about the Bioconductor mailing list