[BioC] problem generating Chinese hamster txdb from GFF file

Thu Aug 15 22:44:10 CEST 2013

Hi Guido,

The problem is that your keys are not valid.  The keys for GENEID can be 
retrieved with the keys method like this:

k <- keys(txdb, keytype="GENEID")

And if we look at them we can see that they don't look anything like the 
keys you are trying to extract data for:

head(k)
[1] "gene10"    "gene1000"  "gene10000" "gene10001" "gene10002" "gene10003"

So if you use keys like that (valid keys), then everything works fine.  
For example:

select(txdb, keys = head(k), columns="TXNAME", keytype="GENEID")

      GENEID  TXNAME
1    gene10    rna8
2    gene10    rna7
3  gene1000  rna917
4 gene10000 rna9527
5 gene10001 rna9528
6 gene10002 rna9530
7 gene10002 rna9529
8 gene10003 rna9531

Also I noticed that you are getting the warning about deducing the exon 
ranks.  Are you sure you want to do that with a mammal?  This file 
contains a field called "exon_number" that looks like something you 
should probably try to use.

And finally, since you seem to want to use a real gene ID, you will have 
to extract those from the file separately.  This is because whoever 
created this file decided that they wanted to encode the actual gene 
symbols here in a peculiar way.  So you can do that step separately by 
using the rtracklayer import() method like this:

library(rtracklayer)
result <- import("ref_CriGri_1.0_scaffolds.gff3", format="gff3")

If you look at what is in result, you will see (for example) that gene0 
is cross referenced to GeneID:100754456 (for example).

Hope this clarifies things,

   Marc

On 08/15/2013 10:20 AM, Hooiveld, Guido wrote:
> Hello,
> I would like to make a transcript database for Chinese hamster (Cricetulus griseus) using the function makeTranscriptDbFromGFF. Since it is the first time I use this function, I do this per the code in the vignette. However, I face some problems and would appreciate any pointer.
>
> I started by downloading the GFF (ref_CriGri_1.0_scaffolds.gff3) from NCBI.
> [ftp://ftp.ncbi.nlm.nih.gov/genomes/Cricetulus_griseus/GFF/]
>
> Creating the txdb seems to go fine:
>
>> txdb <- makeTranscriptDbFromGFF(file="ref_CriGri_1.0_scaffolds.gff3", format="gff3", dataSource="ftp://ftp.ncbi.nlm.nih.gov/genomes/Cricetulus_griseus/GFF/", species="Cricetulus griseus")
> extracting transcript information
> Extracting gene IDs
> extracting transcript information
> Processing splicing information for gff3 file.
> Deducing exon rank from relative coordinates provided
> Prepare the 'metadata' data frame ... metadata: OK
> Now generating chrominfo from available sequence names. No chromosome length information is available.
> Warning messages:
> 1: In .deduceExonRankings(exs, format = "gff") :
>    Infering Exon Rankings.  If this is not what you expected, then please be sure that you have provided a valid attribute for exonRankAttributeName
> 2: In matchCircularity(chroms, circ_seqs) :
>    None of the strings in your circ_seqs argument match your seqnames.
>
> # Characterization txdb
>> txdb
> TranscriptDb object:
> | Db type: TranscriptDb
> | Supporting package: GenomicFeatures
> | Data source: ftp://ftp.ncbi.nlm.nih.gov/genomes/Cricetulus_griseus/GFF/
> | Organism: Cricetulus griseus
> | miRBase build ID: NA
> | transcript_nrow: 21612
> | exon_nrow: 212163
> | cds_nrow: 183531
> | Db created by: GenomicFeatures package from Bioconductor
> | Creation time: 2013-08-15 18:26:33 +0200 (Thu, 15 Aug 2013)
> | GenomicFeatures version at creation time: 1.13.29
> | RSQLite version at creation time: 0.11.4
> | DBSCHEMAVERSION: 1.0
>> columns(txdb)
> [1] "CDSID"      "CDSNAME"    "CDSCHROM"   "CDSSTRAND"  "CDSSTART"   "CDSEND"     "EXONID"
>   [8] "EXONNAME"   "EXONCHROM"  "EXONSTRAND" "EXONSTART"  "EXONEND"    "GENEID"     "TXID"
> [15] "EXONRANK"   "TXNAME"     "TXCHROM"    "TXSTRAND"   "TXSTART"    "TXEND"
>> keytypes(txdb)
> [1] "GENEID"   "TXID"     "TXNAME"   "EXONID"   "EXONNAME" "CDSID"    "CDSNAME"
> As a test, I would like to retrieve the transcript names for 3 Entrez Gene IDs. Note: I selected these 3 IDs randomly; they are in the top5 -listed IDs in the GFF.
> Unfortunately, this doesn't seem to work:
>> keys <- c("100754456", "100759368", "100760238")
>> select(txdb, keys = keys, columns="TXNAME", keytype="GENEID")
> [1] GENEID TXNAME
> <0 rows> (or 0-length row.names)
> Note: the human example from the vignette worked fine.
>
> I think this is somehow due to the fact that during the parsing of the GFF the keytypes (?) are not properly recognized/linked. This is also reflected by the plain use of 'rna' and 'gene' as names/IDs:
>> GR <- transcripts(txdb, columns=c("tx_id", "tx_name", "gene_id", "cds_id","cds_name"))
>> head(GR)
> GRanges with 6 ranges and 5 metadata columns:
>              seqnames             ranges strand |     tx_id     tx_name         gene_id        cds_id
>                 <Rle>          <IRanges>  <Rle> | <integer> <character> <CharacterList> <IntegerList>
>    [1] NW_003613580.1 [ 736932,  738552]      + |         1        rna0           gene2           1,2
>    [2] NW_003613580.1 [2073869, 2085513]      + |         2        rna4           gene6     3,4,5,...
>    [3] NW_003613580.1 [2143936, 2241368]      + |         3        rna6           gene8    8,9,10,...
>    [4] NW_003613580.1 [2390123, 2449470]      + |         4        rna8          gene10  22,23,24,...
>    [5] NW_003613580.1 [2390123, 2555467]      + |         5        rna7          gene10  22,23,24,...
>    [6] NW_003613580.1 [3070833, 3147709]      + |         6        rna9          gene12  43,44,45,...
>               cds_name
>        <CharacterList>
>    [1]              NA
>    [2]              NA
>    [3]              NA
>    [4]              NA
>    [5]              NA
>    [6]              NA
>    ---
>    seqlengths:
>     NW_003613580.1 NW_003613581.1 NW_003613582.1 ... NW_003698075.1 NW_003704321.1 NW_003717424.1
>                 NA             NA             NA ...             NA             NA             NA
>> tail(GR)
> GRanges with 6 ranges and 5 metadata columns:
>              seqnames    ranges strand |     tx_id     tx_name         gene_id        cds_id
>                 <Rle> <IRanges>  <Rle> | <integer> <character> <CharacterList> <IntegerList>
>    [1] NW_003694533.1 [29, 205]      - |     21607    rna22504       gene25148        183527
>    [2] NW_003697941.1 [ 4, 135]      - |     21608    rna22505       gene25149        183528
>    [3] NW_003698068.1 [ 1, 267]      + |     21609    rna22506       gene25150        183529
>    [4] NW_003698075.1 [ 1, 267]      - |     21610    rna22507       gene25151            NA
>    [5] NW_003704321.1 [24, 203]      - |     21611    rna22508       gene25152        183530
>    [6] NW_003717424.1 [10, 189]      + |     21612    rna22509       gene25155        183531
>               cds_name
>        <CharacterList>
>    [1]              NA
>    [2]              NA
>    [3]              NA
>    [4]              NA
>    [5]              NA
>    [6]              NA
>    ---
>    seqlengths:
>     NW_003613580.1 NW_003613581.1 NW_003613582.1 ... NW_003698075.1 NW_003704321.1 NW_003717424.1
>                 NA             NA             NA ...             NA             NA             NA
>
>
> Any suggestions on how to properly generate a txdb from this GFF would be appreciated.
>
> Thanks,
> Guido
>
>> sessionInfo()
> R Under development (unstable) (2013-08-10 r63532)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
>
> locale:
> [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252
> [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
> [5] LC_TIME=English_United States.1252
>
> attached base packages:
> [1] parallel  stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] GenomicFeatures_1.13.29 AnnotationDbi_1.23.18   Biobase_2.21.6          GenomicRanges_1.13.36
> [5] XVector_0.1.0           IRanges_1.19.21         BiocGenerics_0.7.3
>
> loaded via a namespace (and not attached):
> [1] biomaRt_2.17.2     Biostrings_2.29.14 bitops_1.0-5       BSgenome_1.29.1    DBI_0.2-7
>   [6] RCurl_1.95-4.1     Rsamtools_1.13.28  RSQLite_0.11.4     rtracklayer_1.21.9 stats4_3.1.0
> [11] tools_3.1.0        XML_3.98-1.1       zlibbioc_1.7.0
>
> ---------------------------------------------------------
> Guido Hooiveld, PhD
> Nutrition, Metabolism & Genomics Group
> Division of Human Nutrition
> Wageningen University
> Biotechnion, Bomenweg 2
> NL-6703 HD Wageningen
> the Netherlands
> tel: (+)31 317 485788
> fax: (+)31 317 483342
> email:      guido.hooiveld at wur.nl
> internet:   http://nutrigene.4t.com
> http://scholar.google.com/citations?user=qFHaMnoAAAAJ
> http://www.researcherid.com/rid/F-4912-2010
>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor