[BioC] problem generating Chinese hamster txdb from GFF file

Thu Aug 22 21:02:42 CEST 2013

Marc, thanks for coming back to this.
As said before, I obtained the GFF from the NCBI, and assumed these would comply with all standards. Apparently this is not the case for the CHO GFF, but remarkably the exon ranking is also not present in e.g. the latest GFF for the human genome available at the NCBI... ?? (ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/GFF/).  
Apparently, more people have noticed the lack of confirmation to the standards, see e.g.: http://blastedbio.blogspot.nl/2011/08/why-are-ncbi-gff3-files-still-broken.html
Anyway, I will contact the NCBI regarding this issue, and report back if needed.

Thanks again,
Guido

-----Original Message-----
From: Marc Carlson [mailto:mcarlson at fhcrc.org] 
Sent: Wednesday, August 21, 2013 01:54
To: Hooiveld, Guido
Cc: bioconductor at r-project.org
Subject: Re: [BioC] problem generating Chinese hamster txdb from GFF file

Hi Guido,

Sorry for the delay.  Your reply seems to have snuck past my radar.

Below is the code I ended up using to process this file.  I looked more closely at the file you are trying to parse and I noticed that there is something very strange about it.  Many different features are labeled as being of type "exon", but only SOME of them actually have the exon ranking information.  :(

But what is really odd here is that the features which don't have an exon ranking don't seem to be the "same" as the ones that do. I would recommend that you get an explanation for that from the person who generated this file.

Anyhow I also made another quick update to GenomicFeatures to allow you to better make use of the other IDs that are in there. So with version
1.13.33 you should be able to do this (for example), and get gene ids that may be closer to what you want.  I aim to do a little more in this department, but I have some other things I have to do 1st.

library(GenomicFeatures)

txdb <- makeTranscriptDbFromGFF(file="ref_CriGri_1.0_scaffolds.gff3",
                                 format="gff3",
                                 gffGeneIdAttributeName="Name", dataSource="ftp://ftp.ncbi.nlm.nih.gov/genomes/Cricetulus_griseus/GFF",
                                 species="Cricetulus griseus")

   Marc

On 08/16/2013 01:59 AM, Hooiveld, Guido wrote:
> Hi Marc,
> Thanks for helpful comments!
> Keys, keytypes, columns, select,.... still difficult for me...  Moreover, it is the first time I am working with these type of files, so I still have to get used to all concepts and do's & don'ts.
> Anyway, I tried to regenerate the txdb utilizing the info on exon ranks. Unfortunately, this doesn't work in my hands... Am I missing something obviously, or is something going seriously wrong? Note that the error thrown is slightly different between the production and dev versions of the packages (see below).
>
> I will also check the functions in rtracklayer.
>
> Regards,
> Guido
>
>
> Dev:
>> txdb <- makeTranscriptDbFromGFF(file="ref_CriGri_1.0_scaffolds.gff3", 
>> format="gff3", exonRankAttributeName="exon_number",
> + dataSource="ftp://ftp.ncbi.nlm.nih.gov/genomes/Cricetulus_griseus/GF
> + F", species="Cricetulus griseus")
> extracting transcript information
> Extracting gene IDs
> extracting transcript information
> Processing splicing information for gff3 file.
> Error in if (any(splicings$exon_rank == 0) && !any(splicings$exon_rank <  :
>    missing value where TRUE/FALSE needed
>> sessionInfo()
> R Under development (unstable) (2013-08-10 r63532)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
>
> locale:
> [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C
> [5] LC_TIME=English_United States.1252
>
> attached base packages:
> [1] parallel  stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] GenomicFeatures_1.13.29 AnnotationDbi_1.23.18   Biobase_2.21.6          rtracklayer_1.21.9      GenomicRanges_1.13.36   XVector_0.1.0           IRanges_1.19.24
> [8] BiocGenerics_0.7.4      BiocInstaller_1.11.4
>
> loaded via a namespace (and not attached):
>   [1] biomaRt_2.17.2     Biostrings_2.29.15 bitops_1.0-5       BSgenome_1.29.1    DBI_0.2-7          RCurl_1.95-4.1     Rsamtools_1.13.29  RSQLite_0.11.4     stats4_3.1.0
> [10] tools_3.1.0        XML_3.98-1.1       zlibbioc_1.7.0
>
> Production:
>> txdb <- makeTranscriptDbFromGFF(file="ref_CriGri_1.0_scaffolds.gff3", 
>> format="gff3", exonRankAttributeName="exon_number",
> + dataSource="ftp://ftp.ncbi.nlm.nih.gov/genomes/Cricetulus_griseus/GF
> + F", species="Cricetulus griseus")
> extracting transcript information
> Extracting gene IDs
> extracting transcript information
> Processing splicing information for gff3 file.
> Prepare the 'metadata' data frame ... metadata: OK Now generating 
> chrominfo from available sequence names. No chromosome length information is available.
> Error in .normargSplicings(splicings, transcripts_tx_id) :
>    'splicings$exon_rank' must be an integer vector with no NAs In 
> addition: Warning message:
> In matchCircularity(chroms, circ_seqs) :
>    None of the strings in your circ_seqs argument match your seqnames.
>> sessionInfo()
> R version 3.0.1 Patched (2013-06-05 r62877)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
>
> locale:
> [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C
> [5] LC_TIME=English_United States.1252
>
> attached base packages:
> [1] parallel  stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] TxDb.Hsapiens.UCSC.hg19.knownGene_2.9.2 GenomicFeatures_1.12.3                  AnnotationDbi_1.22.6                    Biobase_2.20.1
> [5] GenomicRanges_1.12.4                    IRanges_1.18.2                          BiocGenerics_0.6.0
>
> loaded via a namespace (and not attached):
>   [1] biomaRt_2.16.0     Biostrings_2.28.0  bitops_1.0-5       BSgenome_1.28.0    DBI_0.2-7          RCurl_1.95-4.1     Rsamtools_1.12.3   RSQLite_0.11.4
>   [9] rtracklayer_1.20.4 stats4_3.0.1       tools_3.0.1        XML_3.98-1.1       zlibbioc_1.6.0
>
> -----Original Message-----
> From: bioconductor-bounces at r-project.org 
> [mailto:bioconductor-bounces at r-project.org] On Behalf Of Marc Carlson
> Sent: Thursday, August 15, 2013 22:44
> To: bioconductor at r-project.org
> Subject: Re: [BioC] problem generating Chinese hamster txdb from GFF 
> file
>
> Hi Guido,
>
> The problem is that your keys are not valid.  The keys for GENEID can be retrieved with the keys method like this:
>
> k <- keys(txdb, keytype="GENEID")
>
> And if we look at them we can see that they don't look anything like the keys you are trying to extract data for:
>
> head(k)
> [1] "gene10"    "gene1000"  "gene10000" "gene10001" "gene10002" "gene10003"
>
>
> So if you use keys like that (valid keys), then everything works fine.
> For example:
>
> select(txdb, keys = head(k), columns="TXNAME", keytype="GENEID")
>
>        GENEID  TXNAME
> 1    gene10    rna8
> 2    gene10    rna7
> 3  gene1000  rna917
> 4 gene10000 rna9527
> 5 gene10001 rna9528
> 6 gene10002 rna9530
> 7 gene10002 rna9529
> 8 gene10003 rna9531
>
>
> Also I noticed that you are getting the warning about deducing the exon ranks.  Are you sure you want to do that with a mammal?  This file contains a field called "exon_number" that looks like something you should probably try to use.
>
> And finally, since you seem to want to use a real gene ID, you will have to extract those from the file separately.  This is because whoever created this file decided that they wanted to encode the actual gene symbols here in a peculiar way.  So you can do that step separately by using the rtracklayer import() method like this:
>
> library(rtracklayer)
> result <- import("ref_CriGri_1.0_scaffolds.gff3", format="gff3")
>
> If you look at what is in result, you will see (for example) that gene0 is cross referenced to GeneID:100754456 (for example).
>
>
> Hope this clarifies things,
>
>
>     Marc
>
>
>
>
> On 08/15/2013 10:20 AM, Hooiveld, Guido wrote:
>> Hello,
>> I would like to make a transcript database for Chinese hamster (Cricetulus griseus) using the function makeTranscriptDbFromGFF. Since it is the first time I use this function, I do this per the code in the vignette. However, I face some problems and would appreciate any pointer.
>>
>> I started by downloading the GFF (ref_CriGri_1.0_scaffolds.gff3) from NCBI.
>> [ftp://ftp.ncbi.nlm.nih.gov/genomes/Cricetulus_griseus/GFF/]
>>
>> Creating the txdb seems to go fine:
>>
>>> txdb <- 
>>> makeTranscriptDbFromGFF(file="ref_CriGri_1.0_scaffolds.gff3", 
>>> format="gff3", 
>>> dataSource="ftp://ftp.ncbi.nlm.nih.gov/genomes/Cricetulus_griseus/GF
>>> F/", species="Cricetulus griseus")
>> extracting transcript information
>> Extracting gene IDs
>> extracting transcript information
>> Processing splicing information for gff3 file.
>> Deducing exon rank from relative coordinates provided Prepare the 
>> 'metadata' data frame ... metadata: OK Now generating chrominfo from 
>> available sequence names. No chromosome length information is available.
>> Warning messages:
>> 1: In .deduceExonRankings(exs, format = "gff") :
>>     Infering Exon Rankings.  If this is not what you expected, then 
>> please be sure that you have provided a valid attribute for 
>> exonRankAttributeName
>> 2: In matchCircularity(chroms, circ_seqs) :
>>     None of the strings in your circ_seqs argument match your seqnames.
>>
>> # Characterization txdb
>>> txdb
>> TranscriptDb object:
>> | Db type: TranscriptDb
>> | Supporting package: GenomicFeatures Data source: 
>> | ftp://ftp.ncbi.nlm.nih.gov/genomes/Cricetulus_griseus/GFF/
>> | Organism: Cricetulus griseus
>> | miRBase build ID: NA
>> | transcript_nrow: 21612
>> | exon_nrow: 212163
>> | cds_nrow: 183531
>> | Db created by: GenomicFeatures package from Bioconductor Creation 
>> | time: 2013-08-15 18:26:33 +0200 (Thu, 15 Aug 2013) GenomicFeatures 
>> | version at creation time: 1.13.29 RSQLite version at creation time: 
>> | 0.11.4
>> | DBSCHEMAVERSION: 1.0
>>> columns(txdb)
>> [1] "CDSID"      "CDSNAME"    "CDSCHROM"   "CDSSTRAND"  "CDSSTART"   "CDSEND"     "EXONID"
>>    [8] "EXONNAME"   "EXONCHROM"  "EXONSTRAND" "EXONSTART"  "EXONEND"    "GENEID"     "TXID"
>> [15] "EXONRANK"   "TXNAME"     "TXCHROM"    "TXSTRAND"   "TXSTART"    "TXEND"
>>> keytypes(txdb)
>> [1] "GENEID"   "TXID"     "TXNAME"   "EXONID"   "EXONNAME" "CDSID"    "CDSNAME"
>> As a test, I would like to retrieve the transcript names for 3 Entrez Gene IDs. Note: I selected these 3 IDs randomly; they are in the top5 -listed IDs in the GFF.
>> Unfortunately, this doesn't seem to work:
>>> keys <- c("100754456", "100759368", "100760238") select(txdb, keys = 
>>> keys, columns="TXNAME", keytype="GENEID")
>> [1] GENEID TXNAME
>> <0 rows> (or 0-length row.names)
>> Note: the human example from the vignette worked fine.
>>
>> I think this is somehow due to the fact that during the parsing of the GFF the keytypes (?) are not properly recognized/linked. This is also reflected by the plain use of 'rna' and 'gene' as names/IDs:
>>> GR <- transcripts(txdb, columns=c("tx_id", "tx_name", "gene_id", 
>>> "cds_id","cds_name"))
>>> head(GR)
>> GRanges with 6 ranges and 5 metadata columns:
>>               seqnames             ranges strand |     tx_id     tx_name         gene_id        cds_id
>>                  <Rle>          <IRanges>  <Rle> | <integer> <character> <CharacterList> <IntegerList>
>>     [1] NW_003613580.1 [ 736932,  738552]      + |         1        rna0           gene2           1,2
>>     [2] NW_003613580.1 [2073869, 2085513]      + |         2        rna4           gene6     3,4,5,...
>>     [3] NW_003613580.1 [2143936, 2241368]      + |         3        rna6           gene8    8,9,10,...
>>     [4] NW_003613580.1 [2390123, 2449470]      + |         4        rna8          gene10  22,23,24,...
>>     [5] NW_003613580.1 [2390123, 2555467]      + |         5        rna7          gene10  22,23,24,...
>>     [6] NW_003613580.1 [3070833, 3147709]      + |         6        rna9          gene12  43,44,45,...
>>                cds_name
>>         <CharacterList>
>>     [1]              NA
>>     [2]              NA
>>     [3]              NA
>>     [4]              NA
>>     [5]              NA
>>     [6]              NA
>>     ---
>>     seqlengths:
>>      NW_003613580.1 NW_003613581.1 NW_003613582.1 ... NW_003698075.1 NW_003704321.1 NW_003717424.1
>>                  NA             NA             NA ...             NA             NA             NA
>>> tail(GR)
>> GRanges with 6 ranges and 5 metadata columns:
>>               seqnames    ranges strand |     tx_id     tx_name         gene_id        cds_id
>>                  <Rle> <IRanges>  <Rle> | <integer> <character> <CharacterList> <IntegerList>
>>     [1] NW_003694533.1 [29, 205]      - |     21607    rna22504       gene25148        183527
>>     [2] NW_003697941.1 [ 4, 135]      - |     21608    rna22505       gene25149        183528
>>     [3] NW_003698068.1 [ 1, 267]      + |     21609    rna22506       gene25150        183529
>>     [4] NW_003698075.1 [ 1, 267]      - |     21610    rna22507       gene25151            NA
>>     [5] NW_003704321.1 [24, 203]      - |     21611    rna22508       gene25152        183530
>>     [6] NW_003717424.1 [10, 189]      + |     21612    rna22509       gene25155        183531
>>                cds_name
>>         <CharacterList>
>>     [1]              NA
>>     [2]              NA
>>     [3]              NA
>>     [4]              NA
>>     [5]              NA
>>     [6]              NA
>>     ---
>>     seqlengths:
>>      NW_003613580.1 NW_003613581.1 NW_003613582.1 ... NW_003698075.1 NW_003704321.1 NW_003717424.1
>>                  NA             NA             NA ...             NA             NA             NA
>>
>>
>> Any suggestions on how to properly generate a txdb from this GFF would be appreciated.
>>
>> Thanks,
>> Guido
>>
>>> sessionInfo()
>> R Under development (unstable) (2013-08-10 r63532)
>> Platform: x86_64-w64-mingw32/x64 (64-bit)
>>
>> locale:
>> [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United 
>> States.1252 [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C 
>> [5] LC_TIME=English_United States.1252
>>
>> attached base packages:
>> [1] parallel  stats     graphics  grDevices utils     datasets  methods   base
>>
>> other attached packages:
>> [1] GenomicFeatures_1.13.29 AnnotationDbi_1.23.18   Biobase_2.21.6          GenomicRanges_1.13.36
>> [5] XVector_0.1.0           IRanges_1.19.21         BiocGenerics_0.7.3
>>
>> loaded via a namespace (and not attached):
>> [1] biomaRt_2.17.2     Biostrings_2.29.14 bitops_1.0-5       BSgenome_1.29.1    DBI_0.2-7
>>    [6] RCurl_1.95-4.1     Rsamtools_1.13.28  RSQLite_0.11.4     rtracklayer_1.21.9 stats4_3.1.0
>> [11] tools_3.1.0        XML_3.98-1.1       zlibbioc_1.7.0
>>
>> ---------------------------------------------------------
>> Guido Hooiveld, PhD
>> Nutrition, Metabolism & Genomics Group Division of Human Nutrition 
>> Wageningen University Biotechnion, Bomenweg 2
>> NL-6703 HD Wageningen
>> the Netherlands
>> tel: (+)31 317 485788
>> fax: (+)31 317 483342
>> email:      guido.hooiveld at wur.nl
>> internet:   http://nutrigene.4t.com
>> http://scholar.google.com/citations?user=qFHaMnoAAAAJ
>> http://www.researcherid.com/rid/F-4912-2010
>>
>>
>> 	[[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: 
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: 
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>