[BioC] makeTranscriptDbFromGFF Error for UCSC GTF File

Hervé Pagès hpages at fhcrc.org
Wed Jul 2 19:54:08 CEST 2014


Hi Dario, Marc,

FWIW, I get a different error. Like you I downloaded the refGene table
in GTF format using the UCSC Table Browser web interface
(https://genome.ucsc.edu/cgi-bin/hgTables?db=hg19&hgta_group=genes&hgta_track=refGene).
Then:

   ## No problem with the parser (used internally by 
makeTranscriptDbFromGFF):

   library(rtracklayer)
   hg19_refGene <- import("hg19_refGene.gtf")

   ## Error with makeTranscriptDbFromGFF:

   > library(GenomicFeatures)
   > txdb <- makeTranscriptDbFromGFF("hg19_refGene.gtf", format="gtf")
   extracting transcript information
   Estimating transcript ranges.
   Extracting gene IDs
   Processing splicing information for gtf file.
   Deducing exon rank from relative coordinates provided
   Warning messages:
   1: In .deduceTranscriptsFromGTF(transcripts) :
     Some of your transcripts have exons on more than one chromsome.  We
   cannot deduce the order of these exons so these transcripts have been
   discarded.
   2: In .deduceExonRankings(exs, format = "gtf") :
     Infering Exon Rankings.  If this is not what you expected, then
   please be sure that you have provided a valid attribute for
   exonRankAttributeName
Error in unlist(mapply(.assignRankings, starts, strands)) :
   error in evaluating the argument 'x' in selecting a method for 
function 'unlist': Error in (function (starts, strands)  :
   Exon rank inference cannot accomodate trans-splicing.

Cheers,
H.

 > sessionInfo()
R version 3.1.0 Patched (2014-06-21 r66002)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
  [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
[1] GenomicFeatures_1.17.12 AnnotationDbi_1.27.8    Biobase_2.25.0
[4] rtracklayer_1.25.11     GenomicRanges_1.17.18   GenomeInfoDb_1.1.9
[7] IRanges_1.99.16         S4Vectors_0.0.9         BiocGenerics_0.11.2

loaded via a namespace (and not attached):
  [1] BatchJobs_1.2            BBmisc_1.7 
BiocParallel_0.7.5
  [4] biomaRt_2.21.0           Biostrings_2.33.10       bitops_1.0-6 

  [7] brew_1.0-6               checkmate_1.1            codetools_0.2-8 

[10] DBI_0.2-7                digest_0.6.4             fail_1.2 

[13] foreach_1.4.2            GenomicAlignments_1.1.14 iterators_1.0.7 

[16] plyr_1.8.1               Rcpp_0.11.2              RCurl_1.95-4.1 

[19] Rsamtools_1.17.27        RSQLite_0.11.4           sendmailR_1.1-2 

[22] stats4_3.1.0             stringr_0.6.2            tools_3.1.0 

[25] XML_3.98-1.1             XVector_0.5.6            zlibbioc_1.11.1 


On 07/02/2014 10:16 AM, Marc Carlson wrote:
> Hi Dario,
>
> That error says that some of the attributes have been formatted in a way
> that leaves them uninterpretable by the parser.  But what really puzzles
> me is why you want to parse this track as a GTF file at all?  The UCSC
> hg19 track is already available as a package here:
>
> http://www.bioconductor.org/packages/release/data/annotation/html/TxDb.Hsapiens.UCSC.hg19.knownGene.html
>
>
> And if that is not actually the track you are trying for, then perhaps
> you should just use the makeTranscriptDbFromUCSC() function instead?
> That would be the more typical tool for making UCSC tracks into
> TranscriptDb objects.
>
> In contrast, using GTF or GFF files for making TranscriptDb objects is
> always a little risky because many of these files will not have been
> created with the intention of holding a transcriptome as data (which is
> the specific thing that a TranscriptDb object is meant to hold).  This
> is because the GTF and GFF file formats were not initially intended for
> the specific purpose of holding a transcriptome but were instead
> intended to be something more general.
>
> Hope this helps,
>
>
>   Marc
>
>
>
> On 07/02/2014 12:00 AM, Dario Strbenac wrote:
>> Hello,
>>
>> I used :
>>
>>> system.time(hg19 <-
>>> makeTranscriptDbFromGFF("/home/dario/data/Annotation/hg19.gtf",
>>> format = "gtf"))
>> Error in .parse_attrCol(attrCol, file, colnames) :
>>    Some attributes do not conform to 'tag value' format
>> Timing stopped at: 15.605 0.296 16.07
>>
>> I downloaded the GTF file from UCSC Table Browser. The table's name
>> was refGene. To me, it seems that the attributes are fine :
>>
>>> hg19table <- read.table("/home/dario/data/Annotation/hg19.gtf", sep =
>>> '\t', stringsAsFactors=FALSE)
>>> table(sapply(strsplit(hg19table[, 9], ' '), length))
>>       4
>> 967118
>>
>> I have R version 3.1.0 (2014-04-10) and GenomicFeatures 1.16.2
>>
>> --------------------------------------
>> Dario Strbenac
>> PhD Student
>> University of Sydney
>> Camperdown NSW 2050
>> Australia
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list