[BioC] GenomicFeatures: Problem with makeTranscriptDbFromGFF

Mon Apr 21 20:12:00 CEST 2014

Hi Katja,

Sorry about the earlier confusion about where you posted your question.  
Because you posted the same question to two different mailing lists 
(double post), my email client 'lost' the thread that went to our 
regular mailing list.

Also thank you for mentioning this problem with deducing the 
transcripts.  However my solution for this problem will not be very 
ideal since it is fundamentally impossible to guess the transcripts exon 
ordering if the exons are not even on the same chromosomes.  So if your 
GFF file does not contain exon ordering for the transcripts then I fear 
that you are out of luck (as the information is clearly necessary and 
also not available).

To make the code more robust I will look into catching this situation 
when it comes up and throwing out transcripts that are like that (with a 
warning).  But this begs the question: do you really need a full 
TranscriptDb object?  What were you planning to do with this?

  Marc

On 04/15/2014 02:02 PM, Katja Hebestreit wrote:
> Thank you so much for pointing that out!!
>
> Unfortunately, there are a bunch of genes with exomes on different chromosomes. And this is also the case for newer versions of that .gtf file (refFlat from UCSC). Do you have an idea when you will fix that?
>
> Cheers,
> Katja
>
> ----- Original Message -----
> From: "Martin Morgan" <mtmorgan at fhcrc.org>
> To: katjah at stanford.edu, bioconductor at r-project.org
> Sent: Tuesday, April 15, 2014 1:27:30 PM
> Subject: Re: [BioC] GenomicFeatures: Problem with makeTranscriptDbFromGFF
>
> On 04/15/2014 10:53 AM, Maintainer wrote:
>> Dear Bioconductor maintainers,
>>
>> I still have a problem using makeTranscriptDbFromGFF:
>>
>> txdb <- makeTranscriptDbFromGFF(file="Data/mm9_test_noSpace.gtf",
>> format="gtf")
>> extracting transcript information
>> Estimating transcript ranges.
>> Error in res[i, ] <- .deduceTranscriptRangeData(subs[[i]]) :
>>      number of items to replace is not a multiple of replacement
>>
> If I
>
>     options(error=recover)
>
> then
>
>   > xx = makeTranscriptDbFromGFF(file="mm9_test_noSpace.gtf", format="gtf")
> extracting transcript information
> Estimating transcript ranges.
> Error in res[i, ] <- .deduceTranscriptRangeData(subs[[i]]) :
>     number of items to replace is not a multiple of replacement length
>
> Enter a frame number, or 0 to exit
>
> 1: makeTranscriptDbFromGFF(file = "mm9_test_noSpace.gtf", format = "gtf")
> 2: .prepareGTFTables(gff, exonRankAttributeName)
> 3: .prepareGTFtranscripts(data)
> 4: .deduceTranscriptsFromGTF(transcripts)
>
> Selection: 4
> Called from: .deduceTranscriptsFromGTF(transcripts)
> Browse[1]>
>
> then look around a little, e.g.,
>
> Browse[1]> .deduceTranscriptsFromGTF
> ## ...
>
> I spot the offending line
>
>       for (i in seq_len(length(trns))) {
>           res[i, ] <- .deduceTranscriptRangeData(subs[[i]])
>
> and look at subs[[i]] and .deduceTranscriptRangeData(subs[[i]])
>
> Browse[1]> subs[[i]]
>          seqnames    start      end strand type  gene_id transcript_id exon_rank
> 47369      chr2 36245431 36245515      + exon Mir684-1      Mir684-1        NA
> 147043     chr5 23763346 23763430      + exon Mir684-1      Mir684-1        NA
> Browse[1]> .deduceTranscriptRangeData(subs[[i]])
> [1] "Mir684-1" "chr2"     "chr5"     "+"        "23763346" "36245515" "Mir684-1"
>
> I guess the problem is that the exons are coming from two separate chromosomes,
> and .deduceTranscriptRangeData does not handle that case successfully.
>
> Hopefully the code will be made more robust / efficient, but in the mean time
> you could edit your file to remove the troublesome Mir684-1.
>
> Martin
>
>
>> I uploaded the file here:
>> https://www.dropbox.com/s/x5ne8qz8sqbxvrp/mm9_test_noSpace.gtf
>>
>> I tried to reduce the file further, in order to get a smaller example file and to check which lines cause the problem, but:
>> mm9_test_noSpace.gtf is 200,000 lines long and it results in an error.
>> When I am using the first 100,000 (or, last 100,000 lines, respectively) it works!
>>
>> head -n 100000 mm9_test_noSpace.gtf > mm9_test_test_noSpace.gtf
>> --> works!
>> tail -n 100000 mm9_test_noSpace.gtf > mm9_test_test_noSpace.gtf
>> --> works!
>>
>> I have no idea what is going on. Also, the .gtf file is from UCSC.
>>
>> Do you have any idea what the problem is?
>> Thank you so much!
>> Katja
>>
>>
>> ----- Forwarded Message -----
>> From: "Michael Lawrence" <lawrence.michael at gene.com>
>> To: "Katja Hebestreit" <katjah at stanford.edu>
>> Cc: "Michael Lawrence" <lawrence.michael at gene.com>
>> Sent: Monday, April 14, 2014 7:37:23 PM
>> Subject: Re: [BioC] GenomicFeatures: Problem with makeTranscriptDbFromGFF
>>
>> Sorry, I hadn't tested the makeTranscriptDbFromGFF function. I didn't write
>> that one, so the maintainers of GenomicFeatures will need to help now.
>>
>>
>> On Mon, Apr 14, 2014 at 5:31 PM, Katja Hebestreit <katjah at stanford.edu>wrote:
>>
>>> Great! Thank you so much, Micheal!
>>>
>>> Unfortunately, I still have problems with this file:
>>>
>>> txdb <- makeTranscriptDbFromGFF(file="Data/mm9_test_noSpace.gtf",
>>> format="gtf")
>>> extracting transcript information
>>> Estimating transcript ranges.
>>> Error in res[i, ] <- .deduceTranscriptRangeData(subs[[i]]) :
>>>    number of items to replace is not a multiple of replacement
>>>
>>> To determine the gene lengths I wanted to use the exact .gtf file we used
>>> for counting the reads per gene some time ago.
>>>
>>>
>>> ----- Original Message -----
>>> From: "Michael Lawrence" <lawrence.michael at gene.com>
>>> To: "Katja Hebestreit" <katjah at stanford.edu>
>>> Cc: "Vincent Carey" <stvjc at channing.harvard.edu>, "Michael Lawrence" <
>>> lawrence.michael at gene.com>, "Rsamtools Maintainer" <
>>> maintainer at bioconductor.org>, bioconductor at r-project.org
>>> Sent: Monday, April 14, 2014 2:55:06 PM
>>> Subject: Re: [BioC] GenomicFeatures: Problem with makeTranscriptDbFromGFF
>>>
>>> I've made rtracklayer 1.25.1 ignore whitespace after the last attribute.
>>> After that change, both the truncated and full file parse successfully.
>>> Btw, which genes are these? The UCSC genes are already available as a TxDb
>>> package.
>>>
>>>
>>> On Mon, Apr 14, 2014 at 2:36 PM, Katja Hebestreit <katjah at stanford.edu
>>>> wrote:
>>>> Okay, I removed the whitspaces with
>>>> sed 's/.$//' mm9_test.gtf > mm9_test_noSpace.gtf
>>>>
>>>> Here is the file:
>>>> https://www.dropbox.com/s/x5ne8qz8sqbxvrp/mm9_test_noSpace.gtf
>>>>
>>>> But still, I get an error:
>>>>
>>>> txdb <- makeTranscriptDbFromGFF(file="Data/mm9_test_noSpace.gtf",
>>>> format="gtf")
>>>> extracting transcript information
>>>> Estimating transcript ranges.
>>>> Error in res[i, ] <- .deduceTranscriptRangeData(subs[[i]]) :
>>>>     number of items to replace is not a multiple of replacement
>>>>
>>>>
>>>> I tried to reduce the file further, in order to get a smaller example
>>> file
>>>> and to check which lines cause the problem, but:
>>>>
>>>> mm9_test_noSpace.gtf is 200,000 lines long and it results in an error.
>>>> When I am using the first 100,000 (or, last 100,000 lines, respectively)
>>> it
>>>> works!
>>>>
>>>> head -n 100000 mm9_test_noSpace.gtf > mm9_test_test_noSpace.gtf
>>>> --> works!
>>>> tail -n 100000 mm9_test_noSpace.gtf > mm9_test_test_noSpace.gtf
>>>> --> works!
>>>>
>>>> I have no idea what is going on. Also, the .gtf file is from UCSC.
>>>>
>>>> Thanks again for helping with that annoying problem!
>>>> Katja
>>>>
>>>>
>>>> ----- Original Message -----
>>>> From: "Vincent Carey" <stvjc at channing.harvard.ed
>>>> To: "Katja Hebestreit" <katjah at stanford.edu>
>>>> Cc: "Michael Lawrence" <lawrence.michael at gene.com>, "Rsamtools
>>>> Maintainer" <maintainer at bioconductor.org>, bioconductor at r-project.org
>>>> Sent: Monday, April 14, 2014 11:45:30 AM
>>>> Subject: Re: [BioC] GenomicFeatures: Problem with makeTranscriptDbFromGFF
>>>>
>>>> remove the trailing whitespace at the end of every line
>>>>
>>>>
>>>> On Mon, Apr 14, 2014 at 2:24 PM, Katja Hebestreit <katjah at stanford.edu
>>>>> wrote:
>>>>> You can download the file here:
>>>>>
>>>>> https://www.dropbox.com/s/04nck83jq6r91bc/mm9_test.gtf
>>>>>
>>>>> Using file I get the error:
>>>>>
>>>>> txdb <- makeTranscriptDbFromGFF(file="Data/mm9_test.gtf", format="gtf")
>>>>> Error in .parse_attrCol(attrCol, file, colnames) :
>>>>>     Some attributes do not conform to 'tag value' format
>>>>>
>>>>> Thank you so much for helping!!
>>>>> Katja
>>>>>
>>>>>
>>>>> ----- Original Message -----
>>>>> From: "Michael Lawrence" <lawrence.michael at gene.com>
>>>>> To: "Katja Hebestreit" <katjah at stanford.edu>
>>>>> Cc: "Michael Lawrence" <lawrence.michael at gene.com>,
>>>>> bioconductor at r-project.org, "Rsamtools Maintainer" <
>>>>> maintainer at bioconductor.org>
>>>>> Sent: Monday, April 14, 2014 7:27:26 AM
>>>>> Subject: Re: [BioC] GenomicFeatures: Problem with
>>> makeTranscriptDbFromGFF
>>>>> Well, I copied the text and replaced the spaces with tabs as
>>> appropriate
>>>>> and everything seemed to work fine, so you might to attach that
>>> fragment
>>>> of
>>>>> the file, just to be sure it isn't a formatting issue.
>>>>>
>>>>> Does import("file.gtf") work for you? If so, that should be good enough
>>>> for
>>>>> your use case.
>>>>>
>>>>> Michael
>>>>>
>>>>>
>>>>> On Sun, Apr 13, 2014 at 10:14 PM, Katja Hebestreit <
>>> katjah at stanford.edu
>>>>>> wrote:
>>>>>> Actually, the error was not reproducible with the lines I attached.
>>> But
>>>>> it
>>>>>> is reproducible with those lines (four additional lines):
>>>>>>
>>>>>> chr1    mm9_refFlat     stop_codon      3206103 3206105 0.000000
>>>>    -
>>>>>>         .       gene_id "Xkr4"; transcript_id "Xkr4";
>>>>>> chr1    mm9_refFlat     CDS     3206106 3207049 0.000000        -
>>>>     2
>>>>>>         gene_id "Xkr4"; transcript_id "Xkr4";
>>>>>> chr1    mm9_refFlat     exon    3204563 3207049 0.000000        -
>>>>     .
>>>>>>         gene_id "Xkr4"; transcript_id "Xkr4";
>>>>>> chr1    mm9_refFlat     CDS     3411783 3411982 0.000000        -
>>>>     1
>>>>>>         gene_id "Xkr4"; transcript_id "Xkr4";
>>>>>> chr1    mm9_refFlat     exon    3411783 3411982 0.000000        -
>>>>     .
>>>>>>         gene_id "Xkr4"; transcript_id "Xkr4";
>>>>>> chr1    mm9_refFlat     CDS     3660633 3661429 0.000000        -
>>>>     0
>>>>>>         gene_id "Xkr4"; transcript_id "Xkr4";
>>>>>> chr1    mm9_refFlat     start_codon     3661427 3661429 0.000000
>>>>    -
>>>>>>         .       gene_id "Xkr4"; transcript_id "Xkr4";
>>>>>> chr1    mm9_refFlat     exon    3660633 3661579 0.000000        -
>>>>     .
>>>>>>         gene_id "Xkr4"; transcript_id "Xkr4";
>>>>>> chr1    mm9_refFlat     stop_codon      4283062 4283064 0.000000
>>>>    -
>>>>>>         .       gene_id "Rp1"; transcript_id "Rp1";
>>>>>> chr1    mm9_refFlat     CDS     4283065 4283093 0.000000        -
>>>>     2
>>>>>>         gene_id "Rp1"; transcript_id "Rp1";
>>>>>>
>>>>>> Let me know if you like to get the entire file.
>>>>>>
>>>>>> Thank you!!
>>>>>> Katja
>>>>>>
>>>>>> ----- Original Message -----
>>>>>> From: "Michael Lawrence" <lawrence.michael at gene.com>
>>>>>> To: "Katja Hebestreit" <katjah at stanford.edu>
>>>>>> Cc: bioconductor at r-project.org, "Rsamtools Maintainer" <
>>>>>> maintainer at bioconductor.org>
>>>>>> Sent: Sunday, April 13, 2014 10:02:13 PM
>>>>>> Subject: Re: [BioC] GenomicFeatures: Problem with
>>>> makeTranscriptDbFromGFF
>>>>>> On Sun, Apr 13, 2014 at 7:18 PM, Katja Hebestreit <
>>> katjah at stanford.edu
>>>>>>> wrote:
>>>>>>> Hello,
>>>>>>>
>>>>>>> I get an error when I try to import my gff file:
>>>>>>>
>>>>>>> txdb <- makeTranscriptDbFromGFF(file="file.gtf", format="gtf")
>>>>>>>
>>>>>>> Error in .parse_attrCol(attrCol, file, colnames) :
>>>>>>>     Some attributes do not conform to 'tag value' format
>>>>>>>
>>>>>>> This is how my file looks like:
>>>>>>>
>>>>>>> chr1    mm9_refFlat     stop_codon      3206103 3206105 0.000000
>>>>>    -
>>>>>>>         .       gene_id "Xkr4"; transcript_id "Xkr4";
>>>>>>> chr1    mm9_refFlat     CDS     3206106 3207049 0.000000        -
>>>>>     2
>>>>>>>         gene_id "Xkr4"; transcript_id "Xkr4";
>>>>>>> chr1    mm9_refFlat     exon    3204563 3207049 0.000000        -
>>>>>     .
>>>>>>>         gene_id "Xkr4"; transcript_id "Xkr4";
>>>>>>> chr1    mm9_refFlat     CDS     3411783 3411982 0.000000        -
>>>>>     1
>>>>>>>         gene_id "Xkr4"; transcript_id "Xkr4";
>>>>>>> chr1    mm9_refFlat     exon    3411783 3411982 0.000000        -
>>>>>     .
>>>>>>>         gene_id "Xkr4"; transcript_id "Xkr4";
>>>>>>> chr1    mm9_refFlat     CDS     3660633 3661429 0.000000        -
>>>>>     0
>>>>>>>         gene_id "Xkr4"; transcript_id "Xkr4";
>>>>>>>
>>>>>>> I have the feeling that this has something to do with the missing
>>>> exon
>>>>>>> rank information in my file. Is that true? Is there a way to import
>>>>> this
>>>>>>> file? All I want to do is to determine the gene lengths.
>>>>>>>
>>>>>> It is most likely as the error says: some of your attributes are
>>>>> malformed.
>>>>>> Is that the entire file listed above, or is there more? If you could
>>>> get
>>>>> me
>>>>>> the file somehow I could diagnose the issue.
>>>>>>
>>>>>>
>>>>>>> Could anyone help? That would be awesome!
>>>>>>> Cheers,
>>>>>>> Katja
>>>>>>>
>>>>>>>
>>>>>>> sessionInfo()
>>>>>>> R version 3.1.0 (2014-04-10)
>>>>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>>>>
>>>>>>> locale:
>>>>>>>    [1] LC_CTYPE=de_DE.UTF-8       LC_NUMERIC=C
>>>>>>>    [3] LC_TIME=de_DE.UTF-8        LC_COLLATE=de_DE.UTF-8
>>>>>>>    [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=de_DE.UTF-8
>>>>>>>    [7] LC_PAPER=de_DE.UTF-8       LC_NAME=C
>>>>>>>    [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>>>>> [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
>>>>>>>
>>>>>>> attached base packages:
>>>>>>> [1] parallel  stats     graphics  grDevices utils     datasets
>>>>    methods
>>>>>>> [8] base
>>>>>>>
>>>>>>> other attached packages:
>>>>>>> [1] GenomicFeatures_1.16.0 AnnotationDbi_1.25.19  Biobase_2.23.6
>>>>>>> [4] GenomicRanges_1.16.0   GenomeInfoDb_0.99.32   IRanges_1.21.45
>>>>>>> [7] BiocGenerics_0.9.3     BiocInstaller_1.14.1
>>>>>>>
>>>>>>> loaded via a namespace (and not attached):
>>>>>>>    [1] BatchJobs_1.2           BBmisc_1.5
>>>>    BiocParallel_0.6.0
>>>>>>>    [4] biomaRt_2.20.0          Biostrings_2.32.0       bitops_1.0-6
>>>>>>>    [7] brew_1.0-6              BSgenome_1.32.0
>>> codetools_0.2-8
>>>>>>> [10] DBI_0.2-7               digest_0.6.4            fail_1.2
>>>>>>> [13] foreach_1.4.2           GenomicAlignments_1.0.0
>>> iterators_1.0.7
>>>>>>> [16] plyr_1.8.1              Rcpp_0.11.1             RCurl_1.95-4.1
>>>>>>> [19] Rsamtools_1.16.0        RSQLite_0.11.4
>>>>    rtracklayer_1.24.0
>>>>>>> [22] sendmailR_1.1-2         stats4_3.1.0            stringr_0.6.2
>>>>>>> [25] tools_3.1.0             XML_3.98-1.1            XVector_0.4.0
>>>>>>> [28] zlibbioc_1.10.0
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Bioconductor mailing list
>>>>>>> Bioconductor at r-project.org
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>> Search the archives:
>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>
>> ________________________________________________________________________
>> devteam-bioc mailing list
>> To unsubscribe from this mailing list send a blank email to
>> devteam-bioc-leave at lists.fhcrc.org
>> You can also unsubscribe or change your personal options at
>> https://lists.fhcrc.org/mailman/listinfo/devteam-bioc
>>
>