[BioC] GenomicFeatures: Problem with makeTranscriptDbFromGFF

Tue Apr 15 22:27:30 CEST 2014

On 04/15/2014 10:53 AM, Maintainer wrote:
> Dear Bioconductor maintainers,
>
> I still have a problem using makeTranscriptDbFromGFF:
>
> txdb <- makeTranscriptDbFromGFF(file="Data/mm9_test_noSpace.gtf",
> format="gtf")
> extracting transcript information
> Estimating transcript ranges.
> Error in res[i, ] <- .deduceTranscriptRangeData(subs[[i]]) :
>     number of items to replace is not a multiple of replacement
>

If I

   options(error=recover)

then

 > xx = makeTranscriptDbFromGFF(file="mm9_test_noSpace.gtf", format="gtf")
extracting transcript information
Estimating transcript ranges.
Error in res[i, ] <- .deduceTranscriptRangeData(subs[[i]]) :
   number of items to replace is not a multiple of replacement length

Enter a frame number, or 0 to exit

1: makeTranscriptDbFromGFF(file = "mm9_test_noSpace.gtf", format = "gtf")
2: .prepareGTFTables(gff, exonRankAttributeName)
3: .prepareGTFtranscripts(data)
4: .deduceTranscriptsFromGTF(transcripts)

Selection: 4
Called from: .deduceTranscriptsFromGTF(transcripts)
Browse[1]>

then look around a little, e.g.,

Browse[1]> .deduceTranscriptsFromGTF
## ...

I spot the offending line

     for (i in seq_len(length(trns))) {
         res[i, ] <- .deduceTranscriptRangeData(subs[[i]])

and look at subs[[i]] and .deduceTranscriptRangeData(subs[[i]])

Browse[1]> subs[[i]]
        seqnames    start      end strand type  gene_id transcript_id exon_rank
47369      chr2 36245431 36245515      + exon Mir684-1      Mir684-1        NA
147043     chr5 23763346 23763430      + exon Mir684-1      Mir684-1        NA
Browse[1]> .deduceTranscriptRangeData(subs[[i]])
[1] "Mir684-1" "chr2"     "chr5"     "+"        "23763346" "36245515" "Mir684-1"

I guess the problem is that the exons are coming from two separate chromosomes, 
and .deduceTranscriptRangeData does not handle that case successfully.

Hopefully the code will be made more robust / efficient, but in the mean time 
you could edit your file to remove the troublesome Mir684-1.

Martin

> I uploaded the file here:
> https://www.dropbox.com/s/x5ne8qz8sqbxvrp/mm9_test_noSpace.gtf
>
> I tried to reduce the file further, in order to get a smaller example file and to check which lines cause the problem, but:
> mm9_test_noSpace.gtf is 200,000 lines long and it results in an error.
> When I am using the first 100,000 (or, last 100,000 lines, respectively) it works!
>
> head -n 100000 mm9_test_noSpace.gtf > mm9_test_test_noSpace.gtf
> --> works!
> tail -n 100000 mm9_test_noSpace.gtf > mm9_test_test_noSpace.gtf
> --> works!
>
> I have no idea what is going on. Also, the .gtf file is from UCSC.
>
> Do you have any idea what the problem is?
> Thank you so much!
> Katja
>
>
> ----- Forwarded Message -----
> From: "Michael Lawrence" <lawrence.michael at gene.com>
> To: "Katja Hebestreit" <katjah at stanford.edu>
> Cc: "Michael Lawrence" <lawrence.michael at gene.com>
> Sent: Monday, April 14, 2014 7:37:23 PM
> Subject: Re: [BioC] GenomicFeatures: Problem with makeTranscriptDbFromGFF
>
> Sorry, I hadn't tested the makeTranscriptDbFromGFF function. I didn't write
> that one, so the maintainers of GenomicFeatures will need to help now.
>
>
> On Mon, Apr 14, 2014 at 5:31 PM, Katja Hebestreit <katjah at stanford.edu>wrote:
>
>> Great! Thank you so much, Micheal!
>>
>> Unfortunately, I still have problems with this file:
>>
>> txdb <- makeTranscriptDbFromGFF(file="Data/mm9_test_noSpace.gtf",
>> format="gtf")
>> extracting transcript information
>> Estimating transcript ranges.
>> Error in res[i, ] <- .deduceTranscriptRangeData(subs[[i]]) :
>>   number of items to replace is not a multiple of replacement
>>
>> To determine the gene lengths I wanted to use the exact .gtf file we used
>> for counting the reads per gene some time ago.
>>
>>
>> ----- Original Message -----
>> From: "Michael Lawrence" <lawrence.michael at gene.com>
>> To: "Katja Hebestreit" <katjah at stanford.edu>
>> Cc: "Vincent Carey" <stvjc at channing.harvard.edu>, "Michael Lawrence" <
>> lawrence.michael at gene.com>, "Rsamtools Maintainer" <
>> maintainer at bioconductor.org>, bioconductor at r-project.org
>> Sent: Monday, April 14, 2014 2:55:06 PM
>> Subject: Re: [BioC] GenomicFeatures: Problem with makeTranscriptDbFromGFF
>>
>> I've made rtracklayer 1.25.1 ignore whitespace after the last attribute.
>> After that change, both the truncated and full file parse successfully.
>> Btw, which genes are these? The UCSC genes are already available as a TxDb
>> package.
>>
>>
>> On Mon, Apr 14, 2014 at 2:36 PM, Katja Hebestreit <katjah at stanford.edu
>>> wrote:
>>
>>> Okay, I removed the whitspaces with
>>> sed 's/.$//' mm9_test.gtf > mm9_test_noSpace.gtf
>>>
>>> Here is the file:
>>> https://www.dropbox.com/s/x5ne8qz8sqbxvrp/mm9_test_noSpace.gtf
>>>
>>> But still, I get an error:
>>>
>>> txdb <- makeTranscriptDbFromGFF(file="Data/mm9_test_noSpace.gtf",
>>> format="gtf")
>>> extracting transcript information
>>> Estimating transcript ranges.
>>> Error in res[i, ] <- .deduceTranscriptRangeData(subs[[i]]) :
>>>    number of items to replace is not a multiple of replacement
>>>
>>>
>>> I tried to reduce the file further, in order to get a smaller example
>> file
>>> and to check which lines cause the problem, but:
>>>
>>> mm9_test_noSpace.gtf is 200,000 lines long and it results in an error.
>>> When I am using the first 100,000 (or, last 100,000 lines, respectively)
>> it
>>> works!
>>>
>>> head -n 100000 mm9_test_noSpace.gtf > mm9_test_test_noSpace.gtf
>>> --> works!
>>> tail -n 100000 mm9_test_noSpace.gtf > mm9_test_test_noSpace.gtf
>>> --> works!
>>>
>>> I have no idea what is going on. Also, the .gtf file is from UCSC.
>>>
>>> Thanks again for helping with that annoying problem!
>>> Katja
>>>
>>>
>>> ----- Original Message -----
>>> From: "Vincent Carey" <stvjc at channing.harvard.ed
>>> To: "Katja Hebestreit" <katjah at stanford.edu>
>>> Cc: "Michael Lawrence" <lawrence.michael at gene.com>, "Rsamtools
>>> Maintainer" <maintainer at bioconductor.org>, bioconductor at r-project.org
>>> Sent: Monday, April 14, 2014 11:45:30 AM
>>> Subject: Re: [BioC] GenomicFeatures: Problem with makeTranscriptDbFromGFF
>>>
>>> remove the trailing whitespace at the end of every line
>>>
>>>
>>> On Mon, Apr 14, 2014 at 2:24 PM, Katja Hebestreit <katjah at stanford.edu
>>>> wrote:
>>>
>>>> You can download the file here:
>>>>
>>>> https://www.dropbox.com/s/04nck83jq6r91bc/mm9_test.gtf
>>>>
>>>> Using file I get the error:
>>>>
>>>> txdb <- makeTranscriptDbFromGFF(file="Data/mm9_test.gtf", format="gtf")
>>>> Error in .parse_attrCol(attrCol, file, colnames) :
>>>>    Some attributes do not conform to 'tag value' format
>>>>
>>>> Thank you so much for helping!!
>>>> Katja
>>>>
>>>>
>>>> ----- Original Message -----
>>>> From: "Michael Lawrence" <lawrence.michael at gene.com>
>>>> To: "Katja Hebestreit" <katjah at stanford.edu>
>>>> Cc: "Michael Lawrence" <lawrence.michael at gene.com>,
>>>> bioconductor at r-project.org, "Rsamtools Maintainer" <
>>>> maintainer at bioconductor.org>
>>>> Sent: Monday, April 14, 2014 7:27:26 AM
>>>> Subject: Re: [BioC] GenomicFeatures: Problem with
>> makeTranscriptDbFromGFF
>>>>
>>>> Well, I copied the text and replaced the spaces with tabs as
>> appropriate
>>>> and everything seemed to work fine, so you might to attach that
>> fragment
>>> of
>>>> the file, just to be sure it isn't a formatting issue.
>>>>
>>>> Does import("file.gtf") work for you? If so, that should be good enough
>>> for
>>>> your use case.
>>>>
>>>> Michael
>>>>
>>>>
>>>> On Sun, Apr 13, 2014 at 10:14 PM, Katja Hebestreit <
>> katjah at stanford.edu
>>>>> wrote:
>>>>
>>>>> Actually, the error was not reproducible with the lines I attached.
>> But
>>>> it
>>>>> is reproducible with those lines (four additional lines):
>>>>>
>>>>> chr1    mm9_refFlat     stop_codon      3206103 3206105 0.000000
>>>   -
>>>>>        .       gene_id "Xkr4"; transcript_id "Xkr4";
>>>>> chr1    mm9_refFlat     CDS     3206106 3207049 0.000000        -
>>>    2
>>>>>        gene_id "Xkr4"; transcript_id "Xkr4";
>>>>> chr1    mm9_refFlat     exon    3204563 3207049 0.000000        -
>>>    .
>>>>>        gene_id "Xkr4"; transcript_id "Xkr4";
>>>>> chr1    mm9_refFlat     CDS     3411783 3411982 0.000000        -
>>>    1
>>>>>        gene_id "Xkr4"; transcript_id "Xkr4";
>>>>> chr1    mm9_refFlat     exon    3411783 3411982 0.000000        -
>>>    .
>>>>>        gene_id "Xkr4"; transcript_id "Xkr4";
>>>>> chr1    mm9_refFlat     CDS     3660633 3661429 0.000000        -
>>>    0
>>>>>        gene_id "Xkr4"; transcript_id "Xkr4";
>>>>> chr1    mm9_refFlat     start_codon     3661427 3661429 0.000000
>>>   -
>>>>>        .       gene_id "Xkr4"; transcript_id "Xkr4";
>>>>> chr1    mm9_refFlat     exon    3660633 3661579 0.000000        -
>>>    .
>>>>>        gene_id "Xkr4"; transcript_id "Xkr4";
>>>>> chr1    mm9_refFlat     stop_codon      4283062 4283064 0.000000
>>>   -
>>>>>        .       gene_id "Rp1"; transcript_id "Rp1";
>>>>> chr1    mm9_refFlat     CDS     4283065 4283093 0.000000        -
>>>    2
>>>>>        gene_id "Rp1"; transcript_id "Rp1";
>>>>>
>>>>> Let me know if you like to get the entire file.
>>>>>
>>>>> Thank you!!
>>>>> Katja
>>>>>
>>>>> ----- Original Message -----
>>>>> From: "Michael Lawrence" <lawrence.michael at gene.com>
>>>>> To: "Katja Hebestreit" <katjah at stanford.edu>
>>>>> Cc: bioconductor at r-project.org, "Rsamtools Maintainer" <
>>>>> maintainer at bioconductor.org>
>>>>> Sent: Sunday, April 13, 2014 10:02:13 PM
>>>>> Subject: Re: [BioC] GenomicFeatures: Problem with
>>> makeTranscriptDbFromGFF
>>>>>
>>>>> On Sun, Apr 13, 2014 at 7:18 PM, Katja Hebestreit <
>> katjah at stanford.edu
>>>>>> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> I get an error when I try to import my gff file:
>>>>>>
>>>>>> txdb <- makeTranscriptDbFromGFF(file="file.gtf", format="gtf")
>>>>>>
>>>>>> Error in .parse_attrCol(attrCol, file, colnames) :
>>>>>>    Some attributes do not conform to 'tag value' format
>>>>>>
>>>>>> This is how my file looks like:
>>>>>>
>>>>>> chr1    mm9_refFlat     stop_codon      3206103 3206105 0.000000
>>>>   -
>>>>>>        .       gene_id "Xkr4"; transcript_id "Xkr4";
>>>>>> chr1    mm9_refFlat     CDS     3206106 3207049 0.000000        -
>>>>    2
>>>>>>        gene_id "Xkr4"; transcript_id "Xkr4";
>>>>>> chr1    mm9_refFlat     exon    3204563 3207049 0.000000        -
>>>>    .
>>>>>>        gene_id "Xkr4"; transcript_id "Xkr4";
>>>>>> chr1    mm9_refFlat     CDS     3411783 3411982 0.000000        -
>>>>    1
>>>>>>        gene_id "Xkr4"; transcript_id "Xkr4";
>>>>>> chr1    mm9_refFlat     exon    3411783 3411982 0.000000        -
>>>>    .
>>>>>>        gene_id "Xkr4"; transcript_id "Xkr4";
>>>>>> chr1    mm9_refFlat     CDS     3660633 3661429 0.000000        -
>>>>    0
>>>>>>        gene_id "Xkr4"; transcript_id "Xkr4";
>>>>>>
>>>>>> I have the feeling that this has something to do with the missing
>>> exon
>>>>>> rank information in my file. Is that true? Is there a way to import
>>>> this
>>>>>> file? All I want to do is to determine the gene lengths.
>>>>>>
>>>>>
>>>>> It is most likely as the error says: some of your attributes are
>>>> malformed.
>>>>> Is that the entire file listed above, or is there more? If you could
>>> get
>>>> me
>>>>> the file somehow I could diagnose the issue.
>>>>>
>>>>>
>>>>>>
>>>>>> Could anyone help? That would be awesome!
>>>>>> Cheers,
>>>>>> Katja
>>>>>>
>>>>>>
>>>>>> sessionInfo()
>>>>>> R version 3.1.0 (2014-04-10)
>>>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>>>
>>>>>> locale:
>>>>>>   [1] LC_CTYPE=de_DE.UTF-8       LC_NUMERIC=C
>>>>>>   [3] LC_TIME=de_DE.UTF-8        LC_COLLATE=de_DE.UTF-8
>>>>>>   [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=de_DE.UTF-8
>>>>>>   [7] LC_PAPER=de_DE.UTF-8       LC_NAME=C
>>>>>>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>>>> [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
>>>>>>
>>>>>> attached base packages:
>>>>>> [1] parallel  stats     graphics  grDevices utils     datasets
>>>   methods
>>>>>> [8] base
>>>>>>
>>>>>> other attached packages:
>>>>>> [1] GenomicFeatures_1.16.0 AnnotationDbi_1.25.19  Biobase_2.23.6
>>>>>> [4] GenomicRanges_1.16.0   GenomeInfoDb_0.99.32   IRanges_1.21.45
>>>>>> [7] BiocGenerics_0.9.3     BiocInstaller_1.14.1
>>>>>>
>>>>>> loaded via a namespace (and not attached):
>>>>>>   [1] BatchJobs_1.2           BBmisc_1.5
>>>   BiocParallel_0.6.0
>>>>>>   [4] biomaRt_2.20.0          Biostrings_2.32.0       bitops_1.0-6
>>>>>>   [7] brew_1.0-6              BSgenome_1.32.0
>> codetools_0.2-8
>>>>>> [10] DBI_0.2-7               digest_0.6.4            fail_1.2
>>>>>> [13] foreach_1.4.2           GenomicAlignments_1.0.0
>> iterators_1.0.7
>>>>>> [16] plyr_1.8.1              Rcpp_0.11.1             RCurl_1.95-4.1
>>>>>> [19] Rsamtools_1.16.0        RSQLite_0.11.4
>>>   rtracklayer_1.24.0
>>>>>> [22] sendmailR_1.1-2         stats4_3.1.0            stringr_0.6.2
>>>>>> [25] tools_3.1.0             XML_3.98-1.1            XVector_0.4.0
>>>>>> [28] zlibbioc_1.10.0
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioconductor mailing list
>>>>>> Bioconductor at r-project.org
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>> Search the archives:
>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>
>>
>
> ________________________________________________________________________
> devteam-bioc mailing list
> To unsubscribe from this mailing list send a blank email to
> devteam-bioc-leave at lists.fhcrc.org
> You can also unsubscribe or change your personal options at
> https://lists.fhcrc.org/mailman/listinfo/devteam-bioc
>

-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793