[BioC] makeTranscriptDbFromGFF for Unstranded Transcripts

Wed May 8 20:09:41 CEST 2013

Hi Dario,

On 05/07/2013 10:00 PM, Dario Strbenac wrote:
> Hello,
>
> I have assembled transcripts based on unstranded RNA-seq data. It's not the ideal experimental design, of course. Is it not possible to make a TranscriptDb object from this ?

Right now, it's not possible. Note that in the TranscriptDb object
returned by makeTranscriptDbFromGFF(), we really want to see stranded
transcripts, exons, and CDSs. Because: (1) the purpose of a
TranscriptDb is to describe a gene model, and the strand information
(especially the exon strand) is an *essential* part of the gene model,
and (2) there are too many tools/too much code around that expect this.

The same could probably be said of GTF and GFF files, and I wonder if
storing a set of unstranded mRNAs, exons, CDSs in those files is
considered valid.

Anyway, if we wanted makeTranscriptDbFromGFF() to support such GTF and
GFF files, we would need to automatically replace all missing strands
by a + or a -. This would need to be done in a blunt way e.g. all
missing mRNA/exon/CDS strands would be replaced with a + with a warning.
Trying to make makeTranscriptDbFromGFF() smarter by looking whether
the exons within a given mRNA have their start/end going up or down
when the exon rank is going up would be too problematic.

>
> makeTranscriptDbFromGFF("transcripts.gtf", format = "gtf", exonRankAttributeName = "exon_number")

Ok, so you've managed to store the exon rank in your file. But that
means that you must have *implicitly* chosen a strand for your exons
right? Because, within a given mRNA, when the exon rank is going up,
start/end should normally go up on the + strand and down on the -
strand.

Why not just set the strand for the mRNAs / exons / CDSs in your
GTF file?

There is too much going on here so, at this point, I'm not convinced
makeTranscriptDbFromGFF() should decide the strand for you.

Cheers,
H.

> extracting transcript information
> Estimating transcript ranges.
> Extracting gene IDs
> Processing splicing information for gtf file.
> Prepare the 'metadata' data frame ... metadata: OK
> Now generating chrominfo from available sequence names. No chromosome length information is available.
> Error in .normargTranscripts(transcripts) :
>    values in 'transcripts$tx_strand' must be "+" or "-"
>
> R version 3.0.0 (2013-04-03)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
> GenomicFeatures_1.12.1
>
> --------------------------------------
> Dario Strbenac
> PhD Student
> University of Sydney
> Camperdown NSW 2050
> Australia
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319