[BioC] building a refseq-based transcriptDb: warnings of interest?

Sat Jul 24 08:12:22 CEST 2010

Hi Vince,

On 07/23/2010 02:50 AM, Vincent Carey wrote:
>> hg18r.txdb = makeTranscriptDbFromUCSC(tablename="refGene")
> Download the refGene table ... OK
> Download the refLink table ... OK
> Extract the 'transcripts' data frame ... OK
> Extract the 'splicings' data frame ... OK
> Download and preprocess the 'chrominfo' data frame ... OK
> Prepare the 'metadata' data frame ... OK
> Make the TranscriptDb object ... OK
> There were 50 or more warnings (use warnings() to see the first 50)
>> warnings()
> Warning messages:
> 1: In .extractUCSCCdsStartEnd(cdsStart[i], cdsEnd[i],
> exon_locs$start[[i]],  ... :
>    UCSC data anomaly in transcript NM_017940: the cds cumulative length
> is not a multiple of 3
> 2: In .extractUCSCCdsStartEnd(cdsStart[i], cdsEnd[i],
> exon_locs$start[[i]],  ... :
>    UCSC data anomaly in transcript NM_001037675: the cds cumulative
> length is not a multiple of 3
> 3: In .extractUCSCCdsStartEnd(cdsStart[i], cdsEnd[i],
> exon_locs$start[[i]],  ... :
>    UCSC data anomaly in transcript NM_001039703: the cds cumulative
> length is not a multiple of 3
> 4: In .extractUCSCCdsStartEnd(cdsStart[i], cdsEnd[i],
> exon_locs$start[[i]],  ... :
>
> and so on.  Does this need to be reported to UCSC?

Glad you bring this in the discussion.

If you look at the schema:

http://genome.ucsc.edu/cgi-bin/hgTables?db=hg19&hgta_group=genes&hgta_track=refGene&hgta_table=refGene&hgta_doSchema=describe+table+schema

the refGene table has cdsStartStat and cdsStartEnd cols (in addition
to the  cdsStart and cdsEnd cols) which describe the status of each
CDS. My understanding is that only CDS with status 'cmpl' (complete)
are guaranteed to have a length that is a multiple of 3.

Currently makeTranscriptDbFromUCSC() imports all CDS in the TranscriptDb
object, regardless of their status, and issues a warning for each CDS
that doesn't look right. Maybe not the best approach. Should we allow
the user to filter CDSs based on this status? Or should we import only
complete CDSs? Or we import all the CDSs but we store in the metadata
table of the TranscriptDb object (and then display this in the show
method) the fact that not all the CDSs are complete? Then all
TranscriptDb objects made with makeTranscriptDbFromUCSC() would
be marked that way, except those obtained from the knownGene
table where, AFAIK, all the CDSs are guaranteed to be complete.

One difficulty with the design of TranscriptDb objects was to come
up with a db schema that would accommodate data coming from very
different places like UCSC and biomaRt, and then to implement
methods for extracting features from the db that would not be
specific to one source or another. This is why adding the cdsStartStat
and cdsStartEnd cols to our own db was discarded because those
cols are specific to UCSC (IIRC biomaRt/Ensembl doesn't provide
this info). Not even all transcript-like tables at UCSC have them.
And tables that have them don't necessarily use the same set of
values for this col (they use a MySQL enum type).

I guess it all depends what people want to do with those CDSs.

Cheers,
H.

>>   sessionInfo()
> R version 2.12.0 Under development (unstable) (2010-06-30 r52417)
> Platform: x86_64-apple-darwin10.3.0/x86_64 (64-bit)
>
> locale:
> [1] C
>
> attached base packages:
> [1] stats     graphics  grDevices datasets  tools     utils     methods
> [8] base
>
> other attached packages:
> [1] GenomicFeatures_1.1.6 GenomicRanges_1.1.15  IRanges_1.7.13
> [4] weaver_1.15.0         codetools_0.2-2       digest_0.4.2
>
> loaded via a namespace (and not attached):
> [1] BSgenome_1.17.5    Biobase_2.9.0      Biostrings_2.17.26 DBI_0.2-5
> [5] RCurl_1.4-2        RSQLite_0.9-1      XML_3.1-0          biomaRt_2.5.1
> [9] rtracklayer_1.9.3
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319