[BioC] UCSC data anomaly in 50638 transcript(s): the cds cumulative length is

Hervé Pagès hpages at fhcrc.org
Wed May 28 07:59:25 CEST 2014


Hi Adi,

Hope you don't mind that I'm cc'ing the list.

On 05/27/2014 04:17 PM, Tarca, Adi wrote:
> Dear Hervé,
>
> Should I worry about the warning below?
>
> I just want to overall some rna seq reads with know genes.

Do you mean "overlap"?

>
> Thanks,
>
> Adi
>
>  > txdb2=makeTranscriptDbFromUCSC(
>
> +              genome="hg19",
>
> +              tablename="knownGene")

Note that we provide a few "TxDb" packages that contain pre-computed
TranscriptDb objects for a few organisms and tracks:

   http://bioconductor.org/packages/release/BiocViews.html#___TranscriptDb

There is one for hg19/knownGene: the TxDb.Hsapiens.UCSC.hg19.knownGene
package.

>
> Download the knownGene table ... OK
>
> Download the knownToLocusLink table ... OK
>
> Extract the 'transcripts' data frame ... OK
>
> Extract the 'splicings' data frame ... OK
>
> Download and preprocess the 'chrominfo' data frame ... OK
>
> Prepare the 'metadata' data frame ... OK
>
> Make the TranscriptDb object ... OK
>
> Warning message:
>
> In .extractCdsLocsFromUCSCTxTable(ucsc_txtable, exon_locs) :
>
>    UCSC data anomaly in 50638 transcript(s): the cds cumulative length is
>
>    not a multiple of 3 for transcripts ‘uc001aaa.3’ ‘uc010nxr.1’
>
>    ‘uc009vis.3’ ‘uc009vjc.1’ ‘uc009vjd.2’ ‘uc009vit.3’
> ‘uc009viu.3’
>
>    ‘uc001aae.4’ ‘uc001aai.1’ ‘uc001aah.4’ ‘uc009vir.3’
> ‘uc009viq.3’
>
>    ‘uc001aac.4’ ‘uc009viv.2’ ‘uc009viw.2’ ‘uc009vix.2’
> ‘uc009viy.2’
>
>    ‘uc009viz.2’ ‘uc010nxs.1’ ‘uc009vje.2’ ‘uc009vjf.2’
> ‘uc009vjb.1’
>
>    ‘uc001aak.3’ ‘uc021oeg.2’ ‘uc001aaq.2’ ‘uc001aar.2’
> ‘uc021oeh.1’
>
>    ‘uc009vjk.2’ ‘uc001aau.3’ ‘uc001aax.1’ ‘uc021oej.1’
> ‘uc021oek.1’
>
>    ‘uc021oel.1’ ‘uc001abb.3’ ‘uc001abe.4’ ‘uc001abi.2’
> ‘uc001abj.3’
>
>    ‘uc009vjm.3’ ‘uc010nxw.2’ ‘uc001abl.3’ ‘uc002khh.3’
> ‘uc001abm.2’
>
>    ‘uc001abo.3’ ‘uc031pjj.1’ ‘uc001abp.2’ ‘uc021oem.2’
> ‘uc009vjn.2’
>
>    ‘uc009vjo.2’ ‘uc031pjk.1’ ‘uc001abt.4’ ‘uc001abu.1’
> ‘u [... truncated]

This warning is wrong. It's actually easy to check that all the CDS
have a cumulative length that is a multiple of 3:

   > cds_by_tx <- cdsBy(txdb2, by="tx")
   > table(sum(width(cds_by_tx)) %% 3L)
       0
   63691

Seems to be a regression introduced in BioC 2.14. Someone in Seattle
will work on a fix and we will notify the list when the fix is
available.

Otherwise, assuming the code in charge of issuing the warning is
working properly, you can get a legitimate warning like this for
some combination of UCSC organism/track (but AFAIK never for the
knownGene track). If all you want to do is find/count overlaps between
some rna seq reads and known genes, then you probably don't care about
CDS at all.

Cheers,
H.

>
>  > sessioninfo()
>
> Error: could not find function "sessioninfo"
>
>  > sessionInfo()
>
> R version 3.0.3 (2014-03-06)
>
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
>
> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>
> [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>
> [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>
> [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>
> [9] LC_ADDRESS=C               LC_TELEPHONE=C
>
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
>
> [1] parallel  stats     graphics  grDevices utils     datasets  methods
>
> [8] base
>
> other attached packages:
>
> [1] gplots_2.13.0           RColorBrewer_1.0-5      PADOG_1.4.0
>
> [4] GSA_1.03                nlme_3.1-117            KEGGdzPathwaysGEO_1.1.3
>
> [7] Heatplus_2.8.0          marray_1.40.0           limma_3.18.13
>
> [10] org.Hs.eg.db_2.10.1     preprocessCore_1.24.0   GO.db_2.10.1
>
> [13] SPIA_2.14.0             KEGGgraph_1.20.0        graph_1.40.1
>
> [16] XML_3.98-1.1            KEGG.db_2.10.1          RSQLite_0.11.4
>
> [19] DBI_0.2-7               R2HTML_2.2.1            rtracklayer_1.22.7
>
> [22] Rsamtools_1.14.3        Biostrings_2.30.1       GenomicFeatures_1.14.5
>
> [25] AnnotationDbi_1.24.0    Biobase_2.22.0          GenomicRanges_1.14.4
>
> [28] XVector_0.2.0           IRanges_1.20.7          BiocGenerics_0.8.0
>
> [31] BiocInstaller_1.12.1    multicore_0.2
>
> loaded via a namespace (and not attached):
>
> [1] biomaRt_2.18.0     bitops_1.0-6       BSgenome_1.30.0    caTools_1.17
>
> [5] gdata_2.13.3       grid_3.0.3         gtools_3.4.0
> KernSmooth_2.23-12
>
> [9] lattice_0.20-29    RCurl_1.95-4.1     stats4_3.0.3       tools_3.0.3
>
> *Adi Laurentiu TARCA, Ph.D.***
>
> Assistant Professor (Research),
> Department of Computer Science & Center for Molecular Medicine and
> Genetics, Wayne State University,
> Director, Bioinformatics and Computational Biology Unit,  Perinatology
> Research Branch (NICHD),
>
> 3990 John R., Office 4809,
> Detroit, Michigan 48201
> Tel: 1-313-5775305
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list