[BioC] UCSC data anomaly in 50638 transcript(s): the cds cumulative length is

Marc Carlson mcarlson at fhcrc.org
Thu May 29 20:26:25 CEST 2014


Hi Adi,

This issue was being caused by some overly zealous warning code. It was 
throwing a warning whenever a CDS was absent (and not *only* when it was 
a non-viable length - as the warning says).  I have fixed this so that 
the code is more reasonable about what it thinks you need to be warned 
about.

  Marc


On 05/28/2014 08:52 AM, Tarca, Adi wrote:
> Dear Hervé,
> I have seen that type of error in google search but usually was for one or few transcripts.
> Seeing that the problem was for maybe all of the transcripts, I was not sure that the table was properly downloaded.
> Thank you for the clarification and for making others aware of the issue.
> Best regards,
> Adi
>
> -----Original Message-----
> From: Hervé Pagès [mailto:hpages at fhcrc.org]
> Sent: Wednesday, May 28, 2014 1:59 AM
> To: Tarca, Adi
> Cc: bioconductor at r-project.org
> Subject: Re: UCSC data anomaly in 50638 transcript(s): the cds cumulative length is
>
> Hi Adi,
>
> Hope you don't mind that I'm cc'ing the list.
>
> On 05/27/2014 04:17 PM, Tarca, Adi wrote:
>> Dear Hervé,
>>
>> Should I worry about the warning below?
>>
>> I just want to overall some rna seq reads with know genes.
> Do you mean "overlap"?
>
>> Thanks,
>>
>> Adi
>>
>>   > txdb2=makeTranscriptDbFromUCSC(
>>
>> +              genome="hg19",
>>
>> +              tablename="knownGene")
> Note that we provide a few "TxDb" packages that contain pre-computed TranscriptDb objects for a few organisms and tracks:
>
>     http://bioconductor.org/packages/release/BiocViews.html#___TranscriptDb
>
> There is one for hg19/knownGene: the TxDb.Hsapiens.UCSC.hg19.knownGene package.
>
>> Download the knownGene table ... OK
>>
>> Download the knownToLocusLink table ... OK
>>
>> Extract the 'transcripts' data frame ... OK
>>
>> Extract the 'splicings' data frame ... OK
>>
>> Download and preprocess the 'chrominfo' data frame ... OK
>>
>> Prepare the 'metadata' data frame ... OK
>>
>> Make the TranscriptDb object ... OK
>>
>> Warning message:
>>
>> In .extractCdsLocsFromUCSCTxTable(ucsc_txtable, exon_locs) :
>>
>>     UCSC data anomaly in 50638 transcript(s): the cds cumulative length
>> is
>>
>>     not a multiple of 3 for transcripts ......u [... truncated]
> This warning is wrong. It's actually easy to check that all the CDS have a cumulative length that is a multiple of 3:
>
>     > cds_by_tx <- cdsBy(txdb2, by="tx")
>     > table(sum(width(cds_by_tx)) %% 3L)
>         0
>     63691
>
> Seems to be a regression introduced in BioC 2.14. Someone in Seattle will work on a fix and we will notify the list when the fix is available.
>
> Otherwise, assuming the code in charge of issuing the warning is working properly, you can get a legitimate warning like this for some combination of UCSC organism/track (but AFAIK never for the knownGene track). If all you want to do is find/count overlaps between some rna seq reads and known genes, then you probably don't care about CDS at all.
>
> Cheers,
> H.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list