[BioC] What is the easiest way to make the GC content and length matrix for package cqn?

Fri Apr 11 02:11:27 CEST 2014

Hi Matt,

On 04/08/2014 03:23 PM, Hervé Pagès wrote:
>
>
> On 04/08/2014 02:33 PM, Harris A. Jaffee wrote:
>> I would like to know the answer!  This is all I have.
>>
>>> width(tr_by_gene)
>> IntegerList of length 25529
>> [["1"]] 6694
>> [["10"]] 9969
>> [["100"]] 32214
>> [["1000"]] 226516
>> [["10000"]] 355050 343564 343866 355040 343554 343856
>> [["100008586"]] 187588 6118 6109
>> [["100008587"]] 156 156 156 156 156 156 156 156
>> [["100008588"]] 1869 1869 1869 1869 1869 1869 1869
>> [["100008589"]] 188093 5066 5077 5070
>> [["100009613"]] 2915
>> ...
>> <25519 more elements>
>>
>> Then my guess would be to build the "DNAStringSetList" (see Biostrings)
>> underlying your transcripts, say x, and do
>>
>>     sapply(x, letterFrequency, letters="GC")
>>
>> Hopefully there's something better.
>
> Yes, that's the idea. You want to try avoiding the sapply() though,
> which is generally slow:
>
>    ## Using BSgenome.Hsapiens.NCBI.GRCh38 for now (which hg38 is based
>    ## on but, sadly, the reference sequences are named differently).
>    ## Maybe we should have BSgenome.Hsapiens.NCBI.hg38 too, so it's
>    ## easier to work with TranscriptDb objects obtained from UCSC.
>    library(BSgenome)
>    genome <- getBSgenome("GRCh38")
>
>    ## The following hack would not be needed if hg38 and GRCh38 were
>    ## using the same chromosome names. We'll keep only chr1-22, chrX,
>    ## chrY, chrM, because it's too complicated to safely map and rename
>    ## the other sequences.
>    seqlevels(tr_by_gene, force=TRUE) <- seqlevels(tr_by_gene)[1:25]
>    seqlevels(genome)[1:25] <- seqlevels(tr_by_gene)
>    genome(tr_by_gene) <- "GRCh38"

Don't know where you stand on this but FWIW I just added a new utility
to the new GenomeInfoDb package that makes it relatively easy to rename
the sequences in GRCh38 with the names used by UCSC:

     library(GenomeInfoDb)
     hg38_chrominfo <- fetchExtendedChromInfoFromUCSC("hg38")
     hg38_seqlevels <- setNames(hg38_chrominfo$UCSC_seqlevels,
                                hg38_chrominfo$NCBI_seqlevels)
     seqlevels(genome) <- hg38_seqlevels[seqlevels(genome)]

so the above hack is not needed anymore.

Of course, things should be made even easier, e.g. maybe the user should
be able to just do:

   genome(genome) <- "hg38"

But I'll have to leave this for the next devel cycle.

HTH,
H.

>
>    ## Extract the transcript sequences grouped by gene:
>    trseq_by_gene <- getSeq(genome, tr_by_gene)
>
>    ## Compute the transcript GC counts grouped by gene:
>    af <- alphabetFrequency(unlist(trseq_by_gene, use.names=FALSE),
> baseOnly=TRUE)
>    trGCcount_by_gene <- relist(af[ , "C"] + af[ , "G"], trseq_by_gene)
>
> Then:
>
>    > trGCcount_by_gene
>    IntegerList of length 24443
>    [["1"]] 3856
>    [["10"]] 3697
>    [["100"]] 17142
>    [["1000"]] 84322
>    [["100008586"]] 3156 2539 2537
>    [["100009613"]] 1578
>    [["100009676"]] 1463
>    [["10001"]] 6854 6854 6854 6854
>    [["10002"]] 2652 4134
>    [["10003"]] 20205
>    ...
>    <24433 more elements>
>
> Note that BSgenome.Hsapiens.NCBI.GRCh38 is only available in BioC 2.14.
> I'll think of a way to make a thin BSgenome.Hsapiens.UCSC.hg38 package
> that depends on BSgenome.Hsapiens.NCBI.GRCh38 for the sequences but
> presents them to the user with the UCSC names.
>
> Cheers,
> H.
>
>>
>> On Apr 8, 2014, at 4:03 PM, Thornton, Matthew wrote:
>>
>>> Hello!
>>>
>>> I have some RNASeq data that I am analyzing with edgeR and I would
>>> like to use the cqn package to correct for GC bias.  I have aligned
>>> the data to the UCSC hg38 genome.
>>>
>>>>  From googling I have found the the bedtools 'nuc' command will give
>>>> me the GC content with ranges and the length. Providing that I have
>>>> a bed file of the hg38 genome.
>>>
>>> What I need to make sure of is that I am calculating the length and
>>> the GC content for the correct intervals. I bin my reads using
>>> GenomicFeatures like this (thank you Homer)
>>>
>>> # Load Libraries
>>> library(GenomicFeatures)
>>> library(GenomicRanges)
>>>
>>> txdb <- makeTranscriptDbFromUCSC(genome='hg38',tablename='refGene')
>>>
>>> tr_by_gene <- transcriptsBy(txdb,'gene')
>>>
>>> library(Rsamtools)
>>>
>>> r_AE89_m <- readGAlignmentsFromBam("sorted_trimmed_AE89_minus.bam")
>>> r_AE89_p <- readGAlignmentsFromBam("sorted_trimmed_AE89_plus.1.bam")
>>> r_AE90_m <- readGAlignmentsFromBam("sorted_trimmed_AE90_minus.bam")
>>> r_AE90_p <- readGAlignmentsFromBam("sorted_trimmed_AE90_plus.bam")
>>>
>>> # counts reads overlapping the genes.
>>> c_AE89_m <- countOverlaps(tr_by_gene,r_AE89_m)
>>> c_AE89_p <- countOverlaps(tr_by_gene,r_AE89_p)
>>> c_AE90_m <- countOverlaps(tr_by_gene,r_AE90_m)
>>> c_AE90_p <- countOverlaps(tr_by_gene,r_AE90_p)
>>>
>>> # Create a count table
>>> countTable <- data.frame(AE89_minus=c_AE89_m, AE89_1_plus=c_AE89_p,
>>> AE90_minus=c_AE90_m, AE90_plus=c_AE90_p, stringsAsFactors=FALSE)
>>>
>>> rownames(countTable)=names(tr_by_gene)
>>>
>>> Is there a way to get lengths and gc content from 'tr_by-gene'? Is
>>> there a way to create a bed file from 'tr_by_gene' so that I can then
>>> use bedtools nuc?
>>>
>>> Thank you.
>>>
>>> Matt
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319