[BioC] What is the easiest way to make the GC content and length matrix for package cqn?

Martin Morgan mtmorgan at fhcrc.org
Wed Apr 9 00:38:29 CEST 2014

On 04/08/2014 01:03 PM, Thornton, Matthew wrote:
> Hello!
> I have some RNASeq data that I am analyzing with edgeR and I would like to use the cqn package to correct for GC bias.  I have aligned the data to the UCSC hg38 genome.
>>From googling I have found the the bedtools 'nuc' command will give me the GC content with ranges and the length. Providing that I have a bed file of the hg38 genome.
> What I need to make sure of is that I am calculating the length and the GC content for the correct intervals. I bin my reads using GenomicFeatures like this (thank you Homer)
> # Load Libraries
> library(GenomicFeatures)
> library(GenomicRanges)
> txdb <- makeTranscriptDbFromUCSC(genome='hg38',tablename='refGene')
> tr_by_gene <- transcriptsBy(txdb,'gene')
> library(Rsamtools)
> r_AE89_m <- readGAlignmentsFromBam("sorted_trimmed_AE89_minus.bam")
> r_AE89_p <- readGAlignmentsFromBam("sorted_trimmed_AE89_plus.1.bam")
> r_AE90_m <- readGAlignmentsFromBam("sorted_trimmed_AE90_minus.bam")
> r_AE90_p <- readGAlignmentsFromBam("sorted_trimmed_AE90_plus.bam")
> # counts reads overlapping the genes.
> c_AE89_m <- countOverlaps(tr_by_gene,r_AE89_m)
> c_AE89_p <- countOverlaps(tr_by_gene,r_AE89_p)
> c_AE90_m <- countOverlaps(tr_by_gene,r_AE90_m)
> c_AE90_p <- countOverlaps(tr_by_gene,r_AE90_p)
> # Create a count table
> countTable <- data.frame(AE89_minus=c_AE89_m, AE89_1_plus=c_AE89_p, AE90_minus=c_AE90_m, AE90_plus=c_AE90_p, stringsAsFactors=FALSE)

For the preceding (I know you didn't ask!) you could represent all of your files

     files <- c(r_AE89_m="sorted_trimmed_AE89_minus.bam",

as a 'BamFileList'

     bam_files <- BamFileList(files)

write a 'COUNTER' function that takes features, reads, and possibly other arguments

     COUNTER <- function(features, reads, ...)
         ## ignore some arguments
         countOverlaps(features, reads)

And then perform the summary

     se <- summarizeOverlaps(tx_by_gene, bam_files, COUNTER)
     head(assay(se), 3)

Once this is working to your satisfaction, modify the recipe to count each file 
on a separate thread, limiting memory use by changing the 'yieldSize' (number of 
records read at any one time) on the BamFileList

     library(BiocParallel)                   # count one file per core
     bam_files <- BamFileList(files, yieldSize=1000000)
     se <- summarizeOverlaps(tx_by_gene, bam_files, COUNTER)

Add a param=ScanBamParam() argument to control which reads and possible other 
information (e.g., mapping quality) that gets read in and hence is available for 
use in the COUNTER function.

The help page ?summarizeOverlaps has some built-in functions that count in 
different, relevant, ways.


> rownames(countTable)=names(tr_by_gene)
> Is there a way to get lengths and gc content from 'tr_by-gene'? Is there a way to create a bed file from 'tr_by_gene' so that I can then use bedtools nuc?
> Thank you.
> Matt
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

More information about the Bioconductor mailing list