[BioC] What is the easiest way to make the GC content and length matrix for package cqn?

Harris A. Jaffee hj at jhu.edu
Tue Apr 8 23:33:18 CEST 2014


I would like to know the answer!  This is all I have.

> width(tr_by_gene)
IntegerList of length 25529
[["1"]] 6694
[["10"]] 9969
[["100"]] 32214
[["1000"]] 226516
[["10000"]] 355050 343564 343866 355040 343554 343856
[["100008586"]] 187588 6118 6109
[["100008587"]] 156 156 156 156 156 156 156 156
[["100008588"]] 1869 1869 1869 1869 1869 1869 1869
[["100008589"]] 188093 5066 5077 5070
[["100009613"]] 2915
...
<25519 more elements>

Then my guess would be to build the "DNAStringSetList" (see Biostrings)
underlying your transcripts, say x, and do

	sapply(x, letterFrequency, letters="GC")

Hopefully there's something better.

On Apr 8, 2014, at 4:03 PM, Thornton, Matthew wrote:

> Hello!
> 
> I have some RNASeq data that I am analyzing with edgeR and I would like to use the cqn package to correct for GC bias.  I have aligned the data to the UCSC hg38 genome.
> 
>> From googling I have found the the bedtools 'nuc' command will give me the GC content with ranges and the length. Providing that I have a bed file of the hg38 genome.
> 
> What I need to make sure of is that I am calculating the length and the GC content for the correct intervals. I bin my reads using GenomicFeatures like this (thank you Homer)
> 
> # Load Libraries
> library(GenomicFeatures)
> library(GenomicRanges)
> 
> txdb <- makeTranscriptDbFromUCSC(genome='hg38',tablename='refGene')
> 
> tr_by_gene <- transcriptsBy(txdb,'gene')
> 
> library(Rsamtools)
> 
> r_AE89_m <- readGAlignmentsFromBam("sorted_trimmed_AE89_minus.bam")
> r_AE89_p <- readGAlignmentsFromBam("sorted_trimmed_AE89_plus.1.bam")
> r_AE90_m <- readGAlignmentsFromBam("sorted_trimmed_AE90_minus.bam")
> r_AE90_p <- readGAlignmentsFromBam("sorted_trimmed_AE90_plus.bam")
> 
> # counts reads overlapping the genes.
> c_AE89_m <- countOverlaps(tr_by_gene,r_AE89_m)
> c_AE89_p <- countOverlaps(tr_by_gene,r_AE89_p)
> c_AE90_m <- countOverlaps(tr_by_gene,r_AE90_m)
> c_AE90_p <- countOverlaps(tr_by_gene,r_AE90_p)
> 
> # Create a count table
> countTable <- data.frame(AE89_minus=c_AE89_m, AE89_1_plus=c_AE89_p, AE90_minus=c_AE90_m, AE90_plus=c_AE90_p, stringsAsFactors=FALSE)
> 
> rownames(countTable)=names(tr_by_gene)
> 
> Is there a way to get lengths and gc content from 'tr_by-gene'? Is there a way to create a bed file from 'tr_by_gene' so that I can then use bedtools nuc?
> 
> Thank you.
> 
> Matt
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list