[BioC] Consensus sequence

Hervé Pagès hpages at fhcrc.org
Thu Sep 11 00:44:32 CEST 2014

Hi Jim,

On 09/08/2014 07:32 PM, James W. MacDonald wrote:
> I have a use case that I am not sure the best way to proceed.
> I have a non-model organism for which we would like to know the sequence
> for a set of genes. It's a bird species, and I can for example align to the
> zebra finch genome, then run bam_tally over a region to get  variants where
> my species differs from zebra finch, and then infer the sequence based on
> the reference and the variants.
> But that seems harder than it should be. Is there some function that I am
> missing that I can feed a GRanges and get back the consensus sequence for
> that region, based on say a simple vote of the reads that overlap it?

So basically you're going to use the same approach as if you had reads
from a zebra finch individual. The real gene sequences for that
individual would probably also differ from the reference sequence,
maybe less than for your bird that is not zebra finch, but that does
not really change the overall game right?

I agree you should try to avoid the intermediate step of variant
calling. Sounds a little bit simpler to aim directly for the
consensus sequence of your regions of interest. Here is one way
to do this:


   ## Uses pileLettersAt() and alpahabetFrequency() internally.
   ## Arguments in ... are passed to alpahabetFrequency().
   ## The 'param' arg must be like in stackStringsFromBam() (see
   ## ?stackStringsFromBam for the details).
   alphabetFrequencyFromBam <- function(file, index=file, param, ...)
     param <- GenomicAlignments:::.normarg_param(param)
     region <- bamWhich(param)
     bamWhat(param) <- "seq"
     gal <- readGAlignmentsFromBam(file, index=index, param=param)
     seqlevels(gal) <- names(region)
     qseq <- mcols(gal)$seq
     at <- GRanges(names(region),
     piles <- pileLettersAt(qseq, seqnames(gal), start(gal), cigar(gal),
     alphabetFrequency(piles, ...)

   ## A naive consensus string extraction: at each position, select
   ## the letter with highest frequency. Insert letter "." if all
   ## frequencies are 0 at a given position. Insert letter "*" is
   ## there is a tie.
   consensusStringFromFrequencyMatrix <- function(fm)
     selection <- apply(fm, 2,
                        function(x) {
                          i <- which.max(x)
                          if (x[i] == 0L)
                          if (sum(x == x[i]) != 1L)
     paste0(c(DNA_BASES, ".", "*")[selection], collapse="")


   ## Get some genes.
   txdb <- makeTranscriptDbFromUCSC(genome="taeGut2", table="refGene")
   my_genes <- sample(genes(txdb), 20)

   ## Extract the corresponding consensus sequences from your BAM file.
   my_gene_seqs <- sapply(seq_along(my_genes),
     function(i) {
       af <- alphabetFrequencyFromBam(bamfile, param=my_genes[i],
       fm <- t(af[ , DNA_BASES])

IMPORTANT: The strand in 'my_genes' is ignored, that is, all sequences
are extracted from the plus strand (and going from 5' to 3' along that
strand), even for genes that belong to the minus strand.

It's not going to be very efficient (approx. 1 s per gene).


> Thanks,
> Jim
> 	[[alternative HTML version deleted]]
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

More information about the Bioconductor mailing list