[BioC] Consensus sequence

Thu Sep 11 01:33:37 CEST 2014

I should add that the code I sent previously makes sense if
the genes in your individual contains only a "small" number of
substitutions with respect to the reference genes. If they
contain indels and/or a high number of substitutions, it would
not be appropriated (it would still "work" but the consensus
sequences could not be trusted). The problem with indels is that
the alphabetFrequencyFromBam() function doesn't know how to
handle them (because it only knows how to report things that
align with nucleotides in the reference genome, but insertions
don't). Also having too many substitutions would make the
alignment process less reliable (would make it less likely
that reads coming from gene X align to the region
corresponding to gene X in the reference).

H.

On 09/10/2014 03:44 PM, Hervé Pagès wrote:
> Hi Jim,
>
> On 09/08/2014 07:32 PM, James W. MacDonald wrote:
>> I have a use case that I am not sure the best way to proceed.
>>
>> I have a non-model organism for which we would like to know the sequence
>> for a set of genes. It's a bird species, and I can for example align
>> to the
>> zebra finch genome, then run bam_tally over a region to get  variants
>> where
>> my species differs from zebra finch, and then infer the sequence based on
>> the reference and the variants.
>>
>> But that seems harder than it should be. Is there some function that I am
>> missing that I can feed a GRanges and get back the consensus sequence for
>> that region, based on say a simple vote of the reads that overlap it?
>
> So basically you're going to use the same approach as if you had reads
> from a zebra finch individual. The real gene sequences for that
> individual would probably also differ from the reference sequence,
> maybe less than for your bird that is not zebra finch, but that does
> not really change the overall game right?
>
> I agree you should try to avoid the intermediate step of variant
> calling. Sounds a little bit simpler to aim directly for the
> consensus sequence of your regions of interest. Here is one way
> to do this:
>
>    library(GenomicAlignments)
>
>    ## Uses pileLettersAt() and alpahabetFrequency() internally.
>    ## Arguments in ... are passed to alpahabetFrequency().
>    ## The 'param' arg must be like in stackStringsFromBam() (see
>    ## ?stackStringsFromBam for the details).
>    alphabetFrequencyFromBam <- function(file, index=file, param, ...)
>    {
>      param <- GenomicAlignments:::.normarg_param(param)
>      region <- bamWhich(param)
>      bamWhat(param) <- "seq"
>      gal <- readGAlignmentsFromBam(file, index=index, param=param)
>      seqlevels(gal) <- names(region)
>      qseq <- mcols(gal)$seq
>      at <- GRanges(names(region),
>                    IRanges(start(region[[1L]]):end(region[[1L]]),
>                            width=1L))
>      piles <- pileLettersAt(qseq, seqnames(gal), start(gal), cigar(gal),
>                             at)
>      alphabetFrequency(piles, ...)
>    }
>
>    ## A naive consensus string extraction: at each position, select
>    ## the letter with highest frequency. Insert letter "." if all
>    ## frequencies are 0 at a given position. Insert letter "*" is
>    ## there is a tie.
>    consensusStringFromFrequencyMatrix <- function(fm)
>    {
>      selection <- apply(fm, 2,
>                         function(x) {
>                           i <- which.max(x)
>                           if (x[i] == 0L)
>                             return(5L)
>                           if (sum(x == x[i]) != 1L)
>                             return(6L)
>                           i
>                         })
>      paste0(c(DNA_BASES, ".", "*")[selection], collapse="")
>    }
>
> Then:
>
>    ## Get some genes.
>    library(GenomicFeatures)
>    txdb <- makeTranscriptDbFromUCSC(genome="taeGut2", table="refGene")
>    my_genes <- sample(genes(txdb), 20)
>
>    ## Extract the corresponding consensus sequences from your BAM file.
>    my_gene_seqs <- sapply(seq_along(my_genes),
>      function(i) {
>        af <- alphabetFrequencyFromBam(bamfile, param=my_genes[i],
>                                       baseOnly=TRUE)
>        fm <- t(af[ , DNA_BASES])
>        consensusStringFromFrequencyMatrix(fm)
>      })
>
> IMPORTANT: The strand in 'my_genes' is ignored, that is, all sequences
> are extracted from the plus strand (and going from 5' to 3' along that
> strand), even for genes that belong to the minus strand.
>
> It's not going to be very efficient (approx. 1 s per gene).
>
> Cheers,
> H.
>
>>
>> Thanks,
>>
>> Jim
>>
>>     [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319