[BioC] Consensus sequence

Fri Sep 12 19:15:54 CEST 2014

Hi Jim,

Thanks for the feedback. Yesterday I added alphabetFrequencyFromBam()
to GenomicAlignments (in devel). Just applied your fix to it now.

Cheers,
H.

On 09/12/2014 08:06 AM, James W. MacDonald wrote:
> Hi Herve,
>
> Thanks! That works better than the kludgetastic function I set up.
>
> But note I had to add one line:
>
> alphabetFrequencyFromBam <- function(file, index=file, param, ...){
>      seqlevels(param) <- seqlevelsInUse(param)
>               ## added this
>      param <- GenomicAlignments:::.normarg_param(param)
>      region <- bamWhich(param)
>      bamWhat(param) <- "seq"
>      gal <- readGAlignmentsFromBam(file, index=index, param=param)
>      seqlevels(gal) <- names(region)
>      qseq <- mcols(gal)$seq
>      at <- GRanges(names(region),
>                    IRanges(start(region[[1L]]):end(region[[1L]]),
>                            width=1L))
>      piles <- pileLettersAt(qseq, seqnames(gal), start(gal), cigar(gal),
>                             at)
>      alphabetFrequency(piles, ...)
> }
>
> because of how normarg_param() treats GRanges with multiple seqlevels:
>
>  > bamWhich(GenomicAlignments:::.normarg_param("chr5:1307857-1308626"))
> IRangesList of length 1
> $chr5
> IRanges of length 1
>        start     end width
> [1] 1307857 1308626   770
>
>  > bdnf <- GRanges("chr5", IRanges(1307857,1308626))
>  > bamWhich(GenomicAlignments:::.normarg_param(bdnf))
> IRangesList of length 1
> $chr5
> IRanges of length 1
>        start     end width
> [1] 1307857 1308626   770
>
>  > gns <- genes(txdb)
>  > bdnf2 <- gns["751584"]
>  > bdnf2
> GRanges with 1 range and 1 metadata column:
>           seqnames             ranges strand |         gene_id
>              <Rle>          <IRanges>  <Rle> | <CharacterList>
>    751584     chr5 [1307857, 1308626]      - |          751584
>    ---
>    seqlengths:
>                          chr1 ...      chrZ_KB442450_random
>                     118548696 ...                      2283
>  > bamWhich(GenomicAlignments:::.normarg_param(bdnf2))
> IRangesList of length 37096
> $chr1
> IRanges of length 0
>
> $chr2
> IRanges of length 0
>
> $chr3
> IRanges of length 0
>
> ...
> <37093 more elements>
>
> But this param instance still technically fulfills the criteria for
> stackStringsFromBam:
>
>  > unlist(bamWhich(GenomicAlignments:::.normarg_param(bdnf2)))
> IRanges of length 1
>        start     end width       names
> [1] 1307857 1308626   770 chr5.751584
>
> Best,
>
> Jim
>
>
>
>
> On Wed, Sep 10, 2014 at 7:33 PM, Hervé Pagès <hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>> wrote:
>
>     I should add that the code I sent previously makes sense if
>     the genes in your individual contains only a "small" number of
>     substitutions with respect to the reference genes. If they
>     contain indels and/or a high number of substitutions, it would
>     not be appropriated (it would still "work" but the consensus
>     sequences could not be trusted). The problem with indels is that
>     the alphabetFrequencyFromBam() function doesn't know how to
>     handle them (because it only knows how to report things that
>     align with nucleotides in the reference genome, but insertions
>     don't). Also having too many substitutions would make the
>     alignment process less reliable (would make it less likely
>     that reads coming from gene X align to the region
>     corresponding to gene X in the reference).
>
>     H.
>
>
>
>     On 09/10/2014 03:44 PM, Hervé Pagès wrote:
>
>         Hi Jim,
>
>         On 09/08/2014 07:32 PM, James W. MacDonald wrote:
>
>             I have a use case that I am not sure the best way to proceed.
>
>             I have a non-model organism for which we would like to know
>             the sequence
>             for a set of genes. It's a bird species, and I can for
>             example align
>             to the
>             zebra finch genome, then run bam_tally over a region to get
>             variants
>             where
>             my species differs from zebra finch, and then infer the
>             sequence based on
>             the reference and the variants.
>
>             But that seems harder than it should be. Is there some
>             function that I am
>             missing that I can feed a GRanges and get back the consensus
>             sequence for
>             that region, based on say a simple vote of the reads that
>             overlap it?
>
>
>         So basically you're going to use the same approach as if you had
>         reads
>         from a zebra finch individual. The real gene sequences for that
>         individual would probably also differ from the reference sequence,
>         maybe less than for your bird that is not zebra finch, but that does
>         not really change the overall game right?
>
>         I agree you should try to avoid the intermediate step of variant
>         calling. Sounds a little bit simpler to aim directly for the
>         consensus sequence of your regions of interest. Here is one way
>         to do this:
>
>             library(GenomicAlignments)
>
>             ## Uses pileLettersAt() and alpahabetFrequency() internally.
>             ## Arguments in ... are passed to alpahabetFrequency().
>             ## The 'param' arg must be like in stackStringsFromBam() (see
>             ## ?stackStringsFromBam for the details).
>             alphabetFrequencyFromBam <- function(file, index=file,
>         param, ...)
>             {
>               param <- GenomicAlignments:::.normarg___param(param)
>               region <- bamWhich(param)
>               bamWhat(param) <- "seq"
>               gal <- readGAlignmentsFromBam(file, index=index, param=param)
>               seqlevels(gal) <- names(region)
>               qseq <- mcols(gal)$seq
>               at <- GRanges(names(region),
>                             IRanges(start(region[[1L]]):__end(region[[1L]]),
>                                     width=1L))
>               piles <- pileLettersAt(qseq, seqnames(gal), start(gal),
>         cigar(gal),
>                                      at)
>               alphabetFrequency(piles, ...)
>             }
>
>             ## A naive consensus string extraction: at each position, select
>             ## the letter with highest frequency. Insert letter "." if all
>             ## frequencies are 0 at a given position. Insert letter "*" is
>             ## there is a tie.
>             consensusStringFromFrequencyMa__trix <- function(fm)
>             {
>               selection <- apply(fm, 2,
>                                  function(x) {
>                                    i <- which.max(x)
>                                    if (x[i] == 0L)
>                                      return(5L)
>                                    if (sum(x == x[i]) != 1L)
>                                      return(6L)
>                                    i
>                                  })
>               paste0(c(DNA_BASES, ".", "*")[selection], collapse="")
>             }
>
>         Then:
>
>             ## Get some genes.
>             library(GenomicFeatures)
>             txdb <- makeTranscriptDbFromUCSC(__genome="taeGut2",
>         table="refGene")
>             my_genes <- sample(genes(txdb), 20)
>
>             ## Extract the corresponding consensus sequences from your
>         BAM file.
>             my_gene_seqs <- sapply(seq_along(my_genes),
>               function(i) {
>                 af <- alphabetFrequencyFromBam(__bamfile, param=my_genes[i],
>                                                baseOnly=TRUE)
>                 fm <- t(af[ , DNA_BASES])
>                 consensusStringFromFrequencyMa__trix(fm)
>               })
>
>         IMPORTANT: The strand in 'my_genes' is ignored, that is, all
>         sequences
>         are extracted from the plus strand (and going from 5' to 3'
>         along that
>         strand), even for genes that belong to the minus strand.
>
>         It's not going to be very efficient (approx. 1 s per gene).
>
>         Cheers,
>         H.
>
>
>             Thanks,
>
>             Jim
>
>                  [[alternative HTML version deleted]]
>
>             _________________________________________________
>             Bioconductor mailing list
>             Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>             https://stat.ethz.ch/mailman/__listinfo/bioconductor
>             <https://stat.ethz.ch/mailman/listinfo/bioconductor>
>             Search the archives:
>             http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>             <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>
>
>     --
>     Hervé Pagès
>
>     Program in Computational Biology
>     Division of Public Health Sciences
>     Fred Hutchinson Cancer Research Center
>     1100 Fairview Ave. N, M1-B514
>     P.O. Box 19024
>     Seattle, WA 98109-1024
>
>     E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>     Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>     Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>
>
>
> --
> James W. MacDonald, M.S.
> Biostatistician
> University of Washington
> Environmental and Occupational Health Sciences
> 4225 Roosevelt Way NE, # 100
> Seattle WA 98105-6099

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319