[BioC] Consensus sequence

Fri Sep 12 17:06:47 CEST 2014

Hi Herve,

Thanks! That works better than the kludgetastic function I set up.

But note I had to add one line:

alphabetFrequencyFromBam <- function(file, index=file, param, ...){
    seqlevels(param) <- seqlevelsInUse(param)
         ## added this
    param <- GenomicAlignments:::.normarg_param(param)
    region <- bamWhich(param)
    bamWhat(param) <- "seq"
    gal <- readGAlignmentsFromBam(file, index=index, param=param)
    seqlevels(gal) <- names(region)
    qseq <- mcols(gal)$seq
    at <- GRanges(names(region),
                  IRanges(start(region[[1L]]):end(region[[1L]]),
                          width=1L))
    piles <- pileLettersAt(qseq, seqnames(gal), start(gal), cigar(gal),
                           at)
    alphabetFrequency(piles, ...)
}

because of how normarg_param() treats GRanges with multiple seqlevels:

> bamWhich(GenomicAlignments:::.normarg_param("chr5:1307857-1308626"))
IRangesList of length 1
$chr5
IRanges of length 1
      start     end width
[1] 1307857 1308626   770

> bdnf <- GRanges("chr5", IRanges(1307857,1308626))
> bamWhich(GenomicAlignments:::.normarg_param(bdnf))
IRangesList of length 1
$chr5
IRanges of length 1
      start     end width
[1] 1307857 1308626   770

> gns <- genes(txdb)
> bdnf2 <- gns["751584"]
> bdnf2
GRanges with 1 range and 1 metadata column:
         seqnames             ranges strand |         gene_id
            <Rle>          <IRanges>  <Rle> | <CharacterList>
  751584     chr5 [1307857, 1308626]      - |          751584
  ---
  seqlengths:
                        chr1 ...      chrZ_KB442450_random
                   118548696 ...                      2283
> bamWhich(GenomicAlignments:::.normarg_param(bdnf2))
IRangesList of length 37096
$chr1
IRanges of length 0

$chr2
IRanges of length 0

$chr3
IRanges of length 0

...
<37093 more elements>

But this param instance still technically fulfills the criteria for
stackStringsFromBam:

> unlist(bamWhich(GenomicAlignments:::.normarg_param(bdnf2)))
IRanges of length 1
      start     end width       names
[1] 1307857 1308626   770 chr5.751584

Best,

Jim

On Wed, Sep 10, 2014 at 7:33 PM, Hervé Pagès <hpages at fhcrc.org> wrote:

> I should add that the code I sent previously makes sense if
> the genes in your individual contains only a "small" number of
> substitutions with respect to the reference genes. If they
> contain indels and/or a high number of substitutions, it would
> not be appropriated (it would still "work" but the consensus
> sequences could not be trusted). The problem with indels is that
> the alphabetFrequencyFromBam() function doesn't know how to
> handle them (because it only knows how to report things that
> align with nucleotides in the reference genome, but insertions
> don't). Also having too many substitutions would make the
> alignment process less reliable (would make it less likely
> that reads coming from gene X align to the region
> corresponding to gene X in the reference).
>
> H.
>
>
>
> On 09/10/2014 03:44 PM, Hervé Pagès wrote:
>
>> Hi Jim,
>>
>> On 09/08/2014 07:32 PM, James W. MacDonald wrote:
>>
>>> I have a use case that I am not sure the best way to proceed.
>>>
>>> I have a non-model organism for which we would like to know the sequence
>>> for a set of genes. It's a bird species, and I can for example align
>>> to the
>>> zebra finch genome, then run bam_tally over a region to get  variants
>>> where
>>> my species differs from zebra finch, and then infer the sequence based on
>>> the reference and the variants.
>>>
>>> But that seems harder than it should be. Is there some function that I am
>>> missing that I can feed a GRanges and get back the consensus sequence for
>>> that region, based on say a simple vote of the reads that overlap it?
>>>
>>
>> So basically you're going to use the same approach as if you had reads
>> from a zebra finch individual. The real gene sequences for that
>> individual would probably also differ from the reference sequence,
>> maybe less than for your bird that is not zebra finch, but that does
>> not really change the overall game right?
>>
>> I agree you should try to avoid the intermediate step of variant
>> calling. Sounds a little bit simpler to aim directly for the
>> consensus sequence of your regions of interest. Here is one way
>> to do this:
>>
>>    library(GenomicAlignments)
>>
>>    ## Uses pileLettersAt() and alpahabetFrequency() internally.
>>    ## Arguments in ... are passed to alpahabetFrequency().
>>    ## The 'param' arg must be like in stackStringsFromBam() (see
>>    ## ?stackStringsFromBam for the details).
>>    alphabetFrequencyFromBam <- function(file, index=file, param, ...)
>>    {
>>      param <- GenomicAlignments:::.normarg_param(param)
>>      region <- bamWhich(param)
>>      bamWhat(param) <- "seq"
>>      gal <- readGAlignmentsFromBam(file, index=index, param=param)
>>      seqlevels(gal) <- names(region)
>>      qseq <- mcols(gal)$seq
>>      at <- GRanges(names(region),
>>                    IRanges(start(region[[1L]]):end(region[[1L]]),
>>                            width=1L))
>>      piles <- pileLettersAt(qseq, seqnames(gal), start(gal), cigar(gal),
>>                             at)
>>      alphabetFrequency(piles, ...)
>>    }
>>
>>    ## A naive consensus string extraction: at each position, select
>>    ## the letter with highest frequency. Insert letter "." if all
>>    ## frequencies are 0 at a given position. Insert letter "*" is
>>    ## there is a tie.
>>    consensusStringFromFrequencyMatrix <- function(fm)
>>    {
>>      selection <- apply(fm, 2,
>>                         function(x) {
>>                           i <- which.max(x)
>>                           if (x[i] == 0L)
>>                             return(5L)
>>                           if (sum(x == x[i]) != 1L)
>>                             return(6L)
>>                           i
>>                         })
>>      paste0(c(DNA_BASES, ".", "*")[selection], collapse="")
>>    }
>>
>> Then:
>>
>>    ## Get some genes.
>>    library(GenomicFeatures)
>>    txdb <- makeTranscriptDbFromUCSC(genome="taeGut2", table="refGene")
>>    my_genes <- sample(genes(txdb), 20)
>>
>>    ## Extract the corresponding consensus sequences from your BAM file.
>>    my_gene_seqs <- sapply(seq_along(my_genes),
>>      function(i) {
>>        af <- alphabetFrequencyFromBam(bamfile, param=my_genes[i],
>>                                       baseOnly=TRUE)
>>        fm <- t(af[ , DNA_BASES])
>>        consensusStringFromFrequencyMatrix(fm)
>>      })
>>
>> IMPORTANT: The strand in 'my_genes' is ignored, that is, all sequences
>> are extracted from the plus strand (and going from 5' to 3' along that
>> strand), even for genes that belong to the minus strand.
>>
>> It's not going to be very efficient (approx. 1 s per gene).
>>
>> Cheers,
>> H.
>>
>>
>>> Thanks,
>>>
>>> Jim
>>>
>>>     [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>>
>>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319
>

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099

	[[alternative HTML version deleted]]