[BioC] Consensus sequence
James W. MacDonald
jmacdon at uw.edu
Fri Sep 12 17:06:47 CEST 2014
Hi Herve,
Thanks! That works better than the kludgetastic function I set up.
But note I had to add one line:
alphabetFrequencyFromBam <- function(file, index=file, param, ...){
seqlevels(param) <- seqlevelsInUse(param)
## added this
param <- GenomicAlignments:::.normarg_param(param)
region <- bamWhich(param)
bamWhat(param) <- "seq"
gal <- readGAlignmentsFromBam(file, index=index, param=param)
seqlevels(gal) <- names(region)
qseq <- mcols(gal)$seq
at <- GRanges(names(region),
IRanges(start(region[[1L]]):end(region[[1L]]),
width=1L))
piles <- pileLettersAt(qseq, seqnames(gal), start(gal), cigar(gal),
at)
alphabetFrequency(piles, ...)
}
because of how normarg_param() treats GRanges with multiple seqlevels:
> bamWhich(GenomicAlignments:::.normarg_param("chr5:1307857-1308626"))
IRangesList of length 1
$chr5
IRanges of length 1
start end width
[1] 1307857 1308626 770
> bdnf <- GRanges("chr5", IRanges(1307857,1308626))
> bamWhich(GenomicAlignments:::.normarg_param(bdnf))
IRangesList of length 1
$chr5
IRanges of length 1
start end width
[1] 1307857 1308626 770
> gns <- genes(txdb)
> bdnf2 <- gns["751584"]
> bdnf2
GRanges with 1 range and 1 metadata column:
seqnames ranges strand | gene_id
<Rle> <IRanges> <Rle> | <CharacterList>
751584 chr5 [1307857, 1308626] - | 751584
---
seqlengths:
chr1 ... chrZ_KB442450_random
118548696 ... 2283
> bamWhich(GenomicAlignments:::.normarg_param(bdnf2))
IRangesList of length 37096
$chr1
IRanges of length 0
$chr2
IRanges of length 0
$chr3
IRanges of length 0
...
<37093 more elements>
But this param instance still technically fulfills the criteria for
stackStringsFromBam:
> unlist(bamWhich(GenomicAlignments:::.normarg_param(bdnf2)))
IRanges of length 1
start end width names
[1] 1307857 1308626 770 chr5.751584
Best,
Jim
On Wed, Sep 10, 2014 at 7:33 PM, Hervé Pagès <hpages at fhcrc.org> wrote:
> I should add that the code I sent previously makes sense if
> the genes in your individual contains only a "small" number of
> substitutions with respect to the reference genes. If they
> contain indels and/or a high number of substitutions, it would
> not be appropriated (it would still "work" but the consensus
> sequences could not be trusted). The problem with indels is that
> the alphabetFrequencyFromBam() function doesn't know how to
> handle them (because it only knows how to report things that
> align with nucleotides in the reference genome, but insertions
> don't). Also having too many substitutions would make the
> alignment process less reliable (would make it less likely
> that reads coming from gene X align to the region
> corresponding to gene X in the reference).
>
> H.
>
>
>
> On 09/10/2014 03:44 PM, Hervé Pagès wrote:
>
>> Hi Jim,
>>
>> On 09/08/2014 07:32 PM, James W. MacDonald wrote:
>>
>>> I have a use case that I am not sure the best way to proceed.
>>>
>>> I have a non-model organism for which we would like to know the sequence
>>> for a set of genes. It's a bird species, and I can for example align
>>> to the
>>> zebra finch genome, then run bam_tally over a region to get variants
>>> where
>>> my species differs from zebra finch, and then infer the sequence based on
>>> the reference and the variants.
>>>
>>> But that seems harder than it should be. Is there some function that I am
>>> missing that I can feed a GRanges and get back the consensus sequence for
>>> that region, based on say a simple vote of the reads that overlap it?
>>>
>>
>> So basically you're going to use the same approach as if you had reads
>> from a zebra finch individual. The real gene sequences for that
>> individual would probably also differ from the reference sequence,
>> maybe less than for your bird that is not zebra finch, but that does
>> not really change the overall game right?
>>
>> I agree you should try to avoid the intermediate step of variant
>> calling. Sounds a little bit simpler to aim directly for the
>> consensus sequence of your regions of interest. Here is one way
>> to do this:
>>
>> library(GenomicAlignments)
>>
>> ## Uses pileLettersAt() and alpahabetFrequency() internally.
>> ## Arguments in ... are passed to alpahabetFrequency().
>> ## The 'param' arg must be like in stackStringsFromBam() (see
>> ## ?stackStringsFromBam for the details).
>> alphabetFrequencyFromBam <- function(file, index=file, param, ...)
>> {
>> param <- GenomicAlignments:::.normarg_param(param)
>> region <- bamWhich(param)
>> bamWhat(param) <- "seq"
>> gal <- readGAlignmentsFromBam(file, index=index, param=param)
>> seqlevels(gal) <- names(region)
>> qseq <- mcols(gal)$seq
>> at <- GRanges(names(region),
>> IRanges(start(region[[1L]]):end(region[[1L]]),
>> width=1L))
>> piles <- pileLettersAt(qseq, seqnames(gal), start(gal), cigar(gal),
>> at)
>> alphabetFrequency(piles, ...)
>> }
>>
>> ## A naive consensus string extraction: at each position, select
>> ## the letter with highest frequency. Insert letter "." if all
>> ## frequencies are 0 at a given position. Insert letter "*" is
>> ## there is a tie.
>> consensusStringFromFrequencyMatrix <- function(fm)
>> {
>> selection <- apply(fm, 2,
>> function(x) {
>> i <- which.max(x)
>> if (x[i] == 0L)
>> return(5L)
>> if (sum(x == x[i]) != 1L)
>> return(6L)
>> i
>> })
>> paste0(c(DNA_BASES, ".", "*")[selection], collapse="")
>> }
>>
>> Then:
>>
>> ## Get some genes.
>> library(GenomicFeatures)
>> txdb <- makeTranscriptDbFromUCSC(genome="taeGut2", table="refGene")
>> my_genes <- sample(genes(txdb), 20)
>>
>> ## Extract the corresponding consensus sequences from your BAM file.
>> my_gene_seqs <- sapply(seq_along(my_genes),
>> function(i) {
>> af <- alphabetFrequencyFromBam(bamfile, param=my_genes[i],
>> baseOnly=TRUE)
>> fm <- t(af[ , DNA_BASES])
>> consensusStringFromFrequencyMatrix(fm)
>> })
>>
>> IMPORTANT: The strand in 'my_genes' is ignored, that is, all sequences
>> are extracted from the plus strand (and going from 5' to 3' along that
>> strand), even for genes that belong to the minus strand.
>>
>> It's not going to be very efficient (approx. 1 s per gene).
>>
>> Cheers,
>> H.
>>
>>
>>> Thanks,
>>>
>>> Jim
>>>
>>> [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>>
>>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> Phone: (206) 667-5791
> Fax: (206) 667-1319
>
--
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099
[[alternative HTML version deleted]]
More information about the Bioconductor
mailing list