[BioC] Consensus sequence
Hervé Pagès
hpages at fhcrc.org
Fri Sep 12 19:15:54 CEST 2014
Hi Jim,
Thanks for the feedback. Yesterday I added alphabetFrequencyFromBam()
to GenomicAlignments (in devel). Just applied your fix to it now.
Cheers,
H.
On 09/12/2014 08:06 AM, James W. MacDonald wrote:
> Hi Herve,
>
> Thanks! That works better than the kludgetastic function I set up.
>
> But note I had to add one line:
>
> alphabetFrequencyFromBam <- function(file, index=file, param, ...){
> seqlevels(param) <- seqlevelsInUse(param)
> ## added this
> param <- GenomicAlignments:::.normarg_param(param)
> region <- bamWhich(param)
> bamWhat(param) <- "seq"
> gal <- readGAlignmentsFromBam(file, index=index, param=param)
> seqlevels(gal) <- names(region)
> qseq <- mcols(gal)$seq
> at <- GRanges(names(region),
> IRanges(start(region[[1L]]):end(region[[1L]]),
> width=1L))
> piles <- pileLettersAt(qseq, seqnames(gal), start(gal), cigar(gal),
> at)
> alphabetFrequency(piles, ...)
> }
>
> because of how normarg_param() treats GRanges with multiple seqlevels:
>
> > bamWhich(GenomicAlignments:::.normarg_param("chr5:1307857-1308626"))
> IRangesList of length 1
> $chr5
> IRanges of length 1
> start end width
> [1] 1307857 1308626 770
>
> > bdnf <- GRanges("chr5", IRanges(1307857,1308626))
> > bamWhich(GenomicAlignments:::.normarg_param(bdnf))
> IRangesList of length 1
> $chr5
> IRanges of length 1
> start end width
> [1] 1307857 1308626 770
>
> > gns <- genes(txdb)
> > bdnf2 <- gns["751584"]
> > bdnf2
> GRanges with 1 range and 1 metadata column:
> seqnames ranges strand | gene_id
> <Rle> <IRanges> <Rle> | <CharacterList>
> 751584 chr5 [1307857, 1308626] - | 751584
> ---
> seqlengths:
> chr1 ... chrZ_KB442450_random
> 118548696 ... 2283
> > bamWhich(GenomicAlignments:::.normarg_param(bdnf2))
> IRangesList of length 37096
> $chr1
> IRanges of length 0
>
> $chr2
> IRanges of length 0
>
> $chr3
> IRanges of length 0
>
> ...
> <37093 more elements>
>
> But this param instance still technically fulfills the criteria for
> stackStringsFromBam:
>
> > unlist(bamWhich(GenomicAlignments:::.normarg_param(bdnf2)))
> IRanges of length 1
> start end width names
> [1] 1307857 1308626 770 chr5.751584
>
> Best,
>
> Jim
>
>
>
>
> On Wed, Sep 10, 2014 at 7:33 PM, Hervé Pagès <hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>> wrote:
>
> I should add that the code I sent previously makes sense if
> the genes in your individual contains only a "small" number of
> substitutions with respect to the reference genes. If they
> contain indels and/or a high number of substitutions, it would
> not be appropriated (it would still "work" but the consensus
> sequences could not be trusted). The problem with indels is that
> the alphabetFrequencyFromBam() function doesn't know how to
> handle them (because it only knows how to report things that
> align with nucleotides in the reference genome, but insertions
> don't). Also having too many substitutions would make the
> alignment process less reliable (would make it less likely
> that reads coming from gene X align to the region
> corresponding to gene X in the reference).
>
> H.
>
>
>
> On 09/10/2014 03:44 PM, Hervé Pagès wrote:
>
> Hi Jim,
>
> On 09/08/2014 07:32 PM, James W. MacDonald wrote:
>
> I have a use case that I am not sure the best way to proceed.
>
> I have a non-model organism for which we would like to know
> the sequence
> for a set of genes. It's a bird species, and I can for
> example align
> to the
> zebra finch genome, then run bam_tally over a region to get
> variants
> where
> my species differs from zebra finch, and then infer the
> sequence based on
> the reference and the variants.
>
> But that seems harder than it should be. Is there some
> function that I am
> missing that I can feed a GRanges and get back the consensus
> sequence for
> that region, based on say a simple vote of the reads that
> overlap it?
>
>
> So basically you're going to use the same approach as if you had
> reads
> from a zebra finch individual. The real gene sequences for that
> individual would probably also differ from the reference sequence,
> maybe less than for your bird that is not zebra finch, but that does
> not really change the overall game right?
>
> I agree you should try to avoid the intermediate step of variant
> calling. Sounds a little bit simpler to aim directly for the
> consensus sequence of your regions of interest. Here is one way
> to do this:
>
> library(GenomicAlignments)
>
> ## Uses pileLettersAt() and alpahabetFrequency() internally.
> ## Arguments in ... are passed to alpahabetFrequency().
> ## The 'param' arg must be like in stackStringsFromBam() (see
> ## ?stackStringsFromBam for the details).
> alphabetFrequencyFromBam <- function(file, index=file,
> param, ...)
> {
> param <- GenomicAlignments:::.normarg___param(param)
> region <- bamWhich(param)
> bamWhat(param) <- "seq"
> gal <- readGAlignmentsFromBam(file, index=index, param=param)
> seqlevels(gal) <- names(region)
> qseq <- mcols(gal)$seq
> at <- GRanges(names(region),
> IRanges(start(region[[1L]]):__end(region[[1L]]),
> width=1L))
> piles <- pileLettersAt(qseq, seqnames(gal), start(gal),
> cigar(gal),
> at)
> alphabetFrequency(piles, ...)
> }
>
> ## A naive consensus string extraction: at each position, select
> ## the letter with highest frequency. Insert letter "." if all
> ## frequencies are 0 at a given position. Insert letter "*" is
> ## there is a tie.
> consensusStringFromFrequencyMa__trix <- function(fm)
> {
> selection <- apply(fm, 2,
> function(x) {
> i <- which.max(x)
> if (x[i] == 0L)
> return(5L)
> if (sum(x == x[i]) != 1L)
> return(6L)
> i
> })
> paste0(c(DNA_BASES, ".", "*")[selection], collapse="")
> }
>
> Then:
>
> ## Get some genes.
> library(GenomicFeatures)
> txdb <- makeTranscriptDbFromUCSC(__genome="taeGut2",
> table="refGene")
> my_genes <- sample(genes(txdb), 20)
>
> ## Extract the corresponding consensus sequences from your
> BAM file.
> my_gene_seqs <- sapply(seq_along(my_genes),
> function(i) {
> af <- alphabetFrequencyFromBam(__bamfile, param=my_genes[i],
> baseOnly=TRUE)
> fm <- t(af[ , DNA_BASES])
> consensusStringFromFrequencyMa__trix(fm)
> })
>
> IMPORTANT: The strand in 'my_genes' is ignored, that is, all
> sequences
> are extracted from the plus strand (and going from 5' to 3'
> along that
> strand), even for genes that belong to the minus strand.
>
> It's not going to be very efficient (approx. 1 s per gene).
>
> Cheers,
> H.
>
>
> Thanks,
>
> Jim
>
> [[alternative HTML version deleted]]
>
> _________________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
> https://stat.ethz.ch/mailman/__listinfo/bioconductor
> <https://stat.ethz.ch/mailman/listinfo/bioconductor>
> Search the archives:
> http://news.gmane.org/gmane.__science.biology.informatics.__conductor
> <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
> Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
> Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>
>
>
> --
> James W. MacDonald, M.S.
> Biostatistician
> University of Washington
> Environmental and Occupational Health Sciences
> 4225 Roosevelt Way NE, # 100
> Seattle WA 98105-6099
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list