[BioC] Deleting object rows while looping - II

Hervé Pagès hpages at fhcrc.org
Tue Apr 30 22:07:36 CEST 2013


Hi there,

Yep, lot of good advice from Steve. I'll only add a few things:

(1) Yes, as Steve said, building an ever growing 'cons' is not good.
     And since you don't seem to make any use of that 'cons' object
     in the end anyway, except sending it to a file with write.table(),
     why not send the little pieces to the file in each iteration by
     calling write.table() with append=TRUE, instead of sticking them
     to the ever growing 'cons' object?

(2) Why not load your alignments in a GappedAlignments (renamed
     GAlignments) object instead of a data.frame?

     Then all the code you use before entering the big loop is simpler:

       library(Rsamtools)
       M <- "GAAGC" ### specify MID (individual)
       lib <- 25 ### specify library

       # genotyping settings:
       foldcov <- 3
       prop.1.2 <- 0.7
       sum.1.2 <- 15
       het <- 1/3
       min <- 2

       # infile upload:
       folder <- 'C:\\Documents and Settings\\daniel\\Desktop\\temp'
       infile <- paste(folder, "/novAl_lib_", lib,"_", M, ".bam", sep="")
       outfile <- paste(folder, '/', 'lib_', lib, '_', M, 
".consensus.txt", sep="")
       param <- ScanBamParam(what="seq", tag="Z3",
                             reverseComplement=TRUE)
       f <- readGappedAlignments(infile, param=param)

    Then, if you try to display the first 4 alignments, you'll see
    something like:

       > f[1:4]
       GappedAlignments with 4 alignments and 2 metadata columns:
             seqnames strand       cigar    qwidth     start
                <Rle>  <Rle> <character> <integer> <integer>
         [1]     seq1      +         36M        36         1
         [2]     seq1      +         35M        35         3
         [3]     seq1      +         35M        35         5
         [4]     seq1      +         36M        36         6
                   end     width      ngap |
             <integer> <integer> <integer> |
         [1]        36        36         0 |
         [2]        37        35         0 |
         [3]        39        35         0 |
         [4]        41        36         0 |
                                              seq        Z3
                                   <DNAStringSet> <logical>
         [1] CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG      <NA>
         [2]  CTAGTGGCTCATTGTAAATGTGTGGTTTAACTCGT      <NA>
         [3]  AGTGGCTCATTGTAAATGTGTGGTTTAACTCGTCC      <NA>
         [4] GTGGCTCATTGTAATTTTTTGTTTTAACTCTTCTCT      <NA>
         ---
         seqlengths:
          seq1 seq2
          1575 1584

     Looks good huh? ;-)

       # extract the relevant data:
       mcols(f)$seq <- narrow(mcols(f)$seq, start=6, end=94)

     Just to make sure: the above call to narrow() is trimming the
     first 5 bases so if you want to trim the last 5 bases and if
     your sequence is a 99-mer, you are good. But if it's a 100-mer,
     you would need to use end=95.

       chrpos <- paste(seqnames(f), start(f), sep=' ')
       u.chrpos <- sort(unique(chrpos))
       length(f) #total number of alignments
       length(u.chrpos) #this gives the number of unique loci
       length(f)/length(u.chrpos) #mean coverage per locus

     Note that I would not call this the "mean coverage per locus".
     A given locus receives more coverage than just from the
     alignments that start exactly at that locus but also coverage
     from alignments that start before that locus and end at or
     after that locus.

     A more accurate way to get the "mean coverage per locus" would
     be to do something like:

       f_cvg <- coverage(f)
       cvg_at_loci <- lapply(names(f_cvg),
                        function(seqname) {
                          cvg <- f_cvg[[seqname]]
                          cvg[start(f)[as.character(seqnames(f)) == 
seqname]]})
       mean(do.call(c, cvg_at_loci))

     Note that will give you a much higher "mean coverage per locus"!


(3) The big loop:

       c.thr <- as.integer(length(f)/length(u.chrpos)*foldcov) # 
relative coverage threshold based on average coverage

       # big loop:
       for (i in seq_along(u.chrpos)) {
           idx <- which(chrpos == u.chrpos[i])
           stack <- mcols(f)$seq[idx]
           stack.le <- length(stack)
           if (stack.le <= c.thr) {
               ...
           }
       }

     OK so your loop is not doing anything at loci for which
     stack.le > c.thr, and also, by looking deeper in the code (not
     shown above) it seems that it won't do anything either at loci
     for which stack.le == 1 i.e. if there is only 1 read that starts
     at the loci.

     So maybe getting rid upfront of those loci where you're not
     doing anything would reduce substantially the nb of loci that
     you actually need to examine?

     This can easily be done with:

       # big loop:
       tmp <- Rle(sort(chrpos))
       my.u.chrpos <- runValue(tmp)[runLength(tmp) <= c.thr & 
runLength(tmp) >= 2]
       for (i in seq_along(my.u.chrpos)) {
           idx <- which(chrpos == my.u.chrpos[i])
           stack <- mcols(f)$seq[idx]
           ...
       }

(4) Don't compute twice the same thing, especially if it's
     (potentially) an expensive operation:

     Not good:

       if (length(unique(stack))==1){
           ...
       }
       if(length(unique(stack))>1){
           ...
       }

     Better:

       u.stack <- unique(stack)
       if (length(u.stack) == 1) {
           ...
       }
       if (length(u.stack) > 1) {
           ...
       }

(5) Put all this code in a function. Even if you only use it in 1
     place. It's always a good exercise, and it almost always lead
     to better code.

(6) Direct comparison of the query sequences of the reads that are
     aligned to the same locus is not always meaningful. For example it's
     not meaningful if the alignments are not all on the same strand,
     unless you use reverseComplement=FALSE (which is the default, maybe
     for a good reason). So you should not use reverseComplement=TRUE.
     Even if you use reverseComplement=FALSE, comparing the query
     sequences is still not meaningful if the alignments don't have
     the same cigar. But maybe you know that all your alignments have
     a simple cigar, e.g. "99M" or something like that? If not, then
     you would need to check that.

Hope this helps,

Cheers,
H.


On 04/30/2013 09:52 AM, Steve Lianoglou wrote:
> Hi,
>
> This is, unfortunately, a bit too much code for me to digest right
> now, but a few points.
>
> (1) You are building an ever growing "something" that you are storing
> in cons: you start with `cons <- NULL` then are reconstructing it w/in
> every iteration w/in the loop, ie you have `cons <- c(cons, something)
>
> This is going to kill you -- can you set `cons` to be a "whatever" of
> the length you are trying to build? (it looks like it will be of
> length (or nrow) `length(u.chrpos)`
>
> (2) You create a `stack` object in each loop iteration that is some
> DNAStringSet of consisting of all the sequences at the chromosome --
> but then you don't use it as a DNAStringSet ... in fact you revert it
> to a character (as.character(stack)) often -- don't make a `stack`
> DNAStringSet to begin with.
>
> (3) It seems you are outputting a character vector per row that you
> then have to parse -- why not just output a data.frame with the
> relevant columns that you will eventually use.
>
> (4) Here is a shell of a script you would use if you wanted to know
> how many reads and how many of them are unique start at each locus:
>
> library(data.table)
> ## ... your init stuff up to building `dat`, which is where data.table
> ## will take over
> dat <- data.table(chr=chr, pos=posi, seqs=seqs, key=c('chr', 'pos'))
> result <- dat[, {
>    ## .N is simply the number of rows in this group -- read the data.table docs.
>    ## the `seqs` column is injected into the scope evaluated here
>    ## so I can manipulate the sequences within this chr/pos group directly
>    list(n.reads=.N, n.unique=length(unique(seqs)))
> }, by=c('chr', 'pos')]
>
> Now look at `result`, you will have a data.table with as many rows as
> there are unique chr,pos pairs, and you will have counted the number
> of reads, and number of unique reads at each locus.
>
> (5) Hopefully the speed with which (4) completes will encourage you to
> take the time to read through the data.table documentation to apply it
> to your problem. It will take me more time than I have to read through
> your code ... giving a small subset of your data and the expected
> output would be a lot more helpful for people trying to help you.
> Fortunately, you will end up largely swapping in the core of your for
> loop within the data.table `j` expression, eg:
>
> result <- dat[, {
>    ## a large core of your for loop will go here
>    ## don't make a DNAStringSet in here because you are not
>    ## using it anyway!
> }, by=c('chr', 'pos')]
>
> Anyway, good luck with it.
>
> -steve
>
>
> On Mon, Apr 29, 2013 at 11:52 PM, Daniel Berner <daniel.berner at unibas.ch> wrote:
>> Martin, Steve
>>
>> Thanks for your suggestions, much appreciated! My core problem is that the
>> operations inside the loop are too complex for me to allow implementing any
>> function() and apply() combination (there are multiple interleaved if() lines
>> etc). and I was not aware of the data.table package, but this also looks
>> hard to me, given the many steps performed within the loop.
>>
>> to be more specific, I want to reduce a number of Illumina short reads at
>> genomic loci to diploid or haploid (depending on coverage) consensus
>> genotypes, by evaluating several test criteria. Below I am copying the full
>> code. the input is an alignment in BAM format.
>>
>> Any suggestion to make this faster is most welcome, as the code takes more
>> than a week to run for each individual, and I have to process 96 of them (I
>> have multicore, but still)! It seems to me that the rate-limiting step is the
>> subsetting; the other manipulations are fast enough to work in such a
>> primitive loop.
>>
>> cheers,
>> Daniel Berner
>> Zoological Institute
>> University of Basel
>>
>> #########################
>> library(Rsamtools)
>> rm(list=ls())
>> M<-"GAAGC" ### specify MID (individual)
>> lib<-25 ### specify library
>>
>> # genotyping settings:
>> foldcov<-3 # a RAD locus is not gennotyped if its total seq coverage is more than foldcov times the overall mean coverage (=length(posi)/length(u.chrpos)). eliminates repeats
>> prop.1.2<-0.7 # a RAD locus is not genotyped if the proportion of the two dominant haplotypes together is <= prop.1.2, relative to the total coverage of the locus. eliminates repeats
>> sum.1.2<-15 # a RAD locus in not genotyped as DIPLOID if the sum of the coverage of the two dominant haplotypes (or simply the coverage for monomorphic loci) is less than sum.1.2
>> het<-1/3 # a diploid locus is called heterozygous if the ratio count.htp.2nd/count.htp.dominant is greater than het
>> min<-2 # a haploid locus is called when coverage is at least min (but lower than sum.1.2)
>>
>> # infile upload:
>>   #folder<-'/Users/dani/Documents/genome scan/novocraft' #here are the BAMs
>> folder<-'C:\\Documents and Settings\\daniel\\Desktop\\temp' #here are the BAMs
>> infile<-paste(folder, "/novAl_lib_", lib,"_", M, ".bam", sep="") # infile path
>> outfile<-paste(folder, '/', 'lib_', lib, '_', M, ".consensus.txt", sep="") # outfile path
>>   #outfile<-paste("/Users/dani/Documents/genome\ scan/R_bioconductor/", 'lib_', lib, '/', 'lib_', lib, '_', M, ".consensus.txt",sep="")
>> param <-ScanBamParam(flag=scanBamFlag(isUnmappedQuery=FALSE),what=c("rname", "pos", "seq", "strand"), tag="Z3", reverseComplement=TRUE)
>> f<-scanBam(infile, param=param)[[1]] #upload for a c. 1GB BAM is c. 1 minute
>>
>> # extract the relevant data:
>> chr<-c(as.character(f$rname[grep("-",f$strand,invert=T)]),as.character(f$rname[grep("-",f$strand,invert=F)]))
>> posi<-c(f$pos[grep("-",f$strand,invert=T)],f$tag[[1]][grep("-",f$strand,invert=F)])
>> seqs<-as.character(c(narrow(f$seq,start=6,end=94)[grep("-",f$strand,invert=T)],narrow(f$seq,start=6,end=94)[grep("-",f$strand,invert=F)]))
>> chrpos<-paste(chr, posi, sep=' ')
>> u.chrpos<-sort(unique(chrpos))
>> length(posi) #total number of alignments
>> length(u.chrpos) #this gives the number of unique loci
>> length(posi)/length(u.chrpos) #mean coverage per locus
>> c.thr<-as.integer(length(posi)/length(u.chrpos)*foldcov) # relative coverage threshold based on average coverage
>>
>> # derive the data object to process:
>> dat<-data.frame(cbind(chrpos, seqs))
>>
>> # clean the environment and release memory:
>> rm(f, chr, posi, seqs, chrpos) #no longer used
>> gc()
>>
>> # process the dat object while writing to 'cons'
>> # outfile data structure: chrpos / coverage / X-Y (needed for SNP calling) / diploid_homoz(N)-OR-diploid_heteroz(Y)-OR-haploid_low(L) / seq
>> cons<-NULL
>> for (i in 1:length(u.chrpos)){
>>      stack<-DNAStringSet(as.character(subset(dat, chrpos==u.chrpos[i])[,2]))
>>      stack.le<-length(stack)
>>      if(stack.le<=c.thr){
>>          #monomorphic:
>>          if (length(unique(stack))==1){
>>              #well covered:
>>              if(stack.le>=sum.1.2){
>>                  gtp.X<-paste(u.chrpos[i], stack.le, "X", "N", as.character(stack[1]), sep=" ")
>>                  gtp.Y<-paste(u.chrpos[i], stack.le, "Y", "N", as.character(stack[1]), sep=" ")
>>                  cons<-rbind(cons, gtp.X, gtp.Y)
>>              }
>>              #poorly covered:
>>              if((stack.le<sum.1.2) && (stack.le>=min)){
>>                  gtp.X<-paste(u.chrpos[i], stack.le, "X", "L", as.character(stack[1]), sep=" ")
>>                  cons<-rbind(cons, gtp.X)
>>              }
>>          }
>>          #polymorphic:
>>          if(length(unique(stack))>1){
>>              cnts<-sort(table(as.character(stack)), decreasing=T)[1:2]
>>              #no repeat issue:
>>              if((sum(cnts)/stack.le)>prop.1.2){
>>                  #heterozygote:
>>                  if((sum(cnts)>=sum.1.2) && ((cnts[2]/cnts[1])>het)){
>>                      gtp.X<-paste(u.chrpos[i], stack.le, "X", "Y", as.character(names(cnts)[1]), sep=" ")
>>                      gtp.Y<-paste(u.chrpos[i], stack.le, "Y", "Y", as.character(names(cnts)[2]), sep=" ")
>>                      cons<-rbind(cons, gtp.X, gtp.Y)
>>                  }
>>                  #homozygote (minor variant too rare, artifact):
>>                  if((sum(cnts)>=sum.1.2) && ((cnts[2]/cnts[1])<=het)){ # minor variant too rare, artifact; hence locus diploid homozygote
>>                      gtp.X<-paste(u.chrpos[i], stack.le, "X", "N", as.character(names(cnts)[1]), sep=" ")
>>                      gtp.Y<-paste(u.chrpos[i], stack.le, "Y", "N", as.character(names(cnts)[1]), sep=" ")
>>                      cons<-rbind(cons, gtp.X, gtp.Y)
>>                  }
>>                  #coverage poor, gtp unclear, haploid call:
>>                  if((sum(cnts)<sum.1.2) && (cnts[1]>=min)){
>>                      gtp.X<-paste(u.chrpos[i], stack.le, "X", "L", as.character(names(cnts)[1]), sep=" ")
>>                      cons<-rbind(cons, gtp.X)
>>                  }
>>              }
>>          }
>>      }
>> }
>> write.table(cons, outfile, row.names=F, col.names=F, quote=F) #write to individual file
>> ### END
>>
>>
>>          [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
>
> --
> Steve Lianoglou
> Computational Biologist
> Department of Bioinformatics and Computational Biology
> Genentech
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list