[BioC] Deleting object rows while looping - II

Steve Lianoglou lianoglou.steve at gene.com
Tue Apr 30 18:52:15 CEST 2013


Hi,

This is, unfortunately, a bit too much code for me to digest right
now, but a few points.

(1) You are building an ever growing "something" that you are storing
in cons: you start with `cons <- NULL` then are reconstructing it w/in
every iteration w/in the loop, ie you have `cons <- c(cons, something)

This is going to kill you -- can you set `cons` to be a "whatever" of
the length you are trying to build? (it looks like it will be of
length (or nrow) `length(u.chrpos)`

(2) You create a `stack` object in each loop iteration that is some
DNAStringSet of consisting of all the sequences at the chromosome --
but then you don't use it as a DNAStringSet ... in fact you revert it
to a character (as.character(stack)) often -- don't make a `stack`
DNAStringSet to begin with.

(3) It seems you are outputting a character vector per row that you
then have to parse -- why not just output a data.frame with the
relevant columns that you will eventually use.

(4) Here is a shell of a script you would use if you wanted to know
how many reads and how many of them are unique start at each locus:

library(data.table)
## ... your init stuff up to building `dat`, which is where data.table
## will take over
dat <- data.table(chr=chr, pos=posi, seqs=seqs, key=c('chr', 'pos'))
result <- dat[, {
  ## .N is simply the number of rows in this group -- read the data.table docs.
  ## the `seqs` column is injected into the scope evaluated here
  ## so I can manipulate the sequences within this chr/pos group directly
  list(n.reads=.N, n.unique=length(unique(seqs)))
}, by=c('chr', 'pos')]

Now look at `result`, you will have a data.table with as many rows as
there are unique chr,pos pairs, and you will have counted the number
of reads, and number of unique reads at each locus.

(5) Hopefully the speed with which (4) completes will encourage you to
take the time to read through the data.table documentation to apply it
to your problem. It will take me more time than I have to read through
your code ... giving a small subset of your data and the expected
output would be a lot more helpful for people trying to help you.
Fortunately, you will end up largely swapping in the core of your for
loop within the data.table `j` expression, eg:

result <- dat[, {
  ## a large core of your for loop will go here
  ## don't make a DNAStringSet in here because you are not
  ## using it anyway!
}, by=c('chr', 'pos')]

Anyway, good luck with it.

-steve


On Mon, Apr 29, 2013 at 11:52 PM, Daniel Berner <daniel.berner at unibas.ch> wrote:
> Martin, Steve
>
> Thanks for your suggestions, much appreciated! My core problem is that the
> operations inside the loop are too complex for me to allow implementing any
> function() and apply() combination (there are multiple interleaved if() lines
> etc). and I was not aware of the data.table package, but this also looks
> hard to me, given the many steps performed within the loop.
>
> to be more specific, I want to reduce a number of Illumina short reads at
> genomic loci to diploid or haploid (depending on coverage) consensus
> genotypes, by evaluating several test criteria. Below I am copying the full
> code. the input is an alignment in BAM format.
>
> Any suggestion to make this faster is most welcome, as the code takes more
> than a week to run for each individual, and I have to process 96 of them (I
> have multicore, but still)! It seems to me that the rate-limiting step is the
> subsetting; the other manipulations are fast enough to work in such a
> primitive loop.
>
> cheers,
> Daniel Berner
> Zoological Institute
> University of Basel
>
> #########################
> library(Rsamtools)
> rm(list=ls())
> M<-"GAAGC" ### specify MID (individual)
> lib<-25 ### specify library
>
> # genotyping settings:
> foldcov<-3 # a RAD locus is not gennotyped if its total seq coverage is more than foldcov times the overall mean coverage (=length(posi)/length(u.chrpos)). eliminates repeats
> prop.1.2<-0.7 # a RAD locus is not genotyped if the proportion of the two dominant haplotypes together is <= prop.1.2, relative to the total coverage of the locus. eliminates repeats
> sum.1.2<-15 # a RAD locus in not genotyped as DIPLOID if the sum of the coverage of the two dominant haplotypes (or simply the coverage for monomorphic loci) is less than sum.1.2
> het<-1/3 # a diploid locus is called heterozygous if the ratio count.htp.2nd/count.htp.dominant is greater than het
> min<-2 # a haploid locus is called when coverage is at least min (but lower than sum.1.2)
>
> # infile upload:
>  #folder<-'/Users/dani/Documents/genome scan/novocraft' #here are the BAMs
> folder<-'C:\\Documents and Settings\\daniel\\Desktop\\temp' #here are the BAMs
> infile<-paste(folder, "/novAl_lib_", lib,"_", M, ".bam", sep="") # infile path
> outfile<-paste(folder, '/', 'lib_', lib, '_', M, ".consensus.txt", sep="") # outfile path
>  #outfile<-paste("/Users/dani/Documents/genome\ scan/R_bioconductor/", 'lib_', lib, '/', 'lib_', lib, '_', M, ".consensus.txt",sep="")
> param <-ScanBamParam(flag=scanBamFlag(isUnmappedQuery=FALSE),what=c("rname", "pos", "seq", "strand"), tag="Z3", reverseComplement=TRUE)
> f<-scanBam(infile, param=param)[[1]] #upload for a c. 1GB BAM is c. 1 minute
>
> # extract the relevant data:
> chr<-c(as.character(f$rname[grep("-",f$strand,invert=T)]),as.character(f$rname[grep("-",f$strand,invert=F)]))
> posi<-c(f$pos[grep("-",f$strand,invert=T)],f$tag[[1]][grep("-",f$strand,invert=F)])
> seqs<-as.character(c(narrow(f$seq,start=6,end=94)[grep("-",f$strand,invert=T)],narrow(f$seq,start=6,end=94)[grep("-",f$strand,invert=F)]))
> chrpos<-paste(chr, posi, sep=' ')
> u.chrpos<-sort(unique(chrpos))
> length(posi) #total number of alignments
> length(u.chrpos) #this gives the number of unique loci
> length(posi)/length(u.chrpos) #mean coverage per locus
> c.thr<-as.integer(length(posi)/length(u.chrpos)*foldcov) # relative coverage threshold based on average coverage
>
> # derive the data object to process:
> dat<-data.frame(cbind(chrpos, seqs))
>
> # clean the environment and release memory:
> rm(f, chr, posi, seqs, chrpos) #no longer used
> gc()
>
> # process the dat object while writing to 'cons'
> # outfile data structure: chrpos / coverage / X-Y (needed for SNP calling) / diploid_homoz(N)-OR-diploid_heteroz(Y)-OR-haploid_low(L) / seq
> cons<-NULL
> for (i in 1:length(u.chrpos)){
>     stack<-DNAStringSet(as.character(subset(dat, chrpos==u.chrpos[i])[,2]))
>     stack.le<-length(stack)
>     if(stack.le<=c.thr){
>         #monomorphic:
>         if (length(unique(stack))==1){
>             #well covered:
>             if(stack.le>=sum.1.2){
>                 gtp.X<-paste(u.chrpos[i], stack.le, "X", "N", as.character(stack[1]), sep=" ")
>                 gtp.Y<-paste(u.chrpos[i], stack.le, "Y", "N", as.character(stack[1]), sep=" ")
>                 cons<-rbind(cons, gtp.X, gtp.Y)
>             }
>             #poorly covered:
>             if((stack.le<sum.1.2) && (stack.le>=min)){
>                 gtp.X<-paste(u.chrpos[i], stack.le, "X", "L", as.character(stack[1]), sep=" ")
>                 cons<-rbind(cons, gtp.X)
>             }
>         }
>         #polymorphic:
>         if(length(unique(stack))>1){
>             cnts<-sort(table(as.character(stack)), decreasing=T)[1:2]
>             #no repeat issue:
>             if((sum(cnts)/stack.le)>prop.1.2){
>                 #heterozygote:
>                 if((sum(cnts)>=sum.1.2) && ((cnts[2]/cnts[1])>het)){
>                     gtp.X<-paste(u.chrpos[i], stack.le, "X", "Y", as.character(names(cnts)[1]), sep=" ")
>                     gtp.Y<-paste(u.chrpos[i], stack.le, "Y", "Y", as.character(names(cnts)[2]), sep=" ")
>                     cons<-rbind(cons, gtp.X, gtp.Y)
>                 }
>                 #homozygote (minor variant too rare, artifact):
>                 if((sum(cnts)>=sum.1.2) && ((cnts[2]/cnts[1])<=het)){ # minor variant too rare, artifact; hence locus diploid homozygote
>                     gtp.X<-paste(u.chrpos[i], stack.le, "X", "N", as.character(names(cnts)[1]), sep=" ")
>                     gtp.Y<-paste(u.chrpos[i], stack.le, "Y", "N", as.character(names(cnts)[1]), sep=" ")
>                     cons<-rbind(cons, gtp.X, gtp.Y)
>                 }
>                 #coverage poor, gtp unclear, haploid call:
>                 if((sum(cnts)<sum.1.2) && (cnts[1]>=min)){
>                     gtp.X<-paste(u.chrpos[i], stack.le, "X", "L", as.character(names(cnts)[1]), sep=" ")
>                     cons<-rbind(cons, gtp.X)
>                 }
>             }
>         }
>     }
> }
> write.table(cons, outfile, row.names=F, col.names=F, quote=F) #write to individual file
> ### END
>
>
>         [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



--
Steve Lianoglou
Computational Biologist
Department of Bioinformatics and Computational Biology
Genentech



More information about the Bioconductor mailing list