[R] how to work with long vectors

Martin Morgan mtmorgan at fhcrc.org
Fri Nov 5 17:42:47 CET 2010


On 11/05/2010 09:13 AM, Changbin Du wrote:
> HI, Phil,
> 
> I used the following codes and run it overnight for 15 hours, this morning,
> I stopped it. It seems it is still not efficient.
> 
> 
>>
> matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth",
> sep="\t", skip=0, header=F,fill=T) #
>> names(matt)<-c("id","reads")
> 
>> dim(matt)
> [1] 3384766       2

[snip]

>>> On Thu, 4 Nov 2010, Changbin Du wrote:
>>>
>>>  HI, Dear R community,
>>>>
>>>> I have one data set like this,  What I want to do is to calculate the
>>>> cumulative coverage. The following codes works for small data set (#rows
>>>> =
>>>> 100), but when feed the whole data set,  it still running after 24 hours.
>>>> Can someone give some suggestions for long vector?
>>>>
>>>> id                reads
>>>> Contig79:1    4
>>>> Contig79:2    8
>>>> Contig79:3    13
>>>> Contig79:4    14
>>>> Contig79:5    17
>>>> Contig79:6    20
>>>> Contig79:7    25
>>>> Contig79:8    27
>>>> Contig79:9    32
>>>> Contig79:10    33
>>>> Contig79:11    34
>>>>
>>>>
>>>> matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth",
>>>> sep="\t", skip=0, header=F,fill=T) #
>>>> dim(matt)
>>>> [1] 3384766       2
>>>>
>>>> matt_plot<-function(matt, outputfile) {
>>>> names(matt)<-c("id","reads")
>>>>
>>>> cover<-matt$reads
>>>>
>>>>
>>>> #calculate the cumulative coverage.
>>>> + cover_per<-function (data) {
>>>> + output<-numeric(0)
>>>> + for (i in data) {
>>>> +           x<-(100*sum(ifelse(data >= i, 1, 0))/length(data))
>>>> +           output<-c(output, x)
>>>> +                 }
>>>> + return(output)
>>>> + }
>>>>
>>>>
>>>> result<-cover_per(cover)

Hi Changbin

If I understand correctly, your contigs 'start' at position 1, and have
'width' equal to matt$reads. You'd like to know the coverage at the last
covered location of each contig in matt$reads.

## first time only
source("http://bioconductor.org")
biocLite("IRanges")

##
library(IRanges)
contigs = IRanges(start=1, width=matt$reads)
cvg = coverage(contigs) ## an RLE summarizing coverage, from position 1
as.vector(cvg[matt$reads]) / nrow(matt)  ## at the end of each contig

for a larger data set:

> matt=data.frame(reads=ceiling(as.integer(runif(3384766, 1, 100))))
> contigs = IRanges(start=1, width=matt$reads)
> system.time(cvg <- coverage(contigs))
   user  system elapsed
  5.145   0.050   5.202

Martin

>>>>
>>>>
>>>> Thanks so much!
>>>>
>>>>
>>>> --
>>>> Sincerely,
>>>> Changbin
>>>> --
>>>>
>>>>        [[alternative HTML version deleted]]
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>>
>>
>>
>> --
>> Sincerely,
>> Changbin
>> --
>>
>> Changbin Du
>> DOE Joint Genome Institute
>> Bldg 400 Rm 457
>> 2800 Mitchell Dr
>> Walnut Creet, CA 94598
>> Phone: 925-927-2856
>>
>>
>>
> 
> 


-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793



More information about the R-help mailing list