[R] making code (loop) more efficient

Wed Dec 16 03:20:13 CET 2020

Hi Jim,

Maybe my post is confusing.

so "dd" came from my slow code and I don't use it again in parallelized code.

So for example for one of my files:

if
i="retina.ENSG00000120647.wgt.RDat"
> a <- get(load(i))
> head(a)
                            top1          blup lasso enet
rs4980905:184404:C:A  0.07692622 -1.881795e-04     0    0
rs7978751:187541:G:C  0.62411425  9.934994e-04     0    0
rs2368831:188285:C:T  0.69529158  1.211028e-03     0    0
...

Slow code was posted just to show what was running very slow and it
was running. I really need help fixing parallelized version.

On Tue, Dec 15, 2020 at 7:35 PM Jim Lemon <drjimlemon using gmail.com> wrote:
>
> Hi Ana,
> My guess is that in your second code fragment you are assigning the
> rownames of "a" and the _values_ contained in a$blup to the data.table
> "data". As I don't have much experience with data tables I may be
> wrong, but I suspect that the column name "blup" may not be visible or
> even present in "data". I don't see it in "dd" above this code
> fragment.
>
> Jim
>
> On Wed, Dec 16, 2020 at 11:12 AM Ana Marija <sokovic.anamarija using gmail.com> wrote:
> >
> > Hello,
> >
> > I made a terribly inefficient code which runs forever but it does run.
> >
> > library(dplyr)
> > library(splitstackshape)
> >
> > datalist = list()
> > files <- list.files("/WEIGHTS1/Retina", pattern=".RDat", ignore.case=T)
> >
> > for(i in files)
> > {
> > a<-get(load(i))
> > names <- rownames(a)
> > data <- as.data.frame(cbind(names,a))
> > rownames(data) <- NULL
> > dd=na.omit(concat.split.multiple(data = data, split.cols = c("names"),
> > seps = ":"))
> > dd=select(dd,names_1,blup,names_3,names_4)
> > colnames(dd)=c("rsid","weight","ref_allele","eff_allele")
> > dd$WGT<-i
> > datalist[[i]] <- dd # add it to your list
> > }
> >
> > big_data = do.call(rbind, datalist)
> >
> > There is 17345 RDat files this loop has to go through. And each file
> > has approximately 10,000 lines. All RDat files can be downloaded from
> > here: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE115828 and
> > they are compressed in this file: GSE115828_retina_TWAS_wgts.tar.gz .
> > And subset of 3 of those .RDat files is here:
> > https://github.com/montenegrina/sample
> >
> > For one of those files, say i="retina.ENSG00000135776.wgt.RDat"
> > dd looks like this:
> >
> > > head(dd)
> >           rsid        weight ref_allele eff_allele
> > 1:  rs72763981  9.376766e-09          C          G
> > 2: rs144383755 -2.093346e-09          A          G
> > 3:   rs1925717  1.511376e-08          T          C
> > 4:  rs61827307 -1.625302e-08          C          A
> > 5:  rs61827308 -1.625302e-08          G          C
> > 6: rs199623136 -9.128354e-10         GC          G
> >                            WGT
> > 1: retina.ENSG00000135776.wgt.RDat
> > 2: retina.ENSG00000135776.wgt.RDat
> > 3: retina.ENSG00000135776.wgt.RDat
> > 4: retina.ENSG00000135776.wgt.RDat
> > 5: retina.ENSG00000135776.wgt.RDat
> > 6: retina.ENSG00000135776.wgt.RDat
> >
> > so on attempt to parallelize this I did this:
> >
> > library(parallel)
> > library(data.table)
> > library(foreach)
> > library(doSNOW)
> >
> > n <-  parallel::detectCores()
> > cl <- parallel::makeCluster(n, type = "SOCK")
> > doSNOW::registerDoSNOW(cl)
> > files <- list.files("/WEIGHTS1/Retina", pattern=".RDat", ignore.case=T)
> >
> > lst_out <- foreach::foreach(i = seq_along(files),
> >                   .packages = c("data.table") ) %dopar% {
> >
> >    a <- get(load(files[i]))
> >    names <- rownames(a)
> >    data <- data.table(names, a["blup"])
> >    nm1 <- c("rsid", "ref_allele", "eff_allele")
> >    data[,  (nm1) := tstrsplit(names, ":")[-2]]
> >    return(data[, .(rsid, weight = blup, ref_allele, eff_allele)][,
> >                WGT := files[i]][])
> >  }
> > parallel::stopCluster(cl)
> >
> > big_data <- rbindlist(lst_out)
> >
> > I am getting this Error:
> >
> > Error in { : task 7 failed - "object 'blup' not found"
> > > parallel::stopCluster(cl)
> >
> > Can you please advise,
> > Ana
> >
> > ______________________________________________
> > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.