[BioC] Deleting object rows while looping - II

Steve Lianoglou lianoglou.steve at gene.com
Wed May 1 18:50:05 CEST 2013


Hi folks,

Wow -- that's a lot of great suggestions coming straight from the
bioc-wizards themselves.

I'm still not going to jump into the details on the logic of what
Daniel *really* wants, just want to make a comment on Martin's last
point:

> This type of operation is well-suited to data.table, though I'm not sure
> enough of the syntax and implementation to know whether Steve's
>
>
> dat <- data.table(chr=chr, pos=posi, seqs=seqs, key=c('chr', 'pos'))
> result <- dat[, {
>   list(n.reads=.N, n.unique=length(unique(seqs)))
> }, by=c('chr', 'pos')]
>
> is implemented efficiently -- I'm sure the .N is; just not whether clever
> thinking is used behind the scenes to avoid looping through function(x)
> length(unique(x)). The syntax is certainly clearer than my 'view' approach.

I only really used this as a pedagogical example to show that one
could access subsets of the columns directly by name within the `j`
expression of the `[.data.table` function.

The .N is essentially a no-op to call as it is already computed for
you, but repeatedly calling a function within each grouped subset will
incur the overhead of a function call within each subgroup.

Still, I think the OP would notice a significant boost in performance
by simply naively translating his code using data.table -- if you
really wanted to eek out the last bit of performance (which isn't
really necessary if you're just doing things once, but if you're
building a pipeline, fell free) that'd be another convo ...

Anyway, it looks like there's a lot of good stuff in this thread
already. I'd be curious to here back from Daniel when he tries a few
of these things. Also, wasn't aware of the new(?) `SplitDataFrame`
mojo -- very nice stuff.

-steve

--
Steve Lianoglou
Computational Biologist
Department of Bioinformatics and Computational Biology
Genentech



More information about the Bioconductor mailing list