[R] Data-mining using R

Fernando Henrique Ferraz Pereira da Rosa mentus at gmx.de
Fri May 9 02:35:02 CEST 2003


      Is it possible to use R as a data-mining tool? Here's the problem I've
got. I have a couple of data sets consisting of results from a cDNA
microarray experiment - the details about the biology don't really matter here, the
same theory applies for any other data-mining task (that's why I thought it'd
be more appropriate to post this on r-user).  Each of these datasets consists
of about 30000 rows by 20 to 30 columns. Let's say that each row represents
(very roughly speaking) a gene, and the columns are details about its level
of expression, reliability of the measurament, coordinates and so on.
      The main objetive here is identify some genes (rows) according to some
criteria. In order to do that, what I want to be able to do, is selectively
filter the rows, graph some convinient variables, do some further filtering
and so on.
      Let me take a more concrete example to make myself clear. Let's say
that I load a given dataset on a dataframe, namely expr1. This dataframe would
have the fields expr1$name, expr1$expression, expr1$reliablity, expr1$x,
expr1$y and so on, containing, for instance, 26000 rows. Now from these 26000 I'd
like to select only those ones satisfying expr1$expression > 2000,
expr1$reliability = 100 and plot a graph on expr1$x x expr1$y, for them. I'd have then
a reduced dataset of the first one. Let's say now that I want to narrow my
filter even more, selecting only (among the ones I have already selected) the
ones where expr1$x > 20.
      This would be done many times and in different orders. I'd like to be
able to, among those 26000 rows, take only the 100 whose expr$x are the 100
greatest
. And so on, many times, until I found a set of suitable rows.
      What is the proper way to do that using R, if any? I've played a
little with dataframes (I could for instance use: expr1$names[expr1$x > 20] to get
the names of those genes whose x > 20) but it seemed a little clumsy. Should
I keep trying to manipulate directly the dataframe, or perhaps should I save
it on a mysql database and do que queries using RMYSql? Or maybe there is a
better option?
      I know that these things I've said are pretty easy to implement using,
for instance M$ Excel (I've seen them working on it). You just select
drop-down menus and filter the rows to your liking. But I really would like to be
able to accomplish this task using R and other open source tools like MySql,
Perl, etc.
      

Thank you in advance,

--




More information about the R-help mailing list