[R] Do you use R for data manipulation?
ld7631 at gmail.com
Mon May 11 19:29:27 CEST 2009
I am not a statistician and not a computer scientist by education. I
consider myself an R novice and came to R - thanks to my boss - from
an SPSS background. I work for a market research company and the most
typical data files we deal with are not huge - up to several thousand
rows and up to a thousand variables.
I would say, on certain projects, most of what we do in R (if you look
at the number of lines in R we devote to a given task) is data
manipulation. The actual statistical method is frequently just a line
- all the rest is getting the data shaped right and then spitting out
the results of the analysis in a way that is usable (i.e.,
I find R to be excellent in data manipulations that we perform. First
of all, it's great that you can always grab variables/cases you need
and ignore all the rest. In SPSS you just keep staring at all those
data and variables that you don't need - trying to find the one you
Second - I find R to be incredibly fast (as opposed to SPSS or Excel)
with the amounts of data we are dealing with.
And third - nothing is "written in stone" and your original data is
always untouched - you can always read it in again and again. For
example, if I create a new variable and make a mistake, I can always
fix the code, rerun that piece of the code and that gives me the
corrected object that containes that new variable. I never touch the
original data and hence - never "spoil" it.
On Mon, May 11, 2009 at 11:20 AM, ronggui <ronggui.huang at gmail.com> wrote:
> 2009/5/6 Emmanuel Charpentier <charpent at bacbuc.dyndns.org>:
>> Le mercredi 06 mai 2009 à 00:22 -0400, Farrel Buchinsky a écrit :
>>> Is R an appropriate tool for data manipulation and data reshaping and data
>> [ Large Snip ! ... ]
>> Depends on what you have to do.
>> I've done what can be more or less termed "data management" with almost
>> uncountable tools (from Excel (sigh...) to R with SQL, APL, Pascal, C,
>> Basic (in 1982 !), Fortran and even Lisp in passing...).
>> SQL has strong points : join is, to my tastes, more easily expressed in
>> SQL than in most languages, projection and aggregation are natural.
>> However, in SQL, there is no "natural" ordering of row tables, which
>> makes expressing algorithms using this order difficult. Try for example
>> to express the differences of a time series ... (it can be done, but it
>> is *not* a pretty sight).
>> On the other hand, R has some unique expressive possibilities (reshape()
>> comes to mind).
>> So I tend to use a combination of tools : except for very small samples,
>> I tend to manage my data in SQL and with associated tools (think data
>> editing, for example ; a simple form in OpenOffice's Base is quite easy
>> to create, can handle anything for which an ODBC driver exists, and
>> won't crap out for more than a few hundreds line...). finer manipulation
>> is usually done in R with native tools and sqldf.
>> But, at least in my trade, the ability to handle Excel files is a must
>> (this is considered as a standard for data entry. Sigh ...). So the
>> first task is usually a) import data in an SQL database, and b) prepare
>> some routines to dump SQL tables / R dataframes in Excel tor returning
>> back to the original data author...
> I don't think Excel is a standard tool for data entry. Epidata entry
> is much more professional.
>> Emmanuel Charpentier
>> R-help at r-project.org mailing list
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
> HUANG Ronggui, Wincent
> PhD Candidate
> Dept of Public and Social Administration
> City University of Hong Kong
> Home page: http://asrr.r-forge.r-project.org/rghuang.html
> R-help at r-project.org mailing list
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
Dimitri.Liakhovitski at markettools.com
More information about the R-help