[R] Organisation of medium/large projects with multiple analyses

Thu Oct 26 23:00:30 CEST 2006

Dear all,

I'm still new to R, but have a fair experience with general programming.
All of my data is stored in postgresql, and I have a number of R files
that generate tables, results, graphs etc.  These are then available to
be imported into powerpoint/latex etc.

I'm using version control (subversion), and as with most small projects,
now have an ever increasing number of R scripts, each with fairly
specific features. With any enlarging project, there are always issues
regarding interdependencies, shared commonality (eg accessing same data
store), and old scripts stopping working with changes made elsewhere (eg
to data schema). For example, I might have a specific inclusion and
exclusion criteria for patients, and this SQL query may have to be
included in a number of analyses; I'm tempted to factor this out into a
project-specific data access library, but is that over the top?

This is a very long-winded and roundabout way of asking people how they
organise medium-sized projects? Do people create their own "libraries"
for specific projects for shared functionality, or do people just
liberally use "source()" for this kind of thing? What about namespaces?
I've got unwieldy sounding functions like ataxia.repeats.plot.alleles()
and often these functions are not particularly generic, and are only
called three or four times, but they do save repetition.

Do you go to the effort of creating a library that solves your
particular problem, or only reserve that for more generic functionality?
Do people keep all of their R scripts for a specific project separate,
or in one big file? I can see advantages (knowing it all works) and
disadvantages (time for it all to run after minor changes) in both
approaches, but it is unclear to me which is "better". I do know that
I've set-up a variety of analyses, moved on to other things, only to
find later on that old scripts have stopped working because I've changed
some interdependency. Does anyone go as far as to use test suites to
check for sane output (apart from doing things manually)?  Note I'm not
asking about how to run R on all these scripts, as people have already
suggested makefiles.

I realise these are vague high-level questions, and there won't be any
"right" or "wrong" answers, but I'm grateful to hear about different
strategies in organising R analyses/files, and how people solve these
problems? I've not seen this kind of thing covered in any of the
textbooks. Apologies for being so verbose!

Best wishes,

Mark

-- 
Dr. Mark Wardle
Clinical research fellow and Specialist Registrar in Neurology,
C2-B2 link, Cardiff University, Heath Park, CARDIFF, CF14 4XN. UK