[R] Scripting in R -- pattern matching, logic, system calls, the works!

Tue Sep 16 17:31:51 CEST 2008

Instead of writing some long, ugly, "script", the way to use R is to
break problems down into distinct tasks. Reading data is one task, and
performing regressions on the data, plotting & summarising are
different tasks. Write functions to do each task in general, and then
use those functions.

So one task is reading the data from a Coverage dir. You want to do a
linear regression on the data, so you want to have the data stored as
a data frame. Following on from Don McQueen's good advice, here's a
function that does the job:

read.data.from.coverage.dir <- function(dir, pattern="Length_[0-9]+", min.length=0, max.length=Inf) {

    ## return a data frame with lengths in first column and means of
    ## file contents in second column

    files <- list.files(dir, pattern)
    lengths <- as.numeric(gsub("Length_", "", files, perl=TRUE))
    files <- files[lengths >= min.length && lengths <= max.length]
    get.mean.from.file <- function(file) mean(scan(file.path(dir,file), quiet=TRUE))
    data.frame(x=lengths, y=sapply(files, get.mean.from.file))
}

And here's a function, that uses the first one, to get all the data
from your various coverage dirs

get.all.data <- function(topdir) {
    coverage.dirs <- list.files(path=topdir, pattern="Coverage", full.names=TRUE)
    lapply(coverage.dirs, read.data.from.coverage.dir)
}

So now you can do

## read all the data
all.data <- get.all.data(topdir="~")

## perform all the regressions
regression.fits <- lapply(all.data, function(df) lm(y ~ x, data=df))

## summarise them
summaries <- lapply(regression.fits, summary)

## etc

All those commands are generating lists of objects; lapply is a
shorthand for doing a for loop over a list.

You can use sink() to redirect output, but it would probably be better
to create tables and/or figures in R first, then write them to files.

Dan

On Tue, Sep 16, 2008 at 07:01:42AM -0700, bioinformatics_guy wrote:
> 
> Don,
> Excellent advice.  I've gone back and done a bit of coding and wanted to see
> what you think and possibly "shore up" some of the technical stuff I am
> still having a bit of difficulty with.
> 
> I'll past the code I have to date with any important annotations:
> 
> topdir="~"
> library(gmodels)
> 
> setwd(topdir)
> 
> ### Will probably want to do two for loops as opposed to recursive
> files=list.files(path=topdir,pattern="Coverage")
> 
> for (i in files)
> {
>         dir=paste("~/hangers/",i,sep="")
> 
>         files2=list.files(path=dir,pattern="Length")
> 
>         ### Make an empty matrix that will have the independent variable as
> the filenum and the dependent variable
>         ### as the mean of the length or should I have two vectors for the
> regression.  Basically the Length_(\d+) is the independent variable (which
> is taken from the filename) which all the regressions will have and then
> inside the Length_(\d+) is a 1d set of numbers which I take the mean of
> which in turn becomes the dependent variable.  So in essence the points are:
> f(length)=mean(length$V1)
> f(45)=50
> f(50)=60
> etc ...
> 
> 
>         for (j in files2)
>         {
>         ## I just rearranged the following line but I'm not sure what the
> command is doing
>         ## I am assuming 'as.numeric' means take the input as a number
> instead of a string and the gsub has                #me stumped 
>        
>         filenum=as.numeric(gsub('Length_','',j))        
>         
>         ## Can I assign variables at the top instead of hardcoding? like
> upper=50 , lower=30?
>         ## And I don't need to put brackets for this if statement do I? 
> Does it basically just
>         ## say that if the filenum is outside those parameters, just go to
> the next j in files2?
>         if (filenum > 200 | filenum < -10) next
> 
>         dir2=paste("~/hangers",i,j,sep="/")
> 
>         tmp=read.table(dir2)
> 
>         mean(tmp($V1))
> 
>         Now should I put these in a matrix or a vector (all j values (length
> vs mean(tmp$V1) for each i iteration) 
>         }
> }
> 
> I think lastly, Id like to get a print out of each of the regressions (each
> iteration of i).  Is that when I use the summary command?  And, like in
> unix, can I redirect the output to a file?
> 
> Best
> 
> 
> Don MacQueen wrote:
> > 
> > I can't go through all the details, but hopefully this will help get 
> > you started.
> > 
> > If you look at the help page for the list.files() function, you will see
> > this:
> > 
> >       list.files(path = ".", pattern = NULL, all.files = FALSE,
> >                  full.names = FALSE, recursive = FALSE,
> >                  ignore.case = FALSE)
> > 
> > The "." in path means to start at your current working directory. 
> > Assuming your 5 Coverage directories are subdirectories of your 
> > current working directory, that's what you want.
> > 
> > Then, setting recursive to TRUE will cause it to also list the 
> > contents of all subdirectories. Since your Length files are in the 
> > Coverage subdirectories, that's what you want.
> > 
> > Finally, the pattern argument returns only files that match the 
> > pattern, so something like
> >    patter="Length"
> > should get you just the files you want.
> > 
> > The result is a character vector containing the names of all your 
> > Length files. Try it and see.
> > 
> > Then, a simple loop over the over the vector of filenames, with an 
> > appropriate scan() or read.table() command for each, will read the 
> > data in.
> > 
> > If you need to restrict the files, say Length_20, Length_25, 
> > Length_30, etc. then you'll have to do some more work.
> > Look at
> >     as.numeric(gsub( 'Length_', '', filename))
> > to get just the number part of the filename, as a number, and then 
> > you can use numeric inequalities to identify whether or not any 
> > particular file is to be processed.
> > 
> > Since you haven't shown what the contents of your files look like 
> > (two columns of numbers or what), I have no idea what to suggest for 
> > the part having to do with reading them in, plotting or doing linear 
> > regression.
> > 
> > The basic function for linear regression is lm().
> > 
> > 
> > Here is a summary:
> > 
> > files <- list.files( '~' , pattern='Length', recursive=TRUE)
> > 
> > for (fl in files) {
> > 
> >    ## optional, to restrict to only certain files
> >    filenum <- as.numeric(gsub( 'Length_', '', filename))
> > 
> >    ## skip to next file if it isn't in the correct number range
> >    if (filenum > 50 | filenum < 20) next
> > 
> >    ## a command to read the current file. perhaps:
> >    ## tmp <- read.table(fl)
> > 
> >    ## commands to do statistics on the data in the current file. perhaps:
> >    ## fit <- lm( y ~ y, data=tmp)
> > 
> >    ## some output
> >    cat('------ file =',fl,'-----\n')
> >    print(fit)
> > 
> > }
> > 
> > This example doesn't restrict only to certain Coverage subdirectories.
> > 
> > -Don
> > 
> > 
> > 
> > At 9:29 AM -0700 9/15/08, bioinformatics_guy wrote:
> >>Im very new to R so this might be a very simple question.  First I'll lay
> out
> >>the hierarchy of my directories, goals.
> >>
> >>I have say 5 directories of form "Coverage_(some number)" and each one of
> >>these I have text files of form "Length_(some number)" which are comprised
> >>of say 30 numbers.  Each one of these Length files (which are basically
> >>incremented by 5 from 0 to 100, Length_(0,5,10,15,20) are to be averaged
> >>where the average is the y-value and the length is the x-value in a linear
> >>regression.
> >>
> >>What I want to do is, write a script that looks in each of the coverage
> >>directories and then reads in each of the files, takes the means, and
> plots
> >>them in form I specified above.  The catch is, what if I only want to plot
> >>say Length_(20-50) and what command/method is best for a linear
> regression?
> >>I've looked at m1(), but have not gotten it to work properly.
> >>
> >>Below is some of the code I've put together:
> >>
> >>topdir="~"
> >>
> >>setwd(topdir)
> >>
> >>### Took this function from a friend so I'm not sure what its doing
> besides
> >>grep-ing a directory?
> >>ll<-function(string)
> >>{
> >>	grep(string,dir(),value=T)
> >>}
> >>
> >>### I believe this is looking for all files of form below
> >>subdir = ll("Coverage_[1-9][0-9]$")
> >>
> >>### A for loop iterating through each of the sub directories.
> >>for (i in subdir)
> >>{     
> >>         #not sure what this line is doing as I found it on the internet
> >> on a
> >>similar function
> >>	setwd(paste(getwd(),i,sep="/"))
> >>         #This makes a vector of all the file names
> >>         filelist=ll("Length_")
> >>
> >>Can I use a regex or logic to only take the filelist variables I want?
> >>And can I now get the mean of each Length_* and set in a matrix (length x
> >>mean)?
> >>
> >>Then finally, how to do a linear regression of this.       
> >>
> >>--
> >>View this message in context: http:// www. 
> >>nabble.com/Scripting-in-R----pattern-matching%2C-logic%2C-system-calls%2C-the-works%21-tp19496451p19496451.html
> >>Sent from the R help mailing list archive at Nabble.com.
> >>
> >>______________________________________________
> >>R-help at r-project.org mailing list
> >>https:// stat.ethz.ch/mailman/listinfo/r-help
> >>PLEASE do read the posting guide http:// www.
> R-project.org/posting-guide.html
> >>and provide commented, minimal, self-contained, reproducible code.
> > 
> > 
> > -- 
> > --------------------------------------
> > Don MacQueen
> > Environmental Protection Department
> > Lawrence Livermore National Laboratory
> > Livermore, CA, USA
> > 925-423-1062
> > 
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> > 
> > 
> 
> -- 
> View this message in context: http://www.nabble.com/Scripting-in-R----pattern-matching%2C-logic%2C-system-calls%2C-the-works%21-tp19496451p19512508.html
> Sent from the R help mailing list archive at Nabble.com.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
http://www.stats.ox.ac.uk/~davison