[R] Scripting in R -- pattern matching, logic, system calls, the works!

Don MacQueen macq at llnl.gov
Tue Sep 16 00:06:55 CEST 2008


I can't go through all the details, but hopefully this will help get 
you started.

If you look at the help page for the list.files() function, you will see this:

      list.files(path = ".", pattern = NULL, all.files = FALSE,
                 full.names = FALSE, recursive = FALSE,
                 ignore.case = FALSE)

The "." in path means to start at your current working directory. 
Assuming your 5 Coverage directories are subdirectories of your 
current working directory, that's what you want.

Then, setting recursive to TRUE will cause it to also list the 
contents of all subdirectories. Since your Length files are in the 
Coverage subdirectories, that's what you want.

Finally, the pattern argument returns only files that match the 
pattern, so something like
   patter="Length"
should get you just the files you want.

The result is a character vector containing the names of all your 
Length files. Try it and see.

Then, a simple loop over the over the vector of filenames, with an 
appropriate scan() or read.table() command for each, will read the 
data in.

If you need to restrict the files, say Length_20, Length_25, 
Length_30, etc. then you'll have to do some more work.
Look at
    as.numeric(gsub( 'Length_', '', filename))
to get just the number part of the filename, as a number, and then 
you can use numeric inequalities to identify whether or not any 
particular file is to be processed.

Since you haven't shown what the contents of your files look like 
(two columns of numbers or what), I have no idea what to suggest for 
the part having to do with reading them in, plotting or doing linear 
regression.

The basic function for linear regression is lm().


Here is a summary:

files <- list.files( '~' , pattern='Length', recursive=TRUE)

for (fl in files) {

   ## optional, to restrict to only certain files
   filenum <- as.numeric(gsub( 'Length_', '', filename))

   ## skip to next file if it isn't in the correct number range
   if (filenum > 50 | filenum < 20) next

   ## a command to read the current file. perhaps:
   ## tmp <- read.table(fl)

   ## commands to do statistics on the data in the current file. perhaps:
   ## fit <- lm( y ~ y, data=tmp)

   ## some output
   cat('------ file =',fl,'-----\n')
   print(fit)

}

This example doesn't restrict only to certain Coverage subdirectories.

-Don



At 9:29 AM -0700 9/15/08, bioinformatics_guy wrote:
>Im very new to R so this might be a very simple question.  First I'll lay out
>the hierarchy of my directories, goals.
>
>I have say 5 directories of form "Coverage_(some number)" and each one of
>these I have text files of form "Length_(some number)" which are comprised
>of say 30 numbers.  Each one of these Length files (which are basically
>incremented by 5 from 0 to 100, Length_(0,5,10,15,20) are to be averaged
>where the average is the y-value and the length is the x-value in a linear
>regression.
>
>What I want to do is, write a script that looks in each of the coverage
>directories and then reads in each of the files, takes the means, and plots
>them in form I specified above.  The catch is, what if I only want to plot
>say Length_(20-50) and what command/method is best for a linear regression?
>I've looked at m1(), but have not gotten it to work properly.
>
>Below is some of the code I've put together:
>
>topdir="~"
>
>setwd(topdir)
>
>### Took this function from a friend so I'm not sure what its doing besides
>grep-ing a directory?
>ll<-function(string)
>{
>	grep(string,dir(),value=T)
>}
>
>### I believe this is looking for all files of form below
>subdir = ll("Coverage_[1-9][0-9]$")
>
>### A for loop iterating through each of the sub directories.
>for (i in subdir)
>{     
>         #not sure what this line is doing as I found it on the internet on a
>similar function
>	setwd(paste(getwd(),i,sep="/"))
>         #This makes a vector of all the file names
>         filelist=ll("Length_")
>
>Can I use a regex or logic to only take the filelist variables I want?
>And can I now get the mean of each Length_* and set in a matrix (length x
>mean)?
>
>Then finally, how to do a linear regression of this.       
>
>--
>View this message in context: http:// www. 
>nabble.com/Scripting-in-R----pattern-matching%2C-logic%2C-system-calls%2C-the-works%21-tp19496451p19496451.html
>Sent from the R help mailing list archive at Nabble.com.
>
>______________________________________________
>R-help at r-project.org mailing list
>https:// stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide http:// www. R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.


-- 
--------------------------------------
Don MacQueen
Environmental Protection Department
Lawrence Livermore National Laboratory
Livermore, CA, USA
925-423-1062



More information about the R-help mailing list