[R] Scripting in R -- pattern matching, logic, system calls, the works!

bioinformatics_guy wwwhitener at gmail.com
Tue Sep 16 16:01:42 CEST 2008


Don,
Excellent advice.  I've gone back and done a bit of coding and wanted to see
what you think and possibly "shore up" some of the technical stuff I am
still having a bit of difficulty with.

I'll past the code I have to date with any important annotations:

topdir="~"
library(gmodels)

setwd(topdir)

### Will probably want to do two for loops as opposed to recursive
files=list.files(path=topdir,pattern="Coverage")

for (i in files)
{
        dir=paste("~/hangers/",i,sep="")

        files2=list.files(path=dir,pattern="Length")

        ### Make an empty matrix that will have the independent variable as
the filenum and the dependent variable
        ### as the mean of the length or should I have two vectors for the
regression.  Basically the Length_(\d+) is the independent variable (which
is taken from the filename) which all the regressions will have and then
inside the Length_(\d+) is a 1d set of numbers which I take the mean of
which in turn becomes the dependent variable.  So in essence the points are:
f(length)=mean(length$V1)
f(45)=50
f(50)=60
etc ...


        for (j in files2)
        {
        ## I just rearranged the following line but I'm not sure what the
command is doing
        ## I am assuming 'as.numeric' means take the input as a number
instead of a string and the gsub has                #me stumped 
       
        filenum=as.numeric(gsub('Length_','',j))        
        
        ## Can I assign variables at the top instead of hardcoding? like
upper=50 , lower=30?
        ## And I don't need to put brackets for this if statement do I? 
Does it basically just
        ## say that if the filenum is outside those parameters, just go to
the next j in files2?
        if (filenum > 200 | filenum < -10) next

        dir2=paste("~/hangers",i,j,sep="/")

        tmp=read.table(dir2)

        mean(tmp($V1))

        Now should I put these in a matrix or a vector (all j values (length
vs mean(tmp$V1) for each i iteration) 
        }
}

I think lastly, Id like to get a print out of each of the regressions (each
iteration of i).  Is that when I use the summary command?  And, like in
unix, can I redirect the output to a file?

Best


Don MacQueen wrote:
> 
> I can't go through all the details, but hopefully this will help get 
> you started.
> 
> If you look at the help page for the list.files() function, you will see
> this:
> 
>       list.files(path = ".", pattern = NULL, all.files = FALSE,
>                  full.names = FALSE, recursive = FALSE,
>                  ignore.case = FALSE)
> 
> The "." in path means to start at your current working directory. 
> Assuming your 5 Coverage directories are subdirectories of your 
> current working directory, that's what you want.
> 
> Then, setting recursive to TRUE will cause it to also list the 
> contents of all subdirectories. Since your Length files are in the 
> Coverage subdirectories, that's what you want.
> 
> Finally, the pattern argument returns only files that match the 
> pattern, so something like
>    patter="Length"
> should get you just the files you want.
> 
> The result is a character vector containing the names of all your 
> Length files. Try it and see.
> 
> Then, a simple loop over the over the vector of filenames, with an 
> appropriate scan() or read.table() command for each, will read the 
> data in.
> 
> If you need to restrict the files, say Length_20, Length_25, 
> Length_30, etc. then you'll have to do some more work.
> Look at
>     as.numeric(gsub( 'Length_', '', filename))
> to get just the number part of the filename, as a number, and then 
> you can use numeric inequalities to identify whether or not any 
> particular file is to be processed.
> 
> Since you haven't shown what the contents of your files look like 
> (two columns of numbers or what), I have no idea what to suggest for 
> the part having to do with reading them in, plotting or doing linear 
> regression.
> 
> The basic function for linear regression is lm().
> 
> 
> Here is a summary:
> 
> files <- list.files( '~' , pattern='Length', recursive=TRUE)
> 
> for (fl in files) {
> 
>    ## optional, to restrict to only certain files
>    filenum <- as.numeric(gsub( 'Length_', '', filename))
> 
>    ## skip to next file if it isn't in the correct number range
>    if (filenum > 50 | filenum < 20) next
> 
>    ## a command to read the current file. perhaps:
>    ## tmp <- read.table(fl)
> 
>    ## commands to do statistics on the data in the current file. perhaps:
>    ## fit <- lm( y ~ y, data=tmp)
> 
>    ## some output
>    cat('------ file =',fl,'-----\n')
>    print(fit)
> 
> }
> 
> This example doesn't restrict only to certain Coverage subdirectories.
> 
> -Don
> 
> 
> 
> At 9:29 AM -0700 9/15/08, bioinformatics_guy wrote:
>>Im very new to R so this might be a very simple question.  First I'll lay
out
>>the hierarchy of my directories, goals.
>>
>>I have say 5 directories of form "Coverage_(some number)" and each one of
>>these I have text files of form "Length_(some number)" which are comprised
>>of say 30 numbers.  Each one of these Length files (which are basically
>>incremented by 5 from 0 to 100, Length_(0,5,10,15,20) are to be averaged
>>where the average is the y-value and the length is the x-value in a linear
>>regression.
>>
>>What I want to do is, write a script that looks in each of the coverage
>>directories and then reads in each of the files, takes the means, and
plots
>>them in form I specified above.  The catch is, what if I only want to plot
>>say Length_(20-50) and what command/method is best for a linear
regression?
>>I've looked at m1(), but have not gotten it to work properly.
>>
>>Below is some of the code I've put together:
>>
>>topdir="~"
>>
>>setwd(topdir)
>>
>>### Took this function from a friend so I'm not sure what its doing
besides
>>grep-ing a directory?
>>ll<-function(string)
>>{
>>	grep(string,dir(),value=T)
>>}
>>
>>### I believe this is looking for all files of form below
>>subdir = ll("Coverage_[1-9][0-9]$")
>>
>>### A for loop iterating through each of the sub directories.
>>for (i in subdir)
>>{     
>>         #not sure what this line is doing as I found it on the internet
>> on a
>>similar function
>>	setwd(paste(getwd(),i,sep="/"))
>>         #This makes a vector of all the file names
>>         filelist=ll("Length_")
>>
>>Can I use a regex or logic to only take the filelist variables I want?
>>And can I now get the mean of each Length_* and set in a matrix (length x
>>mean)?
>>
>>Then finally, how to do a linear regression of this.       
>>
>>--
>>View this message in context: http:// www. 
>>nabble.com/Scripting-in-R----pattern-matching%2C-logic%2C-system-calls%2C-the-works%21-tp19496451p19496451.html
>>Sent from the R help mailing list archive at Nabble.com.
>>
>>______________________________________________
>>R-help at r-project.org mailing list
>>https:// stat.ethz.ch/mailman/listinfo/r-help
>>PLEASE do read the posting guide http:// www.
R-project.org/posting-guide.html
>>and provide commented, minimal, self-contained, reproducible code.
> 
> 
> -- 
> --------------------------------------
> Don MacQueen
> Environmental Protection Department
> Lawrence Livermore National Laboratory
> Livermore, CA, USA
> 925-423-1062
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
> 

-- 
View this message in context: http://www.nabble.com/Scripting-in-R----pattern-matching%2C-logic%2C-system-calls%2C-the-works%21-tp19496451p19512508.html
Sent from the R help mailing list archive at Nabble.com.



More information about the R-help mailing list