[R] Writing a summary file in R

Wed Aug 3 22:57:58 CEST 2011

Just a very simple follow-up. In the summary table (listed as "summ" below),
the "TR" column I would like to display the total number of rows (i.e.
counts) which I have done via "NROW()" function.

However, in the "RG1" I would only like to count the number of rows with a
'totalread' count >= 1 (i.e. rows that don't contain zero).

This may be confusing given the data I've provided, but values in the
'totalreads' column don't have to be 1 or 0, they can be any value.
Therefore using "sum()" won't work in every case.

As you can see I've tried using NROW() below for "RG1" but it didn't work
out like I had planned.

For example, given the input data, "chr4 100 300" should have RG1=1 and
percent=0.5. Instead, it just counts every row regardless of value.

The solution is probably something very simple I'm overlooking, but if you
could help I'd appreciate it.

Below is the code I've slightly modified from David's reply:
###############################Code##############################

> colnames(data) <-
> c("chr","start","end","base1","base2","totalreads","methylation","strand")
> data
#this is the input file
####################################
     chr start end base1 base2 totalreads methylation strand
1   chr1   100 159   104   104          1        0.05      +
2   chr1   100 159   145   145          1        0.04      +
3   chr1   200 260   205   205          1        0.12      +
4   chr1   500 750   600   600          1        0.09      +
5   chr3   450 700   500   500          1        0.03      +
6   chr4   100 300   150   150          1        0.05      +
7   chr4   100 300   175   175          0        0.00      +
8   chr7   350 600   400   400          1        0.06      +
9   chr7   350 600   550   550          0        0.00      +
10  chr9   100 125   100   100          1        0.10      +
11 chr11   679 687   680   680          1        0.07      +
12 chr11   679 687   681   681          0        0.00      +
13 chr22   100 200   105   105          1        0.03      +
14 chr22   100 200   110   110          1        0.08      +
15 chr22   300 400   350   350          0        0.00      +
####################################

> splinp <- split(data, paste(data$chr, data$start))
> df <- as.data.frame(t(sapply(splinp, function(x) list(end=x$end[1],
> TR=NROW(x[['totalreads']]), RG1=NROW(x[['totalreads']]>=1),
> percent=(NROW(x[['totalreads']]>=1)/NROW(x[['totalreads']]))))))
> df

#######################
          end TR RG1 percent
chr1 100  159  2   2       1
chr1 200  260  1   1       1
chr1 500  750  1   1       1
chr11 679 687  2   2       1
chr22 100 200  2   2       1
chr22 300 400  1   1       1
chr3 450  700  1   1       1
chr4 100  300  2   2       1
chr7 350  600  2   2       1
chr9 100  125  1   1       1
#######################

> df.summ <- as.data.frame(t(sapply(splinp, function(x)
> summary(x$methylation))))
> summ<-cbind(df,df.summ)
> summ

#the finished output
###########################################
          end TR RG1 percent Min. 1st Qu. Median  Mean 3rd Qu. Max.
chr1 100  159  2   2       1 0.04  0.0425  0.045 0.045  0.0475 0.05
chr1 200  260  1   1       1 0.12  0.1200  0.120 0.120  0.1200 0.12
chr1 500  750  1   1       1 0.09  0.0900  0.090 0.090  0.0900 0.09
chr11 679 687  2   2       1 0.00  0.0175  0.035 0.035  0.0525 0.07
chr22 100 200  2   2       1 0.03  0.0425  0.055 0.055  0.0675 0.08
chr22 300 400  1   1       1 0.00  0.0000  0.000 0.000  0.0000 0.00
chr3 450  700  1   1       1 0.03  0.0300  0.030 0.030  0.0300 0.03
chr4 100  300  2   2       1 0.00  0.0125  0.025 0.025  0.0375 0.05
chr7 350  600  2   2       1 0.00  0.0150  0.030 0.030  0.0450 0.06
chr9 100  125  1   1       1 0.10  0.1000  0.100 0.100  0.1000 0.10
############################################

##############################################################

David Winsemius wrote:
> 
> On Jul 27, 2011, at 9:42 PM, Dennis Murphy wrote:
> 
>> Hi:
>>
>> Is this more or less what you're after?
>>
>> ## Note: This is the preferred way to send your data by e-mail.
>> ## I used dput(data-frame-name) to produce this,
>> ## where data-frame-name = 'df' on my end.
>> df <- structure(list(V1 = c("chr1", "chr1", "chr1", "chr1", "chr3",
>> "chr4", "chr4", "chr7", "chr7", "chr9", "chr11", "chr11", "chr22",
>> "chr22", "chr22"), V2 = c(100L, 100L, 200L, 500L, 450L, 100L,
>> 100L, 350L, 350L, 100L, 679L, 679L, 100L, 100L, 300L), V3 = c(159L,
>> 159L, 260L, 750L, 700L, 300L, 300L, 600L, 600L, 125L, 687L, 687L,
>> 200L, 200L, 400L), V4 = c(104L, 145L, 205L, 600L, 500L, 150L,
>> 175L, 400L, 550L, 100L, 680L, 681L, 105L, 110L, 350L), V5 = c(104L,
>> 145L, 205L, 600L, 500L, 150L, 175L, 400L, 550L, 100L, 680L, 681L,
>> 105L, 110L, 350L), V6 = c(1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 0L,
>> 1L, 1L, 0L, 1L, 1L, 0L), V7 = c(0.05, 0.04, 0.12, 0.09, 0.03,
>> 0.05, 0, 0.06, 0, 0.1, 0.07, 0, 0.03, 0.08, 0), V8 = c("+", "+",
>> "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+"
>> )), .Names = c("V1", "V2", "V3", "V4", "V5", "V6", "V7", "V8"
>> ), class = "data.frame", row.names = c(NA, -15L))
>>
>> ############
>> # This is the structure you should see:
>>> str(df)
>> 'data.frame':   15 obs. of  8 variables:
>> $ V1: chr  "chr1" "chr1" "chr1" "chr1" ...
>> $ V2: int  100 100 200 500 450 100 100 350 350 100 ...
>> $ V3: int  159 159 260 750 700 300 300 600 600 125 ...
>> $ V4: int  104 145 205 600 500 150 175 400 550 100 ...
>> $ V5: int  104 145 205 600 500 150 175 400 550 100 ...
>> $ V6: int  1 1 1 1 1 1 0 1 0 1 ...
>> $ V7: num  0.05 0.04 0.12 0.09 0.03 0.05 0 0.06 0 0.1 ...
>> $ V8: chr  "+" "+" "+" "+" ...
>> ############
>>
>> # Method 1: Write a function and call ddply()
>> summfun <- function(d)  {
>>    dsum <- as.data.frame(as.list(summary(d[['V7']])))
>>    names(dsum) <- c('Min', 'Q1', 'Median', 'Mean', 'Q3', 'Max')
>>    data.frame(V3 = d[1, 'V3'], dsum)
>>  }
>> library('plyr')
>> ddply(df, .(V1, V2), summfun)
>>
>> The idea behind summfun is this: ddply() prefers functions that take a
>> data frame as input and a data frame (or scalar) as output. dsum
>> converts summary(V7) to a data frame by first coercing it into a list
>> and then to a data frame. The names are changed for convenience. dsum
>> has one line, so we add V3 to the data frame before outputting it.
>> ddply() will attach the grouping variables to the output
>> automatically; however, you can put them into the output data frame
>> and ddply() will not duplicate the grouping variables in the output.
>>
>> The alternative in ddply(), which is simpler code, outputs the results
>> from summary() in different rows for each grouping. In this event, it
>> is useful to carry along the names of the summaries so that one can
>> recast the data with the cast() function from the reshape package:
>>
>> # Method 2: Summarize and reshape
>> # V3 is unnecessary but it is useful to carry it along for the output
>> u <- ddply(df, .(V1, V2, V3), summarise, summ = summary(V7),
>>                       summtype = names(summary(V7)))
>> library('reshape')
>> cast(u, V1 + V2 + V3 ~ summtype, value = 'summ')
>>
>> HTH,
>> Dennis
>>
>> PS: I may be one of those folks to whom David was referring in
>> relation to plyr :)
> 
> I've been really impressed at Dennis' facility with plyr, reshape, and  
> reshape2. Note that the 'reshape' function has nothing to do with the  
> 'reshape' package.  Here's what I came up with using base functions:
> 
>  > str(inpdat)
> 'data.frame':	15 obs. of  8 variables:
>   $ chromosome : chr  "chr1" "chr1" "chr1" "chr1" ...
>   $ startreg   : int  100 100 200 500 450 100 100 350 350 100 ...
>   $ endreg     : int  159 159 260 750 700 300 300 600 600 125 ...
>   $ base1      : int  104 145 205 600 500 150 175 400 550 100 ...
>   $ base2      : int  104 145 205 600 500 150 175 400 550 100 ...
>   $ totalreads : int  1 1 1 1 1 1 0 1 0 1 ...
>   $ methylation: num  0.05 0.04 0.12 0.09 0.03 0.05 0 0.06 0 0.1 ...
>   $ strand     : chr  "+" "+" "+" "+" ...
> # The split into distinct 'chromosome' and 'startreg' categories:
> splinp <- split(inpdat, paste(inpdat$chromosome, inpdat$startreg) )
> 
> # Process within separate categories: the tapply, aggragate and by  
> functions are all related
> 
>  > df <- as.data.frame( t(sapply(splinp, function(x) list(chr=x 
> $chromosome[1], strt=x$startreg[1], end=x$endreg[1],  
> frac=sum(x[['totalreads']]>=1)/nrow(x) )) ) )
> # You often need the t() function when working with apply functions
>  > df
>              chr strt end frac
> chr1 100   chr1  100 159    1
> chr1 200   chr1  200 260    1
> chr1 500   chr1  500 750    1
> chr11 679 chr11  679 687  0.5
> chr22 100 chr22  100 200    1
> chr22 300 chr22  300 400    0
> chr3 450   chr3  450 700    1
> chr4 100   chr4  100 300  0.5
> chr7 350   chr7  350 600  0.5
> chr9 100   chr9  100 125    1
> 
>  > as.data.frame(t(sapply(splinp, function(x) summary(x 
> $methylation )) ) )
>            Min. 1st Qu. Median  Mean 3rd Qu. Max.
> chr1 100  0.04  0.0425  0.045 0.045  0.0475 0.05
> chr1 200  0.12  0.1200  0.120 0.120  0.1200 0.12
> chr1 500  0.09  0.0900  0.090 0.090  0.0900 0.09
> chr11 679 0.00  0.0175  0.035 0.035  0.0525 0.07
> chr22 100 0.03  0.0425  0.055 0.055  0.0675 0.08
> chr22 300 0.00  0.0000  0.000 0.000  0.0000 0.00
> chr3 450  0.03  0.0300  0.030 0.030  0.0300 0.03
> chr4 100  0.00  0.0125  0.025 0.025  0.0375 0.05
> chr7 350  0.00  0.0150  0.030 0.030  0.0450 0.06
> chr9 100  0.10  0.1000  0.100 0.100  0.1000 0.10
> 
> # The coup de grace: bind the columns together
> 
>  > df.summ <- as.data.frame(t(sapply(splinp, function(x) summary(x 
> $methylation )) ) )
>  > cbind(df, df.summ)
>              chr strt end frac Min. 1st Qu. Median  Mean 3rd Qu. Max.
> chr1 100   chr1  100 159    1 0.04  0.0425  0.045 0.045  0.0475 0.05
> chr1 200   chr1  200 260    1 0.12  0.1200  0.120 0.120  0.1200 0.12
> chr1 500   chr1  500 750    1 0.09  0.0900  0.090 0.090  0.0900 0.09
> chr11 679 chr11  679 687  0.5 0.00  0.0175  0.035 0.035  0.0525 0.07
> chr22 100 chr22  100 200    1 0.03  0.0425  0.055 0.055  0.0675 0.08
> chr22 300 chr22  300 400    0 0.00  0.0000  0.000 0.000  0.0000 0.00
> chr3 450   chr3  450 700    1 0.03  0.0300  0.030 0.030  0.0300 0.03
> chr4 100   chr4  100 300  0.5 0.00  0.0125  0.025 0.025  0.0375 0.05
> chr7 350   chr7  350 600  0.5 0.00  0.0150  0.030 0.030  0.0450 0.06
> chr9 100   chr9  100 125    1 0.10  0.1000  0.100 0.100  0.1000 0.10
> 
> -- 
> Best;
> 
> David.
> 
>>
>> On Wed, Jul 27, 2011 at 4:02 PM, a217 <ajn21 at case.edu> wrote:
>>> Hello,
>>>
>>> I have an input file:
>>> http://r.789695.n4.nabble.com/file/n3700031/testOut.txt testOut.txt
>>>
>>> where col 1 is chromosome, column2 is start of region, column 3 is  
>>> end of
>>> region, column 4 and 5 is base position, column 6 is total reads,  
>>> column 7
>>> is methylation data, and column 8 is the strand.
>>>
>>>
>>> I would like a summary output file such as:
>>> http://r.789695.n4.nabble.com/file/n3700031/out.summary.txt  
>>> out.summary.txt
>>>
>>> where column 1 is chromosome, column 2 is start of region, column 3  
>>> is end
>>> of region, column 4 is total reads in general, column 5 is total  
>>> reads >=1,
>>> column 6 is (col4/col5) or the percentage, and at the end I'd like  
>>> to list 6
>>> more columns based on summary results from summary() function in R.
>>>
>>> The summary() function will be used to analyze all of the  
>>> methylation data
>>> (col7 from input) for each region (bounded by col2 and col3).
>>>
>>> For example for chr1 100 159 summary() gives:
>>>  Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
>>>  0.0400  0.0425  0.0450  0.0450  0.0475  0.0500
>>>
>>> which is simply the methylation data input into summary() only in  
>>> the region
>>> of chr1 100 159.
>>>
>>> I know how to perform all of the required functions line-by-line,  
>>> but the
>>> hard part for me is essentially taking the input data with multiple
>>> positions in each region and assigning all of the summary results  
>>> to one
>>> line identified by the region.
>>>
>>> If any of you have any suggestions I would appreciate it.
>>>
>>> --
>>> View this message in context:
>>> http://r.789695.n4.nabble.com/Writing-a-summary-file-in-R-tp3700031p3700031.html
>>> Sent from the R help mailing list archive at Nabble.com.
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
> 
> David Winsemius, MD
> West Hartford, CT
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 

--
View this message in context: http://r.789695.n4.nabble.com/Writing-a-summary-file-in-R-tp3700031p3716837.html
Sent from the R help mailing list archive at Nabble.com.