[R] apply a function down each column

Wed Jan 13 23:43:57 CET 2010

Thank you very much! It works now perfectly. I even extended it to be  
able to apply it to the whole dataset:

data<-read.delim("mhc_data.txt", stringsAsFactors=FALSE)

lettermatch <- function(a, b) {
     tb <- merge(as.data.frame(table(strsplit(a, ""))),
as.data.frame(table(strsplit(b, ""))), by="Var1")
     sum(apply(tb[-1], 1, min))
     }

output<-matrix(ncol=(ncol(data)-1),nrow=nrow(data)/2)
sim<-rep(0, nrow(data)/2)

for (y in 2:(ncol(data))) {

	for (x in 1:(nrow(data)/2)) {
  			a <- data[(2*x-1),y]  # odd rows
  			b <- data[(2*x),y]    # even rows
  			sim[x]<-(lettermatch(a,b)) 		
  			}
		output[,y-1]<-sim
}
colnames(output)<-c(names(data[2:length(names(data))]))
rownames(output)<-c(1:(nrow(data)/2))

output

Laetitia

Am 12.01.2010 um 18:31 schrieb Peter Ehlers:

> Laetitia,
>
> I was just responding to your comment that "R complains
> about a syntax error". But I realize now that "2x" would
> probably cause an "unexpected symbol" error.
>
> Here's what I get when I run your loop; what do you get?
>
>> for (x in 1:(nrow(dat)-1)) {
> +  a <- as.character(dat[(2x-1),1])
> Error: unexpected symbol in:
> "for (x in 1:(nrow(dat)-1)) {
>  a <- as.character(dat[(2x"
>> b <- as.character(dat[(2x),1])
> Error: unexpected symbol in " b <- as.character(dat[(2x"
>> lettermatch(a,b)
> Error in strsplit(a, "") : object 'a' not found
>> }
> Error: unexpected '}' in "}"
>>
>
> and here's what I get when I fix the obvious syntax
> error:
>
>> for (x in 1:(nrow(dat)-1)) {
> +  a <- as.character(dat[(2*x-1),1])
> +  b <- as.character(dat[(2*x),1])
> +  lettermatch(a,b)
> + }
> Error in fix.by(by.x, x) : 'by' must specify valid column(s)
>>
>
> That leaves two problems:
> 1) you're looking at the wrong column in dat[,1]; that
>    should be dat[,2], etc.
> 2) that error message indicates that your index variable (x)
>    gets to invalid values.
>
> Try this:
>
> for (x in 1:(nrow(dat)/2)) {
>  a <- dat[(2*x-1),2]  # odd rows
>  b <- dat[(2*x),2]    # even rows
>  print(lettermatch(a,b))
> }
>
> You don't need the as.character() if you have character data.
> Always do a str(dat) before you do any analysis.
>
>  -Peter Ehlers
>
> Laetitia Schmid wrote:
>> Dear Peter,
>> thank you for the suggestion.
>> Unfortunately the star did not help. Did it work for you? For me it  
>> seems incomplete somehow.
>> Laetitia
>>
>> ________________________________________
>> From: Peter Ehlers [ehlers at ucalgary.ca]
>> Sent: Tuesday, January 12, 2010 09:54 AM
>> To: Laetitia Schmid
>> Cc: Steve Lianoglou; r-help at r-project.org
>> Subject: Re: [R] apply a function down each column
>>
>> See inline below.
>>
>> Laetitia Schmid wrote:
>>> Dear Steve,
>>> my solution looks like it would work, but it does not.
>>> I attached a text file with an extract of my data. Maybe you can  
>>> try it
>>> yourself. I want to compare C1 with M1, C2 with M2, C3 with M3,,,  
>>> for
>>> each column.
>>> I do not really know what the problem is. R complains about a  
>>> syntax error.
>>> The function I am applying counts the common strings between the  
>>> two.
>>> Greg Hirson helped me to write it.
>>>
>>> lettermatch <- function(a, b) {
>>>   tb <- merge(as.data.frame(table(strsplit(a, ""))),
>>> as.data.frame(table(strsplit(b, ""))), by="Var1")
>>>   sum(apply(tb[-1], 1, min))
>>> }
>>>
>>> For example for the second column I tried:
>>>
>>> for (x in 1:(nrow(dat)-1)) {
>>> a <- as.character(dat[(2x-1),1])
>>
>> Shouldn't that be 2*x-1??
>>
>>  -Peter Ehlers
>>
>>> b <- as.character(dat[(2x),1])
>>> lettermatch(a,b)
>>> }
>>>
>>> or
>>>
>>> a <- as.character(dat[seq(1, nrow(dat), by=2),2])
>>> b <- as.character(dat[seq(2, nrow(dat), by=2), 2])
>>> all.results <- lettermatch(a,b)
>>>
>>> With "dat<-read.delim("data_lgs.txt",stringsAsFactors=FALSE)" I can
>>> leave the "as.character" away in the formula above.
>>>
>>> Laetitia
>>>
>>> Individuals    Seq1    Seq2    Seq3    Seq4
>>> C1    GGGG    AATT    CCGG    CTTT
>>> M1    GGGG    AAAA    GGGG    GGGG
>>> C2    GGGG    AATT    CCGG    CTTT
>>> M2    AGGG    AACT    CCGG    CGTT
>>> C3    AGGG    AACT    CCGG    CGTT
>>> M3    AGGG    AACT    CCGG    CGTT
>>> C4    GGGG    AATT    CCGG    CCTT
>>> M4    GGGG    AAAT    CGGG    CTTT
>>> C5    AGGG    ACTT    CCCG    CTTT
>>> M5    AGGG    CTTT    CCCC    CCTT
>>> C6    AGGG    CTTT    CCCC    CCTT
>>> M6    AAAG    CCTT    CCCC    CTTT
>>> C7    AAAG    ACCC    CCCG    GTTT
>>> M7    AAGG    AACC    CCGG    TTTT
>>> C8    GGGG    AATT    CCGG    CCTT
>>> M8    GGGG    AATT    CCGG    CCTT
>>> C9    GGGG    AAAA    GGGG    TTTT
>>> M9    GGGG    AAAA    GGGG    TTTT
>>> C11    AGGG    AAAC    CGGG    GGTT
>>> M11    GGGG    AATT    CCGG    CCTT
>>>
>>>
>>>
>>> Am 11.01.2010 um 15:18 schrieb Steve Lianoglou:
>>>
>>>> Hi,
>>>>
>>>> On Mon, Jan 11, 2010 at 8:41 AM, Laetitia Schmid <laetitia at gmt.su.se 
>>>> >
>>>> wrote:
>>>>> Hello World,
>>>>> I have a function that makes pairwise comparisons between two
>>>>> strings. I would like to apply this function to my data (which
>>>>> consists of columns with different strings) in the way that it
>>>>> compares the first with the second entry, and then the third  
>>>>> with the
>>>>> fourth, and then the fifth with the sixth, and so on down each  
>>>>> column...
>>>>> So (2x-1) and (2x) would be the different entries to be compared!
>>>>>
>>>>> dat= my data:
>>>>>
>>>>> for the first column: compare dat[(2x-1),1] with dat[(2x),1] and x
>>>>> would be 1:i, i=length(dat[,1])
>>>>>
>>>>> I think the best way to do that is a loop:
>>>>>
>>>>> a <- as.character(dat[(2x-1),1])
>>>>> b <- as.character(dat[(2x),1])
>>>>>
>>>>> for (i in 1:length(dat[,1]) my_function(a, b))
>>>>>
>>>>> Can somebody help me to apply a function with a loop in the way I
>>>>> want to a column?
>>>> It seems as if you got it already, don't you?
>>>>
>>>> for (x in 1:(nrow(dat)-1)) {
>>>> a <- dat[(2x-1),1]
>>>> b <- dat[(2x), 1]
>>>> my_function(a,b)
>>>> }
>>>>
>>>>> Is there a specification of "tapply" for that?
>>>> I don't think so, but depending on what you want to do, the size of
>>>> your data, and the amount of RAM you have, it might be faster to
>>>> compare everything "at once" (assuming `my_function` can be
>>>> vectorized), for instance:
>>>>
>>>> a <- dat[seq(1, nrow(dat), by=2),1]
>>>> b <- dat[seq(2, nrow(dat), by=2), 1]
>>>> all.results <- my_function(a,b)
>>>>
>>>> Also, as an aside, I see you keep calling "as.character" on your  
>>>> data
>>>> when you extract it from your data.frame. Is your data being  
>>>> converted
>>>> to factors? You can look to set stringsAsFactors=FALSE if this is  
>>>> the
>>>> case and you are reading in data using read.table/delim/etc (see:
>>>> ?read.table)
>>>>
>>>> Hope that helps,
>>>>
>>>> -steve
>>>>
>>>> --
>>>> Steve Lianoglou
>>>> Graduate Student: Computational Systems Biology
>>>> | Memorial Sloan-Kettering Cancer Center
>>>> | Weill Medical College of Cornell University
>>>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>>
>> --
>> Peter Ehlers
>> University of Calgary
>> 403.202.3921
>>
>>
>
> -- 
> Peter Ehlers
> University of Calgary
> 403.202.3921