[R] How to apply a function to every element of a dataframe, when the function uses for each colummn and row different values to calculate with?

Wed Aug 21 00:01:13 CEST 2013

PLEASE do not crosspost to Rhelp and googlegroups. (removed that address.)

On Aug 20, 2013, at 9:43 AM, Jacqueline Oehri wrote:

> Dear R users
> 
> 
> I have a question concerning applying a function to each element of a dataframe:
> 
> 
> 1)
> --> I have a dataframe like this: "d":
> (columnames: names of Landcovertypes, rownames: coordinates,  nr:
> rowsums, nc:colummnsums)
> (look at the end of the mail for the structure of d, dput(d) )
> here, "d" has 14 rows and 6 colummns:
> 
>> d
>       PL_7_1_7.txt PL_7_1_8.txt PUEH_4_0.txt PUEH_7_1_2.txt UEH_7_2_2.txt nr
> 821194            0            0            0              0             0    29
> 821202            0            0            0              0             0     8
> 821206            1            0            0              0             0     2
> 827162            1            0            0              0             0     6
> 827166            0            1            1              1             1    17
> 827178            0            0            0              0             0     0
> 827182            1            0            0              0             0     4
> 827186            0            0            0              0             0    16
> 827190            0            0            0              0             0    16
> 827194            0            0            0              0             0    18
> 827198            0            0            0              0             0    19
> 827206            0            0            0              0             0    19
> 833166            0            0            0              0             0     8
> nc               86          120          905            300           309 18733
> 
> 
> -->And i want to apply the following function "f" to each element xij
> of the dataframe "d":
> (xij is the element of the dataframe "d" at row nr. "i" and colummn
> nr. "j", x11 is therefore the element in the first row & the first
> collumn, which in case of "d" is equal to "0".)
> 
> f = (x[i][j] -((nr[i]*nc[j])/n))^2/((nr[i]*nc[j])/n)

Looks like you are trying to reinvent the chisq.test function. These are snippets of that code with the continuity correction material removed:

 sr <- rowSums(x)
  sc <- colSums(x)
  E <- outer(sr, sc, "*")/n
  STATISTIC <- sum( ..see below.. ) 

You would probably remove the sum and go with

  fmat <- (abs(x - E) )^2/E

(I'm not sure why that abs is in the `chisq.test` code.)

> 
> 
> so that in the end I will have a new dataframe "e", which contains the
> results of the function "f" as its elements instead of the original
> values! (do you know what I mean?)
> Do you have any hints how to do that?
> 
> 2) After this, I wanted to filter out for EACH ROW in "e"  the maximum
> value in the row & assign or link the respective columname of this
> maxiumum value to the respective rowname;
> so that in the end I will know for each rowname,

Just index the column names by the result of row-which.max:

 colnames(m) [ apply(m, 1, which.max) ]  # (be sure to remove the "nr" column)

Test to see if I'm missing anything:

chisq.test        # to see the code

# posting dput() on the corner of your matrix was a good idea:

m <- d[!rownames(d)=="nc", !colnames(d)=="nr"] 
m <- data.matrix(m); n <- sum(m); sr <- rowSums(m)
sc <- colSums(m)
E <- outer(sr, sc, "*")/n
fmat <- (abs(m - E) )^2/E
 colnames(m) [ apply(m, 1, which.max) ]

 [1] "PL_7_1_7.txt" "PL_7_1_7.txt" "PL_7_1_7.txt" "PL_7_1_7.txt"
 [5] "PL_7_1_8.txt" "PL_7_1_7.txt" "PL_7_1_7.txt" "PL_7_1_7.txt"
 [9] "PL_7_1_7.txt" "PL_7_1_7.txt" "PL_7_1_7.txt" "PL_7_1_7.txt"
[13] "PL_7_1_7.txt"

The sum of the "predicteds" checks out:
> sum(E, na.rm=TRUE)
[1] 7

> round(fmat, 3)
       PL_7_1_7.txt PL_7_1_8.txt PUEH_4_0.txt PUEH_7_1_2.txt UEH_7_2_2.txt
821194          NaN          NaN          NaN            NaN           NaN
821202          NaN          NaN          NaN            NaN           NaN
821206        0.762        0.143        0.143          0.143         0.143
827162        0.762        0.143        0.143          0.143         0.143
827166        1.714        0.321        0.321          0.321         0.321
827178          NaN          NaN          NaN            NaN           NaN
827182        0.762        0.143        0.143          0.143         0.143
827186          NaN          NaN          NaN            NaN           NaN
827190          NaN          NaN          NaN            NaN           NaN
827194          NaN          NaN          NaN            NaN           NaN
827198          NaN          NaN          NaN            NaN           NaN
827206          NaN          NaN          NaN            NaN           NaN
833166          NaN          NaN          NaN            NaN           NaN

Obviously with a more complete set of data than you offered you would get fewer NaN rows caused by the zero denominators in your data. 

Note that which.max of c(0,0,0,0,0) is 1, so be aware that there is ambiguity when the row count is zero.

-- 
David

> which columname "fits
> best to it" i.e. which columname had the biggest value for this
> respective row.
> For example, in dataframe "d", in the third row called "821206 ", the
> maximum-value lies in the first colummn, which is named "PL_7_1_7.txt
> ". In this example I would link the name "821206 " somehow to the name
> "PL_7_1_7.txt ".
> 
> Do you have any suggestions for me, how to do this the best way? or
> where i should look up possible solutions? I m really lost...

> 
> What i tried until now was this:
> 
>> 
> f.good <- function(x, nr, nc, n) {
>  n <-  d[14,6]
>  nr <- d[,6]
>  nc <- d[14,]
>  z1 <- (x-((nr*nc)/n))^2/((nr*nc)/n)
>  return(z1)
> }
> 
> and then i wanted to use the "apply" function:
> 
>> 
> apply(d, c(1,2), f.good)
> 
> but it never worked at all, and I think Im far away from a solution!
> 
> Can somebody help me out and give me a hint what to do? does somebody
> know a clever way to achieve tasks 1) &2) ?
> 
> Im very glad about every input!!!!!
> 
> Thanks a lot already!!! Have a nice day!
> 
> Best wishes,
> Jacqueline
> 
> 
>> dput(d)
> structure(list(PL_7_1_7.txt = c(0, 0, 1, 1, 0, 0, 1, 0, 0, 0,
> 0, 0, 0, 86), PL_7_1_8.txt = c(0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
> 0, 0, 0, 120), PUEH_4_0.txt = c(0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
> 0, 0, 0, 905), PUEH_7_1_2.txt = c(0, 0, 0, 0, 1, 0, 0, 0, 0,
> 0, 0, 0, 0, 300), UEH_7_2_2.txt = c(0, 0, 0, 0, 1, 0, 0, 0, 0,
> 0, 0, 0, 0, 309), nr = c(29, 8, 2, 6, 17, 0, 4, 16, 16, 18, 19,
> 19, 8, 18733)), .Names = c("PL_7_1_7.txt", "PL_7_1_8.txt", "PUEH_4_0.txt",
> "PUEH_7_1_2.txt", "UEH_7_2_2.txt", "nr"), row.names = c("821194",
> "821202", "821206", "827162", "827166", "827178", "827182", "827186",
> "827190", "827194", "827198", "827206", "833166", "nc"), class = "data.frame")
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius
Alameda, CA, USA