[R] help for a loop procedure

Thu Jan 27 17:30:15 CET 2011

On Thu, Jan 27, 2011 at 11:30:37AM +0100, Serena Corezzola wrote:
> Hello everybody!
> 
> 
> 
> I?m trying to define the optimal number of surveys to detect the highest
> number of species within a monitoring season/session.
> 
> To do this I want to run all the possible combinations between a set of
> samples and to calculate the total number of species for each combination of
> 2, 3, 4 ?n samples events, so that at the end I will be able to define which
> is the lowest number of samples that I need to obtain the best result.
> 
> 
> 
> I?ve already done this operation manually, just to see if it works, but the
> point is that some of my datasets have more than 30 samples and more than 35
> species, so that the number of combinations will be HUGE!
> 
> So here is the question: I need to find a way for R to make all possible
> combinations of samples automatically, and then to automatically return the
> total number of species in every combination.
> 
> I?ve tried to search for a loop script, or something like that. However, I?m
> relatively new to R and I don?t know what I need to do? Can anyone help me?
> 
> 
> 
> Here I?ve written a simple example of the operations I need to do, just to
> make my problem clearer.
> 
> 
> 
> My dataset (matrix) has sample events by rows (U1,U2,U3) and detected
> species by columns.
> 
> 
> 
> U<-read.table("C:\\Documents
> \\tre_usc.txt",header=T,row.names=1,sep="\t",dec = ",")

Hello:

For simplicity of preparing a reply, let me include your data
as an R command.

  U <- structure(list(Aadi = c(0L, 0L, 0L), Aagl = c(0L, 0L, 0L),
  Apap = c(0L, 0L, 0L), Aage = c(0L, 0L, 0L), Bdia = c(7L, 4L, 0L), 
  Beup = c(0L, 2L, 0L), Crub = c(5L, 1L, 0L), Carc = c(0L, 0L, 0L), 
  Cpam = c(1L, 0L, 14L)), .Names = c("Aadi", "Aagl", "Apap", "Aage", 
  "Bdia", "Beup", "Crub", "Carc", "Cpam"), class = "data.frame", 
  row.names = c("U1", "U2", "U3"))

     Aadi Aagl Apap Aage Bdia Beup Crub Carc Cpam
  U1    0    0    0    0    7    0    5    0    1
  U2    0    0    0    0    4    2    1    0    0
  U3    0    0    0    0    0    0    0    0   14

> First, I?ve created from this matrix all the subsets based on single
> samples,
> 
> 
> 
> U1 <- U [c(1), ]
> 
> U2 <- U [c(2), ]
> 
> U3 <- U [c(3), ]
>
[...] 

> 
> then I?ve combined them summing each time the values of the chosen lines
> (total n? of combination = 4).
> 
> 
> 
> U12<-U1+U2
> 
> U13<-U1+U3
> 
> U23<-U2+U3
> 
> U123<-U1+U2+U3
> 
[...]
> 
> 
> Then I?ve applied the command ?length? to find the number of species for
> every new combination.
> 
> 
> 
> length(U12[U12>0])
> 
> [1]  4
> 
> 
> 
> length(U13[U13>0])
> 
> [1] 3
> 

This can be partially automatized as follows

  UM <- as.matrix(U)
  A <- rbind(
  c(1, 0, 0),
  c(0, 1, 0),
  c(0, 0, 1),
  c(1, 1, 0),
  c(1, 0, 1),
  c(0, 1, 1),
  c(1, 1, 1))
  rownam <- rep("U", times=nrow(A))
  for (i in 1:3) {
  	rownam[A[, i] == 1] <- paste(rownam[A[, i] == 1], i, sep="")
  }
  dimnames(A) <- list(rownam, NULL)
  C <- A %*% UM
  C

       Aadi Aagl Apap Aage Bdia Beup Crub Carc Cpam
  U1      0    0    0    0    7    0    5    0    1
  U2      0    0    0    0    4    2    1    0    0
  U3      0    0    0    0    0    0    0    0   14
  U12     0    0    0    0   11    2    6    0    1
  U13     0    0    0    0    7    0    5    0   15
  U23     0    0    0    0    4    2    1    0   14
  U123    0    0    0    0   11    2    6    0   15

  rowSums(C != 0)

    U1   U2   U3  U12  U13  U23 U123 
     3    3    1    4    3    4    4 

> Now I need to do this with 10 and 32 sample events??.: (

If i understand you correctly, your real table U has 32 rows
and you want to consider all subsets of at most 10 rows. If this
is so, then the number of combinations is

  sum(choose(32, 1:10))
  # [1] 107594212

A matrix of this number of rows and 35 columns requires 30 GB
of memory. How do you want to summarize the results? There may
be a more efficient way to compute the required parameters.

For example, the average number of species, which are contained
in a sum of a random selection of k rows may be computed easily,
since we can consider the columns (species) individually and
for each column, the probability to get a nonzero sum may be
computed without actually constructing all the subsets.

If you need a parameter, which is harder to compute than the
average, it is possible to consider simulation. In this case,
not all subsets would be generated, but a smaller number
of randomly chosen subsets of k rows for a given k.

Petr Savicky.