[R] working with taxonomic trees: sampling

Olga Lyashevska olga at herenstraat.nl
Mon Feb 1 13:32:32 CET 2010


Dear all,

I am working with taxonomic data, represented as a list of classes,  
orders, families, genera and finally species.

 > class(mydata)
[1] "data.frame"
 > mode(mydata)
[1] "list"
 > names(mydata)
[1] "tclass"   "torder"   "tfamily"  "tgenus"   "tspecies"
 > length(mydata$tclass)
[1] 161590

The first 10 rows look like the following:

 > mydata[1:10,]
         tclass        torder            tfamily       tgenus
1  Chlorophyta Chlorophyceae     Dunaliellaceae Collodictyon
2  Chlorophyta Chlorophyceae     Dunaliellaceae Collodictyon
3  Chlorophyta Chlorophyceae     Dunaliellaceae Collodictyon
4  Chlorophyta Chlorophyceae     Dunaliellaceae   Dunaliella
5  Chlorophyta Chlorophyceae     Dunaliellaceae   Dunaliella
6  Chlorophyta Chlorophyceae     Dunaliellaceae   Dunaliella
7  Chlorophyta Chlorophyceae Chlamydomonadaceae Brachiomonas
8  Chlorophyta Chlorophyceae Chlamydomonadaceae Brachiomonas
9  Chlorophyta Chlorophyceae Chlamydomonadaceae Brachiomonas
10 Chlorophyta Chlorophyceae Chlamydomonadaceae Brachiomonas
                     tspecies
1    Collodictyontriciliatum
2       Collodictyonciliatum
3   Collodictyonsemiciliatum
4           Dunaliellasalina
5         Dunaliellabardawil
6      Dunaliellatertiolecta
7      Brachiomonassubmarina
8        Brachiomonassimplex
9  Brachiomonasellipsoidalis
10      Brachiomonaswestiana

In total I have 115 (unique) classes, containing 733 orders,  
containing 16 185 families, etc

What I am trying to do is to obtain a subtree represented by let's say  
n1 random classes, containing n2 random orders (but restricted to  
those that belong to the classes chosen earlier), containing n3 random  
families etc and all the way down to species, where the number of  
species will be n5.

So the elements I chose at each subsequent level will be defined by  
elements that are already chosen at the level above. If I randomly  
chose lets say 3 classes A,B and C I want to restrict our randomly  
chosen orders (lets say a1,a2,a3, b1,b2) to only those classes that  
are already chosen. Similarly I also need to restrict list of families  
to those orders that are chosen and that are known to belong to  
classes A,B,C.

So I want to obtain a subtree spanning across all taxonomic levels,  
with randomly defined number of elements at each taxonomic level but  
in a such way that at the end I will not end up with orphaned nodes  
i.e. species without classes.

I have been trying to use 'sample' like following:

tcla<-sample(tclass,10,replace=T) #I pick 10 random elements, but I  
want it to be a random number;
torder1<-torder[tclass==tcla] # I match list of orders with those that  
belong to classes defined earlier;
tord<-sample(torder1, 10,replace=T) # pick 10 orders from classes that  
are already chosen;

etc all the way down to species level.

The problem with this approach is that I may obtain branches without  
any leaves. How to get rid of those branches?

And after all I want to repeat this procedure lets say 1000 times,  
each time obtaining different number of elements at each taxonomic  
level.

Sorry for this long-winded post, I hope it is clear what I am trying  
to do.

I would appreciate any tips!

Thanks,
Olga



More information about the R-help mailing list