[R] Counting/processing a character vector

Wed Feb 18 15:21:49 CET 2009

Apologies, Jim Holtman has pointed out a couple of problems/queries with
my original email that I would like to make clear.

Firstly, I introduced a typo when trying to be helpful. In my email
below, I had incorrectly typed out one of the species codes I would
count:

10000000
16220602
20110000
24000000
40320203 ## This should have been 40210102
45140000
45630600 == 7 "species" present.

Secondly, the criteria I laid out might suggest that in the 10 rows of
example I quoted, I would count both:

45630000
45630600

This is not what I wanted and apologies that this was not clear. I only
want to count 45630600 because this is more "specific" in terms of what
creature this is than 45630000. I don't know that 45630000 is not
45630600, so I should not count both 45630000 and 45630600, as this
could be double accounting.

These data are species counts and sometimes it is not possible to
identify an individual to species level. Sometime we can't even get the
genera, or even family, hence why sometimes we have a count for the
family (45630000) as well as for the genus (45630600) in the same
sample/site. It depends on how much of the individual there is to
identify it from as to how precise the identification is.

So I only want to count a higher level category only if I have not
counted a lower level category contained within this higher level.

I hope this is a little bit clearer? And no, I did not come up with this
coding system nor the idea to use "counts" of "species" in this
way... ;-)

Apologies if my original email caused unnecessary confusion.

All the best,

G

On Wed, 2009-02-18 at 13:37 +0000, Gavin Simpson wrote:
> Dear List,
> 
> I have a data set stored in the following format:
> 
> > head(dat, n = 10)
>       id  sppcode abundance
> 1  10307 10000000         1
> 2  10307 16220602         2
> 3  10307 20000000         5
> 4  10307 20110000         2
> 5  10307 24000000         1
> 6  10307 40210000        83
> 7  10307 40210102        45
> 8  10307 45140000         1
> 9  10307 45630000         1
> 10 10307 45630600        41
> > str(dat)
> 'data.frame':	111 obs. of  3 variables:
>  $ id       : Factor w/ 3 levels "10307","10719",..: 1 1 1 1 1 1 1 1 1 1 ...
>  $ sppcode  : chr  "10000000" "16220602" "20000000" "20110000" ...
>  $ abundance: num  1 2 5 2 1 83 45 1 1 41 ...
> 
> that represent counts of species, recorded with a particular coding
> system. The abundance column is not needed for this particular
> operation, but is present in the data files.
> 
> I am interested in counting entries (rows) in the sppcode component of
> dat. The sppcode takes a particular format: Order Family Genus Species,
> with 2 alphanumeric digits allocated for each level of the hierarchy. I
> want to know how many species there are in each site (the id factor),
> but I should only count a higher level entry if there are no lower
> levels present.
> 
> For example, for the above data excerpt (just the headed rows), I would
> count the following rows:
> 
> 10000000
> 16220602
> 20110000
> 24000000
> 40320203
> 45140000
> 45630600 == 7 "species" present.
> 
> To be more specific, I don't count 45630000 (row 9) because there exists
> a sppcode for this 'id' where either of the next two pairs of digits are
> not all 0's.
> 
> In words, I want to count all rows where WWXXYYZZ are ZZ != 00, then,
> rows where ZZ == 00 only if the WWXXYY combination has not been counted
> yet.
> 
> An example data set has been placed in my University web space and can
> be read into R with the following:
> 
> ## read example csv data
> dat <- read.csv(url("http://www.homepages.ucl.ac.uk/~ucfagls/files/example_data.csv"),
>                 colClasses = c("factor","character","numeric"))
> ## show the data
> head(dat, n = 10)
> 
> And the sppcode variable can be broken out into the 4 levels if required via:
> 
> ## split out the four levels of categorisation:
> dat2 <- data.frame(dat,
>                    order = with(dat, substr(sppcode, 1, 2)),
>                    family = with(dat, substr(sppcode, 3, 4)),
>                    genus = with(dat, substr(sppcode, 5, 6)),
>                    species = with(dat, substr(sppcode, 7, 8)))
> 
> The actual data set/problem contains several hundred different id's.
> 
> I can't see an efficient way of processing these data in the manner
> described. Any help would be most gratefully received.
> 
> Many thanks,
> 
> Gavin
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
-- 
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
 Dr. Gavin Simpson             [t] +44 (0)20 7679 0522
 ECRC, UCL Geography,          [f] +44 (0)20 7679 0565
 Pearson Building,             [e] gavin.simpsonATNOSPAMucl.ac.uk
 Gower Street, London          [w] http://www.ucl.ac.uk/~ucfagls/
 UK. WC1E 6BT.                 [w] http://www.freshwaters.org.uk
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20090218/b623c37a/attachment-0002.bin>