[R] subsetting character vector into groups of numerics

Tue Oct 29 03:29:57 CET 2002

>From p.connolly at hortresearch.co.nz Tue Oct 29 15:27:34 2002
Date: Tue, 29 Oct 2002 15:27:34 +1300
From: Patrick Connolly <p.connolly at hortresearch.co.nz>
To: Peter Dalgaard BSA <p.dalgaard at biostat.ku.dk>
Subject: Re: [R] subsetting character vector into groups of numerics
Message-ID: <20021029022734.GD27769 at hortresearch.co.nz>
References: <20021028223228.GC27769 at hortresearch.co.nz> <x2elaajexj.fsf at biostat.ku.dk>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <x2elaajexj.fsf at biostat.ku.dk>
User-Agent: Mutt/1.4i
Status: RO
Content-Length: 4611
Lines: 133

On Tue, 29-Oct-2002 at 12:56AM +0100, Peter Dalgaard BSA wrote:

|> Patrick Connolly <p.connolly at hortresearch.co.nz> writes:
|> 

[...]

|> > I can't rely on the closing parenthesis as the last character in the
|> > vector, though the subgroup could be clearly defined without it.
|> > Numbers are obvious to the eye, but are not always separated from one
|> > another consistently.  Part of the reason for this exercise is to
|> > check that the Group is made up of the Subgroups with no elements
|> > missing, so getting Group is not simply a matter of concatenating the
|> > subgroups.
|> > 
|> > 
|> > Ideas appreciated.
|> 
|> Hmm... You seem to be telling us what the format is not. If you want
|> us to come up with something for the machine to do, it's not too
|> useful that things are "obvious to the eye"! 

Sorry.  Trying to keep down the verbosity, I made it too brief.  My
main point was that the number of spaces was not always consistent so
the method couldn't rely on, say beginning with a '(' character, and
the subgroups separated by ') (' with the end defined by a ')'.

|> 
|> If the format is consistently like the above with subgroups in (),
|> then you could start with using some of the deeper magic of gsub() to
|> turn the format into something which would be easier to split into
|> individual vectors, e.g.
|> 
|> > gsub("\\(([^)]*)\\)", "/\\1", x)
|> [1] "12 78 23 9 76 43 2 15 41 81 92 5/92 12 /81 78 5 76 9 41 /23 2 15 43"

In any case, that method will work for 

.... 92 5(92 12) (....
and
.... 92 5 (92 12) (....

so the space before the "(" character is not critical.  I was
concerned it would throw a spanner in the works.  When I do a check to
see that the Group is made up of all the Subgroups, I'll be able to
detect if there are any cases of a ')' without a succeeding ')'.  It's
so hard to get good data-entry help these days. :-)

|> 
|> [What was that? Well, "(" is a special grouping operator in regular
|> expressions; it isn't part of the RE as such, but things inside (..)
|> can be referred to with backreferences like \1, which of course needs
|> to be entered as "\\1". \( is an actual left parenthesis, again
|> written with the doubled backslash. [^)]* is a sequence consisting of
|> any character except left parentheses (which is not a grouping
|> operator when it sits within square brackets). So we're finding the
|> bits of text delimited by ( and ) and replacing them with a / and the
|> content of the parentheses. Got it? Don't worry if you don't, I didn't
|> get it right till the 12th try either! The important thing is knowing
|> that this kind of stuff is possible if you stare at it long enough.]

In my case, I needed a bit more help.  That solution is brilliant.
Thanks for the explanation of it too.  It covers everything I can
think of except the occasion where a '(' or ')' is missing.  I know
the final ")" is absent in a few places.  It's probably easiest for me
to do a test and add that character if required before using gsub,
then check if the Groups tally with the subgroups to determine if
there is anything missing.  Those should be rare enough to fix in the
data file instead of trying to come up with a generic method of
detecting them and making the requisite modifications.

|> 
|> Now that it is in an easier format we can use strsplit to get
|> individual parts:
|> 
|> > s <- strsplit(gsub("\\(([^)]*)\\)", "/\\1", x),"/")

I probably would have got that if I'd got that far.

[...]

|> and once we have those we might use scan() on each string to get the
|> numbers. This requires the use of a text connection, like this
|> 
|> > lapply(s[[1]], function(x)scan(textConnection(x)))

I'd never had occasion to use textConnection before and was completely
ignorant of its existence.  Certainly simpler than my idea of
exporting text files and then using a Perl script and then importing
back in.

|> Read 12 items
|> Read 2 items
|> Read 6 items
|> Read 4 items
|> [[1]]
|>  [1] 12 78 23  9 76 43  2 15 41 81 92  5
|> 
|> [[2]]
|> [1] 92 12
|> 
|> [[3]]
|> [1] 81 78  5 76  9 41
|> 
|> [[4]]
|> [1] 23  2 15 43
|> 
|> ...
|> 
|> Your turn!

Can't improve on that!  It's so close to what I require we could call
it a day.  Thanks again.

best

-- 
Patrick Connolly
HortResearch
Mt Albert
Auckland
New Zealand 
Ph: +64-9 815 4200 x 7188
~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~
I have the world`s largest collection of seashells. I keep it on all
the beaches of the world ... Perhaps you`ve seen it.  ---Steven Wright 
~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~

______________________________________________________
The contents of this e-mail are privileged and/or confidential to the
named recipient and are not to be used by any other person and/or
organisation. If you have received this e-mail in error, please notify 
the sender and delete all material pertaining to this e-mail.
______________________________________________________
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._