[R] Help with text separation

Sarah Goslee sarah.goslee at gmail.com
Mon Nov 14 16:40:43 CET 2011


Hi,

On Mon, Nov 14, 2011 at 8:54 AM, Michael Griffiths
<griffiths at upstreamsystems.com> wrote:
> Thank you Sarah,
>
> Your reply was very helpful. I have the added difficulty that I am not only
> dealing with single A-Z characters, but quite often have the following
> situation:
>
> form<-c('~Sentence+LEGAL+Intro+Intro/Intro1+Intro*LEGAL+benefit+benefit/benefit1+product+action+mean+CTA*help')
>
> and again, I need to remove the +'CTA*help' part of the character string.
> However, in another instance I may have
>
> form<-c('~Sentence*LEGAL+Intro+Intro/Intro1+Intro*LEGAL+benefit+benefit/benefit1+product+action+mean+CTA*help')
>
>
> In this case I would need to remove 'Sentence*LEGAL+' from form.
>
> Can this be accomplished in the same manner?

Regular expressions are *very* powerful, so yes. You should read a good
intro to regular expressions, and pay careful attention to the word markers,
then take a look at the specifics of R's implementation.

Why do I send you to the help? Because the possible answers all look a
lot like this:
> form<-c('~Sentence*LEGAL+Intro+Intro/Intro1+Intro*LEGAL+benefit+benefit/benefit1+product+action+mean+CTA*help')
> gsub("\\+\\<\\w*\\>\\*\\<\\w*\\>", "", form)
[1] "~Sentence*LEGAL+Intro+Intro/Intro1+benefit+benefit/benefit1+product+action+mean"

Sarah

>
> Many thanks, once again, for your help
>
> Mike Griffiths
>
>
>
> On Mon, Nov 14, 2011 at 12:09 PM, Sarah Goslee <sarah.goslee at gmail.com>
> wrote:
>>
>> Hi,
>>
>> On Mon, Nov 14, 2011 at 4:20 AM, Michael Griffiths
>> <griffiths at upstreamsystems.com> wrote:
>> > Good morning R list,
>> >
>> > My apologies if this has *already* answered elsewhere, but I have not
>> > found
>> > the answer that I am looking for.
>> >
>> > I have a character string, i.e.
>> >
>> >
>> > form<-c('~ A + B + C + C / D + E + E / F + G + H + I + J + K + L * M')
>> >
>> > Now, my aim is to find the position of all those instances of '*' and to
>> > remove said '*'. However, I would also like to remove the preceding
>> > variable name before the '*', the math operator preceding this, and also
>> > the variable name after the '*'. So, here I would like to remove '+L*M'
>>
>> You just want to get rid of them? gsub() it is.
>>
>> I've changed your formula a little bit to better demonstrate what's going
>> on:
>> > form<-c('~ A + B * C + C / D + E + E / F * G + H + I + J + K + L * M')
>> > gsub(" \\+ [A-Z] \\* [A-Z]", "", form)
>> [1] "~ A + C / D + E + E / F * G + H + I + J + K"
>>
>> That regular expression will take out a
>> space
>> +
>> any capital letter
>> space
>> *
>> space
>> any capital letter.
>>
>> It will take out all occurrences of that sequence, but won't take out
>> occurrences of * not in that sequence.
>>
>> If you don't want the spaces, you don't need them. Just take them out
>> of the regular expression as well.
>>
>> Not that strsplit() was remotely the right tool here, but you can
>> split into characters without a separator:
>> > form <- 'abcd'
>> > strsplit(form, '')
>> [[1]]
>> [1] "a" "b" "c" "d"
>>
>> Sarah
>>
>> > So, far I have come up with the following code:
>> >
>> > parts<-strsplit(form,' ')
>> > index<-which(unlist(parts)=="*")
>> > for (i in 1:length(index)){
>> >    parts[[1]][index[i]]<-list(NULL)
>> >    parts[[1]][index[i]+1]<-list(NULL)
>> >    parts[[1]][index[i]-1]<-list(NULL)
>> >    parts[[1]][index[i]-2]<-list(NULL)
>> > }
>> > new.form<-unlist(parts)
>> >
>> > form<-new.form[0]
>> > for (i in 1: length(new.form)){
>> >    form<-paste(form,new.form[i], sep="")
>> > }
>> >
>> > However, as you can see, I have had to use strsplit in, what I consider
>> > a
>> > rather clumsy manner, as the character string (form) has to be in a
>> > certain
>> > format. All variables and maths operators require a space between them
>> > in
>> > order for strsplit to work in the manner I require.
>> >
>> > I would very much like to accomplish what the above code already does,
>> > but
>> > without the need for the initial character string having the need for
>> > the
>> > aforementioned spaces.
>> >
>> > If the list can offer help, I would be most appreciative.
>> >
>> > Yours
>> >
>> > Mike Griffiths
>> >
>> >
>> >
>> --
>> Sarah Goslee
>> http://www.functionaldiversity.org
>
>
>



More information about the R-help mailing list