[R] using an array of strings with strsplit, issue when including a space in split criteria

Tony Breyal tony.breyal at googlemail.com
Tue Sep 8 15:30:28 CEST 2009


SOLVED.

Thanks to a reply off-list it appears that the 'space' in "published
11" is actually some kind of multibyte character. If I physically
delete the 'space' and replace it by using the spacebar on my
keyboard, then strsplit() behaves as expected.

I had got the text from a hyperlink and copy and pasted it into R. It
did not occur to me that the 'spaces' might be something else. However
I am surprised that it worked in the first instance for both of the
kind posters above. Perhaps i'm just unluky with the local settings on
my Vista PC :S

Cheers everyone, much appreciated!
Tony



On 8 Sep, 11:57, Tony Breyal <tony.bre... at googlemail.com> wrote:
> UPDATE:
>
> I'm not sure why, but on my Windows XP 64bit machine, I ran the same
> code again and this time it is not working even though it worked
> previously. This has been done using the Rgui --vanilla command.
>
> > x <- c("Weekly sales figures to 30 August 2008 published 5 September", "Weekly sales figures to 6 September 2008 published 11 September")
> > strsplit(x, 'published ', fixed=TRUE)
>
> [[1]]
> [1] "Weekly sales figures to 30 August 2008 "
> [2] "5 September"
>
> [[2]]
> [1] "Weekly sales figures to 6 September 2008 published 11 September"
>
> O/S: Windows XP 64bit Pro; Service Pack 2> sessionInfo()
>
> R version 2.9.2 (2009-08-24)
> i386-pc-mingw32
>
> locale:
> LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.
> 1252;LC_MONETARY=English_United States.
> 1252;LC_NUMERIC=C;LC_TIME=English_United States.1252
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods
> base
>
>
>
> On 8 Sep, 09:47, Tony Breyal <tony.bre... at googlemail.com> wrote:
>
>
>
> > After further investigation it appears that the problem is specific to
> > my Vista PC. I am able to get the correct results using R 2.9.2 on a
> > Window XP 64bit machine. However i do not know why this does not work
> > on my Vista PC. The following was done after rebooting Vista.
>
> > >From CMD.exe I ran the following line:
>
> > C:\Program Files\R\R-2.9.2\bin>Rgui --vanilla
>
> > This opened up R.
>
> > ### R 2.9.2 START ###> txt <- c("sales to 23 August 2008 published 29 August",
>
> > + "sales to 6 September 2008 published 11 September")
>
> > > strsplit(txt, 'published', fixed=TRUE)
>
> > [[1]]
> > [1] "sales to 23 August 2008 " " 29 August"
>
> > [[2]]
> > [1] "sales to 6 September 2008 " " 11 September"
>
> > > strsplit(txt, 'published ', fixed=TRUE)
>
> > [[1]]
> > [1] "sales to 23 August 2008 " "29 August"
>
> > [[2]]
> > [1] "sales to 6 September 2008 published 11 September"
>
> > > sessionInfo()
>
> > R version 2.9.2 (2009-08-24)
> > i386-pc-mingw32
>
> > locale:
> > LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United
> > Kingdom.1252;LC_MONETARY=English_United
> > Kingdom.1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252
>
> > attached base packages:
> > [1] stats     graphics  grDevices utils     datasets  methods   base
>
> > ### R 2.9.2 END ###
>
> > The exact same thing happened when I used R 2.9.0  and R 2.8.1 on this
> > same vista computer.
>
> > ### R 2.9.0 ###> sessionInfo()
>
> > R version 2.9.0 (2009-04-17)
> > i386-pc-mingw32
>
> > locale:
> > LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United
> > Kingdom.1252;LC_MONETARY=English_United
> > Kingdom.1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252
>
> > attached base packages:
> > [1] stats     graphics  grDevices datasets  utils     methods   base
>
> > other attached packages:
> > [1] rcom_2.1-3     rscproxy_1.3-1
>
> > loaded via a namespace (and not attached):
> > [1] tools_2.9.0
>
> > ### R 2.8.1 ###> sessionInfo()
>
> > R version 2.8.1 (2008-12-22)
> > i386-pc-mingw32
>
> > locale:
> > LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United
> > Kingdom.1252;LC_MONETARY=English_United
> > Kingdom.1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252
>
> > attached base packages:
> > [1] stats     graphics  grDevices utils     datasets  methods   base
>
> > my computer details are:
> > Windows Vista Ultimate
> > Service Pack 1
> > Manufacturer: Dell
> > Rating: 3.4
> > Processor: Intel Core 2 Duo CPU E6750 @ 2.66 GHz
> > Memory (RAM): 4.00 GB
> > System type: 32-bit Operating System
>
> > 2009/9/8 Gabor Grothendieck <ggrothendi... at gmail.com>:
>
> > > I am using the exact same version of R as you also on Vista
> > > but can't reproduce your result.  For me it splits properly.
>
> > > Try starting R like this (modify path if needed) from the
> > > Windows cmd line:
>
> > > \Program Files\R\R-2.9.2\bin\Rgui --vanilla
>
> > > and then try it.
>
> > > On Mon, Sep 7, 2009 at 11:40 AM, Tony Breyal<tony.bre... at googlemail.com> wrote:
> > >> Dear all,
>
> > >> I'm having a problem understanding why a split does not occur with in
> > >> the 2nd use of the function strsplit below:
>
> > >> # text strings
> > >>> txt <- c("sales to 23 August 2008 published 29 August",
> > >> + "sales to 6 September 2008 published 11 September")
>
> > >> # first use
> > >>> strsplit(txt, 'published', fixed=TRUE)
> > >> [[1]]
> > >> [1] "sales to 23 August 2008 " " 29 August"
>
> > >> [[2]]
> > >> [1] "sales to 6 September 2008 " " 11 September"
>
> > >> # second use, but with a space ' ' in the split
> > >>> strsplit(txt, 'published ', fixed=TRUE)
> > >> [[1]]
> > >> [1] "sales to 23 August 2008 " "29 August"
>
> > >> [[2]]
> > >> [1] "sales to 6 September 2008 published 11 September"
>
> > >> Thank you kindly for any help in advance.
> > >> Tony
>
> > >> O/S: Win Vista Ultimate
> > >>> sessionInfo()
> > >> R version 2.9.2 (2009-08-24)
> > >> i386-pc-mingw32
>
> > >> locale:
> > >> LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom.
> > >> 1252;LC_MONETARY=English_United Kingdom.
> > >> 1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252
>
> > >> attached base packages:
> > >> [1] stats     graphics  grDevices utils     datasets  methods
> > >> base
>
> > >> other attached packages:
> > >> [1] RODBC_1.3-0
>
> > >> ______________________________________________
> > >> R-h... at r-project.org mailing list
> > >>https://stat.ethz.ch/mailman/listinfo/r-help
> > >> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
> > >> and provide commented, minimal, self-contained, reproducible code.
>
> > --
> > Tony Breyal
>
> > ______________________________________________
> > R-h... at r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-h... at r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list