[R] Please assist -- Unable to remove '-' character from char vector--

David Winsemius dwinsemius at comcast.net
Mon Apr 25 21:21:01 CEST 2016


> On Apr 25, 2016, at 2:32 AM, Sunny Singha <sunnysingha.analytics at gmail.com> wrote:
> 
> Thank you Jim,
> The code did assist me to get the what I needed.
> Also, I learnt that there are different types of dashes
> (en-dash/em-dash/hyphen) as explained on this site :
> http://www.punctuationmatters.com/hyphen-dash-n-dash-and-m-dash/
> 
> I achieved it by executing below command after going through this page
> on stackoverflow:
> http://stackoverflow.com/questions/9223795/how-to-correctly-deal-with-escaped-unicode-characters-in-r-e-g-the-em-dash
> 
> splitends<-sapply(end,strsplit,"-|\u2013|,")
> 
> where '\u2013' is, i guess, the unicode for en-dash/em-dash character
> in the ranges values.
> I had scrapped the HTML table from this web page :
> https://en.wikipedia.org/wiki/List_of_World_Heritage_in_Danger
> and range values does have en-dash characters.
> 
> For now the issue is resolved but how does one capture values similar
> to  '\u2013' for other possible special cases to be specified in the
> regex ?

It's possible to target sequences of Unicode characters using a regex character class which does have a sequence operator. (R's sequence operator fails in my efforts.)

x <- "\"em\u2013dash\" \"em–dash\" \" em \u2016 dash\""
gsub('[\u2013:\u2016]', "", x)   # removes both
#[1] "\"emdash\" \"emdash\" \" em  dash\""

-- 
David.
> 
> Regards,
> Sunny Singha.
> 
> 
> On Mon, Apr 25, 2016 at 12:39 PM, Jim Lemon <drjimlemon at gmail.com> wrote:
>> Hi Sunny,
>> Try this:
>> 
>> # notice that I have replaced the fancy hyphens with real hyphens
>> end<-c("2001-","1992-","2013-","2013-","2013-","2013-",
>> "1993-2007","2010-","2012-","1984-1992","1996-","2015-")
>> splitends<-sapply(end,strsplit,"-")
>> last_bit(x) return(x[length(x)])
>> sapply(splitends,last_bit)
>> 
>> Jim
>> 
>> On Mon, Apr 25, 2016 at 4:35 PM, Sunny Singha
>> <sunnysingha.analytics at gmail.com> wrote:
>>> Hi,
>>> I have a char vector with year values. Some cells have single year
>>> value '2001-' and some have range like 1996-2007.
>>> I need to remove hyphen character '-' from all the values within the
>>> character vector named as 'end'. After removing the hyphen I need to
>>> get the last
>>> number from the cells where there are year range values i.e if the
>>> cell has range 1996-2007, the code should return me 2007.
>>> 
>>> How could I get this done?
>>> 
>>> Below are the values within this char vector:
>>> 
>>>> end
>>> [1] "2001-"            "1992-"            "2013-"            "2013-"
>>>          "2013-"            "2013-"
>>> [7] "2003-"            "2010-"            "2009-"            "1986-"
>>>          "2012-"            "2003-"
>>> [13] "2005-"            "2013-"            "2003-"            "2013-"
>>>          "1993–2007, 2010-" "2012-"
>>> [19] "1984–1992, 1996-" "2015-"            "2009-"            "2000-"
>>>          "2005-"            "1997-"
>>> [25] "2012-"            "1997-"            "2002-"            "2006-"
>>>          "1992-"            "2007-"
>>> [31] "1997-"            "1982-"            "2015-"            "2015-"
>>>          "2010-"            "1996–2007, 2011-"
>>> [37] "2004-"            "1999-"            "2007-"            "1996-"
>>>          "2013-"            "2012-"
>>> [43] "2012-"            "2010-"            "2011-"            "1994-"
>>>          "2014-"
>>> 
>>> I tried below command--> gsub('[-|,]', '', end)
>>> This did remove all the hyphen character but not from cells having
>>> range year values.Below is the result after executing above command:
>>> As you see hypphen character is removed from single values but not
>>> from ranges. Please guide.
>>> 
>>>> gsub('[-|,]', '', end)
>>> [1] "2001"           "1992"           "2013"           "2013"
>>>  "2013"           "2013"           "2003"
>>> [8] "2010"           "2009"           "1986"           "2012"
>>>  "2003"           "2005"           "2013"
>>> [15] "2003"           "2013"           "1993–2007 2010" "2012"
>>>  "1984–1992 1996" "2015"           "2009"
>>> [22] "2000"           "2005"           "1997"           "2012"
>>>  "1997"           "2002"           "2006"
>>> [29] "1992"           "2007"           "1997"           "1982"
>>>  "2015"           "2015"           "2010"
>>> [36] "1996–2007 2011" "2004"           "1999"           "2007"
>>>  "1996"           "2013"           "2012"
>>> [43] "2012"           "2010"           "2011"           "1994"
>>>  "2014"
>>> 
>>> Regards,
>>> Sunny Singha
>>> 
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius
Alameda, CA, USA



More information about the R-help mailing list