[R] Help with regex replacements

Chris Evans chr|@ho|d @end|ng |rom p@yctc@org
Tue Jun 27 19:16:23 CEST 2023


I am sure this is easy for people who are good at regexps but I'm 
failing with it.  The situation is that I have hundreds of lines of 
Ukrainian translations of some English. They contain things like this:

1"Я досяг того, чого хотів"2"Мені вдалося зробити бажане"3"Я досяг 
(досягла) того, чого хотів (хотіла)"4"Я досяг(-ла) речей, яких хотілося 
досягти"5"Я досяг/ла того, чого хотів/ла"6"Я досяг\\досягла того, чого 
прагнув\\прагнула."7"Я досягнув(ла) того, чого хотів(ла)"

Using dput():

tmp <- structure(list(Text = c("Я досяг того, чого хотів", "Мені вдалося 
зробити бажане", "Я досяг (досягла) того, чого хотів (хотіла)", "Я 
досяг(-ла) речей, яких хотілося досягти", "Я досяг/ла того, чого 
хотів/ла", "Я досяг\\досягла того, чого прагнув\\прагнула", "Я 
досягнув(ла) того, чого хотів(ла)" )), row.names = c(NA, -7L), class = 
c("tbl_df", "tbl", "data.frame" )) Those show four different ways 
translators have handled gendered words: 1) Ignore them and (I'm 
guessing) only give the masculine 2) Give the feminine form of the word 
(or just the feminine suffix) in brackets 3) Give the feminine 
form/suffix prefixed by a forward slash 4) Give the feminine form/suffix 
prefixed by backslash (here a double backslash) I would like just to 
drop all these feminine gendered options. (Don't worry, they'll get back 
in later.) So I would like to replace 1) anything between brackets with 
nothing! 2) anything between a forward slash and the next space with 
nothing 3) anything between a backslash and the next space with nothing 
but preserving the rest of the text. I have been trying to achieve this 
using str_replace_all() but I am failing utterly. Here's a silly little 
example of my failures. This was just trying to get the text I wanted to 
replace (as I was trying to simplify the issues for my tired wetware): > 
tmp %>%+ as_tibble() %>% + rename(Text = value) %>% + mutate(Text = 
str_replace_all(Text, fixed("."), "")) %>% + filter(row_number() < 4) 
%>% + mutate(Text2 = str_replace(Text, "\\(.*\\)", "\\1")) Errorin 
`mutate()`:ℹIn argument: `Text2 = str_replace(Text, "\\(.*\\)", 
"\\1")`.Caused by error in `stri_replace_first_regex()`:!Trying to 
access the index that is out of bounds. (U_INDEX_OUTOFBOUNDS_ERROR) Run 
`rlang::last_trace()` to see where the error occurred. I have tried 
gurgling around the internet but am striking out so throwing myself on 
the list. Apologies if this is trivial but I'd hate to have to clean 
these hundreds of lines by hand though it's starting to look as if I'd 
achieve that faster by hand than I will by banging my ignorance of R 
regexp syntax on the problem. TIA, Chris

-- 
Chris Evans (he/him)
Visiting Professor, UDLA, Quito, Ecuador & Honorary Professor, 
University of Roehampton, London, UK.
Work web site: https://www.psyctc.org/psyctc/
CORE site: http://www.coresystemtrust.org.uk/
Personal site: https://www.psyctc.org/pelerinage2016/



More information about the R-help mailing list