[R] Best way to test for numeric digits?

Wed Oct 18 20:54:32 CEST 2023

Às 19:35 de 18/10/2023, Leonard Mada escreveu:
> Dear Rui,
> 
> On 10/18/2023 8:45 PM, Rui Barradas wrote:
>> split_chem_elements <- function(x, rm.digits = TRUE) {
>>   regex <- "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"
>>   if(rm.digits) {
>>     stringr::str_replace_all(mol, regex, "#") |>
>>       strsplit("#|[[:digit:]]") |>
>>       lapply(\(x) x[nchar(x) > 0L])
>>   } else {
>>     strsplit(x, regex, perl = TRUE)
>>   }
>> }
>>
>> split.symbol.character = function(x, rm.digits = TRUE) {
>>   # Perl is partly broken in R 4.3, but this works:
>>   regex <- "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"
>>   s <- strsplit(x, regex, perl = TRUE)
>>   if(rm.digits) {
>>     s <- lapply(s, \(x) x[grep("[[:digit:]]+", x, invert = TRUE)])
>>   }
>>   s
>> }
> 
> You have a glitch (mol is hardcoded) in the code of the first function. 
> The times are similar, after correcting for that glitch.
> 
> Note:
> - grep("[[:digit:]]", ...) behaves almost twice as slow as grep("[0-9]", 
> ...)!
> - corrected results below;
> 
> Sincerely,
> 
> Leonard
> #######
> 
> split_chem_elements <- function(x, rm.digits = TRUE) {
>    regex <- "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"
>    if(rm.digits) {
>      stringr::str_replace_all(x, regex, "#") |>
>        strsplit("#|[[:digit:]]") |>
>        lapply(\(x) x[nchar(x) > 0L])
>    } else {
>      strsplit(x, regex, perl = TRUE)
>    }
> }
> 
> split.symbol.character = function(x, rm.digits = TRUE) {
>    # Perl is partly broken in R 4.3, but this works:
>    regex <- "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"
>    s <- strsplit(x, regex, perl = TRUE)
>    if(rm.digits) {
>      s <- lapply(s, \(x) x[grep("[0-9]", x, invert = TRUE)])
>    }
>    s
> }
> 
> mol <- c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl")
> mol10000 <- rep(mol, 10000)
> 
> system.time(
>    split_chem_elements(mol10000)
> )
> #   user  system elapsed
> #   0.58    0.00    0.58
> 
> system.time(
>    split.symbol.character(mol10000)
> )
> #   user  system elapsed
> #   0.67    0.00    0.67
> 
Hello,

You are right, sorry for the blunder :(.
In the code below I have replaced stringr::str_replace_all by the 
package stringi function stri_replace_all_regex and the improvement is 
significant.

split_chem_elements <- function(x, rm.digits = TRUE) {
   regex <- "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"
   if(rm.digits) {
     stringi::stri_replace_all_regex(x, "#", regex) |>
       strsplit("#|[0-9]") |>
       lapply(\(x) x[nchar(x) > 0L])
   } else {
     strsplit(x, regex, perl = TRUE)
   }
}

# system.time(
#   split_chem_elements(mol10000)
# )
#  user  system elapsed
#  0.06    0.00    0.09
# system.time(
#   split.symbol.character(mol10000)
# )
#  user  system elapsed
#  0.25    0.00    0.28

Hope this helps,

Rui Barradas

-- 
Este e-mail foi analisado pelo software antivírus AVG para verificar a presença de vírus.
www.avg.com