[R] Best way to test for numeric digits?

Fri Oct 20 19:27:18 CEST 2023

Leonard,

Since it now seems a main consideration you have is speed/efficiency, maybe a step back might help.

Are there simplifying assumptions that are valid or can you make it simpler, such as converting everything to the same case?

Your sample data was this and I assume your actual data is similar and far longer.

c("Li", "Na", "K",  "2", "Rb", "Ca", "3")

So rather than use complex and costly regular expressions, or other full searches, can you just assume all entries start with either an uppercase letter orn a numeral and test for those usinnd something simple like
> substr(c("Li", "Na", "K",  "2", "Rb", "Ca", "3"), 1, 1)
[1] "L" "N" "K" "2" "R" "C" "3"

If you save that in a variable you can check if that is greater than or equal to "A" or perhaps "0" and also perhaps if it is less than or equal to "Z" or perhaps "9" and see if such a test is faster.

orig <- c("Li", "Na", "K",  "2", "Rb", "Ca", "3")
initial <- substr(orig, 1, 1)
elements_bool <- initial >= "A" & initial <= "Z"

The latter contains a Boolean vector you can use to index your original and toss away the ones with digits, or any lower case letter versions or any other UNICODE symbols.

orig_elements <- orig[elements_bool]

> orig
[1] "Li" "Na" "K"  "2"  "Rb" "Ca" "3" 
> orig_elements
[1] "Li" "Na" "K"  "Rb" "Ca"
> orig[!elements_bool]
[1] "2" "3"

Other approaches you might consider depending on your needs is to encapsulate your data as a column in a data.frame or tibble or other such construct and generate additional columns along the way that keep your information consolidated in what could be an efficient way especially if you shift some of your logic to using faster compiled functionality and perhaps using packages that fit your needs better such as data.table or dplyr and other things in the tidyverse. And note if using pipelines, for many purposes, the new built-in pipelines may be faster.

-----Original Message-----
From: R-help <r-help-bounces using r-project.org> On Behalf Of Leonard Mada via R-help
Sent: Wednesday, October 18, 2023 10:59 AM
To: R-help Mailing List <r-help using r-project.org>
Subject: [R] Best way to test for numeric digits?

Dear List members,

What is the best way to test for numeric digits?

suppressWarnings(as.double(c("Li", "Na", "K",  "2", "Rb", "Ca", "3")))
# [1] NA NA NA  2 NA NA  3
The above requires the use of the suppressWarnings function. Are there 
any better ways?

I was working to extract chemical elements from a formula, something 
like this:
split.symbol.character = function(x, rm.digits = TRUE) {
     # Perl is partly broken in R 4.3, but this works:
     regex = "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
     # stringi::stri_split(x, regex = regex);
     s = strsplit(x, regex, perl = TRUE);
     if(rm.digits) {
         s = lapply(s, function(s) {
             isNotD = is.na(suppressWarnings(as.numeric(s)));
             s = s[isNotD];
         });
     }
     return(s);
}

split.symbol.character(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"))

Sincerely,

Leonard

Note:
# works:
regex = "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
strsplit(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"), regex, perl = T)

# broken in R 4.3.1
# only slightly "erroneous" with stringi::stri_split
regex = "(?<=[A-Z])(?![a-z]|$)|(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
strsplit(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"), regex, perl = T)

______________________________________________
R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.