[R] Find the dataset(s) that contain(s) non-ASCII characters

Christophe Dutang dutangc at gmail.com
Mon Apr 4 21:19:20 CEST 2016


Dear list,

I’m maintainsing a package containing only datasets (152): http://dutangc.free.fr/pub/RRepos/web/CASdatasets-index.html <http://dutangc.free.fr/pub/RRepos/web/CASdatasets-index.html> 

When R CMD checking the package, I get the following NOTE
* checking data for non-ASCII characters ... NOTE
 Note: found 4 marked UTF-8 strings

I wonder how to find which dataset(s) (all recorded as rda files) contain(s) non-ASCII characters. 

Using the iconv function let us to find or replace non-ASCII characters 
iconv(x, "UTF-8", "ASCII", sub="I_WAS_NOT_ASCII")

I use the following function to detect non-ASCII characters.

testASCII <- function(idata)
{
 col <- (1:NCOL(idata))[sapply(idata, is.factor)]
 col <- c(col, (1:NCOL(idata))[sapply(idata, is.character)])
 for(i in col)
 {
   x <- idata[, i]
   cat(colnames(idata)[i], "\n")
   res <- grep("I_WAS_NOT_ASCII", iconv(x, "latin1", "ASCII", sub="I_WAS_NOT_ASCII"))
   res <- c(res, grep("I_WAS_NOT_ASCII", iconv(x, "UTF-8", "ASCII", sub="I_WAS_NOT_ASCII")))
   if(any(length(res) > 0))
     cat(res, "\n")
 }
}

Unfortunately, I did not find yet which rda file contains non-ASCII characters among 56 most recent datasets. Is there a faster way to detect non-ASCII characters than to manually load and testASCII()? for example directly on rda files?

Any comment is welcome.

Regards, Christophe


> sessionInfo()
R version 3.2.4 (2016-03-10)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.5 (Yosemite)

locale:
[1] fr_FR.UTF-8/fr_FR.UTF-8/fr_FR.UTF-8/C/fr_FR.UTF-8/fr_FR.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     
---------------------------------------
Christophe Dutang
LMM, UdM, Le Mans, France
web: http://dutangc.free.fr <http://dutangc.free.fr/>

	[[alternative HTML version deleted]]



More information about the R-help mailing list