[Rd] issue with unz()?

Fri Feb 10 01:53:05 CET 2017

If you use check.names=FALSE in your call to read.csv you can see that
the first column name starts with the 3 bytes ef bb bf, which is the
UTF-8 "byte-order mark" that Microsoft applications like to put at the
start of a text file stored in UTF-8.

> v0514 <- read.csv(unz(temp, file0514[1]), stringsAsFactors=FALSE, check.names=FALSE)
> names(v0514)[1]
[1] "ï»¿Accident_Index"
> charToRaw(names(v0514)[1])
 [1] ef bb bf 41 63 63 69 64 65 6e 74 5f 49 6e 64 65 78

I thought that adding fileEncoding="UTF-8-BOM" or perhaps
encoding="UTF-8-BOM" would take care of the issue, but it does not do
it for me.  You can remove them by hand with substring()

> substring(names(v0514)[1],4)
[1] "Accident_Index"
Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Thu, Feb 9, 2017 at 4:13 PM, jing hua zhao <jinghuazhao at hotmail.com> wrote:
> Dear R-devel,
>
>
> I appear to see differences in behavior of unz between Windows and Linux.
>
>
> url0514 <- "http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/Stats19_Data_2005-2014.zip"
> file0514 <- c("Vehicles0514.csv","Casualties0514.csv","Accidents0514.csv")
>
> temp <- tempfile()
> download.file(url0514,temp)
> a0514 <<- read.csv(unz(temp, file0514[3]))
>
> c0514 <<- read.csv(unz(temp, file0514[2]))
>
> v0514 <<- read.csv(unz(temp, file0514[1]))
>
>
> Under Windows, I noticed that there are variables i..Accident_Index in objects [a|c|v]0514, but this is not the case if zip file contains only one file, i.e.,
>
> file2015 <- c("Vehicles_2015.csv","Casualties_2015.csv","Accidents_2015.csv")
> url2015 <- "http://data.dft.gov.uk/road-accidents-safety-data/RoadSafetyData_2015.zip"
> download.file(url2015,temp)
> v2015 <<- read.csv(unz(temp, file2015[1]))
> c2015 <<- read.csv(unz(temp, file2015[2]))
> a2015 <<- read.csv(unz(temp, file2015[3]))
>
>
> so to combine [a|c|v]0514 and [a|c|v]2015, I need to add something like
>
>
> names(a0514)[names(a0514)=="ï..Accident_Index"] <- "Accident_Index"
> names(c0514)[names(c0514)=="ï..Accident_Index"] <- "Accident_Index"
> names(v0514)[names(v0514)=="ï..Accident_Index"] <- "Accident_Index"
>
>
> This is unnecessary under Linux (RHEL), since those i..Accident_Index have no i.. prefix.
>
>
> Do I miss anything here?
>
>
> Many thanks,
>
>
>
>
> Jing Hua Zhao
>
>         [[alternative HTML version deleted]]
>
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel