[R] iconv question: SQL Server 2005 to R

Thu Oct 10 08:33:32 CEST 2013

On 09/10/2013 10:37, Milan Bouchet-Valat wrote:
> Le mardi 08 octobre 2013 à 16:02 -0700, Ira Sharenow a écrit :
>> A colleague is sending me quite a few files that have been saved with MS
>> SQL Server 2005. I am using R 2.15.1 on Windows 7.
>>
>> I am trying to read in the files using standard techniques. Although the
>> file has a csv extension when I go to Excel or WordPad and do SAVE AS I
>> see that it is Unicode Text. Notepad indicates that the encoding is
>> Unicode. Right now I have to do a few things from within Excel (such as
>> Text to Columns) and eventually save as a true csv file before I can
>> read it into R and then use it.
>>
>> Is there an easy way to solve this from within R? I am also open to easy
>> SQL Server 2005 solutions.
>>
>> I tried the following from within R.
>>
>> testDF = read.table("Info06.csv", header = TRUE, sep = ",")
>>
>>> testDF2 =  iconv(x = testDF, from = "Unicode", to = "")
>>
>> Error in iconv(x = testDF, from = "Unicode", to = "") :
>>
>> unsupported conversion from 'Unicode' to '' in codepage 1252
>>
>> # The next line did not produce an error message
>>
>>> testDF3 =  iconv(x = testDF, from = "UTF-8" , to = "")
>>
>>> testDF3[1:6,  1:3]
>>
>> Error in testDF3[1:6, 1:3] : incorrect number of dimensions
>>
>> # The next line did not produce an error message
>>
>>> testDF4 =  iconv(x = testDF, from = "macroman" , to = "")
>>
>>> testDF4[1:6,  1:3]
>>
>> Error in testDF4[1:6, 1:3] : incorrect number of dimensions
>>
>>>   Encoding(testDF3)
>>
>> [1] "unknown"
>>
>>>   Encoding(testDF4)
>>
>> [1] "unknown"
>>
>> This is the first few lines from WordPad
>>
>> Date,StockID,Price,MktCap,ADV,SectorID,Days,A1,std1,std2
>>
>> 2006-01-03
>> 00:00:00.000, at Stock1,2.53,467108197.38,567381.144444444,4,133.14486997089,-0.0162107939626307,0.0346283580367959,0.0126471695454834
>>
>> 2006-01-03
>> 00:00:00.000, at Stock2,1.3275,829803070.531114,6134778.93292,5,124.632223896458,0.071513138376339,0.0410694546850102,0.0172091268025929
> What's the actual problem? You did not state any. Do you get accentuated
> characters that are not printed correctly after importing the file? In
> the two lines above it does not look like there would be any non-ASCII
> characters in this file, so encoding would not matter.

It is most likely UCS-2.  That has embedded NULs, so the encoding does 
matter.  All 8-bit encodings extend ASCII: others do not, in general.

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595