[R] reading and frequency analysis of Spanish text

Sam Thomas sam.thomas at revelanttech.com
Wed Aug 5 21:03:51 CEST 2009


I used the readDOC function in tm.  

After storing the document locally on a Windows pc...

langren.sp.path <- "C:\\text\\" #store file by itself in this directory

langren.corpus <- (Corpus(DirSource(langren.sp.path), readerControl = list(reader
									= readDOC(AntiwordOptions = "-t"), language = "spa", load = TRUE)))

(langren.sp.file <- langren.corpus[[1]])[1:10]


I think the default encoding for antiword is latin1, but antiword -m option can handle other mappings.  

Sam Thomas

-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Michael Friendly
Sent: Wednesday, August 05, 2009 2:19 PM
To: R-Help
Subject: [R] reading and frequency analysis of Spanish text

For an historical  paper I'm working on, I have some Spanish plaintext, 
presently in the form of a Word .doc
file,
http://euclid.psych.yorku.ca/SCS/Gallery/images/Private/Langren/Verdadera-spanish-stripped.doc

and also some ciphered text from the same original source.  The ultimate 
goal is to use some
frequency analysis of letters and word lengths in  the plaintext to help 
decode the ciphered text.

For now, I'm stuck on how to read the Spanish plaintext into R as a text 
string, given that it is in a Word .doc file
using some form of latin1 encoding.  From Word, I can Save As .. plain 
text (.txt), but I'm worried about losing
character encoding information and I don't see anything in the list of 
Other encodings presented that seems
helpful. 

A naive attempt to read the .doc file directly gives:

 > langren.sp.file <- 
"http://euclid.psych.yorku.ca/SCS/Gallery/images/Private/Langren/Verdadera-spanish-stripped.doc"
 >
 > langren.txt <- scan(langren.sp.file, encoding="latin1")
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, 
na.strings,  :
  scan() expected 'a real', got 'ÐÏࡱá'
 >

Can someone help?

-- 
Michael Friendly     Email: friendly AT yorku DOT ca 
Professor, Psychology Dept.
York University      Voice: 416 736-5115 x66249 Fax: 416 736-5814
4700 Keele Street    http://www.math.yorku.ca/SCS/friendly.html
Toronto, ONT  M3J 1P3 CANADA

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list