[R] text analysis errors

Thu Jan 7 14:06:45 CET 2021

On 2021-01-07 11:34 +1100, Jim Lemon wrote:
> On Thu, Jan 7, 2021 at 10:40 AM Gordon Ballingrud
> <gob.allingrud using gmail.com> wrote:
> >
> > Hello all,
> >
> > I have asked this question on many forums without response. And although
> > I've made progress myself, I am stuck as to how to respond to a particular
> > error message.
> >
> > I have a question about text-analysis packages and code. The general idea
> > is that I am trying to perform readability analyses on a collection of
> > about 4,000 Word files. I would like to do any of a number of such
> > analyses, but the problem now is getting R to recognize the uploaded files
> > as data ready for analysis. But I have been getting error messages. Let me
> > show what I have done so far. I have three separate commands because I
> > broke the file of 4,000 files up into three separate ones because,
> > evidently, the file was too voluminous to be read alone in its entirety.
> > So, I divided the files up into three roughly similar folders. They are
> > called ‘WPSCASES’ one through three. Here is my code, with the error
> > messages for each command recorded below:
> >
> > token <-
> > tokenize("/Users/Gordon/Desktop/WPSCASESONE/",lang="en",doc_id="sample")
> >
> > The code is the same for the other folders; the name of the folder is
> > different, but otherwise identical.
> >
> > The error message reads:
> >
> > *Error in nchar(tagged.text[, "token"], type = "width") : invalid multibyte
> > string, element 348*
> >
> > The error messages are the same for the other two commands. But the
> > 'element' number is different. It's 925 for the second folder, and 4302 for
> > the third.
> >
> > token2 <-
> > tokenize("/Users/Gordon/Desktop/WPSCASES2/",lang="en",doc_id="sample")
> >
> > token3 <-
> > tokenize("/Users/Gordon/Desktop/WPSCASES3/",lang="en",doc_id="sample")
> >
> > These are the other commands if that's helpful.
> >
> > I’ve tried to discover whether the ‘element’ that the error message
> > mentions corresponds to the file of that number in the file’s order. But
> > since folder 3 does not have 4,300 files in it, I think that that was
> > unlikely. Please let me know if you can figure out how to fix this stuff so
> > that I can start to use ‘koRpus’ commands, like ‘readability’ and its
> > progeny.
> >
> > Thank you,
> > Gordon
> 
> Hi Gordon,
> Looks to me as though you may have to extract the text from the Word
> files. Export As Text.

Hi!  

quanteda::tokenizer says it needs a 
character vector or «corpus» as input

	https://www.rdocumentation.org/packages/quanteda/versions/0.99.12/topics/tokenize

... or is this tokenize from the 
tokenizers package, I found something 
about «doc_id» here:

	https://cran.r-project.org/web/packages/tokenizers/vignettes/introduction-to-tokenizers.html

You can convert docx to markdown using 
pandoc:

	pandoc --from docx --to markdown $inputfile

odt also works, and many others.  

I believe pandoc is included in RStudio.  
But I have never used it from there 
myself, so that is really bad advice I 
think.

To read doc, I use wvHtml:

	wvHtml $inputfile - 2> /dev/null | w3m -dump -T text/html

Rasmus

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20210107/7fd2204a/attachment.sig>