[R] readPDF() -- unsure how to install xpdf to make this work?

Tony Breyal tony.breyal at googlemail.com
Mon Nov 17 23:17:53 CET 2008


Hi All, thank you so very much for your help, i have now got it
working! I thought that i had replied already but i don't think it got
through so this is a repost of it for anyone who does a search on this
topic...

After adding the directory to the path variable, i should have
restarted my laptop. I had assumed that windows would update the path
automatically, but apparently didn't happen on my uni laptop (windows
xp sp2).

Also, i recieved 2 private emails about how to use the readPDF
function, so here is how you do it:

### R START ###
> library(tm)
> my.path <- 'C:\\Documents and Settings\\tony\\Desktop\\pdfs\\' #put your pdfs in here
> Corpus(DirSource(my.path), readerControl = list(reader=readPDF))
A text document collection with 1 text document
Warning message:
In readLines(filename, encoding = encoding) :
  incomplete final line found on 'C:\Documents and Settings\tony
\Desktop\pdfs\/r-intro.pdf'
>
### R END ###

not quite sure what what consequence that warning has, but otherwise
it's fine to me

Cheers,
Tony Breyal



On 16 Nov, 23:49, "joris meys" <jorism... at gmail.com> wrote:
> Hi Tony,
>
> You can name several variables 'Path' without problems. So you best
> restore the original variable PATH to its original value (or it ain't
> going to work any more) and just add a new one, call that PATH as
> well, and add the directory C:\Program Files\xpdf , like Uwe already
> suggested.
>
> This should make it work (I hope)
>
> Kind regards
> Joris
>
> On Sun, Nov 16, 2008 at 9:15 PM, Tony Breyal <tony.bre... at googlemail.com> wrote:
> > Hi Joris, there is already a variable called 'Path', therefore i
> > appended the directory path to the other strings already in the value
> > section:
>
> > Name: Path
> > Value: %SystemRoot%\system32;%SystemRoot%;%SystemRoot%\System32\Wbem;
> > %SystemRoot%\system32\nls;%SystemRoot%\system32\nls\ENGLISH;C:\Program
> > Files\Novell\ZENworks\;C:\Program Files\Common Files\Teleca Shared;C:
> > \Program Files\QuickTime\QTSystem\;C:\Program Files\xpdf\
>
> > Still didn't work i'm afraid, but cheers for the sugestion.
>
> > Tony Breyal
>
> > On 16 Nov, 20:00, "joris meys" <jorism... at gmail.com> wrote:
> >> Try putting "PATH" under name, and the directory path (not the file)
> >> under value. That looks more appropriate to me...
>
> >> Kind regards
> >> Joris Meys
>
> >> On Sun, Nov 16, 2008 at 8:41 PM, Tony Breyal <tony.bre... at googlemail.com> wrote:
> >> > Hi,
>
> >> > Uwe -- ahh, thank you kindly, I was able to do a web search after
> >> > reading your post above in order to find a guide on how to set the
> >> > path in windows (i wasn't aware that this is how a file is made
> >> > avaiable to the system). I haven't got it to work yet, but at least
> >> > i'm on the right track! also just after reading your post, i've
> >> > discoverd the system() function in R, what wonderful thing that is!
>
> >> > Clair -- I'm still working on getting the files to be accessable to
> >> > the system, but in the mean time i have just discovered the system()
> >> > function in R which is work around for the moment... so using your
> >> > example, you could do:
> >> > ## R code
> >> >> system(paste('"C:/Program Files/xpdf/pdftotext.exe"', '"C:/Documents and Settings/clair/Desktop/test/r-intro.pdf"'), wait=FALSE)
>
> >> > the above will create a new text document in your c:/../test folder.
>
> >> > Now obviously, we want to use the readPDF() function in package: tm.
> >> > so on my uni laptop, running windows XP, this is what i have done:
>
> >> > 1. Click through: start >> control panel >> system
> >> > 2. Click the Advanced tab.
> >> > 3. Click Environment variables.
> >> > 4. Click New (under 'system') to add a new variable name and value.
> >> >  4a. name: pdftotext
> >> >  4b. value: C:\Program Files\xpdf\pdftotext.exe
> >> > 5. Click New (under 'system') to add a new variable name and value.
> >> >  4a. name: pdfinfo
> >> >  4b. value: C:\Program Files\xpdf\pdfinfo.exe
>
> >> > In theory, i think, that should work. however so far it hasn't, so not
> >> > quite sure what to do. but at least in the mean time we have the system
> >> > () function as a work around. If you can figure out what i'm doing
> >> > wrong (probably something obvious knowing me!) please do let me know.
>
> >> > Cheers,
> >> > Tony Breyal
>
> >> > On 16 Nov, 18:14, Uwe Ligges <lig... at statistik.tu-dortmund.de> wrote:
> >> >> clair.crossup... at googlemail.com wrote:
> >> >> > I never said it *should* work.
>
> >> >> > I was simply trying something out that works on other types of files
> >> >> > I've needed in the past (eg: html, csv, dat, etc.). I don't know the
> >> >> > details of the pdf format, but I thought it was worth a try, certainly
> >> >> > no harm in experimenting, and hence I learned that pdfs aren't stored
> >> >> > in the same way that other files i've used in the past are. that's
> >> >> > fine, good to learn new things.
>
> >> >> > As for trying the readPDF() function, yes, I have downloaded and used
> >> >> > xpdf to convert pdfs into plain text since reading the OP email.
> >> >> > However, ow you can make xpdf available to the system so that readPDF
> >> >> > () works in R? i don't know, hence why I posted in this thread.
>
> >> >> > You clearly seem to have a solution, fancy sharing?
>
> >> >> Sure, I thought that could not be a real question:
> >> >> Set your environment variable PATH so that it additionally points to the
> >> >> directory where these tools are installed. As you would do for any other
> >> >> software that is to be called without knowledge where it is installed.
>
> >> >> Uwe Ligges
>
> >> >> > Clair Crossupton xx
>
> >> >> > On 16 Nov, 12:34, Uwe Ligges <lig... at statistik.tu-dortmund.de> wrote:
> >> >> >> clair.crossup... at googlemail.com wrote:
> >> >> >>> Hello, I was just wondering if you had found a solution? I am having
> >> >> >>> the same difficulty of converting pdf's into plain text documents in
> >> >> >>> R. I originally thought I could use the readLines() function, but as
> >> >> >>> you can see below that did not work.
> >> >> >> Why the hell should it? It is designed to read *text* files. And what
> >> >> >> you get below is exactly how your PDF file looks like if you read it as
> >> >> >> text which it is NOT. Why do you not also go the readPDF() way (and yes,
> >> >> >> it is not always possible nor reliable to go that way).
>
> >> >> >> Uwe Ligges
>
> >> >> >>> R> my.destfile <- "C:\\Documents and Settings\\clair\\Desktop\\test\\r-
> >> >> >>> intro.pdf"
> >> >> >>> R> my.url <- "http://cran.r-project.org/doc/manuals/R-intro.pdf"
> >> >> >>> R> download.file(url = my.url, destfile=my.destfile, mode='wb')
> >> >> >>> R> txt <- readLines(my.destfile)
> >> >> >>> R> txt
> >> >> >>> [1]
> >> >> >>> "%PDF-1.4"
> >> >> >>> [2]
> >> >> >>> "%ÐÔÅØ"
> >> >> >>> [3] "1 0 obj
> >> >> >>> <<"
> >> >> >>> [4] "/Length 587
> >> >> >>> "
> >> >> >>> [5] "/Filter /
> >> >> >>> FlateDecode"
> >> >> >>> [6]
> >> >> >>> ">>"
> >> >> >>> [7]
> >> >> >>> "stream"
> >> >> >>> [8] "xÚmTM ¢@\020½ó+z\017&ÎÁ±?\024tBL\020$ñ°ãd4›½*´.‰\002\001<øï·_•èÌf
> >> >> >>> \017'W¯_wÕ«îrðãc;Šòê`GæUŠOÛV×&³£øç¾ö\006ƒ¤Ê(R)\027[vïÖæ6ïWÛ7ñÑTÙÖvb
> >> >> >>> \030¯"uYt/N¼.³ó5·½êÿ¢¥=\025åS‚<b¸³¿G› "
> >> >> >>> Warm Regards,
> >> >> >>> Clair
> >> >> >>> On 13 Nov, 15:10, Tony Breyal <tony.bre... at googlemail.com> wrote:
> >> >> >>>> Dear R-Help,
> >> >> >>>> I need to convert a set of '.pdf' files into an equivalent set of
> >> >> >>>> '.txt' files. This is so that i can do some text mining on the
> >> >> >>>> content.
> >> >> >>>> In the latest R-News letter (http://cran.r-project.org/doc/Rnews/
> >> >> >>>> Rnews_2008-2.pdf), the package 'tm' for text mining is mentioned. In
> >> >> >>>> that lovely package, there is a function called 'readPDF()'. In order
> >> >> >>>> to use this, ?readPDF says
> >> >> >>>>     "Note that this PDF reader needs both the tools pdftotext and
> >> >> >>>> pdfinfo installed and accessable on your system."
> >> >> >>>> These tools are available fromhttp://www.foolabs.com/xpdf/download.html
> >> >> >>>> I am able to download this and use it easily from a dos window to
> >> >> >>>> convert a pdf file into a txt file.
> >> >> >>>> Question: how do i make these tools available to R, so that i can use
> >> >> >>>> the readPDF() function?
> >> >> >>>> Thank you in advance for any help, and I hope the above made sense.
> >> >> >>>> Tony Breyal
> >> >> >>>> ###OS = Windows Vista Ultimate>> sessionInfo()
> >> >> >>>> R version 2.8.0 (2008-10-20)
> >> >> >>>> i386-pc-mingw32
> >> >> >>>> locale:
> >> >> >>>> LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom.
> >> >> >>>> 1252;LC_MONETARY=English_United Kingdom.
> >> >> >>>> 1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252
> >> >> >>>> attached base packages:
> >> >> >>>> [1] grid      stats     graphics  grDevices utils     datasets
> >> >> >>>> methods   base
> >> >> >>>> other attached packages:
> >> >> >>>> [1] tm_0.3-1           XML_1.98-1         Snowball_0.0-3
> >> >> >>>> RWeka_0.3-14       rJava_0.6-0        Matrix_0.999375-16
> >> >> >>>> lattice_0.17-15    filehash_2.0
> >> >> >>>> loaded via a namespace (and not attached):
> >> >> >>>> [1] proxy_0.4-1
> >> >> >>>> ______________________________________________
> >> >> >>>> R-h... at r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
> >> >> >>>> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
> >> >> >>>> and provide commented, minimal, self-contained, reproducible code.
> >> >> >>> ------------------------------------------------------------------------
> >> >> >>> ______________________________________________
> >> >> >>> R-h... at r-project.org mailing list
> >> >> >>>https://stat.ethz.ch/mailman/listinfo/r-help
> >> >> >>> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
> >> >> >>> and provide commented, minimal, self-contained, reproducible code.
>
> >> >> >> ______________________________________________
> >> >> >> R-h... at r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
> >> >> >> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
> >> >> >> and provide commented, minimal, self-contained, reproducible code.
>
> >> >> > ______________________________________________
> >> >> > R-h... at r-project.org mailing list
> >> >> >https://stat.ethz.ch/mailman/listinfo/r-help
> >> >> > PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
> >> >> > and provide commented, minimal, self-contained, reproducible code.
>
> >> >> ______________________________________________
> >> >> R-h... at r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
> >> >> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
> >> >> and provide commented, minimal, self-contained, reproducible code.
>
> >> > ______________________________________________
> >> > R-h... at r-project.org mailing list
> >> >https://stat.ethz.ch/mailman/listinfo/r-help
> >> > PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
> >> > and provide commented, minimal, self-contained, reproducible code.
>
> >> ______________________________________________
> >> R-h... at r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read
>
> ...
>
> read more »



More information about the R-help mailing list