[R] Reading PDF files with German umlauts using tabulizer

Wolfgang Grond grond @end|ng |rom number|@nd@de
Tue Sep 6 11:39:52 CEST 2022


Dear all,

I have some trouble with reading PDF files in German language.

I want to extract text and tables with the tabulizer package, and every 
things goes well as long as I read English texts.

When I try the same code

text <- extract_text(file = "Pub_001.pdf")

with documents in German language

German umlauts are not recognized.

They are either replaced by a combination of characters.

Instead of

"Entmischung und Kristallisation in Gläsern des Systems"
                                      --
I get

"Entmischung und Kristallisation in GHisern des Systems"
                                      --

or replaced by ascii like this

instead of

"In Gläsern des Systems"
       -
I get

"In Glasern des Systems"
       -

Opening the file with Adobe Reader tells me that encoding is "Ansi"

Is there a way to read this file correctly?

Thanks in advance for any idea.

Regards



More information about the R-help mailing list