[R] Reading PDF files with German umlauts using tabulizer

Wed Sep 7 10:03:06 CEST 2022

Hi!

The package "tabulizer" seems to be removed from package repositories,
so it is a bit hard to test.

I found the documentation and the syntax of "extract_text" is:

extract_text(file, pages = NULL, area = NULL, password = NULL,
  encoding = NULL, copy = FALSE)

So have you tried to set the "encoding" parameter?

HTH,
Kimmo

ti, 2022-09-06 kello 11:39 +0200, Wolfgang Grond kirjoitti:
> Dear all,
> 
> I have some trouble with reading PDF files in German language.
> 
> I want to extract text and tables with the tabulizer package, and
> every 
> things goes well as long as I read English texts.
> 
> When I try the same code
> 
> text <- extract_text(file = "Pub_001.pdf")
> 
> with documents in German language
> 
> German umlauts are not recognized.
> 
> They are either replaced by a combination of characters.
> 
> Instead of
> 
> "Entmischung und Kristallisation in Gläsern des Systems"
>                                       --
> I get
> 
> "Entmischung und Kristallisation in GHisern des Systems"
>                                       --
> 
> or replaced by ascii like this
> 
> instead of
> 
> "In Gläsern des Systems"
>        -
> I get
> 
> "In Glasern des Systems"
>        -
> 
> Opening the file with Adobe Reader tells me that encoding is "Ansi"
> 
> Is there a way to read this file correctly?
> 
> Thanks in advance for any idea.
> 
> Regards
> 
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.