[R] textual analysis - transforming several pdf to txt - naming the files

Cecília Carmo cec|||@@c@rmo @end|ng |rom u@@pt
Wed Jul 5 11:14:14 CEST 2023


I am taking my first steps in textual analysis with R.
I have pdf files consisting of company reports for several years (1 file corresponds to 1 company and 1 year).
My idea is to start by transforming all my pdf files into txt files for further treatment and analysis (this will allow me to group the files by company or by year, depending on the future analysis to be performed).
I do not have in-depth knowledge of programming in R. I just adapt codes that I find, to my needs. Here goes the first doubt in a code I'm adapting:

My pdf files are in one directory named "pdfs". The names of my files are, for example, SONAE2020FS.pdf, EDP2021GS.pdf
I want to convert them to txt and give the same names as in the pdf files: SOANE2020FS.txt, EDP2021GS.txt
I'm running the following scrip, but the names of txt files that I obtain are: pdftext1, pdftext2, pdftext3...
What do I need to change?
Thank you very much,

Cec�lia Carmo
Universidade de Aveiro - Portugal


dirpath <- ("/Users/ceciliacarmo/documents/RTextualAnalysis/data/pdfs")


library(pdftools)

library(dplyr)


convertpdf2txt <- function(dirpath){

  files <- list.files(dirpath, full.names = T)

  x <- sapply(files, function(x){

  x <- pdftools::pdf_text(x) %>%

  paste0(collapse = " ") %>%

  stringr::str_squish()

  return(x)

    })

}

# apply function

txts <- convertpdf2txt(here::here("data", "pdf/"))

# add names to txt files

names(txts) <- paste0(here::here("data","pdftext"), 1:length(txts), sep = "")




	[[alternative HTML version deleted]]



More information about the R-help mailing list