[R] Extracting information from a tm corpus.

Shawn Way SWay at meco.com
Thu Mar 2 17:23:46 CET 2017


I'm trying to use the tm package to extract text from a corpus of documents.  I'm able to read in a set of PDF's and get a corpus that is filtered to include all the documents that contain a term, for example, "hot water".  I'm also able to get a list of the documents using the names() function but I just cannot get a handle on getting the lines out of the corpus.

I would like to get a corpus that had just the filtered content out, ie the lines containing the term.

I can manage to do this using something like :

  library(tm)
  library(tidyverse)
  library(tidytext)
  library(stringr)
  cname <- file.path(".","pdfs")
  docs <- Corpus(DirSource(cname), readerControl=list(reader=readPDF))
  docs <- tm_map(docs, content_transformer(tolower))

  search.par <- c("18")
  docs_filtered <- docs %>%
      tm_filter(FUN=function(x) any(grep(search.par, content(x))))


content(docs_filtered[[1]])[grep(search.par,content(docs_filtered[[1]]))]

This gives me the lines that contain the term "18"  in corpus document 1.  Is there any way to do this for all the corpus documents?

What I would like is something that has the lines containing the search parameter in the corpus document to allow printing, at least to screen.

Thank you!

Shawn Way
   



More information about the R-help mailing list