[R] Help request: Parsing docx files for key words and appending to a spreadsheet

jim holtman jho|tm@n @end|ng |rom gm@||@com
Fri Dec 29 19:33:22 CET 2023


checkout the 'officer' package

Thanks

Jim Holtman
*Data Munger Guru*


*What is the problem that you are trying to solve?Tell me what you want to
do, not how you want to do it.*


On Fri, Dec 29, 2023 at 10:14 AM Andy <phaedrusv using gmail.com> wrote:

> Hello
>
> I am trying to work through a problem, but feel like I've gone down a
> rabbit hole. I'd very much appreciate any help.
>
> The task: I have several directories of multiple (some directories, up
> to 2,500+) *.docx files (newspaper articles downloaded from Lexis+) that
> I want to iterate through to append to a spreadsheet only those articles
> that satisfy a condition (i.e., a specific keyword is present for >= 50%
> coverage of the subject matter). Lexis+ has a very specific structure
> and keywords are given in the row "Subject".
>
> I'd like to be able to accomplish the following:
>
> (1) Append the title, the month, the author, the number of words, and
> page number(s) to a spreadsheet
>
> (2) Read each article and extract keywords (in the docs, these are
> listed in 'Subject' section as a list of keywords with a percentage
> showing the extent to which the keyword features in the article (e.g.,
> FAST FASHION (72%)) and to append the keyword and the % coverage to the
> same row in the spreadsheet. However, I want to ensure that the keyword
> coverage meets the threshold of >= 50%; if not, then pass onto the next
> article in the directory. Rinse and repeat for the entire directory.
>
> So far, I've tried working through some Stack Overflow-based solutions,
> but most seem to use the textreadr package, which is now deprecated;
> others use either the officer or the officedown packages. However, these
> packages don't appear to do what I want the program to do, at least not
> in any of the examples I have found, nor in the vignettes and relevant
> package manuals I've looked at.
>
> The first point is, is what I am intending to do even possible using R?
> If it is, then where do I start with this? If these docx files were
> converted to UTF-8 plain text, would that make the task easier?
>
> I am not a confident coder, and am really only just getting my head
> around R so appreciate a steep learning curve ahead, but of course, I
> don't know what I don't know, so any pointers in the right direction
> would be a big help.
>
> Many thanks in anticipation
>
> Andy
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

	[[alternative HTML version deleted]]



More information about the R-help mailing list