[R] Help request: Parsing docx files for key words and appending to a spreadsheet

Andy ph@edru@v @end|ng |rom gm@||@com
Fri Dec 29 21:37:38 CET 2023


Thanks - I'll have a look at these options too.

I'm happy to send over a sample document, but wasn't aware if 
attachments are allowed. The documents come Lexis+, so require user 
credentials to log in, but I could upload the file somewhere if that 
would help? Any ideas for a good location to do so?


On 29/12/2023 20:25, Dr Eberhard W Lisse wrote:
> I would also look at https://pandoc.org perhaps which can
> export a number of formats...
>
> And for spreadsheets https://github.com/jqnatividad/qsv is my
> goto weapon.  Can also read and write XLSX and others.
>
> A sample document or two would always be helpful...
>
> el
>
> On 29/12/2023 21:01, CALUM POLWART wrote:
>> It sounded like he looked at officeR but I would agree
>>
>> content <- officer::docx_summary("filename.docx")
>>
>> Would get the text content into an object called content.
>>
>> That object is a data.frame so you can then manipulate it.
>> To be more specific, we might need an example of the DF
> [...]
>>> On Fri, Dec 29, 2023 at 10:14 AM Andy <phaedrusv using gmail.com>
>>> wrote:
> [...]
>>>> I'd like to be able to accomplish the following:
>>>>
>>>> (1) Append the title, the month, the author, the number of
>>>> words, and page number(s) to a spreadsheet
>>>>
>>>> (2) Read each article and extract keywords (in the docs,
>>>> these are listed in 'Subject' section as a list of
>>>> keywords with a percentage showing the extent to which the
>>>> keyword features in the article (e.g., FAST FASHION (72%))
>>>> and to append the keyword and the % coverage to the same
>>>> row in the spreadsheet.  However, I want to ensure that
>>>> the keyword coverage meets the threshold of >= 50%; if
>>>> not, then pass onto the next article in the directory.
>>>> Rinse and repeat for the entire directory.
> [...]
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list