[R] Creating a web site checker using R

Chris Evans chr|@ho|d @end|ng |rom p@yctc@org
Thu Aug 8 19:54:41 CEST 2019


I use R a great deal but the huge web crawling power of it isn't an area I've used. I don't want to reinvent a cyberwheel and I suspect someone has done what I want.  That is a program that would run once a day (easy for me to set up as a cron task) and would crawl a single root of a web site (mine) and get the file size and a CRC or some similar check value for each page as pulled off the site (and, obviously, I'd want it not to follow off site links). The other key thing would be for it to store the values and URLs and be capable of being run in "create/update database" mode or in "check pages" mode and for the change mode run to Email me a warning if a page changes.  The reason I want this is that two of my sites have recently had content "disappear": neither I nor the ISP can see what's happened and we are lacking the very useful diagnostic of the date when the change happened which might have mapped it some component of WordPress, plugins or themes having updated.

I am failing to find anything such and all the services that offer site checking of this sort are prohibitively expensive for me (my sites are zero income and either personal or offering free utilities and information).

If anyone has done this, or something similar, I'd love to hear if you were willing to share it.  Failing that, I think I will have to create this but I know it will take me days as this isn't my area of R expertise and as, to be brutally honest, I'm a pretty poor programmer.  If I go that way, I'm sure people may be able to point me to things I may be (legitimately) able to recycle in parts to help construct this.

Thanks in advance,

Chris

-- 
Chris Evans <chris using psyctc.org> Skype: chris-psyctc
Visiting Professor, University of Sheffield <chris.evans using sheffield.ac.uk>
I do some consultation work for the University of Roehampton <chris.evans using roehampton.ac.uk> and other places but this <chris using psyctc.org> remains my main Email address.
I have "semigrated" to France, see: https://www.psyctc.org/pelerinage2016/semigrating-to-france/ if you want to book to talk, I am trying to keep that to Thursdays and my diary is now available at: https://www.psyctc.org/pelerinage2016/ecwd_calendar/calendar/
Beware: French time, generally an hour ahead of UK.  That page will also take you to my blog which started with earlier joys in France and Spain!



More information about the R-help mailing list