[R] Parsing aspects of a url path in R

Ben Tupper ben.bighair at gmail.com
Thu Mar 6 18:52:03 CET 2014


Hi,

The XML package has a nice function, parseURI(), that nicely slice and dices the url.  

library(XML)
parseURI('http://www.mdd.com/food/pizza/index.html')

Might that help?

Cheers,
Ben

On Mar 6, 2014, at 12:23 PM, Abraham Mathew <abmathewks at gmail.com> wrote:

> Let's say that I have the following character vector with a series of url
> strings. I'm interested in extracting some information from each string.
> 
> url = c("http://www.mdd.com/food/pizza/index.html", "
> http://www.mdd.com/build-your-own/index.html",
>        "http://www.mdd.com/special-deals.html", "
> http://www.genius.com/find-a-location.html",
>        "http://www.google.com/hello.html")
> 
> - First, I want to extract the domain name followed by .com. After
> struggling with this for a while, reading some regular expression
> tutorials, and reading through stack overflow, I came up with the following
> solution. Perfect!
> 
>> parser <- function(x) gsub("www\\.", "", sapply(strsplit(gsub("http://",
> "", x), "/"), "[[", 1))
>> parser(url)
> [1] "mdd.com"    "mdd.com"    "mdd.com"    "genius.com" "google.com"
> 
> - Second, I want to extract everything after .com in the original url.
> Unfortunately, I don't know the proper regular expression to assign in
> order to get the desired result. Can anyone help.
> 
> Output should be
> /food/pizza/index.html
> build-your-own/index.html
> /special-deals.html
> 
> If anyone has a solution using the stringr package, that'd be of interest
> also.
> 
> 
> Thanks.
> 
> -- 
> 
> *Abraham Mathew**Analytics Strategist*
> *Minneapolis, MN*
> *720-648-0108*
> 
> *abmathewks at gmail.com <abmathewks at gmail.com>*
> *Twitter <https://twitter.com/abmathewks> **LinkedIn
> <http://www.linkedin.com/pub/abraham-mathew/29/21b/212/> **Blog
> <https://mathewanalytics.wordpress.com/> **Tumblr
> <http://iwearstyle.tumblr.com/> Pinterest
> <http://pinterest.com/amathew123/>*
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list