[R] Parsing aspects of a url path in R

arun smartpink111 at yahoo.com
Thu Mar 6 19:13:45 CET 2014


Try:
gsub(".*\\.com","",url)
[1] "/food/pizza/index.html"     "/build-your-own/index.html"
[3] "/special-deals.html"        "/find-a-location.html"     
[5] "/hello.html"               


  gsub(".*www\\.([[:alpha:]]+\\.com).*","\\1",url)
#[1] "mdd.com"    "mdd.com"    "mdd.com"    "genius.com" "google.com"
A.K.


On Thursday, March 6, 2014 12:37 PM, Abraham Mathew <abmathewks at gmail.com> wrote:
Let's say that I have the following character vector with a series of url
strings. I'm interested in extracting some information from each string.

url = c("http://www.mdd.com/food/pizza/index.html", "
http://www.mdd.com/build-your-own/index.html",
        "http://www.mdd.com/special-deals.html", "
http://www.genius.com/find-a-location.html",
        "http://www.google.com/hello.html")

- First, I want to extract the domain name followed by .com. After
struggling with this for a while, reading some regular expression
tutorials, and reading through stack overflow, I came up with the following
solution. Perfect!

> parser <- function(x) gsub("www\\.", "", sapply(strsplit(gsub("http://",
"", x), "/"), "[[", 1))
> parser(url)
[1] "mdd.com"    "mdd.com"    "mdd.com"    "genius.com" "google.com"

- Second, I want to extract everything after .com in the original url.
Unfortunately, I don't know the proper regular expression to assign in
order to get the desired result. Can anyone help.

Output should be
/food/pizza/index.html
build-your-own/index.html
/special-deals.html

If anyone has a solution using the stringr package, that'd be of interest
also.


Thanks.

-- 

*Abraham Mathew**Analytics Strategist*
*Minneapolis, MN*
*720-648-0108*

*abmathewks at gmail.com <abmathewks at gmail.com>*
*Twitter <https://twitter.com/abmathewks> **LinkedIn
<http://www.linkedin.com/pub/abraham-mathew/29/21b/212/> **Blog
<https://mathewanalytics.wordpress.com/> **Tumblr
<http://iwearstyle.tumblr.com/> Pinterest
<http://pinterest.com/amathew123/>*

    [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.





More information about the R-help mailing list