[R] regex - extracting src url

Tue Mar 22 06:13:14 CET 2016

?strsplit  #I think
My "solution" assumes a fixed format for the URL's as shown in your
example. If that is not the case, it doesn't work.

> y <- '<IMG SRC="https://ad.doubleclick.net/ddm/trackimp/N344006.1960500FACEBOOKAD/B9589414.130145906;dc_trk_aid=303019819;dc_trk_cid=69763238;ord=[timestamp];dc_lat=;dc_rdid=;tag_for_child_directed_treatment=?"
+ BORDER="0" HEIGHT="1" WIDTH="1" ALT="Advertisement">'

> y  ## checking that the URL is as expected

[1] "<IMG SRC=\"https://ad.doubleclick.net/ddm/trackimp/N344006.1960500FACEBOOKAD/B9589414.130145906;dc_trk_aid=303019819;dc_trk_cid=69763238;ord=[timestamp];dc_lat=;dc_rdid=;tag_for_child_directed_treatment=?\"\nBORDER=\"0\"
HEIGHT=\"1\" WIDTH=\"1\" ALT=\"Advertisement\">"

> lapply(strsplit(y,"\""),"[",2) ## should work on a vector of URL's, y

[[1]]
[1] "https://ad.doubleclick.net/ddm/trackimp/N344006.1960500FACEBOOKAD/B9589414.130145906;dc_trk_aid=303019819;dc_trk_cid=69763238;ord=[timestamp];dc_lat=;dc_rdid=;tag_for_child_directed_treatment=?"

Cheers,
Bert

Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )

On Mon, Mar 21, 2016 at 9:44 PM, Omar André Gonzáles Díaz
<oma.gonzales at gmail.com> wrote:
> Hi,I have a DF with a column with "html", like this:
>
> <IMG SRC="
> https://ad.doubleclick.net/ddm/trackimp/N344006.1960500FACEBOOKAD/B9589414.130145906;dc_trk_aid=303019819;dc_trk_cid=69763238;ord=[timestamp];dc_lat=;dc_rdid=;tag_for_child_directed_treatment=?"
> BORDER="0" HEIGHT="1" WIDTH="1" ALT="Advertisement">
>
>
> I need to get this:
>
>
> https://ad.doubleclick.net/ddm/trackimp/N344006.1960500FACEBOOKAD/B9589414.130145906;dc_trk_aid=303019819;dc_trk_cid=69763238;ord=[timestamp];dc_lat=;dc_rdid=;tag_for_child_directed_treatment=
> ?
>
>
> I've got this so far:
>
>
> https://ad.doubleclick.net/ddm/trackimp/N344006.1960500FACEBOOKAD/B9589414.130145906;dc_trk_aid=303019819;dc_trk_cid=69763238;ord=[timestamp];dc_lat=;dc_rdid=;tag_for_child_directed_treatment=?\"
> BORDER=\"0\" HEIGHT=\"1\" WIDTH=\"1\" ALT=\"Advertisement
>
>
> With this is the code I've used:
>
> carreras_normal$Impression.Tag..image. <-
> gsub("<img.+?src=[\"'](.*?)[\"'].*?>","\\1",carreras_normal$Impression.Tag..image.,
>                                   ignore.case = T)
>
>
>
> *But I still need to use get rid of this part:*
>
>
> https://ad.doubleclick.net/ddm/trackimp/N344006.1960500FACEBOOKAD/B9589414.130145906;dc_trk_aid=303019819;dc_trk_cid=69763238;ord=[timestamp];dc_lat=;dc_rdid=;tag_for_child_directed_treatment=
> ?*\" BORDER=\"0\" HEIGHT=\"1\" WIDTH=\"1\" ALT=\"Advertisement*
>
>
> Thank you for your help.
>
> Omar Gonzáles.
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.