[R] regex - extracting src url

Tue Mar 22 11:27:46 CET 2016

On 03/22/2016 12:44 AM, Omar André Gonzáles Díaz wrote:
> Hi,I have a DF with a column with "html", like this:
>
> <IMG SRC="
> https://ad.doubleclick.net/ddm/trackimp/N344006.1960500FACEBOOKAD/B9589414.130145906;dc_trk_aid=303019819;dc_trk_cid=69763238;ord=[timestamp];dc_lat=;dc_rdid=;tag_for_child_directed_treatment=?"
> BORDER="0" HEIGHT="1" WIDTH="1" ALT="Advertisement">
>
>
> I need to get this:
>
>
> https://ad.doubleclick.net/ddm/trackimp/N344006.1960500FACEBOOKAD/B9589414.130145906;dc_trk_aid=303019819;dc_trk_cid=69763238;ord=[timestamp];dc_lat=;dc_rdid=;tag_for_child_directed_treatment=
> ?
>
>
> I've got this so far:
>
>
> https://ad.doubleclick.net/ddm/trackimp/N344006.1960500FACEBOOKAD/B9589414.130145906;dc_trk_aid=303019819;dc_trk_cid=69763238;ord=[timestamp];dc_lat=;dc_rdid=;tag_for_child_directed_treatment=?\"
> BORDER=\"0\" HEIGHT=\"1\" WIDTH=\"1\" ALT=\"Advertisement
>
>
> With this is the code I've used:
>
> carreras_normal$Impression.Tag..image. <-
> gsub("<img.+?src=[\"'](.*?)[\"'].*?>","\\1",carreras_normal$Impression.Tag..image.,
>                                    ignore.case = T)
>
>
>
> *But I still need to use get rid of this part:*
>
>
> https://ad.doubleclick.net/ddm/trackimp/N344006.1960500FACEBOOKAD/B9589414.130145906;dc_trk_aid=303019819;dc_trk_cid=69763238;ord=[timestamp];dc_lat=;dc_rdid=;tag_for_child_directed_treatment=
> ?*\" BORDER=\"0\" HEIGHT=\"1\" WIDTH=\"1\" ALT=\"Advertisement*
>
>
> Thank you for your help.

You're querying an xml string, so use xpath, e.g., via the XML library

 > as.character(xmlParse(y)[["//IMG/@SRC"]])
[1] 
"https://ad.doubleclick.net/ddm/trackimp/N344006.1960500FACEBOOKAD/B9589414.130145906;dc_trk_aid=303019819;dc_trk_cid=69763238;ord=[timestamp];dc_lat=;dc_rdid=;tag_for_child_directed_treatment=?"

`xmlParse()` translates the character string into  an XML document. `[[` 
subsets the document to extract a single element. "//IMG/@SRC" follows 
the xpath specification (this section 
https://www.w3.org/TR/xpath-31/#abbrev of the specification provides a 
quick guide) to find, starting from the 'root' of the document, a node, 
at any depth, labeled IMG containing an attribute labeled SRC.

A variation, if there were several IMG tags to be extracted, would be

   xpathSApply(xmlParse(y), "//IMG/@SRC", as.character)

>
> Omar Gonzáles.
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

This email message may contain legally privileged and/or confidential information.  If you are not the intended recipient(s), or the employee or agent responsible for the delivery of this message to the intended recipient(s), you are hereby notified that any disclosure, copying, distribution, or use of this email message is prohibited.  If you have received this message in error, please notify the sender immediately by e-mail and delete this email message from your computer. Thank you.