[R] translating HTML character entities to accented characters

David L Carlson dcarlson at tamu.edu
Sun Aug 12 22:36:44 CEST 2012


This may work for your needs with a little fine tuning. Special and accented
characters can be represented in HTML with a character name or a numeric
value. For example, " can be represented as " or as " and it
appears from your example that both are used. I've attached a
dput(HTMLChars) to the end of this message with the concordances. The
following works on your data, but I haven't included any error checking.
Assuming your .csv file is called txt and the data.frame HTMLChars is
loaded:

# Search for &Name;
lsta <- unique(unlist(regmatches(txt, gregexpr("&[[:alpha:]]+;", txt))))
lsta <- data.frame(Name=lsta)
matches <- merge(HTMLChars, lsta)
for (i in 1:nrow(matches)) {
     txt <- gsub(matches$Name[i], matches$Character[i], txt)
}

# Search for &#Number;
lstn <- unique(unlist(regmatches(txt, gregexpr("&#[[:digit:]]+;", txt))))
lstn <- data.frame(Number=lstn)
matches <- merge(HTMLChars, lstn)
for (i in 1:nrow(matches)) {
     txt <- gsub(matches$Number[i], matches$Character[i], txt)
}

txt now contains the converted characters.

dput(HTMLChars)
structure(list(Character = c("\"", "'", "&", "<", ">", "", "¡", 
"¢", "£", "¤", "¥", "¦", "§", "¨", "©", "ª", "«", "¬", "­­", 
"®", "¯", "°", "±", "²", "³", "´", "µ", "¶", "·", "¸", "¹", "º", 
"»", "¼", "½", "¾", "¿", "×", "÷", "À", "Á", "Â", "Ã", "Ä", "Å", 
"Æ", "Ç", "È", "É", "Ê", "Ë", "Ì", "Í", "Î", "Ï", "Ð", "Ñ", "Ò", 
"Ó", "Ô", "Õ", "Ö", "Ø", "Ù", "Ú", "Û", "Ü", "Ý", "Þ", "ß", "à", 
"á", "â", "ã", "ä", "å", "æ", "ç", "è", "é", "ê", "ë", "ì", "í", 
"î", "ï", "ð", "ñ", "ò", "ó", "ô", "õ", "ö", "ø", "ù", "ú", "û", 
"ü", "ý", "þ"), Number = c(""", "'", "&", "<", 
">", " ", "¡", "¢", "£", "¤", "¥", 
"¦", "§", "¨", "©", "ª", "«", "¬", 
"­", "®", "¯", "°", "±", "²", "³", 
"´", "µ", "¶", "·", "¸", "¹", "º", 
"»", "¼", "½", "¾", "¿", "×", "÷", 
"À", "Á", "Â", "Ã", "Ä", "Å", "Æ", 
"Ç", "È", "É", "Ê", "Ë", "Ì", "Í", 
"Î", "Ï", "Ð", "Ñ", "Ò", "Ó", "Ô", 
"Õ", "Ö", "Ø", "Ù", "Ú", "Û", "Ü", 
"Ý", "Þ", "ß", "à", "á", "â", "ã", 
"ä", "å", "æ", "ç", "è", "é", "ê", 
"ë", "ì", "í", "î", "ï", "ð", "ñ", 
"ò", "ó", "ô", "õ", "ö", "ø", "ù", 
"ú", "û", "ü", "ý", "þ"), Name = c(""", 
"'", "&", "<", ">", " ", "¡", "¢", 
"£", "¤", "¥", "¦", "§", "¨", 
"©", "ª", "«", "¬", "­", "®", "¯", 
"°", "±", "&sup2;", "&sup3;", "´", "µ", 
"¶", "·", "¸", "&sup1;", "º", "»", 
"&frac14;", "&frac12;", "&frac34;", "¿", "×", "÷", 
"À", "Á", "Â", "Ã", "Ä", "Å", 
"Æ", "Ç", "È", "É", "Ê", "Ë", 
"Ì", "Í", "Î", "Ï", "Ð", "Ñ", 
"Ò", "Ó", "Ô", "Õ", "Ö", "Ø", 
"Ù", "Ú", "Û", "Ü", "Ý", "Þ", 
"ß", "à", "á", "â", "ã", "ä", 
"å", "æ", "ç", "è", "é", "ê", 
"ë", "ì", "í", "î", "ï", "ð", 
"ñ", "ò", "ó", "ô", "õ", "ö", 
"ø", "ù", "ú", "û", "ü", "ý", 
"þ")), .Names = c("Character", "Number", "Name"), row.names = c(NA, 
100L), class = "data.frame")

-------
David

> -----Original Message-----
> From: Michael Friendly [mailto:friendly at yorku.ca]
> Sent: Friday, August 10, 2012 12:14 PM
> To: dcarlson at tamu.edu
> Cc: 'R-help'
> Subject: Re: [R] translating HTML character entities to accented
> characters
> 
> Thanks, David
> 
> I need an all-R solution for this, because the author.csv file is
> exported from a database that enforces the HTML
> encoding and the import into R may have to be repeated several times as
> the database is updated.
> 
> -Michael
> 
> On 8/10/2012 12:40 PM, David L Carlson wrote:
> > It's not quite an R solution, but I just pasted your examples into a
> script
> > window in R and saved it as chars.html. Then I opened it in Firefox
> and
> > pasted the results here (with returns inserted to match your
> original).
> >
> >> grep("&", author$lname, value=TRUE)
> > [1] "Frère de Montizon" "Lumière"
> > [3] "Lumière" "Niépce"
> > [5] "Süssmilch" "Schüpbach"
> >> grep("&", author$birthplace, value=TRUE)
> > [1] "Marbach, Württemberg"
> > [2] "Côte-d'Or"
> > [3] "Chalon-sur-Saône, Saône-et-Loire"
> > [4] "Groß Särchen, Germany"
> >> apropos("HTML")
> > For a CSV file you would want to preserve the lines by adding <br> to
> the
> > end of each line first.
> >
> > ----------------------------------------------
> > David L Carlson
> > Associate Professor of Anthropology
> > Texas A&M University
> > College Station, TX 77843-4352
> >
> >
> >
> >> -----Original Message-----
> >> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
> >> project.org] On Behalf Of Michael Friendly
> >> Sent: Friday, August 10, 2012 11:15 AM
> >> To: R-help
> >> Subject: [R] translating HTML character entities to accented
> characters
> >>
> >> I've imported a .csv file where character strings that contained
> >> accented characters were written as HTML
> >> character entities.  Is there a function that works on a vector to
> >> translate them back to accented (latin1) characters?
> >>
> >> Some examples:
> >>
> >>   > grep("&", author$lname, value=TRUE)
> >> [1] "Frère de Montizon" "Lumière"
> >> [3] "Lumière"           "Niépce"
> >> [5] "Süssmilch"           "Schüpbach"
> >>   > grep("&", author$birthplace, value=TRUE)
> >> [1] "Marbach, Württemberg"
> >> [2] "Côte-d'Or"
> >> [3] "Chalon-sur-Saône, Saône-et-Loire"
> >> [4] "Groß Särchen, Germany"
> >>   > apropos("HTML")
> >>
> >> thx,
> >> -Michael
> >>
> >> --
> >> Michael Friendly     Email: friendly AT yorku DOT ca
> >> Professor, Psychology Dept.
> >> York University      Voice: 416 736-2100 x66249 Fax: 416 736-5814
> >> 4700 Keele Street    Web:   http://www.datavis.ca
> >> Toronto, ONT  M3J 1P3 CANADA
> >>
> >> ______________________________________________
> >> R-help at r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide http://www.R-project.org/posting-
> >> guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> 
> 
> --
> Michael Friendly     Email: friendly AT yorku DOT ca
> Professor, Psychology Dept.
> York University      Voice: 416 736-2100 x66249 Fax: 416 736-5814
> 4700 Keele Street    Web:   http://www.datavis.ca
> Toronto, ONT  M3J 1P3 CANADA



More information about the R-help mailing list