Comparison with Existing Solutions
To the best of my knowledge, there are already several solutions to
this problem, and why do I need to wrap up a new function to do this?
Because of performance.
First of all, there is an existing package textutils
that contains lots of functions dealing with data. The one of our
interest is HTMLdecode
.
Second, there is a function by SO user Stibu here
that uses xml2
package. And the function is:
unescape_html2 <- function(str){
html <- paste0("<x>", paste0(str, collapse = "#_|"), "</x>")
parsed <- xml2::xml_text(xml2::read_html(html))
strsplit(parsed, "#_|", fixed = TRUE)[[1]]
}
Third, I took the code from Christoph (here) and wrote
a R wrapper for the C function. This function is
xutils::html_decode
.
Now, let’s test the performance and I use bench
package
here.
bench::mark(
html_decode(strings),
unescape_html2(strings),
textutils::HTMLdecode(strings)
)
#> # A tibble: 3 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 html_decode(strings) 5.01µs 5.65µs 173183. 0B 17.3
#> 2 unescape_html2(strings) 93.88µs 99.58µs 9751. 539KB 24.2
#> 3 textutils::HTMLdecode(strings) 5.84ms 5.97ms 167. 381KB 83.5
Clearly, the speed of html_decode
function is
unparalleled.
Note:
When testing the results, I discovered a bug in
textutils::HTMLdecode
and reported it here.
The maintainer fixed it immediately. As of this writing (Feb. 16, 2021),
the development version of textutils
has this bug fixed,
but the CRAN version may not. This means that if you test the
performance yourself with a previous version of textutils
,
you may run into error and installing the development version will solve
for it.