This is a package where I collected some of the function I have used when dealing with data.

library(xutils)

Text-related Functions

`html_decode`

Currently, there is only one function: html_decode which will replace the HTML entities like & into their original form &.

This function is a thin-wrapper of C++ code provided by Christoph on Stack Overflow.

Example

An example would be

strings <- c("abcd", "&amp; &apos; &gt;", "&amp;", "&euro; &lt;")
html_decode(strings)
#> [1] "abcd"  "& ' >" "&"     "€ <"

It works very well!

Comparison with Existing Solutions

To the best of my knowledge, there are already several solutions to this problem, and why do I need to wrap up a new function to do this? Because of performance.

First of all, there is an existing package textutils that contains lots of functions dealing with data. The one of our interest is HTMLdecode.

Second, there is a function by SO user Stibu here that uses xml2 package. And the function is:

unescape_html2 <- function(str){
  html <- paste0("<x>", paste0(str, collapse = "#_|"), "</x>")
  parsed <- xml2::xml_text(xml2::read_html(html))
  strsplit(parsed, "#_|", fixed = TRUE)[[1]]
}

Third, I took the code from Christoph (here) and wrote a R wrapper for the C function. This function is xutils::html_decode.

Now, let’s test the performance and I use bench package here.

bench::mark(
  html_decode(strings),
  unescape_html2(strings),
  textutils::HTMLdecode(strings)
)
#> # A tibble: 3 × 6
#>   expression                          min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                     <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 html_decode(strings)             5.01µs   5.65µs   173183.        0B     17.3
#> 2 unescape_html2(strings)         93.88µs  99.58µs     9751.     539KB     24.2
#> 3 textutils::HTMLdecode(strings)   5.84ms   5.97ms      167.     381KB     83.5

Clearly, the speed of html_decode function is unparalleled.

Note:

When testing the results, I discovered a bug in textutils::HTMLdecode and reported it here. The maintainer fixed it immediately. As of this writing (Feb. 16, 2021), the development version of textutils has this bug fixed, but the CRAN version may not. This means that if you test the performance yourself with a previous version of textutils, you may run into error and installing the development version will solve for it.

- Text-related Functions
  - html_decode

Introduction

Text-related Functions

html_decode

Example

Comparison with Existing Solutions

`html_decode`