---
title: "Introduction"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Introduction}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
This is a package where I collected some of the function I have used when dealing with data.
```{r setup}
library(xutils)
```
# Text-related Functions
## `html_decode`
Currently, there is only one function: `html_decode` which will replace the HTML entities like
`&` into their original form `&`.
This function is a thin-wrapper of C++ code provided by **Christoph**
on [Stack Overflow](https://stackoverflow.com/a/1082191/10437891).
### Example
An example would be
```{r}
strings <- c("abcd", "& ' >", "&", "€ <")
html_decode(strings)
```
It works very well!
### Comparison with Existing Solutions
To the best of my knowledge, there are already several solutions to this problem, and why do I need to
wrap up a new function to do this? Because of performance.
First of all, there is an existing package `textutils` that contains lots of functions dealing with data.
The one of our interest is `HTMLdecode`.
Second, there is a function by SO user **Stibu**
[here](https://stackoverflow.com/questions/5060076/convert-html-character-entity-encoding-in-r/65909574#65909574)
that uses `xml2` package.
And the function is:
```{r}
unescape_html2 <- function(str){
html <- paste0("", paste0(str, collapse = "#_|"), "")
parsed <- xml2::xml_text(xml2::read_html(html))
strsplit(parsed, "#_|", fixed = TRUE)[[1]]
}
```
Third, I took the code from **Christoph**
([here](https://stackoverflow.com/a/1082191/10437891))
and wrote a R wrapper for the C function.
This function is `xutils::html_decode`.
Now, let's test the performance and I use `bench` package here.
```{r}
bench::mark(
html_decode(strings),
unescape_html2(strings),
textutils::HTMLdecode(strings)
)
```
Clearly, the speed of `html_decode` function is unparalleled.
**Note**:
When testing the results, I discovered a bug in `textutils::HTMLdecode` and reported it
[here](https://github.com/enricoschumann/textutils/issues/3). The maintainer fixed it immediately.
As of this writing (Feb. 16, 2021), the development version of `textutils` has this bug fixed,
but the CRAN version may not. This means that if you test the performance yourself with a previous version
of `textutils`, you may run into error and installing the development version will solve for it.