trinker/textclean

Convert Hexadecimal characters to correct character encoding

pgensler opened this issue · 2 comments

I have some data that has clearly been scraped from the web:

wine/name: 1981 Château de Beaucastel Châteauneuf-du-Pape
wine/wineId: 18856
wine/variant: Red Rhone Blend

#Sample Data
test = "wine/name: 1981 Château de Beaucastel Châteauneuf-du-Pape
wine/wineId: 18856
wine/variant: Red Rhone Blend"

Part of the issue with this is that converting the hexadecimal representation of unicode characters requires that you use a function similar to this:

decode <- function(x) {
  xmlValue(getNodeSet(htmlParse(rawfile, asText = TRUE), "//p")[[1]])
}

Use stringi to convert the characters into normal ones:
poop <- stringi::stri_trans_general(poop, "Latin-ASCII")

Should this be a feature to include in textclean, or is this a task really made for the XML package/stringi to parse issues like this? Currently I've been able to decode the characters with stringi, but figured it may be better to have a wrapper function that could do what is above in a cleaner fashion.

I've had some serious issues with reading in files into R with these characters, as it throws errors with the encoding of the file not being UTF-8(in readr), so I'm not sure if this is truly an r encoding issue, or not well documented/well handled. Thanks for your package, as I have really enjoyed working with this package, and the documentation behind qdap.

I use stringi all the time. It's my go-to for cleaning. textclean doesn't replace stringi functionality, it complements it. That being said I think it's important to make this relationship clear in textclean documentation to avoid user frustration.

This is related: #18