Convert Hexadecimal characters to correct character encoding
pgensler opened this issue · 2 comments
I have some data that has clearly been scraped from the web:
wine/name: 1981 Château de Beaucastel Châteauneuf-du-Pape
wine/wineId: 18856
wine/variant: Red Rhone Blend
#Sample Data
test = "wine/name: 1981 Château de Beaucastel Châteauneuf-du-Pape
wine/wineId: 18856
wine/variant: Red Rhone Blend"
Part of the issue with this is that converting the hexadecimal representation of unicode characters requires that you use a function similar to this:
decode <- function(x) {
xmlValue(getNodeSet(htmlParse(rawfile, asText = TRUE), "//p")[[1]])
}
Use stringi to convert the characters into normal ones:
poop <- stringi::stri_trans_general(poop, "Latin-ASCII")
Should this be a feature to include in textclean, or is this a task really made for the XML package/stringi to parse issues like this? Currently I've been able to decode the characters with stringi, but figured it may be better to have a wrapper function that could do what is above in a cleaner fashion.
I've had some serious issues with reading in files into R with these characters, as it throws errors with the encoding of the file not being UTF-8(in readr), so I'm not sure if this is truly an r encoding issue, or not well documented/well handled. Thanks for your package, as I have really enjoyed working with this package, and the documentation behind qdap.
I use stringi all the time. It's my go-to for cleaning. textclean doesn't replace stringi functionality, it complements it. That being said I think it's important to make this relationship clear in textclean documentation to avoid user frustration.