Characters in text files are valid according to declared encoding
jeanetteclark opened this issue · 0 comments
jeanetteclark commented
Status : ⌛ Not Started
Description
Check for text values within the correct ranges for declared encoding.
e.g., ASCII files only contain characters in the range \x00 to \xFF
e.g., Unicode encoded text files only contain characters in the correct range (e.g., for UTF-8)
Priority
- Data Quality: Required
Issues
- Most files don't have a declared encoding? So I'm not sure how we would check for this other than assuming most things we see are UTF-8 (or maybe ASCII??) unless declared otherwise. Thoughts @mbjones?
Procedure
- in R, we could use
validUTF8