NCEAS/metadig-checks

Characters in text files are valid according to declared encoding

jeanetteclark opened this issue · 0 comments

Status : ⌛ Not Started

Description

Check for text values within the correct ranges for declared encoding.

e.g., ASCII files only contain characters in the range \x00 to \xFF
e.g., Unicode encoded text files only contain characters in the correct range (e.g., for UTF-8)

Priority

  • Data Quality: Required

Issues

  • Most files don't have a declared encoding? So I'm not sure how we would check for this other than assuming most things we see are UTF-8 (or maybe ASCII??) unless declared otherwise. Thoughts @mbjones?

Procedure

  • in R, we could use validUTF8