Truss Interview: CSV Normalization

This is my response to the CSV normalization problem.

How to Run

I would prefer this to be run on macOS 11.2+. I have developed this in macOS 10.15.7. This program requires Python 3.7. Depending on your setup, your default version of Python may be Python 2.7, which has been deprecated (so hopefully it is not your default version), or Python 3.7. You can determine which version of Python you use by opening up Terminal and running python --version. If you find yourself using Python 2.7, please substitute python3.7 (or however you call Python 3.7 as you may very well not be using the laptop you got before starting college and have proper Python version switching configured) for python and pip3.7 (same spiel as the Python 3.7 command) for pip in the following instructions.

  1. Make sure you have Python 3.7 installed.
  2. Install pytz, a library which handles timezone calculations, if you do not have it already. You may do so by running: pip install pytz
  3. Install rfc3339 by running: pip install rfc3339
  4. Run the program with the following command: python normalizer.py < [input CSV file].csv > [output CSV file].csv

In an ideal world, I would have created an easily replicable environment (perhaps using Docker) in which pytz, rfc3339, and Python 3.7 were installed.

The file output.csv in this directory was generated by running python normalizer.py < sample-with-broken-utf8.csv > output.csv

Comments, Admissions of Inadequacy, and Future Rabbit Holes

Timestamps

There is a lot I do not know about time formats. There was a point at which I naively thought that .isoformat() would be sufficient for converting the timestamp into RFC 3339, but this fun Venn diagram showed me how that would be a mistake. I had two options at this point:

  1. Learn everything there is to know about RFC3339, or at least do the hacky thing for the sake of the exercise and check all of the time formats and their ISO conversions for RFC 3339 compliance.
  2. Use a library built by someone who had already done that.

After beginning to read this document, I chose the latter. Had I more time, I might more properly vet the library and its developer "henry" and the rfc3339 library, learn more, and perhaps go in a different direction.

There is no shortage of rabbit holes under which I could go down when it comes to time. While I was directed to assume that the source timestamp was in Pacific Time, I could have used the date to determine whether that should be in Daylight Savings. Furthermore, I could have used the zip code column (if provided) to determine whether the presumed submitter of the row's data hailed from Alaska or Arizona, where Daylight Savings is not observed.

UTF-8 Error Checking

I wrote this in Python 3, which mercifully uses UTF-8 by default. Python 3.7 has some extra perks, namely allowing engineers to reconfigure the encoding in the std* wrappers. A quick look at the Python 3.7 library docs reveals that I could set errors=, which I gathered (and then confirmed) meant I could choose "replace" which would replace invalid UTF-8 characters with the Unicode Replacement Character.

Now, it isn't always feasible to assume that the world runs on Python 3.7. If I had to assume Python 3.6 or lower, I would have gone back to the drawing board. Here's the StackOverflow article from which I stole my stdin reconfigure technique. I don't have time to fully search/research, I might poke around his suggestion of wrapping the sys.stdin.buffer attribute in a new io.TextIOWrapper() instance to specify a different encoding (which looks like in theory it should work) or taking the filename in as an argument instead and opening the file directly rather than reading it through sys.stdin.

Things That Will Break This (And Incomplete List)/Things I Would Do With Unlimited Time

  • This proogram assumes a CSV file which contains the column names listed (well, the ones which require explicit processing) in the prompt precisely (case sensitive). The columns do not, however, have to be in a particular order. On the flipside, this is perhaps too lenient in that it allows for the exclusion of columns which do not require explicit processing (e.g. "Address", "Notes") and the addition of other columns (e.g. "Icecream"). I opted for more rather than less flexibility, but could adapt to be stricter.
  • Had I had more time, I would have wanted to do a little more rigorous testing on my timezone conversion. In general, in "real life" I probably would have started by writing tests and producing very specific test cases for each column type.
  • I don't do nearly as much error checking as I should have, and I did not customize messages based on the part of the program (and the specific data issue) that had caused the error. This is partially due to time constraints and as with the above note, I would usually not save that for last.
  • In a real-life program, it would be better to hardcode less. For instance, I'd probably pass the timezones into normalize_timestamp() and set them or get them from somewhere.
  • This readme is more of a ramble than it is proper documentation.
  • There are likely many more issues which I will realize only after submitting this.