Truss Interview: CSV Normalization

This is my response to the CSV normalization problem.

How to Run

I would prefer this to be run on macOS 11.2+. I have developed this in macOS 10.15.7. This program requires Python 3.7. Depending on your setup, your default version of Python may be Python 2.7, which has been deprecated (so hopefully it is not your default version), or Python 3.7. You can determine which version of Python you use by opening up Terminal and running python --version. If you find yourself using Python 2.7, please substitute python3.7 (or however you call Python 3.7 as you may very well not be using the laptop you got before starting college and have proper Python version switching configured) for python and pip3.7 (same spiel as the Python 3.7 command) for pip in the following instructions.

Make sure you have Python 3.7 installed.
Install pytz, a library which handles timezone calculations, if you do not have it already. You may do so by running: pip install pytz
Install rfc3339 by running: pip install rfc3339
Run the program with the following command: python normalizer.py < [input CSV file].csv > [output CSV file].csv

In an ideal world, I would have created an easily replicable environment (perhaps using Docker) in which pytz, rfc3339, and Python 3.7 were installed.

The file output.csv in this directory was generated by running python normalizer.py < sample-with-broken-utf8.csv > output.csv

Comments, Admissions of Inadequacy, and Future Rabbit Holes

Timestamps

There is a lot I do not know about time formats. There was a point at which I naively thought that .isoformat() would be sufficient for converting the timestamp into RFC 3339, but this fun Venn diagram showed me how that would be a mistake. I had two options at this point:

Learn everything there is to know about RFC3339, or at least do the hacky thing for the sake of the exercise and check all of the time formats and their ISO conversions for RFC 3339 compliance.
Use a library built by someone who had already done that.

After beginning to read this document, I chose the latter. Had I more time, I might more properly vet the library and its developer "henry" and the rfc3339 library, learn more, and perhaps go in a different direction.

There is no shortage of rabbit holes under which I could go down when it comes to time. While I was directed to assume that the source timestamp was in Pacific Time, I could have used the date to determine whether that should be in Daylight Savings. Furthermore, I could have used the zip code column (if provided) to determine whether the presumed submitter of the row's data hailed from Alaska or Arizona, where Daylight Savings is not observed.

UTF-8 Error Checking

I wrote this in Python 3, which mercifully uses UTF-8 by default. Python 3.7 has some extra perks, namely allowing engineers to reconfigure the encoding in the std* wrappers. A quick look at the Python 3.7 library docs reveals that I could set errors=, which I gathered (and then confirmed) meant I could choose "replace" which would replace invalid UTF-8 characters with the Unicode Replacement Character.

Now, it isn't always feasible to assume that the world runs on Python 3.7. If I had to assume Python 3.6 or lower, I would have gone back to the drawing board. Here's the StackOverflow article from which I stole my stdin reconfigure technique. I don't have time to fully search/research, I might poke around his suggestion of wrapping the sys.stdin.buffer attribute in a new io.TextIOWrapper() instance to specify a different encoding (which looks like in theory it should work) or taking the filename in as an argument instead and opening the file directly rather than reading it through sys.stdin.

Things That Will Break This (And Incomplete List)/Things I Would Do With Unlimited Time