intuit-task: A Java repository from ronytesler

The 'home-task' project is the REST api service. It accesses the db and serve the clients.

The 'csv-monitor' project monitors the players.csv file, and after the file is saved, it checks for changes and send a message, using RabbitMQ, with the new/modified player, to the 'home-task' service (see "S3 file watcher" below). The flow for checking if a player has been modified or is new - When the service starts, I check if a db_state.json file exists. The db_state.json file contains for each player a mappings between the playerID and the hash of the player's line in the file. If it exists, I check if its last modified time is before the last modified time of the players.csv file. If it doesn't exist, or if its last modified time is earlier, I read the players.csv file line by line. For each line, I compare the hash (SHA-256) of the line, with the hash in the db_state.json file for the same playerId ( if exists). If they're not equal, I add/update the hash to/in the file. If it does exist and its last modified time is newer, I read the file and keep it as a Map in memory. I use a WatchService in order to be notified when the player.csv is modified. When being notified, I again scan the file and compare the hashes and send Queue Messages etc. assumptions:

As long as we deal with a file of baseball players, we can assume it will contain a reasonable amount of rows.
Almost half of the players have died, and a lot of them have retired, so it make sense to not expect large amount of updates (also by looking at the fields' nature, although new ones can be added).
We might want to search or filter by specific columns, so the data is stored as a document with each column in the file corresponding to a field in the object, where the playerID is the _id field in MongoDB.
Because the task was to serve the file content, there might be a need to serve the content even if it changes. I assume the file is where the content is handled, and there's no other interface for managing the players' data.
I assume the data in the file can be handled as it is, and there's no need for modifications and verification (for example, a future/negative birth year or a month/day not in range).
Because the data of each player should not change a lot, it makes sense to put it in a cache, but because I'm not sure if the db is going to be changed by other sources, I add a ttl to it.
No authentication is needed - everyone can access the api.
No proxy server or load balancer is used. No throttling, rate limiting or DDoS protection.

not covered:

Synchronizing the db to the memory state and the csv file - Although I assumed the Excel file is the interface for managing the data, I don't know if other systems/people are going to alter the database content. I could add an "updated_time" to each document that is automatically updated when an update to the document happens ( or when the document is inserted), then when the system loads - sync with the documents that were updated since our last sync (last sync time is also saved in the db, for using the same clock).
load testing - Didn't test how many requests per seconds can be performed, with different cache misses/hits amount.
security - Didn't create a user for rabbitmq with non default values for user and password
files synchronization edge cases - An update fails, then another one for the same player succeeds, then the first one succeeds. This results in unupdated data. If the file is large, it will be locked for a while, preventing users from updating it. This happens everytime the file is saved, and can happen a lot with Excel 'auto save' feature. I use the modified time of a local file and I compare it to the modified time of an s3 file
failures and shut-downs - I didn't test scenarios where a service shuts down, or there is no internet connection, in different times (on startup, when reading the file, when updating the db, etc.). Also, if updating the
using shared classes - I use the same dto object and enum in both projects instead of extracting them to a single library and use that instead.
large files - not tested on very large files. not tested on heavy write scenarios where a lot of players are being updated or added to the file. This can slow down response times for the endpoints.
pagination - Didn't have time to add pagination to the reactive implementation
batch updates - The current implementation uses a Queue Message per player that needs to be updated. The main issue with this is updating the db for each document, and not accessing the db once for a batch of documents.
using a csv reader package - The csv format is simple so just splitting the file and accessing the fields by their index seemed to be enough.
IDE warnings - Some warning there's no time left to handle.
logs and alerts - I don't write logs to a service such as AWS Cloudwatch, and I'm not notified when an error occurs.
memory and cpu testing, monitoring and alerting - I didn't check the cpu and memory consumption, especially of the file watcher, the synchronization with the db and getting the all players endpoint.
scaling - I didn't try to run the REST service on multiple machines, or test the service with multiple clients.
S3 file watcher - I use a WatchService for monitoring the players file when the file is local. The local file implementation is commented out in the code, and instead I use an S3 file. For monitoring the S3 file I can use AWS bucket notifications, when the event destination can be an SQS queue, or a lambda function that will send the event to our Rabbit queue.
names - I guess the names of the projects could be better

ronytesler/intuit-task