Implementation of Siden Fullstack Coding Task
We have a file (can be a very large file (up to 100 GB)), which contains a lot of words and each word is on a new line (file sample - https://drive.google.com/file/d/1mmFQVmWJT4entvG5OVOSxcG_uT9XUSAa/view?usp=sharing)
- Implement "PUT" HTTP endpoint which will save/replace file on the disk
- Implement "GET "HTTP endpoint which will return "processed" file without word duplicates
- Implementation must run in 8GB of RAM. Assume that you will not have enough memory to fit the entirety of the file in memory
I chose to use a simple MySQL database to store all the words of the file and a streaming mechanism to ensure both the PUT and GET requests do not require too much memory, regardless of the file size. When I first read the prompt, my initial thought was to use some sort of map. However, given that the file may get up-to a 100Gb, and there is no restriction on the words, we could end up having completely unique words in the original file making the processed file also 100Gb in size. Although a distributed map would be possible to implement, I felt that a simple MySQL database would get the job done as it allows batch queries.
Given that the file uploaded can be up-to 100 Gb, the PUT request does not take the file itself as a parameter, but rather a url pointing to the file. This way we can stream the file in smaller chunks. The chunk size I chose for now was 1024 although this can easily be changed with the current implementation. So the system only needs to hold a maximum of 1024 words in memory at any given time. The GET request has a similar batch architecture. The query only grabs up-to 1024 entries into memory at a given time and streams them over http to the end user.
The runtime complexity for the put is O(n) where n is the number of total words including duplicates in the file uploaded by the user. Each word is only processed once.
The space complexity is O(m) where m is the total number of unique words. Only unique words are persisted in memory. Due to the batch processing, only a constant amount of memory is used per each batch.
- the requirements only asked for PUT and GET, but a POST could have easily been implemented that combined files.
- although I did not test this on an 8Gb machine, this method should work, albiet slowly for a file that is 100Gb in size.
- additional constraints could be set on the MySQL buffers themselves to ensure too much memory is never used.
- a NOSQL database could also be used, but given the structure is quite simple, and joins of any kind would never be required, a sql database handles the job
- in a production setup, I would have stress tested the system, but for now I only tested the basics in the test directory