A company has a business process that requires processing of text files from external customers. Some information about the text file:
- Each line of the text file is an individual record and can be processed independently of the other records in the text file
- The size of the file can range from a few KB to a few GB
- The lines are ordered by the Date field in ascending order.
[Date in YYYY-MM-DDThh:mm:ssZ Format][space][Email Address][space][Session Id in GUID format]
2020-12-04T11:14:23Z jasdfne.dsdfoe@esdfmsdfail.com 3f30eb2b-a637-4c93-a633-b4567bc374576
Write an application to parse the text files above and expose an HTTP API to serve the data contained in the text files.
- expose an HTTP POST route on the / path
- accept iso8601 UTC timestamps only
- The array must be ordered by eventTime from earliest to latest
- The ordering of the keys within the JSON objects does not matter
curl -XPOST localhost:8279/ -d '{"filename":"sample1.txt", "from":"2021-07-06T23:00:00Z", "to": "2020-07-06T23:00:00Z"}'
[{
"eventTime": "2000-01-01T03:05:58Z",
"email": "test123@test.com",
"sessionId": "97994694-ea5c-4da7-a4bb-d6423321ccd0"
},
{
"eventTime": "2000-01-01T04:05:58Z",
"email": "test456@test.com",
"sessionId": "97994694-ea5c-4da7-a4bb-d6423321ccd1"
}
]
On a high level, we have two main tasks in this application:
- Injesting data from the text files and store it in a structured format, for easy querying.
- Expose HTTP endpoint for accessing the injested data.
[1] Is taken care in this preliminary implementation in a more robust way, In that:
- Incoming data-files are processed in a continuous and memory efficient way (streaming)
- Data files can be added when the application is running, and they will be processed and added to the DB.
- New record formats can be easily plugged using Camel components.
- New data sources (other than file) can be easily plugged.
- The embedded DB is located at the directory named: derby_data_dir in the working directory.
[2] Can be optimized, explained in : Why Derby embedded DB
Due to the nature of the problem, This implementation is a reduced scope implementation, in that, there are multiple possible optimizations.
Why use Apache camel and a DB and not something from scratch:
- Camel is purpose built for applications like this, Where data needs to flow from one system to another, including file watching+parsing+transforming.
Why Java+SpringBoot
- Springboot, gets us easy server+API and especially Camel and DB support.
- Reduce the amount of custom code, which will minimize testing effort.
- Fastest MVP (minimum viable product), since all solution components needed for this problem are pre-built.
About the correctness of output: The correctness was verified manually with a reduced dataset. Also the timestamp used is natively supported by the database. (Derby TIMESTAMP type)
File IO based search is eliminated because:
- Consumes a lot of time for each API call, in repeated work.
- We cannot make use of the timestamp format for effective querying, unless the data is in-memory.
- If we preprocess files, create index and search using binary/n-ary search, it is similar to the strategy used in databases.
Why Derby embedded DB: Ideally for range queries, we should go with a timeseries DB, or DBs which have efficient implementations supporting range queries. Mongo DB could be used, But this will mean a separate service, avoiding in this preliminary implementation.
Additional Notes:
- There is an extra field (Id) in the API result. (Introduced after the screenshots were captured.) This is the primary key of the table, (Instead of the earlier SesionID GUID).
- Dockerfile and dockerization in this project is a basic one, uses the fat jar of the springboot application.
Choice of database: Derby is not the right choice of database. For existing implementation, We can split the database at a million record or such, But ideally we will run a timeseries database and make camel-route output data into that. Load balancing: Use camel's in-built loadbalancing component to direct messages to different DB instances.
Known improvements:
- check if Camel is currently processing, on API call, If result not found respond with an appropriate error advising retry after sometime.
- Change to an appropriate database.
- Use an application level logic for partitioning data to DB instances.
- Use of a cache for saving recent range queries, and query DB only for new ranges.
- Implement pagination for the REST API, so that large ranges wouldn't time out/go over HTTP body size limitations.
- File based logging for easier debugging, relatively very easy, due to springboot.
- Mid-way in the implmentation, to speed-up DB insertions, I changed to Batch update mode, so one of the Camel processor is unused from the original design.
- Use prepared statements for SQL in the record search implementation: SessionDataRepositoryImpl.findSessionsInTimeRange() to mitigate SQL injection.
- API spec is currently non-standard, because only one API is there., We should use a swagger spec, so we have documentation too without much efforts.
- Although implemented in Java-17, not really using much of its new features, we can use Records for some classes.
- Lot of negative test-cases in the UTs.
- This is a springboot web application.
- The meat of the solution is in FileRoute.java.
- File component of apache camel library is used for injesting the data from text files.
- Camel runs continuously in a multi-threaded mode, injesting data from the files placed under input_files_folder in application.properties, and unmarshall each record-line from the text file to a model and diverts this structured data to the DB.
- Unmarshalling is aided by the Camel Bindy Data format, which readily parses the records.
- Camel uses multiple threads (no. of avaliable cores on the system).
- There is an index created on the "eventTime" column to speed up range queries.
- To speed-up data insertion into the table, Spring JDBC batch update used.
- Tested insertion of 3 million records (takes around ~4 mins).
- To avoid some boilerplate code, Lombok java library was used.
Remember to set the property: input_files_folder in application.properties Or add the command-line parameter: -Dspring-boot.run.arguments=--input_files_folder=/input/files/path/
to an appropriate directory in your system containing the sample*.txt which contains the recoreds in the given format:
Eg: 2000-01-01T17:25:49Z dsfgedsdfrisdfgc_strdfgosin@adsfgdamsdfgs.co.uk dfgdfg-dfgdfg-3425fgh-ghgh-dfghdfgh4345
Run command:
./mvnw spring-boot:run <-Dspring-boot.run.arguments=--input_files_folder=/location-of-input-text-files/>
If using docker (testing pending)
docker run -p8080:8080 app:latest'
**Note: ** First run might take time, to download dependencies, build and processing the text data.