Strategy:
Since the Log file is very large, therefore, pre-processing it and making an index like structure will be efficient. To do that, we can create
a dictionary where key
contains the line number and value
contains offset of that line in terms of bytes from starting.
In this way, given any line number, we can find its position in the file and read the line in O(1) time as dictionaries are nothing but hashmap.
If we are given 1 TB large Log file, then its index dictionary will not be more than 100MB since it only contains key/value pairs for every line and both are integers. Once the request comes, we perform binary search using start_date as target value and find the line number from where we have to start reading the logs. This process will take O(log(n)) time where n is the number of lines in the log file.
Now that we have starting line number from where we have to read logs, we can read line by line sequentially without bringing the entire file in memory and append our results into the array.
So essentially, every request is served in min(O(log(n)), O(r)), where r is number of lines in the given range (End Date - Start Date) & n is total number of lines in the log file.
Input/Output Description:
The application can be interacted with POST API's and response comes back the JSON. The input/output format is as follow:
INPUT:
{ "start_date": "2020-01-01T15:41:52.301Z", "end_date": "2020-02-18T07:34:09.451Z", "api_key": "qp0jnJKA78" }
- start_date: Takes start date in ISO 8601 format as given in the log file.
- end_date: Takes end date in ISO 8601 format as given in the log
- api_key: API key is just their for authentication purpose, copy the same api key in all requests
OUTPUT:
{ "start_date": "2020-01-01T15:41:52.301Z", "end_date": "2020-02-18T07:34:09.451Z", "logs_count": 96172, "logs": [ { "time": "2020-01-01T15:42:01.290Z", "message": "Response 200 sent to 79.122.157.242 for /about" }, { "time": "2020-01-01T15:42:28.385Z", "message": "Querying table customers" }, ... .. . ] }
- start_date: Same as input
- end_date: Same as output
- logs_count: Total number of logs in the given range
- logs: Array of dictionaries, where each element has log message and timestamp
Installation steps:
Project requires Python 3.6 or above
. To check if you have it, please run:
python3.6
on terminal. After installing it, please follow the instructions:
-
Install virtualenv.
sudo apt-get install python3-pip
sudo pip3 install virtualenv
-
Clone the project and navigate to inside
home-project
folder. Then create a new virtual environment.virtualenv -p python3.6 env
-
Activate the virtual environment:
source env/bin/activate
-
Install the dependencies:
pip install -r requirements.txt
-
Start the API's:
python run.py
-
Use POSTMAN to make API requests, else terminal using CURL:
curl --location --request POST 'http://0.0.0.0:8000/fetch_logs' --header 'Content-Type: application/json' --data-raw '{ "start_date": "2020-01-01T15:41:52.301Z", "end_date": "2020-02-18T07:34:09.451Z", "api_key": "qp0jnJKA78" }'
The application uses asyncio, therefore, requests from multiple clients can be served asynchronously.