- loop through all files in the folder
- parse and load the content of each file in a separate table, using a thread pool
- process all loaded events and load them in a second table, using a thread pool
Split file into chunks, either by line, or by size:
split -l 200 large_file.log
split -b 500MB large_file.log
-
From the IDE, make sure "annotation processing is enabled", because Lombok is being used. https://www.jetbrains.com/help/idea/configuring-annotation-processing.html
-
Run DemoApplication.java with a folder name parameter containing log files. Ex:
/home/andrei/Projects/logfile-processor
This is only a demo application. For real world production use a streaming solution is preferred to a batch processing once since it's more suitable for logs.
Possible options:
-
AWS Kinesis + ElasticSearch:
Ex: https://aws.amazon.com/getting-started/projects/build-log-analytics-solution/
-
Stream logs to Kafka and process with Apache Spark
-
Load logs to any other distributed cache(like Redis) and process batches with Hadoop MapReduce
-
Spring Batch in parallel mode with file chunking.
The application can be extended to become a lightweight log streaming agent