The purpose of this project is the implementation of a batch processing using Akka. It's more a proof of concept than a full productive solution, there is no special error handling.
The program reads a file with comma separated values (csv), the lines are converted to records and they are then processed. The time needed is recorded. There also exists a non-parallel version to compare the processing speed.
This are the basic requirements
- the sequence of data in the output must be the same as in the input even if there is no sequential number in the input data
- it must be possible to process files with an arbitrary number of records without memory or performance problems
for this test the following processing is done
- creation of a record which contains the original csv value
- creation a copy of the record
- calculate a fibonacci number or do a thread sleep
- when writing the output, the original csv value is used so that input and output can be compared using the diff command. This is to ensure that the records are written in the correct order.
The following sections describe the used actors and the messages that are passed between them. The program uses a pull-pattern which ensures that the the work to be done is distributed among the actors which are not busy at the moment. The idea for this I took from this blog entry
The Reader
is the main actor of the whole program. The CSV2Record
actors register with the Reader
and
the Reader
gets the message to process a file. It reads the lines from the input file and takes care that there
is always a limited number of records in the system. This prevents that the system is flooded with data and that the
Writer
has to buffer too many processed records if one records takes a little longer during processing.
Register
is sent by aCSV2Record
to register with the Reader. This message must be resent regularly because theReader
removesCSV2Record
actors which don't self register after a certain time. This is to prevent that the Reader sends data to actors which are no longer available.InitReader
contains information about the file to process and is sent in the beginning by the Inbox of the system.GetWork
is sent by aCSV2Record
when this actor can process some data.RecordReceived
, contains the record id and is sent by theWriter
if it has received the corresponding record. TheReader
then removes this record from the list of records that may have to be sent again.RecordsWritten
is sent by theWriter
when it has written some data to the output. The message contains the number of written records and enables theReader
to read the next data.SendAgain
is sent by the Reader itself. The purpose of this message is to check if some records need to be sent again if the processing takes too long.
WorkAvailable
is sent to all registeredCSV2Record
actors when new data is available.DoWork
, contains the record-id and the csv line, is sent to aCVS2Record
actor in response to aGetWork
message.WorkDone
is sent to Inbox when all records are processed or there has been an error.
The Writer
is responsible to write the records it gets to the output file keeping the original order of the
records. To achieve this, the Writer
has an internal buffer where the records are stored which can not yet be written
because they were processed faster than records which need to be written before them. The management of this records is
done by using the record id which is sequentially generated by the Reader
.
InitWriter
, which contains name and encoding of the output file. Sent by theInbox
object.ProcessRecord
, which contains the record id, the processed record and the original csv line.
InitReady
is sent to the Inbox when the InitWriter message was received and the internal structures are initialized; contains a flag signaling success.RecordReceived
, contains a record-id and is sent to theReader
when aProcessRecord
message was received to keep the Reader from sending the record again into the system.RecordsWritten
is sent to the Reader when output records have been written. The message contains the number of written records.
This actor converts a line in csv format into a record. For simulation purposes the actor just uses some processing time as well.
WorkAvailable
, is sent from theReader
to the registeredCSV2Record
actors, when there is data to be processed.DoWork
(contains the record id and the original line read), is sent from theReader
after theCSV2Record
has sent aGetWork
message to the Reader.
Register
is sent regularly to theReader
.GetWork
, is sent to theReader
when the actor is ready to do some work either after the previous work has been processed or after the actor has received aWorkAvailable
message.ProcessRecord
(contains the record id, the original line and the processed record), is sent to a RecordModifier.
This actor does the processing of the record.
ProcessRecord
(seeCSV2Record
)
ProcessRecord
(contains the processed record), is sent to theWriter
.
this is a maven project where the maven appassembler-maven-plugin
plugin is configured to run during the package phase. So a simple
mvn package command builds the whole stuff and creates a directory target/AkkaBatch. The bin subdirectory
contains scripts for starting the program, the etc subdirectory contains the akkabatch.conf configuration file.