Ingestor for match data streams to store into S3.
Etymology: We "cook" raw matches into pseudorecordio files and prepare them for ETL jobs that read from S3.
This is a really simple Spark streaming application that does the following:
- Subscribe to match Kafka topic, e.g.
bacchus.matches.NA
- Parse matches from Kafka topic (inefficient, TODO optimize)
- Extract version from the message
- Split stream into one RDD per unique version
- Convert RDD to DF and write pseudorecordio file
$region/$version/$time.protolist
to an S3 bucket.
The largest optimization would be to somehow send the version separately with the Kafka message, preventing a full deep parse of our very intricate Match message.
Additionally, the upload implementation currently uses the tempfile system to interface with the S3 TransferManager. An issue has been opened to request the ability to pass in byte arrays to the TransferManager which would prevent the need for interfacing with the file system: aws/aws-sdk-java#964