/WARCDataSource

Spark 2.4.x DataSourceV2 implementation for WAC

Primary LanguageScala

WARCDataSource

Work in progress.

Spark DataSourceV2 for reading WARC.

Unit Tests

The unit test is definitely not self contained since it relies on having one of the April 2019 WARC files present in $HOME/Downloads. The gradle runSpark task also assumes this. This is due to line ending peculiarities between Unix and Windows and making certain that trying to grab a few records wasn't resulting in problems I experienced while parsing records.