helgeho/ArchiveSpark

filter a rdd of archive records, and save a new warc file

dportabella opened this issue · 4 comments

Is it possible to filter a RDD of a archive records, and save it back to an archive?
Something like:

ArchiveSpark.load(sc, WarcHdfsSpec("/cdx/path/*.cdx", "/path/to/warc/and/arc"))
.filter(r => domain(r.url) = "epfl.ch")
.save(sc, WarcHdfsOut("/tmp/filtered.xdc", "/tmp/filtered.warc"))

Hi David and thanks for the suggestions, we've considered this already and it is actually fairly easy to implement, it's just a matter of time. However, now as there is actually demand, I'll try to add this feature as soon as possible. I'll keep you posted!

Hi helgeho, I've created a gist using the warcbase library. You might adapt it to ArchiveSpark:
https://gist.github.com/dportabella/3caf261c218a4448a03a14dbc06fe730

Hi David,

I have great news for you, we just pushed ArchiveSpark 2.5 and (besides many bug fixes) it now has support for this:

After loading a CDX/WARC dataset, you can now filter / select the records and save them as new WARC(.gz) files with custom headers and even generate corresponding CDX records for it.
If you have imported de.l3s.archivespark.specific.warc._ in your code, you can now call saveAsWarc on your dataset, with the following signature:

cdxWarcRDD.saveAsWarc(path: String, info: WarcMeta, generateCdx: Boolean = true)

WarcMeta is a case class that lets you specify some metadata for your new WARC files, i.e.,:

val meta = WarcMeta(publisher = "Your Name")

More detailed instruction will follow soon...
Hope this helps!

that's great news, thx! :)