filter a rdd of archive records, and save a new warc file
dportabella opened this issue · 4 comments
Is it possible to filter a RDD of a archive records, and save it back to an archive?
Something like:
ArchiveSpark.load(sc, WarcHdfsSpec("/cdx/path/*.cdx", "/path/to/warc/and/arc"))
.filter(r => domain(r.url) = "epfl.ch")
.save(sc, WarcHdfsOut("/tmp/filtered.xdc", "/tmp/filtered.warc"))
Hi David and thanks for the suggestions, we've considered this already and it is actually fairly easy to implement, it's just a matter of time. However, now as there is actually demand, I'll try to add this feature as soon as possible. I'll keep you posted!
Hi helgeho, I've created a gist using the warcbase library. You might adapt it to ArchiveSpark:
https://gist.github.com/dportabella/3caf261c218a4448a03a14dbc06fe730
Hi David,
I have great news for you, we just pushed ArchiveSpark 2.5 and (besides many bug fixes) it now has support for this:
After loading a CDX/WARC dataset, you can now filter / select the records and save them as new WARC(.gz) files with custom headers and even generate corresponding CDX records for it.
If you have imported de.l3s.archivespark.specific.warc._
in your code, you can now call saveAsWarc
on your dataset, with the following signature:
cdxWarcRDD.saveAsWarc(path: String, info: WarcMeta, generateCdx: Boolean = true)
WarcMeta
is a case class that lets you specify some metadata for your new WARC files, i.e.,:
val meta = WarcMeta(publisher = "Your Name")
More detailed instruction will follow soon...
Hope this helps!
that's great news, thx! :)