An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.
- 0
Updated build for scala 2.12/spark 3.1.2+?
#27 opened by lesleyodu - 0
- 7
WARCType metadata
#25 opened by parismic - 1
- 3
- 2
Extracting information from warc/metadata
#21 opened by parismic - 1
java.lang.ClassNotFoundException: io.circe.Json
#20 opened by parismic - 0
- 2
saveAsWarc() does not reproduce input
#18 opened by parismic - 1
unresolved dependency for hadoop-core
#16 opened by parismic - 5
- 1
duplicate filename with different path
#13 opened by dportabella - 0
Problem with WarcGzHdfs
#17 opened by parismic - 2
Question on using "filterExists"
#15 opened by xw0078 - 26
cdx format, includes json
#4 opened by borissmidt - 3
- 4
load a warc archive without a cdx file
#9 opened by dportabella - 2
add this example to the doc
#8 opened by dportabella - 4
broken link in the doc
#2 opened by dportabella - 3
saveAsWarc with same warcPaths as the source
#12 opened by dportabella - 1
saveAsWarc compressed file
#11 opened by dportabella - 1
WarcHdfsSpec not found
#7 opened by dportabella - 3
- 4
- 2