helgeho/ArchiveSpark

An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.

ScalaMIT

Issues

Updated build for scala 2.12/spark 3.1.2+?
#27 opened 6 months ago by lesleyodu
0
WARC files written in ArchiveSpark incompatible with warcio
#26 opened 4 years ago by parismic
0
WARCType metadata
#25 opened 4 years ago by parismic
7
Can ArchiveSpark read and process binary payload in warc files?
#24 opened 5 years ago by aysunakarsu
1
How to get the Array[Byte] from the InputStream in ArchiveSpark3 efficiently
#22 opened 6 years ago by parismic
3
Extracting information from warc/metadata
#21 opened 6 years ago by parismic
2
java.lang.ClassNotFoundException: io.circe.Json
#20 opened 6 years ago by parismic
1
Unknown connection error when downloading from wayback
#19 opened 6 years ago by thusithaC
0
saveAsWarc() does not reproduce input
#18 opened 6 years ago by parismic
2
unresolved dependency for hadoop-core
#16 opened 6 years ago by parismic
1
Missing location (offset/filename) in CDX generated from uncompressed WARC
#14 opened 6 years ago by xw0078
5
duplicate filename with different path
#13 opened 6 years ago by dportabella
1
Problem with WarcGzHdfs
#17 opened 6 years ago by parismic
0
Question on using "filterExists"
#15 opened 7 years ago by xw0078
2
cdx format, includes json
#4 opened 7 years ago by borissmidt
26
silent error if a CDX entry does not exists in the warc path
#10 opened 7 years ago by dportabella
3
load a warc archive without a cdx file
#9 opened 7 years ago by dportabella
4
add this example to the doc
#8 opened 7 years ago by dportabella
2
broken link in the doc
#2 opened 7 years ago by dportabella
4
saveAsWarc with same warcPaths as the source
#12 opened 7 years ago by dportabella
3
saveAsWarc compressed file
#11 opened 7 years ago by dportabella
1
WarcHdfsSpec not found
#7 opened 7 years ago by dportabella
1
can't not resolve 'BenchmarkMeasure' function in benchmark subproject
#6 opened 7 years ago by xw0078
3
filter a rdd of archive records, and save a new warc file
#3 opened 7 years ago by dportabella
4
Its not clear how to use the mapEnrich function
#5 opened 7 years ago by borissmidt
2