helgeho/ArchiveSpark

duplicate filename with different path

dportabella opened this issue · 1 comments

ArchiveSpark throws this error when used on the CommonCrawl:
Exception in thread "main" java.lang.RuntimeException: duplicate filename: CC-MAIN-20160823195810-00000-ip-10-153-172-175.ec2.internal.warc.gz

CommonCrawl is divided into 100 "segments", with 300 WARC files for each segment. The directory is as follows: crawl-data/CC-MAIN-2016-36/segments/$SEG/warc/$FILE. All the 100 segments have a file with the same name CC-MAIN-20160823195810-00000-ip-10-153-172-175.ec2.internal.warc.gz (and others). This should not be a problem, because the CDX file provides the full path to the file. However it seems that ArchiveSpark takes into account only the file name and ignores the path, and so it expects that the name of each WARC file to be unique since the root dir.

Example line from the CDX file:

1,108,84,185)/503-nginx.html 20160824041351 {"url": "http://185.84.108.1/503-nginx.html", "mime": "text/html", "status": "200", "digest": "XQZBAMNV2YHHFEJ4GWNFLWUW43FKUUWF", "length": "1708", "offset": "552647", "filename": "crawl-data/CC-MAIN-2016-36/segments/1471982290765.41/warc/CC-MAIN-20160823195810-00248-ip-10-153-172-175.ec2.internal.warc.gz"}

How to solve this?

Sorry, I saw this just now, yes, this indeed is an issue currently. ArchiveSpark expects WARC filenames to be unique. The only way to work around this currently is a loop like this (pseudo-scala):

var rdd = ArchiveSpark.load(... ".../segment1/...")
for (seg <- (2 to 100)) rdd = rdd.union(ArchiveSpark.load(... ".../segment" + seg + "/..."))

Unions are done only virtually in Spark by keeping pointers to the different RDDs involved, so this will be quick and should not have any disadvantages.

However, the CDX format you are using here is not the regular CDX as provided by the Internet Archive, but JCDX. This is not natively support by ArchiveSpark yet, but there exist an (unofficial / untested) DataSpec for this: https://github.com/trafficdirect/ArchiveSparkJCDX