archivesunleashed/aut
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
ScalaApache-2.0
Issues
- 1
`s3a` URLs don't work as in documentation
#556 opened by acruise - 2
Include last modified date for a resource
#546 opened by ruebot - 0
DomainGraph should use YYYYMMDD not YYYYMMDDHHMMSS
#544 opened by ruebot - 1
org.apache.tika.mime.MimeTypeException: Invalid media type name: application/rss+xml lang=utf-8
#542 opened by ruebot - 0
Add ARCH text files derivatives
#540 opened by ruebot - 1
Remove http headers, and html on webpages()
#538 opened by ruebot - 2
ARC reader string vs int error on record length
#492 opened by ruebot - 0
- 0
Add domain column to webpages()
#534 opened by ruebot - 1
Extract gzip data from transfer-encoded WARC
#493 opened by ianmilligan1 - 1
Replace Java ARC/WARC record processing library
#494 opened by ruebot - 1
java.lang.RuntimeException: Unsupported literal type class scala.collection.immutable.Set$Set1 Set(liberal.ca)
#529 opened by JakeBickUKGWA - 0
Include timestamp in crawl date
#525 opened by ruebot - 1
Replace scala-uri library from ExtractDomain and just parse public_suffix_list.dat
#521 opened by ruebot - 0
Scaladocs haven't been created since 0.90.0 release
#522 opened by ruebot - 2
ExtractDomains returns non-Apex Domains
#519 opened by ruebot - 0
ARC file name appearing in `url` list
#516 opened by ianmilligan1 - 1
- 0
crawl_date is not included on binary information jobs when documentation says it is
#512 opened by ruebot - 1
Update required Scala version to 2.12
#509 opened by ruebot - 0
- 3
Extract hyperlinks from wayback machine
#501 opened by yxzhu16 - 0
Python implementation of .all() has .keepValidPages() incorrectly applied to it
#502 opened by ruebot - 4
Update Read.me w/ citation information
#497 opened by SamFritz - 1
Split tf into it's own repo
#498 opened by ruebot - 2
Release 0.80.0 JAR produces error; built 0.80.1 fatjar built on repo works
#495 opened by ianmilligan1 - 4
Change master branch to main branch
#490 opened by ruebot - 0
GitHub action - Run isort and black on Python code
#488 opened by ruebot - 0
Add Google Java Formatter as a GitHub action
#484 opened by ruebot - 0
Add scalafmt GitHub action
#486 opened by ruebot - 5
Packages build is often broken - should we support it?
#483 opened by ruebot - 1
Implement SaveToDisk in Python
#478 opened by ruebot - 10
- 5
Python UDFs - class or not?
#467 opened by ruebot - 6
Broken link in documentation
#476 opened by sepastian - 0
Improve udfs/package.scala test coverage
#473 opened by ruebot - 2
Remove tabDelimit
#471 opened by ianmilligan1 - 0
Remove Extract Entities
#469 opened by ruebot - 1
Remove ExtractImageDetailsDF.scala
#464 opened by ruebot - 1
github-stite-deploy uses password based authentication which is being deprecated by GitHub
#461 opened by ruebot - 0
For extractor (spark-submit) job, set Spark app name to be the extractor job name.
#458 opened by ruebot - 0
DomainFrequencyExtractor should remove WWW prefix
#456 opened by ruebot - 7
Update Java 8 instructions for MacOS
#445 opened by ianmilligan1 - 0
Add parquet as an app format option
#448 opened by ruebot - 1
Update PlainTextExtractor to just extract text
#452 opened by ruebot - 0
- 2
Remove RDD options from app
#449 opened by ruebot - 0
Add spark-submit to README
#444 opened by ruebot - 0
Remove GraphXML and ExtractGraphX
#442 opened by ruebot - 5