archivesunleashed/aut

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.

ScalaApache-2.0

Issues

`s3a` URLs don't work as in documentation
#556 opened 4 months ago by acruise
1
Include last modified date for a resource
#546 opened 2 years ago by ruebot
2
DomainGraph should use YYYYMMDD not YYYYMMDDHHMMSS
#544 opened 2 years ago by ruebot
0
org.apache.tika.mime.MimeTypeException: Invalid media type name: application/rss+xml lang=utf-8
#542 opened 2 years ago by ruebot
1
Add ARCH text files derivatives
#540 opened 2 years ago by ruebot
0
Remove http headers, and html on webpages()
#538 opened 2 years ago by ruebot
1
ARC reader string vs int error on record length
#492 opened 2 years ago by ruebot
2
Discard date RDD filter only takes a single string, not a list of strings.
#532 opened 2 years ago by ruebot
0
Add domain column to webpages()
#534 opened 2 years ago by ruebot
0
Extract gzip data from transfer-encoded WARC
#493 opened 2 years ago by ianmilligan1
1
Replace Java ARC/WARC record processing library
#494 opened 2 years ago by ruebot
1
java.lang.RuntimeException: Unsupported literal type class scala.collection.immutable.Set$Set1 Set(liberal.ca)
#529 opened 2 years ago by JakeBickUKGWA
1
Include timestamp in crawl date
#525 opened 2 years ago by ruebot
0
Replace scala-uri library from ExtractDomain and just parse public_suffix_list.dat
#521 opened 3 years ago by ruebot
1
Scaladocs haven't been created since 0.90.0 release
#522 opened 3 years ago by ruebot
0
ExtractDomains returns non-Apex Domains
#519 opened 3 years ago by ruebot
2
ARC file name appearing in `url` list
#516 opened 3 years ago by ianmilligan1
0
WARC-Target-URI in Wget warc files is not parsed properly
#514 opened 3 years ago by javieraespinosa
1
crawl_date is not included on binary information jobs when documentation says it is
#512 opened 3 years ago by ruebot
0
Update required Scala version to 2.12
#509 opened 3 years ago by ruebot
1
Migrate CI infrastructure from TravisCI to GitHub Action
#506 opened 4 years ago by ruebot
0
Extract hyperlinks from wayback machine
#501 opened 4 years ago by yxzhu16
3
Python implementation of .all() has .keepValidPages() incorrectly applied to it
#502 opened 4 years ago by ruebot
0
Update Read.me w/ citation information
#497 opened 4 years ago by SamFritz
4
Split tf into it's own repo
#498 opened 4 years ago by ruebot
1
Release 0.80.0 JAR produces error; built 0.80.1 fatjar built on repo works
#495 opened 4 years ago by ianmilligan1
2
Change master branch to main branch
#490 opened 4 years ago by ruebot
4
GitHub action - Run isort and black on Python code
#488 opened 4 years ago by ruebot
0
Add Google Java Formatter as a GitHub action
#484 opened 4 years ago by ruebot
0
Add scalafmt GitHub action
#486 opened 4 years ago by ruebot
0
Packages build is often broken - should we support it?
#483 opened 4 years ago by ruebot
5
Implement SaveToDisk in Python
#478 opened 4 years ago by ruebot
1
PEP8 Naming - UDFs, App method names, DataFrame names, and filters.
#468 opened 4 years ago by ruebot
10
Python UDFs - class or not?
#467 opened 4 years ago by ruebot
5
Broken link in documentation
#476 opened 4 years ago by sepastian
6
Improve udfs/package.scala test coverage
#473 opened 4 years ago by ruebot
0
Remove tabDelimit
#471 opened 4 years ago by ianmilligan1
2
Remove Extract Entities
#469 opened 4 years ago by ruebot
0
Remove ExtractImageDetailsDF.scala
#464 opened 4 years ago by ruebot
1
github-stite-deploy uses password based authentication which is being deprecated by GitHub
#461 opened 4 years ago by ruebot
1
For extractor (spark-submit) job, set Spark app name to be the extractor job name.
#458 opened 4 years ago by ruebot
0
DomainFrequencyExtractor should remove WWW prefix
#456 opened 4 years ago by ruebot
0
Update Java 8 instructions for MacOS
#445 opened 4 years ago by ianmilligan1
7
Add parquet as an app format option
#448 opened 4 years ago by ruebot
0
Update PlainTextExtractor to just extract text
#452 opened 4 years ago by ruebot
1
Add datathon derivatives to app (binary info, web pages, web graph
#447 opened 4 years ago by ruebot
0
Remove RDD options from app
#449 opened 4 years ago by ruebot
2
Add spark-submit to README
#444 opened 4 years ago by ruebot
0
Remove GraphXML and ExtractGraphX
#442 opened 4 years ago by ruebot
0
Use Monochromatic Ids instead of hash to produce network identifiers.
#440 opened 4 years ago by greebie
5