internetarchive/heritrix3
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
JavaNOASSERTION
Issues
- 0
- 0
Provided seed files are updated (the more the job is repited, the more they are modified)
#558 opened by cgr71ii - 0
How to change auth type?
#556 opened by Moogamouth - 1
- 0
Support for silent option when running a job
#563 opened by melker - 5
Maven build fails due to HTTP only upstream servers
#481 opened by Jauchi - 3
Question about memory usage
#462 opened by naveen17797 - 0
- 0
- 7
archive web crawler - crawl speed
#562 opened by solaceten - 4
Error when more than 125 jobs are instantiated
#559 opened by cgr71ii - 0
WARNING: politessDelay unset, returning default 5000
#553 opened by cgr71ii - 0
- 2
Error: Could not find or load main class org.archive.crawler.Heritrix Caused by: java.lang.ClassNotFoundException: org.archive.crawler.Heritrix
#546 opened by Moogamouth - 3
Question about the size of the 'state' directory
#498 opened by cgr71ii - 2
RateLimitGuard.authenticate() authentication failure
#474 opened by troloff - 3
- 8
Exclude PDF-Files
#453 opened by oschihin - 0
- 1
Question re: cloudfront.net
#487 opened by carj - 0
- 0
How to scale Heritrix with Kubernetes?
#466 opened by naveen17797 - 2
How to cite?
#463 opened by Querela - 1
Impact of log4j CVE-2021-44228 on heritrix3?
#451 opened by bnewbold - 4
Authentication on servers using Oauth2
#446 opened by AndreSchmutz - 10
cannot resolve these dependencies
#443 opened by oldRabbitForz - 1
- 0
Resume a crawl for later
#500 opened by JenPho - 6
Questions about TransclusionDecideRule
#496 opened by cgr71ii - 1
Implicit max. value of URI cost and precedence (?) should raise warning if exceeded
#502 opened by cgr71ii - 5
Time is not stopped when Disk Space Monitor is triggered and report files are removed
#499 opened by cgr71ii - 0
Bean reference missing inherited properties
#497 opened by ato - 1
${launchId} is not being replaced (sometimes)
#495 opened by cgr71ii - 1
Apple Silicon docker images
#471 opened by ziodave - 2
- 0
ExtractorHTML matches srcset attribute case-sensitively
#477 opened by ato - 2
Heritrix not ignoring robots.txt
#479 opened by Elletra - 0
- 1
HTTP/2 protocol
#472 opened by kauka-1 - 0
- 1
"java.lang.NoClassDefFoundError: Could not initialize class org.archive.util.CLibrary" on Apple Silicon
#467 opened by ziodave - 0
- 0
Heritrix crasching on malformed Content-Length header
#449 opened by krakan - 1
Commas in srcset-URLs are not handled correctly
#458 opened by grob - 3
Crawl job stats and reports misleading when excluding PDF-Files (follow up to issue #453)
#455 opened by oschihin - 1
- 1
- 0
- 4
[Question] SEVERE Configuration problem: Unable to locate Spring NamespaceHandler for XML schema namespace
#437 opened by naveen17797 - 2
force HTTP11 true - breaking warc records
#432 opened by Feribv