centic9/CommonCrawlDocumentDownload
A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types. This is used for mass-testing of frameworks like Apache POI and Apache Tika
JavaBSD-2-Clause
Stargazers
- alexyorkeMicrosoft
- berezovskyi127.0.0.1
- berkaineuland AI
- bskaggs
- burf2000Burf.co Search engine
- centic9
- chrismattmannMattmann.AI
- craigpfeifer@Lightning-AI
- decalage2
- ericharleyProvidence, RI
- harshalrj25
- hoagy-davis-digges
- husrevbeyazisik
- jaypat87
- jjangsangySan Francisco, CA
- keithjjones
- keyboardsamuraiCologne, Germany
- kotobotHeadway EdTech
- MessianNilIISc
- mikalvSigterm.no
- mkr
- morskyjezekUniversity of Michigan School of Information
- mzhaox
- oooooleg
- oriefrati@Autofleet
- pankajdev73
- punkeelMunich, Germany
- puzzlepeaches@sprocketsecurity
- rharang
- samalloingKB - National Library of the Netherlands
- sebastian-nagel@commoncrawl
- svip-Admin
- tballisonRhapsode Consulting LLC
- tbpalsulichGoogle
- viaminID.me
- walletma