tballison
File processing and search. Founder Rhapsode Consulting LLC. Chair/VP Apache Tika. Committer, Apache POI, PDFBox, Lucene/Solr, Nutch, OpenNLP.
Rhapsode Consulting LLC
Pinned Repositories
CC-MAIN-2021-31-PDF-UNTRUNCATED
commoncrawl-fetcher-lite
Simplified version of a common crawl fetcher
cord-19
Data munging for CORD-19
file-observatory
Single server/laptop grade file-observatory
lucene-addons
Standalone versions of LUCENE_5205 and other patches: SpanQueryParser, Concordance and Co-occurrence stats
mp4parser
A Java API to read, write and create MP4 files
quaerite
Search relevance evaluation toolkit
rhapsode
Advanced desktop search/corpus exploration prototype
SimpleCommonCrawlExtractor
Simple wrapper around IIPC Web Commons to take a literal warc.gz and extract standalone binaries
tika-gui-v2
Unofficial user interface for Apache Tika
tballison's Repositories
tballison/quaerite
Search relevance evaluation toolkit
tballison/commoncrawl-fetcher-lite
Simplified version of a common crawl fetcher
tballison/file-observatory
Single server/laptop grade file-observatory
tballison/tika-gui-v2
Unofficial user interface for Apache Tika
tballison/SimpleCommonCrawlExtractor
Simple wrapper around IIPC Web Commons to take a literal warc.gz and extract standalone binaries
tballison/CC-MAIN-2021-31-PDF-UNTRUNCATED
tballison/cord-19
Data munging for CORD-19
tballison/share
Public share
tballison/awesome-digital-preservation
Carefully curated list of awesome digital preservation resources.
tballison/hodgepodge
one off dev repo, very experimental
tballison/language-detector
Language Detection Library for Java
tballison/tika-addons
Addons not part of the official Tika release
tballison/any23
Apache Anything To Triples (Any23) is a library, a web service and a command line tool that extracts structured data in RDF format from a variety of Web documents.
tballison/apachestuff
tballison/commons-compress
Mirror of Apache Commons Compress
tballison/commons-io
Apache Commons IO
tballison/droid
DROID (Digital Record and Object Identification)
tballison/hadoop-safe-tika
tballison/incubator-stormcrawler
A scalable, mature and versatile web crawler based on Apache Storm
tballison/junrar
plain java unrar util (former sf project)
tballison/metadata-extractor
Extracts Exif, IPTC, XMP, ICC and other metadata from image files
tballison/nanite
Nanite - a friendly swarm of format-identifying robots.
tballison/nutch
Apache Nutch is an extensible and scalable web crawler
tballison/opensearch-java
Java Client for OpenSearch
tballison/oss-fuzz
OSS-Fuzz - continuous fuzzing for open source software.
tballison/poi
Mirror of Apache POI
tballison/tika-arlington-pdf-model
Simple wrapper around the Arlington PDF model's TestGrammar
tballison/tika-detector-stormcrawler
Wraps the charset detection logic from StormCrawler as a Tika module
tballison/tika-docker
Convenience Docker images for Apache Tika Server
tballison/tika-eval-multi-comparer
Demo tika-eval-multi-comparer