/ddex

API for document data extraction

Primary LanguageJavaOtherNOASSERTION

DDEx Project

Extracting data independently of file formats

DDEx - Document Data Extractor - is a framework that allows applications to transparently open and extract the content of documents, regardless of formats.

We are working to provide support for:

  • OLE2 file formats [.doc, .xls, .ppt]
  • OOXML file formats [.docx, .xlsx, .pptx]
  • ODF file formats [.odt, .ods, .odp]
  • CSV
  • PDF
  • Google Docs (minimal support)

Goal, Challenges, Differentials

DDEx is based on the Builder Design Pattern, and can be easily extended to support other formats. DDEx aims at decoupling the process of content extraction from the content processing, handling the diversity of file formats and providing access to the document's content independently of file formats.

DDEx manages the intersection between multiple APIs (such as Apache POI and ODFDOM) by offering a common interface, allowing applications to use document's content in other contexts, encapsulating and performing the extraction independently of formats.

Alt text

Who is using DDEx?

DDEx was born on the academia and ended up being used by other Ph.D. and MSc students during their research. DDEx is also being used by other projects and is associated with academic productions, such as: