/MetaDataExtractor

Python repo to extract metadata from a variety of documents (MS Office docs, PDF, images)

Primary LanguagePython

MetaDataExtractor

Python repo to extract metadata from a variety of documents (MS Office docs, PDF, images).

Launch with:

python3 -m pip install requirements.txt

python main.py

This will create a json file "metadata.json" stored at the root of the repo.

You will also find a shinyapp in the visualization folder, convert the json file to csv with the code below and store in /visualization/data/. For some reason python gives a segfault when embedding the code in the repo, so just launch the code below in your favorite IDE to avoid it!

import pandas as pd

path = 'data/data/metadata.json'

temp = pd.read_json(path)

df = temp.T

df.to_csv('metadata.csv')