/docx-to-html

Primary LanguageJavaScriptApache License 2.0Apache-2.0

DOCX to HTML

This project takes a DOCX file and converts it to HTML.

Getting Started

Everything is self contained in the docker containers and connected using docker-compose. All dependencies are installed by docker and self contained so all you need to do is download the app, build it and run it.

Prerequisites

You need to have docker installed on your machine.

https://docs.docker.com/install/

Deployment and Development - Download the app, build it and run it

Download the app:

git clone https://github.com/podaac/docx-to-html/

For development - build the docker containers using docker-compose:

Install the node modules in the dev build

cd frontend && npm install 

Note: If you do not yet have node.js (required dependency to execute the "npm" command) installed, it's highly recommended to first install the Node Version Manager (NVM) available here: https://github.com/nvm-sh/nvm#installing-and-updating

To install node.js using NVM:

nvm install node

Then build the Docker container

cd docx-to-html && docker-compose up -d --build

This command builds each docker container and installs all of the dependencies needed. It also runs the app as a daemon using -d, this allows you to exit your terminal session and continue to run the app.

The Frontend is exposed on port 8083. To access the app locally, go to http://localhost:8083/

In production, point to port 8083 or change the exposed port in docker-compose.yml and frontend/prod.docx2html-react-frontend.Dockerfile.

To kill the app:

docker-compose down

To run the app without building again:

docker-compose up -d

Deployment for production

Uses prod.docker-compose.yml file and the production dockerfiles for each container. You will need to change the POST request URL to point to the backend if it is not at localhost on your machine. Change POST request URL in frontend/src/components/MainContent.js

docker-compose -f docker-compose.yml -f prod.docker-compose.yml up -d --build

In production, point to port 8083 or change the exposed port in the prod.docker-compose.yml file and frontend/prod.docx2html-react-frontend.Dockerfile.

If you need to change ports here is a list of where those files live:

Built With

Web Frameworks

Docker Images:

Dependencies

End to End App Workflow

App Architecture Backend Architecture

The user uplodas a file or multiple files on the frontend. React uses Fetch for a POST request to the Flask Backend. The request gets handled in app/api.py. It saves the .docx file to the server in /backend/ and sends the file to converter/handle_input.py.

converter/convertDOCX2HTML.py converts the file to an HTML string, deletes the .docx file and returns the HTML string.

converter/parse_html.py is where all of the parsing and HTML manipulation takes place. It converts the HTML to a Beautiful Soup Object and parses it.

If there are .tiff, .emf or .wmf images, they get converted to PNG in converter/image_converter.py. They get saved as a tmp file while being converted and all the tmp files get deleted once the conversion is over.

converter/handle_input.py returns the parsed HTML to app/api.py to be returned as JSON to the frontend.

The frontend saves the HTML as a file to the users computer.

All containers are set to always be restarted. No data is being saved.

In the dev environment the files are linked to the container using 'volumes' so you do not have to rebuild the container if changes are made. The React frontend hot reloads so you don't need to restart the server. If you change anything in the Flask backend, you need to reboot the server to see any changes.

In the production environment 'volumes' are removed for security purposes.

Authors

  • Andrew Joseph - Initial work

License

This project is licensed under Apache 2.0 - see the LICENSE file for details

Acknowledgments

  • JPL Mentors: David Moroni & Suresh Vannan
  • Help and Support: Wilbert Veit, Ying Chen, Allan Yu, Sandra Cosic